As a mobile developer, MOST of the time I need to get in touch with audio and video related development. In fact, I am not a full-time audio and video development engineer in the strict sense. I just got in touch with audio and video related fields due to business needs in 2016, and the open source GSYVideoPlayer just became popular. In order to solve a series of problems became “half bucket of water” audio and video development engineer.
During the years of maintaining GSYVideoPlayer, I found that many developers are still not clear about basic concepts related to audio and video field, so I often get issues like this:
“Why is XXX available and GSY not?”
“Why can’t one of my videos play on MP4?”
“Why do buffered videos need to re-request data after seek?
“Why do they have black edges?”
“…”
These are common sense questions in audio and video development, so this article will introduce the basic knowledge of audio and video development through basic concepts, frequently asked questions, and application scenarios.
The basic concept
First of all, as shown in the figure below, it is a.mov video file. You can see that the encoder in the information bar has AAC and HEVC, and this is the audio encoding and video encoding of the video.MOV is actually the encapsulation protocol, which is actually the basic concept we will introduce next.
In general, the video stream is ready to play after loading, and it needs to go through such a process as protocol unpacking and encoding, wherein protocol refers to the streaming media protocol; Encapsulation is the encapsulation format of video; Coding is divided into video coding and audio coding.
Protocols generally include HTTP, RTSP, RTMP, etc. The most common is HTTP network protocol, while RTSP and RTMP are generally used for live streaming or support common control signaling, such as remote monitoring.
Video encapsulation protocol refers to our common MP4, AVI, RMVB, MKV, TS, FLV and other common suffix formats. They represent multimedia encapsulation protocol, which is to package audio and video together in the transmission process, so it is necessary to unpack this part of the content before playing. Extract the corresponding audio code and video code.
So if someone asks you what your video code is in the future, you can no longer answer “my video code is MP4”.
Audio coding
Audio encoding refers to the encoding method of audio data, such as: MP3, PCM, WAV, AAC, AC-3, etc., because the original data size of audio is generally not suitable for direct input, for example, the original size can be calculated according to sampling rate * number of tracks * sample format. Assuming that the previous MOV had an audio sampling rate of 44100, a 16-bit, mono, and 24 seconds sample format, its original audio size should be
44100 * 16 * 1 * 24/8 ≈ 2MB
The actual size of the extracted audio information, as shown in the figure below, is only about 200 K, which is the role of audio coding.
Therefore, audio transmission will generally adopt a variety of coding formats for compression and redundancy, such as WAV/PCM coding audio quality is better, but the volume will be relatively large; MP3 lossy compression can compress the volume of audio while the audio quality is still acceptable; AAC is also lossy compression, but there are LC-AAC, He-AAC and so on.
Video coding
Video coding refers to the coding and compression mode of the picture image, generally H263, H264, HEVC (H265), MPEG-2, MPEG-4 and so on, among which H264 is the more common coding mode at present.
Under normal circumstances, we understand the picture is RGB combination, and the current video field may be more use YUV format, where Y represents brightness (gray), and U and V represent chroma (saturation).
YUV is the special processing and stacking of RGB to obtain color. For example, YUV420 can be understood to store chroma with a sampling rate of 2:1, and then the brightness can display the picture through chroma. More YUV will not be discussed here, and one of the reasons why YUV is used is to be compatible with the previous black and white TV.
Why not just use the original YUV? Here we assume that the MOV video above directly uses YUV420 format, then the size of a frame would be:
1080 * 1920 * 1 + 1080 * 1920 * 0.5 = 2.9MB
If on this basis, including frame rate (30) and the length of a video (one hour), the original size of a video will be astronomical, such a situation is obviously not suitable for network transmission, so the video coding used to compress the image.
In video compression, there are a few more concepts to know, such as:
- 1. IPB frame is a common frame compression method, in which I frame belongs to the key frame and is the reference frame of each picture; P frame is forward prediction frame; B frame is a two-way prediction frame. To put it simply, I frame can get a complete picture by itself, while P frame needs the previous I frame or P frame to help decode and get a complete picture, while B frame needs the previous I/P frame or the following P frame to help form a picture.
Therefore, I frame is very important. Compressing I frame can easily suppress the size of space, while compressing P/B frame can compress the redundant information in time. So when the video seek, I frame is very key, if the video seek after the occurrence of forward jump, it is likely that your video compression is too much.
-
2. There is also a concept called IDR frame, because H264 adopts multi-frame prediction, so I frame cannot be used as an independent observation condition, so there is a special I frame called IDR frame for reference. The most critical concept of IDR frame is: Once the IDR frame is received during the decoder process, the reference frame buffer will be emptied immediately and the IDR frame will be regarded as the referenced frame.
-
3. DTS (Decoding Time Stamp) and PTS (Presentation Time Stamp) also exist in video Decoding. DTS is mainly used for video Decoding, and PTS is mainly used for video synchronization and output in the Decoding stage.
Because the decoding of the data in the video packet is not continuous, but requires the DTS obtained by decoding the data source to determine when the packet should be decoded, and the resulting PTS determines when the decoded picture is drawn.
- 4. GOP (Group Of Picture) is the distance between two I frames. Generally, the larger the GOP is, the better the Picture will be, and the longer the decoding time will be. Therefore, if the bit rate is fixed and the GOP value is larger, the number of P/B frames will be more and the picture quality will be higher.
Q&A
Ffmpeg is a Fast Forward Mpeg, so the pronunciation is (ef,ef,’em, PEG), generally FFMPEG uses soft decoding, that is, pure CPU decoding; MediaCodec, on the other hand, plays hard decoding, which supports GPU assistance.
Question 1: “Why can the same video machine A broadcast machine B?”
This problem is most likely due to the use of MediaCodec’s hard decoding, which varies from phone to phone and system version.
Question 2: “Why ffMPEG plays all the time, VLC can play but ijkPlayer can’t?”
This is because FFMPEG supports per-configuration packaging, and because many times you don’t need that much, such as turning on and off support for certain formats in a configure file for on-demand packaging, the level of support for ffMPEG packaging may vary from project to project.
Support wav --enable-libwavpack --enable-muxer=wav --enable-demuxer=wav --enable-decoder= WAVpack --enable-encoder= WAVpack --enable-decoder=wav --enable-encoder=wav --enable-encoder=pcm_s16le --enable-decoder=pcm_s16le --enable-encoder=pcm_u8 --enable-decoder= pcM_u8 --enable-muxer=pcm_u8 --enable-demuxer=pcm_u8 Support mp2 --enable-encoder=mp2 --enable-decoder=mp2 --enable-muxer=mp2 --enable-decoder=mp2float--enable-encoder=mp2fixed Support H265 --enable-decoder= heVCCopy the code
Question 3: “Why do I need to request again after SEEK when my video is buffered?”
This explains the difference between caching and buffering:
-
Buffer: Like in the garbage, it is not possible to immediately run to the garbage dump as soon as there is garbage, but first put the garbage into the garbage can, garbage can is full and then put together to the garbage pile. Because the buffer is in memory, you can’t buffer the entire video in memory, so you usually see the buffer as a temporary chunk of data, a buffer block is constantly loading and clearing process.
-
Caching: The explanation of caching is much simpler, which means that the video is downloaded locally while playing, so that the data in the cache area does not need to be requested twice.
Question 4: “Why does my video jump after dragging?”
As explained above, this is related to the key frame of the video, as well as the compatibility strategy chosen by FFMPEG. For example, with the -accurate_seek, the additional part between the jump point and position will be decoded and discarded. This corresponds to the enable-accurate-seek configuration in iJK.
Question 5: “Why is my video out of sync?”
For example, iJkPlayer uses audio as the synchronization clock. If audio and video are not synchronized in iJkPlayer, then the bit rate or frame rate of the video is too high. Try using Framedrop to drop frames or enable hard decoding support.
Question 6: “Why is my video in the wrong size and direction?”
Generally, the video information has rotation Angle. For example, the video recorded on an Android phone may have rotation Angle. Therefore, the rotation Angle such as 270 and 90 should be considered in layout and drawing.
In addition, the video will also have the information of Width Height Ratio, that is, the Width to Height Ratio. For example, this information is obtained by videoSarNum/videoSarDen in iJKPlayer. Only when the Width to Height Ratio is calculated together with the Width to Height of the video, To get the true display width and height.
Question 7: “Why does my video have black edges?”
This is common sense, in the face of different size resolution platform, video display is given Surface according to you size to display, so you can choose to use drawing, cutting, to adapt to the different pattern such as height, to adapt to the width to configure your drawing controls, so they can reach you need to control the black side of the scene.
“How to get the image of a certain time stamp”, “how to play several videos at the same time”, “how to implement the filter”, “how to achieve double speed” and so on. Those interested can go to the issue of GSYVideoPlayer or search for the relevant FFMPEG implementation.
Usage scenarios
Finally, use scenarios for audio and video development. Why?
Because a lot of times developers might think, “Isn’t it just a matter of putting a Url in the PLAYER SDK?” Actually return really not, had done sound and video development should have deep experience.
-
1, first of all, when doing audio and video development, to determine their need to support the packaging protocol, video coding, audio coding format, because there are thousands of coding formats, under normal circumstances you can not all support, so first of all in the demand to determine the need to support the format range.
-
2. If there is a scenario where users upload videos independently, it is better to provide formatting and transcoding rate functions on the server side. Because in the server to judge the video format and transcoding can standardize the unified coding, this can reduce the client end because of codec failure cannot play the problem; In addition, by providing links of different bit rates for the same video, you can have a better playback experience on different mobile phone models and systems, and reduce the problems of audio and video synchronization or lag caused by high bit rates mentioned above.
Similar functions are supported on Both Alibaba cloud and Tencent Cloud.
-
3. There are many scenarios in network playback. For example, the network environment changes during the playback, from 4G to Wifi or from Wifi to 4G, which involves two points: First, the network environment changes, so the original pull channel is actually disconnected, at this time, you need to restart a new connection to replace the old playback kernel, to continue playing; Second, when the environment between Wifi and 4G changes, users need to be prompted and decide whether to perform subsequent operations.
-
4. For example, when the video image needs to be switched from the list to the detail page, or from the original container to another renderable container, different surfaces need to be set without pausing the playing kernel to achieve the purpose of switching; The playback of the opening AD and the preloading of the video content require two different scenarios of request processing and so on.
Video playback scenarios involve data communication at the front and back ends, as well as the transformation of user environment scenarios and the iteration of business requirements. If your boss says to you:
“Just make a video like Bilibili and play it.”
Trust me, then you’re just getting started.