preface
This article will introduce a number of concepts related to audio and video development, and it is only by understanding these concepts that we will determine how far we can go in audio and video development, so hopefully you will understand these.
First, video packaging format
The common video packaging formats are.mov,.avi,.mpg,.vob,.mkv,.rm,.rmVB and so on. Why are there so many different file formats? That’s because they implement video in different ways, and the difference is that we need to understand the concept of “video encapsulation format”.
1.1 Video encapsulation format
Video encapsulation format (video format for short) is a container for storing video information. It contains video information, audio information, and related configuration information (such as the association between video and audio information, and how to decode). The most direct reflection of the video encapsulation format is the video file format, as shown in the following table 👇
Video file format | Video package format | paraphrase |
---|---|---|
.avi | AVI(Audio Video Interleave) | Image quality is good, but the volume is too large, compression standards are not unified, there are high and low version compatibility problems. |
.wmv | WMV(Windows Media Video) | Can be downloaded while playing, very suitable for online play and transmission |
.mpg .mpeg .mpe .dat .vob .asf .3gp .mp4 | MPEG(Moving Picture Experts Group) | Three compression standards, MPEG-1, MPEG-2, and MPEG-4, are specifically designed for streaming high quality video, with the goal of maximizing image quality with the least amount of data. |
.mkv | Matroska | A new video package format that encapsulates multiple encoded videos and more than 16 audio streams in different formats and languages into a single Matroska Media file. |
Rm, RMVB | Real Video | The audio and video compression specification developed by Real Networks is called Real Media. Users can use RealPlayer to develop different compression ratios according to different network transmission rates, so as to realize real-time transmission and playback of video data over a low network rate. |
.mov | QuickTime File Format | A video format developed by Apple. The default player is Apple’s QuickTime. This package format features a high compression ratio, perfect video resolution, and the ability to preserve alpha channels. |
.flv | Flash Video | Adobe Flash is an extension of a web video package format. This format is adopted by many video sites. |
1.2 container
The compressed video data and audio data are put into a file in a certain format. This file can be called a container. Of course it’s just a shell.
Usually, in addition to audio and video data, there is also some metadata for video synchronization: subtitles, for example. This multiple data is processed by different programs, but is bound together for transmission and storage.
Two, common video and audio coding format introduction
2.1 Video codec
Video codec process — the process of compression or decompression of digital video.
2.2 Common video codec modes
-
The H26x series is dominated by ITU telecommunication Standardization Organization (ITU-T), including H.261, H.262, H.263, H.264, H.265.
-
The.H261 was used in older video conferencing or video telephony systems, and all subsequent standards were based on it.
-
H262 is equivalent to MPEG-2 Part II and is used in DVD, SVCD and most digital video broadcast systems and wired distribution systems.
-
.H263 is mainly used for video conferencing, video telephony and network video related products. This is a significant performance improvement over its previous video coding standard. Especially in the low bit rate end, it can greatly save bit rate on the premise of guaranteeing certain quality.
-
H264 is equivalent to MPEG-4 Part 10, also known as Advanced Video Coding (AVC), is a Video compression standard. It is a widely used high-precision Video recording, compression, and publishing format. This standard introduces a series of new technologies that can greatly improve compression performance and greatly surpass previous standards at both high and low bit rates.
-
H265 High Efficiency Video Coding (HEVC) is a Video compression standard and successor to H.264. HEVC is believed to not only improve image quality, but also achieve two times the compression rate of H.264 (equivalent to 50% reduction of bit rate under the same picture quality), and can support 4K resolution and even ultra high definition TV, the highest resolution up to 8192×4320 (8K resolution), which is the current development trend.
-
-
The MPEG series is developed by the moving Image Expert Group (MPEG) under the international Standards Organization (ISO).
MPEG-1
The second part, mainly on VCD, is also used for some online videos. The quality of the codec is roughly equivalent to that of the original VHS videotape.MPEG-2
The second part is equal toH.262
, used in DVD, SVCD and most digital video broadcast systems and wired distribution systems.MPEG-4
The second part can be used in network transmission, broadcasting and media storage. Compared with theMPEG-2
Part TWO and the first editionH.263
Its compression performance has been improved.MPEG-4
Part TEN, the equivalentH.264
Is the standard that the two coding organizations worked together to produce.
H265 is not selected
iOS11.0
And then supportH265
.- Relative to the
H264
.H265
The load on the CPU is greater, of course, the CPU heat is more serious.
2.3 Relationship between “Codec Mode” and “Encapsulation Format
“Video Encapsulation Format” = video + audio + video codec mode and other information container.
A “video encapsulation format” can support a variety of “video codec mode”. For example, QuickTime File Format(.mov) supports almost all video codec methods, and MPEG(.mp4) also supports a wide range of video codec methods.
The technical term is probably A/B, where A is “video codec mode” and B is “video encapsulation format”. For example, an H.264/MOV video File is encapsulated in QuickTime File Format and encoded in H.264.
2.4 Audio codec mode
Frequently used audio encoding methods are 👇🏻
AAC
, full English nameAdvanced Audio Coding
The mPEG-2-based audio coding technology was developed by Fraunhofer IIS, Dolby LABS, AT&T, Sony and others in 1997. In 2000, after the emergence of the MPEG-4 standard, AAC reintegrated its features, adding SBR technology and PS technology, in order to distinguish the traditional MPEG-2 AAC, also known as MPEG-4 AAC.MP3
, full English nameMPEG-1 or MPEG-2 Audio Layer III
Is a once popular digital audio encoding and lossy compression format designed to dramatically reduce the amount of audio data. It was invented and standardized in 1991 by a group of engineers at the Research organization Fraunhofer-Gesellschaft in Erlangen, Germany. The popularity of MP3 has caused great impact and influence on the music industry.WMA
, full English nameWindows Media Audio
, a digital audio compression format developed by Microsoft, which itself includes lossy and lossless compression formats.
2.5 Encoding format in live/small video
- Video Encoding format –>
H264
, low bit rate, high quality image, strong fault tolerance, strong network adaptation, with high data compression ratio, can reach an amazing 102:1 - Audio Encoding format –>
AAC
, the current popular lossy compression coding technology, inLess than 128 kbit/s
The performance is excellent at the bit rate, and is mostly used for audio coding in video. And it’s derivedLC-AAC,HE-AAC,HE-AAC v2
Three main encoding formats.- Lc-aac is a traditional AAC, which is mainly used in Chinese
High bit rate
Scene coding (>= 80Kbit/s
) - He-aac is mainly applied to
Low bit rate
Coding of scenarios (<= 48Kbit/s
)
- Lc-aac is a traditional AAC, which is mainly used in Chinese
2.6 RGB & YUV
We usually use RGB model to represent colors. In RGB model, each color needs three digits to represent R, G, and B respectively, and each digit occupies one bit byte, so a total of 24bits are required.
So is there a more efficient color model that uses fewer bits to represent colors? –>YUV,Y represents brightness, which is also gray order value, U and V represent chromaticity component.
Now suppose we define a “Luminance” to represent the brightness of a color, which can be expressed in terms of R, G, and B as 👇
Y = kr*R + kg*G + kb*B
Copy the code
Y is brightness, and Kr, kg, KB are weights of R, G, and B.
At this point, we can define a concept of “Chrominance” to represent the difference in color 👇
Cr = R - Y, Cg = G - Y, Cb = B - YCopy the code
Cr, Cg and Cb represent chromaticity components on R, G and B respectively. The above model is the basic principle of YCbCr color model.
YCbCr, a member of YUV family, is the most widely used color model in computer systems, such as the video field mentioned in this paper.
In YUV, Y stands for “brightness”, or gray scale, while U and V stand for “chroma”.
The key of YUV is that its brightness signal Y and chromaticity signal U and V are separated, that is to say, even if only Y signal component but no U and V component, we can still represent the image, but the image is black and white gray image. In YCbCr, Y refers to the brightness component, Cb refers to the blue chromaticity component, and Cr refers to the red chromaticity component.
Now we get the recommended correlation coefficient from ITU-R BT.601-7 standard, we can get the formula of YCbCr and RGB conversion 👇
Y = 0.299r + 0.587G + 0.114B Cb = 0.564(B-y) Cr = 0.713(R-y) R = Y + 1.402Cr G = y-0.344 Cb-0.714 Cr B = Y + 1.772 CbCopy the code
So far, we have a preliminary understanding of the color model YCbCr, but if you are careful, you will find that YCbCr still uses 3 digits to represent the color. Is there any bit saving? See the image below 👇🏻
-
Suppose the image consists of the following pixels
-
An image is an array of pixels. The information of the three components of each pixel is complete, YCbCr 4:4:4.
- In the figure below, the “brightness” value is retained for each pixel, but the “chroma” value of even pixel points in each row is omitted, thus saving bits.
YCbCr4:2:2
- Below, more ellipsis is made, but the quality of the image is not affected too much.
YCbCr4:2:0
Therefore, more than 90% of the live broadcast and small video use this kind of YCBCR4:2:0.
3. Explanation of live broadcast process
As an iOS developer, especially to do audio and video business related development, of course, I need to be familiar with the live streaming process of App terminal, which is divided into 8 steps 👇
- Audio and video capture
- Video filters
- Audio and video coding
- Push the flow
- Streaming media server
- Pull flow
- Audio and video decoding
- Audio and video playback
Audio and video capture
Audio and video capture is the AVFoudation capture we wrote before 👉🏻 01-AVFoundation capture and 02-AVFoundation advanced capture.
Video filters
Generally, GPUImage is used to realize some filter functions, so we should focus on understanding and mastering the underlying principle of GPUImage.
Audio and video coding
Among them, hard coding is related to GPU. Apple launched two hard coding frameworks videoToolBox and AudioToolBox after iOS8.0.
Push the flow
Streaming media server
Pull flow
Audio and video decoding
Audio and video playback
The above are just some key technical points in the whole live broadcast, and there are many details, many pages and other logic are not detailed. Just imagine, these work cannot be completed by one person, so a team is needed, and each person is responsible for a certain link. So, want to master all in a short time, to master the level, certainly impossible ha! 😂
The cover
Finally, let’s take a look at the cover and video display. There are generally three ways to get 👇
- Server distribution to you 👉🏻 live (playback supported!)
- Client: by default, the first frame video is the cover 👉🏻 small video
- Client: Select an appropriate frame from the video as the cover 👉🏻 small video
4. Analysis of live broadcast and small video architecture
4.1 Live Broadcast Architecture
The figure above is just a sub-end of the previous mind map of the live broadcast process. Each end lists some commonly used technical points or commonly used frameworks. Then let’s look at the partition diagram of the architecture 👇🏻
Roughly divided into several parts 👇🏻
Video capture terminal
Use an iPhone or Android phone to record audio and video, then add some beautifying effects, filters, etc., encode (compress) and useStreaming media protocol
Upload the file to the server.The service side of cloud
theCDN
On-demand streaming media will perform a series of processing after receiving audio and video streams, includingencryption
And other protection measures, after the processing is completed, it is stored to the server, and the playback address is generated.Play the end
Use the corresponding player, according to the correspondingBroadcast address
.Decrypt, decode
Play audio and video data.
4.2 Small Video Architecture
The small video architecture is slightly simpler than the live broadcast architecture, but the general process is similar. It should be noted that 👇🏻
- Video capture of small video is used
Texture Texture
Mode is used for transmission, and audio is used forPCM
model - Small videos can be large (such as ultra HD), in which case the
FFMPEG
Framework forThe packet
The packet
After completion, inThe output
Stage two actions, one isThe local store
And the other isHTTP upload
, should be considered when uploadingBreakpoint continuingly
This scenario
5. Architecture analysis of pan-entertainment and live broadcast servers
Live streaming is divided into two categories 👇🏻
Pan-entertainment live streaming
👉🏻 games, entertainment, such as tiger tooth, douyu, etcReal-time interactive live streaming
👉🏻 educational livestreams, such as Tencent Classroom
5.1 Pan-entertainment live streaming architecture
- First of all,
The anchor end
toSignaling server
Send out a request for a room Signaling server
Will open up a room and putStreaming media CDN server
Address returned toThe anchor end
The anchor end
getStreaming media CDN server
Address to start streaming audio and video data to the CDNPlay the end
toStreaming media CDN server
Keep pulling, and you’ll seeThe anchor end
The audio and video
5.2 Real-time interactive live broadcasting architecture
The reason for using UDP
Because we want to live is fast, real-time effect, even if the packet loss problem is not big, if the use of TCP protocol, then between the client and the server will wait for the data to be sent complete, imagine yesterday’s data had to wait until today to receive, it is meaningless.
The client
useUDP protocol.
Transmit data toThe server
, andPan-entertainment Architecture
Same, server subSignaling server
andMedia server
- At the same time,
The server
Need to guarantee 24 hours of uninterrupted stable service, so there will be more than oneSignaling server
andMedia server
Even if a server has a problem, it can be switched to another server in time to ensure the stability and robustness of the service - Since there are more than one
Signaling server
andMedia server
Every server must be guaranteedLoad balancing
And this isThe control center
In charge of, how do you do it? 👇 🏻
Signaling server and media server is just like a node, each node will regularly report to the control center health index, the healthy index including CPU usage, memory usage, IO occupy, network occupancy percentage, etc., reporting every once in a while, guarantee normal operation of each node, of course, also can appear the situation of the two special 👇 🏻
- If there is a
Don't report
Of orThe health index is not up to standard
Is there a problem, at this pointThe control center
There will be problemsnode
Switch tasks on the server to normalnode
The server - If a
node
Busy, onenode
And I have a lot of time, soThe control center
Some of the busy tasks are handed over to the idle nodes
-
Moving on, there is an intermediate layer, which can be called a memory line or a heartbeat line, which is responsible for consolidating data from all the nodes and transferring it to the media server.
-
Since the client is the data transmitted by UDP protocol, the media server will convert the data to RMTP protocol, converting real-time interactive live broadcast data into pan-entertainment live broadcast data, so that the broadcast end can check the audio and video data
6. CDN network analysis
Purpose of CDN network resolution 👉🏻 To solve the problem of slow network.
The CDN is roughly composed of three parts 👇🏻
- Edge node
- Secondary node: also called
The backbone nodes
- Stand the source node
For example, to understand the above structure, I want to eat Peking duck, but you are not in Beijing, for example, changsha, if you fly directly from Changsha to Beijing, then the process is very time-consuming, at this time we find changsha also has a branch of Quanjude, then we can eat Beijing duck in Changsha, delicious! At this point, Beijing is like the source node and Changsha is the edge node. What about the secondary node? 👉🏻 people are not in Changsha, to travel to Zhangjiajie, so we can only find the nearest Changsha, and then order a sent over, changsha is the second node, the edge node is Zhangjiajie.
There is a kind of situation, influenced by artificial factors, there are two network, is China unicom network and telecommunication network, for example, we can’t find the resources in unicom network, then the artificially build Internet line, bridging the unicom network and telecom, we can continue to find the resources to telecommunication network, the main object of this bridge is the backbone nodes (i.e., the secondary node), Get through the trunk nodes of Unicom and telecom. This is also a special case of CDN network resolution.
Vii. H264 related concepts
Next, take a look at some basic concepts related to H264 codecs.
IBP 7.1 frame
I frame: key frame, using the intra – frame compression technology.
For example, if the camera is pointing at you, very little actually happens to you in a second. Cameras typically capture dozens of frames of data a second. For example, for animations, it’s 25 frames per second, whereas most video files are around 30 frames per second. For some of the more demanding, precise motion requirements, to capture the full action, the advanced camera is generally 60 frames per second. Those that change it very little for a set of frames. In order to compress the data, what happens? Save the first frame completely. You can’t do this without decoding the data behind the keyframe. So I frames are particularly critical.
P frame: forward reference frame. Compression refers only to the previous frame. It’s interframe compression.
The first frame of the video is saved as a key frame. The following frames will depend forward. So frame 2 is dependent on frame 1. All subsequent frames are stored only for the difference of the previous frame. That’s a big reduction. Thus achieving a high compression rate effect.
B frame: bidirectional reference frame, which refers to the previous frame as well as the next frame during compression. Interframe compression technique.
- B frame refers to both the previous frame and the later frame. This makes it more compressible. Less data is stored. The more B frames you have, the higher your compression rate will be. This is the advantage of B frame, but the biggest disadvantage of B frame is that if it is a real-time interactive live broadcast, then B frame can only be decoded by referring to the following frame, and it has to wait for the transmission of the following frame in the network. This has to do with the Internet. If the network is in good condition, decoding will be faster, if the network is not good decoding will be slightly slower. Retransmission is also required when a packet is lost. For live interaction, B frames are generally not used.
- If a certain degree of delay can be accepted in the broadcast of pan-entertainment, b-frame can be used if a relatively high compression ratio is required.
- If we are in the live interactive broadcast, we need to improve the timeliness, then we can not use B frame.
7.2 Group of Frames (GOF) A Group of frames
If you have 30 frames in a second. These 30 frames can be drawn as a group. If the camera or the lens doesn’t change much for one minute. You could also draw all the frames in a minute as a group.
What is a set of frames?
It’s going from one I frame to the next I frame. This set of data, including B frame /P frame. We call it GOF. The picture below is 👇
What are the benefits of GOF? What problem does it solve?
And this is related to what we’re going to do next.
7.3 SPS/PPS
SPS/PPS is actually storing GOP parameters.
SPS: (Sequence Parameter Set, Sequence Parameter Set) store frame number, reference frame number, decoding image size, frame field coding mode selection identifier, etc.
PPS:(Picture Parameter Set). Storage entropy encoding mode selection mark, number of slices, initial quantization parameters and filtering coefficient adjustment mark, etc.(information related to images)
Just remember that the SPS/PPS data we receive first before a set of frames. Without this set of parameters, we can’t decode it.
If we make an error when decoding, first check whether there are SPS/PPS. If not, it is because the peer end did not send it or because the peer end lost it in the process of sending.
SPS/PPS data, which we also classify as I frames. These two sets of data must not be lost.
7.4 Analysis of the causes of video screen delay
When we watch a video, we will encounter the phenomenon of screen spluttering or stuttering. So this has to do with the GOF that we just talked about.
- If the GOP group
P frame loss
Will cause the decoding side of the image error. - In order to avoid the occurrence of screen splintering, it is generally found that P or I frames are lost. All frames in this GOP are not displayed. Only go to the next I frame and refresh the image.
- When this happens because the screen is not refreshed. The frames that lost the packet were all thrown away. The image will be stuck somewhere. That’s why it’s stuck.
So to sum up 👇
Flowers screen
Because of youI lost my P frame or I frame
, resulting in a decoding error. whilecaton
Is to prevent the splintering screen, will be the whole group of errorsGOP data
Throw it away. Go to the next correct setGOP
And then reload the screen. And here in the middleTime difference
That’s what we feelcaton
.
H264 compression technology
-
Intra-frame predictive compression 👉🏻 solves the problem of spatial data redundancy.
What is spatial data, is that the data in this picture contains a lot of color, a lot of light in the wide height space. Data that is hard to detect with the human eye. For this data, we can consider it redundant. It’s just compressed.
-
Interframe predictive compression 👉🏻 solves the problem of time-domain data redundancy.
As we have explained before, the data captured by the camera does not change significantly in a period of time, so we compress the same data in this period of time. This is called time domain data compression.
-
Integer discrete cosine transform (DCT) 👉🏻 transforms spatial correlation into frequency domain independent data and then quantifies it.
It’s abstract. It’s tied up in mathematics. If you have a good understanding of the Fourier transform. This will be difficult to understand! If you don’t know the Fourier transform. It may be a little difficult. The Fourier transform can transform a complex waveform into many sine waves. They just don’t have the same frequency. And the amplitude is different. If they are not consistent in frequency then we can compress them.
-
CABAC compression 👉🏻 lossless compression.
Nine, H264 coding principle detailed explanation
Next, we will analyze some concepts related to H264 coding principle.
9.1 H264 macro block Partitioning and grouping
As shown above, H264 describes the upper left corner of an image as a macro block, which is an 8 by 8 element. Take out the color and describe it as shown in the figure on the right. Then, a complete picture is described by macro block as shown below 👇🏻
This completes the basic image macroblock partitioning.
Sub-block dividing
So is every macro block going to be 8 by 8? Not exactly, there are sub-block partitions 👇🏻
In this big macro block, you can refine it. The macro block in the left center of the figure, which is all blue, can be described by a color block, which is much simpler.
On the right of the image above, we compare MPEG2 and H.264 next to it
MPEG2
It also says compare when it’s storedcomplete
.Take up the space
Relative to more.H.264
orTo reduce
There’s a lot of space, like repeating colors and they’re just using very simple color blocks.
The frame group
For example, a billiard ball moves from one position to another group. You can see that the desktop background is the same. It just shifts the position of the sphere. So we can divide this set of frames into groups.
9.2 Searching for Intra-group macroblocks
What is intra-group macroblock lookup?
As shown in the figure above, the billiard ball rolls from one corner to the other corner, and the two adjacent images do the macro block search within the group. Scan the graph line by line to the third line. I found the billiard ball and looked around it. I found similar blocks.
Motion estimation
Then you put the blocks in the same picture. So the billiard ball just starts moving from position 1 to position 2 in the second picture and there’s a motion vector there. The vector will contain the direction of the motion and the distance. Compare all the graphs in pairs. And you end up with this picture on the right. That’s the red part on the right. Each of these red arrows is a motion vector. Many frames form a continuous motion estimate.
So what do we want to achieve with this estimate?
Motion vector and compensation compression
Finally, the continuous motion estimates are converted to 👇🏻 as shown in the figure below
Next align and compress. All frames have the same background. Where is the transformation? The transformation is its motion vector and the billiard ball data. And actually after we did the math. All that’s left is the motion vector data plus the residual data. After a calculation like this. By compressing data between frames we can see that we actually only need to store a little bit of data. Instead of saving all the image data from dozens of frames. This has the effect of compression. This process is called interframe compression.
9.3 Intra-frame prediction
Frame compression
Is forThe I frame
Because what it solves isData redundancy in space
.Interframe compression
Is to solve theTime data redundancy
. The billiard ball example above 👉🏻 compresses out a lot of data that is identical over time. Leaving onlyOperational estimates and residuals
.
Principle of intra-frame compression
In the frame we use other compression principles 👇🏻
As shown in the figure below, we must first calculate and choose which mode to use.
A different mode operation is applied to each macroblock.
Once the mode is selected for each macroblock. The result is the one shown below. There are nine modes of intra-frame prediction, as shown below 👇🏻\
Introduction to the principle of 9 Modes of Intra-frame prediction (1)
After letting each macroblock pick a pattern, we can use block prediction patterns. After that, it gets a “prediction map” 👇
The prediction is on the left, and the source is on the right.
The calculated forecast is different from the original. The original image is relatively rounded. The forecast is relatively rough.
Calculate intra-frame prediction residuals:
Then for these two graphs, do the difference calculation 👇🏻
The bottom image is our original image. The difference between the prediction and the original figure gives a result 👉🏻 the gray plot, which is the residuals.
Prediction models and residual compression: Once we get the residual, we compress. 👉🏻 residual data and mode information data selected for each macro block are saved during compression. So with these two numbers. When we decode, we first calculate the prediction graph from the pattern information of the macroblock. We then accumulated the prediction with our residuals. I can restore the original image. This process is the principle of “intra-frame compression technology”, as shown below 👇🏻
9.4 DCT compressed
DCT compression is also an integer cosine compression technique. So how does it compress? We partition a quantifiable macro block.
Then the quantized macroblock is compressed according to DCT mathematical method.
The compressed version looks like this: 👇🏻\
Data distribution is in the upper left corner, data is empty in the lower right corner. So as to reduce the amount of data. So how did it work? This is a very deep mathematical background if you want to calculate it. If you’re interested, you can do a search on the Internet.
DCT Compression Principle (1) DCT compression principle Wikipedia
9.5 VLC compression
VLC uses similar Huffman codes. Use short codes to record high frequency data. Record low frequency data with long codes. The high frequencies are coded as short codes and the low frequencies as long codes.
After VLC compression, it becomes lossless compression.
CABAC compression (lossless compression technique for context adaptation)
VLC is actually the technology used by MPEG2. H264 uses CABAC, or contextual adaptation technology. In addition to using Huffman short code high frequency, the long code high frequency way is also added to the context adaptation technology. The compression ratio can be increased depending on the context.
Comparison: \
VLC
Compressed data is in large chunks. Lossless compression.- while
CABAC
Compression, which increases with the compressed data, gives comprehensive context information. The compression ratio increases accordingly. Data blocks are reduced from large to small data blocks.
CC teacher _HelloCoder audio and video learning from zero to whole