Audio and video industry has been developing for years, with the emergence of the mobile terminal in recent years, more and more audio and video APP, the audio and video to a climax, but as a result of the audio and video learning cost is very high, many developers, in order to closely to the pace of The Times, I wrote this article av foundation, explained the relevant knowledge of audio and video, To break the “high threshold” of audio and video for everyone, I hope we can make progress together.
audio
The process of saving sound to audio is actually the process of digitizing analog audio. In order to achieve this process, analog audio needs to be sampled, quantized and encoded. Let’s go through this process in more detail.
The sampling
Sampling is the process of converting signals from analog signals in the continuous time domain to discrete signals in the discrete time domain (discrete means discontinuous). According to the famous Nyquist theorem, sampling should be carried out at 2 times of the highest frequency of sound. The sound that human ear can hear is 20Hz ~ 20kHz. So the typical sampling rate for sound is 44.1khz (why 44.1khz instead of 40KHz? Because it is best to sample with this rate. Some other sampling rates are listed below for reference.
- 8,000 Hz – A telephone sampling rate that is sufficient for human speech
- 11025 Hz
- 22,050 Hz – Sampling rate used for radio broadcasting
- 32,000 HZ-MinidV Digital video camcorder, DAT (LP mode) sampling rate
- 44,100 Hz – audio CD, also commonly used for mpeg-1 audio (VCD, SVCD, MP3) with the sampling rate
- 47,250 Hz – The sampling rate used for the world’s first commercial PCM recorder developed by Nippon Columbia (Denon)
- 48,000 HZ-The sampling rate used for digital sound for miniDV, DIGITAL TV, DVD, DAT, film and professional audio
- 50,000 Hz – The sampling rate used by the first commercial digital recorder developed by 3M and Soundstream in the late 1970s
- 50,400 Hz – sampling rate used for mitsubishi x-80 digital recorder
- The sampling rate used for 96,000 or 192,000 HZ-DVD-Audio, some LPCM DVD tracks, Blu-Ray Disc tracks, and HD-DVD tracks
- 2.8224 MHz – sampling rate used in 1-bit Sigma-Delta Modulation process called Direct Stream Digital developed by SACD, SONY and Philips.
As shown in the figure above, it is a sampling process that cuts analog audio signals into discontinuous time domains in a continuous time domain, that is, the operation on the X-axis. So if you have an x axis, you have a y axis, and the operation on the y axis is actually quantization.
quantitative
What is quantization? It is to represent the above information with specific data. In terms of images, it is to give each X time axis the corresponding change value of its Y axis. So if we want to express the specific value of a sound, we must first allocate a variable interval (this interval is called quantization format), 1 unit in binary is a bit, usually 8 bits, 16 bits, 32 bits.
The decibel (db) is a unit that describes the volume of sound. The range of db for human ears is from 0 to 90dB. 0 db is the faintest sound that human ears can hear, and hearing is seriously affected in an environment of 90dB. After studying, one bit can record about 6 decibels of sound, so I can get the data below
unit | The scope of | dB |
---|---|---|
8 bits | 0 ~ 2^8-1 <=> 0 ~ 255 | 0 ~ 48 |
16 bits | 0 to 2^16-1 <=> 0 to 65535 | 0 ~ 96 |
32 bit | 0 ~ 2^ 32-1 <=> 0 ~ 4294967295 | 0 ~ 192 |
Generally, we use 16 bits, because it has a range of 0 to 65535, and just meet the range of sound size, and 8 bits processing audio, only 0 to 255 changes, changes can not be recorded very fine, then why not use 32 bits? This is because 32 bits consume too much storage, but if you want to make audio more refined, you can use 32 bits. Since quantization is based on sampling, every quantization is a sample, so how do we store it? Which brings us to the following code.
channel
What is the vocal tract? Stereo and Mono are common.
- Mono, a single sound source, many places have been
Stereo
replace - Stereo, two or more sound sources
When we stand in different positions to listen to monophonic, it presents the effect is not the same, and stereo is not, in different directions can still maintain natural and sweet, now CD is generally double channel stereo, that channel number is 2.
Audio coding
The so-called encoding is to record the sampled and quantized digital data in a certain format, such as sequential storage or compressed storage space. Macroscopically, it can be divided into two categories: compressed coding and uncompressed coding.
A storage format has many kinds of, usually audio naked data is what we call PCM, also called Pulse code modulation (English: Pulse-code Modulation, abbreviated: PCM) is a digital method of analog signal, it is a kind of non-compressed coding format. In general, describing a piece of PCM data requires the following concepts:
- Quantization format – SampleFormat, default 16 bits
- SampleRate – SampleRate, default 44.1KHz
- Number of channels – Channel. The default value is two
Generally we describe audio in terms of bit rate, which is throughput per second in BPS:
Bit rate = sampling rate * quantization format * number of channelsCopy the code
As above, we can get: bit rate = 44100 * 16 * 2 = 1411200 BPS = 1411.2kbps And the data size in 1 minute: 1411.2 * 60/8/1024 = 10.34m
The following is the basic information I obtained from an MP3 format. You can see that the quantization format, sampling rate and number of channels are all the same, but its bit rate is 320kbps, which is inconsistent with our calculation formula above. This is where the compression encoding format comes in.
Duration: 00:00:39.99, start: 0.011995, bitrate: 320 KB /s Stream#0:0: Audio: mp3, 44100 Hz, stereo, s16p, 320 kb/s
Copy the code
Compression coding is to compress data. In fact, the principle of compression coding is to compress redundant signals. Redundant signals refer to signals that cannot be perceived by human ears, including audio signals beyond the range of human ears and masked audio signals, etc. Generally, there are two kinds of compression:
- Lossy compression: After data compression, information is not lost and can be completely restored to the original state
- Lossless compression. The compressed and decompressed data is different but very similar to the original data
In music applications, we often see “lossless music”, and if you find it in MP3 format, it must be lossy compression, which is what we call false lossless. Common lossy compression is as follows:
- MP2
- MP3
- AAC
- WMA
- ADPCM
- ATRAC
- Dolby AC-3
- Musepack
- Ogg Vorbis
- Opus
- LDAC
video
So with audio out of the way, let’s talk about video, first of all, what is video? We all know that video is made up of images, frame by frame, so before we talk about video, what is an image?
image
We have learned physics since childhood that light can be dispersed into different colors through Mitsubishi mirrors. After further study, it is found that only red (R), green (G) and blue (B) cannot be decomposed, so they are called the three primary colors of light.
Generally, when we buy a mobile phone, we will refer to its resolution. Of course, the higher the resolution, the better, because the bigger the resolution, the clearer, closer to the original appearance of things. Why is this?
In fact, in order to allow people to perceive the image on the mobile phone, also used such RGB mode. Taking “1080×1920” as an example, each pixel has 1080 pixels horizontally and 1920 pixels vertically, so there are 1080×1920=2073600 pixels in total. Each pixel contains three sub-pixels, red (R), green (G) and blue (B), so that each pixel can have its own full color presentation.
Image said
We know that red (R), green (G) and blue (B) can all be represented by 00 ~ FF or 0 ~ 255. From the above, we know that 8 bits can just represent 0 ~ 255. A pixel also contains three sub-pixels of red (R), green (G) and blue (B), which requires at least 24 bits. We usually add an opacity (A) to the color, so A pixel is actually 32 bits, this representation is commonly used by us RGBA_8888, so how much space is needed for A picture to be displayed in full screen on the phone with the above resolution?
1080*1920*4 = 8294400b = 8100kb = 7.91Mb
Copy the code
This is also the amount of memory used by bitmaps. The raw data of each image is very large, so if you load a bitmap directly on the phone, the memory will run out quickly. So if the image is walking on the network directly, it is certainly not good. Generally, a compression is carried out. The common compression format is:
- BMP – lossless compression
- PNG – lossless compression
- JPEG – Lossy compression
This is why PNG format is used when making small images (such as ICONS) and JPEG format is used for large images. Small images are compressed and magnified losslessly without being too fuzzy, while large images are ensured to be clear.
Video said
We generally use YUV to represent the naked data of videos. YUV is also a color coding method, so why not use RGB? Compared with RGB video signal transmission, its biggest advantage is that it only needs to occupy a very small bandwidth (RGB requires three independent video signals to be transmitted simultaneously)
“Y” stands for Luminance (Luma), also known as gray scale value. “U” and “V” are chroma, which describe the color and saturation of the image and specify the color of the pixel. If you ignore UV, you’re left with gray (Y), just like the old black and white TV signal, so YUV was invented to transition from black and white TV to color TV.
UV is described by Cb and Cr. Cb reflects the difference between the blue part of RGB input signal and the brightness value of RGB signal, while Cr reflects the difference between the red part of RGB input signal and the brightness value of RGB signal. The UV signal tells the display to offset the brightness of a color against a baseline; the higher the UV value, the more saturated the pixel will be.
To save bandwidth, most YUV formats use fewer than 24 bits per pixel on average. The main subsample formats are YCbCr 4:2:0, YCbCr 4:2:2, YCbCr 4:1:1 and YCbCr 4:4:4. The notation for YUV is called A:B:C notation:
- 4:4:4 represents complete sampling, similar to RGB
- 4:2:2 represents 2:1 horizontal sampling and vertical complete sampling.
- 4:2:0 represents 2:1 horizontal sampling and 2:1 vertical sampling.
- 4:1:1 represents the horizontal sampling of 4:1, vertical complete sampling.
The data volume of a frame of video displayed on a 1080×1920 mobile phone is as follows:
YUV format | Size (1080×1920 resolution) |
---|---|
444 | 1080* 1920* 3= 6220800B = 6075 KB = 5.93Mb |
422 | 1080* 1920* (1+0.5+0.5) = 4147200 b = 4050 kb = 3.96Mb |
420 | 1080* 1920* (1+0.5+0) = 3110400 B = 3037.5KB = 2.97Mb |
411 | 1080* 1920* (1+0.25+0.25) = 3110400 B = 3037.5KB = 2.97Mb |
As we can see from the figure above, using YUV420 is nearly 3M less per frame than using RGB directly, which is the main reason why YUV is used instead of RGB.
It should be noted that yuV4:2:0 is generally used in videos. Yuv4:2:0 does not mean that only U (Cb) and V (Cr) must be 0, but refers to U: V references each other, which means that for each row, there is only one U or V component. If one row is 4:2:0, then the next row is 4:0:2, and the next row is 4:2:0… And so on.
How to convert YUV into RGB data to be presented on mobile phones? And that requires a transformation formula
Here we know what is going on with the video. The video is composed of a frame of images, and a frame of images is composed of YUV bare data, and YUV bare data can be converted with RGB, and the converted RGB is finally presented on the mobile phone.
Video coding
Before introducing coding, let’s introduce two concepts:
- Frame rate (FPS) – measures the number of frames displayed per unit of time (s). 24fps is generally acceptable for video. For games, if the frame rate is less than 30fps, there will be incoherence, which is commonly referred to as stuttering.
- Bit rate – Measure the amount of data per unit of time (s).
We can calculate the bit rate for playing YUV420 video data on a 1080* 1920 phone:
BitRate = 1080* 1920* (1+0.5+0)* 24 = 71.2 MbpsCopy the code
The amount of movie data in 90 minutes can also be obtained:
Total = bitRate * 60 * 90 = 375.42 GBCopy the code
This is definitely unacceptable to us, so we have to encode it, similar to audio coding, generally using compression coding, otherwise there is no play
Different from audio coding, video data has a strong correlation, that is, a large amount of redundant information, including temporal redundant information and spatial redundant information.
- Temporal redundancy: In video data, there is usually strong correlation between adjacent frames and frames, and such correlation refers to temporal redundancy.
- Spatial redundancy — In the same frame, there is usually a strong correlation between adjacent pixels. Such correlation refers to spatial redundancy.
There are two series of common video encodings:
- MPEG series – including Mpeg1 (for VCD), Mpeg2 (for DVD), Mpeg4 AVC (now the most used streaming media is it)
- H.26x series – includes H.261, H.262, H.263, H.264 (most used for video now)
IPB frame
MPEG defines I frame, P frame and B frame, and implements different compression algorithms according to different frames
- I-frame is the first frame of a group, which is compressed into a single complete video frame by compression algorithm. Therefore, i-frame removes redundant information in the spatial dimension of video frames.
- P-frame – Forward predictive coding frame, which needs to be decoded into a complete video picture by referring to an I frame or P frame in front of it.
- B frame – bidirectional predictive interpolation coding frame, need to refer to the previous I frame or P frame and the following P frame to generate a complete video picture, so P frame and B frame remove the redundant information in the time dimension of the video frame.
Note: There is a special frame IDR in the I frame, which is also an I frame. If an IDR frame appears in the video coding process, it means that all subsequent frames can no longer refer to the previous frame, which acts as a watershed.
As we said earlier, a video is a series of images, and each image is a frame. Images are processed in groups. The commonly used structure of a group of frames consists of 15 frames in the form of IBBPBBPBBPBBPBB. A group of frames is also called GOP, as shown in the following figure
As you can see from the figure above, the decoding order is different from the display order, but without B frames, it is the same.
The above is the audio and video basis of the relevant content, if you do not understand or incorrect place, please leave a message in the comments section below, hope to encourage each other.
reference
Advanced Audio and Video Development Guide: Practice based on Android and iOS platforms