1.4 Understand digital media
1.4.1 Digital media sampling
There are two main ways to digitize media. First time sampling, this method captures changes in a signal cycle. For example, when you record an audio recording, all of your pitch changes and tone changes are captured during recording. The second spatial sampling is generally used in the process of digitization of pictures and other visual media content. Spatial sampling consists of capturing the brightness and chromaticity of an image at a certain resolution, and then creating the digital result composed of the pixel data of the image.
1.4.2 Digital Audio
Sound is a wave generated by the vibration of objects, which spreads through the medium (such as air) and can be perceived by human or animal auditory organs. Its essence is that the vibration of objects causes the medium (such as air) to vibrate, causing the surrounding medium (air) to produce dense changes, forming longitudinal waves with dense phases.
There are three important characteristics of sound, namely pitch, loudness and timbre. Pitch refers to the level of sound, which is determined by frequency. The higher the frequency, the higher the pitch (frequency unit is Hz, Hertz). The hearing range of human ear is from 20Hz to 20,00Hz, below 20Hz becomes infrasonic wave, and above 20,00Hz becomes ultrasonic. Loudness is the subjectively felt sound size (commonly known as volume), which is determined by the amplitude and the distance between people and the sound source. The greater the amplitude, the greater the loudness, and the smaller the distance between people and the sound source, the greater the loudness. Timbre is also called audio, the waveform determines the timbre of the sound.
Digital audio is a technology that uses digital means to record, store, edit, compress or play sound. Because sound can be decomposed into sinusoidal superposition of different frequency and intensity through Fourier transform, it is possible to convert analog signals into electrical signals and store them in the form of 0 and 1 on the computer.
Digital audio involves two important variables. One is the sampling frequency, which refers to the collection of data within the cycle into electrical signals for storage. The shorter the cycle, the higher the frequency, the more real the sound reduction. The other is quantization number, that is, the maximum number of data stored after sampling. The data stored by the computer is limited, and it is impossible to be infinitely accurate to the data collected, so it needs to be eliminated. The quantization number commonly used is 8, 16 and 32 bits.
Digitizing sound requires a lot of space if the original data is retained without any compression. For example, a 44.1khz, 16-bit LPCM audio file may take up 10MB of memory per minute. Therefore, the industry has launched a number of standard audio formats for digital audio compression. The following are some commonly used formats:
- WAV: An audio format developed by Microsoft that supports audio compression but is often used to store uncompressed lossless audio
- MP3: Common audio file compression technique used to dramatically reduce the number of audio files
- AAC: currently one of the most popular formats, compared to MP3, sound quality is better, the file is smaller, ideally can be compressed to 1/18 of the original file
- APE: lossless compression, can compress files for the original half
- FLAC: no compression
Because the original data size of the audio is generally not suitable for direct input, for example, the original size can be calculated according to the sampling rate * number of sound channels * sample format. Assuming that the previous MOV audio sampling rate is 44100, the sample format is 16 bit, mono and 24 seconds, then its original audio size should be
44100 * 16 * 1 * 24/8 ≈ 2MBCopy the code
The actual size of the audio information extracted is only about 200 K, and that’s where audio coding comes in.
1.4.3 Digital Media Compression
In general, the video stream is ready to play after loading, and it needs to go through such a process as protocol unpacking and encoding, wherein protocol refers to the streaming media protocol; Encapsulation is the encapsulation format of video; Coding is divided into video coding and audio coding.
Video encapsulation protocol refers to our common MP4, AVI, RMVB, MKV, TS, FLV and other common suffix formats. They represent multimedia encapsulation protocol, which is to package audio and video together in the transmission process, so it is necessary to unpack this part of the content before playing. Extract the corresponding audio code and video code.
Video coding
Video coding refers to the coding and compression mode of the picture image, generally H263, H264, HEVC (H265), MPEG-2, MPEG-4 and so on, among which H264 is the more common coding mode at present.
Under normal circumstances, we understand the picture is RGB combination, and the current video field may be more use YUV format, where Y represents brightness (gray), and U and V represent chroma (saturation).
YUV is a special processing and stacking of RGB to get color. For example, YUV420 can be understood as 2: 1 sampling rate is stored, and then brightness through chroma to display the picture, more YUV here will not be discussed, and why YUV is used one of the factors is to compatible with the previous black and white TV. Why not just use the original YUV? Here we assume that the MOV video above directly uses YUV420 format, then the size of a frame would be:
1080 * 1920 * 1 + 1080 * 1920 * 0.5 = 2.9MB
Copy the code
If on this basis, including frame rate (30) and the length of a video (one hour), the original size of a video will be astronomical, such a situation is obviously not suitable for network transmission, so the video coding used to compress the image.
In video compression, there are a few more concepts to know, such as:
- 1. IPB frame is a common frame compression method, in which I frame belongs to the key frame and is the reference frame of each picture; P frame is forward prediction frame; B frame is a two-way prediction frame. To put it simply, I frame can get a complete picture by itself, while P frame needs the previous I frame or P frame to help decode and get a complete picture, while B frame needs the previous I/P frame or the following P frame to help form a picture.
Therefore, I frame is very important. Compressing I frame can easily suppress the size of space, while compressing P/B frame can compress the redundant information in time. So when the video seek, I frame is very key, if the video seek after the occurrence of forward jump, it is likely that your video compression is too much.Copy the code
- 2. There is also a concept called IDR frame, because H264 adopts multi-frame prediction, so I frame cannot be used as an independent observation condition, so there is a special I frame called IDR frame for reference. The most critical concept of IDR frame is: Once the IDR frame is received during the decoder process, the reference frame buffer will be emptied immediately and the IDR frame will be regarded as the referenced frame.
- 3. DTS (Decoding Time Stamp) and PTS (Presentation Time Stamp) also exist in video Decoding. DTS is mainly used for video Decoding, and PTS is mainly used for video synchronization and output in the Decoding stage.
Because the decoding of the data in the video packet is not continuous, but the DTS obtained by decoding the data source determines when the packet should be decoded, and the resulting PTS determines when the decoded picture is drawn.Copy the code
- 4. GOP (Group Of Picture) is the distance between two I frames. Generally, the larger the GOP is, the better the Picture will be, and the longer the decoding time will be. Therefore, if the bit rate is fixed and the GOP value is larger, the number of P/B frames will be more and the picture quality will be higher.