(Highly recommended) Mobile audio and video from zero to hand

An overview of the

With the rise of the whole Internet, the form of data transmission is also constantly upgrading and changing. The general trend is as follows:

Pure text message, QQ -> space, Weibo, circle of friends picture text combination -> wechat voice -> major live broadcast software -> Douyin Kuaishou short videoCopy the code

The development of audio and video is expanding to various industries, ranging from long-distance teaching in education, face recognition in transportation, remote medical treatment and so on. The direction of audio and video has occupied a very important position, but there are very few articles that really introduce audio and video. A fresh graduate may be difficult to understand, because audio and video involves a lot of theoretical knowledge, and code writing needs to combine these theories, so it is very important to understand audio and video, coding and decoding and other theoretical knowledge. I have also been in touch with audio and video projects since my internship. I have read many people’s articles. Here I summarize an easy-to-understand article for more students preparing to learn audio and video to get started faster.

To highlight

The theoretical knowledge in this paper comes from the summary of the basic principles of audio and video coding of various audio and video articles, and there will be some additional parts summed up by myself. If there are any errors to comment on, it will be corrected after review.

In order to prevent everyone to understand too empty, the author spent three months will be the most commonly used, the most important functions of the theoretical knowledge and actual combat Demo to write out, with the article reading effect better. The articles of each part can be viewed in the in-depth study at the beginning of each chapter by clicking the link. The Github address of the articles in the link is available. Each Demo can be tested and run by downloading the Demo.

If you like, please help to like and support the reprint, reprint please attach the original link.

The principle of

collect

No matter iOS platform or Android platform, we need to use the official API to implement a series of related functions. First of all, we need to understand what we want. At the beginning, we need a mobile phone. Camera is an indispensable part of smart phone, so we can obtain video data collected by physical camera and audio data collected by microphone through some system API.
To deal with

Audio and video raw data is essentially a large piece of data, which is packaged by the system into a customized structure and usually provided to us in the form of callback functions. After receiving audio and video data, a series of special processing can be done according to the needs of their respective projects, such as: Video rotation, zoom, filter, beauty, cropping and other functions, audio mono noise reduction, echo elimination, mute and other functions.
coding

After the custom processing original data can be transmitted, such as live function is a good video data sent to the server, in the end for all fans to watch, but transport by itself is based on network environment, the huge raw data must be compressed to take away, can be understood as we move to packaged items are to luggage such understanding.
transmission

Encoded audio and video data is usually transmitted using the RTMP protocol, which is specially used for audio and video transmission. As various video data formats cannot be unified, a standard is required as the transmission rule. Agreements do just that.
decoding

After receiving the encoded data sent by us, the server needs to decode it into original data, because the encoded data sent directly to the physical hardware device cannot be played directly, only decoded into original data can be used.
Audio and video synchronization

Every frame of audio and video decoding time stamp that contains the most start recording time Settings, we need according to the time stamp them right out, but may lose some data in the network transmission, or delay, at this moment we need certain strategies to realize the synchronization of audio and video, can be divided into several kinds of strategies: cache must video data, video after audio and so on .

Push flow, pull flow process

Push stream: The video data collected by the mobile phone is sent to the background player for display. The player can be Windows, Linux or Web. That is, the mobile phone acts as the collecting function, and the video collected by the mobile phone camera and the audio collected by the microphone are synthesized and encoded and then sent to the player of the corresponding platform.
Pull stream: To play the video data from the player terminal on the mobile phone. The reverse process of push stream is to decode the video data from the Windows, Linux, and Web terminal and send it to the corresponding audio and video hardware. Finally, the video is rendered and played on the mobile phone interface.

Push flow is as follows:

The pull flow is as follows:

Specific analysis

Push flow and pull flow are actually reciprocal processes, which are introduced from collection.

1. The collection

Acquisition is the first link of the stream and the source of the original audio and video data. The original data types collected are audio data PCM, video data YUV,RGB… .

1.1. Audio acquisition

Further study of
- Introduction to iOS Core Audio
- IOS Audio Session manages the Audio context
- IOS Audio Queue Collects and plays Audio data
- IOS Audio Queue Collects Audio data
- IOS Audio Unit Collects Audio data
- IOS Audio Unit Collects Audio data
Collecting source
- Built-in microphone
- External microphone-enabled equipment (camera, microphone…)
- System Built-in Album
Main audio parameters
- Samplerate: Indicates the amount of data collected per second during the digitalization of analog signals. The higher the sampling frequency, the higher the amount of data and the higher the sound quality.
- (The iPhone can’t pick up two channels directly, but it can simulate — that is, make a copy of the collected mono channels. Some Android models can)
- Bit width: The size of each sampling point. The more bits, the finer and better the sound quality. It is usually 8 or 16 bits.
- Data format: The original data collected by iOS devices is linear PCM audio data
- Others: You can also set the precision of the sample value, how many frames of data are in each packet, how many bytes are in each frame, etc.
Audio frame

Different from video, each frame of video is a picture, while audio is streaming and does not have a clear concept of frame. In practice, for convenience, data in the unit of 2.5ms~60ms is taken as one frame of audio.
To calculate

Data quantity (bytes/second) = (Sampling frequency (Hz) x sampling bits (bit) x number of sound channels) / 8

The number of mono channels is 1, and the number of stereo channels is 2. Byte B, 1MB=1024KB =1024 *1024B

1.2. Video capture

Further study of
- IOS AVCaptureSession collects video data
- IOS AVCaptureSession Collecting video data Demo)
- Video raw data YUV introduction
Collecting source
- camera
- Screen recording
- External device with camera acquisition function (camera, DJI UAV, OSMO…)
- System Built-in Album

Note: Some external cameras, such as using the camera to collect data, and then using mobile phones to process and encode data and send it, are also ok, but we need to analyze the data flow, that is, from the HDMI cable of the camera to the cable port, the cable port to USB,USB to Apple Lighting interface, FFmpeg can be used to obtain the data.

Main video parameters
- Image format: YUV, RGB (red, yellow and blue colors mixed to form various colors).
- Resolution: The maximum screen resolution supported by the current device
- Frame rate: The number of frames captured per second
- Others: White balance, focus, exposure, flash, etc
Computing (RGB)

1 frame = resolution (width * height) * number of bytes per pixel (usually 3 bytes)

Note that the above calculation method is not unique, because there are many video data formats, such as YUV420 calculation method is resolution (width * height) * 3/2

1.3. To sum up

We assume that the video to be uploaded is 1080P 30fps(resolution :1920*1080) and the sound is 48kHz, then the amount of data per second is as follows:

Audio = (48000 * 16 * 2) / 8 = 192000B = 0.192 MBCopy the code

From this, we can know that if the original data is directly transmitted, then a movie needs more than 1000 GIGABytes of video. If this is the case, how terrible it will be, so it involves our later coding link.

2. To deal with

In-depth research (to be added)
- Efficient clipping video
- Implement volume column function according to sound size

From the previous step, we can obtain the collected audio raw data and video raw data, which are generally obtained from the official API of the respective mobile phone platform on the mobile terminal. There are methods to achieve this in the previous links.

After that, the raw data can be processed, and the raw operation can only be processed before encoding, because the encoded data can only be used for transmission. For example, images can be processed

Skin care
The watermark
A filter
tailoring
rotating
.

Audio processing

mixing
Eliminate echo
The noise reduction
.

At present, there are many popular large frameworks dedicated to processing video, audio, such as OpenGL, OpenAL, GPUImage… There are open source libraries on the Internet for all the above processing. The basic principle is that we take the original audio and video frame data, send it to the open source library, and then get the processed audio and video to continue our own process. Of course, many open source libraries still need to be modified and packaged slightly based on project requirements.

3. The coding

Further study of
- IOS Video Video encoding
- IOS Audio audio encoding

3.1. Why code

At the end of step 1 collection, it has been said that the original video generates more than 200 MB per second. If the original data is directly transmitted, the network bandwidth, namely the memory consumption, is huge, so the video must be encoded in the transmission.

For example, if we move directly, things will be scattered and need to go many times to get them. If we pack clothes and items, we only need a few suitcases to do it at one time. When we get to our new home, we take things out, rearrange them, and that’s how codec works.

3.2. Lossy compression vs. lossless compression

Lossy compression
- Video utilizes the visual characteristics of human eyes to exchange data compression with certain objective distortion. For example, human eyes have different thresholds for brightness recognition, visual thresholds, and sensitivity to brightness and chroma, so that appropriate errors can be introduced during coding without being detected.
- Audio takes advantage of the fact that humans are insensitive to certain frequency components in images or sound waves, allowing certain information to be lost during compression; The method of removing redundant components in sound. Redundant components are the signals in the audio that are not detectable by the human ear and do not contribute to the timbre, pitch, etc. The reconstructed data is different from the original data, but it does not affect the misunderstanding of the information expressed in the original data.
Lossy compression is used where the reconstructed signal does not have to be identical to the original signal.
Lossless compression
- Video has spatial redundancy, temporal redundancy, structural redundancy and entropy redundancy, which means there is a strong correlation between each pixel of the image. Eliminating these redundancies will not lead to information loss
- The audio compression format uses the statistical redundancy of data for compression, which can completely restore the original data without causing any distortion. However, the compression rate is limited by the theoretical data statistical redundancy, generally ranging from 2:1 to 5:1.

Because of the above compression method, the amount of video data can be greatly compressed, which is conducive to transmission and storage.

3.3. Video coding

Principle: How does coding make large amounts of data smaller? The main principles are as follows
- Spatial redundancy: There is a strong correlation between adjacent pixels of the image
- Temporal redundancy: Similar content between adjacent images of a video sequence
- Coding redundancy: Different pixel values have different probabilities
- Visual redundancy: The human visual system is insensitive to certain details
- Knowledge redundancy: The structure of regularity can be derived from prior and background knowledge
Compression coding method
- Transform coding (for details, please Google)
  
  The image signal described in spatial domain is transformed into frequency domain, and then the transformed coefficients are encoded. Generally speaking, the image has a strong spatial correlation, and the transformation of frequency domain can achieve the removal of correlation and energy concentration. Common orthogonal transformations include discrete Fourier transform, discrete cosine transform and so on.
- Entropy coding (please Google for details)
  
  Entropy coding is named because the average code length after coding is close to the source entropy value. Entropy Coding is often implemented with Variable Length Coding (VLC). The basic principle is to assign short codes to symbols with high probability of occurrence in the source and long codes to symbols with low probability of occurrence, so as to obtain statistically shorter average code length. Variable word length encoding usually includes Huffman encoding, arithmetic encoding, run-length encoding and so on.
- Motion estimation and motion compensation (important)
  
  Motion estimation and motion compensation are effective methods to eliminate temporal correlation of image sequences. The transformation coding and entropy coding introduced above are all carried out on the basis of a frame of image, and the spatial correlation of each pixel in the image can be eliminated by these methods. In fact, in addition to spatial correlation, image signal also has temporal correlation. For example, for digital video with static background and small motion of the main body, such as news broadcast, the difference between each picture is very small and the correlation between pictures is very large. In this case, we do not need to encode each frame individually, but can encode only the changing part of the adjacent video frames, so as to further reduce the amount of data, which is realized by motion estimation and motion compensation.
A. Motion estimation techniques

The current input image is segmented into several small image sub-blocks that do not overlap each other. For example, an image frame of 1280*720 is firstly divided into 40*45 image blocks of 16*16 size that do not overlap each other in a grid form. Then search for the most similar image block for each image block within the scope of a search window of the previous image or the latter image. This search process is called motion estimation.

B. Motion compensation

A motion vector can be obtained by calculating the position information between the most similar image block and the image block. In this way, in the process of coding, the block in the current image can be subtracted from the most similar image block pointed to by the reference image motion vector to obtain a residual image block. Since each pixel value in each residual image block is small, a higher compression ratio can be obtained in the compression coding.
Compressed data type

Because of motion estimation and motion compensation, the encoder divides each input frame into three types according to the reference image: I frame, P frame and B frame.
- I frame: only the data in this frame is used for encoding, and motion estimation and motion compensation are not needed in the encoding process.
- P frame: In the process of coding, the previous I frame or P frame is used as the motion compensation of the reference image, actually encoding the difference between the current image and the reference image.
- B frame: Prediction is made using the I or P frames in front and the I or P frames behind during coding. Thus, each P frame uses one image as the reference image. B frames require two frames as references.

Mixed coding (transform coding + motion estimation, motion compensation + entropy coding) is used in practical applications.

The encoder

After decades of development, the functions of encoders have been very powerful, a wide variety of encoders, the following are some of the most mainstream encoders.
- H.264
It can provide high-quality video at lower bandwidth than older standards (in other words, half or less of the bandwidth of MPEG-2, H.263, or MPEG-4 Part 2) without adding too much design complexity to make it impossible or too expensive to implement.
- H.265/HEVC
High Efficiency Video Coding (HEVC) is a Video compression standard that is considered a successor to the ITU-T H.264/MPEG-4 AVC standard. HEVC is believed to not only improve video quality, but also achieve twice the compression rate of H.264/MPEG-4 AVC (equivalent to a 50% bitrate reduction for the same image quality).
- VP8
VP8 is an open video compression format, first developed by On2 Technologies and later released by Google.
- VP9
The development of VP9 began in the third quarter of 2011. The goal is to reduce the file size by 50% compared to VP8 coding with the same picture quality. Another goal is to surpass HEVC coding in coding efficiency.

3.4. Audio coding

The principle of

Digital audio compression coding can compress the audio data signal as much as possible on the premise that the signal does not produce distortion in the auditory aspect. Digital audio compression coding is implemented by removing redundant components in sound. Redundant components are the signals in the audio that are not detectable by the human ear and do not contribute to the timbre, pitch, etc.

Redundant signals include audio signals outside the range of human ear and masked audio signal lights. For example, the sound frequency that can be detected by human ear is 20Hz~20kHz, and other frequencies beyond this cannot be detected by human ear and are all redundant signals. In addition, according to the physiological and psychological phenomena of human ear hearing. When a forte signal and a weak signal exist at the same time, the weak signal will be masked by the forte signal and can not be heard, so the weak signal can be regarded as a redundant signal without transmission. This is the masking effect of human hearing.
Compression coding method
- Spectrum masking
The human ear cannot hear a frequency after its sound energy falls below a certain threshold, which is called the minimum audible threshold. When there is another high energy sound, the threshold near the sound frequency increases a lot, which is called the masking effect

The human ear is most sensitive to the sound of 2KHz ~ 5KHz, but insensitive to the sound signal of too low or too high frequency. When a sound with a frequency of 0.2khz and intensity of 60dB appears, the threshold value near it increases a lot.
- The time domain masking
When the fortissimo signal and the weak signal appear at the same time, there is also time domain masking effect, front masking, simultaneous masking, back masking. Front masking refers to the short period of time before the human ear hears the strong signal, the existing weak signal will be masked and cannot be heard.
```
- In the short time before the strong signal is heard, the existing weak signal will be masked and cannot be heard. - Simultaneous masking means that when the strong signal and the weak signal exist at the same time, the weak signal will be masked by the strong signal and cannot be heard. - After masking refers to that when the strong signal disappears, it takes a long time to hear the weak signal again, which is called after masking. These masked weak signals can be regarded as redundant signals.Copy the code
```

4. Encapsulate and encode data

Further study of
- IOS complete push flow process
- Set up the iOS FFmpeg environment

4.1 define

Encapsulation is the syncing of audio and video generated by the encoder to produce a video file that we can see, hear and see in sync with what we hear. That is, a container is generated after encapsulation to hold audio and video streams and other information (such as subtitles, metadata, etc.).

4.2 format

AVI(.AVI):
- The advantage is good image quality. Since lossless AVI can save alpha channels, it is often used by us
- Too many shortcomings, too big, and worse still, not uniform compression standards,
MOV(.mov): A video format developed by Apple. The default player is Apple’s QuickTime.
MPEG(.MPG,.MPEG,MPE,.DAT,.VOB,.ASF,.3GP,.MP4):
WMV(.WMV,.ASF)
Real Video(.rm,.RMBV): Different compression ratios are formulated according to different network transmission rates to realize real-time transmission and playback of Video data over low-speed networks
Flash Video(.flV): a popular web Video package format extended from Adobe Flash. With the abundance of video sites, this format has become very popular.

4.3 Synthesize the encoded data into a stream

In the mobile terminal we need to use FFmpeg framework, as introduced above,FFmpeg can not only do encoding and decoding, but also can synthesize video streams, such as the common.FLV stream,.ASF stream.

Finally, the synthesized data can be used to write files or spread over the network

Added: FFmpeg (Required learning framework)

FFmpeg is an open source framework for recording, converting, and streaming audio and video formats. It includes libavCodec, an audio and video decoder library for multiple projects, and libavFormat, an audio and video format conversion library.

Currently support Linux,Mac OS,Windows three mainstream platforms, can also be compiled to Android or iOS platform. Brew install ffmpeg –with-libvpx –with-libvorbis –with-ffplay

4.4. Introduction to FLV flow

Overview

FLV package format analyzer. FLV is Flash Video, which is a widely used Video packaging format on the Internet. Video sites like Youtube and Youku use FLV to encapsulate video

Flash Video (FLV) is a popular streaming media format designed and developed by Adobe. It is suitable for application on the Internet because of its light size and simple packaging. In addition, FLV can be played using Flash Player, which is already installed in most browsers around the world, making it easy to play FLV videos from the web. At present, the mainstream video websites such as Youku, Tudou, Letv and other websites use FLV format without exception. The file suffix in FLV format is.flv.
structure

FLV consists of File Header and File Body. The File Body consists of a series of tags. So an FLV file is structured as shown in Figure 1.

Each Tag also contains the Previous Tag Size field, which indicates the Size of the Previous Tag. The Tag type can be video, audio, or Script. Each Tag can contain only one of the three types of data. Figure 2 shows the detailed structure of the FLV file.

5. Transfer data through RTMP

advantages
- Good CDN support, mainstream CDN vendors support
- The protocol is simple and easy to implement on various platforms
disadvantages
- Based on TCP, the transmission cost is high, and the problem is significant in the case of high packet loss rate in weak network environment
- Browser push is not supported
- Adobe proprietary agreement, which Adobe no longer updates

The streaming media we push out needs to be transmitted to the audience, and the whole link is the transmission network.

5.1. The Overview

RTMP is an application-layer protocol in the Internet TCP/IP five-layer architecture. The basic unit of data in the RTMP protocol is called a Message. When RTMP transmits data over the Internet, messages are broken up into smaller units called chunks.

5.2. The message

Messages are the basic unit of data in the RTMP protocol. Different kinds of messages contain different Message Type ids that represent different capabilities. More than ten message types are specified in the RTMP protocol, each playing a different role.

Messages between 1 and 7 are used for protocol control. These messages are generally used for RTMP protocol management, and users generally do not need to manipulate the data in the messages
Message Type IDS 8 and 9 are used to transmit audio and video data, respectively
Message Type messages with ids 15-20 are used to send AMF-encoded commands that are responsible for interactions between the user and the server, such as play, pause, and so on
The Message Header consists of the Message Type ID, the Payload Length, the Timestamp, and the Stream ID of the media Stream to which the Message belongs

2. Message block

When data is transmitted over the network, messages need to be divided into smaller data blocks to be suitable for transmission over the corresponding network environment. According to the RTMP protocol, messages are divided into chunks when they are transmitted over the network.

The Chunk Header consists of three parts:

The Chunk Basic Header used to identify the Chunk
Chunk Message Header used to identify the Message to which the block payload belongs
And Extended Timestamp that only appears when the Timestamp overflows

3. Message fragmentation

In the process of dividing a Message into several Message blocks, the Message Body is divided into fixed-size data blocks (128 bytes by default, and the last data block can be smaller than this fixed length), and the corresponding Message block is formed by adding the Chunk Header to the Header. The message chunking process is shown in Figure 5, where a 307-byte message is split into 128-byte message blocks (except for the last one).

When RTMP transmits media data, the sender encapsulates the media data into a message and then divides the message into a message block. Finally, the segmented message block is sent through TCP. After receiving the data through TCP, the receiving end recombines the message block into a message, and then decompresses the message to recover the media data.

4. Logical structure in RTMP

The RTMP protocol states that there are two prerequisites for playing a stream

Step 1: Establish a NetConnection
The second step is to create a NetStream.

The network connection represents the basic connectivity between server-side applications and clients. A network stream represents a channel for sending multimedia data. Only one network connection can be established between the server and the client, but many network flows can be created based on that connection. Their relationship is shown below:

5. Connection process

The following steps are required to play a RTMP stream:

Shake hands
Establish a connection
Establish flow
play

All RTMP connections start with a handshake. The connection establishment stage is used to establish the “network connection” between the client and server. The flow establishment stage is used to establish the “network flow” between the client and server. The playback stage is used to transmit video and audio data.

Parse and decode the video stream

Further study of
- IOS complete file pull stream parsing decoding synchronous rendering audio and video streams
- FFmpeg parses video data
- IOS uses FFmpeg to implement hard decoding of video
- IOS uses VideoToolbox to achieve hard decoding of video
- IOS uses FFmpeg to implement audio hard decoding
- IOS uses native audio decoding

So far, the complete push flow process has been introduced, the following process is the reverse process – pull flow.

Because the receiver takes the encoded video stream and ultimately wants to render it on the screen, The audio is broadcast through loudspeakers and other output devices, so following the above steps, the receiver can get the video stream data through the RTMP protocol, and then need to use FFmpeg parse data, because we need to separate the audio and video data, after separating the audio and video data, they need to decode separately. The decoded video is in YUV/RGB format, and the decoded audio is linear PCM data.

It should be noted that the decoded data cannot be used directly, because if the mobile terminal wants to play the decoded data, it needs to put it into a specific data structure. In iOS, the video data needs to be put into CMSampleBufferRef, which is in turn CMTime and CMVideoForma TDes and CMBlockBuffer, so we need to provide the information it needs to form the format that the system can play.

7. Audio and video synchronization and playback

Further study of
- IOS Video Rendering
- AudioQueue realtime audio stream playing actual combat

When we get the decoding of audio and video frames, the first thing to consider is how to synchronize audio and video, in the case of normal network is not to need to make audio and video synchronization operation, because we parse to the audio and video data itself with them in the acquisition of the timestamp when, as long as we are in a reasonable period of time to get the audio and video frames and send them to the screen and speakers can be real The synchronization is now played. However, considering the network fluctuation, some frames may be lost or the acquisition may be delayed. When this happens, the audio and video will not be synchronized, so the audio and video need to be synchronized.

We can understand it like this: there is a ruler and an ant (video) is walking with a pole (audio). The pole is a constant speed of the ant, either fast or slow. If the speed is slow, you will whip it to make it run, and if the speed is fast, you will drag it. So the audio and video can be synchronized. The biggest problem here is that audio is constant and video is non-linear.

After obtaining the AUDIO and video PTS separately, we had three options: Audio video synchronization (calculate the difference between the audio video PTS, to determine if there is a delay) video and audio synchronization video (according to the audio and video PTS adjust the audio sample value difference, namely change the audio buffer size) and audio video synchronization external clock (like the previous one), because adjust audio range is too big, can lead to users is not sharp sound, So we usually choose the first option.

Our strategy is to predict the PTS of the next frame by comparing the previous PTS with the current PTS. In the meantime, we need to sync video to audio. We will create an Audio clock as an internal variable to track when the audio is now playing, and the Video Thread will use this value to calculate and determine whether the video is playing fast or slow.

Now suppose we have a get_audio_clock function that returns our audio clock, and when we get that value, how do we handle the situation where the audio and video are out of sync? Simply trying to jump to the correct packet is not a good solution. What we need to do is adjust the timing of the next refresh: if the video is slow we refresh faster, if the video is fast we refresh slower. Now that we have adjusted the refresh time, let’s compare the frame_timer with the device clock. Frame_timer will always add up to the time delay that we calculate during playback. In other words, the frame_timer is the point at which the next frame should be played. We simply add the newly calculated delay to the frame_timer, compare it to the system time, and use the resulting value as an interval to refresh.

Refer to the article

The thor
Ali seven ox clouds