2016 is the year of the outbreak of mobile live broadcasting. In less than half a year, numerous mobile live broadcasting apps set off the craze of nationwide live broadcasting. However, IN my opinion, the threshold of live broadcasting is relatively high. Professional technology is needed to support the streaming end, server end and player end, and there is a lot of knowledge to learn only on the streaming end. At present, most live broadcasts use the RTMP protocol. I write a simple Demo here to help you better understand the process of live streaming, mainly including: audio and video collection, audio and video coding, data packaging, RTMP protocol and other related knowledge. The structure of the project is very clear, and each module is separated by protocol, which is convenient for everyone to learn different modules.
First, the overall process of push-down flow is described:
- Establishing a TCP Connection
- Establish RTMP connection and send various control instructions
- Get raw video data and audio data
- Compress and encode raw video data and audio data
- The encoded video data and audio data are packaged
- Send packaged audio and video data
The role of the project classes
-
SGSimpleSession is an Api interface layer, responsible for providing directly callable interfaces externally. It is also a data distribution center where the obtained original audio and video data and encoded data are distributed to different classes for processing.
-
Video related classes
-
SGVideoSource original video data acquisition class, the bottom with AVFoundation framework to achieve. Provide external original unencoded video data, at the same time provide image preview function. If you need to add beauty, camera switch, flip, flash and other operations are also handled here.
Original video frame: Original video data is actually a frame of data, they are not compressed and coded, each frame contains image information and time information, we extracted the picture through the code.
FPS: The number of frames included in 1s is the frame rate (FPS). Generally, FPS ranges from 15 to 30 frames. The higher the FPS, the smoother the picture and the higher the bandwidth consumption. In actual live broadcasts, most use 15 to 20 is ok.
Resolution: The size of a frame of image,iOS native 352*288,640*480,1280*720, etc. General live broadcast uses 640*480, then cut to 640* 360.
Bit rate: Also called bit rate, the number of bits of data transmitted per unit of time during data transmission. It can be understood that the bit rate determines the display precision of a frame of image. Within a certain range, the higher the bit rate, the sharper the image, the greater the bandwidth consumption or file size. Beyond a certain range, the clarity does not change. Generally 640 * 480 resolution, the bit rate of 512kbps can ensure clarity.
-
SGVideoConfig this video configuration class, mainly including compression level, resolution, bit rate and so on configuration
-
SGH264Encoder this class is an encoder. Its main function is to encode and compress the original video frame. The hard coding is used here, and the encoding output format is H264 format.
Encoding: Encoding refers to the encoding and compression of the original frame data, so that the encoded data is smaller and easier to transmit over the network. Due to the large volume of original data, network transmission is very inconvenient, so data compression is required. The current mainstream video compression algorithm is H264, and the compression format is H264. H264 has different compression levels and different compression ratios. Common compression levels are :baseline, main, and high.
Hard coding: hard coding is relative to soft coding. Generally, soft coding is calculated by CPU, which consumes CPU performance and takes a lot of time, but has good compatibility. Soft coding generally adopts FFMPEG or X264. Relatively speaking, hard coding uses GPU to encode, which is very fast and efficient. This section uses the hard decoding provided by iOS, which only supports oss later than iOS8.
Compressed video frames: Compressed video has three types of frames :I,B, and P. I frames are also called keyframes. An image can be independently displayed after decoding. P-frame is a forward prediction frame, and a complete image can only be decoded by referring to the previous frame. B is the bidirectional prediction frame, and the image can be decoded only by referring to the previous frame and the next frame. I frame, therefore, the compression ratio of the minimum, around 0.7, it can only be used within the frame compression, P frame compression ratio times, probably can reach 0.5, B frame has higher compression ratio, reached 0.3 ~ 0.5, the B frame and P frame USES is compressed frame and interframe compression technology (namely motion estimation, the principle is part of the image of adjacent frame is the same, professional term space Redundancy). For example, the baseline contains only I and P frames, while the main and high levels contain all three frames, and their overall compression ratio is higher than the baseline. However, because B frame can only be displayed by referring to the previous frame and the next frame, it is easy to cause the lag. In case the next frame is not obtained, the previous frame can not be displayed, so in practical application (live app), the general compression level adopts baseline.
Gop: this I tried to describe: because in addition to the I frames and other frames are not independent rendering, according to the theory, only need one I frame other all are the I frame, so the highest compression ratio, but because (B P frame and the frame of reference other frame can be the cause of error, when after a period of time, the greater the accumulated error is the original, result in image distortion. The solution is to have a small segment as a cell, each cell first frame is I frame; In this way, if something goes wrong in the first section, it will not affect the next section. Each section is called a close GOP. The first frame of each GOP must be a keyframe, because yours has no reference; Generally, the size of GOP is set to 1s to 3s, so the interval between key frames is 1s frames (corresponding to GOP,1s) and 3S frames (corresponding to GOP, 3s). According to the above definition,1s frames are FPS, so the interval between key frames is 1* FPS to 3* FPS. One of the second optimization points is to reduce the SIZE of GOP, because the first GOP frame is a key frame, which can be rendered independently. The time for users to enter the live broadcast room is random, so as to ensure that users can get the key frame as soon as possible and render the image as soon as possible. At the same time, the smaller the GOP is, the more the number of key frames is, and the greater the bandwidth consumption is.
-
The SGH264Packager class is responsible for packaging the encoded H264 frames into RTMP-compliant data before sending them.
-
-
Audio correlation class
SGAudioSource
This class is responsible for recording audio data and output raw audio frames in PCM format.-
SGAudioConfig this class is audio configuration related to the class, mainly including the number of channels, bit rate, sample rate configuration.
-
SGAACEncoder this class is used to encode and compress the original PCM audio data, and the encoding result is audio data in AAC format. Hard coding is used here. The library for soft coding is FAAC.
-
The SGAACPackager class is used to encode AAC data into rTMP-compliant data.
-
RTMP related classes
SGStreamSession
This class is used to establish THE TCP connection, read and send the underlying data, and call back the connection state, which is important throughout the project.SGRtmpSession
This class is related to RTMP and is responsible for interaction with the server, including the RTMP handshake, sending instructions, further encapsulation of data, encapsulation into a message, and then sending. For example, after the handshake is completed, the message size should be renegotiated (128 bytes by default). However, 128 bytes is too small to affect efficiency, so it is generally changed to a bit larger, such as 16KB here. If too large, it is not good either, as it will cause bandwidth waste. Librtmp class librtmp class librtmp class librtmp class librtmp class librtmp class librtmp
The above is the basic structure of the whole project, the whole process is like a factory assembly line, you can replace and study each module by yourself. There are many notes in the demo for easy understanding. Does it feel like a lot of information? There may be some places that are not rigorous, but also hope that we can correct them.
The project was finished around July last year, and then some mess was added, and then the project died and moved on to a new project (still live). I wrote several introductory articles in the middle. I was going to write a series of articles, but I was too busy to write them completely. At the beginning of the New Year, while the project is not too busy, I quickly sorted out the pure code words, if you have any questions can directly leave a message.
- Attached is the full code :github.com/iOSSinger/S…
- Attach a personal blog: www.jianshu.com/u/7246ea6d0…
- Attach RTMP Chinese document: raw.githubusercontent.com/iOSSinger/S…
Attached study blog:
-
Dr Xiao-hua lei’s blog: this is a very good audio development technical articles, like audio can see blog.csdn.net/leixiaohua1…
-
Hard coded in detail: www.jianshu.com/p/a6530fa46…