004- Video H264 coding details (part 1)

preface

This article begins to explain the most interesting knowledge points 👉🏻 H264 video coding, roughly divided into three parts, including the explanation of each knowledge point and the actual coding part.

H264 structure and code stream analysis

1.1 H264 structure diagram

In the H264 structure above, the encoded data of a video image is called a frame. A frame is composed of one slice or multiple slices, and a slice is composed of one or more macro blocks (MB). A macro block is composed of multiple sub-blocks, namely 16×16 YUV data. The macro block is the basic unit of H264 coding.

Field and Frame: A scene or frame of a video can be used to produce an encoded image.
Slice: In each image, a pattern in which macroblocks are arranged into slices. The tablets are divided into I tablets, B tablets, P tablets and some others.
- An I slice contains only I macroblocks, a P slice can contain P and I macroblocks, and a B slice can contain B and I macroblocks.
  - I macro blocks make intra-frame predictions using the pixels decoded from the current slice as a reference.
  - P macroblock uses the previously encoded image as the reference image for intra-frame prediction.
  - B macro blocks use bidirectional reference images (preceding and following frames) for intra-frame prediction.
- The purpose of chip is to limit the spread and transmission of error code, so that the chip is independent of each other. The prediction of one slice should not be based on macroblocks in other slices, so that the prediction error in one slice does not propagate to other slices.
Macro block: An encoded image is usually divided into several macro blocks, one of which consists of a 16×16 brightness pixel with an 8×8 Cb and an 8×8 Cr color pixel block attached.

1.2 H264 coding layer

H264 coding layer, divided into two layers.

NAL Layer: (Network Abstraction Layer)
- What it does is that as long as H264 is transmitted over the network, each packet Ethernet is 1500 bytes during transmission. H264 frames tend to be larger than 1500 bytes. So it’s going to beunpacking. Split a frame into multiple packets for transmission. All of theUnpack or packIs throughNAL layerTo deal with it.
VCL Layer (Video Coding Layer) is used to compress Video raw data.

1.3 Basic concepts of bit stream

SODB:(String of Data Bits). The length may not be a multiple of 8. It is generated by the VCL layer. Because it’s not a multiple of 8, it’s a little tricky.
RBSP:(Raw Byte Sequence Payload,SODB+trailing bits) . The algorithm is to fill the last bit of SODB with 1. Not byte aligned with 0. If you complete 0, do not know where to end. If there are less than 8 bits, add 0 bits.
EBSP:(Encapsulate Byte Sequence Payload) . After generating the compressed stream, we also add a start bit before each frame. The starting bit is usually 0001 in hexadecimal, but there may be two consecutive 0x00 bits in the entire encoded data. So that’s a conflict with the starting bit. So what happened? The H264 specification states that if two consecutive 0x00 values are processed, an additional 0x03 value is added. This prevents the compressed data from colliding with the starting bit.
NALU: NAL Header(1B)+ EBsp. NALU is a network Header added to EBSP.

Key points of EBSP decoding

Each NAL is preceded by a start code 0x00 00 01 (or 0x00 00 00 00 01). The decoder detects each start code as a start identifier for a NAL. When the next start code is detected, the current NAL ends.
At the same time, H.264 states that when the detection of 0x00 00 01, the end of the current NAL can also be represented. What happens when 0x000001 or 0x000000 occurs in NAL? H.264 introduces a mechanism to prevent competition. If the encoder detects the presence of 0x000001 or 0x000000 in NAL data, the encoder will insert a new byte 0x03 before the last byte, so that when the decoder detects 0x000003, it will discard 03 and restore the original data (unshell operation).
When decoding, the decoder first reads the NAL data byte by byte, calculates the NAL length, and then starts decoding.

1.4 Details on NAL Unit

The detailed structure diagram of NALU is as follows:

NAL units are composed of a NALU header + a slice.
Slice can be subdivided into “slice head + slice data “.
Each slice data contains many macro blocks.
Each macro block contains the type of the macro block, the prediction of the macro block, and the residual data.

H264 code stream layered structure diagram

A Annex format data: start code +Nal Unit data
NAL Unit: NALU header +NALU data
NALU body: is composed of slices. Slice includes slice head + slice data
Slice data: macro block composition
PCM: Macro block type + PCM data, or macro block type + macro block mode + residual data
Residual: the Residual block

⚠️ This graph is more important. You can see more.

Ii. Introduction to VideoToolBox

VideoToolBox is a native hard-coded framework released after apple iOS8.0, using hardware accelerators and based on Core Foundation library functions (it is written in C language).

2.1 Procedure

We generally use the VideoToolBox framework, and the things we need to do include 👇🏻

Create session -> Set parameters related to encoding -> Start encoding -> loop input source data (YUV type data, directly obtained from the camera)-> get encoded H264 data -> End encoding
Build H264 files, network transmission is actually H264 files

2.2 Basic data structures

CMSampleBuffer has two cases of encoding and decoding, which are different 👇🏻

After the coding👉🏻 Data is stored inCMBlockBufferIn whichStreaming dataThat’s where I got it
unencoded👉🏻 Data is stored inCVPixelBufferIn the

2.4 Coding process

In the figure above, through video coding, the original data is encoded to generate H264 stream data. However, it is not that the H264 data can be directly handed over to the decoder for processing. The decoder can only process H264 file data.

2.3 the h264 file

In the figure above 👇 🏻

SPS and PPS are the first to be decoded. SPS and PPS should be decoded first before the following data can be analyzed.
Next, I B P frame, refer to 03- video coding ## 7, H264 related concepts.
No matter what kind of framework you use, such as VideoToolBox, FFmpeg, hard coding, etc., no matter what kind of platform you are on, such as MAC, Windows or mobile terminal, you need to follow the FORMAT of H264 file.

The SPS and PPS

SPS(Sequence Parameter Sets)

Picture Parameter Sets (PPS)

This is just an understanding.

2.4 Determine the frame type I, B and P

As we know, video is composed of frames, and frames are composed of one or more pieces of data. In the process of network transmission, a piece of data may be very large, which needs to be unpacked and sent, and then packets are assembled after receiving. Then the problem comes:

How to identify the type of frame, distinguish between I, B and P frames?

Iii. Detailed explanation of NALU unit data

NALU = NAL Header + NAL Body

H264 bit streams are actually transmitted in the form of NALU in the network. Each NALU consists of 1-byte Header and RBSP, as shown in the following figure 👇🏻

3.1 NAL Header Parsing

The NAL Header is 1 byte and contains 8 bits. What data does the 8 bits contain? 👇 🏻

A: 0F
1-2:NRI
3-7:TYPEType, that’s how it worksDetermine the frame typeFrame I, frame B, frame P

F: forbidden_zero_bitIn theH264It’s in the codeThe first digit has to be 0I’m not going to explain this in detail,Keep in mind thatCan.
NRISaid:The importance ofIt is of no use at present.000 is the least useful, 111 is the most useful. Used to representThe current NALUThe importance ofThe larger the value, the more important it is. The decoder can discard the NALU of zero importance when the decoding fails.
TYPE: indicates the NAL type. There are many tables in the following table, but you need to remember several common ones 👇🏻
- 5: IDR image slice (can be understood as I frame, I frame is composed of multiple I slices)
- 7: Sequence Parameter Sets (SPS)
- 8: Image Parameter Set (PPS)

3.2 NAL Types

Single type: An RTP packet contains only NALU, that is, H264 frames contain only one slice, such as P or B frames are single type
Combination type: An RTP contains multiple NALUs of the type 24-27, which are generally placed in a package like PPS or SPS, because the two data units are very small
Shard type: AN NALU unit is divided into multiple RTP packets of type 28-29

Single NALU RTP packet

RTP packets that combine NALU

RTP packets of fragmented NALU

First byte :FU Indicator Second byte :FU Header Header of a fragment. If there are more than one fragment, FU Header will be combined

FU Header

S: start bitUsed to indicate shardingstartIn network transmission, packet by packet, we know its shard packet, so how do we distinguish the packet at the beginning or the packet at the end? If it’s 1, it’s the beginning of a shard
E: end bitUsed to indicate shardingThe end of the
R: Unused. Set it to0
Type: Indicates shardingNAL typeAfter the network transmission is completed, fragments still need to be combined into NALU units. The NAL unit isKey framesorNon-critical frame, it isspsorppsAccording toTypeTo judge

Consider: in the transmission process, a frame is cut into multiple slices. If the sequence is out of order during transmission, or one of the slices is lost, how can we judge the transmission integrity of NALU units?

Solution 👇🏻

According to FU S/E of the Header, and with the help of the RTP packets in baotou, baotou in RTP, include the serial number of each package, if you receive the package, received S package, also received a package, E in the middle of the package of the serial number is continuous, then package is complete, if not is the packet loss in a row, if there is no packet loss can be combined.

Iv. Realization of AVFoundation video Data Collection (1)

Next, the coding demonstrates how to collect video data. You can recall the previous 02-AVFoundation advanced capture, we previously implemented a video recording function based on the system camera, and did not involve video coding, so this coding demonstration is different 👇🏻

The data collection👉 🏻 based onAVFoudationFrames (this should be familiar)
Video coding👉 🏻 based onVideoToolBoxThe framework

The whole process is roughly 👇🏻

Data collection -> Coding complete -> H264 files -> Write sandbox/network transfer

4.1 Data Collection

I’m sure you’re all familiar with the data acquisition process now, so I’m not going to go into the code here.

First declare the attribute 👇🏻

@interface ViewController ()<AVCaptureVideoDataOutputSampleBufferDelegate> @property(nonatomic,strong)UILabel *cLabel; @property(nonatomic,strong)AVCaptureSession *cCapturesession; @property(nonatomic,strong)AVCaptureDeviceInput *cCaptureDeviceInput; @property(nonatomic,strong)AVCaptureVideoDataOutput *cCaptureDataOutput; / / capture output @ property (nonatomic, strong) AVCaptureVideoPreviewLayer * cPreviewLayer; // Preview layer @endCopy the code

Different from the video camera function, the output is used AVCaptureVideoDataOutput, so need to follow the delegate is AVCaptureVideoDataOutputSampleBufferDelegate.

Then you need to create a queue to do two things 👉🏻 capture and code 👇🏻

@implementation ViewController { int frameID; // frame ID dispatch_queue_t cCaptureQueue; Dispatch_queue_t cEncodeQueue; VTCompressionSessionRef cEncodeingSession; // Encode session CMFormatDescriptionRef format; NSFileHandle *fileHandele; // File pointer to store sandbox}Copy the code

ViewDidLoad👇🏻

- (void)viewDidLoad { [super viewDidLoad]; // Do any additional setup after loading the view, Descriptor = [[UILabel alloc]initWithFrame:CGRectMake(20, 20, 200, 100)]; _clabel. text = @" H.264 hardcoded in CC class "; _cLabel.textColor = [UIColor redColor]; [self.view addSubview:_cLabel]; UIButton *cButton = [[UIButton alloc]initWithFrame:CGRectMake(200, 20, 100, 100)]; [cButton setTitle:@"play" forState:UIControlStateNormal]; [cButton setTitleColor:[UIColor whiteColor] forState:UIControlStateNormal]; [cButton setBackgroundColor:[UIColor orangeColor]]; [cButton addTarget:self action:@selector(buttonClick:) forControlEvents:UIControlEventTouchUpInside]; [self.view addSubview:cButton]; }Copy the code

Next comes the button click event

- (void)buttonClick (UIButton *)button {// Check whether _cCapturesession and _cCapturesession are capturing if (! _cCapturesession || ! _cCapturesession. Set the) {/ / modify button state [button setTitle: @ "Stop" forState: UIControlStateNormal]; // startCapture [self startCapture]; } else { [button setTitle:@"Play" forState:UIControlStateNormal]; // stopCapture [self stopCapture]; }}Copy the code

Start recording a video 👇🏻

- (void)startCapture { self.cCapturesession = [[AVCaptureSession alloc]init]; / / set to capture high-resolution self. CCapturesession. SessionPreset = AVCaptureSessionPreset640x480; CCaptureQueue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); cEncodeQueue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); AVCaptureDevice *inputCamera = nil; // Obtain iPhone video capture devices, such as front camera and rear camera...... NSArray *devices = [AVCaptureDevice devicesWithMediaType:AVMediaTypeVideo]; For (AVCaptureDevice * device in devices) {/ / get the rear camera if ([device position] = = AVCaptureDevicePositionBack) {inputCamera  = device; } // Encapsulate the capture device as an AVCaptureDeviceInput object self.cCaptureDeviceInput = [[AVCaptureDeviceInput alloc]initWithDevice:inputCamera error:nil]; / / determine whether can join the rear camera as an input device if ([self. CCapturesession canAddInput: self. CCaptureDeviceInput]) {/ / add equipment to the session [self.cCapturesession addInput:self.cCaptureDeviceInput]; } self.cCaptureDataOutput = [[AVCaptureVideoDataOutput alloc]init]; / / set the discarded the final video frame to NO [self. CCaptureDataOutput setAlwaysDiscardsLateVideoFrames: NO]; [self.cCaptureDataOutput setVideoSettings:[NSDictionary dictionaryWithObject:[NSNumber numberWithInt:kCVPixelFormatType_420YpCbCr8BiPlanarFullRange] forKey:(id)kCVPixelBufferPixelFormatTypeKey]]; }Copy the code

YUV4:2:0, I haven’t seen this before, so let’s see.

Five, YUV color details

The more familiar color system 👉🏻 RGB occupies 1 byte per color channel. YUV is familiar with audio and video business development, and its characteristics are 👇🏻

YUV(also known asYCbCr) is a color coding method used in television systems
YSaid:brightness, that is,Gray-scale valuesIt is the basic signal
The U and VThis is thetachromaUV is used to describe imagesColor saturation, which are used to specifyPixel color.

The relationship between YUV and video: The video recorded by the camera is YUV.

5.1 Common FORMATS of YUV

Yuv4:2:0 (YCbCr 4:2:0) 👉🏻 is half less than RGB
Yuv4:2:2 (YCbCr 4:2:2) 👉🏻 is 1/3 less than RGB and saves a lot of space for historical reasons.
Yuv4:4:4 (YCbCr 4:4:4) 👉🏻 can be understood as 1:1:1, that is, 4 Y correspond to 4 U and 4 V.

YUV4:4:4

In the 4:4:4 mode, all color information is saved, as shown in 👇🏻

Each of the four adjacent pixels ABCD has its own YUV. In the process of secondary sampling of color, each pixel retains its own YUV, which is called 4:4:4.

YUV4:2:2

ABCD has four adjacent pixels, A (Y0, U0, V0), B (Y1, U1, V1), C (Y2, U2, V2), D (Y3, U3, V3). When sampling twice, A is retained (Y0, U0), B (Y1, V1), C (Y2, U2) and D (Y3, V3). That is, the Y (brightness) of each pixel retains its own value, while the values of U and V are sampled at every interval and eventually become 👇🏻

In other words, A borrows B’s V1, B borrows A’s U0, C borrows D’s V3, and D borrows C’s U2. This is the legendary 4:2:2, A picture of 1280 * 720 size, which is 👇🏻 when YUV 4:2:2 is sampled

(1280 * 720 * 8 + 1280 * 720 * 0.5 * 8 * 2) / 8/1024/1024 = 1.76 MB.

It can be seen that the YUV 4:2:2 sampled image saves one-third of the storage space compared with RGB model image, and the bandwidth occupied during transmission will also be reduced accordingly.

YUV4:2:0

In the 4:2:2 mentioned above, we can see that the UV of the two adjacent pixels is borrowed from each other left and right. Can it be borrowed from left and right up and down? The answer is of course 👇🏻

YUV 4:2:0 sampling does not mean sampling only the U component but not the V component. Instead, only one chromaticity component (U or V) is scanned for each line, and the Y component is sampled in a 2:1 fashion.

For example, YU samples the first row in a 2:1 fashion, while YV components are sampled in a 2:1 fashion in the second row. For each chromaticity component, its horizontal and vertical samples are 2:1 relative to the Y component. Assuming that the first row scans the U component and the second row scans the V component, two rows need to be scanned to complete the UV component.

It can be seen from the mapped pixels that the four Y components share a set of UV components and are distributed in the form of 2*2 small squares. Compared with YUV 4:2:2 sampling, the two Y components share a set of UV components, which can save more space. The size of a 1280 * 720 image sampled at YUV 4:2:0 is:

(1280 * 720 * 8 + 1280 * 720 * 0.25 * 8 * 2) / 8/1024/1024 = 1.32MB Saves half of the space compared to 2.63m

5.2 YUV Storage Format

Planar formats: For Planar YUV formats, Y for all pixels is stored consecutively, followed by U for all pixels, and then V for all pixels, such as YYYY YYYY UU VV.
- I420: YYYYYYYY UU VV –> YUV420P (PCDedicated)
- YV12: YYYYYYYY VV UU –> YUV420P
Packed formats: For Packed YUV formats, Y,U and V of each pixel are stored consecutively and alternately, such as YUV YUV YUV YUV, which is similar to RGB.
- NV12: YYYYYYYY UVUV –> YUV420SP
- NV21: YYYYYYYY VUVU –> YUV420SP

Possible in the development process, such as android and iOS, after decoding video, found that appear upside down or flip video images, is probably because they do not match the YUV format, general common I420 PC, android default NV21 commonly, the iOS default NV12, if you want to conduct unified, you need to ensure consistent storage format.

Vi. Realization of AVFoundation video Data Collection (2)

After understanding YUV color system, we continued to complete the video collection process 👇🏻

- (void)startCapture { self.cCapturesession = [[AVCaptureSession alloc]init]; / / set to capture high-resolution self. CCapturesession. SessionPreset = AVCaptureSessionPreset640x480; CCaptureQueue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); cEncodeQueue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0); AVCaptureDevice *inputCamera = nil; // Obtain iPhone video capture devices, such as front camera and rear camera...... NSArray *devices = [AVCaptureDevice devicesWithMediaType:AVMediaTypeVideo]; For (AVCaptureDevice * device in devices) {/ / get the rear camera if ([device position] = = AVCaptureDevicePositionBack) {inputCamera  = device; } // Encapsulate the capture device as an AVCaptureDeviceInput object self.cCaptureDeviceInput = [[AVCaptureDeviceInput alloc]initWithDevice:inputCamera error:nil]; / / determine whether can join the rear camera as an input device if ([self. CCapturesession canAddInput: self. CCaptureDeviceInput]) {/ / add equipment to the session [self.cCapturesession addInput:self.cCaptureDeviceInput]; } self.cCaptureDataOutput = [[AVCaptureVideoDataOutput alloc]init]; / / set the discarded the final video frame to NO [self. CCaptureDataOutput setAlwaysDiscardsLateVideoFrames: NO]; [self.cCaptureDataOutput setVideoSettings:[NSDictionary dictionaryWithObject:[NSNumber] [self.cCaptureDataOutput setVideoSettings:[NSDictionary Withobject :[NSNumber numberWithInt:kCVPixelFormatType_420YpCbCr8BiPlanarFullRange] forKey:(id)kCVPixelBufferPixelFormatTypeKey]]; / / set to capture agents and capture the queue [self. CCaptureDataOutput setSampleBufferDelegate: self queue: cCaptureQueue]; / / determine whether can add output if ([self. CCapturesession canAddOutput: self. CCaptureDataOutput]) {/ / add output [self. CCapturesession addOutput:self.cCaptureDataOutput]; } / / create a connection AVCaptureConnection * connection = [self. CCaptureDataOutput connectionWithMediaType: AVMediaTypeVideo]; / / set the connection direction [connection setVideoOrientation: AVCaptureVideoOrientationPortrait]; / / initialize layer self. CPreviewLayer = [[AVCaptureVideoPreviewLayer alloc] initWithSession: self. CCapturesession]; / / set the video gravity [self cPreviewLayer setVideoGravity: AVLayerVideoGravityResizeAspect]; / / set the layer frame [self cPreviewLayer setFrame: self. The bounds]; / / add layer [self. The layer addSublayer: self. CPreviewLayer]; / / file is written to sandbox nsstrings * filePath = [[NSSearchPathForDirectoriesInDomains (NSDocumentDirectory NSUserDomainMask, YES) lastObject]stringByAppendingPathComponent:@"cc_video.h264"]; / / remove the existing files first [[NSFileManager defaultManager] removeItemAtPath: filePath error: nil]; / / a new file BOOL createFile = [[NSFileManager defaultManager] createFileAtPath: filePath contents: nil attributes: nil]; if (! createFile) { NSLog(@"create file failed"); } else { NSLog(@"create file success"); } NSLog(@"filePaht = %@",filePath); fileHandele = [NSFileHandle fileHandleForWritingAtPath:filePath]; // Initialize videoToolbBox [self initVideoToolBox]; // Start capturing [self.cCapturesession startRunning]; }Copy the code

7. Configuration of VideoToolBox video coding parameters

Next comes the videoToolbBox initialization process, including the configuration of some parameters for the video encoding. Things to do include 👇🏻

Create an encoding session 👉🏻 cEncodeingSession
The parameters of the configuration code

7.1 Creating an encoding session

Create a coding session using C function is VTCompressionSessionCreate 👇 🏻

Explain the meanings of each parameter 👇🏻

Parameter 1: allocator, set NULL to default allocation
Parameter 2: resolutionwidth, the unit ispixelIf the data is invalid, the system changes it to a reasonable value
Parameter 3: resolutionheightSame as above,
Parameter 4: encoding type, such as kCMVideoCodecType_H264
Parameter 5: Coding specification. Setting NULL is optional for videoToolbox
Parameter 6: Source pixel buffer properties. Set NULL to disallow videToolbox creation and create your own
Parameter 7: Compressed data allocator. Set NULL, the default assignment
Parameter 8: callback function. whenVTCompressionSessionEncodeFrameAfter a call is compressed once, it is called asynchronously.

⚠ ️ note: when you set up the NULL, you need to call VTCompressionSessionEncodeFrameWithOutputHandler method is compressed frame processing, support iOS9.0 above

Parameter 9: Callback the reference value defined by the customer, bridging self so that the C function can call the OC method
Parameter 10: Encoding session variable

7.2 Parameters of the configuration code

Configuration encoded parameters also need to use the C function VTSessionSetProperty👇🏻

This function is simple, and the parameters are interpreted as 👇🏻

Parameter 1: the setting object of the configuration parametercEncodeingSession
Parameter 2: Attribute name
Parameter 3: The value of the property

7.3 Complete Initialization Code

// Initialize videoToolBox - (void)initVideoToolBox {dispatch_sync(cEncodeQueue, ^{frameID = 0; // Resolution: same as AVFoudation resolution int width = 480,height = 640; / / 1. Call VTCompressionSessionCreate create coding session OSStatus status = VTCompressionSessionCreate (NULL, width, height, kCMVideoCodecType_H264, NULL, NULL, NULL, didCompressH264, (__bridge void *)(self), &cEncodeingSession); NSLog(@"H264:VTCompressionSessionCreate:%d",(int)status); if (status ! = 0) { NSLog(@"H264:Unable to create a H264 session"); return ; } / / 2. The configuration parameters / / set the real-time encoding output (avoid delay) VTSessionSetProperty (cEncodeingSession kVTCompressionPropertyKey_RealTime, kCFBooleanTrue);  // Discard VTSessionSetProperty(cEncodeingSession, kVTCompressionPropertyKey_ProfileLevel,kVTProfileLevel_H264_Baseline_AutoLevel); VTSessionSetProperty(cEncodeingSession, cEncodeingSession); kVTCompressionPropertyKey_AllowFrameReordering, kCFBooleanFalse); Int frameInterval = 10; // Set the GOPsize interval. /** CFNumberCreate(CFAllocatorRef allocator, CFNumberType theType, const void *valuePtr) * allocator: Allocator kCFAllocatorDefault Default * theType: data type * *valuePtr: Pointer, address */ CFNumberRef frameIntervalRaf = CFNumberCreate(kCFAllocatorDefault, kCFNumberIntType, &frameInterval); VTSessionSetProperty(cEncodeingSession, kVTCompressionPropertyKey_MaxKeyFrameInterval, frameIntervalRaf); // Set expected framerate, not actual framerate int FPS = 10; CFNumberRef fpsRef = CFNumberCreate(kCFAllocatorDefault, kCFNumberIntType, &fps); VTSessionSetProperty(cEncodeingSession, kVTCompressionPropertyKey_ExpectedFrameRate, fpsRef); // Bit rate: a large bit rate will be very clear, but at the same time the file will be large. Int bitRate = width * height * 3 * 4 * 8; int bitRate = width * height * 3 * 4 * 8; CFNumberRef bitRateRef = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &bitRate); VTSessionSetProperty(cEncodeingSession, kVTCompressionPropertyKey_AverageBitRate, bitRateRef); Byte int bigRateLimit = width * height * 3 * 4; CFNumberRef bitRateLimitRef = CFNumberCreate(kCFAllocatorDefault, kCFNumberSInt32Type, &bigRateLimit); VTSessionSetProperty(cEncodeingSession, kVTCompressionPropertyKey_DataRateLimits, bitRateLimitRef); / / start coding VTCompressionSessionPrepareToEncodeFrames (cEncodeingSession); }); }Copy the code

For the calculation formula of bit rate, please refer to 👇🏻 below

Viii. Realization of AVFoundation video Data Collection (3)

There are two nodes left in the process of video collection: stop capturing and video coding preparation.

8.1 Stopping Capture

Before using VideoToolBox for video coding, let’s go back to the process of video collection. Just now, we have realized the startCapture and stopped the unrealized 👇🏻

- (void)stopCapture {// Stop capturing [self.ccapturesession stopRunning]; // Remove the preview layer [self.cpreviewLayer removeFromSuperlayer]; // End videoToolbBox [self endVideoToolBox]; // Close fileHandele closeFile; fileHandele = NULL; }Copy the code

The end VideoToolBox code is 👇🏻

-(void)endVideoToolBox {
    VTCompressionSessionCompleteFrames(cEncodeingSession, kCMTimeInvalid);
    VTCompressionSessionInvalidate(cEncodeingSession);
    CFRelease(cEncodeingSession);
    cEncodeingSession = NULL;
}
Copy the code

8.2 Video coding preparation

The preparation, you should know, is definitely done in the delegate method of the output, and the output we’re using isAVCaptureVideoDataOutputIts delegate isAVCaptureVideoDataOutputSampleBufferDelegate.Get the video streamThe method that triggers is

-(void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer FromConnection :(AVCaptureConnection *)connection {// start video recording, get the camera video frame, Dispatch_sync (cEncodeQueue, ^{// This is unencoded/uncompressed video stream [self encode:sampleBuffer]; }); }Copy the code

But there is a problem, the video and audio data are collected by AVFoudation, and then handed over to this agent method! So how do you tell the difference between video and audio data? 👇 🏻

With the captureOutput object, determine whether it is AVCaptureVideoDataOutput or AVCaptureAudioDataOutput.

9. VideoToolBox Video Coding Implementation (1)

9.1 Encoding Function

As well as creating an encoding session, the video encoding function is also C function 👇🏻

The parameters are defined as 👇🏻

Parameter 1:Coding sessionvariable
Argument 2:unencodeddata
Parameter 3: The one obtainedsample bufferPresentation of dataThe time stamp. Each timestamp passed to this session is greater than the previous display timestamp.
Parameter 4: The display time of the frame when the sample buffer data was retrieved. If no time information is available, set this parameterkCMTimeInvalid.
Parameter 5: frameProperties: Contains thisThe attribute of the frame. Frame changes affect subsequent encoding frames.
Parameter 6: ourceFrameRefCon: The callback will reference the frame you setreference.
Parameter 7: infoFlagsOut: Points to oneVTEncodeInfoFlagsTo accept an encoding operation. If you are usingasynchronousRun,kVTEncodeInfo_AsynchronousIs set up;synchronousRun,kVTEncodeInfo_FrameDroppedIs set up; Set up theNULLI don’t want to accept this information.

9.2 Video encoding encode

- (void)encode:(CMSampleBufferRef)sampleBuffer {CVImageBufferRef imageBuffer = (CVImageBufferRef)CMSampleBufferGetImageBuffer(sampleBuffer); // Set the frame time, otherwise the timeline will be too long. CMTime presentationTimeStamp = CMTimeMake(frameID++, 1000); VTEncodeInfoFlags flags; / / code function OSStatus statusCode = VTCompressionSessionEncodeFrame (cEncodeingSession imageBuffer, presentationTimeStamp, kCMTimeInvalid, NULL, NULL, &flags); if (statusCode ! = noErr) { NSLog(@"H.264:VTCompressionSessionEncodeFrame faild with %d",(int)statusCode); / / end coding VTCompressionSessionInvalidate (cEncodeingSession); CFRelease(cEncodeingSession); cEncodeingSession = NULL; return; } NSLog(@"H264:VTCompressionSessionEncodeFrame Success"); }Copy the code

At this point the coding is complete, and the next two questions are 👇🏻

Where can I get successfully encoded H264 stream data?
What do you do after you’ve got the data encoded successfully?

9.3 Encoding callback completed

To answer question 1, when we configured sessioncEncodeingSession, we specified a callback function didCompressH264 to get the H264 stream data 👇🏻

void didCompressH264(void *outputCallbackRefCon, void *sourceFrameRefCon, OSStatus status, VTEncodeInfoFlags infoFlags, CMSampleBufferRef sampleBuffer)
Copy the code

Remember the H264 file format we talked about earlier? See below 👇 🏻

In NALU stream data, the 0 and 1 are SPS and PPS, which contain many key information such as parameters. Of course, we need to deal with this first, and to obtain SPS and PPS, we need to get key frames first. That’s problem 2: what you need to do once you get the data that you coded successfully.

9.3.1 Judgment of key frames

It is roughly divided into three steps 👇🏻

fromsampleBufferGets an array of data streams fromarray

CFArrayRef array = CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, true);

fromarrayGets the object whose index value is 0

CFDictionaryRef dic = CFArrayGetValueAtIndex(array, 0);

Determine whetherKey frames

bool isKeyFrame = ! CFDictionaryContainsKey(dic, kCMSampleAttachmentKey_NotSync);

9.3.2 Obtaining the C function of SPS/PPS

Parameter 1: image storage mode
Parameter 2:0 Indicates the index value
Parameters 3, 4, and 5: The transmission value is the address, and the output is the SPS/PPS parameter information
Parameter 6: Output information. 0 is passed by default

9.3.3 Generating H264 files

void didCompressH264(void *outputCallbackRefCon, void *sourceFrameRefCon, OSStatus status, VTEncodeInfoFlags infoFlags, CMSampleBufferRef sampleBuffer) { NSLog(@"didCompressH264 called with status %d infoFlags %d",(int)status,(int)infoFlags); If (status! = 0) { return; } // Not ready if (! CMSampleBufferDataIsReady(sampleBuffer)) { NSLog(@"didCompressH264 data is not ready"); return; } // Convert ref (previously bridging self object) to viewConntroller ViewController *encoder = (__bridge ViewController *)outputCallbackRefCon; Bool keyFrame =! CFDictionaryContainsKey((CFArrayGetValueAtIndex(CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, true), 0)), kCMSampleAttachmentKey_NotSync); // Get the SPS & PPS data only once, // SPS (sample per second /s) // if (keyFrame) {// Image storage mode, Description of encoder, the CMFormatDescriptionRef format = CMSampleBufferGetFormatDescription (sampleBuffer); / / SPS was obtained from the zeroth index key frames size_t sparameterSetSize, sparameterSetCount; const uint8_t *sparameterSet; OSStatus statusCode = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(format, 0, &sparameterSet, &sparameterSetSize, &sparameterSetCount, 0); If (statusCode = = noErr) {/ / to get PPS size_t pparameterSetSize, pparameterSetCount; const uint8_t *pparameterSet; / / PPS was obtained from the first index key frames OSStatus statusCode = CMVideoFormatDescriptionGetH264ParameterSetAtIndex (format, 1, & pparameterSet, &pparameterSetSize, &pparameterSetCount, 0); // SPS and PPS are successfully obtained. Prepare written documents if (statusCode = = noErr) {/ / PPS & SPS - > NSData NSData * SPS = [NSData dataWithBytes: sparameterSet length:sparameterSetSize]; NSData *pps = [NSData dataWithBytes:pparameterSet length:pparameterSetSize]; If (encoder) {// Write a file [encoder gotSpsPps: SPS PPS: PPS]; }}}} // There are other operations... }Copy the code

GotSpsPps: PPS: implementation, see figure 👇🏻

So you add the starting bit 00, 00, 00, 01

NSLog(@" gotspp %d %d",(int)[SPS length],(int)[PPS length]); // Add start bit 00 00 00 01 const char bytes[] = "\x00\x00\x00\x01"; Size_t length = (sizeof bytes) -1; NSData *ByteHeader = [NSData dataWithBytes:bytes length:length]; [fileHandele writeData:ByteHeader]; [fileHandele writeData:sps]; [fileHandele writeData:ByteHeader]; [fileHandele writeData:pps]; }Copy the code

X. VideoToolBox Video Coding Implementation (2)

SPS/PPS processing has been done above, followed by NALU stream data processing, which is the CMBlockBuffer shown below 👇🏻

The CMBlockBuffer summarizes the encoded data stream, which we need to capture and convert to H264 file format.

10.1 get CMBlockBuffer

C function 👇🏻, of course

Very simple, just a code 👇🏻

CMBlockBufferRef dataBuffer = CMSampleBufferGetDataBuffer(sampleBuffer);

We can think of dataBuffer as an array that we need to iterate over to get the data inside. How do I traverse it? Three conditions are required 👇🏻

The length of a single element
The length of the total data
The starting address

The C function is then used to obtain 👇🏻

CMBlockBufferRef dataBuffer = CMSampleBufferGetDataBuffer(sampleBuffer); size_t length,totalLength; Char *dataPointer; // According to the length of the single data, the length of the entire NALU stream, and the first address of the data, Can traverse the entire data flow for processing -- > can be interpreted as time array OSStatus statusCodeRet = CMBlockBufferGetDataPointer (dataBuffer, 0, & length, & totalLength. &dataPointer); If (statusCodeRet == noErr) {Copy the code

10.2 Big-endian Mode and small-endian Mode

Before iterating through the data, there is a problem to consider 👉🏻 big-endian mode & little-endian mode.

In computer hardware, data can be stored in two ways: big-endian and small-endian.

Big endian byte order:highBytes inIn front of the.lowBytes inbehind
Small endian byte order:lowBytes inIn front of the.highBytes inbehind

For example, in hexadecimal data 0x01234567, the big-endian byte order is 01, 23, 45, 67, and the small-endian byte order is 67, 45, 23, 01.

Why do we have little endian order?

Because computer circuits deal with the low bytes first, efficiency will be higher! Therefore, the internal processing of the computer starts from the low byte, and the human reading and writing habit is big endian, so, except inside the computer, the general situation is to keep the big endian.

10.3 Iterating NALU Data

There are two ways to iterate, one is through the pointer p++ offset to operate, one is through the step size offset operation, we use the latter here, the code is 👇🏻

size_t bufferOffset = 0; static const int AVCCHeaderLength = 4; // The first 4 bytes of nALu data returned are not 001 startCode, but the frame length of large-endian mode. While (bufferOffset < Totallength-avCCheaderLength) {uint32_t NALUnitLength = 0; // Read nalu memcpy(&NALUnitLength, dataPointer + bufferOffset, AVCCHeaderLength); NALUnitLength = CFSwapInt32BigToHost(NALUnitLength); NALUnitLength = CFSwapInt32BigToHost(NALUnitLength); NSData *data = [[NSData alloc]initWithBytes:(dataPointer + bufferOffset + AVCCHeaderLength) length:NALUnitLength]; // Write nALu data to a file [encoder gotEncodedData:data isKeyFrame:keyFrame]; BufferOffset += AVCCHeaderLength + NALUnitLength; }Copy the code

10.4 full version didCompressH264

Full version code 👇🏻

void didCompressH264(void *outputCallbackRefCon, void *sourceFrameRefCon, OSStatus status, VTEncodeInfoFlags infoFlags, CMSampleBufferRef sampleBuffer) { NSLog(@"didCompressH264 called with status %d infoFlags %d",(int)status,(int)infoFlags); If (status! = 0) { return; } // Not ready if (! CMSampleBufferDataIsReady(sampleBuffer)) { NSLog(@"didCompressH264 data is not ready"); return; } ViewController *encoder = (__bridge ViewController *)outputCallbackRefCon; Bool keyFrame =! CFDictionaryContainsKey((CFArrayGetValueAtIndex(CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, true), 0)), kCMSampleAttachmentKey_NotSync); // Get the SPS & PPS data only once, // SPS (sample per second /s) // if (keyFrame) {// Image storage mode, Description of encoder, the CMFormatDescriptionRef format = CMSampleBufferGetFormatDescription (sampleBuffer); / / SPS was obtained from the zeroth index key frames size_t sparameterSetSize, sparameterSetCount; const uint8_t *sparameterSet; OSStatus statusCode = CMVideoFormatDescriptionGetH264ParameterSetAtIndex(format, 0, &sparameterSet, &sparameterSetSize, &sparameterSetCount, 0); If (statusCode = = noErr) {/ / to get PPS size_t pparameterSetSize, pparameterSetCount; const uint8_t *pparameterSet; / / PPS was obtained from the first index key frames OSStatus statusCode = CMVideoFormatDescriptionGetH264ParameterSetAtIndex (format, 1, & pparameterSet, &pparameterSetSize, &pparameterSetCount, 0); // SPS and PPS are successfully obtained. Prepare written documents if (statusCode = = noErr) {/ / PPS & SPS - > NSData NSData * SPS = [NSData dataWithBytes: sparameterSet length:sparameterSetSize]; NSData *pps = [NSData dataWithBytes:pparameterSet length:pparameterSetSize]; If (encoder) {// Write a file [encoder gotSpsPps: SPS PPS: PPS]; } } } } CMBlockBufferRef dataBuffer = CMSampleBufferGetDataBuffer(sampleBuffer); size_t length,totalLength; Char *dataPointer; // According to the length of the single data, the length of the entire NALU stream, and the first address of the data, Can traverse the entire data flow for processing -- > can be interpreted as time array OSStatus statusCodeRet = CMBlockBufferGetDataPointer (dataBuffer, 0, & length, & totalLength. &dataPointer); if (statusCodeRet == noErr) { size_t bufferOffset = 0; static const int AVCCHeaderLength = 4; // The first 4 bytes of nALu data returned are not 001 startCode, but the frame length of large-endian mode. While (bufferOffset < Totallength-avCCheaderLength) {uint32_t NALUnitLength = 0; // Read nalu memcpy(&NALUnitLength, dataPointer + bufferOffset, AVCCHeaderLength); NALUnitLength = CFSwapInt32BigToHost(NALUnitLength); NALUnitLength = CFSwapInt32BigToHost(NALUnitLength); NSData *data = [[NSData alloc]initWithBytes:(dataPointer + bufferOffset + AVCCHeaderLength) length:NALUnitLength]; // Write nALu data to a file [encoder gotEncodedData:data isKeyFrame:keyFrame]; BufferOffset += AVCCHeaderLength + NALUnitLength; }}}Copy the code

Then there is gotEncodedData: isKeyFrame: the implementation of the methods 👇 🏻

- (void)gotEncodedData:(NSData*)data isKeyFrame:(BOOL)isKeyFrame { NSLog(@"gotEncodeData %d",(int)[data length]); if (fileHandele ! < span style = "max-width: 100%; clear: both; min-height: 1em; The current NAL ends. /* To prevent 0x000001 from occurring in NAL, H.264 proposed the 'Competitive Emulation Prevention' mechanism. If a NAL was detected to have two consecutive 0x00 bytes, a 0x03 was inserted. When the decoder detects 0x000003 within NAL, 0x03 is discarded and the original data is restored. Generally speaking, there are two ways to package the code stream of H264. One is the annex-b Byte stream format, which is the default output format of most encoders. The first 3 to 4 bytes of each frame are H264's start_code,0x00000001 or 0x000001. The first few bytes (1,2,4 bytes) are the length of NAL instead of start_code. In this case, we must use some global data to get the coders profile,level,PPS,SPS and other information before decoding. */ const char bytes[] ="\x00\x00\x00\x01"; Size_t length = (sizeof bytes) -1; NSData *ByteHeader = [NSData dataWithBytes:bytes length:length]; // Write the header byte [fileHandele writeData:ByteHeader]; // Write H264 data [fileHandele writeData:data]; }}Copy the code

conclusion

H264 structure and code stream analysis
- The H264 structure
  - Video images are encoded after 👉🏻frame
  - slice👉 🏻One slice or more slicescompositionframe
  - The macro block👉🏻 one or moreMacroblock (MB)compositionslice
- H264 coding layer
  - NAL layer: (Network Abstraction Layer)
  - VCL Layer :(Video Coding Layer)
- stream
  - SODB:(String of Data Bits, original Data Bits)
  - RBSP:(Raw Byte Sequence Payload,SODB+trailing bits)
  - EBSP:(Encapsulate Byte Sequence Payload)
  - NALU: NAL Header(1B)+EBSPThis is 👉 🏻Focus on
- NAL Unit
  - NAL Unit = one NALU header + one slice
  - Slice = slice head + slice data
  - Slice data = macro block +… + macro block
  - Macroblock = type + prediction + residual data
VideoToolBox
- IOS8.0 after the launch of the native hardcoding framework, based onCore FoundationC language
- Basic data structure 👉🏻CMSampleBuffer
  - Unencoded 👉 🏻CVPixelBuffer
  - Encoded 👉 🏻CMBlockBuffer
- Coding process 👉🏻CVPixelBufferVideo Encoder ->CMBlockBuffer-> H264 file format
- The H264 file
  - The H264 file format is the NALU stream data type
  - The sequence of frames 👉🏻 SPS + PPS + I B P frames
- Identify I, B, P frames
  - hexadecimalThe conversionbinary
  - binarytakeFour to eightAnd then converted intoThe decimal system
  - Decimal result referencetable
NALU unit data in detail
- NALU = NAL Header(1 Byte) + NAL Body
- NAL Header parsing
  - 1 byte, that is, 8 bits
    - A: 0FThe value must be 0
    - 1-2:NRIImportance 👉🏻 000 the most useless, 111 the most useful
    - 3-7:TYPEType, that’s how it worksDetermine the frame typeFrame I, frame B, frame P
      - 5 indicates I frame
      - 7 represents the SPS sequence parameter set
      - 8 represents PPS image parameter set
- NAL type
  - Single type: an RTP packet contains only NALU, that is, H264 frames contain only one slice
  - Combination type: An RTP contains multiple NALUs, such as PPS or SPS
  - Sharding type: An NALU unit is divided into multiple RTP packets
    - First byte :FU Indicator Fragment unit indicator
    - Second byte :FU Header The fragment unit Header, which has multiple slices
  - FU Header
    - S: start bitUsed to indicate shardingstart
    - E: end bitUsed to indicate shardingThe end of the
    - R: Unused. Set it to0
    - Type: Indicates shardingNAL type, it isKey framesorNon-critical frame, it isspsorpps
  - NALU unit transmission complete identification
    - The S and E packets are received. Procedure
    - The ones in the middleThe serial number is continuous
YUV color system
- Also known asYCbCr, is a color coding method used in television systems
- YSaid:brightness, that is,Gray-scale valuesIt is the basic signal
- The U and VThis is thetachromaUV is used to describe imagesColor saturation, which are used to specifyPixel color
- YUV common format
  - Yuv4:2:0 (YCbCr 4:2:0) 👉🏻 is half less than RGB
  - Yuv4:2:2 (YCbCr 4:2:2) 👉🏻 is one-third less than RGB
  - Yuv4:4:4 (YCbCr 4:4:4) 👉🏻 can be interpreted as 1:1:1
- YUV storage format
  - Planar formats
    - I420:YUV420P (PCDedicated)
    - YV12:YUV420P
  - Packed formats
    - NV12:YUV420SP (iOSThe default)
    - NV21:YUV420SP (The androidThe default)
AVFoundation video data acquisition implementation
- The whole process👉🏻 Data collection -> Encoding complete -> H264 file -> Write sandbox/network transfer
- The data collection👉 🏻 based onAVFoudationThe framework
  - The output sourceAVCaptureVideoDataOutput, need to followAVCaptureVideoDataOutputSampleBufferDelegate
  - The queuesynchronousComplete 2 things 👉🏻capture 和 coding
  - The compression mode of pixels captured in video is YUV4:2:0
    - kCVPixelBufferPixelFormatTypeKey : kCVPixelFormatType_420YpCbCr8BiPlanarFullRange
- Video coding👉 🏻 based onVideoToolBoxThe framework
  - Initialize videoToolbBox
    - Create an encoded session 👉🏻VTCompressionSessionCreate
    - Parameter of configuration code 👉🏻VTSessionSetProperty
      - Real-time encodingkVTCompressionPropertyKey_RealTime
      - Abandon B framekVTCompressionPropertyKey_ProfileLevel
      - Produce B framekVTCompressionPropertyKey_AllowFrameReordering
      - Key frame (GOPsize) intervalkVTCompressionPropertyKey_MaxKeyFrameInterval
      - Expect frame ratekVTCompressionPropertyKey_ExpectedFrameRate
      - Rate limitkVTCompressionPropertyKey_DataRateLimits
      - The mean ratekVTCompressionPropertyKey_AverageBitRate
VideoToolBox video coding
- Stop to catch
  - Stop capturing sessions
  - Remove preview layer
  - End videoToolbBox
  - Close the file
- Pre-coding preparation
  - The timing of the coding points 👉 🏻 AVCaptureVideoDataOutputSampleBufferDelegate method-(void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection
- coded
  - Get each unencoded frame 👉🏻CMSampleBufferGetImageBuffer
  - Encoding function 👉🏻VTCompressionSessionEncodeFrame
  - Get successfully encoded H264 stream data 👇
    - Encoding complete callback 👉🏻VTCompressionSessionCreateIs the specified callback function
      - fromsampleBufferGets an array of data streams fromCMSampleBufferGetSampleAttachmentsArray
      - fromarrayWhere the index value is 0CFDictionaryRefdic
      - Judgment key frame! CFDictionaryContainsKey(dic, kCMSampleAttachmentKey_NotSync)
  - Generate H264 file format
    - For SPS/PPS 👉 🏻CMVideoFormatDescriptionGetH264ParameterSetAtIndex
    - Written to the file
      - NSData is read based on size and address pointer
      - Configure the Header
        
        Add start bit"\x00\x00\x00\x01"
        
        To get rid of\ 0terminator
      - Write sequence 👉🏻 Header + spsData + Header + ppsData
    - Get CMBlockBuffer 👉 🏻CMSampleBufferGetDataBuffer
    - Traverse the CMBlockBuffer to obtain nALU data
      - Length of individual elements + length of total data +The starting addressPointer offset traversal
      - Switch from big-endian mode to small-endian mode (default small-endian mode on MAC)
    - Writes nALU data to a file
      - Configure the Header as you would write SPS/PPS
      - Write sequence 👉🏻 Header + NALData

CC teacher _HelloCoder audio and video learning from zero to whole