The opening

Hot, intimidating audio and video development

Today, short video apps are booming. With the rise of short video, audio and video development has been paid more and more attention, but because audio and video development involves a wide range of knowledge, the threshold of entry is relatively high, so that many developers are afraid.

Why are you writing this series of posts

Although there are many online posts summed up the audio and video have upgrade route, but the audio and video development related knowledge is relatively independent, have “related audio and video decoding”, have “OpenGL related”, there are also “FFmpeg related”, but for newcomers, join together, bring all the knowledge and good understanding of all knowledge, It’s very difficult.

Himself in the process of learning audio video development, profoundly realized due to dispersion of knowledge transition fault of confusion and pain, so, hope that through their own understanding, can put the audio and video summed up development related knowledge, and form a series of articles, step by step, analyzes each link, a learned to do a summary and consolidated, Second, I hope to help those who want to start audio and video development.

[sound Ming]

First of all, this series of articles are based on their own understanding and practice, there may be wrong places, welcome to correct.

Secondly, this is an introductory series, covering only enough knowledge, and there are many blog posts on the Internet for in-depth knowledge. Finally, in the process of writing the article, I will refer to the articles shared by others and list them at the end of the article, thanking these authors for their sharing.

Code word is not easy, reproduced please indicate the source!

Tutorial code: [Making portal】

directory

First, Android audio and video hard decoding:
  • 1. Basic knowledge of audio and video
  • 2. Audio and video hard decoding process: packaging basic decoding framework
  • 3. Audio and video playback: audio and video synchronization
  • 4, audio and video unencapsulation and packaging: generate an MP4
2. Use OpenGL to render video frames
  • 1. Preliminary understanding of OpenGL ES
  • 2. Use OpenGL to render video images
  • 3, OpenGL rendering multi-video, picture-in-picture
  • 4. Learn more about EGL of OpenGL
  • 5, OpenGL FBO data buffer
  • 6, Android audio and video hardcoding: generate an MP4
Android FFmpeg audio and video decoding
  • 1, FFmpeg SO library compilation
  • 2. Android introduces FFmpeg
  • 3, Android FFmpeg video decoding playback
  • 4, Android FFmpeg+OpenSL ES audio decoding playback
  • 5, Android FFmpeg+OpenGL ES play video
  • Android FFmpeg Simple Synthesis MP4: Video unencapsulation and Reencapsulation
  • 7, Android FFmpeg video encoding

You can read about it in this article

As an introductory article, let’s take a look at what audio and video consist of and some common terms and concepts.

What is video?

Animation books

I don’t know if you played an animated picture book when you were a child. When you turn it over continuously, the picture of the picture book will become an animation, similar to the GIF format picture now.

Originally a static picture book, after flipping, will become an interesting small animation, if the screen is enough, flipping fast enough, this is actually a small video.

And that’s exactly how video works. Because of the special structure of the human eye, when the picture changes quickly, there is a residue, which feels like a continuous motion. So, a video is a series of pictures.

Video frame

A frame, a basic concept of video, represents an image, like the page in the flip book above, which is a frame. A video is made up of many, many frames.

Frame rate

Frame rate is the number of frames per unit of time. The unit is frames per second or FPS (Frames per second). For example, in an animated book, how many pictures are included in a second? The more pictures, the smoother the picture and the more natural the transition.

Typical values of frame rate are as follows:

24/25 FPS: a typical movie frame rate of 24/25 frames per second.

30/60 FPS: 30/60 frames per second, the frame rate of the game. 30 frames is acceptable, 60 frames feels more smooth and realistic.

Above 85 FPS, the human eye is barely aware of it, so higher framerates don’t make much sense in video.

Color space

Here we will only talk about two commonly used color Spaces.

  • RGB

RGB color mode should be the most familiar one, which is widely used in modern electronic devices. All colors can be mixed with R, G, and B base colors.

  • YUV

So, to focus on YUV, this color space is not familiar to us. This is a color format in which brightness and chroma are separated.

Early TVS were black and white, meaning they only had a brightness value, or Y. After the advent of color TV, UV and two shades were added to form YUV, also known as YCbCr.

Y: Brightness is the gray level. In addition to indicating brightness signal, it also contains more green channel quantity.

U: difference between blue channel and brightness.

V: the difference between the red channel and brightness.

What are the advantages of using YUV?

The human eye is sensitive to brightness, not sensitive to chroma, so reduce part of the AMOUNT of UV data, the human eye can not perceive, so you can compress the RESOLUTION of UV, without affecting the perception of the premise, reduce the volume of the video.

RGB and YUV conversion
R Y = 0.299 + 0.114 + 0.587 G B U R = 0.147 0.289 G + 0.436 = 0.615 B V R - 0.515 G - 0.100 B -- -- -- -- -- -- -- -- -- R = Y + 1.14 V G = Y -0.39U -0.58V B = Y + 2.03uCopy the code

What is audio?

The most commonly used mode of carrying audio data is pulse code modulation (PCM).

In nature, sound is continuous, an analog signal, so how do you preserve it? That is to digitize sound, that is, convert it into a digital signal.

We know that sound is a wave with its own amplitude and frequency, so to save a sound, we need to save the amplitude of the sound at various points in time.

However, digital signals cannot continuously store the amplitude at all time points. In fact, it is not necessary to store the continuous signal to restore the sound acceptable to the human ear.

According to Nyquist sampling theorem, in order to restore the analog signal without distortion, the sampling frequency should not be less than 2 times of the highest frequency in the analog signal spectrum.

According to the above analysis, PCM collection steps can be divided into the following steps:

Analog signal -> Sampling -> Quantization -> Encoding -> Digital Signal

Sampling rate and sampling number

Sampling rate, that is, the frequency of sampling.

As mentioned above, the sampling rate should be more than two times of the original acoustic frequency, and the highest frequency that human ear can hear is 20kHz. Therefore, in order to meet the auditory requirements of human ear, the sampling rate should be at least 40kHz, usually 44.1kHz, and the higher frequency is usually 48kHz.

The number of sampling digits involved in the amplitude quantization mentioned above. Waveform amplitude on the analog signal is continuous sample value, in the digital signal, the signal is usually discontinuous, so the analog signals after quantification, takes only an approximation of integer values, in order to record the amplitude values, the sampler can adopt a fixed number of bits to record the amplitude value, usually eight, 16 and 32 bit.

digits The minimum value The maximum
8 0 255
16 – 32768. 32767
32 – 2147483648. 2147483647

The more bits, the more accurate the recorded value, the higher the restore degree.

And finally, coding. Since the digital signal is composed of 0, 1, the amplitude value needs to be converted into a series of 0 and 1 for storage, that is, coding, and finally the data is the digital signal: a series of 0 and 1 data.

The whole process is as follows:

Track number

The number of channels refers to the number of sounds that can support different sounds.

Mono channel: 1 Dual-channel: 2 Stereo channel: 2 channels By default Stereo channel (4 channels) : 4 channels

Bit rate

Bit rate refers to the amount of information per second that can pass through a data stream, in BPS (bit per second).

Bit rate = Sampling rate * Sampling bits * Number of channels

Third, why code

Coding here is not the same as coding in the audio above, but compression coding.

We know that in the computer world, everything is made up of zeros and ones, and audio and video data are no exception. Due to the huge amount of audio and video data, if it is stored according to raw stream data, it will consume very large storage space and is not conducive to transmission. In fact, audio and video contain a lot of repeated data of 0 and 1, so we can compress the data of 0 and 1 through certain algorithms.

Especially in video, because the picture is gradually transitioned, the whole video contains a lot of picture/pixel repetition, which provides a very large compression space.

Therefore, encoding can greatly reduce the size of audio and video data, making audio and video easier to store and transmit.

4. Video coding

Video encoding format

There are many video coding formats, such as H26x series and MPEG series of coding, these coding formats are to adapt to the development of The Times and appear.

Among them, the H26x (1/2/3/4/5) series is dominated by the International Telecommunication Union (ITU)

The MPEG (1/2/3/4) series is dominated by MPEG (Moving Picture Experts Group, an organization under the ISO umbrella).

Of course, they also have a joint coding standard, which is the current mainstream coding format H264, and of course, the next generation of more advanced compression coding standard H265.

Introduction to H264 coding

H264 is currently the most popular video coding standard, so we will use this format as a benchmark for future articles.

H264 is customized by ITU and MPEG and is part 10 of MPEG-4.

Because H264 coding algorithm is very complex, not a moment to be able to explain clearly, also not within the scope of my current ability, so here is only a simple introduction in the daily development of the need to understand the concept. In fact, the encoding and decoding of the video is usually done by frameworks (such as Android Hard Solution /FFmpeg) that are not accessible to the average developer.

  • Video frame

We already know that video is composed of frame by frame, but in the data of video, the original data is not really saved frame by frame (if so, the compression coding is meaningless).

H264 will select a frame as the complete coding according to the changes of the frame within a period of time, and the next frame only records the difference of the complete data with the last frame, which is a dynamic compression process.

In H264, the three types of frame data are

I frame: intra-frame coding frame. It’s a complete frame.

P frame: forward prediction coding frame. Is a nonholic frame generated by referring to the previous I frame or P frame.

B frame: bidirectional predictive interpolation coded frame. Reference before and after image frame coding generation. A B frame depends on the last I or P frame before it and the last P frame after it.

  • Image group: GOP and keyframe: IDR

Full name: Group of picture Refers to a group of video frames that do not vary much.

The first frame of GOP becomes the key frame: IDR

IDR is an I frame, which prevents the decoding error of one frame from causing the decoding error of all subsequent frames. When the decoder decodes to the IDR, it clears the previous reference frame and starts a new sequence, so that even if a significant error occurs in the previous frame, it does not spread to subsequent data.

Note: All keyframes are I frames, but I frames are not necessarily key frames

  • DTS and PTS

The full name of DTS is: Decoding Time Stamp. Indicates when a stream of data read into memory begins to be fed into the decoder for decoding. Which is the timestamp of the decoding order.

PTS Full name: Presentation Time Stamp Used to indicate when the decoded video frame is displayed.

The output order of DTS and PTS is the same in the absence of B frames, and PTS and DTS are different once B frames are present.

  • Frame color space

Previously we introduced RGB and YUV image color space. H264 uses YUV.

YUV storage is divided into two categories: planar and Packed.

Planar: stores all the Y first, then all the U, and finally V;

Packed: Y, U and V of each pixel are stored continuously and cross.

A planar is as follows:

The packed as follows:

However, pakced storage is rarely used, and planar storage is adopted in most videos.

As mentioned above, because the human eye is not sensitive to chromaticity, it can save storage space by omitting some chromaticity information, that is, brightness shares some chromaticity information. Accordingly, Planar distinguishes the following formats: YUV444, YUV422, and YUV420.

YUV 4:4:4 sampling, each Y corresponds to a set of UV components.

YUV 4:2:2 sampling, every two Y share a set of UV components.

YUV 4:2:0 sampling, each four Y share a set of UV components.

The most commonly used is the YUV420.

  • YUV420 format storage mode

YUV420 is planar storage, but divided into two types:

YUV420P: three-plane storage. The data is YYYYYYYYUUVV (for example, I420) or YYYYYYYYVVUU (for example, YV12).

YUV420SP: two-plane storage. There are two types: YYYYYYYYUVUV (such as NV12) or YYYYYYYYVUVU (such as NV21)

H264 coding algorithm and data structure, involving a lot of knowledge and space (such as network abstraction layer NAL, SPS, PPS), this paper will not go into detail, there are more tutorials on the Internet, interested in in-depth study.

Start to understand H264 coding

Five, audio coding

Audio encoding format

Raw PCM audio data is also a very large amount of data, so it needs to be compressed and encoded.

Like video coding, audio also has many coding formats, such as WAV, MP3, WMA, APE, FLAC and so on. Music lovers should be very familiar with these formats, especially the latter two lossless compression formats.

However, our hero today is not them, but another compression format called AAC.

AAC is a new generation of audio lossy compression technology, a high compression ratio of audio compression algorithm. Most of the audio data in MP4 video is in AAC compression format.

Introduction to AAC coding

There are two AAC formats: ADIF and ADTS.

ADIF: Audio Data Interchange Format. Audio data interchange format. The feature of this format is that it is possible to find the beginning of the audio data definitively, without the decoding that begins in the middle of the audio data stream, i.e. it must be decoded at a clearly defined beginning. This format is commonly used in disk files.

ADTS: Audio Data Transport Stream. Audio data transmission stream. The characteristic of this format is that it is a bit stream with synchronization words, and decoding can start anywhere in this stream. Its features are similar to the MP3 data stream format.

ADTS can be decoded in any frame, and it has header information in each frame. ADIF has only one unified header, so it must be decoded after getting all the data. And the formats of the two headers are also different. At present, the encoded audio streams are generally in ADTS format.

ADIF data format:

header raw_data

ADTS data format of a frame (middle part, left and right ellipsis for before and after data frame) :

AAC internal structure is no longer described, you can refer to the AAC file parsing and decoding process

Vi. Audio and video containers

Careful readers may have found that the previous we introduced a variety of audio and video encoding format, none of the video formats we usually use, such as: MP4, RMVB, AVI, MKV, MOV…

Yes, these familiar video formats are actually containers wrapped around encoded audio and video data, which are used to mix video and audio streams encoded to specific coding standards into a single file.

For example, MP4 supports video encoding such as H264 and H265 and audio encoding such as AAC and MP3.

Mp4 is the most popular video format at present. Generally, video is packaged in MP4 format on mobile terminals.

Seven, hard and soft decoding

The difference between hard and soft solutions

We’ll see in some players that we can choose between hard decoding and soft decoding, but most of the time we won’t be able to tell the difference, as long as it works for the average user.

So what’s the difference between them?

On a mobile phone or PC, there will be hardware such as CPU, GPU or decoder. Typically, we do our calculations on the CPU, the execution chip of our software, while the GPU is responsible for the display of the picture (a kind of hardware acceleration).

The so-called soft decoding, refers to the use of CPU computing power to decode, usually if the CPU is not very strong when the decoding speed will be relatively slow, and the mobile phone may appear heat phenomenon. However, due to the use of a uniform algorithm, compatibility will be good.

Hard decoding refers to the use of a special decoding chip on the phone to speed up decoding. Usually the decoding speed of hard decoding will be much faster, but because hard decoding is implemented by various manufacturers, the quality is uneven, it is very prone to compatibility problems.

Hard decoding for Android platform

I’ve finally come to the Android section, which ends this article and serves as a starting point for the next one.

MediaCodec is the codec interface introduced in Android version 4.1(API 16), which is an inevitable pit for all developers who want to develop audio and video on Android.

Due to the serious fragmentation of Android, although after years of development, Android hard solution has been greatly improved, but in fact, different manufacturers implement different, there will still be some unexpected pit.

Compared with FFmpeg, Android rough decoding is relatively easy to get started, so next, I will start with MediaCodec, explain how to achieve video codec, and introduce OpenGL to achieve video editing, and finally introduce FFmpeg to achieve soft solution. It is a more conventional audio and video development entry process.

Refer to the article

Basic knowledge of audio and video development

YUV color coding parsing

YUV data format

Audio Basics

AAC file parsing and decoding process