Application in daily life, video class accounted for more of our time, companies have also into the battlefield, whether trill, a short video types such as well quickly, tiger tooth, bettas are live type, tencent video, iQIYI, youku isometric video type, or a Vue, American film video editing beauty types, such as there is always a suitable for you.

In the future, with the popularization of 5G and the decline of network rates, the prospect of audio and video is very broad. But on the other hand, both audio and video codec and player, video editing and beauty of various algorithms, or the combination of video and artificial intelligence (AI cut piece, video repair, super thanh hoa, etc.), both of which involve all aspects of the underlying knowledge, the learning curve is steep, the threshold is relatively high, so companies also contributed to the current audio and video related talent shortage. If you are interested in audio and video development, I highly recommend that you try it. I am personally very enthusiastic about audio and video development.

Of course, the experience of audio and video development is based on the “filling pit” again and again to grow up, let’s take a look at the understanding and thinking about audio and video.

No matter as developers or users, we are exposed to a variety of short video and live broadcast apps every day, and the development of audio and video related to them is becoming more and more important. But for most Android developers, working on Android audio and video is probably a niche area right now, and while there may not be many developers in this area right now, there’s a lot of knowledge involved in this direction.

Basic knowledge of audio and video

1. Concepts related to audio and video

Speaking of audio and video, let’s start with the video format that we are familiar with and unfamiliar with.

For us, the most common video format is MP4, which is a generic container format. Container format means that there is a corresponding internal data flow to host the content. And since it is a video, there must be a sound track and a video track, and the sound track, the video track itself also has the corresponding format. Common audio track and visual track formats include:

  • Visual orbit: H.265(HEVC), H.264. Most Android phones support direct hardware encoding and decoding in H.264 format. For H.265, Android 5.0 or above machines support direct hardware decoding, but for hardware encoding, only part of the high-end chips can support, such as Qualcomm 8XX series, Huawei 98X series. For orbit coding, the higher the resolution, the greater the performance cost and the longer the coding time.
  • Audio track: AAC. This is a long-established audio encoding format that Android phones can largely codec directly with hardware, with few compatibility issues. It can be said that as a video track format, AAC has been very mature.

For coding itself, these formats mentioned above are lossy coding, so the compression coding itself also needs an indicator to measure the amount of data after compression, this standard is the bit rate. In the same compression format, the higher the bit rate, the better the quality. For more information about the codecs supported by Android, you can refer to the official documentation.

To summarize, to shoot an MP4 video, we need to encode the video track and the audio track separately, and then compose an MP4 file as the MP4 data stream.

2. The process of audio and video coding

Next, let’s take a look at how a video was made. First of all, since it is shooting, we must deal with cameras and microphones. In terms of process, taking H.264/AAC coding as an example, the overall process of video recording is as follows:

We collect data from the camera/recording equipment, send the data into the encoder, encode the visual track/audio track respectively, and then send it into the synthesizer (MediaRemuxer or similar to MP4V2, FFmpeg and other processing libraries), and finally output MP4 files. Next, I mainly take the visual track as an example to introduce the process of coding.

First, record the entire video directly using the system’s MediaRecorder, which is the easiest way to directly output MP4 files. However, this interface is very poorly customizable. For example, if we want to record a square video, unless the camera itself supports a consistent width and height resolution, we can only post process or Hack. In addition, in the actual App, unless the video requirements are not particularly high, generally will not directly use MediaRecorder.

The processing of visual track is a relatively complex part in video recording. The input source is Camera data, and the final output is coded H.264/H.265 data. Let me introduce two processing models.

The first method is to use the Camera to obtain the original data interface of the Camera output (such as onPreviewFrame). After preprocessing, such as scaling and clipping, it is sent to the encoder and output H.264/H.265.

The raw data output from the camera is in NV21, a YUV color format. Unlike RGB colors, YUV data formats take up less space and are widely used in video coding.

Generally speaking, because the camera directly output NV21 format size and the final video does not necessarily match, and encoder often also requires input of another YUV format (generally YUV420P), so after obtaining NV21 color format, also need to perform various zoom, cropping operations, such as. Libraries such as FFmpeg and libyuv are used to process YUV data.

Finally, data is fed into the encoder. In the selection of video encoder, we can directly choose the system MediaCodec, using the hardware coding ability of the phone itself. However, if the size of the final output video is strictly required, the bit rate used will be low. In this case, the output quality of the hardware encoder of most mobile phones may be poor. Another common choice is to use X264 to encode, the picture performance is relatively good, but compared to the hardware encoder, the speed is much slower, so in actual use it is best to choose according to the scene.

In addition to processing raw camera data directly, a common processing model uses the Surface as the input source for the encoder.

For Android camera preview, you need to set a Surface/SurfaceTexture as the output of the camera preview data, and MediaCodec after API 18+, You can create a Surface as input to the encoder with createInputSurface. Another way to do this is to print the camera’s preview Surface content to MediaCodec’s InputSurface.

In terms of the choice of encoder, although InputSurface is created by MediaCodec, at first glance, it seems that only MediaCodec can be used to encode, not X264, but with PreviewSurface, we can create an OpenGL context. All drawn content can then be retrieved using glReadPixel, and the read data can then be converted into YUV and entered into X264 (additionally, if the GLES 3.0 environment is used, we can use PBO to speed up glReadPixles).

Since we’ve created an OpenGL context here, for current video apps, there are all kinds of filters and beauty effects that can actually be implemented based on OpenGL.

As for the code to do this, you can see the grafika example.

Video processing

1. Video editing

In current video apps, you can see various video cutting and editing functions, such as:

  • Crop a portion of the video.
  • Multiple videos are spliced.

For video clipping, splicing, Android directly provides the interface of MediaExtractor, combined with seek and the interface of readSampleData corresponding to read frame data, we can directly obtain the content of the corresponding timestamp frame, so read out is already coded data, so there is no need to re-encode, You can input the synthesizer directly to synthesize MP4 again.

We just need to seek to the time stamp of the original video that needs to be clipped, and then keep reading sampleData and sending it to MediaMuxer. This is the simplest way to achieve video clipping.

In practice, seekTo does not work for all timestamps. For example, for a 4min video, we want to seek to about 2min position and then read data from that position, but after calling seekTo to 2min position and then read data from MediaExtractor, You’ll notice that the actual data you’re getting might be a little bit ahead or a little bit behind 2 minutes here. This is because MediaExtractor this interface can only seek to the location of the video keyframe, and we want the location does not necessarily have the keyframe. The problem goes back to video coding, where there is a certain distance between two key frames.

As shown in the figure above, a keyframe is called an I-frame, which can be considered a frame that is not compressed and decoded without relying on other video frames. However, between the two key frames, there are compression frames such as B frame and P frame, which need to rely on other frames to decode a picture completely. As for the interval between two key frames, known as a GOP, in the GOP frame, MediaExtractor is not directly seek to, because this class is not responsible for decoding, can only seek to before and after the key frame. But if the GOP is too large, it will lead to video editing is very inaccurate (in fact, some mobile phone ROM has been changed, the realization of MediaExtractor can also seek accurately).

In this case, the only way to achieve accurate clipping is to rely on the decoder. The decoder itself can unpack the contents of all frames, and after the unpack frame is introduced, the whole clipping process becomes the following.

We need to seek to the previous I frame in the desired position, and then send it to the decoder. After the decoder removes a frame, it determines whether the PTS of the current frame is within the required timestamp range. If so, it sends the data to the encoder, recodes again to get the H.264 view-track data, and then combines the MP4 file.

The above is the basic video clipping process. For video stitching, it is similar to getting multiple sections of H.264 data before feeding it into the synthesizer.

In addition, in the actual video editing, we will also add a lot of video effects and filters. In the previous video shooting scene, we used Surface as the input source of MediaCodec and used Surface to create the OpenGL context. When MediaCodec is used as a decoder, it is also possible to specify a Surface as the output of its decoding in configure. Most video effects can be implemented through OpenGL, so the general process for implementing video effects is as follows.

We give the decoded rendering to OpenGL, and then output it to the InputSurface of the encoder to realize the whole coding process.

2. Video playback

Any video App will involve video playing, from recording, editing to playing, to form a complete video experience. The easiest way to play an MP4 file is to use the system’s MediaPlayer. With just a few lines of code, you can play the video directly. This is the simplest implementation for local video playback, but we might actually have more complex requirements:

  • The videos that need to be played may not be local, and many of them may be network videos that need to be played on the side.
  • The video may be played as part of a video editing process that requires a live preview of the video effects during editing.

For the second scenario, we can simply configure the View to play the video as a GLSurfaceView. With the OpenGL environment, we can implement various effects and filters on this View. MediaPlayer also has a direct interface for setting up playback configurations that are common for video editing, such as fast forward and reverse playback.

The first scenario is more common, such as a video streaming interface. Most of the video is online. Although MediaPlayer can also play online video, there are two problems when using MediaPlayer:

  • Videos downloaded by setting the MediaPlayer video URL are placed in a private location, which is not easy to access directly in the App. As a result, we cannot preload videos, and the original buffered content cannot be reused for videos that have been played and buffered before.
  • The same as video clips directly using MediaExtractor returned data problems, MediaPlayer also cannot accurately seek, can only seek to the place where there are key frames.

For the first problem, we can solve the problem by the way of video URL proxy download, through the Local use of the Local HTTP Server proxy download to a specified place. The open source community already has mature project implementations, such as AndroidVideoCache.

As for the second problem, the inability to accurately seek is fatal on some apps, and the product may not be able to accept the experience. As with video editing, we had to implement the player ourselves directly based on MediaCodec, which was a bit more complicated. Of course, you can also use Google’s Open source ExoPlayer, which is easy and fast, and also supports setting online video urls.

Seems to have a solution to all problems, is everything ok?

The most common online video format is MP4. However, when some videos are uploaded directly to the server, whether using MediaPlayer or ExoPlayer, it seems that you have to wait until the entire video is downloaded before starting to play. The reason for this problem is actually due to the MP4 format, specifically related to moOCs in the MP4 format.

In THE MP4 format, there is a place called MOOV to store meta information about the current MP4 file, including the current MP4 file’s track format, video length, playback rate, the position offset of the key frame of the track and other important information. When the MP4 file is playing online, the information in the MOOV is needed to decode the track.

The problem is that when the MOOv is at the end of the MP4 file, the player does not have enough information to decode it, so the video has to be downloaded directly before it can be decoded and played. Therefore, in order to realize the side-down playback of MP4 files, you need to put the MOOV in the head of the file. At present, there are very mature tools in the industry, FFmpeg and MP4V2 can put a MOOV of an MP4 file in the head of the file. For example, if FFmpeg is used, the following command is used:

ffmpeg -i input.mp4 -movflags faststart -acodec copy -vcodec copy output.mp4
Copy the code

Using -MovFlags FastStart, we can advance the MOOV in the video file.

In addition, if you want to detect whether an MP4 moov is in front, you can use a tool like AtomicParsley to detect.

In the practice of video playback, in addition to MP4 format as a side down broadcast format, there are more scenes need to use other formats, such as M3U8, FLV and so on, the industry in the client common implementation including iJKPlayer, ExoPlayer, interested students can refer to their implementation.

The learning path of audio and video development

To a wide range of audio and video related development, today I just make a brief introduction of the basic architecture, if you want to further development of this field, learning from my personal experience, want to be a qualified developers, in addition to the basic knowledge of Android development, further study, I think you also need to grasp the following technology stack.

language

  • C/C++ : Audio and video development often needs to deal with the underlying code, master C/C++ is a necessary skill. There’s a lot of information on this. I’m sure we can find it all.
  • ARM NEON Assembly: This is an advanced skill that uses NEON assembly to speed up video encoding and decoding, as well as low frame processing. For example, FFmpeg/libyuv uses NEON assembly extensively to speed up processing. Although it is not a necessary skill, you can learn more about it if you are interested in it. For details, please refer to the ARM community tutorials.

The framework

  • FFmpeg: can be said to be the industry’s most famous audio and video processing framework, almost all of the audio and video development process, can be said to be a necessary skill.
  • Libyuv: The Google open source YUV frame processing library, which is often used to manipulate data because the camera output, codec input and output are based on YUV. However, this library performs better and is also based on NEON Assembly acceleration.
  • Libx264 /libx265: the most widely used H.264/H.265 soft codec library. Although hard coding can be used on mobile platforms, many times for compatibility or picture quality, because many low-end Android machines, in low bit rate scenarios or soft coding better picture quality, may eventually have to consider using soft codec.
  • OpenGL ES: Today, most video effects, beauty algorithm processing, and final rendering are implemented based on GLES, so GLES is a necessary knowledge for in-depth audio and video development. In addition to the GLES, Vulkan is a higher-performance graphics API that has been developed in recent years, but is not widely used yet.
  • ExoPlayer/ iJkPlayer: A full video App will definitely involve the video playback experience. These two libraries are arguably the most commonly used video players in the industry today, supporting a wide range of formats, protocols, and almost mandatory skills if you want to learn more about video playback.

Based on the above technology stack, we can learn in depth from the following two paths.

1. Development of video-related special effects

At present, there are more and more apps related to live broadcasting and small video. The special effects related to almost every App are often realized by Using OpenGL itself. For some simple effects, you can use a technique like Color Look Up Table, which can be achieved by modifying the material and matching Shader to find the Color replacement. If you want to learn more complex filters, I recommend you go to Shadertoy, which has many examples of shaders.

And beauty, beauty related effects, especially beauty, need to use face recognition to obtain the key points, the face texture triangulation, and then through Shader to enlarge and offset the corresponding key point texture coordinates to achieve. If you want to go into the development of video effects, I recommend you to learn more about OpenGL, which will involve many optimization points.

2. Video coding compression algorithm

H.264/H.265 are very mature video coding standards. How to use these video coding standards to minimize video size and save bandwidth on the premise of ensuring video quality requires a very deep understanding of the video coding standards themselves. This may be a relatively high threshold direction, I am still in the learning stage, interested students can read the relevant coding standards documentation.

Green hills never change, green waters always flow. Thank you