preface

Now the market image, audio and video software more and more, the last two years is also live, short video dividend period. Image, audio and video have always been the entrance of Internet vision, and mastering and skillfully using audio and video, image technology has become an indispensable skill in the current Internet era, and this skill is precipitated.

At present, the learning materials on the market are uneven. I think if you want to start the road of audio and video learning, you must first understand the technical points involved in the overall process, and then break them one by one. I am also an expert in audio and video. Recently, the company plans to make a video face-changing application, which is currently in the technical research period. Taking this opportunity, I am ready to start my audio-video learning.

Audio and video app

Image:


Audio:


Video:


The whole process

Taking mobile live broadcasting as an example, the overall process is as follows:

The data collection

1. Audio collection

Audio collection involves the following points:

  • Test whether the microphone can be used;
  • Need to check the phone’s support for a certain audio sampling rate;
  • In some cases, echo cancellation is required for the audio;
  • Set the correct buffer size for audio collection.

In the Android system, the general use of AudioRecord or MediaRecord to collect audio. AudioRecord is a lower-level API that can take a sequence of FRAMES of PCM data that can then be processed. And MediaRecorder is based on the AudioRecorder API (eventually will create AudioRecord to interact with AudioFlinger), it can directly capture the audio data into the encoding format of execution, and save.

2. Video capture

Video collection involves the following points:

  • Check whether the camera can be used;
  • The image captured by the camera is horizontal, which needs to be rotated before display.
  • A series of image sizes can be selected when the camera is collected. Special processing is required when the size of the captured image is inconsistent with the size of the mobile phone screen.
  • Android phone camera has a series of states, need to be in the right state to carry out the corresponding operation of the camera.
  • Many parameters of Android camera have compatibility problems, and these compatibility problems need to be dealt with properly.

There are two apis for video capture in Android: Camera and Camera2. Camera is an older API that has been abandoned since Android 5.0(21). And audio, there are also high-level and low-level API, high-level is Camera and MediaRecorder, can quickly achieve coding, low-level is direct use of Camera, and then the data collected for filtering, noise reduction and other pre-processing, after the completion of processing by MediaCodec hardware coding, Finally, MediaMuxer is used to generate the final video file.

The data processing

1. Audio processing

The original stream of audio can be processed, such as noise reduction, echo, and various filter effects.

2. Video processing

At present, Douyin and Meitu have provided many video filters for shooting and video processing, as well as various stickers, scenes, face recognition, special effects and watermarking.

In fact, the beauty of the video and add special effects are processed through OpenGL. Android has a GLSurfaceView, which is similar to a SurfaceView but can be rendered using Renderer. The SurfaceTexture can be generated using OpenGL, and the SurfaceTexture can be generated using the Id of the texture, and the SurfaceTexture can be delivered to the Camera. Finally, the texture can be used to connect the Camera preview image to OpenGL. So you can do a bunch of things with OpenGL.

The whole process is nothing more than generating a new texture from the Camera preview using the FBO technique in OpenGL, and then drawing the new texture with onDrawFrame() in the Renderer. Adding watermarking means first converting an image to a texture and then using OpenGL to draw. Adding dynamic pendant effects is more complex, first according to the current preview image algorithm analysis and recognition of human face corresponding parts, and then draw corresponding images in each corresponding part, the realization of the whole process has a certain difficulty, face recognition technology currently has OpenCV, Dlib, MTCNN and so on.

Data encoding

1. Audio coding

Android using AudioRecord can record sound, recorded sound is PCM sound, using three parameters to represent the sound, they are: channel number, sampling number and sampling frequency. If all the audio is transmitted in PCM format, it takes up a lot of bandwidth, so the audio needs to be encoded before transmission.

There are already some widely used sound formats, such as WAV, MIDI, MP3, WMA, AAC, Ogg, and so on. Compared with PCM, these formats compress sound data and reduce transmission bandwidth. Audio coding can also be divided into soft and hard coding two kinds. Soft coding downloads the corresponding code library, write the corresponding JNI, and then pass in the data for coding. Hardcoding is done using MediaCodec, which is provided by Android itself.

The difference between hard coding and soft coding is that soft coding can be determined and modified at run time. Hard coding can’t change that.

2. Video coding

There are two ways to realize video coding on Android platform: one is soft coding, the other is hard coding. Soft editing, often rely on THE CPU, the use of CPU computing power to code. For example, we can download x264 code library, write the relevant JNI interface, and then pass in the corresponding image data. After x264 library processing will be the original image into h264 format video.

For hard programming, MediaCodec provided by Android itself is adopted. To use MediaCodec, you need to pass in corresponding data, which can be YUV image information or a Surface. Surface is generally recommended for higher efficiency. Surface uses local video data caching directly, without mapping or copying them to ByteBuffers; Therefore, this approach is more efficient. When using Surface, you can’t usually access the raw video data directly, but you can use the ImageReader class to access unreliable decoded (or raw) video frames. This may still be more efficient than using ByteBuffers, because some local buffers can be mapped to direct ByteBuffers. When using ByteBuffer mode, the Image class and the getInput/OutputImage(int) method can be used to retrieve the raw video data frame.

Audio and video mix

Here’s a picture I stole that takes too long to draw:

Take compositing MP4 videos as an example:

    1. As a whole, the synthesized MP4 files, the video part is h.264 encoded data, and the audio part is AAC encoded data.
    1. Write H.264 and AAC data to MP4 files simultaneously via MediaMuxer’s writeSampleData() interface.

The data transfer

  • At present, the mainstream video streaming protocols include RTMP and RTSP.

Technologies Involved

The following technologies are involved, and I will list them in the order of image, audio and video:

  1. Camera, Camera2.
  2. SurfaceView, TextureView, SurfaceTexture, GLSurfaceView.
  3. OpenGL ES.
  4. OpenCV, DLIB.
  5. YUV, PCM, H.264, H.265, ACC.
  6. AudioRecord, AudioTrack.
  7. The MediaRecorder.
  8. MediaCodec.
  9. MediaExtractor, MediaMuxer.
  10. Ffmpeg, ijkplayer.
  11. RTMP, RTSP.

Later, I will summarize the audio and video related technologies for these technologies. If you need them, you can click “like” and pay attention to them.