Solving problems during audio and video playback (optimization of playback quality)

An overview,

This paper mainly introduces some problems in the process of audio and video playback, as well as optimization methods for specific problems.

Ii. Brief introduction of abnormal conditions and causes

The uplink or downlink network bandwidth is insufficient, the device performance is insufficient, and the video stream time stamp problem occurs
Blurred screen, the whole picture may appear blur or Mosaic, SPS/PPS parameter setting error, or P frame loss or decoding failure caused by partial picture blurred screen.
Green screen, SPS/PPS capture failure or error, unrendered images some filled with black, some filled with green, some filled with the previous frame. If the video parameters change and the SPS&PPS information of the decoding terminal is not updated in time, the picture cannot be rendered normally, and then the phenomenon of green screen appears
Skip is the discontinuity of audio or video during playback — > that is, the discontinuity of time stamp. This is mainly due to buffer management, audio synchronization, some aggressive frame loss policies or performance bottlenecks.
Invalid drag, invalid seek position, incorrect seek position calculation (progress bar error, timestamp error or entire video file length error)
The dragging error is too large and the GOP is too large
Black screen, sound but black screen.
Cut rate abnormal (no trace switching), generally shown as unable to render, or screen, or jump, or black screen and so on. In fact, some information is incorrectly set/not reset when switching between different bit streams.
Disconnection, in addition to setting the disconnection threshold, over the threshold reconnection, did not think what a good way. However, there are many details to be worked out in the implementation of reconnection.
The audio and picture are not synchronized, and the synchronization algorithm is not properly processed. As a result, the difference between the audio and video timestamps is too large. But there are also physical situations where the audio source is too far from the microphone and the speed of sound is much slower than the speed of light (not too far from the microphone). Or the audio processing module in the acquisition process takes too long.
Abnormalities of sound, such as changes in sound, noise, echo, discontinuity of sound. First of all, in addition to noise suppression, echo cancellation, automatic gain, resampling and other improper processing of these modules, there may also be acquisition or playback codec, render parameters error problems.
Playback fails for a number of reasons. The preceding errors are related to services (third-party) : inbound errors, return errors of the playback interface, failure to create playback containers at the service layer, player creation errors, flow status errors, invalid playback URLS, and DNS resolution errors. Errors that may cause playback failures at the playback level include CDN node connection failure, packet reading error, decoding error, network interruption at the playback end, stream server down or high load, and source stream interruption. In actual industrial scenarios, we generally believe that errors that can be recovered within the threshold time are not considered as playback failures.
Long first frame time, player initialization time, back-end interface time, DNS resolution time, CDN node scheduling time, metadata resolution time, decoder initialization time, the buffer threshold at which the playback queue starts to send data to the renderer, all lead to long first frame time.
The delay is large, mainly including the coding time at the collection end, the uplink transmission link time (TCP has ACK and retransmission mechanism, which consumes more time), the transcoding time, the distribution time to THE CDN node, and the downlink time.
Cumulative delay: Generally, tcp-based streaming media is used for uplink or downlink transmission. Due to network jitter and TCP retransmission, the cumulative delay increases.
The mobile phone is hot, which is common at the acquisition end and also common at the high bit rate at the playback end. In fact, the CPU usage is too high, or the memory usage is too high. In panorama and VR scenarios, the graphics card usage may be too high. For example, the buffer design is unreasonable, the data is frequently copied, the algorithm time complexity is too high, the codecs are too dependent on CPU calculation, the amount of rendering data is too large, the conversion calculation in the rendering process is too large, the coding algorithm does not match the machine configuration, and some data accuracy is too high.
When recording a video, the encoding parameter is incorrectly set or the time stamp is abnormal.
When playing the flash screen, the stable picture and the fused picture switch frequently, giving the user a feeling of picture shaking.

Third, optimization method

Solve and optimize each abnormal situation or optimization point one by one:

1. Caton

Upstream optimization, to solve the source and flow of stuck. For example, device performance configuration optimization at the pusher end, performance optimization at the pusher end, hard knitting, and CDN node selection at the pusher edge.
Jitter buffering, balancing stutter and delay.
Download link optimization and select closer and better CDN edge nodes.
For users with serious playback delay or the CDN load is too high, the live backend delivers messages to force the client to reduce the bit rate. The client itself can force a bit rate reduction if it detects persistent lag.
The player uses hard solution to optimize decoding performance.
The bottom layer predicts network changes and re-establishes the connection with CDN without trace switching (link switching).
Select and combine TCP and UDP. TCP is reliable, but a large number of retransmissions occur during network jitter, resulting in more serious congestion. UDP is more adaptable to network jitter, but may cause packet disordering and packet discarding
Variable bit rates, such as transcoding platforms and players that can support true bit rate adaptive streams, should be considered. Otherwise, a compromise scheme can be adopted. The client can predict the network changes, and the bottom layer of the player can establish a connection with the CDN low definition code stream, obtain low definition data and decode it, and provide it to the upper rendering (fake layered coding implementation), and switch back to the source stream after the network is restored. In this way, there is no need to make major changes in the protocol, transcoding and decoding mode, but the CDN load may be increased, and the downloading, decoding and synchronization logic of the playback SDK will be complicated.
Layered coding is difficult to achieve in live broadcast scenarios, but it is also worth considering. It should be considered when playing on demand. Mainly transcoding and decoding algorithm requirements are different.
Insert fake data, such as inserting a small number of smooth audio frames, and inserting frames when the picture changes a little. Implementation is difficult, first at the threshold to insert the frame need judgment in time, otherwise there will be a great experience is very poor, or too late to frame, need to consider the timestamp of changes and, by synching up determine what point can plug in the audio frame, which points can video frame, how to switch, after returning to normal in the download link again calibration timestamp. This part can refer to how to insert the frame for the remote signal when the remote signal is out of sync with the near signal during echo cancellation.
Timestamp errors (e.g., video timestamp fallback), in which a strict audio and painting synchronization strategy would result in discarding some video frames without rendering, and the image would pause until the timestamp returned to normal. You can increase the buffer and adjust the timestamp to monotonically increasing.

2. The screen

Low bitrate Mosaic, except to the current resolution to match the appropriate bitrate, there is no other way
Graphics card performance bottlenecks, optimized rendering module code logic, such as panorama and VR playback, can be locally updated
Failure or error in obtaining THE SPS/PPS decoding information. For scalable coding or multi-stream scenarios, if the resolution and other data changes but is not refreshed in time, the screen may be broken (some models show a green screen). The modification of decoding information mainly comes from transcoding service (for example, in FLV, by default, only the video header downloaded at the beginning of playback contains SPS/PPS information, and if the encoding information is modified during the process, the video header will not be sent again by default). It feels that if the server only transcodes instead of encapsulating, It is easy to have this problem, if the server to do transcoding, can join in origin switch decoding information AVCDecorderConfigurationRecord this frame.
Some reference frames are lost. It may be that there is a frame loss policy at the collecting end and the player end, and the discarded frames are referred to. It may also be that the receiving buffer SO_SNDBUF of the streaming media server is too small, and some frames received during the recovery of network jitter are discarded and these frames are referred to, resulting in partial screen loss. In UDP scenarios, packet loss may be caused by unreliable transmission.
Decoder initialization parameter error or processing logic error, resulting in decoding failure, and decoding failed frame is referenced, push stream encoder initialization parameter error will also lead to coding anomaly.
Some Android models are not compatible with hard coding and hard decoding. When encoding/decoding anomalies are not detected, the blacklist can only be set based on historical experience
Parameters passed to render module are not actual parameters, this is a bug
When dragging or resetting decoder parameters, the decoder queue is not emptied

3. Green screen

Failure or error in obtaining SPS/PPS decoding information may result in a green screen. (Flush screen)
The ios VideoToolbox cannot resolve frames shred into multiple NALU units when hard decomposing, because the ios hard decoder considers such frames incomplete. Before sending it to VideoToolbox, the player end can copy all the data in multiple NALUs of this frame (refer to NALU format in H264 for details) into AVPacket, and then feed it to VideoToolbox. Note: When mutli-slice is enabled on the capture end, some video frames in the source stream will be sliced into multiple slices and stored in different NALUs, thus causing this problem

4. Jump

Server buffer management and performance bottlenecks cause that when the buffer is full and the streaming media server has no time to process the data in the buffer, the streaming media server discards some data in the source stream, causing the experience of skipping on the player end. If the hard disk cannot keep up with the DATA generated by the CPU during the recording of a video or live playback, some frames may be discarded, resulting in playback hops.
If the CPU usage of the playback end is too high, for example, the CPU usage is too high or the access frequency is too high, the performance bottleneck may occur and some data may be discarded.
As a result of audio and painting synchronization, if a timestamp rollback occurs, the player automatically discards the frame during the rollback by default. Or the audio and video timestamps differ too much and some frames may be discarded in order to synchronize the reference clock.
Active frame loss policy results in the adoption of frame loss policies to balance codec/network transmission/playback speed/delay, which may result in skipping, such as continuous discarding of audio frames with critical sound signals, or continuous discarding of video frames. The most obvious was when the GOP was discarded.
Unreliable UDP transport results in out-of-order and packet loss, and if the frame corresponding to the external reference clock has arrived, the intermediate frame will be discarded. You can develop mechanisms for UTCP to validate packets using TCP, and for clients to actively send NACK requests for retransmission like SRT

5. Dragging doesn’t work

The implementation of seek in FFMPEG can refer to the last part of the file how to seek in mp4/ MKV /ts/ FLV. [Note: Personal experience in FLV without keyframe lists is very, very, very slow.] SeekTime =(position of the progress bar/total length of the progress bar)* Total length of the video, which is then passed to the playback kernel. If seekTime is not within the actual total video length, seekTime will invalidate seekTime, which is usually not a problem without bugs. When seeking forward, the data is not buffered. If the download speed is too slow and the data has not been buffered in a certain period of time, it may start the internal restart mechanism of the player and cancel the seek.

6. Dragging error is too large

The principle of seek is to find the nearest I frame of the seekTime position. GOP=n* FPS, the larger this n value is, the farther seekTime is from the actual registration point (on average). [Note to clear the decoding buffer queue after SEEK operation] If the total length of the video is long, the need for accurate SEEK is not very urgent, as long as the GOP is not extremely large. If the total length of the video is small, or the scene is frequently changed (such as outdoor sports), then precise seek may be needed. You can locate the previous I frame of seekTime, decode the I frame and subsequent data, and wait until the frame to which the seekTime timestamp belongs appears before putting it into the playback queue. (This may feel like seek takes too long to experience, but it’s generally ok because the GOP for this scenario doesn’t need to be too large)

7. Black screen

The source stream has no image, which may be due to the failure of image collection or coding at the push stream end. Detection logic can be added in the coding module.
Continuous decoding failure, the player may not support a specific format, this is the need to improve the callback mechanism, notify the upper decoding failure. The underlying layer can try to reset the decoder itself, or re-establish the connection.
Part of the push stream coding and packaging format is not standard, which is also a source stream problem.
The service layer on the push stream side switches the mode of pure audio push stream/audio and video push stream. If the stream is pure audio at the beginning and then is cut to audio and video stream, the player will not re-initialize the decoder by default, resulting in a black screen on the player side. You can add a detection inside the player to see if a new frame format appears, adding logic to reinitialize the decoder during playback. Or the server can send a notification.

8. The bit cutting rate is abnormal

General performance is unable to render, or screen, or jump, or black screen and so on. In fact, some information is incorrectly set/not reset when switching between different bit streams.

9. Stop

Streaming interruption inevitably affects the playback experience. You can only recover as quickly as possible and ensure normal playback after recovery

Period setting of interruption detection (for source stream interruption or download link at player end, etc.)
Close the original stream connection, clean up some unnecessary data, and close avFormat
Re-open the stream and get audio and video decoded data
Rebuild the Stream connection and start reading audio and video data
In this process, find_stream_info resetting decoding parameters may cause double speed playback after recovery, etc. It is best to keep the original decoding parameters unchanged. The specific principle is unknown todo

10. Out of sync

Refer to the principle and realization of tone and picture synchronization

11. The voice is abnormal

For the audio preprocessing part of the work, webrTC’s audio processing module can be used, which is much better than speex, and the increase in SDK volume is also acceptable. Reference link: Noise suppression echo cancellation mixing mute detection
Acquisition or playback of the codec, sampling rate, render parameters error. Attention should be paid to the treatment of these problems in engineering implementation.
For the discontinuous sound caused by skipping, refer to the handling of skipping.
Sound discontinuity or tone change caused by frame loss is mainly caused by improper adjustment of frame loss policies. For example, if the frame loss frequency is very high, there will be a stopping experience. If many sequential audio frames are discarded at one time, there will be a skipping experience. If audio is not lost but video, the sound may change tone.

12. Playback fails

Since there are many possible reasons for playback failure, the most important thing is to improve log collection.
For errors related to upper-layer services (third parties), analyze logs to help related parties correct errors.
If the CDN node fails to be connected, multiple IP addresses can be returned by DNS resolution at one time. If the client fails to retry at one IP address, the client can use other IP addresses to establish connections
Packet reading error, if not AVERROR_EOF, packet is discarded directly. However, high frequency or continuous packet reading error will seriously affect the user’s playing experience and cause the user to quit the broadcast room in abnormal playing state (this is considered as playing failure). When a certain number of packet reading errors are accumulated, the connection can be re-established to prevent normal playback. AVERROR_EOF error reconnects directly.
Decoding error, general decoding error will be reflected as abnormal rendering, processing method refer to read packet error
In fact, the network interruption at the player end partially overlaps with the previous packet reading error. There is still timeout reconnection. Increase the timeout time and change the link reconnection strategy when the timeout reaches the threshold
If the flow server is down or overloaded, log reporting can only be used for post-event analysis. The stability of the whole cluster is realized by monitoring the flow service itself
The source stream is interrupted. The source stream is abnormally shut down or the uplink is interrupted. You can perform some error handling on the source stream.

13. Long first frame

Preset playback format, if the URL has some audio and video decoding parameters, you can preset iformat, early start decoding
DNS pre-resolution, which can be sent to the scheduling server by the business layer while initializing the player
CDN scheduling optimization, download link optimization
FLV metadata parsing is simplified, because FLV audio and video tags already contain a lot of information, and the upper layer can also preset some information when initializing the player, so you can parse metadata while read_frame without waiting.
Optimize the wait and synchronization during the initial playback, for example, reduce the length of the packets buffer queue, and refresh forcibly without any start-on-prepared

14. Big delay (live broadcast)

The data continuity of source and stream needs to be considered due to the high design requirements due to the self-adaptive network at the acquisition end and frame loss in extremely poor network environment. On the acquisition side and the player side of the frame loss, in fact, can be written separately. If I frame is discarded, the whole GOP needs to be discarded. If P frame is discarded, the latter P frame in GOP can be directly discarded. Note: The acquisition terminal of webcast APP generally does not encode B frame.
Acquisition end coding performance optimization and code logic optimization, reduce coding delay, can set mutli slice
The acquisition end adopts real-time protocol, such as RTP or custom UDP
The acquisition end uses variable bit rate VBR. When the picture changes little or the network bandwidth is insufficient, the bit rate is reduced, and the bit rate is maintained or increased at other times
Transcoding platform CDN node scheduling optimization, as far as possible to ensure that the establishment of connection with the acquisition end is short, efficient and stable network transmission
Transcoding platform transcoding performance optimization, reduce the delay caused by decoding and re-encoding, split tasks with different real-time and compression ratio requirements
CDN distribution after download node scheduling optimization, performance and load optimization
The player network ADAPTS itself to frame loss, which is similar to the frame loss policy of the acquisition end. Generally, after discarding the audio frame, compare the time stamp between the video frame and the audio frame, and then choose the discarded P frame number based on the premise of normal decoding and rendering. If it is known the resolution and other information required for decoding, then you can consider discarding the frame before decoding directly (if the frame is discarded before decoding improperly, it will cause skipping and splintered/green screen).
When the playback end is played at double speed, the audio will change tone, so it needs to be processed by changing speed and not changing tone
Player decoding performance optimization
The jitter buffer of the player side (to balance with the lag)
The player uses the protocol with higher real-time performance
The player uses variable bit rate
The player uses layered coding stream

15. Cumulative Latency (live streaming)

Smooth frame loss and fast frame loss

16. The phone gets hot

Recorded so far:

CPU usage is too high, mainly due to high computational complexity, data accuracy is too high, the algorithm requires higher machine configuration, soft editing is very common, matrix conversion and other operations, panorama is very common
The CPU usage is too high, the number of cores is too small, and thread switching is extremely frequent. On machines with fewer cores, single-threaded decoding can sometimes be more efficient than multi-threaded decoding.
High memory usage, poorly designed buffer queues, frequent data replication, etc
The graphics card usage rate is too high, ultra hd or panorama /VR will have the graphics card usage rate is too high, can reduce the refresh rate of some pictures

17. The recording video is abnormal

PTS is calibrated when written
Some key decoded information cannot be written incorrectly

18. The splash screen

This part I have not encountered, online information refers to the video image fusion, or the player card suddenly added some default pictures, resulting in the screen shake (also known as flash screen). Even when the screen synthesis, or add logo, or play queue insert frame, should pay attention to smooth transition.

Iv. Detection and monitoring

Logs of the playback end are reported
For Streamid, full-link monitoring from push stream, transcoding, distribution, download and playback is realized

Five, the reference

Analysis and summary of live broadcasting problems
ffmpeg-how to seek in mp4/mkv/ts/flv