This paper will take a video playback frame loss problem as an example to introduce the idea of solving this performance problem, the tools used, the optimization scheme tried and some test results, which has a strong comprehensive. Of course, I am just a novice in performance optimization. Welcome to discuss and exchange.

The problem

ExoPlayer is an open source Java player launched by Google. It is used by many video service providers, including YouTube. However, we found that when using ExoPlayer to play 4K HLS videos on some hardware platforms with low performance, a large number of frames will be lost. The video also becomes “blocky”.

Assuming that

Unlike other applications, audio and video decoding of 4K video consumes a lot of CPU resources, and the decoding speed is required. If not, the playback fluency will be affected. So frame loss is most likely due to the full CPU load, making the audio and video do not have enough resources to maintain smooth playback. Refer to: Android Mobile Performance In Action, Chapter 4

Validation tool

To test the hypothesis proposed above, three main tools are used here:

  1. DS5 & Streamline

The basic introduction and installation process of performance analysis tools commonly used on the ARM platform can be found on the Internet, for example, ARM DS-5 STREAMLINE

  1. Systrace & TraceView

These two performance analysis tools can be directly used in android Studio’s companion Device Monitor. Although ds5 & Streamline is powerful, it cannot be used to analyze the performance of Java code. Here is the main use of Systrace and TraceView to analyze the performance of Java code, about the use of these two tools, Refer to the links Analyzing UI Performance with Systrace and Profiling with Traceview and DMTracedump below

  1. Bento4 tool set

With the Bento4 toolset, you can make your own HLS or DASH test streams, and Bento4 also supports the latest HLS-FMP4

The verification results

Streamline was used to capture system information while playing 4K videos. It can be found that loader: HLS thread in the player process has burst phenomenon. Specifically, the CPU ratio (two cores) can reach 60% ~ 80%, as shown in the following figure

At the same time, by comparing the real-time captured logs, it can also be seen that the burst must be accompanied by a large number of frame loss. Therefore, it can be determined that loader: HLS occupies a large amount of CPU resources, which is a very important reason for 4k video playback lag.

Investigation and analysis


Loader: HLS thread is the thread that performs M3U8 and TS slice download in the background of player. In this case, it is easy to think that IO operations in the download occupy CPU resources. Analyze several key IO operations in the playback process. Or operations involving buffer can be simply expressed as: CDN –> buffer –> Player –> decoder –> decoder –> Decoder –> Decoder –> Decoder –> Decoder –> Decoder –> Decoder


A suspicious loop has been found in the loader: HLS section

in TsChunk.java@load() while (result == Extractor.RESULT_CONTINUE && ! loadCanceled) { result = extractorWrapper.read(input); . }Copy the code

The read method here calls TsExtractor.java@read, as follows

public int read(ExtractorInput input, PositionHolder seekPosition) throws IOException, InterruptedException { if (! input.readFully(tsPacketBuffer.data, 0, TS_PACKET_SIZE, true)) { return RESULT_END_OF_INPUT; } // parse a packet ..... return RESULT_CONTINUE; }Copy the code

The readfully method above is responsible for downloading. It can be seen that TS_PACKET_SIZE=188 is used as the unit for downloading each time. In the read method, the downloaded packet is then unsealed and packaged. There must also be ts downloads and frequent cyclic calls of parser. The cycle times can be calculated by bitrate*segDuration/(188 * 8). When the bitrate is about 8Mbps and the fragment length is 10s, the cycle times can reach 50000 times. In practice, the bit rate of 4K video is not fixed. It can be only 7Mbps at low time and up to 16Mbps at high time, so the number of cycles is very considerable.

Therefore, we can make a preliminary conclusion: Because the encapsulation operation is also carried out in the Loader thread, the DECODING of TS format is more complicated than that of other formats such as FMP4, and only 188 bytes are processed at each time, so it consumes more resources when playing high bit rate video sources. Some friends may ask, why must download and unpackage together? Can’t you open it? There are two considerations: first, if the download is completed and then unsealed, it is bound to affect the speed of the broadcast; Second, a lot of TS code stream audio and video interweaving is very poor, in order to better audio and video synchronization, it is necessary to do unsealing in advance.

Traceview is used to determine whether Download or Parse consumes more CPU, as shown in the figure below

Figure DefaultExtractorInput in. Read on behalf of the download operation, PesReader. Consume represents the parse operation, you can see or download operation takes up more CPU.


3. As a comparison, the HLS-TS and HLS-FMP4 video sources with the same bit rate and fragment length were produced by bento4 tool to capture system information using STREAMLINE. The results of TS stream playing are shown in the figure below

The result of playing the FMP4 stream is shown below

Comparison of download and parse cycle calls shows that TS stream is 1000 times more than FMP4 stream. Conclusion: Loader: Frequent download I/O operations in HLS threads are the main reason for high CPU usage.

Optimization scheme and results

This section introduces several optimization schemes tried and the corresponding results. In the test, the scenarios with speed limit and no speed limit are considered. The main indicators to be observed are the start time, the lag ratio, the number of lost frames, and the CPU usage


First, reduce the priority of the HLS thread: simple and crude, directly reduce the priority of the Loader thread, equivalent to improve the decoding priority, ensure that the decoding can smoothly complete the work, thus reducing the loss of frames. Add a sentence to the loader thread

Process.setThreadPriority(Process.THREAD_PRIORITY_LOWEST);
Copy the code

After testing, it was found that the frame loss could be reduced by half. After a long time pressure test, it was not found to have an effect on the start time or caton ratio. Let’s take a look at the CPU usage changesAs you can see, loader still has burst, and the peak CPU ratio is still more than 80%, but some of it is “smashed”, the peak duration is reduced, and the test found that about 1/3 of the burst is reducedConclusion: By reducing the priority of Loader thread, the burst occupied by CPU can be broken up to reduce the number of frames lost, and the impact on the start time and loading cycle is not found.


Second, forcing sleep in the middle of the frequent invocation of loader: HLS thread is similar to the idea of reducing thread priority in scheme 1, forcing sleep in the while loop mentioned above to free up CPU resources, but this may lead to a longer download time for TS fragments. To this end, we can design a dynamic sleep mechanism based on bufferedDuration and specific sleep logic, as shown in the following code

int i = 0; try { int result = Extractor.RESULT_CONTINUE; while (result == Extractor.RESULT_CONTINUE && ! loadCanceled) { i++; if (i > DEFAULT_EXTRACTOR_READ_INTERVAL && isHighBuffer) { i = 0; Thread.sleep(DEFAULT_SLEEP_TIME_MS); } result = extractorWrapper.read(input); . }Copy the code

DEFAULT_EXTRACTOR_READ_INTERVAL indicates how many times to sleep after the while loop. DEFAULT_SLEEP_TIME_MS indicates the length of each sleep, set to 10ms. IsHighBuffer determines whether the current buffer margin is sufficient based on the set threshold. After testing, the frame loss can also be reduced by more than half, but because it is forced to sleep, the frequency of loading cycles will be increased. Reasonable buffer threshold can be improved. The CPU usage is not improved obviously in scheme 1, but there is still burst. In fact, the thread priority reduction of scheme 1 can be changed to the logic of dynamic modification based on bufferedDuration.


3. Multiple TS packets can be downloaded at a time to make IO operations less frequent. Of course, only a single TS packet can be decapsulated. To do an extreme test, we used self-made HLS stream test, downloaded 500 TS packets each time, and used Streamline to check the system information, as shown in the following figureIn a certain period of time, the CPU usage can be reduced, but the frequency of loading cycles is obviously increased, which can be improved by setting a more reasonable number of downloads at a time.

This is the whole process from hypothesis verification to proposal and testing for the performance problem of frame loss.


update

We have discussed this issue with Google at github.com/google/ExoP… Here you can post an explanation from Google for your reference

Do think that at the root of these HLS cpu peaks there is a more generic choice in the way the default exoplayer data loading operates. It works in a peak pattern that tries to fetch data as quickly as possible with reltatively large intervals (15s). And thus you get these kind of cpu utilization peaks at given intervals for Ts based HLS.

The cpu utilization side effect of the default loader choice grows with the bitrate as as the loader works in the time domain only. So for a 30mbit stream by default it’s trying to load and parse 15 seconds of data (56mb) as quick as possible. Ironically this also means that the higher you’re connection speed to the server, the more peak utilization the player will put on one of the cpu’s. Which in turn can cause scheduling issues on the system the player runs on.


Pay attention to the public account, master more multimedia domain knowledge and information