** By John Blair, Netflix Partner Engineering**

LinkedIn / www.linkedin.com/in/x1jdb/

Original link /Netflixtechblog.com/life-of-a-n…

Netflix’s app runs on hundreds of smart TVS, TV sticks and pay-TV set-top boxes. The role of Netflix’s partner engineers is to help device manufacturers launch Netflix applications on their devices. In this article, we will discuss a particularly difficult issue that affected the normal launch of a device in Europe.

Mysterious beginnings

At the end of 2017, I was on a conference call focused on a question about the Netflix app launching on a new set-top box. Box is a new Android TV device with 4K playback capabilities, based on version 5.0 of the Android Open Source Project (AOSP), aka “Lollipop.” I’ve been at Netflix for a few years and have released many devices in the past, but this is my first Android TV device.

All four companies associated with the device were on the call: the large European pay-TV company that launched the device (the operator), the contractor that integrated the set-top box firmware (the integrator), the system chip vendor (the chip vendor) and me (Netflix).

The contractor (integrator) that integrates set-top box firmware and Netflix have gone through a rigorous Netflix certification process, but during the TV operator’s internal testing, a company executive reported a serious problem: Netflix was playing “stutter (caton)” on his device. The video will play for a short time, then pause, then restart, then pause again. This won’t happen all the time, but it will certainly start to happen within a few days of the set-top box powering up. They provided a demo video, and it looked bad.

Device integrators have found a way to recreate the problem: start Netflix repeatedly, start playing, and then go back to the device’s user interface. They provided a script to automate the process, which sometimes lasted as long as five minutes, but the script consistently reproduced the error.

Meanwhile, a field engineer at the chip supplier diagnosed the root cause: Netflix’s Android TV app, Ninja, wasn’t transferring audio data fast enough. The lag is caused by insufficient buffering of the equipment’s audio pipes. While the decoder waited for Ninja to send more audio streams, the playback stopped, waiting for more data to arrive and resumed. Integrators, chip suppliers and carriers all thought the problem had been identified, and their message to me was clear: Netflix, there’s a bug in your application and you need to fix it. I heard the stress on the phone. Their device was late and over budget, and they were looking to me for a solution.

survey

I doubt it. The same Ninja app runs on millions of Android TV devices, including smart TVS and other set-top boxes. If Ninja has a bug, why is it only on this device?

I first recreated the problem using the script they provided, while contacting my colleague at the chip supplier to ask if he’d seen anything like it before (he hadn’t). Next, I started checking Ninja’s source code to find the line of code that transmitted the audio data. I know a lot, but I’m starting to get lost in the playcode, and I need help.

I went upstairs to Ninja’s engineer who wrote the audio and video delivery code, and he combed through the code for me. I spent some time myself studying the source code to understand its working parts, and added my own logging to confirm my understanding. The Netflix app is complex. In simple terms, it transfers data from Netflix servers, buffers video and audio data on the device for a few seconds, and then sends video and audio frames to the device’s playback hardware, one at a time.

Figure 1: Device playback pipeline (simplified)

Let’s take a moment to discuss the audio/video pipes in the Netflix application. On every set-top box and smart TV, up to the “decoder buffer” is the same, but the decoder buffer that transfers A/V data to the device is A specific program that runs in its own thread. Its routine job is to keep the decoder buffer full by calling the API (provided by Netflix) that feeds the next frame of audio or video data. In Ninja, this task is performed by an Android thread. There is a simple state machine and some logic to handle the different play states, but under normal play, the thread copies a frame of data into the Android Play API and then tells the thread scheduler to wait 15 milliseconds and call the handler again. When you create an Android thread, you can ask the thread to run repeatedly, just like in a loop, but it’s Android’s thread scheduler that calls the handler, not your own application.

60 frames per second is the maximum frame rate Netflix can play video at, and the device must render a new frame every 16.66 milliseconds, so checking a new sample every 15 milliseconds is fast enough to stay ahead of any video stream Netflix offers. Because the integrator has determined that the audio stream is the problem, I focus on the specific thread handler that passes the audio samples to the Android Audio service.

I want to answer the question: Where is the extra time? I assume that the culprit is some function called by the handler, so I add log messages to the handler, assuming that the error code is obvious. It soon became apparent that there was nothing unusual in the handler, and even if the playback was not smooth, the processor was still running fine for a few milliseconds.

insight

Finally, I focused on three numbers: the data transfer rate, the time when the handler was called, and the time when the handler returned control to Android. I wrote a script to parse the log output and made the chart below, which gives the answer.

Figure 2: Visualizing audio throughput and thread processor time

The orange line is the rate at which data is moved from the streaming buffer to the Android audio system, in bytes per millisecond. In this chart, you can see three different behaviors:

These two high, spiky parts have data rates of 500 bytes per millisecond. This is the buffer phase before the playback starts. The handler is copying the data as fast as it can.

The middle area is the normal playback phase. Audio data is transmitted at about 45 bytes per millisecond.

When audio data is being transmitted at close to 10 bytes per millisecond, the stuck region is on the right. It’s not fast enough to maintain normal playback.

The inescapable conclusion was that the orange line confirmed what the chip supplier’s engineers had reported: Ninja wasn’t transferring audio data fast enough.

To understand why, let’s take a look at what the yellow and gray lines say.

The yellow line shows the time spent in the handler itself, calculated from the timestamps recorded at the top and bottom of the handler. In the normal playback and stuttered areas, the processing time is the same: about 2 milliseconds. The spike showed Ninja wasn’t transferring audio data fast enough due to time spent on other tasks on the device.

The real reason

The gray line is the time between calls to the handler, which tells a different story. Under normal playback, you can see that the handler is called about every 15 milliseconds. In the case of the playback stall, the handler is called on the right side approximately every 55 milliseconds. With an extra 40 milliseconds between calls, there’s no way to keep up with the playback. But why?

I shared my findings with the integrator and chip vendor (look, it’s the Android thread scheduler!). But they were not impressed by the discovery. Why not copy more data each time the handler is called? It was a reasonable question to ask, but changing this behavior involved deeper changes than I was prepared for, and I continued to look for root causes. I delved into the Android source code and learned that The Android thread is a user-space structure and that the thread scheduler uses the epoll() system call for timing. I know that the performance of epoll() cannot be guaranteed, so I suspect that something is affecting epoll() in a systematic way.

That’s when ANOTHER engineer from the chip supplier came to my rescue, discovering a bug that had already been fixed in the next version of Android called Marshmallow. The Android thread scheduler changes the behavior of threads depending on whether the application is running in the foreground or in the background. Background threads are allocated an additional 40 milliseconds (40 million ns) of wait time.

A deep bug in the Android system itself means that this extra timer value is retained when the thread moves to the foreground. Usually audio processing threads are created while the application is in the foreground, but sometimes threads are created while Ninja is still in the background. When this happens, the playback stalls.

Lessons learned

This wasn’t the last bug we fixed on the platform, but it was one of the hardest to track down. It’s outside of the Netflix app, outside of the player thread, part of the system, and all of the initial data points to flaws in the Netflix app itself.

This story does illustrate one of the things I love about my job: I couldn’t foresee all the problems our partners would throw at me, and to solve them I had to understand multiple systems, work with great colleagues, and constantly push myself to learn more. What I do has a direct impact on real people and their user experience. I know that while people are enjoying Netflix in their living rooms, I’m an integral part of the Netflix team that makes it happen.