There are Easter eggs at the end

Author’s brief introduction

Zhen-dong li tencent OMG operations successively responsible for tencent, tencent news, tencent, associate director of the video, and other business operations of the work, to live is fluent, mass, and second, there is in-depth study on low latency, live now focused on building automation system, a live monitoring system building, live broadcast quality optimization work, and in good voice, livemusic concert, Get word of mouth on NBA live.

preface

In the process of challenging some important work, we are gradually improving our technology. This article introduces the work content of the massive live broadcasting I have done in the past year, among which 1300 NBA live broadcasts are selected as the most representative one.

1. The hot development of live broadcasting business

From 2015 to 2016, the whole live broadcasting industry emerged, and a large number of live broadcasting businesses emerged. Why did live broadcasting bring a leap forward in the past year or two? It’s basically four points.

  • The first is driven by technology, which includes intelligent hardware, mobile phones and the improvement of network bandwidth to bring more convenient services for people to watch live broadcasts.

  • Commercial drive The second is commercial drive, the emergence of content IP, including sports and similar to concerts, Tencent also has concerts, there will be hundreds of live concerts every year, there are also some personal live anchor form, Shouting mic. We’ve enriched our entire content system a lot.

  • Content atmosphere The third is similar to the commercial drive, now the way of commercial realization is also rich, including the membership system, reward function, similar to the emergence of content distribution of the realization model, with profits will inject vitality to the industry.

  • The fourth is the demand of the public, because the emergence of online live broadcast enrichis the form of entertainment, especially Chinese people have the psychology of gossip. If I tell you that it happened a few days ago, or quietly tell you that others have already known about it, people will not feel satisfied. If you can go to the scene, Experience what the person is doing or how the thing is being done.

When we can grasp first-hand information and satisfy the gossip psychology, we are also very willing. Here in fact, not only girls love gossip, boys actually love gossip, but the content is different.

As a whole, the emergence of these factors, including the assistance of the whole technology, has brought our whole live broadcasting industry to a new height, which can be said to be the hottest height. Everything can be live broadcast, and it is also live broadcast now. Thanks to the live broadcasting team, I know they are very hard.

2. Characteristics needed for excellent live broadcast

What makes a good live stream? Everyone wants to live, but how can live do well? At that time, I was under pressure. Tencent, a big Internet company, would face a huge public praise problem if the live broadcast did not work well and users were not satisfied.

I listed some excellent technical points, including the picture quality must be clear, such as watching a Victoria’s Secret show, must have the feeling of adding screen, when the key part is not very clear, bullet screen can not stand, 1080P picture quality is worse than 720P.

You watch a ball game, other people are about to shoot, then it gets stuck, and then the ball doesn’t go in, I don’t know what the mood of the people watching the live broadcast is, what bad live broadcast.

‘synching up’, and the scene is also very important, such as some interactive link, I said you have questions, please online friends have any questions to ask, then I found that is all for my speech I don’t like it, someone sent WeChat messages tell me behind, signal is not reached, I will wait for 2 minutes, estimates the following people freaked out.

Including delay and synchronization of sound and picture. During the concert, danmu suddenly cursed and lip-synched. The singer explained that he wasn’t lip-synching, this time he was singing for real. No lip-synching yet, screenshot, recorded a video yes, the lyrics are clearly far ahead, mouth shape is not correct, this is a typical tone and picture out of sync.

Of course, there are more situations, and I will introduce them in detail later. So many technical requirements are given to our live broadcast, because live broadcast is indeed a business that requires very high experience requirements. How can we do it well to reduce criticism? To reduce the probability of not doing it?

3. Technology selection in the live broadcast industry

Next, I will introduce the typical scenario in the NBA process. We are faced with so many technical requirements and so much user pressure. How to choose the technology? Livestreaming is a complicated technology because it involves so many steps.

3.1 Process of live video broadcasting

Such as the live video from video collection, transport, packaging, coding, pushing flow, transcoding, distribution of decoding and playback, and many other links, not we have a total of 18 other links, such as seat reservation seat 1 2 3 again, to do a simple signal processing, the satellite or the network transmission, network transmission and it program production center, and then to packaging, When the packaging is complete, it is distributed to users.

In addition also faces various needs of users, all kinds of definition of demand, because the user’s network audience are not quite same, resolution may not be the same, it is need more clarity, to establish multiple terminals, such as FLV and HLS, again after a certain distribution to go the content distribution network, finally to the user terminal, The terminal also has to adapt technology, the whole process is very complicated.

3.2 Challenges in the broadcast experience

Briefly introduce the technical problems we faced in NBA live broadcast and how to solve them at that time. I’ve listed four main points.

  • Transmission because NBA is played in the United States, from a camera shooting in the United States to the user in China, it has to experience 18,000 kilometers of signal transmission, because the production room in the United States is in New Jersey, the East coast of the United States, we are in the West coast of the Pacific Ocean, across the entire North American continent and across the Pacific Ocean. From the southern tip of China in Hong Kong to the studio in Beijing, and from the studio in Beijing to the homes, it’s basically 18,000 kilometers of ocean transmission, and there are some technical barriers.

  • Production came later, like americans coming signal, English is the most simple transmission, of course, some people say I like to listen to the original sound, but most Chinese people have Chinese demand, including players, event information, this will need to show in the process of packaging to ascend live audio-visual experience, let the Chinese see body is more comfortable, more can accept, easier to understand.

  • In addition, we also need multiple angles and multiple definition. Multiple angles are satisfied with various viewing angles. What we have achieved in NBA is three angles, such as under the blue rim, left Angle and right Angle, so that people can see the game from different angles.

    There is also clarity and playback, playback is also a key point, if your playback is not smooth, unclear or even Mosaic, it will not be acceptable to the whole user. It is a technical problem to be solved how to ensure that users can see the pictures faster without any delay during the playback.

  • Monitor the last tache was monitored, so much, distance so far, the active user is close to $, so much of the terminal and user, the failure probability is very big, how do we go there, to keep the risks to a minimum, but the risk is impossible not to appear, failure is always occur, how to make after failure occurs, make plans for quick start, This monitoring is very important.

But everyone has monitoring, Tencent also has this course, now the monitoring has reached a new level, in addition to the current necessary monitoring of all links, another is the impact of big data. Every day, there are almost 200 billion pieces of monitoring information. How to collect and analyze the data, and how to quickly find the problem in a very short time can help us to quickly locate the problem. This is a huge challenge, and it involves big data processing.

I’m going to start talking about how we solved this problem. It was the summer of 2015, and I was very happy when I received this task. NBA was a big investment, and Tencent invested 100 million dollars or US dollars every year. The company entrusted us with such an important task to reflect our value. If you do a good job, you will get an appreciation and a raise. If you do not do well, you may have to go to finance to get a salary. So risk and opportunity coexist forever.

4. Face the challenge of transmission

4.1 Challenge 1: The transmission process is prone to screen splintering and flow interruption

However, on the first day of live broadcasting, I was dumbstruck because the picture was like this and I felt deeply helpless pressure. The leader says, how do you solve this problem? Since we had never conducted live broadcast before, we began to analyze it, because UDP transmission was adopted in the transmission process to meet low latency and real-time requirements.

But is like the team, it is easy to cause some caton case, because once the transport to make some simultaneous interpreters, package card there, video images are a large number of compression, a packet loss may be a block, the influence of a pixel, we so far distance transmission in the process of transmission, will inevitably lead to packet loss, Once the bag is lost, the picture is stuck. At that time, they said they would solve the problem in terms of business operations and various technologies. There is a solution. I’ll talk about it later.

4.2 Challenge 2: The transmission process is prone to screen splintering and flow interruption

The way we choose is network transmission, from the United States of New Jersey room until through the north American continent, spread across the Pacific, the node from Hong Kong to Beijing again, such a big distance, we had measured the distance, is 17286.59 kilometers long, so far, the sea have natural factors, there may be a tsunami, very easy to cause the entire line is not stable.

Some of you might have said, why not satellite transmission? Satellite transmission is very simple, just through two satellites, the American satellite sent to the European satellite, the European satellite then transferred to the Chinese satellite.

Satellite transmission is really simple, but the price is very expensive, we can say that by satellite transmission, almost 50 times the price of network transmission, network transmission is already very expensive, if all 1300 satellite transmission is not realistic.

4.3 Transmission optimization solution

  • Fault-tolerant technology sometimes encounter problems can reflect the value, I put the task to the next, packet loss can work it out, we had used the error correction technology, the transmission of the packet into a matrix, in each of the horizontal column many a jiaoyan package, like ten times ten to the matrix, each row of each column to throw a package, it can be through the check code completion.

    If you send 100 packets, by adding 20 check codes, no matter which packet is lost in each row and column, there will be no screen jitter. The reliability of the original UDB transmission process was improved, and the fault-tolerant reliability of our network line at that time could reach 1/1000.

    So the picture just appeared, the network is difficult to improve, it is difficult to do greater optimization. We simply reduced the packet loss rate requirement to 10%, but continuous packet loss is also not acceptable, nor is losing two or three packets in a row. This reduces the overall error tolerance rate to about five parts per thousand.

    In the past, there were some pictures in a game every day, but now it is possible to have a very short and subtle picture feeling in a game which is basically ten games. If you have problems in long-distance transmission, you can go to see this error correction technology. As long as the matrix keeps getting smaller, or even becomes a 2-2 matrix, the error correction ability will be stronger.

    But it brings another factor, adding a large number of error-correcting codes will increase the bit rate in the transmission process, because after adding error-correcting codes, like 20% of the traffic has been added, originally only 1 Megabyte, adding it may be 1.2 megabytes, typical is the way of exchanging space for time.

    We also studied the way in which the US military wireless transmission process, not to mention here, basically solved.

  • Multi – link backup just said there is a probability of 1/1000, how to solve? We did network backup technology. We in the north American continent and the Pacific line deployed three lines, each line will have red or green or yellow, and black, color only represents some of the signal transmission, the difference between each signal all the way because we’re not backed up or have more backup operation, some more backup for only the main signal transmission.

    Through such a complex network, it reduces the packet loss probability under normal circumstances, or in response to complex natural weather, as well as some small probability, such as the construction of special line impact.

    Of course, in important matches, we still take satellite transmission signals as a backup plan, so we mainly use error correction technology and multi-rotation technology in signal transmission to reduce the impact of screen splintering or interruption in the whole signal transmission process.

5. Face the challenge of production technology

5.1 Visual Optimization – Subtitles

When the signal is perfectly transmitted to the production center, it is happy to carry out some packaging in the production of the program, such as adding some subtitles, and converting the player information into Chinese information through the subtitles machine.

5.2 Visual Optimization -AR

We can also make use of some AR technologies to add some interactive processes or data analysis in the process of live broadcast.

5.4 Visual Optimization – Multiple angles

More important is the multi-angle, which is to enhance the user’s appeal in the process of watching, such as the addition of English original sound and low Angle and right rebound multi-angle technology. The whole process completed the transmission of the program and the production of packaging process.

6. Face the challenge of broadcasting

6.1 Problem 1: Playback fluency

To the key point, the program is ready, and then it will be transmitted to the user. In the process of transmission to the user, there are specific requirements, namely fluency.

  • The first rule is 2 seconds, when a user to open the video, if more than two seconds, the user then choose to leave the possibility of will gradually increase, increase the time of one second, each open for more than one second, users leave rate may be increased by 6%, we will catch the user see our picture within two seconds, the user is god, can’t test their patience, More than two seconds and you’re gone.

  • The second is the impact of procrastination, which is also analyzed from the data. If the user delays for every second, the user’s viewing time decreases by 1%, and the user is more likely to leave. How do we solve the problem of fluency during playback?

6.2 Solution-CDN technology

The first is the most universal technology, the CDN technology. We have deployed 500 CDN nodes across the country, including Xinjiang, Hong Kong and other regions, including very remote cloud and expensive areas.

CDN is a mature technology that pushes users’ content to the nearest place to users. After having 500 nodes, it also improves users’ access speed. We directly use IP scheduling without DNS resolution, which saves users’ time in the access process. The other is that we do real-time statistics of the overall situation.

With excellent CDN technology and coverage, is it really able to meet the requirements of opening in two seconds? No, because one of the most important features of live streaming is the efficiency with which it starts.

The live broadcast is not available 24 hours a day. Sometimes the signal is gone, and users do not need to watch it at all. However, once the live broadcast starts, for example, when a football game starts, users will have a very strong live broadcast effect, which is the entry effect.

6.3 Problem 2 Guarantee mass User Playback experience

For example, Tencent has channels including wechat and QQ. At the beginning of an NBA game, our users can reach the peak within one minute, and the number of users is about 2 million every minute.

When more people will be crowded, not technical incompetence, is the user is too much, we can imagine, every time in the process of brushing tickets, see 12306, when everyone scolded 12306, I was determined not to scold, because that amount is really too big, how many people every day, the specific data 12306 will be published.

In the mass of users, when everyone wants to enter at that moment, it is really difficult to support, how to do? Life goes on. Try to keep your job.

6.4 Solution – Scheduling Policies

In the process of rapid and massive user entry, what are the impacts on users caused by such a strong user impact? I have summarized two aspects here.

The first is that when users quickly enter the local system will cause congestion, the other is that there are too many users, my system can not support, at this time how to do? Local congestion is a pre-scheduling strategy, that is, users come faster, my response mechanism is faster.

The second is flexible degradation, which is a very important idea in mass technology. In fact, it is to provide services to users in a lossy way.

For example, what if 200 people show up for a hundred seats? If it’s disorderly and nothing is done, there could be a fight on the spot, and that would cause more chaos.

What to do at this point? What if your platform can no longer fully support so many users, and your estimates are wrong? You need to have a soft downgrading strategy, which I’ll talk about in more detail.

The world’s martial arts can only be broken quickly. When users quickly enter, it is bound to give a lot of pressure to the local system. How can we quickly decompose this part of pressure? There are two important ways to do this.

  • The SNMP protocol to collect data delay information the first way is to use simple network protocol SNMP traffic acquisition switch directly, then statistics, users find came in, may be delayed three or four seconds, but thirty thousand people into every 30 seconds, and live is high bandwidth service, tens of thousands of people may have dozens of G, the expansion of the hundreds of G. At this time we do not count network cards, statistics switch traffic, traffic collection data delay to a minimum.

  • The other technology is to use predictive technology, and predictive technology is to look at myself after I fall and see how I fell, and analyze the posture of my fall.

    Although we say that users are quick to get involved, there are certain rules. Through each game, users enter a rule. We look at the curve, and if users enter how many thousands within a minute, how much is the probability of breaking through the machine room.

    When we meet any conditions, once the probability of the machine room breaking exceeds 60%, the flow may not reach 60%, but only 30%, but when we find that the flow generation curve has a high probability of breaking the machine room, we will divert the machine room in advance, and it will no longer enter the machine room.

Before, we stumbled because the delay was only one minute, but in the course of one minute, when the user entered the field, the machine room had been completely overwhelmed, but we began to predict that as long as the curve of the previous one minute, it might burst the machine room, and we would no longer divert the flow to the machine room. It can solve the local congestion problem by pre-scheduling in advance, which is fast, or even by forecasting.

Is flexible scheduling policy – flexible strategy another way to solve the global congestion risk, of course, we have a very rich user online prediction system, can also according to every game the team number of fans and uncontrollable factors, and push the game what channels and drainage, before each game will have a professional data analysis, There may be five million people or six million people in this race, but the reality is that prediction is a very important part, but it’s not a completely safe part. It is impossible to predict completely accurately, just like in the 1993 military parade, everyone predicted how many people would watch the parade, and we were surprised at the end, everyone was watching the parade, so the prediction is not absolutely reliable, it can only be a theoretical basis.

Method 1: Line up What if I predict one table, but two tables come? How not to form the chaos of the scene, at this time must have a flexible mechanism, we have a lot of methods.

The first method is to line up, when a user to predict, such as only five million people, but to the five million and twenty thousand people, then don’t direct squeeze in, come in directly is easier for resource competition, live is a high bandwidth resources of the business, once formed competition for resources, the user can’t download enough data can produce caton, let him not to come in at this time, Let him wait a moment.

Method two: flexible demotion someone say can’t wait how to do? Well, you can’t help it. If it gets in, it affects the other five million, and it could lead to a fight. Also may provide some of the more rich scene, for instance if the user much more special, like a concert, even at this time we will provide some video streaming, can’t offer is to provide audio stream, like faye wong concerts specially provides audio stream, if there are too many users, the bandwidth is not enough, the user can also choose to audio.

This is the important strategy, flexible relegation don’t exceed expectations, let this part of the people to disorder and the existing resource competition has been able to service is very good people, if produce this kind of competition, the whole service system will collapse, so make sure you have a plan, want to have an access mechanism, or some means of relegation is rich, can not only guarantee the existing users, The experience is not affected, and there is a good plan to explain to the people who want to come in.

Scheduling strategy is both, if the user in the process of rapid into words, if it is local, it is fast, faster system through faster speed to get our site room traffic, another way is through A flexible way, when users come we can’t afford not to say that the user from A room to B room can be solved, At this point, few solutions, such as queueing or downgrading strategies, such as audio or low definition quality, are needed to satisfy some users and avoid their impact on global users.

After 2 seconds rule and caton solve, through a technology in dealing with the various user scenario, can be very good to solve the fluency requirements, users will have some demand, two seconds is the basic of patience, but users also want to quickly see the picture, there is one important technology is the second open, is how to let users quickly see the picture, Nothing is absolute, can do the best.

6.5 Solution – Improve the speed at which users can see pictures

Here is I frame compression to remove the image space redundancy, the I frame can be completely decoding, only the compression of frame, without doping time attribute, the I frame decoding independently, P frame need to rely on the I frames, it is to solve the picture at this moment, need to reference the I in front of the frame, through the I frame the background information and motion information is lacking, B frame is a bidirectional frame, which also cannot be solved. It also depends on the following P frame, so basically this is the picture compression logic. B frame needs to get I frame and P frame at the same time, according to get the compressed data to decompress.

Before, it was an unordered process, that is, you may be given I frame, you will be given B frame, you will be given P frame, if you play B frame, you will not be able to solve the problem, first finish the I frame, and then finish the P frame, then you can extract it. This leads to downloading more data and waiting longer to see the screen, which is unbearable for the tech-savvy.

We used a technology to make the user see the picture faster. First of all, I used the i-frame, which was modified with the player. When the user went to i-frame, the picture came out immediately, reducing the time of the user by nearly 200 milliseconds, and allowing our God to see the picture faster by another 200 milliseconds.

However, in the live broadcast of large-scale sports events, especially individual anchors, the advantages will be more obvious. Through these technologies, I remember an interesting question. And one of the students said, is it hard? I said that it was not particularly difficult, the concept was very clear, and it would take one or two weeks to transform.

He said, why is not difficult technology, other live or industry can not do? I answered at that time that I think there are two points in technology or mass production. The first point is that it is not difficult to solve a single point, but the difficulty is to apply a technical system to the business and solve various problems encountered in this field.

We solved the CDN problem, solved the pure problem, and the direct scheduling problem on THE CDN, solved the problem of fluency and volume impact, and solved the problem of opening the screen quickly. In fact, there are a lot of points to solve. When the whole point is repeated, it slowly forms a set of methods, and not one or two points can solve it.

Therefore, the mass technology is not easy to solve, but do not give up in the process, each technical point to the extreme, and is very suitable for their own business experience of the extreme.

7. Facing the challenge of massive monitoring problems

7.1 Monitoring Purposes

Finally said about the problem of monitoring, the entire process monitoring is to found the quality problem, for example, monitoring is at the bottom of the hardware, including CPU, memory, nic, IO and network, because now is Internet services, network monitoring is a must, for example the point-to-point ping latency, udp detection, link segment detection, The slow speed of these monitoring, and the other is the playback, playback belongs to the business layer, this time it is necessary to include the playback volume, open time, card length, card delay rate and failure rate, including some code stream to monitor.

Live in addition to property of the business, the more to monitoring of the business, such as live streams, for example black screen can monitor, the user see the screen has turned black, or perhaps Mosaic, there may be slow or packet loss caused by the situations, the other is mute, live in the process of the user whether to hear the picture, or sound, Users hear jarring sounds and transcoding. This is a three-dimensional model, and when all these points come together, as I mentioned earlier, all kinds of data reporting, including background logs.

7.2 Monitoring Challenges – Log Analysis Efficiency

The total log is 200 billion a day, the future may exceed 500 billion, such a large amount of half a day later to get the result or a day later to get the result, the day lily are cold, how to do? What we need is a minute scale.

The traditional way is no longer fit for the needs, and now we are faced with hundreds of billions of data per day, each of which may have a hundred dimensions, more than 100 data per day, and we also need to have a second-level response, which requires a 10-second response speed and a 30-second delay. At this time, we need to introduce new technologies, analytic-oriented, search-oriented technologies, to advance the data volume challenges that we face in the field of monitoring.

7.3 Solution – Big Data Processing

This is our big data processing process. In fact, it is a classic big data processing process. After the data is reported from various terminals, including Apple, Android, TV, PAD and PC Web, it is received by the log collection system and transmitted to Spark cluster after simple cleaning and Kafka. After finishing the statistics, generate our data products.

Geordi log is to develop based on the ES, here is the experience of the large data share, mainly using the real time operation, to realize the playback process monitoring, the monitoring and controlling of the CDN speed, this architecture basically meet day 200 billion and more than 100 t data, dimension is very much, a log, about one hundred more complex data.

Once you have monitoring data, and it’s available quickly, you can really be one step ahead of the problem, and what kind of problem can be captured quickly. In fact, the technology here covers many aspects. Although it is very simple to say, it covers the technical basis of mass operation, the basis of streaming media, and the technology of big data.

How to take out your data, real-time analysis, also covers the CDN network transmission technology, how to guarantee the network number, how in the process of CDN to accelerate quickly, and the way of how to put the original DNS IP into a straight way, is actually contains a number of ways, it may not be all of a sudden can say very clearly that quite so topic.

8, summary

Massive operation technology is a very big system, I hope that when you meet this kind of situation, you can bravely stand up and face challenges, as long as we have a heart to pursue excellence and keep trying, most of us can do better, this is my experience.

Recent good articles:

5 Minutes to Learn how ZooKeeper Works

Talk about the Small Matter of journaling

Docker jumps backwards, will ops still use it?

Overview of Nginx application Scenarios

Best Practices of CaaS in Microservice Development and Operation

15 Open Source Tools for DevOps on Private Clouds

Tencent: Millisecond Network Optimization of large-scale real-time mobile games

A Comprehensive Understanding of monitoring Knowledge System in one Article

Click “Read the original story” to enjoy GOPS· Beijing Station special offer