Editor’s note: With the development of live streaming, live streaming bullets have become increasingly popular. In terms of architectural design, high stability, high availability and low delay are the three essential elements of a live ammunition screen system. On July 31, at the architect Practice Day hosted by Seven Niuyun, Liu Ding, an architect from Bilibili, brought the best practices of Bilibili in the architecture of live bullet screen service from these three aspects. Here is a summary of his speech.



Liu Ding, bilibili architect, joined B station in 2015, responsible for the development of back-end basic services such as live bullet screen, main station bullet screen and push platform, and also part-time DBA. After graduation in 2011, I joined cheetah mobile, engaged in C++ client development and developed the project “remote maintenance platform”. At the beginning of 2013, I transferred to back-end development (mainly Go) and was mainly responsible for the backend API support of cheetah video software and the bubble system, GoPush-Cluster. GOIM open source project development.High concurrency real-time barrage is an interactive experience. For interaction, the most important considerations are: high stability, high availability, and low latency.



  • High stability, in order to ensure the real-time interaction, so the connection state is required to be stable.
  • High availability is equivalent to providing a backup solution. For example, if one machine is down during interaction, it must be connected to another machine at this time. In this way, the problem of uninterrupted user connection is solved from the side.
  • For low delay, the delay period of barrage is controlled within 1 second, and the response is relatively fast, so it can meet the needs of interaction.

The emergence of station B live bullet screen service architecture (hereinafter referred to as GOIM) is to solve such a series of requirements. This is described in detail below.

Station B broadcast the emergence of GOIM, the service architecture of bullet screen



Figure 1

Live chat is essentially a push system. When you send a message, it pushes it out to everyone. For live barrage, users are constantly sending messages, constantly broadcasting, when there are 100,000 people in a room, one message will send 100,000 requests. Before GOIM came along, there was another project called Gopush, which was launched for the purpose of pushing. After that, GOIM optimized Gopush for some specific application scenarios, and it came to our attention. GOIM consists of the following modules (Figure 1) :

  • Kafka (Third Party Services)

    Message queuing system. Kafka is a distributed publish/subscribe based messaging system that supports horizontal scaling. Each message published to the Kafka cluster is labeled with a Topic (logically thought of as a queue), which acts as a distributed distribution of messages.

  • Router 

    Store messages. After Comet sends information to Logic, Logic stores the received information on the Router using a register session. The Router contains the user’s registration information so that it knows which machine the user is connected to.

  • Logic

    Logical processing of messages. After a user establishes a connection, the message is forwarded to Logic, where the account can be authenticated. Of course, operations such as IP filtering and blacklist setting can also be done via Logic.

  • Comet

    Maintain client long links. It can specify some business requirements, such as specifying the content of the information to be transmitted by the user, conveying user information, and so on. Comet provides and maintains links between servers and clients, and the main way to ensure link availability is by sending link protocols (such as sockets).

  • Client

    The client. Establish a link with Comet.

  • Jop 

    Message distribution. Multiple Jop modules can be overwritten on different machines, messages collected, distributed to all Comet, and then relayed by Comet.

The above is a specific process of GOIM system to achieve client link establishment and message forwarding. The structure wasn’t perfect at first, and there were some problems at the code level. In view of these problems, station B provides some related optimization operations. In terms of high stability, memory optimization, module optimization and network optimization are provided. The following is an introduction to these optimization operations.

GOIM system optimization path

Memory optimization

Memory optimization is mainly divided into the following three aspects:

1. A message must have only one piece of memory

Aggregate messages using Job, Comet pointer references.

2. Put a user’s memory on the stack as much as possible

Memory is created in the corresponding user Goroutine (Go program).

3. The memory is controlled by itself

The main optimization for Comet modules is to look at the various parts of the module where memory is allocated, using the memory pool.

Module optimization

Module optimization is also divided into the following three aspects:

1. Message distribution must be parallel and non-interfering

Ensure that each Comet communication channel is independent and that message distribution is completely parallel and does not interfere with each other.

2. Concurrency must be controllable

Each Goroutine (Go coroutine) that needs to be opened for asynchronous processing must be pre-created with a fixed number of goroutines. If this is not controlled in advance, the Goroutine can explode at any time.

3. The global lock must be broken

Socket link pool management, user online data management are multiple locks; The number of scatters usually depends on the CPU, and it is often necessary to consider the burden of CPU switching, not the more the better.

Three aspects of module optimization, the main consideration of the problem is that the distributed system will appear in the single point of problem, that is, when a user in the establishment of links, if there is a failure, the links established by other users can not be affected.

Testing is the most indispensable part of the practice process, at the same time, the test data is also used for reference and comparison of the best tool.



Figure 2

Figure 2 is the pressure measurement data at the end of 2015. At that time, two physical machines were used, the online volume of each machine was 250,000 on average, and the push quantity of each live broadcast was controlled within 20-50 times per second. Generally, for one screen, 40 pieces can meet the demand of live broadcasting. At that time, the amount of push used for simulation was 50 pieces/second (peak value), and the number of push arrival was 24.4 million/second. This time, the data showed that the CPU was just full, with memory usage around 4G and traffic around 3G. The conclusion from this data is that the real bottleneck load is on the CPU. So, the goal is to Max out the CPU load (but not over load).



Figure 3

After 2015, it was optimized again to move all memory (on the heap and out of control) onto the stack, using only one physical machine that hosted a million online quantities. The optimization effect is reflected in the pressure test data in March 2016 (FIG. 3), which is also a compression condition to be tested during the initial live broadcast.

As can be seen from the data in Figure 3, the optimization effect is multiplied. At that time, the goal was also to fill the CPU, but in the actual live broadcast environment, the most essential problem to be considered is actually in the flow, including the number of bullet screen words, the number of gifts. If danmaku needs to add some special requirements (font, user level, etc.), too many gifts will generate a lot of traffic. Therefore, the final bottleneck of live broadcast barrage optimization is only flow.

Before 2016, the optimization of station B focused on the optimization of the system, including the optimization of memory and the reduction of CPU usage, but the optimization effect was not significant, and the bottleneck of a machine was always flow. After March 2016, station B shifted its optimization focus to network optimization. The following are some measures for network optimization of STATION B.

Network optimization

At the beginning, the work of station B was mainly development. In order to get expansion in the structure, the work was to perfect the code as much as possible. However, in the actual business, there will be more problems in operation and maintenance. Therefore, station B will pay more attention to operation and maintenance in the future.



Figure 4.

Figure 4 shows the early deployment structure of Station B. At the beginning, the entire service was deployed on a single IDC (single point IDC). Over time, this deployment structure gradually showed its defects:

  • The single-line IDC traffic is insufficient
  • A single point of the problem
  • Access rate is low

    Such network deployment often causes problems such as high latency and network speed lag.

In view of the above three problems, the deployment structure of station B has also been improved. Figure 5 is the improved network deployment structure, which will be explained in detail below.



Figure 5



To solve the problem of insufficient single-point IDC traffic, station B adopts multi-point IDC access scheme. There is not enough traffic in one room, so spread it to different rooms and see how it works.

For multi-point IDC access, the cost of dedicated line is very high, which is a great burden for start-up companies. Therefore, the problem of multi-IDC can be solved through some RESEARCH and development or architecture. For the problem of many IDC, there are many aspects that need to be optimized. The following lists some existing optimization schemes of SOME B stations:

1. Adjust the user optimal A-Node

Svrlist module (Figure 6.1) is used to select the most stable node nearest to the user, adjust the IP segment, and then access.



Figure 6.1

2.IDC’s service quality monitoring: call drop rate

To judge whether a node is stable, a large number of user link information needs to be collected continuously. In this case, you can use monitoring to query the disconnection rate, and then continuously tune and collect the final results to make a topology (nationwide). In the topology, you can determine the optimal line between the city and the equipment room.

3. Automatically disconnect the lost server

4. 100% message arrival rate (still being implemented)

Low packet loss rate is very important for barrage. For example, if a message is a gift worth thousands of yuan, once some messages are lost, when the user sends a gift, the effect is that the actual number of gifts displayed in the pop-up screen is far less than the number of gifts the user has spent money to buy. This is a serious problem.

5. Flow control

For barrage, when users reach a certain level, you need to consider problems or flow control, it is also for cost cost control, when buy the bandwidth of the room, is based on gigabit bandwidth pricing standard, when there is excessive, must cut exceeds bid section of the flow, to realize the flow control function.

After the introduction of multi-point IDC access, telecom users can still use telecom lines, but the module can be deployed in other computer rooms, so that some mobile users can connect to mobile computer rooms. In this way, the problem of optimal network selection between different operators in different regions is ensured.

However, the optimal network selection is solved, but it brings the problem of cross-domain transmission. For example, during data collection, Comet modules feed data back to Logic, which sends messages across machine rooms. Some companies’ computer rooms are transmitted through dedicated lines, so the cost will be very high. Therefore, to save costs, you can only use the traffic of the public network, but whether the stability of the public network is high and whether the high availability of the public network needs to be considered. When the traffic goes out from the machine room of China Telecom, passes through the switch of China Telecom, transfers to the switch of China Unicom, and then arrives at the machine room of China Unicom, there will be the problem of cross-carrier transmission, such as high packet loss rate. Therefore, the problem of cross-carrier transmission is very serious.

In order to solve this possible risk, you can try to access a telecom line in China Unicom machine room (bandwidth can be a little smaller), “watch” the module of telecom, so that the traffic from different operators can go their own line. After making such an attempt, not only the packet loss rate is reduced, but also the basic requirements for stability are met, and the cost consumption is not high. However, such a scheme can not be said to be 100% perfect, because even for communication with operators, there may be a switch failure between cities. In this case, station B adopts the method of simultaneously deploying two telecommunication lines between IDC-1 and IDC-2 (FIG. 5). After doing this backup scheme, patency and stability have been significantly improved.

In view of some problems in the above process, the stability of each line needs to be tested in the early stage. In order to test the stability of each line, you can put Comet into each machine room, and collect the communication modes between Comet into a link pool (the link pool can put multiple lines of multiple operators), as a network link, it can be configured into multiple lines, and use modules to detect all the communication between Comet. And the stability of any transmission, if any, to ensure that the link is working. (There are many lines in it, so the one that is open must be selected for transmission. In this way, you can determine which line is open.) That way, when traffic travels, there are multiple lines to choose from, and one of the three carriers is always available.

Based on these problems, the structure of station B was optimized again. (This structure is just finished, it’s not live yet, it needs to go through some testing.)



Figure 6.2

The first is Comet link, before the use of CDN, intelligent DNS. In practice, however, some carrier base stations cache routing tables, so even if the machine is migrated, some users may not be migrated at the same time. The DNS resolution is not completely reliable, and once a problem occurs, the process of solving it is very long, so the experience effect is very bad. The second is List, which is deployed in a central machine room. The client uses the WEB interface service, so that the client can access this service and know which servers to connect with. Deploying IP Lists (Comet) across multiple machine rooms gives clients feedback on the values collected from multiple machine rooms (for example, which lines are open) and lets them choose which machine to connect to.

As shown in figure 6.2. In the figure, IP segments are divided into cities. Some user information in a city is linked to a group (GroupID). There is one or more Comet under the group, and all physical machines belonging to this group are assigned to Comet.



Figure 7.

In Figure 7, the structure is optimized again and Comet is still put in the IDC room. The data is pulled to the central machine room (source station) through pull mode instead of push mode for message transmission. After some online processing, the source station pushes data uniformly. Of course, here to pay attention to the selection of the center room, the stability of the center room is very important. In addition, the fault monitoring function was optimized during the deployment of station B to ensure high availability of services. Fault monitoring is as follows:

1. Simulate the Client to monitor the message arrival rate

2. Enable Ppof online to capture and analyze CPU status at any time

3. Whitelist: Specifies a person to print server logs

Set the whitelist, record log information, and collect problem feedback

Mark key problems and solve them in time

Preventing message recurrence

4. Server load monitoring and SMS alarm

Low cost and high efficiency have always been the standards pursued by B station. B station will continue to optimize and improve the GOIM system to provide users with the best live bullet screen experience.