The background,
Instant message (IM) system is an important part of live broadcast system. A stable, fault-tolerant, flexible message module supporting high concurrency is an important factor affecting user experience of live broadcast system. IM long connection service plays an important role in live broadcast system.
This article briefly describes the message model and the architecture of our message model for the live broadcast of the show. In addition, we have updated and adjusted the evolutionary message model architecture by dealing with different online business problems in the past year. This article is written up and shared with everyone.
Live in the most mainstream business, push-pull flow is the most basic technical point live broadcast business, information technology is implemented to watch live all of the users and the host key technologies to realize the interactive point, via live IM system module, we can complete screen interactive, color barrage, so a gift radio, DMS, PK core shows broadcast function development.” As an information bridge for communication between users and users, between users and anchors, HOW to ensure the stability and reliability of THE “information bridge” in high concurrency scenarios is an important topic in the evolution of live broadcast systems.
2. Live news service
In the live broadcast business, there are several core concepts about the message model. Let’s briefly introduce them first, so that we can have an overall understanding of the message model related to live broadcast.
2.1 Anchors and Users
Anchors and viewers, for IM systems, are ordinary users with a unique user id, which is also an important identifier for IM distribution to point-to-point messages.
2.2 the room no.
One host corresponds to a RoomId. Before broadcasting, hosts will be bound to a unique RoomId after their identity information verification. The RoomId is an important mark for instant message distribution in the IM system.
2.3 Message Type Division
According to the features of the live broadcast service, IM messages can be divided in various ways, such as by the dimension of the receiver, by the business type of the live broadcast message, by the priority of the message, and by the storage mode.
Generally, we divide messages into the following types according to the recipient dimension:
-
Point-to-point messages (unicast messages)
-
Live broadcast message (group broadcast message)
-
Broadcast messages
According to specific service scenarios, there are the following types of messages:
-
Gift message
-
The male screen message
-
PK message
-
Service notification messages
It is necessary for messages to be distributed accurately and in real time to the corresponding community or to a single user terminal. Of course, a better IM messaging model can also empower businesses with some new capabilities, such as the ability to:
-
Count the number of online people in each live broadcast room
-
Capture the user in and out of the broadcast room
-
Count the time for each user to enter the live broadcast room in real time
2.4 Message Priority
Live broadcast messages are prioritized, which is very important. Unlike wechat, QQ and other CHAT IM products, live broadcast messages are prioritized.
WeChat chat messages, such as product, whether private or group chat, everyone sends the message priority is basically the same, who does not exist the message priority, whose message priority is low, all need to be accurate in real-time distributed to various business terminal, but live because of the different business scenarios, the priority of the message distribution is not the same.
For example, if a studio rendering only 15 ~ 20 messages per second, if a hot air produced by a second message volume is more than 20 or more, if you don’t do the message priority control, real-time distributed message directly, so as a result of the mammal is the male screen client rendering caton, gift bounced rendering too fast, the user viewing experience has fallen dramatically, Therefore, we need to give different message priorities for different business types of messages.
Gift message is greater than the male screen, for example, the same type of business news, big gift message priority and larger than a small gift, high-grade screens of the user message priority is higher than the lower level users or of the anonymous user screen message, doing business news distribution, need according to the actual message priority, selective distribution for message accurately.
Three, information technology points
3.1 Message architecture model
3.2 Short polling VS long links
3.2.1 short polling
3.2.1.1 Business model for short polling
First of all, let’s briefly describe the process of short polling time and the basic design idea:
-
The client polls the server interface every 2s with roomId and TIMESTAMP, which passes either 0 or NULL for the first time.
-
The server queries the message events generated after the timestamp of the room according to roomId and TIMESTAMP, and returns a limited number of messages such as (for example, 10~15 messages are returned. Of course, the number of messages generated after this timestamp is far greater than 15. However, because the client’s rendering capacity is limited and too many messages are displayed, the user experience will be affected, so the number of returned messages is limited), and at the same time, the timestamp generated by the last message in these messages is returned as the benchmark request timestamp of the client’s next request to the server.
-
Repeat this so that the latest news from each broadcast room can be updated every 2 seconds on request from each terminal
The main idea of the whole is shown in the figure above, but the specific time can be refined, and specific explanations and details will be made later.
3.2.1.2 Storage model for short polling
Short polling message storage is different from normal long connection message storage, there is no problem of message diffusion, we need to do the message storage needs to achieve the following business objectives:
-
Message insertion time complexity is relatively low;
-
The complexity of message query is relatively low.
-
The structure for storing messages should be relatively small and should not occupy too much memory or disk space.
-
Historical messages can be stored in disk persistence according to service needs.
Based on the above four technical requirements, after discussion among team members, we decided to use the Redis SortedSet data structure for storage. Specific implementation ideas: according to the business types of live broadcast products, business messages are divided into the following four types: gift, public screen, PK and notice.
A live broadcast message is stored using four Redis SortedSet data structures, The key of SortedSet is “live::roomId:: Gift “,” Live ::roomId::chat”,”live::roomId::notify”,”live::roomId:: PK “, and score is the timestamp that the message is actually generated. Value is a serialized JSON string, as shown below:
When the client polls, the server query logic is as follows:
Many students will ask, why not Redis list data structure? The following figure will explain in detail:
Finally, we compare the correlation analysis of the time complexity of the two data structures of Redis SortedSet and Redis List when storing live messages.
Above, we use Redis SortedSet data structure for message storage some simple design thinking, we will also mention the end polling coding, need to pay attention to points.
3.2.1.3 Time control for short polling
The time control of short polling is extremely important. We need to find a good balance between the QoE of live viewers’ viewing experience and the pressure of the server.
If the polling interval is too long, the user experience will deteriorate and the live viewing experience will become worse. It will feel like a “meal”. If the frequency of short polling is too high, the server will be under too much pressure and there will be many “empty polling”. The so-called “empty polling” is invalid polling. That is, after the valid polling returns valid messages in the last second, there will be invalid polling if no new messages are generated in the live broadcast interval.
At present, the polling times of Vivo live broadcast are 1 billion times per day. During the peak hours of late viewing live broadcast, the CPU load of the server and Redis will rise. The thread pool of dubbo’s service provider is always at a high water level. Do the horizontal expansion of the server and node expansion of Redis Cluster, and even let some of the super hot value of the live broadcast load to the designated Redis Cluster, physical isolation, enjoy the “VIP” service, to ensure that each live broadcast messages do not affect each other.
The polling time can also be configured for the live broadcast with different numbers of people. For example, for the live broadcast with a small number of people, for the live broadcast with less than 100 people, a relatively high polling frequency can be set, such as about 1.5 seconds, for the live broadcast with more than 300 people, about 2 seconds for the live broadcast with less than 1000 people, and about 2.5 seconds for the live broadcast with ten thousand people. These configurations can be delivered in real time through the configuration center, so that the client can update the polling time in real time. The frequency of adjustment can be found according to the effect of the actual live broadcast user experience and the load of the server to find a relatively optimal polling interval.
3.2.1.4 Precautions of short polling
1) The server needs to verify the timestamp passed by the client: This is very important, just think, if the audience when watching live, will live out of the background, the client polling process pause, when user resume live viewing screen process, the client passed the time will be very old even expiration time, this time there was a slow, will cause the server query Redis If there is a large number of server slow search, it will lead to the connection between the server and Redis can not be released quickly, but also slow down the performance of the whole server, there will be a moment of a large number of polling interface timeout, service quality and QoE will decline a lot.
2) The client needs to verify duplicate messages: In extreme cases, the client may receive repeated messages, and the reasons may be as follows: At a certain moment, the client sends a request roomId= 888888&TIMESTAMP = T1. Due to network instability or server GC, the processing of the request is slow and takes more than 2s. However, because the polling time is up, The client sends the request roomId= 888888&TIMESTAMP = T1, and the server returns the same data, and the client repeatedly renders the same message for display, which will also affect the user experience, so it is necessary for each client to verify the repeated message.
3) Massive data cannot be returned to render in real time: Imagine, if a great studio and heat have thousands or tens of thousands of messages every second, according to the storage and query of the above ideas is flawed, because we every time because of the limitation of various factors, every time returns only 10 ~ 20 messages, so we need a long time to send this heat a lot of data from a second return entirely, In this case, the latest messages cannot be returned first. Therefore, messages returned from polling can be selectively discarded based on message priorities.
Client polling services server query studio is obvious, the benefits of the news of the message distribution is highly real-time and accurate, it is hard to discern tremble for Internet message cannot reach the scene, but the disadvantages are also very obvious, the server in the business of peak load pressure is very big, if the air all messages are distributed by polling, for a long time in the past, It is difficult to achieve linear growth through horizontal expansion of servers.
3.2.2 long connection
3.2.2.1 Architecture model of long connection
In terms of process, as shown in the figure above, the overall process of live long connection:
The mobile phone client obtains the TCP long-link IP address through HTTP request from the long-link server. The long-link server returns the optimal list of reachable IP addresses based on routing and load policies.
According to the IP address list returned by the long-link server, the mobile phone client requests the long-link client to access the IP address. The long-link server receives the connection request and establishes the connection.
The mobile phone client sends authentication information to authenticate communication information and confirm identity information. Finally, the long connection is established. The long connection server needs to manage the connection, monitor heartbeat, and reconnect the long connection.
The basic architecture of a long-connection server cluster is shown as follows. Services are divided by region, and terminals in different regions are connected as required.
3.2.2.2 Long Link Establishment and Management
To ensure that messages can reach users in a timely, efficient and secure manner, the live broadcast client and IM system establish an encrypted full-duplex data channel, which is used for sending and receiving messages. When a large number of users are online, maintaining these connections and maintaining sessions requires a large amount of memory and CPU resources.
At the IM access layer, functions should be kept simple, and service logic should be processed in later logical services. In this way, a large number of external devices may be re-connected after the IM access process restarts, which affects user experience. The access layer provides a hot update publishing scheme: basic logic that is not frequently changed, such as connection maintenance and account management, is put into the main program, and the business logic is embedded into the program by means of SO plug-in. When modifying the business logic, the plug-in only needs to be reloaded once, which can ensure that the long connection with the device is not affected.
3.2.2.3 Long Link Survival
After a long connection is established, if the intermediate network is disconnected, neither the server nor the client can sense the connection, resulting in fake online. Therefore, one of the key problems of maintaining this “long connection” is to be able to keep the “long connection” in a high availability state by allowing both ends of the connection to be notified quickly when the intermediate link fails, and then reconnecting to establish a new available connection. The KEeplive alive detection mechanism is enabled on the server and intelligent heartbeat is enabled on the client.
-
With keeplive alive detection, you can detect unexpected situations such as client crash, intermediate network opening, and intermediate device deleting connection table due to timeout. In this way, the server can release the half-open TCP connection in the event of an accident.
-
When the client starts the intelligent heartbeat, it not only notifies the server of the client’s survival status, periodically refreshes the NAT and extranet IP mapping table, but also automatically reconnects when the network changes with minimal power consumption and network traffic.
3.2.3 IM Message Distribution in the Broadcast room
The overall flow chart of IM long connection message distribution
When integrating the client, IM long-connection server module and live service server module, the overall message distribution logic follows the following basic principles:
-
Unicast, multicast, and broadcast messages The broadcast service server invokes the interface of the IM long-link server to distribute the messages to each service broadcast room.
-
The service server responds to events generated in the live broadcast room with corresponding service types, such as sending gifts and deducting virtual currency and sending public screens for text health check.
-
Client accept live business server signal control, a message is through short connection channel distribution or HTTP long polling distribution, are all controlled by live business server, client shielding the underlying message for details, the client top accept unified message data format, message processing to carry on the corresponding business type.
3.2.3.1 Direct broadcast room member management and message distribution
Members of the studio is the most important basic metadata studio, a single set of users is actually uncapped, and live shows large number of mammal (greater than 30 w online at the same time), in the hundreds, small live tens of thousands of such distribution, how to manage the members of the studio is one of core functions in a studio system architecture, the common way has the following two kinds:
1. Fixed fragments are allocated to live broadcast rooms. Users are mapped to specific fragments, and users are stored randomly in each fragment.
The algorithm of fixed sharding is simple to implement, but the number of users may be small for the live broadcast room with few users, while the number of users may be large for the live broadcast room with large users. Fixed sharding has the characteristics of poor natural scalability.
2. Dynamic sharding: Specifies the number of fragments. When the number of users exceeds the threshold, a new fragment is added.
Dynamic sharding automatically generates sharding based on the number of users in the live broadcast room. When the number of users in each sharding reaches the threshold, the number of users in each sharding reaches the threshold. However, the number of users in existing sharding changes as users enter and leave the live broadcast room, resulting in high maintenance complexity.
3.2.3.2 Message distribution in the broadcast room
In the live broadcast room, there are various messages such as incoming and outgoing messages, text messages, gift messages and public screen messages. The importance of messages is different, and the corresponding priority can be set for each message.
Messages with different priorities are placed in different message queues. Messages with higher priorities are sent to clients first. When the number of messages exceeds the limit, the earliest messages with lower priorities are discarded. In addition, live broadcast messages belong to real-time messages, and it is of little significance for users to obtain historical messages and offline messages. Messages are stored and organized in the way of read diffusion. Live broadcast messages, according to the members of the studio shard inform the corresponding message service, then the message send to shard respectively corresponding to each user, in order to real-time and efficiently under studio news to users, when users have more than not receiving messages, issued by the service with the method of batch issued by sending multiple messages to the user.
3.2.3.3 Message compression for long Connections
Note the size of the message body when using the TCP long connection to distribute live broadcast messages. If the number of messages to be distributed is large at a certain time or the number of multicast users is large when the same message is used in multicast scenarios, the egress bandwidth of the IM connection layer becomes the bottleneck of message distribution. Therefore, how to effectively control the size of each message and compress the size of each message is a problem we also need to think about. At present, we optimize the relevant message body structure in two ways:
Uses the Protobuf data exchange format
Messages of the same type are merged and sent
Through our online test, using the Protobuf data exchange format, the average byte size of each message can be saved by 43%, which can greatly help us save the machine room exit bandwidth.
3.2.3.4 piece of news
The so-called block message is also the technical solution we learned from other live broadcasting platforms, that is, multiple messages are sent together. The live broadcasting service server does not immediately call IM long-connection server cluster to directly distribute the message after generating a message. The main idea is to distribute the messages generated by the business system in this time period at a constant speed every 1s or 2s in the dimension of direct broadcast.
Distribution of 10 to 20 messages in a second, and if the business server accumulate to more than 10, 20, the news of the discarded according to the priority of the message, if the 10 ~ 20 messages are a priority, are present types of messages, for example, after have the message on to send a message block, the benefit has the following three;
Merge messages to reduce the number of redundant message headers. Multiple messages are sent together. In a customized TCP transmission protocol, you can share message headers to further reduce the size of message bytes.
To prevent message storms, the live broadcast service server can easily control the speed of message distribution and will not distribute messages to the live broadcast client without limit. The client cannot handle so many messages.
Friendly user experience. The normal flow rate of the news in the live broadcast room and the even rhythm of rendering will bring good user live experience, and the whole live broadcast effect will be smooth
3.3 Message Discarding
Whether HTTP polling short or long connection, in the presence of high heat value studio, there are news discarded, for example, in the broadcast of the game, have a more wonderful moments, reviews and screen will instantly increase the number of, at the same time, the news of the low value of the gift will instantly increase a lot, used to show support for their players good operation, Then the server will distribute thousands or tens of thousands of messages per second through IM long connection or HTTP short polling. A sudden increase of messages will cause the following problems on the client.
The number of messages obtained by the client through the long connection increases and the downstream bandwidth pressure increases. Other services may be affected (for example, svGA of gifts cannot be downloaded and played in a timely manner).
The client cannot quickly render so many gifts and public screen messages, the CPU pressure increases, and audio and video processing is affected;
The user experience (QoE) metrics decrease due to a backlog of messages, which makes it possible to display messages that are long out of date.
Therefore, for these reasons, it is necessary to discard messages. For example, the priority of gifts must be higher than that of public screen messages, PK progress bar messages must be higher than that of whole network broadcast messages, and messages of high-value gifts must be higher than that of low-value gifts.
Based on these business theories, we can make the following controls in real code development:
According to specific service characteristics, different levels of messages of different service types are assigned. When flow control is triggered by message distribution, messages with lower priorities are selectively discarded.
The creation time and send time fields are added to the message structure. When invoking the long-link channel, determine whether the interval between the current time and the creation time of the message is too large. If the interval is too large, the message is discarded.
Gain messages (correct), in the business development, the design of the message, as far as possible to gain design news, gain news arrived refers to A subsequent message can contain totem of arrived, for example, 9, 10, PK values of host A and host B is 20 more than 10, then 9 11 points distribution PK news value is more than 10, 22 Instead of distributing incremental messages 2:0, the client is expected to accumulate PK messages (20+2:10+0). However, some messages are discarded due to network tremor or pre-message discarding. Therefore, distributing gain messages or correcting messages can help restore services.
Write at the end
For any live broadcast system, with the development of business and the increasing popularity of the live broadcast room, the problems and challenges encountered by the message system will also follow. Whether it is the message storm with long connection or the massive HTTP short polling request, the pressure on the server will increase dramatically, which is what we need to solve and optimize constantly. According to the business characteristics of each period, we should continuously upgrade the live broadcast message and make an evolving message module to ensure that the ability of message distribution can ensure the sustainable development of the business.
Vivo broadcast message module is also gradually evolution, the main impetus for evolution comes from because of the development of the business, with the diversification of business form, watch the number of users more and more, the function of the system will also gradually increase, also will encounter all sorts of performance bottlenecks, in order to solve the performance problems, will be one by one code analysis, interface performance bottleneck analysis, Then, corresponding solutions or decoupling schemes are given, and message modules are no exception. I hope this article can give you inspiration on the design of relevant live broadcast message modules.
Authors: Vivo Internet Technology -LinDu, Li Guolin