preface
Chat room is a kind of very important IM system. Different from single chat and group chat, chat room is a large-scale real-time message distribution system.
There are many technical implementation schemes for chat rooms, and there are also some open source implementations in the industry, each of which has its own characteristics and application scenarios. As a PaaS platform, netease Yunxin has several outstanding features in its chatroom system architecture and scheme:
- Horizontal expansion ability: mainly reflected in two aspects, one is the number of chat rooms, one is the number of people in a single chat room.
- Function rich: as a platform, chat room provides the underlying communication capabilities, provides a rich set of functions to adapt to a variety of business scenarios, users can according to their own business requirements on demand.
- Support globalization: At present, Yunxin provides a communication network covering the whole world, with a global delay of no more than 250ms through access to the we-CAN network developed by yunxin.
In this paper, we will introduce the specific architecture and practical application cases of netease Yunxin large-scale chat room system in detail.
Netease cloud chat room system architecture
First of all, let’s take a look at the detailed technical architecture of the current chat room of netease Yunxin and some things we have done in the process of architecture upgrading and optimization.
Overall Architecture
The following figure is the technical architecture of netease Yunxin chat room:
It mainly includes the following parts:
- ChatLink of the access layer
- We-can and WE-Can bridge of network transport layer
- Dispatch layer Dispatcher
- Callback, Queue, Presence, Tag, History, etc
- CDN Manager, CDN Pusher and CDN Source of THE CDN distribution layer
Below, we break down each layer in detail.
Access layer
The access layer is implemented differently depending on the client type. For example, common clients (such as iOS, Android, Windows, and Mac) use private binary protocols, and the Web uses Websocket. As the last kilometer away from the client, the access speed, quality and data security of the access layer are crucial:
Access speed and quality
At present, we have built edge nodes covering all provinces of China and all continents of the world to shorten the last kilometer, reduce uncertainty and improve the stability of service.
Data security
Based on symmetric + asymmetric encryption, 0-RTT is implemented between the client and server to complete key exchange and login. Meanwhile, RSA/AES/SM2/SM4 and other encryption algorithms are supported.
In addition to receiving requests from clients, the access layer is also responsible for unicast and broadcast messages. Therefore, the access layer needs to manage all the long connections of this node, including the connections of each chat room and the label attributes of each connection. In addition, the access layer reports its load information to the back-end service to facilitate the scheduling layer to conduct reasonable scheduling.
When the traffic peak comes, the access layer is usually under the greatest pressure because messages need to be broadcast. In order to ensure the stability of the service, we have made many optimization strategies:
Adaptive flow control strategy
- Single-node flow control: The service at the access layer monitors the overall network bandwidth usage of the device and sets two thresholds T1 and T2. When the bandwidth usage exceeds T1, flow control is triggered. If the bandwidth usage exceeds T2, flow control is triggered and the intensity of flow control is constantly adjusted. The ultimate goal is to stabilize bandwidth usage between T1 and T2.
- Single connection flow control: in addition to this, access layer services will also record each long connection message delivery speed, and to adjust the fine-grained, avoid single coarse-grained flow control lead to distribute a single connection too little or too much, do news distribution smooth, namely to reduce the bandwidth traffic fluctuation spikes, also improved the experience of end side.
Performance optimization
When ChatLink is running under heavy load, in addition to the network bandwidth, every link in the call link can become a performance bottleneck. We significantly improved service performance by reducing the number of codecs (serialization, compression, etc.), multithreading concurrency, reducing memory copying, message merging, and more.
Network transport layer
The initial architecture of netease cloud chat room system is to deploy the access layer and the back-end service layer in the same machine room. Most users are directly connected to ChatLink in the BGP machine room. For remote areas or overseas, proxy nodes are deployed through private lines to achieve acceleration. The obvious disadvantage of this scheme is that the upper limit of service capacity is limited by the capacity of single room, in addition, the special line is also a small cost.
After WE are connected to the WE-CAN network, ChatLink at the access layer CAN access the clients nearby, improving the service quality and reducing the cost. In addition, the multi-room architecture also makes our service capacity rise a step.
In order to adapt to the WE-CAN network, WE designed the WE-CAN Bridge layer, which acts as the Bridge layer of the access protocol and chat room protocol, and is responsible for protocol conversion, session management, forwarding and receiving. With this layered architecture, the access layer and back-end service layer can be modified with little or no modification, reducing the cost of system transformation and reducing the risks caused by architecture upgrade.
Scheduling layer
Scheduling layer is the premise of client access to chat room system. Before logging into the chat room, the client needs to obtain the access address and assign the service we call “Dispatcher”.
Dispatcher is centralized and will accept heartbeat information from WE-CAN and ChatLink, and select the best access point according to heartbeat. The key points of dispatching system design are as follows:
The scheduling precision
Scheduling system based on judging the requester IP region and carrier information, comparing the edge node area, operators, and the load on the node itself (such as CPU, bandwidth, etc.), and also considering the edge node to the room of the link (WE CAN), calculate the comprehensive score, and the optimal number of nodes as the scheduling results.
Scheduling performance
In the face of high concurrency scenarios, such as a large chat room, a large number of people often enter the activity at the beginning of the event at the same time, so the scheduling system needs to make a rapid response. To this end, we will optimize the local cache of the above scheduling rules and original data. In addition, in order to avoid unreasonable allocation of heartbeat information lag and node overload, load factors will be dynamically adjusted during service allocation. On the premise of ensuring scheduling performance, we will try our best to smooth the distribution results.
The service layer
The service layer implements various business functions, including: presence, room management, cloud history, third callback, chat room queue, chat room tag, etc. The most basic of these are presence management and room management:
- Online status management: Manage the login status of an account, including which chat rooms and access points you log in to
- Room management: Manages the status of rooms in a chatroom, including which access points the room is located in, which members are in the room, and so on
The difficulty of online status management and room management lies in how to effectively manage massive users and rooms. The features of the PaaS platform enable us to divide regions based on tenants to achieve horizontal expansion. In addition, as the status data has the characteristics of rapid change (short TTL), when some core users or a customer reported a large activity, cloud communication can be quickly split and isolated related resources in a short time.
In addition to supporting massive customer access and horizontal expansion, the service layer also has a very important capability, that is, it needs to provide a variety of scalability functions to adapt to various application scenarios of customers. To this end, Yunxin provides a variety of rich functions, such as:
- Third-party callback: Enables users to intervene in core links such as user login and message sending, and customize service logic. Because service invocation is involved, and the invocation is cross-machine room or even cross-region, in order to avoid the abnormal cloud service caused by third-party service failure, we designed isolation, circuit breaker and other mechanisms to reduce the impact on key processes.
- Chat room queue: can facilitate the user to achieve some such as wheat order, grab wheat and other business scenarios demand;
- Chatroom tag: as the first feature of cloud messaging industry, support personalized distribution of messages. The implementation principle is to define the rules of message distribution and reception by setting the label group when the client logs in and the label expression when the client sends messages. Label information is stored at both the service layer and the access layer. Some label computing is pushed down to the access layer, saving bandwidth and computing resources for central services.
CDN distributed layer
When evaluating a chatroom system, one of the most commonly used words is unlimited. Just because an architecture supports unlimited caps does not mean that there are no caps. A chatroom system, logically, each component unit can be horizontal expansion, but each service depends on the physical machine, switch, machine room bandwidth and so on are capacity upper limit. Therefore, to reasonably allocate multiple rooms in multiple regions, or even other external resources, can really reflect the capacity of a chatroom system can support the upper limit.
In the chatroom system of netease Yunxin, all the user access points are scattered all over the computer rooms, which naturally integrates the resources of different places. The maximum capacity supported is naturally higher than the deployment mode of single room or multiple computer rooms in one region.
Furthermore, when faced with a larger chat room, it is a suitable choice to take advantage of some external universal capabilities. The fusion CDN bullet-screen scheme is such a technical implementation scheme, which can make use of the edge nodes deployed by major CDN vendors in various places, and use the general ability of static acceleration to support super-large scale chat room message distribution.
Based on the fusion CDN bullet-screen distribution scheme, its core point is the distribution and management of bullet-screen. This is an optional module, which is encapsulated inside Yunxin and can be opened or not according to different business characteristics without modifying any business code.
When the fusion CDN bullet-screen distribution scheme is enabled, all bullet-screen broadcast will be divided into two links:
- Important messages that need to be delivered in real time take a long connection to the client
- Other massive messages will enter the CDN Pusher and be aggregated through various strategies and then delivered to the CDN Source
The client SDK will adopt certain policies to obtain barrage messages from CDN edge nodes at regular intervals. The SDK will aggregate messages from different sources, sort them and call back to the user. The App layer does not need to know where the message comes from, but only needs to process it according to its own business requirements.
As shown in the figure above, the message flow process of CDN bullet-screen distribution link is shown: CDN Manager is responsible for managing the allocation policies of different CDN vendors (which can be dynamically adjusted during login through long connection). In addition, I am also responsible for the management of the opening and closing of the integrated CDN mode of each chat room on the platform, as well as the allocation and recovery of the corresponding CDN Pusher resources. CDN Pusher is actually responsible for receiving messages from the client and assembling them into static resources one by one according to the type and priority of messages, which are pushed to the CDN Source and wait for the CDN to pull back to the Source.
Landing practice Cases
Below, we introduce the typical application scenarios of netease Yunxin chat room system.
Large-scale application cases
In August 2020, the 7th anniversary online concert of netease Cloud Music TFBoys is a typical case of large-scale scene application of chat room. In this event, the world record of 78W + paid online concert was created. Netease Yunxin’s fusion CDN bullet-screen distribution scheme was adopted in the implementation of its bullet-screen interaction. In fact, during the preparatory phase, our chat room system achieved a performance target of 0 to 1000W online in 20 minutes and 100W TPS for uplink messages.
As shown in the figure above, it is the architecture diagram supporting the bullet-screen distribution of this activity. Ordinary bullet-screen and gift messages respectively reach the cloud communication server through the client SDK and server API, and finally enter the bullet-screen broadcast service, and then they are shunted to the long connection and CDN, and then delivered to the client through the mixed method of pull/push.
Features – Chatroom tag application case
In recent years, with the development of the Internet, online education has become more and more popular, and the “super small class” model has emerged recently. The so-called super small class refers to the combination of large multi-person class and small class interaction mode.
As an important part of online live broadcast, text interaction is a typical application scenario of chat rooms. However, in the mode of super small class, there are various problems in the conventional chat room system, whether it is to establish multiple chat rooms, or to filter messages in a single chat room, there are some serious problems.
Netease cloud letter first chat TAB function, perfect supporting the business scenario, based on the chat room, we can flexibly support chat room message transceiver, chatrooms permissions stereospecific management, directional query personalization functions such as chat room members, achieve the group of large live scene more interaction, such as for grouping students label, convenient for students according to their aptitude; Group discussion, inter-group internal discussion and inter-group PK, etc.
The picture above shows a scene from a super small class: A speaker teachers small-class + N + N interaction teaching assistant, all the students had been divided into a small class, a completed by corresponding ta prepare reminders, answering questions after class, operation supervision and feedback of students learning work, at the same time receive from the master teacher’s picture, do a large scale, the effect of small classes.
conclusion
The above is all the sharing of this article, mainly introduces the main technology and architecture principle of netease Yunxin to build a large chat room system. Any system is not built overnight, yunxin will continue to polish the underlying technology, like the introduction of WE-CAN to improve the effect of network transmission, but also continue to enrich and improve our functional map, such as the industry’s first chat room tag function. Netease Yunxin will continue to work deeply in the IM field and provide users with the best quality services in various scenarios and industries.
The authors introduce
Jiajun Cao, senior server development engineer of netease Yunxin, joined netease after graduate from Chinese Academy of Sciences, and has been responsible for IM server development of netease Yunxin. Have rich practical experience in IM system construction and related middleware development.
More technical dry goods, welcome to pay attention to [netease Smart enterprise technology +] wechat public number