Evolution of multi-billion signaling services

One, from 0 to 1

On July 31, 2019, anyRTC’s signaling service 1.0 was officially released, about one and a half months after the project launch date. What have we done in such a short time?

1. Message flow pattern

We are positioned as a global signaling service that is stable, reliable, low latency and high concurrency. Unlike traditional instant messaging (IM) services, signaling services are mainly built for real-time applications, focusing on high concurrency and low latency.

In message flow mode, user A needs to send A message to user B. The service addresses user B, routes the message to user B’s area, and finally sends the message to user B.

Data protocol

Since there is no need to store business data such as accounts and contacts, the design of the protocol can be greatly simplified. But at the beginning of the design, we also took a detour. Refer to traditional im in the early years of the scheme design, the client record a local data backup, need to synchronize data, the backup data to the server, the server server data by calculation and pass the difference of the backup data, the differences between the data sent to the client, the client to save the difference data to complete synchronization. However, there are two problems with this solution. First, the amount of backup data will become larger and larger with the increase of client data, which leads to high traffic cost during synchronization. Second, the client must calculate the differential data every time for synchronization, which will bring additional performance overhead and implementation complexity, and affect the real-time data arrival. Then we designed a new protocol, which we called RSync(Routing synchronization) protocol.

The principle of the RSync protocol is that a signaling message has three statuses: 0: Not delivered, 1: Sent, 2: received. Of the three states of the message, only state 2 needs Ack of the receiver, which not only ensures that data will not be lost, but also takes into account the speed of signaling message delivery. Only the routing of messages between services is responsible.

3. Active-active + Multi-data center architecture

Hypermetro: Our hypermetro is a master-master architecture. There is no distinction between Master and slave. Hypermetro can provide business services at the same time. The disaster recovery function is implemented and the overall throughput of the system is increased. For example, MSS management services, we have one in Hong Kong and one in London to distribute data between data centers around the world. Two RSS services will also be deployed in each data center; IMS access services are deployed in multiple sets.

Multiple data centers: Data centers have been deployed in China, Japan, Singapore, Mumbai, Frankfurt, Los Angeles and other places to better serve localized apps. With the development of business in China and overseas business, data centers will be deployed in more regions in the future.

4. Rpc-txp component

AnyRTC is based on the concept of microservice design, and each service is a relatively independent business module. The data interaction between services is carried out using RPC components. At the beginning, we used the open source ZMQ component, which is focused on the streamlined design and excellent performance of ZMQ. However, after the actual service goes online, it is found through background monitoring that the performance of cross-border communication is often not up to our requirements in the context of global deployment. Therefore, we developed a TXP protocol component, which uses UDP to achieve a fast and reliable protocol. Compared with TCP, the transmission rate is increased by about 40%, especially in some scenarios with high latency and high packet loss, such as cross-border operators, the performance advantage is more significant.

2. Resolve core issues

1, C10K

Any high concurrency service will face the C10K problem. The core problem of C10K is how to ensure real-time data interaction between each user and service and economical system resource cost when facing large-scale Tcp connection.

Of course, many open source projects now have solutions to C10K problems. The background service of anyRTC customized its own business model according to its own business characteristics. AnyRTC is a Linux based C++ service, and we also support UDP connection, UDP connection itself does not have C10K problems; TCP connections use epoll for large-scale applications.

2. Data consistency

Uncertainty of user location: An APP has users all over the world, and a user can run around the world, so whether users are allowed to access the service nearby has a great impact on the business process. The signaling service is different from the instant messaging service. Signaling service is more about the speed of message arrival. At the same time, the system does not record the user’s business data. However, if nearby access is allowed, data consistency must be ensured without affecting services, which means that user data needs to be synchronized globally.

How to synchronize data centers: Global synchronization of user data is a big system overhead, so all business modules must be subdivided and subdivided to find the business modules that must be synchronized globally, such as group messages; A person who sends a message to a group must ensure that the message is received by the group’s users in any data center around the world.

AnyRTC designs a data management service to divide global users into regions. Data centers in each region can distribute data to management services, which distribute data globally to avoid direct data interaction between each data center. Together with the offline storage service, services of regional DCS are streamlined and data integrity of the system is ensured.

Iterate quickly

As business grows, so does demand; How to avoid making a flashy overdesign often requires several rounds of thought, discussion, and overthrow of the iterative process.

1. Simplify requirements:

Focus on core function points and strip away the distractions of actual business. Because many users do not understand the difference between our signaling service and instant messaging service; So when faced with requirements, we need to match requirements with function points, avoid over-design, and give the real business back to users.

2. Micro services:

We divided the system into access service, area routing service, management service, log service, data service, statistics service, operation and maintenance service, reporting service and other micro-services, so as to ensure the speed of functional iteration.

3, compatible with old versions:

This is a problem that most business system upgrades face; Different from APP, we provide service capability and cannot require every client to upgrade THE SDK module at any time. In the early design stage, we adopted a solution that emphasizes the server and light the client. In this way, after the system upgrade, every client will not be required to upgrade the SDK, which is more friendly to users of old versions.

Four, multi-area disaster recovery

In the traditional data centre-level DISASTER recovery solution, two data centers are deployed in the same city and a remote disaster recovery center is built. Usually, only one of the three data centers provides online services. When a fault occurs, services are switched to other data centers. The main problem is that the disaster recovery data center has no service traffic. If the primary data center is faulty, the disaster recovery data center may not be switched over to the disaster recovery data center. In normal times, a large number of backup resources do not provide services, and a large number of resources are wasted.

The core of multi-area DISASTER recovery is that multiple data parks provide services at the same time. Therefore, even if one park is faulty, the service traffic of the other parks will only increase by a certain percentage. On the other hand, only the server resources of each campus need to run at the upper limit of capacity (N-1)/N, and the capacity of 1/N can be reserved to provide non-destructive disaster recovery capability. However, in the traditional “two and three centers”, much more server resources are idle. In addition, in normal times, multiple parks are serving external services at the same time, so we can switch the traffic to other parks at any time in case of failure.

A system must be equipped with service monitoring to avoid faults.

There were a lot of background failures at the beginning of the product. More troublesome than the failure, because of the lack of monitoring, often some failures we can not find in the first time, resulting in the fault impact surface magnified.

1. Failure analysis

Each failure is not large or small, and the developers need to thoroughly review the failure process, and then agree on a solution, supplemented by a detailed technical report. This report focuses on how to avoid the recurrence of similar faults, improve the proactive fault detection capability, and shorten the fault response and troubleshooting process.

2. Monitor the alarm system

The idea of the monitoring system is very simple, allowing business code to set a monitoring ID in shared memory or the function of accumulating alarm thresholds. The reporting service on each machine periodically reports all ID-thresholds to the monitoring center. After the data is summarized and stored in the database, the monitoring center outputs the monitoring curve on the unified monitoring page and generates an alarm based on the pre-configured monitoring rules.

V. Project dependence:

Because we need to do a lot of special processing for signaling services, such as data synchronization protocol, data consistency, exception recovery and so on. Therefore, anyRTC signaling services do not directly use any third-party services. However, many of our design concepts refer to many mature cases, such as cluster data consistency design in Redis; Some distributed service scheduling concepts of ZooKeeper.

Reducing the dependence on third-party services has low requirements on the system environment, which enables services to be online quickly, reduces the complexity of system deployment, and reduces operation and upgrade costs.

Vi. Client capability

Vii. Applicable scenarios

1. Voice chat room mai position control: mai position control, platoon mai room management: room number, room list, room entry and exit notification

2, video chat call invitation: send and receive call invitation User management: user online status, user information

3. Room management of live chat room: room number, room list, interactive control of room entry and exit notification: receiving and receiving questions, and application for continuous mic

4. Online education whiteboard: brush track classroom management: student list, class announcement, raising hands signaling control: courseware control, raising hands, microphone silence class recording: provide historical messages to replay class chat and whiteboard content at any time

5, the Internet of Things smart home control signaling intelligent vehicular remote control smart watch receiving and receiving messages Virtual Reality (VR)/Augmented Reality (AR) real-time annotation

6. IM signaling channel messages are sent, message status is synchronized, and audio and video calls are maintained

Evolution of multi-billion signaling services

One, from 0 to 1

2. Resolve core issues

Iterate quickly

Four, multi-area disaster recovery

V. Project dependence:

Vi. Client capability

Vii. Applicable scenarios

Related Posts

Php-fpm (CVE-2019-11043) vulnerability reappears

3. Why do we need to do automatic tests? What kind of project is appropriate?

Use service virtualization to improve developer collaboration