Tars is a high-performance RPC development framework based on the name service and Tars protocol. It also supports an integrated service governance platform to help individuals or enterprises quickly build their own stable and reliable distributed applications in the way of micro-services.
Tars is an open source project based on years of practice achievements of TAF (Total Application Framework), a micro-service architecture used internally by Tencent. The name Tars comes from the robot Tars in the movie Interstellar. In the movie, Tars has a very friendly way of interaction. Anyone who comes into contact with it for the first time can easily communicate with it. With a similar design philosophy, Tars is a framework for ease of use, high performance, and service governance that aims to make development easier, focus on business logic, and make operations more efficient, all under control.
At present, the framework is in use in more than 100 businesses and more than 100,000 servers within Tencent.
Design idea
The design idea of Tars is to use the idea of microservice to govern the service, and at the same time, each module of the whole system is abstracted and layered, decoupling or loose coupling between each layer, as shown in the following figure:
At the bottom of the protocol layer, the design idea is to unify the protocol of service network communication and develop a unified protocol that supports multi-platform, extensible and automatic generation of protocol code in the way of IDL(Interface Definition Language). In the development process, developers only need to pay attention to the content of the communication protocol field, without paying attention to the details of its implementation, which greatly reduces the need to consider whether the protocol can be used across platforms, whether it may need to be compatible, expansion and other issues when developing services.
The common library, communication framework and platform layer in the middle are designed to make business development more focused on business logic itself. Therefore, from the user’s point of view, it encapsulates a large number of common library code and remote procedure call frequently used in daily development process, making development and use more simple and convenient; From the perspective of the framework itself, to achieve high stability, high availability, high performance, so as to make the business service operation more assured; From the perspective of distributed platform, it solves the fault tolerance, load balancing, capacity management, nearby access, gray scale publishing and other problems encountered in the process of service operation, making the platform more powerful.
The uppermost operation layer is designed so that o&M only needs to focus on daily service deployment, release, configuration, monitoring, scheduling and management.
The overall architecture
Structure topology
The topology of the overall architecture is mainly divided into two parts: service nodes and common framework nodes.
Service node:
A service node can be considered as a specific operating system instance that the service actually runs. It can be a physical host, a virtual host, or a cloud host. As the types and scale of services expand, the number of service nodes may be tens of thousands or even hundreds of thousands. Each service Node has one Node and N(N>=0) service nodes. The Node centrally manages service nodes, provides functions such as starting, stopping, publishing, and monitoring, and receives heartbeat reports from service nodes.
Common framework nodes:
All services except service nodes are grouped into one category.
The number of common framework nodes varies. To implement fault tolerance and DISASTER recovery (Dr), deploy them on multiple servers in multiple equipment rooms. The specific number of nodes depends on the scale of service nodes.
It can also be divided into the following parts:
Web management system: On the Web, you can view real-time data about the running of the service, and publish, start, stop, and deploy the service.
Registry (routing + Management service) : provides service node address query, publishing, start, stop, management operations, as well as service reporting heartbeat management, through which to achieve service registration and discovery.
Patch (Release management) : Provides the release function of the service;
Config (Configuration center) : provides unified management of service configuration files.
Log (remote Log) : provides the function of sending service logs to remote.
Stat (call statistics) : Collects call information reported by service services, such as total traffic, average time, and timeout rate, for alarm when service exceptions occur.
Property: collects self-defined Property information, such as memory usage, queue size, and cache hit ratio, for alarm reporting when service exceptions occur.
Notify (Exception information) : Collects statistics on various exceptions reported by services, such as service status changes and DB access failures, so that alarms are generated when service exceptions occur.
In principle, all nodes are required to communicate with each other, at least the nodes of each machine can communicate with the nodes of the common framework.
features
Tars agreement
Tars protocol is implemented by Interface Description Language (IDL), which is a binary, extensible, automatic code generation, multi-platform protocol. So that objects running on different platforms and programs written in different languages can communicate with each other by RPC remote call, mainly applied in the network transmission protocol between background services, as well as object serialization and deserialization.
Protocols support two types: basic type and complex type.
Basic types include: void, bool, byte, short, int, long, float, double, string, unsigned byte, unsigned short, unsigned int;
Complex types include enum, const, struct, vector, map, and nesting of struct, vector, and map.
Such as:
Call way
The interface provided by the service can be defined through THE IDL language protocol, and the communication codes related to the client and the server can be automatically generated. The server only needs to implement the service logic to provide services externally, and the client can invoke the service through the automatically generated code. There are three invocation modes:
Synchronous invocation: the client sends an invocation request and waits for the service to return the result before continuing the logic.
Asynchronous invocation: the client sends an invocation request and continues with other business logic. The server returns the result and the callback processing class processes the result.
One-way call: the client ends the call after making the call request, and the server does not return the call result.
Load balancing
The framework realizes service registration and discovery through the name service. The Client obtains the address information list of the called service by accessing the name service, and then selects the appropriate load balancing method to call the service according to the needs.
Load balancing supports polling, hash, and weight.
Fault tolerance to protect
Fault-tolerant protection is implemented in two ways: name service exclusion and Client active masking.
Name service exclusion strategy:
The service service proactively reports heartbeat messages to the name service so that the name service knows the status of the node on which the service is deployed. When a node fails, the name service does not return the IP address of the faulty node to the Client to rectify the faulty node. Troubleshooting the name service requires two processes: service heartbeat and Client address list. It takes about 1 minute to rectify the fault
Client active masking:
To mask faulty nodes in a timely manner, the Client detects faults based on the exceptions of invoking the invoked service. When a client attempts to invoke a CERTAIN SVR and its call timeout rate exceeds a certain percentage, the client shields the SVR and distributes traffic to normal nodes. The shielded SVR nodes are reconnected at regular intervals. If the SVR nodes are normal, traffic is distributed normally.
Overload protection
In order to prevent the whole system from being busy due to the sudden increase of traffic volume or the failure of the server, which leads to the unavailability of all services, corresponding design is made inside the framework to cope with it. To realize the request queue, the service invocation realizes the asynchronous system through the non-blocking way, so as to achieve the purpose of improving the system processing capacity. The length of the queue is monitored, and when it exceeds a certain threshold, new requests are rejected. Set the timeout period for a request. When the request packet is read from the queue, it determines whether the request has timed out. If it has timed out, it does not process the request.
The message the dyeing
Framework provides the specific request of a certain service interface for dyeing ability, dyeing behind the news can be spread to need access to all of the service, the request of dyeing, the service automatically elevate the log to a specific staining log server, users need only can analysis on dyeing server request access paths, convenient tracking location problem. As follows:
IDC group
To speed up the access between services, reduce the consumption of network resources caused by cross-region and cross-room invocation, and reduce the impact of network faults, the framework provides cross-region and cross-room and nearby access functions.
The SET grouping
To standardize and capacity service deployment management, the framework provides the Set deployment capability. Sets do not call each other, do not interfere with each other, and are isolated from each other to improve O&M efficiency and service availability.
Data monitoring
In order to better reflect and monitor the operation quality of small service processes and large businesses, the framework supports the following data reporting functions:
It provides the function of reporting call information statistics between service modules, so that users can view service traffic, delay, timeout, and exceptions.
User-defined attribute data reporting enables users to view certain dimensions or indicators of the service, such as memory usage, queue size, and cache hit ratio.
The service status change and exception information report function, so that users can check when the service has been published, restarted, broken down, and encountered abnormal fatal errors.
Centralized configuration
Centralized management of service configurations and web-based operations facilitate configuration modification, prompt notification, and security of configuration changes. Keep a history of configuration changes so that configurations can be easily rolled back to previous versions. In configuration pull servitization, a service only needs to invoke the interface of the configuration service to obtain the configuration file.
In order to manage profiles flexibly, profiles are divided into several levels: application, Set, service, and node.
An application configuration is a top-level configuration file that is a common configuration extracted from multiple service configurations and referenced by the service configuration to use its configuration content.
The Set configuration is common for all services in a specific Set group, and can be added based on the application configuration.
The service configuration is the common configuration of all nodes in a specific service and can be referenced by the application configuration.
Node configuration is the personalized configuration of an application node. It is combined with the service configuration to form the configuration of a specific service node.
The project address
Open source: gitee.com/TarsCloud/T…