preface

RPC, short for Remote Procedure Call, is a protocol for requesting services from Remote computer programs over a network without knowledge of the underlying network technology. For example, an application deployed on node A invokes an interface provided by an application deployed on node B. Node A sends the data to node B. Node B finds A specific interface based on the received data and returns the result to node A over the network.

RPC framework encapsulates common capabilities such as network transport, serialization, load balancing, and fault elimination, enabling A node to call remote interfaces as easily as local methods.

SCF is an RPC framework independently developed by 58. It is dedicated to providing high-performance, highly reliable and transparent RPC remote call scheme in distributed environment.

The service management platform is a service governance platform based on THE SCF framework. It has the characteristics of automatic registration and discovery of service nodes, load balancing, service authentication, all-round monitoring and perfect alarm.

The overall architecture

  • SCF server: an application that uses the SCF framework server capabilities to provide an interface that can be called remotely.

  • SCF caller: refers to the application that invokes the interface provided by the service provider using the SCF framework client capabilities.

  • Control center: the core is to maintain the calling relationship between SCF server and SCF caller, generate the service configuration information needed by the caller, and support the real-time push of new configuration information to the caller when the calling relationship is adjusted.

  • Monitoring center: Collects traffic data from service providers and callers in a unified manner and provides real-time alarm functions to improve service management and service stability.

  • Visual management platform: provides a service management page for viewing traffic monitoring data of service providers and callers, configuring call information of service providers and callers, and setting diversified alarms.

SCF server and SCF caller constitute the main components of SCF framework, which can realize basic RPC remote call.

The control center, monitoring center and visual management platform are service management platforms, which supplement the basic capabilities of SCF framework and provide effective means for service governance.

SCF framework

 

SCF call mode

The most basic capability of RPC framework is to provide remote call. SCF provides synchronous call and callback call modes.

A synchronous invocation

Synchronous invocation is the most commonly used method for business and is the framework’s default invocation method. When a caller invokes the service’s interface, the thread executing the call is blocked, waiting for the call to complete. If the server returns a result or waits longer than the set timeout, the thread wakes up to get the return result or catch the timeout exception.

Callback invocation

A callback callback is a call to a service interface, and the interface returns immediately. The thread calling the interface does not need to wait for the result from the server, so there is no blocking. If the server returns a result or waits longer than the set timeout, a separate callback thread in the framework handles the returned result or timeout exception. Therefore, the callback implementation class of the interface must be set before the invocation.

Timeout handling

In the actual production environment, the health status of the server is uncontrollable and the network is complex, and various exceptions may occur. Therefore, not all calls in the above synchronization or callback call are guaranteed to get the result returned by the server. In order to avoid unlimited waiting by the caller, the timeout of the call must be set. If no result is returned after the set time, the caller is notified by a timeout exception.

SCF uses the classic TimeWheel algorithm to implement expiration of call tasks.

Each grid represents a time interval, and each grid corresponds to a linked list of tasks. When adding expired tasks, the expiration time and the current time are used to calculate the number of cells the task should be in, and calculate the number of laps it should be to trigger the timeout.

Assuming that each cell in the diagram represents 100ms, then a circle represents 800ms and is currently on the second cell of the first circle. If the task times out after 500ms, (500+200) % 800=7, so put the task in the linked list corresponding to the 7th cell and mark the 1st lap time out. If the task times out after 1000ms, (1000+200) % 800=4, (1000+200) /800=1, so put the task in the corresponding linked list of the 4th cell and mark the 2nd lap time out.

There are two key points to note about the existence of the expiration algorithm above:

1. There is an error in the expiration time. The error range is the time represented by each grid.

2. The thread that expires the scan task should be independent of the thread that performs the expiration operation to prevent the expiration operation from affecting the expiration scan of subsequent tasks.

serialization

The data transmitted in the network can only be binary data composed of 0 and 1, and usually the data information we request is the object of specific class in the object oriented. Serialization is the process of converting the state information of the object into a form that can be stored or transmitted. Deserialization is the reverse process of serialization.

The SCF framework uses a custom serialization implementation. The following focuses on how serialization can be implemented in asymmetric serialization and generic serialization.

  • Asymmetric serialization

The Internet is a rapidly changing industry. After an interface is released, it is inevitable to adjust the interface transmission object with the development of business, so there is a need to add or delete the member variables in the class. If classes that cannot support both the service and the caller have asymmetric members, business upgrading can be cumbersome.

The idea of SCF serialization for asymmetric classes is to number the member variables of the class. In the process of writing data stream, the member variables are successively written into binary stream according to the number (ID)+ data length (length)+ data (value). In deserialization, id is first read from the stream to determine whether the class to be assigned a value has a member with that ID. If there is a continued reading of length and data, and if there is no such id, the data part corresponding to the id member in the binary stream will be skipped based on the length read, thus achieving the purpose of ignoring the non-existence of the member.

For both versions of entities, members numbered 1, 2, and 3 are on the left, and members numbered 1 and 4 are on the right. The serialization and deserialization processes are as follows:

The basic id + Length + value approach can be used to achieve asymmetric serialization, but all members need to write id and length two special identity, increasing the size of the binary data. For basic types, the length is actually known. Data types are divided according to the following types:

Only need 3 bit can be said to be the data type, so the tag = (id < < 3) | type, embedding type into the tag field, realize the basic types of data only need to write to the tag data, do not need to write the length field, effectively reduce the size of binary data.

  • Generic serialization

Generic serialization is the serialization of an Object that has a non-concrete type member variable (the base class Object in Java) in the class.

SCF generates a unique typeId for each class using the hash method of fully qualified class names. When writing generic members, the typeId of the class is written first, and then the value data is written. TypeId is read, the specific type is found, and value is read according to the type.

Service registration and discovery

The caller calls the server over the network and must know the IP list of the server node before making the call. The original method is specified by using the configuration file in the caller, but this method cannot dynamically detect the changes of the server node in practice, and is not flexible enough to automatically expand and shrink the time service.

Service registration and discovery automatically discover the node information of the service, and the caller can sense the change of the node of the service party in time, and automatically adjust the traffic to switch to a new node.

SCF uses an ETCD cluster to manage service nodes. Each service node corresponds to a key in the ETCD and sets a TTL expiration time for the key. The service node is kept online by refreshing the TTL in heartbeat mode. In order to isolate the ETCD cluster from the service deployment environment and avoid the problem that the number of ETCD cluster connections is too high due to the increase of service nodes, a layer of service management node is encapsulated as a proxy, which forwards the service heartbeat and maintains the status information of the service provider and the caller.

When the service node goes offline, the ETCD cluster notifies the key corresponding to the service management node. The service management node pushes the latest service node list information to the caller in real time, and the caller dynamically updates and switches traffic. At the same time, in order to accommodate the abnormal situation of push failure, the callers are added to check the pull policy periodically according to the timestamp to ensure the final consistency of service node information.

Monitor data collection and storage

Are services running properly in the production environment? What is the current service traffic? Are there invocation exceptions or timeouts? These are the kinds of issues that service managers need to focus on.

The data collection

For a server, a service has multiple methods, which are deployed on multiple nodes at the same time and are called by different callers. The same caller may also call different methods of multiple services at the same time. As a result, the overall collection dimension is a product of magnitude between the service and the caller, how can data be collected effectively?

Here is a collection scheme of call data for 58RPC framework.

As can be seen from the overall architecture diagram, in order to avoid the pressure of traffic data collection, the computing power of all layers should be fully utilized to share the pressure of unified summary.

1. Collect plug-in To make full use of the computing capacity of service nodes, perform local data aggregation first and report data in minutes.

2. The plug-in reports the hash based on the service name to ensure that the data of different nodes of the same service is sent to the same collection server as far as possible. The collection server aggregates the data again to further reduce the pressure of unified Cache counting.

Data is stored

Starting with the call information for the service, let’s look at the data that needs to be stored for a call.

For the monitoring data of the same dimension, only the timestamp, times and time consuming data in the above fields are related to the actual traffic. The service name + service node + function name + caller + type identifier is the same for the same dimension. Therefore, in order to reduce the storage of data, We define a mapping rule (S[demo]SN[10.0.0.1]SF[service.get ()]C[callerdemo] indicates that the service.get () method on the 10.0.0.1 machine of the server demo is called by the callerDemo), Map the five collection meta-information to unique dimension strings, and generate a unique CID for all dimension strings. In actual monitoring data, use CID to replace the five collection meta-information.

In the actual application, the first version of the metadata stored only call, at the time of display according to display the dimensions of data query polymerization leading to monitoring data show special slow, because need to go through a lot of data query and merger, in order to raise the monitoring data query speed, using the way of writing diffusion, for a call metadata, the spread of do as shown in the figure below:

As can be seen from the figure above, the data that often needs to be displayed in the future will be calculated and directly stored in the database during actual storage, and the results can be queried directly according to the CID of the dimension during display, which effectively improves the query speed.

conclusion

As the basic component of 58 distributed architecture, SCF framework supports the network invocation of ten thousand level nodes within 58 Group. This article focuses on basic calls and monitoring. There are still many modules, such as load balancing, network management, failed node removal, service authentication, and service traffic limiting, that are not implemented. SCF framework has gone through many iterations, from the simplest remote call at the beginning to the perfection of surrounding functions of service governance now, and will continue to be optimized in the future. Welcome interested students to communicate with us.