Recently, I came into contact with JD.com’s self-developed service framework JSF (referred to as “Jeff”) during my internship. At present, the downstream interface called in some new functions I wrote is provided by Jeff. There are many efficient service frameworks, such as Alibaba’s Dubbo and Apache’s ZooKeeper, so why did JD develop JSF service framework? Therefore, looking at the evolution history of JSF of JINGdong, I have to sigh that a good architecture can not be realized overnight, but gradually evolved.

1. Dubbo/Zookeeper combo

Dubbo is an open source high-performance service framework of Alibaba. Dubbo enables applications to realize the output and input functions of services through high-performance RPC, and can be seamlessly integrated with the Spring framework. We can take a look at the Dubbo architecture roadmap: from single application architecture to vertical application Architecture to distributed services architecture to mobile computing architecture, as well as another of my articles “Service Unbundling Methodology” :

As can be seen from the figure above, the architecture evolution of the Internet has experienced single application architecture, vertical application architecture, distributed service architecture and mobile computing architecture. Service granularity is becoming more and more refined. Multiple Servlets of Tomcat call each other. Service governance is particularly important. Hence the Dubbo framework, which provides three core capabilities: interface-oriented remote method invocation, intelligent fault tolerance and load balancing, and automatic service registration and discovery.

The following is an architectural diagram of the Dubbo framework:

node instructions
Provider The service provider that exposes the service
Consumer Service consumer that invokes the remote service
Registry A registry for service registration and discovery
Monitor A monitoring center that collects statistics on service invocation times and invocation time
Container Service run container

Use Container to start our service Provider with Docker. Dubbo itself does not use any registry, but a direct connection. However, in most cases, Dubbo + Zookeeper is used. When Zookeeper is used as the registry, the service caller Comsumer will go to the registry to Subscribe related services. The registry will asynchronously push Notify to you a list of services that consumers need. The consumer caches the entire address list locally and uses the address in the address list to request the relevant service when the business needs come. As for Monitor, consumers and providers regularly send the number of times the service has been called and the number of times it has been called to the Monitor. Use cases for Dubbo and Zookeeper will not be covered here, but one article on Dubbo: From Getting Started to Getting Started.

Is Zookeeper suitable as a registry

When we put forward the question “is Zookeeper appropriate as a registration center?”, the first consideration is the specific use scenario. I want to say that In JD, Zookeeper is really not appropriate!

There is no need to say more about the importance of the service registry, which is equivalent to the guide between service providers and service callers. It plays an extremely important role in service governance. The registry must, must, must be highly available and has a strong stability. So how do you choose a server registry? Here are some basic considerations for a registry:

  • Service registration: The way to receive registration information
  • Service Subscription: How to return subscription information, push or pull
  • Status detection: Checks the server status

For example, Zookeeper is highly consistent. Zookeeper checks the status of temporary nodes registered by the server. If the network between the server and Zookeeper is disconnected intermittently, Zookeeper thinks the server is dead and removes the node. But in fact, it is good to network directly between the client and the server, so it is possible to remove all nodes, resulting in no available nodes.

If you choose from an open source framework, consider:

  • Maturity: including learning cost, community popularity, number of documents (blindly pursuing may not be the best fit)
  • Maintenance cost: Registry maintenance
  • Data structure: whether results can be quickly located and traversed
  • Performance and stability
  • CAP principle: CP (focus on consistency) or AP (focus on usability)

Here’s a comparison of two registries I know of

ZooKeeper Eureka
consistency Strong consistency Weak consistency
The data structure Tree K/V
Communication protocol TCP HTTP
The client ZKClient Eureka-client
The principle of CAP CP AP

Selection summary of the registry:

  • If CP is selected for small scale, RPC framework can directly access data source
  • If AP is selected on a large scale, RPC framework cannot directly access data sources
  • There are cross-room, cross-region as far as possible do not choose a strong consistency agreement registry
  • The RPC framework must have a disaster recovery policy where registries are not available
  • Service status detection is very important

After the selection and characteristics of the registry, we will analyze whether Zookeeper is suitable for jd as a service registry. The answer is no. Zookeeper is not suitable as a registry in high traffic scenarios because Zookeeper is not designed for high availability.

Because cross-room Dr Is required, many systems need to be deployed cross-room. For the sake of cost performance, we usually have multiple rooms working at the same time, rather than building N times of redundancy. Which means a single machine room can’t handle full flow. A Zookeeper cluster can have only one master node. Therefore, if the connection between equipment rooms fails, the Zookeeper master can only take care of one equipment room. Service modules running in other equipment rooms can only be stopped because there is no master node. So all the traffic is concentrated in the machine room with the master, and the system goes down. This is the root cause of the failure of JD’s registration center on November 11 in 2015. After the back-end container service is restarted, the cached service address list is lost and the service cannot be invoked. Moreover, Zookeeper does not have the ability of dynamic horizontal expansion. As a registry, Zookeeper is called the bottleneck in the high concurrency scenario of Double 11.

Even in the same equipment room, due to different network segments, network segment isolation may occasionally occur when you adjust switches in the equipment room. In fact, the computer room will basically have a temporary network isolation and other subnet segment adjustment every month. Zookeeper will be unavailable at that point. If the entire business system is based on Zookeeper (for example, every business request must first go to Zookeeper to get the master address of the business system), the availability of the system will be very fragile. Zookeeper is extremely sensitive to network isolation, so Zookeeper will react violently to any movement of the network. This causes the Zookeeper service to be unavailable for a long time, and the Zookeeper service becomes unavailable for the entire system.

Third, what is the ideal service framework

Interface document Management

Provide an entry point for interface documentation management and interface queries, be it a public WIKI, a standalone system, and so on. Here you can define the interface’s documentation, including interface descriptions, method definitions, and field definitions. You can define the SLA of the interface, including the number of concurrency supported, and what is the recommended configuration? There is also the interface of the person in charge of some query entry.

Configuration center

Provide a place for configuration management, and by configuration I mean service related configuration. The configuration includes group configuration, routing policy, blacklist and whitelist, degrade switch, traffic limiting information, timeout period, retry times, and any data that can be dynamically changed. This allows service providers and service callers to make configuration changes without having to restart their own applications. The configuration center can be independent of or merged with the registry.

The monitoring center

Monitoring services focus on interface dimension, instance dimension (such as the JVM instance) data. RPC framework can periodically report the number of calls, time spent, exceptions and other information. The monitoring center can count the service quality information and also monitor and alarm.

Distributed tracking

Different from the monitoring center, the service is carried out in a chain of calls pattern. As a natural buried point of distributed tracking system, RPC framework can perform a good data output.

Service governance

I have listed common service governance functions, such as:

  • Service routing:

    • Weight: for example, machine configuration is high weight, machine configuration is low weight;
    • IP routing: For example, certain machines can only be transferred to certain machines.
    • Packet routing: for example, automatically adjusts a group based on the configuration.
    • Parameter routing: For example, read and write classification is performed based on method names, or different nodes are routed based on parameters.
    • Routing in the equipment room: For example, routing in the same equipment room or routing in the same equipment room is preferred.
  • Invoking authorization:

    • Application authorization: Only authorized applications can invoke these services
    • Token: This set of services is invoked only for token pairs
    • Blacklist and whitelist: Only the list allows this group of services
  • Dynamic grouping:

    • Server cutting group: dynamic group scheduling can be performed for service providers according to the group conditions.
    • Client group cutting: a group scheduling can be performed for callers.
  • Call limiting:

    • Server traffic limiting: The server implements traffic limiting based on the token bucket or leaky bucket model.
    • Client traffic limiting: Traffic limiting is performed based on the client id.
  • Grayscale deployment:

    • Grayscale online: start first, after verification in the provision of services;
    • Pre-release identifier: indicates that the service is pre-released.
    • Interface test: convenient to provide interface automation function test function.
  • Service degradation:

    • Mock: Returns Mock data in case of an exception or test;

    • Fuse: client timeout or server timeout;

    • Denial of service: When the server is under heavy pressure, the system automatically denies services to protect itself.

      The gateway

When does the RPC framework need a gateway when most scenarios are called by themselves?

The gateway provides the following functions:

  • Unified authentication service;
  • Current limiting service;
  • Protocol conversion: external protocol to unified internal protocol;
  • Mock: service test, degrade, etc.
  • Other unified processing logic (e.g. Request parsing, response wrapping).

Iv. JSF iteration of JD

JSF prototype – the ASF

In the early days of JSF, called ASF, the options were as follows:

  • RPC framework: configuration extension based on Dubbo2.3.2, and function extension including rest (Resteasy), WebService (CXF), Kryo/Thrift serialization, call compression, etc.
  • Registry: Zookeeper, RPC framework directly access data source;
  • Monitoring center: Monitoring service +HBase;
  • Management platform: Reads Zookeeper as a management platform and provides basic online and offline functions, such as blacklist and whitelist functions.

JSF is now constructed

JSF today is almost entirely homegrown

  • RPC framework: lightweight, better performance, compatible with older protocols;
  • Registry: based on DB as data source, front-end Index service; Support tenfold access capacity; Part of the logic is placed in the registry to reduce the client burden;
  • Monitoring center: Monitors Proxy services +InfluxDB (changed to ElasticSearch after 2015).
  • Management side: based on DB, more powerful, providing perfect service governance management function; Open jd application management platform to provide application dependency sorting;
  • HTTP gateway: Netty based, supports cross-language invocation.

The current JSF is based on the DB to do the final consistency of data, that is, the AP system. The main functions of the registry are the registration and subscription push of the service list, the acquisition and distribution of the service configuration, and the real-time viewing of the service status. The registry node is stateless and horizontally scalable. All registries under the entire registry cluster are equivalent at several points.

JSF optimizations and features

  • Introducing the Index service concept: The service is one of the most simple HTTP service, to find a registry node (with room or pressure minimum or other specific scenarios), can be thought of as not hang service, RPC framework will take priority even the service registry address, it has the advantage of registry address changes, the RPC framework need not change any Settings;

  • Registry memory has a full cache of service lists, not connected to the database to ensure readable;

  • The data structure of the database is more suitable for display, filtering and analysis of various dimensions, such as group /IP/ application/machine room, etc.

  • Registry is a JSF service, monitoring pressure can be dynamic horizontal expansion, will not appear in 2015 double 11 as the accident;

  • Service list push logic improvement: For example, the original 100 providers, now add one node, the previous SAF needs to deliver 101 nodes, determine which node to add, and establish long links. The current improvement is: modified to send an Add event to inform the RPC framework of adding a node, RPC framework for long link establishment; This greatly reduces the amount of data being pushed;

  • Registry and RPC framework can interact in a variety of ways: Registry and RPC framework are long links, and JSF supports Callback, registry can call RPC framework for operations other than service list changes; For example, view status, view configuration, and deliver configuration.