preface
If you are developing an e-commerce site, there will be a lot of micro-services at the back end, such as memberships, products, recommendations, etc.
How does APP/Browser access these back-end services? If the business is simple, you can assign each business an independent domain name (https://service.api.company.com), but this way will have a few questions:
- Each service needs logic such as authentication, traffic limiting, and permission verification. If each service has its own business, it will be very painful to build a wheel and put it in a unified place to do it.
- Early if business is simpler, this way will not have what problem, but as the business is more and more complex, such as taobao, amazon opens a page may involve hundreds of service work, if every micro service assigned a domain name, on the one hand, the client code will be difficult to maintain, involving hundreds of domain names, On the other hand, there is the bottleneck of connection number. Imagine opening an APP and finding hundreds of remote calls through packet capture, which would be very inefficient on the mobile end.
- Every time a new service is launched, operation and maintenance participation is required, such as domain name application and configuration of Nginx. When the server is launched and offline, operation and maintenance participation is also required. In addition, the use of domain name is not friendly to the isolation of the environment, so callers need to make their own judgment according to the domain name.
- Another problem is that each microservice on the backend may be written in a different language using a different protocol, such as HTTP, Dubbo, GRPC, etc., but you can’t ask the client to adapt to so many protocols. This is a very challenging job, and the project can become very complex and difficult to maintain.
- Later if need to refactor on micro service, also can become very troublesome, need to modification of the client to cooperate with you together, such as goods and services, as business becomes more and more complex, the late need to be split into multiple services, external services provided by this time also need to be split into multiple, at the same time need the client to cooperate with you, very much.
API Gateway
A better way is to use API gateway, implement an API gateway to take over all incoming traffic, similar to the function of Nginx, forward all user requests to the back-end server, but the gateway does not only simple forwarding, but also some extension for traffic. For example, authentication, traffic limiting, permissions, fusing, protocol conversion, error code unification, caching, logging, monitoring, and alarm are implemented by the gateway in a unified manner. In this way, the service side can focus more on the service logic and improve the efficiency of iteration.
By introducing an API gateway, the client only needs to interact with the API gateway, instead of communicating with the interfaces of each business side separately. However, introducing one more component introduces one more potential failure point, so there are many points involved in achieving a high-performance and stable gateway.
API registration
How does the business party access the gateway? In general, there are several ways.
- The first one uses plug-ins to scan the API of the business side, such as the annotation of Spring MVC, combined with Swagger’s annotation, so as to realize parameter verification, document &&SDK generation and other functions. After the scanning is completed, it needs to be reported to the storage service of the gateway.
- Manual input. Such as interface path, request parameters, response parameters, call mode and other information, but this way is relatively troublesome, if too many parameters, the early input will be very time-consuming and laborious.
- Import a configuration file. For example, through SwaggerOpenAPI, etc., such as ali Cloud gateway:
Protocol conversion
The internal API may be implemented by many different protocols, such as HTTP, Dubbo, GRPC, etc., but many of them are not very friendly to users, or cannot be exposed at all, such as Dubbo service. Therefore, a protocol transformation needs to be done at the gateway layer to convert users’ HTTP protocol requests, The gateway layer converts to the underlying corresponding protocol, such as HTTP -> Dubbo, but there are many issues that need to be paid attention to, such as the parameter type, if the type is wrong, resulting in conversion problems, and the log is not detailed enough, the problem will be difficult to locate.
Service discovery
As an entrance to traffic, the gateway is responsible for forwarding requests. However, you need to know who to forward the requests to and how to address them first. There are several ways to do this:
- Write dead in the code/configuration file, although this way is frustrating, but also can be used, for example, the online is still using physical machine, IP change is not very frequent, but the expansion and shrinkage, including the application online and offline will be very troublesome, the gateway itself even need to achieve a set of health monitoring mechanism.
- The domain name. The scheme of domain name is also a kind of good for all of the language, but for the internal service, domain name will be very inefficient, environmental isolation also not too friendly, such as the pretest, online is usually the same database, so the gateway read is probably the same domain name, pretest gateway call at this moment is the online service.
- Registry. The registry will not have these problems, even under the environment of the container, the node’s IP change more frequently, but list of nodes can become real time by the registry, the gateway is transparent, the normal fluctuation of line, including abnormal downtime, etc., also by the registry of health check mechanism detected, and real-time feedback to the gateway. And the use of the registry performance is not additional performance loss, using the domain name, additional need to go a DNS resolution, Nginx forwarding, etc., a lot of hops in the middle, performance will have a great decline, but the use of the registry, gateway and business side is directly point-to-point communication, there will be no additional loss.
The service call
Because the gateway connects to many different protocols, it may need to implement many invocation methods, such as HTTP and Dubbo. For performance reasons, it is best to use asynchronous methods, which support asynchronous HTTP and Dubbo. Apache provides an asynchronous HTTP client based on NIO.
Because the gateway involves many asynchronous calls, such as interceptors, HTTP clients, dubbo, Redis, etc., we need to consider the way asynchronous calls are made. If they are based on callbacks or futures, the code will be deeply nested and the readability will be poor. You can use zuul and Spring Cloud Gateway solutions for responsive transformation.
Elegant offline
Elegant rolled off the production line is also the gateway to focus on one problem, gateway to the underlying will involve a lot of kinds of protocols, such as HTTP, Dubbo, and HTTP can continue to segmentation, such as domain name, the registry, etc., some itself is offline support grace, Nginx itself is to support health monitoring mechanism, for example, if detected a certain node has hang up, For a normal application to go offline, you need to take the application offline logically, then return a failure for subsequent Nginx health monitoring requests (for example, return 500), then wait for a period of time (depending on the Nginx configuration), and then take the application offline. In addition, for registries, it is also similar. Generally, registries only support manual logout. You can call the interface of the registry in the logical logout phase to remove the node. The same is true for others, such as Dubbo.
performance
As the gateway for all traffic, performance is the most important. Most of the early gateways were built based on the synchronous blocking model, such as Zuul 1.x. However, as we all know, each request/connection takes up one thread, which is a heavy resource in the JVM. For example, Tomcat has 200 threads by default. If gateway isolation is not done properly, when upstream services are delayed due to network latency, FullGC, and slow third-party services, etc., The thread pool can easily fill up, causing new requests to be rejected, but threads are blocked on THE IO and the system’s resources are not fully utilized. Another point, vulnerable to network, disk I/O latency. Timeouts need to be set carefully, because gateways can easily be dragged down by a slow interface if not properly set and if service isolation is not well done.
Asynchronization is an entirely different approach. Typically, a single CPU core starts a single thread to handle all requests and responses. The life cycle of a request is no longer fixed to a thread, but divided into different stages to be handled by different thread pools, system resources can be more fully utilized. And because threads are no longer exclusive to a connection, the system resources consumed by a connection are much lower, just a file descriptor plus a few listeners, etc. In a blocking model, each connection would be exclusive to a thread, which is a very heavy resource. The latency of upstream services can also be greatly alleviated, because in the blocking model, slow requests monopolize a thread resource, whereas with asynchrony, the resource consumption of a single connection becomes very low and the system can handle a large number of requests simultaneously.
For JVM platforms, Zuul 2, Spring Cloud Gateway, etc., are good choices for asynchronous gateways. You can also develop your own asynchronous gateway support based on Netty, Spring Boot2.x webFlux, vert.x or Servlet3.1.
The cache
Get request for some idempotent, can according to the business party specified in the gateway level cache first do a layer of caching, stored in the Redis, etc. In the second level cache, so some repetitive request, can be directly in the gateway layer, instead of playing to the lines of business, reduce the pressure of the business side, the other node hang up if the business side, the gateway can also return to its own cache.
Current limiting
Current limit for each business component, the component can say is a must, if current limit is no good, when the request quantity spurt, could easily lead to business service hang up, such as the great promoting such as double 11, 12, the request of the interface is the usual several times, if there is no good capacity evaluation, and to do current limit, it is easy to service the whole is not available, Therefore, according to the processing capacity of the business side interface, do a good job of limiting the flow strategy, I believe that we have seen taobao, Baidu to grab red packets of the degraded page.
Therefore, traffic limiting policies must be implemented at the access layer. Non-core interfaces can be directly degraded to ensure the availability of core services. For core interfaces, traffic limiting policies need to be formulated according to the interface capacity obtained during pressure measurement. Current limiting is divided into several types:
- Single machine. Single machine performance is high, does not involve remote call, only local count, minimal impact on interface RT. However, you need to set the traffic limit for a single gateway or the entire gateway cluster. In the case of the entire cluster, you need to modify the traffic limit for the capacity reduction or expansion of the gateway.
- Distributed. In distributed mode, a storage node is required to maintain the number of calls of the current interface, such as Redis and Sentinel. In this mode, because remote calls are involved, some performance losses will be incurred. In addition, the problem of storage failure should also be considered. Or simply demote the current limiting function itself.
There are also different strategies: simple counting, token buckets, etc. Simple counting is sufficient in most scenarios, but if you need to support scenarios such as burst traffic, token buckets can be used. You also need to consider what limits are based on, such as IP, interface, user dimension, or some value in the request parameters. Expressions can be used here, which is relatively flexible.
The stability of
Stability is a very important part of the gateway, monitoring, alarm needs to be done very perfect can, such as interface consumption, response time, abnormal, error code, and the success rate of monitoring alarm, there are some thread pool related, such as the number of active threads, the queue backlog, and some systems level, such as CPU, memory, FullGC these basic.
Gateway is for all the service entrance, relative to other services to the requirement of the stability of gateway will be higher, it is better to have been stable operation, less as far as possible to restart, but when the new function, or to log screen problem, inevitably needs to be released, so you can reference zuul way, all the core functions are based on different interceptor implementation, Interceptor code is written in Groovy, stored in the database, support dynamic loading, compilation, running, so that when a problem can be located and solved in the first time, and if the gateway needs to develop new functions, just need to add new interceptors, and dynamically added to the gateway, no need to re-publish.
Fusing the drop
Circuit breakers are also very important. If a service fails or the interface response times out seriously, the whole gateway may be brought down by an interface, so fuse degrade needs to be added. When a specific exception occurs, the interface degrade will be directly returned by the gateway, which can be implemented based on Hystrix or Resilience4j.
The log
Due to all the requests are handled by the gateway, so log also need to be relatively perfect, such as time-consuming, interface request, request IP, request parameters (note that desensitization, response parameters, etc., and may involve a lot of micro service, so you need to provide a unified traceId convenient connection all the logs, You can place this traceId in the response header to facilitate troubleshooting.
isolation
For example, application layer isolation, such as thread pool, HTTP connection pool, redis, can also be deployed according to the service scenario, the core service with a separate gateway cluster, separated from other non-core services.
Gateway Management and Control Platform
This is also very important, need to consider the whole process of the user experience, such as access to the gateway of the process, can be simplified as far as possible, intelligent, if be dubbo interface, for example, we can through to git repository to get the source code, parse the corresponding classes, methods, so as to realize automatic filling, as far as possible to help users reduce operating; Another interface from pretest probability test – > general – > online, if every time to fill out the form again will be very trouble, we can not do this thing off automatically, if the gateway deployment to multiple available area, and even different countries, that at this time, we also need to interface data synchronization function, otherwise, the user needs to every background operation again, very trouble.
This personal suggestion is to directly refer to the gateway services provided by Alibaba Cloud and AWS, which have very comprehensive functions.
other
There are other points to consider, such as interface mocks, document generation, SDK code generation, error code unification, service governance related, etc., which are not covered here.
conclusion
The gateway or centralized architecture, all requests need to walk a gateway, so when big promoting or flow excursion, gateway may become the performance bottleneck, and when the gateway access to a large number of interfaces, doing traffic evaluation is not an easy job, before each big promoting all need to interface with the business side do pressure test, to evaluate the capacity of the roughly, And expand the gateway, and the gateway is the entrance to all traffic, all requests are handled by the gateway, it is difficult to accurately estimate the capacity. You can refer to the current popular ServiceMesh and adopt a decentralized solution. The gateway logic is sunk into the SidecAR. The SidecAR and the application are deployed on the same node and take over the incoming and outgoing traffic of the application. Upgrade also will be more smooth, centralized gateway, even gray release, but in theory all the flow of business into the new version of the gateway, if wrong, will affect all of the business, but the way of decentralized and can upgrade for non-core business, first observed after a period of time no problem, then full amount on line. ServiceMesh’s solution is also friendlier for multilingual support.