preface

In recent years, Spring Cloud has become the mainstream technology stack for microservices development, which is very popular in the domestic developer community. In recent years, I have been practicing micro-service architecture in front-line Internet companies (Ctrip, PPDAI, etc.). Based on my personal front-line practice experience and my usual investigation on Spring Cloud, I believe that some components in Spring Cloud’s technology stack are still far from production-level development. For example, Spring Cloud Config and Spring Cloud Sleuth are both self-developed products of Pivotal, which have not yet been applied in large-scale enterprise-level production, and many enterprise-level features are missing (see my description later). In addition, the Spring Cloud architecture is missing some key microservices infrastructure components, such as Metrics monitoring, health checks and alarms. Therefore, on the basis of referring to Spring Cloud micro-service technology stack, combining my own practical experience and open source practices of domestic and foreign first-tier Internet companies (such as Netflix, Dianping, Ctrip, Zalando, etc.), I put forward a lightweight micro-service reference technology stack that is closer to the characteristics of Domestic technology and culture. Hopefully, this reference stack will be a good guide for front-line architects (or startups) to avoid detours and quickly implement microservices architecture.

The reference stack and overall architecture are shown below:



It mainly contains 11 core components, which are:

Core supporting components:

  1. Service gateway Zuul

  2. Service registration discover Eureka+Ribbon

  3. Service configuration Center Apollo

  4. Authentication and authorization center Spring Security OAuth2

  5. Service framework Spring MVC/Boot

Monitoring and feedback components:

  1. Data bus Kafka

  2. Logs monitor ELK

  3. Call chain to monitor CAT

  4. Metrics monitoring KairosDB

  5. Health check and alarm ZMon

  6. Current limiting fuse and flow polymerization Hystrix/Turbine

Two, core support components

2.1 Service gateway Zuul

Infoq did an interview with Adrian Cockcroft, former Director of Architecture at Netflix, around 2013 [Appendix 1]. Adrian was asked: “With all of Netflix’s open source projects, which one do you think is the MOST Indispensable?” Adrian replied. “It’s easy to overlook one of Netflix’s MOST powerful infrastructure services in the open source project: the Zuul Gateway service. Zuul Gateway is primarily used for intelligent routing, but also supports authentication, area and content-aware routing, aggregating multiple underlying services into a unified external API. One of the highlights of the Zuul gateway is that it is dynamically programmable and the configuration takes effect in seconds. From Adrian’s answer, we can get a sense of the importance of the Zuul gateway to the microservices infrastructure.



Zuul is a monster in English, and is also found in the Zerg in StarCraft. Netflix named the gateway Zuul, referring to the door god beast

The Zuul gateway was production-level proven at Netflix and has had many successful applications in the community since being integrated into the Spring Cloud. Zuul gateway has been successfully implemented in Ctrip (daily traffic exceeds 5 billion), PpDAI and other companies, making it the first choice of gateway in microservice infrastructure. Other open source products like Kong or Nginx can also be modified to support gateway functionality, but with a higher threshold of complexity.

The Zuul gateway is not fully asynchronous, but the synchronous model makes it simple, lightweight, and easy to program and extend. Of course, the synchronous model needs to be fusing well (in conjunction with the Fusing component Hystrix), otherwise it can cause resource exhaustion and even avalanche failure.

2.2 Service Registration Discovery Eureka + Ribbon

For the micro-service registration discovery scenario, among the open source products in the community, Netflix Eureka is the only one that has passed the production-level high traffic verification at present. It has also been included in the Spring Cloud system and has many successful applications in the community. For example, Ctrip Apollo configuration center also uses Eureka as soft load. Other products, such as Zookeeper, Etcd, Consul, etc., are relatively general products, which need further packaging and customization before they can be used in production. Eureka supports high availability across data centers, but it is the ultimate AP consistent system, not a strongly consistent system.



Eureka is the exclamation made by Archimedes when he discovered the principle of buoyancy while taking a bath, meaning discovery in microservice

The Ribbon is a client soft load library that connects to Eureka. With Eureka, the Ribbon supports flexible dynamic routing and load balancing policies. The Ribbon client can be directly connected to the internal micro service. The Ribbon can also be deployed on the gateway. In this case, the gateway acts as a super client with routing and soft load capabilities.



2.3 Service Configuration Center Apollo

There is a Spring Cloud Config product in the Spring Cloud system, but its functions are far from production level. It can only be used in small-scale scenarios, and is not recommended for medium and large-scale enterprise scenarios. Apollo is an open source product produced by Ctrip and many other Internet companies. It has been open source for more than two years and now has more than 4K stars on Github. It is very successful and its complete documentation is also a highlight. Apollo supports a comprehensive management interface, multiple environments, real-time implementation of configuration changes, and production-level functions such as permissions and configuration audits. Apollo can be used for both general configuration scenarios, such as connection strings, and advanced scenarios, such as publishing Feature flags and service configurations.



Apollo is the Greek god of the sun

2.4 Spring Security OAuth2

At present, there is no particularly mature micro service security certification center product in the open source community. Some medium and large Internet companies I have worked for before, such as Ctrip and Vipshop, are basically customized and self-developed in this area. However, for ordinary enterprises, there is still a threshold for customized and self-developed products. OAuth2 is a token-based authorization framework, which has been supported by many big companies (Google, Facebook, Twitter, Microsoft, etc.). It can be considered as a de facto microservices security protocol standard, suitable for open platform syntagging. Modern micro-service security (including single-page browser App, wireless native App, server-side WebApp access to micro-services, and inter-service invocation scenarios), and enterprise internal application authentication and authorization (IAM/SSO) and other scenarios.

Spring Security OAuth2 is an extension of Spring Security that supports four major OAuth2 Flows and is basically a recommended product for microservices authentication and authorization centers. However, Spring Security OAuth2 is only a framework, not an end-to-end out-of-the-box product. Enterprise-level applications still need to be customized on it, such as providing a Web management interface, connecting to the enterprise’s internal user authentication login system, using Cache Cache tokens, and connecting to microservices gateway, etc. Can be used as production grade. Here to recommend an architecture exchange group 650385180, which will share some senior architects recorded video: Spring, MyBatis, Netty source code analysis, high concurrency, high performance, distributed, microservice architecture principle, JVM performance optimization these become architects necessary knowledge system. You can also get free learning resources, the following knowledge system map is also available in the group. I believe that for those who have worked and met technical bottlenecks, there must be something you need in this group.



2.5 Service Framework Spring Boot

Spring is arguably one of the most successful Web App/API development frameworks in history, incorporating years of best practices in the Java community, and despite being nearly 15 years old, community activity is still on the rise. Spring Boot is further packaged on the basis of Spring, providing a more considerate Starter project, self-start capability, automatic dependency management, code based configuration and other features to further reduce the access barrier. In addition, Spring Boot provides production-level monitoring features such as Actuator and supports DevOps development mode, which is recommended as the preferred microservices development framework.

The REST Contract specification Swagger has a good integration with Spring, enabling Spring to support the Contract Driven Development model as well. For some large-scale enterprise, if the business team is more complex, considering the interoperability and integration costs, recommended contract model driven development, also is development definition Swagger contract first, and then through the contract generated service interface and the client, and then realize the server-side business logic, the development model can provide standardized interfaces, Reducing integration costs between systems is very important for multi-team collaborative and parallel development.



Spring Boot Logo

Monitoring and feedback components

3.1 Data bus Kafka

Kafka, originally developed and widely used within Linkedin and then open-source on Apache, is a standard part of the industry’s Databus and is found in almost every Internet company. Kafka is one of the most successful open source projects in the world. Its founders left Linkedin to form ConFluent, an enterprise software services company that provides complementary and value-added services around Kafka. On the monitoring side, Kafka can collect, store, and forward data such as logs and Metrics, adding a large buffer in the middle to cope with large log data scenarios. In addition to log monitoring data collection, Kafka is widely used in business big data analysis and IoT scenarios. Kafka can also be used in traditional messaging middleware scenarios if appropriately customized enhancements are made.

Kafka features large capacity, high throughput, high availability, repeatable data consumption, horizontal scalability, and support for consumer groups. Kafka is especially suitable for big data logging scenarios that do not require real-time or non-loss of data.



Kafka founder trio, who left Linkedin to start Kafka-based startup Confluent

3.2 Log Monitoring ELK

ELK (ElasticSearch/Logstash/Kibana) is a standard log monitoring technology stack, almost every Internet company can see the figure of ELK, allegedly ctrip is the biggest users of domestic ELK, daily incremental amount of log data of 80 ~ 90 TB. ELK is very mature and is basically out of the box, with operations, governance, and tuning going on. ELK is usually used with Kafka because log word segmentation is time-consuming. Kafka acts as a pre-buffer to eliminate peak traffic and offset the mismatch between peak log traffic and consumption (word indexing). Log retrieval is very fast once the reverse index is established, so fast and flexible log retrieval is the greatest benefit of ElasticSearch. In addition, ELK has large capacity, high throughput, high availability, horizontal expansion and other enterprise features.

At the beginning of a startup, considering the resource and time constraints, call chain monitoring and Metrics monitoring may not be the first priority, but ELK must be built together. Application log data must be collected and indexed, which can basically cover most Trouble Shooting scenarios (business, performance, program bugs, etc.). In addition, the key to good use of ELK is governance. You need to set some rules (for example, only collect logs of Warn level or higher) and monitor the amount of application log data. Otherwise, developers will abuse the ELK, and any garbage data will be thrown into the ELK, resulting in a large amount of space waste, and may even cause performance availability problems.



3.3 Calling the chain to monitor CAT

Spring Cloud supports Zipkin-based call chain monitoring. Based on my practical experience, I think Zipkin is not an enterprise-level call chain monitoring product, but at best a semi-finished product with many important enterprise-level features missing. Zipkin was first developed by Twitter on the basis of digestingGoogle Dapper’s paper, and has been successfully applied inside Twitter. However, when it came out of open source, it eliminated many important statistical report functions (because it relied on some heavy big data analysis platforms), and it was only a semi-finished product. Visual call chains can be easily queried and rendered, but fine-grained call performance data reports are not open source.

Google began to develop the call chain monitoring system called Dapper around 2007, but eBay had its own Call chain monitoring system CAL(Centralized Application Logging) much earlier than this time (around 2002). Google and eBay have similar design ideas, but there are some differences. CAL has been widely used in eBay and is known as one of the four magic tools of eBay (DAL, Messaging and SOA are the other three). Qimin Wu, the author of CAT, an open source call chain monitoring system (I used to work with him and call him Lao Wu), worked at eBay for nearly a decade and absorbed CAL’s design. After 2011, Lao Wu left eBay for Dianping and spent three years recreating a call chain monitoring product CAT (Centralized Application Tracking) in Dianping. CAT has CAL’s genes and shadow, but also integrated with Lao Wu’s exploration practice and innovation.

CAT is a more complete enterprise-level call chain monitoring product, even approaching the category of Application Performance Management (APM) product. It not only supports query and visualization of call chain, but also supports fine-grained statistical reports of call Performance data. This is the essential difference between CAT and other open source call chain monitoring products on the market. In fact, most of the time, developers use CAT to look at performance statistics reports (mainly CAT Transaction and Problem reports). These reports are like a tool for developers to self-measure and continuously improve application performance. In addition, CAT also supports application error reporting, self-service alarm and other functions, which are also very practical functions for enterprise monitoring.

CAT has been successfully launched in Dianping, Ctrip, Lufax, PpDAI and other companies. Because it is a domestic call chain monitoring product, its interface display and functions are more in line with domestic culture and easier to be launched in domestic companies. I recommend CAT as the first choice for microservice call chain monitoring. As for the intrusion of CAT mentioned by some people in the community, I think we should look at it in two ways. It has advantages and disadvantages. It has coupling but better performance. In addition, enterprises use a call chain monitoring product, generally will not change, developers are used to it, intrusion is not a big problem.



3.4 Metrics Monitor KairosDB

In addition to logging and call chains, Metrics are also an important focus for application monitoring. Internet applications advocate Metrics Driven Development, that is, developers should not only focus on function implementation, do well in unit testing (TDD), but also do well in monitoring the business layer (such as registration, login and down count, etc.) and application layer (such as call number, call delay, etc.). This is also a reflection of DevOps, which requires developers to pay attention to operations requirements, and monitoring buried points is a production-level operation requirement.

Metrics relies on a time series database (TSDB) for its underlying monitoring products, and some of the most popular open source products of late are Prometheus and InfluxDB, which have good community user numbers and feedback to adopt. However, these products have weak distributed ability and high threshold of customization expansion, so they are generally recommended for small start-up companies to adopt them. If the enterprise business and team scale grow to a certain stage, it is recommended to consider a time series monitoring product that supports distributed capabilities, such as KairosDB or OpenTSDB. I have some practical experience in both products. KariosDB is based on Cassandra and is relatively lightweight. If your company already uses Hadoop/HBase, OpenTSDB is a good choice.

KairosDB is also commonly used with Kafka, which acts as a front-buffer. If KariosDB is used, the tag value should not be too discrete. Otherwise, there will be query performance problems. This is related to the KariosDB underlying storage structure. Grafana comes standard with the Metrics presentation and is seamlessly integrated with KariosDB.



Grafana is standard for the Metrics display and can be integrated with mainstream time series databases

3.5 Health Check and Alarm ZMon

In addition to the above monitoring methods, we still need health check and alarm systems as supporting monitoring methods. ZMon is an open source health check and alarm platform developed by Zalando, a German e-commerce company. It provides powerful and flexible alarm monitoring capabilities. ZMon is essentially a distributed monitoring task scheduling platform. It provides a wide range of Check scripts (and you can customize your own extensions) that can be used for various hardware resources or target services (such as HTTP ports, Spring’s Actuator endpoints, Metrics in KariosDB, ELK error log, etc.) for regular health check and alarm, its alarm logic and policy using Python script implementation, developers can achieve self-service alarm. ZMon is applicable to system, application, service, and even terminal user experience layer monitoring and alarm.



ZMon distributed monitoring and alarm system architecture, the bottom layer is based on KairosDB time series database

3.6 Current limiting fuse and flow polymerization Hystrix+Turbine

Around 2010, Netflix was suffering from the Cascading Failure of its distributed microservices system and started a project called Resilient Engineering to address the problem, and Hystrix is one of the eventual products of resilient Engineering. When Hystrix was applied to the Netflix microservice system on a large scale, the avalanche effect was basically solved and the whole system became more resilient. Netflix has since made Hystrix an open source contribution to the community, which has received a lot of positive feedback in the short term. Hystrix currently has over 13,000 stars on Github, and it is said that the system supporting President Obama’s election also used Hystrix for circuit limiting protection [see Appendix 2]. It can be seen that current limiting circuit breaker is a strong demand for distributed system stability, and Netflix has grasped this demand well and provided a production-level verified solution. Hystrix has been incorporated into the Spring Cloud architecture and is the first choice for fusing limiting components in the Java community (there is no better product in sight).

Turbine is a flow aggregation service that works with Hystrix. It can aggregate Hystrix monitoring data flows, and then you can see the traffic and performance of the cluster on the Hystrix Dashboard.



Hystrix means “porcupine” in English, and porcupine animals protect themselves by their quills. Netflix named the circuit breaker Hystrix for its ability to protect microservice calls.

Four, conclusion

  1. There is no good or bad technology stack, only fit. The technology stacks recommended in this article are based on my own practice and summary, but may not be appropriate for all scenarios, as the context of each enterprise is different. As an architect, you can refer to the stack I recommend, but you can’t just copy it. You have to apply it flexibly based on a deep understanding of the principles of distributed systems and the actual enterprise scenarios.

  2. The technology stack recommended in this paper is mainly for microservices infrastructure, and there are other important topics such as message, task, data access layer, publishing system, container cloud platform, distributed transaction, distributed conformance, testing, CI/CD and so on in the whole Internet basic technology platform system.