One, the introduction
Service Mesh is the core of the next generation architecture of Ant Financial. After two years of precipitation, we explored a set of feasible solutions and finally passed the test of Double Eleven. This article mainly shares our thinking and practice in product design at the current “intersection”, hoping to bring some inspiration to everyone.
Why do WE need Service Mesh?
2.1 Microservice governance and business logic decoupling
Before Service Mesh, the gameplay of microservice system is that the middleware team provides an SDK for business applications to use, which integrates various Service governance capabilities, such as Service discovery, load balancing, fusing flow limiting, and Service routing.
At runtime, the SDK and business application code are actually mixed and run in the same process, with very high coupling degree, which brings a series of problems:
- High upgrade cost
- Each upgrade requires the service application to change the SDK version number and release it again.
- When the business is racing ahead, they are less likely to stop and do things that are not very relevant to their business goals.
- The version fragmentation is serious
- Due to the high upgrade cost, the middleware will still develop forward. As time goes by, online SDK versions will be inconsistent and their capabilities will be uneven, making it difficult to unified governance.
- Middleware Evolution difficulty
- Due to the serious version fragmentation, the middleware needs to be compatible with all kinds of old version logic in the code during the process of forward evolution, wearing “shackles” and unable to achieve rapid iteration.
With The Service Mesh, we were able to take most of the SDK capabilities out of the application, disassemble them into separate processes, and deploy them in a Sidecar mode. By sinking service governance capabilities into the infrastructure, businesses can be more focused on business logic, and middleware teams can be more focused on common capabilities, truly independent evolution, transparent upgrade, and overall efficiency.
2.2 Unified Management of Heterogeneous Systems
With the development of new technologies and staff turnover, applications and services using different languages and frameworks often appear in the same company. In order to manage these services in a unified manner, the previous practice was to develop a complete SDK for each language and framework, which cost a lot of maintenance. It also brings great challenges to the personnel structure of the middleware team.
After the Service Mesh, by transforming the Service governance main body ability to sink to the infrastructure, multilingual support is a lot easier, just need to provide a very lightweight SDK, and even many cases do not need a separate SDK, you can easily realize the unity of the language and protocol traffic control, monitoring and management requirements.
photo
2.3 Financial level network security
Many companies currently build microservices based on the assumption of “Intranet trust”, but this principle may seem a bit inappropriate in the context of the current large-scale cloud, especially when it comes to some financial scenarios.
Through the Service Mesh, we can more easily realize the identity identification and access control of applications. With data encryption, we can realize full-link trust, so that services can run in zero-trust networks and improve the overall security level.
Third, in the current “intersection” thinking
3.1 Cloud Native Solution?
Because of the benefits of Service Mesh, the community has paid more and more attention to Service Mesh in the past two years, and many excellent Service Mesh products have emerged. Istio is a typical benchmark product.
Istio, with its forward-looking design combined with the concept of cloud native, is a sight to behold and a heart to yearn for. However, after a deeper look, it was found that at this stage to land, there are still some gaps.
photo
3.2 Greenfield vs Brownfield
Before we start our discussion, let’s take a look at a cartoon.
photo
The cartoon above depicts a scene like this:
- There are two workers working. One is working in a Greenfield and the other is working in a Brownfield
- The worker on the green grass said to the worker on the brown field, “If you hadn’t dug yourself such a deep hole, you could have made something wonderful new like me.”
- Then the worker on the brown ground replied, “Come down and try it!”
This is a very interesting cartoon. On the surface, we can think that it is easy for workers to stand and talk on the green grass, but in fact, the essence of the reason is that they are in different environments.
It’s really comfortable to work on a piece of undeveloped land, because there’s a lot of space, there’s no restrictions around it, and you can use all kinds of new technologies and ideas, like some of the new and new towns that have been built in our country in recent decades. In a construction has been developing the land is big different, the environment will have various restrictions, such as underground could have all sorts of line, is easy to dig broke, along with a variety of building, near a bit careless may give the floor to dig down, so it ACTS to be very careful when design also will be affected by various constraints, not free to play.
For software engineering, the same is true: Greenfield is for new projects or new systems, and Brownfield is for mature projects or legacy systems.
I’m sure most programmers like to work on new projects, including myself. Because you can use new technologies, new frameworks, and you can design the system the way things are, you have a lot of freedom. In development/maintenance of a mature project is not the same, on the one hand, project has been stable operation, logic is very complex, and so cannot be easily replaced by new technology, new framework, when designing new features will be because of the existing architecture and implementation of the code, do a lot of compromise on the other hand predecessors might unknowingly dug many pit will fall into a pit, not taken So you have to tread very carefully, especially when you’re making big architectural changes.
3.3 Realistic Scenarios
3.3.1 Brownfield applications dominate
In reality, we find that most companies have not yet gone cloud native, or are just beginning to explore it, so a large number of applications are actually running in non-K8S architectures, such as running on virtual machines or building microservices based on independent service registries.
While it is true that a small number of Greenfield applications are already built on cloud native, the reality is that a large number of Brownfield applications are the backbone of a company’s business and carry greater business value, so how can they be integrated into the Service Mesh to bring greater value? That makes it an even bigger priority.
photo
3.3.2 There is still a certain distance between the cloud native solution and the production level
On the other hand, there are still some problems to be solved in the overall performance of Istio (citing teacher Xiaojian’s views in ant Financial’s in-depth practice of Service Mesh) :
- Mixer
- Mixer performance has always been one of the most criticized aspects of Istio;
- The out-of-Process Adapter was introduced after Istio 1.1/1.2, which made things worse.
- From the perspective of landing, the extremely poor performance of Mixer V1 was “unbearable for life”. Mixer performance is already unacceptable for general-scale production-grade landing, not to mention large-scale landing…
- Mixer V2 project gave the community hope: By merging Mixer into Sidecar and introducing Web Assembly for Adapter expansion, we expected the correct posture of Mixer landing, the future of Mixer and the “poetry and distance” of Mixer. However, Mixer V2 was delayed to start, and remained In Review for a long time.
- Pilot
- Pilot was a disaster area that was overshadowed by the Mixer: The performance focus was on the Mixer for a long time, and the poorly performing and obviously problematic Mixer was attracting fire. But when the Mixer is abandoned (as is typically the case with the configuration switch to turn off the Mixer that is officially available in the new Istio release), Pilot performance issues quickly become apparent.
- After practice, we found that Pilot has two major problems at present: 1) It cannot support massive data; 2) Each change will trigger full push, resulting in poor performance;
photo
3.4 How should we go at the current “intersection”?
We all believe that cloud native is the future, our “poem and distance”, but the reality is that Brownfield application is leading the way, and cloud native Service Mesh solution is still far from production level. So what should we do at the current “crossroads”?
photo
The answer is:
In fact, as mentioned earlier, we adopted the Service Mesh solution in the first place because of the many benefits that architecture changes can bring, such as: Decoupling of service governance from business logic, unified governance of heterogeneous languages, finance-level network security, etc., and we believe these benefits are very much needed for both Greenfield and Brownfield applications. Even at this stage, the business value of Brownfield applications is much greater than Greenfield applications.
Therefore, from a “pragmatic” point of view, we should first explore a set of feasible solutions in the cash stage, which should support not only Greenfield applications, but also Brownfield applications, so as to truly implement the Service Mesh and generate business value.
Iv. Product practice of Ant Financial
4.1 Development history and landing scale
The development of Service Mesh in Ant Financial has gone through the following stages:
- Technical pre-research phase: At the end of 2017, Service Mesh technology will be investigated and explored, and identified as the future development direction.
- Technical exploration stage: Started to develop Sidecar MOSN with Golang in early 2018, and opened source istio-based SOFAMesh in mid-2018.
- Small-scale implementation phase: Internal implementation will start in 2018, with the first batch of scenarios replacing client SDKS of other languages besides Java language, and then internal small-scale pilot.
- Scale implementation stage: In the first half of 2019, as one of the main contents of the financial level cloud native architecture upgrade of Ant Financial, it gradually spread to the internal business applications of Ant Financial, and smoothly supported the 618 Promotion.
- External output stage: In September 2019, SOFAStack dual-mode micro-service platform entered Ali Cloud and began public testing, supporting SOFA, Dubbo and Spring Cloud applications
- Large-scale implementation stage: In the second half of 2019, it was fully implemented in ant Financial’s internal promotion core application. The implementation scale was very large, and finally supported the “silky smooth” Double 11 Promotion.
The number of containers injected by MOSN reached hundreds of thousands. The QPS processed on The day reached tens of millions. The average processing response time was less than 0.2ms. We achieved our expectations, initially completed the separation of the first step of infrastructure and business, and witnessed the speed of infrastructure iteration after Mesh.
4.2 SOFAStack Dual-mode microservice platform
The name of our Service grid product is SOFAStack Two-mode Microservices Platform, where “two-mode microservices” refers to the combination of traditional microservices and Service Mesh. That is, “SDK-BASED traditional microservice” can achieve the following objectives with “Sidecar based Service Mesh microservice” :
- Interconnection: Applications in two systems can access each other
- Smooth migration: Applications can be migrated between the two systems, transparent to other applications that call the application
- Heterogeneous evolution: After interconnection and smooth migration, we can flexibly adapt applications and architecture evolution according to the actual situation
On the control surface, we introduced the delivery of Pilot implementation configurations (such as service routing rules) and retained a separate SOFA service registry on service discovery.
On the data side, we used our own MOSN, which not only supports SOFA applications, but also Dubbo and Spring Cloud applications. In terms of deployment mode, we support not only container /K8s, but also virtual airport scene.
4.3 Service Discovery in Large-scale Scenarios
The first thing to consider in ant Financial is how to support a large-scale event like Singles’ Day. As mentioned above, the cluster capacity of Pilot itself is limited and it cannot support massive data. Meanwhile, every change will trigger full push and it cannot cope with service discovery in large-scale scenarios.
Therefore, our solution is to maintain a separate SOFA service registry to support tens of millions of service instance information and second push, and business applications directly connect to Sidecar to realize service registration and discovery.
4.4 Traffic Hijacking
Another important topic in the Service Mesh is how to implement traffic hijacking: to enable Inbound and Outbound Service requests of business applications to be processed by the Sidecar.
Different from the community’s iptables and other traffic hijacking solutions, our solution is relatively straightforward, as shown in the following diagram:
- Assume that the server runs on 1.2.3.4 and listens to port 20880. First, the server sends a service registration request to Sidecar, informing Sidecar of the service to be registered and the IP + port (1.2.3.4:20880).
- The Sidecar on the server sends a service registration request to the SOFA service registry, telling the service to be registered and the IP + port. Note that the registered service does not register the application port (20880), but a port that Sidecar listens on (for example, 20881).
- The calling end initiates a service subscription request to its Sidecar to inform the service information to be subscribed.
- Sidecar on the calling end pushes the service address to the calling end. Note that the IP address is the local IP address and the port is the port that Sidecar on the calling end listens to (for example, 20882).
- Sidecar on the calling side will initiate a service subscription request to SOFA service registry and inform the service information to be subscribed.
- SOFA service registry pushes service address (1.2.3.4:20881) to Sidecar at the calling end;
After the above service discovery process, traffic hijacking becomes very natural:
- The caller gets the “server” address 127.0.0.1:20882, so it makes a service call to that address.
- After receiving the request, Sidecar on the calling end parses the request header to know the specific service information to be called, and then obtains the address returned from the service registry to initiate the real call (1.2.3.4:20881).
- After receiving the request, the Sidecar on the server sends the request to the server (127.0.0.1:20880).
One might ask, why not use the Iptables solution? The main reason is that the performance of IPtables deteriorates when there are too many rules configured. More importantly, iptables is difficult to troubleshoot due to its poor management and observability.
4.5 Smooth Migration
Smooth migration is likely to be one of the most important link in the whole scheme, the front also mentioned that at the moment there are a lot of Brownfield application to any firm, they are probably bearing the company’s most valuable business, little slip will bring losses to the company, some core application, is likely to be very slightly shaking can cause failure, Therefore, smooth migration is a must for a large architecture transformation such as Service Mesh, and grayscale and rollback support is also required.
Thanks to a separate service registry, our smooth migration solution is also straightforward:
1. Initial status
Take a service as an example. Initially, there is a service provider and a service caller.
2. Transparently migrate callers
In our solution, there is no requirement to migrate the caller or the Service first. It is assumed that the caller wants to migrate to the Service Mesh first. As long as Sidecar injection is enabled on the caller, the Service provider does not know whether the caller has migrated. So the caller can start Sidecar one at a time in grayscale and roll back if there is a problem.
3. Transparently migrate service providers
If the server wants to migrate to the Service Mesh first, the caller will not be aware of the migration as long as Sidecar injection is enabled on the server. So the server can start Sidecar one by one in gray scale and roll back if there is a problem.
4. The final state
4.6 Multi-protocol support
Considering the current usage scenarios of most users, in addition to SOFA applications, we also support Dubbo and Spring Cloud applications to access the SOFAStack dual-mode microservices platform to provide unified service governance. Multi-protocol support uses the generic X-Protocol, and more protocols can be easily supported in the future.
4.7 VM Support
In the cloud native architecture, Sidecar can easily implement o&M operations such as injection and upgrade with the help of K8s Webhook /operator mechanism. However, a large number of systems are not running on K8s, so we manage the Sidecar process through the agent mode, so that the Service Mesh can help the applications under the old architecture to complete the transformation of services, and support the unified management of services under the new architecture and the old architecture.
4.8 Product Usability
We have also done a lot of work on product usability, such as setting service routing rules and service traffic limiting directly on the interface, and no need to write YAML manually:
You can also view the service topology and real-time monitoring on the interface:
4.9 Aliyun in public test
Finally play a small advertisement, SOFAStack dual-mode micro-service platform is now in ali Cloud public test, welcome interested enterprises to come to experience.
V. Looking to the future
5.1 Embrace cloud native
At present, we can clearly see the entire industry’s process from Cloud Hosted to Cloud Ready to Cloud Native. Therefore, as mentioned above, we firmly believe that Cloud Native is the future, our “poem and future”. Although there is still a gap in the landing process, but I believe that as we continue to invest, gap will be smaller and smaller.
In addition, it is worth mentioning that the fundamental reason why we embrace cloud native is to reduce resource costs, improve development efficiency and enjoy ecological dividends. Therefore, cloud native itself is not an end, but a means. We must not put the cart before the horse.
5.2 Continue to strengthen Pilot capabilities
In order to better embrace cloud native, we will continue to work with the Istio community to enhance Pilot capabilities.
And just recently, in a combination of the past more than a year after thinking and exploration, the ant gold suit and ali group colleagues. Put forward a complete set of solutions, to control plane and the traditional registration center/configuration, which can keep agreement standardized at the same time, strengthen the ability of the Pilot, make its gradual move to production.
(For more details, please refer to Teacher Xiao Jian’s ant Financial Service Mesh in-depth practice, which will not be described here.)
5.3 Transparent hijacking
As mentioned above, Ant Financial implements traffic hijacking based on the service registry. This solution is a good choice in terms of performance, control ability and observability, but it is somewhat invasive to the application (a lightweight registry SDK is required).
Considering that many users are less sensitive to performance requirements, and there are many legacy systems that want unified governance through Service Mesh, we will also support transparent hijacking and enhance governance and observability in the future.
Six, the concluding
Based on the “pragmatic” concept, After two years of precipitation in Ant Financial, Service Mesh explored a feasible solution in the cash stage and finally passed the test of Double 11. During this process, we also experienced the benefits brought by Service Mesh. For example, MOSN completed dozens of business insensitive upgrades in the process, and witnessed the speed of infrastructure iteration after Mesh.
In our judgment, Service Mesh will become the standard solution for cloud microservices in the future, so we will continue to increase the investment in Service Mesh, including ant Financial and Alibaba Group will deeply participate in the Istio community. Work with the community to make Istio the de facto standard for Service Mesh.
Finally, we welcome like-minded partners to join us in building the exciting next generation cloud native architecture!
Financial Class Distributed Architecture (Antfin_SOFA)