Ant Financial has reshaped payments and changed lives over the past 15 years, providing services to more than 1.2 billion people around the world, which is supported by technology. At the 2019 Hangzhou Computing Conference, Ant Financial will share its technological precipitation over the past 15 years, as well as its future-oriented financial technology innovation with attendees. This is one of the best presentations we have compiled and will publish on the public account “Financial Distributed Architecture”.
With the rapid development of Internet technology, we are entering the era of cloud native. In this process, how should the financial industry embrace cloud native? In the past two years, Ant Financial has implemented cloud native in the financial field, which has accumulated some practical experience. Next, I would like to share with you what we think of cloud native in the evolution process of Ant, what problems we encountered when we implemented cloud native in the financial field, and how we solved them.
After years of booming cloud computing, getting onto the cloud is no longer a big problem. The next problem is how to use the cloud well and efficiently. The latest data from RightScale in 2019 shows that at present, the scale of public cloud accounts for 22%, and the number of customers who only use private cloud accounts for 3%. More customers use cloud through mixed mode to achieve a balance between data privacy, security, efficiency and flexibility.
Looking at the global IT industry, the proportion of public cloud only accounts for 10% of the whole basic IT market. The market space is still large, and many of the remaining IT market are traditional enterprise customers. One of the main reasons why traditional industries fail to take advantage of public clouds is that their IT systems take a long time to build and many have their own computer rooms. Others have relatively stable business and do not have strong demand for public cloud. They typically develop hybrid cloud strategies, leaving some core business in the private cloud and some marginal or innovative business in the public cloud.
These characteristics are also evident in the financial industry, in addition to two other characteristics of the financial industry:
- Open and Internet-based business forms: With the development of the Internet and digital economy, financial institutions need to carry out digital transformation, as well as business agility, service scenarios, to cope with the impact of new business models;
- Demand for regulatory compliance: The business characteristics of the financial industry determine that it must be strongly isolated and supervised, so the resource sharing mode on public cloud will face great regulatory challenges.
Therefore, the hybrid cloud strategy is more suitable for financial institutions. This conclusion is supported by research. Nutanix reports that the global financial industry is growing faster than other industries in hybrid cloud adoption, with deployment penetration currently at 21%, compared to a global average of 18.5%.
So, what kind of hybrid cloud is suitable for financial institutions? Take the evolution of ants.
Ant evolved into a cloud platform architecture in the fourth generation of architecture, and in order to cope with the elastic demand for resources of emergent businesses under the Internet business form, Ant also directly evolved into a flexible hybrid cloud architecture at the same stage. Now Ant has evolved to the 5th generation cloud native architecture. How ant transforms the hybrid cloud into a financial hybrid cloud under the cloud native architecture, I think it will give you some inspiration. In this development process, there is a main line, which is the standards and requirements of ant research and development at different stages, including: autonomy, cost, security, stability, mass and agility. These are also our requirements for cloud native architecture in the era of online finance.
From distributed to cloud originator to build financial grade transaction payment system
The first step to establish a financial level online trading system is to realize a financial level distributed architecture. The representative technologies of Ant in this regard are SOFAStack and OceanBase, which have been commercialized externally and have abundant cases. SOFAStack represents a scalable and extensible architecture at the entire application layer or stateless service level. OceanBase represents the storage or stateful service layer represented by database, how to distribute in the architecture. They have four characteristics:
- High availability, 99.99%+ availability guarantee, to ensure that the system always continuous operation without interruption;
- Consistency, in any abnormal situation in the final consistency of data, to ensure the security of funds;
- Scalable, support application level, database level, machine room level, region level rapid expansion;
- High-performance, storage using read/write separation architecture, computing engine full link performance optimization, quasi-memory database performance.
These four key features are most important to financial businesses and need to be implemented end-to-end in applications and storage. Consistency, for example, within a single database can ensure the data consistency, but in the case of large scale, a single database is always there will be a bottleneck, tend to be like a service or application data, according to the similar transaction, payment, accounts, such as granularity vertical apart, when these data are stored in different database cluster, At the same time, in order to support massive data, database cluster will also do separate and multiple copies, OceanBase is such a set of distributed database, in its internal also need to achieve distributed transactions. Only in this way can the consistency problem in all distributed architectures be solved.
In terms of scalability, some systems are known to have distributed architecture, but in fact they may only use microservice framework and service transformation of application layer, but the database layer does not use horizontal expansion technology or distributed database, so the scalability of the whole system is stuck in the shortcomings of the data layer.
Therefore, a truly distributed system needs to achieve end-to-end distribution to achieve unlimited scalability and high performance, while a truly financial distributed system needs to achieve end-to-end high availability and consistency.
In our opinion, the most critical goal of high availability architecture is to keep data and business uninterrupted. On the basis of this goal, we designed and implemented the remote multi-living architecture of three places and five centers. Its core advantages include city-level disaster recovery, low-cost transactions, unlimited scalability, and RPO=0, PTO<30s. As you know, we did a demo of cable clipping at the Cloud Conference last year, which demonstrated how to do cross-city live and disaster recovery on an architectural level. At the same time, in the case of high availability, we have also done a lot of risk-related things, to sum up, on the basis of high availability, we also need to ensure the security of funds, change immunity and quick recovery of failures.
In fact, security is the most frequently mentioned topic at the financial level. In the era of cloud native, what we need to solve is the all-link, end-to-end security risk. It can be divided into three levels:
- Cloud native network security, including policy-based efficient traffic control, full link encryption, traffic hijacking and analysis;
- Cloud native infrastructure security, including secure containers, non-shared kernels, and security sandboxes;
- Cloud native services are secure, including the SOFAEnclave confidential computing middleware and the memory-safe, multi-task Enclave LibOS Occlum.
This part is covered in detail by my colleague in his presentation “Cloud Native Security Architecture for Financial Services”. To summarize, the so-called financial level capability is mainly to achieve end-to-end financial level high availability and end-to-end security at the same time. Next I’d like to share some of the issues I’ve encountered moving forward with cloud native.
From unitary to resilient architectures to cope with the explosive traffic pulses of the Internet
First of all, let’s explain what is unitary. You may easily understand the database layer’s shard or Sharding, which can solve the problem of centralized storage computing performance through Sharding. The core idea of unitary is to advance the shard of data to the shard of entry request. Sharding user requests according to certain latitude (such as user ID) at the network access layer of the machine room is just like taking each machine room as a huge stateful database Sharding. When you are a user whose ID ends with 007 or 008, when the request is sent to the machine room through the mobile phone or the web domain name, The access layer already knows whether to route you to east China or south China. When you go into a machine room in an area, most of the request processing can be done inside the machine room. Occasionally, some services may make cross-room service calls, such as transferring money from A user in machine A to A user in machine B. At this time you need to do stateful design in this machine room.
When we went to the cloud native era, we used Kubernetes as the basis of the design on the large architecture. In the unit architecture, we chose to deploy a Kubernetes cluster in each cell. The Federated APIServer, which supports multiple K8s cluster management and control command delivery, is logically deployed globally. The control metadata is stored in an ETCD cluster to keep the global data consistent. However, it is known that ETCD can only solve the disaster recovery of the same city and two machine rooms. We can no longer cope with the consistency of multi-city multi-data center, so we are moving ETCD to our OB KV engine, so that the storage format and semantics of ETCD are still maintained at the engine level, and the storage layer is highly available in three places and five centers.
Although this architecture is suitable for ant’s technical architecture, we will encounter many new problems when our technology is opened to external customers. For example, there will be a lot of heterogeneous infrastructure in the customer’s computer room, so we need to use the Cloud Provider standard to achieve multi-cloud adaptation.
In addition, many financial institutions, including us, are not designed in the way of “cloud native” because many old systems are stateful dependent on infrastructure, such as IP, so it is difficult to fully adopt the immutable infrastructure mode to support. In some cases, it is difficult to accept the operation and maintenance mode of native K8s workload due to the extremely high requirement on business continuity. For example, when native Deployment does gray scale or Canary release, it deals with applications and traffic very simply and crude, which will lead to abnormal and discontinuous business in operation and maintenance changes. We extend the original Deployment to CAFEDeployment which is more suitable for financial business requirements, so that the large-scale cluster release, gray scale and rollback are more elegant, in line with our “technical risk principle”.
Therefore, the financial level “hybrid cloud” should primarily solve the problem of elasticity and heterogeneity, and conform to the stability of large-scale financial level operation and maintenance. After these problems are solved, the financial industry will attach great importance to how to make steady innovation and how to introduce new operation and peacekeeping development mode while maintaining the traditional mode of development and operation to continue to support the business.
Build an evolving multi-model infrastructure from core systems to innovative businesses
Cloud native is actually derived from PaaS, so when applying cloud native architecture, we also encounter the problem of smooth evolution in PaaS layer first. How can customers retain their old habits while at the same time potentially experimenting with new delivery models? In the traditional mode, people are used to delivering code packages and are used to vM-based operation and maintenance. In the cloud native era, the container image is the delivery carrier, and the running instance is the instantiation container of the image. We use containers to simulate the VM operation mode, and extend Deployment to simulate the VM operation and maintenance mode, while also supporting the container mode.
From there, whether PaaS is based on traditional architecture or a set of PaaS based on K8s, the main operations are the same: site construction, publication, restart, expansion/reduction, offline, etc. Implementing two sets is a waste of resources and increases maintenance costs. It’s the same thing for the user, so we use K8s to implement all the common parts, unified metadata, unified operation and maintenance operations, unified resource abstraction, and provide two interfaces at the product level and operation mode. At the same time in the way of delivery, it can also support the traditional application mode, technology stack mode, can also be based on mirror, of course, we can also support functions outside the application.
Further up is the dual-mode microservice, which also faces the problems of the old system and the new system. We have shared it before, and it is solved through the Mesh method. In the cloud native architecture, Mesh is the Sidecar in Pod, but the old system often does not run on K8s, so it naturally does not support the operation and maintenance mode of Pod and Sidcar, so we have to use Agent mode to manage the Mesh process. Mesh helps applications in the old architecture complete servitization and supports unified management of services in the new architecture and the old architecture.
The data surface should be dual-mode, and the control surface also supports dual-mode. The traditional SDK-BASED micro-service will seek the old service registration service, but the Mesh will be based on the control surface. We will connect the control surface with the old service registration service, and let the latter do the real service registration discovery service, so as to realize the visibility and routing of global services. Also learned about the ant service registration system of classmates know, how we implemented in vlsi and computer room environment design of high availability, and the ability is very difficult to short-term control surface in the community, we are gradually settled this ability to the new architecture, so the control of the dual mode of service is also very suitable for architecture in the hybrid mode, Smooth transition to cloud native architecture.
Finally, Serverless is very popular recently. Although its scenarios are very rich, Serverless has high performance requirements behind it. Each application needs to be started very fast, otherwise it may not be used in the production environment.
Our internal Node system uses a lot of Serverless architecture and has optimized the startup speed, which is currently around 4s on average and is being optimized within 1s. But Java applications are a pain, and the average Java application starts in about 30 seconds to 1 minute. While Serverless is great, Java applications have a hard time using this architectural bonus because of startup speed issues. We used THE SVM technology of JVM to carry out static compilation of the application, and realized the optimization of an application whose normal startup time was about 60s to about 4s. Of course, this was achieved at the expense of some dynamic characteristics such as reflection. At the same time, in order to keep the application from changing as much as possible, we also changed the SDK of many middleware. In order to reduce the impact of adaptation on the application. When this dark technology can be perfectly supported for less than 1s, the entire Java technology ecosystem will migrate smoothly and the application scenario will be expanded even more. However, this will take a long time, and more people in the community need to participate in the counter-dynamic transformation of the open source library. So, we leverage the class isolation of our application container to allow multiple modules or different versions of the same module to run in a Java runtime without interfering with each other, and to simulate fast cold startup and fast scaling under Serverless.
We call this isolated Java runtime, which enables fast loading and unloading of modules, SOFA Serverless Container, and the smallest runtime module SOFA Function. These small code snippets are programmed through a set of Serverless apis. During the transition phase, we deployed SOFA Serverless Container as a cluster and mixed scheduling of multiple SOFA functions on top of it, In this case, SOFA Serverless Container and SOFA Function are 1:N. In the future, if the black tech can solve the problem of Java application startup speed, and the SOFA functions can smoothly transition to Pod deployment mode, a SOFA Function will only run in a SOFA Serverless Container.
To sum up, if the financial level hybrid cloud is to solve the smooth evolution of technology, it still needs to be able to evolve and iterate. Therefore, we provide dual-mode “hybrid” capability in PaaS, microservices, Serverless and other levels.
Finally, we can see that the development trend of banks or the entire financial field corresponds to the evolution trend of technological architecture in a one-to-one way. Different capabilities are required at different stages. I believe that many Banks may now in doing digital transformation, the transformation of the mobile phase, but with the mobile transformation after complete access to the Internet channel, in fact, pay treasure to encounter many problems in front of the believes that many financial institutions will encounter, so also hope that today will have some inspiration to all of you to share. Thank you.
Recommended cloud native activities: Ant Financial and CNCF Service Mesh Meetup in Chengdu
ServiceMesh Meetup is a technology salon which is jointly produced by Ant Financial and CNCF and hosted by ServiceMesher community. The theme is ServiceMesh, Kubernetes and cloud native, and it is held around the country.
This issue of Meetup invites community leaders to share wonderful ideas from the perspectives of micro-service architecture design under the service grid, application in the 5G era, how to use the open source Traefik to build cloud native edge routing and the evolution of Ant Financial’s service grid agent.
Time: 13:00-17:00, October 26, 2019 (Saturday) Venue: Ant C Space, Wuhou District, Chengdu
Financial Class Distributed Architecture (Antfin_SOFA)