Author: Idle fish technology — Cambodian chao
background
In 2018, we put forward the cloud integrated R&D solution of FLUTTER +Dart Faas in practice. This solution reduces the r&d threshold of the service side business assembly layer by relying on the capabilities of Serverless light (focusing on business), fast (single interface and single function, fast r&d and deployment) and NoOps (operation and maintenance platformization). It also enables students on the client side to have the ability and opportunity to participate in the business development on the server side, which reduces the cooperation efficiency of the client side and improves the iterative efficiency of the emerging business. However, in the traditional application architecture of Xianyu, there is a similar business assembly layer, which is called IDleAPI.Due to the unclear design of vertical business boundaries and architectural layering of applications, almost all businesses are iterated over the IDleAPI. New services are continuously added, old services are continuously iterated, and expired services cannot be cleared in a timely manner. As a result, the application scale keeps expanding. According to statistics, Idleapi has provided more than 1200 gateway interfaces to the outside world by the end of 2020 Double 11, of which more than 500 have no business traffic (business offline), but the code is still running and has not been cleaned up in time. As a result, idLEAPI has a total of 70W + lines of code, 2K + service switches and hundreds of service modules. Having so much business, code, and development coupled to one application raises a number of isolation issues:
On-line stability:
Hundreds of service modules running in an application process interfere with each other, which can easily cause isolation problems. For example, if a service module fails (running out of memory or occupying the thread pool), other service modules deployed on the same machine will have no resources and refuse to provide services. As a result, core services deployed on the same machine will fail. Examples like this happen every year.
Low r&d efficiency:
Dozens of R&D students develop and maintain hundreds of business modules, and each release will have more than a dozen branches. Every time a business branch is added, it will face the risk of code conflicts. The greater the gap between the baseline version of a branch and that of other branches, the more conflicts will be solved and the longer the time will be consumed. According to statistics, it takes about 30 minutes for Idleapi to be pre-issued and released, of which 20 minutes are waiting for developers to solve conflicts, resulting in low development efficiency.
Business vertical conflict:
In order to better develop the business and pay attention to the business indicators, Xianyu recombined the personnel structure according to the business domain, but the application structure has not been able to follow up. Although the same service group can autonomously converge and communicate effectively, when all services are coupled in one application, the cross-group collaboration between services still requires a lot of energy
Governance – Split
The structures of large systems tend to disintegrate during development, qualitatively more so than with small systems. Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. According to Conway’s law: large systems always tend to decompose and reorganize during development to achieve some homomorphism of system architecture and personnel structure. To address the various issues with the IDleAPI, we decided to split it. There are a few things you should consider in advance:
1. What is the product of separation? Is it a traditional single application by service domain or a FaaS function by service interface? 2. In the process of splitting, should the business code be completely rewritten or reused? How to deal with redundant business code? 3. How to migrate service configuration, monitoring, and alarms? 4. How to verify quickly? 5. How to smooth gray scale? How do I roll back? How to handle new requirements during service migration? 6. Are there any measures to prevent application /Faas inflation from happening again after the application is launched? The above problems are the key points of the split process and determine whether the split scheme can be successfully executed. Let’s analyze them one by one:
The VS FaaS function is traditionally used
The first directional question to solve for splitting is: What is the target product of splitting? There are two ideas :1. Divide the business domain into small traditional applications and develop, deploy, and maintain them independently; 2. 2. Divide them into FaaS functions based on the gateway interface. Based on several years of exploration and comparison, we believe that FaaS is a good fit to solve the problems encountered by Idleapi.
Commissioning period
First of all, in the debugging period, under the traditional application, multiple interfaces are developed in parallel on the same application, and there is the risk of code merge conflict when different branch codes are released, and it takes about 30 minutes to pre-deploy once. Under Faas, a gateway interface corresponds to a Faas function, and each Faas function has its own independent Git repository and deployment environment. Faas are independent and physically isolated from each other. Developers can be relieved to modify their own code and baseline version, and can also initiate remote debugging at any time without worrying about hindering other developers’ debugging. Moreover, because each Faas function focuses on only one service gateway interface, the amount of code and the two-party service relied on by Faas function is much smaller than that of traditional application. Therefore, pre-deployment only takes 3 minutes, nearly 10 times faster than traditional application.
The run-time
At run time, each Faas function runs on a different cluster, and this natural physical isolation prevents Faas functions from causing isolation failures. If a business on a Faas function runs out of thread pools or writes out disks, it does not affect the functions deployed on its cluster (except for business association).
Code:
Although Faas functions occupy advantages in debugging period, running period and operation and maintenance period, traditional single applications occupy advantages in coding period. For example, code reuse: multiple business codes are stored in an engineering warehouse, and the bottom tool class, Manager class and upper business can be directly called, so code reuse is simple and direct. However, in FaaS mode, different gateway interfaces are in different code repositories, and code reuse: code copy or common code sinking into two-party packages or domain services will cause code maintenance problems. Software version upgrade: When Pandora or Binary package must be upgraded: Traditional applications only need to upgrade the software version that the application depends on, and then re-release it to solve the upgrade problem. However, in FaaS mode: if every function needs to be modified and published one by one by business developers, the workload of repeated labor will be hundreds of times that of traditional applications, which greatly affects the development efficiency. We are also trying to solve this problem through some platform tools or layered measures and other solutions.
Break up the tool
After the split scheme is determined, THE IDleAPI will be divided into hundreds of FaaS functions based on the gateway interface by a huge single application. It is impractical to re-implement so much business, so the best approach is to reuse business code from a single application.After analyzing the code, we found that in the idleapi, the code of each business refers to each other, forming a huge and complex network structure. A business interface is associated with five or even ten other business interface codes, involving nearly 1000 source files, accounting for 1/4 of the total idLEAPI code source files, which completely fails to achieve our purpose of simplifying business code. In addition to the business gateway entry, there are other implicit function entry, such as: JSON serialization will automatically call the set function of the class, Bean initialization function, etc. It presents a significant challenge to manually split the business code. To do this, we designed and implemented a code-splitting tool that helps businesses analyze the classes, methods, and attributes that business entry functions depend on in a tangle of code and exclude classes, methods, and attributes that are not called. This tool can further reduce the number of source files dependent on a single business entry to about 100 (70% of which are interface data types). Combined with the Faas business framework designed and implemented by us, business students can split out business codes with one click, create Faas functions and deploy them to the pre-delivery environment when migrating. The whole process takes less than half an hour. For the service switch configuration, we also provide a migration tool, which can batch migrate online or pre-sent configuration to the new function with one click, eliminating the manual migration need to approve the copy one by one.
Automated regression testing
Testing is the last barrier to ensuring the quality of the business code being split out. In order to reduce the extra workload brought by application splitting to business and testing students, we cooperated with Faas platform and automated regression testing platform to adapt regression testing functions such as recording and playback to SideCar and Pod architecture of Faas platform. The development students only need to record the online traffic in the traditional application after the FaaS function is released, and then import the traffic into the FaaS function to be tested for automatic regression test. By docking with the automated test platform, we developed regression tests that students could complete business by themselves. This reduces the risk of service migration and the test pressure of test students, and improves the efficiency of migration.
operations
In terms of the operation and maintenance of FaaS business, we try our best to retain the operation and maintenance habits of developing students: the split FaaS function retains the name, organization format and coding of logs in individual applications, as well as the ability of developing students to log in remote machines. At the same time, we adapt the business personalized log to the white screen log function of Faas platform, so that students can view and search all logs on any machine through the control platform, which is much better than logging in the machine to view one by one. In addition, the log-based alarm monitoring system only needs to update the monitored service log path to complete monitoring migration.
Evolution of architecture
There are two solutions to the reuse of business code after the application is split into fine-grained Faas functions: One. Governance first, then split: Refactoring individual applications, sinking the code reused by each business (down to the common binary package or down to the business domain service layer), and then splitting individual applications into Faas functions. There are two problems with this solution: 1. Zombie code accounts for only half of the total, leading to ineffective refactoring; 2. Reconstruction on the original application, new business iteration and reconstruction AB are mixed together to do development and grayscale, with high complexity and great risk. two Split before Governance: The services of single applications are split first, and the problem of code reuse is ignored for the time being. After the functions are split, some business students will carry out code reuse transformation according to the actual business needs in the subsequent development process. Reuse business code or separate it into working binary packages, or sink it into domain services. Teasing out reusability issues between cleanly isolated functional codebase is much less complex and risky than the first approach. Therefore, we chose the second option.
earnings
At present, more than 30 gateway interfaces have been separated from single applications and delivered to business development and maintenance, which further verifies the feasibility of the solution in the separation governance of single applications. Later, we will provide the splitting scheme to the developers, who will split the migration business by themselves. After the separation, the business retains the original development operation and maintenance habits. At the same time, one service gateway interface corresponds to one function rule, so that a Faas function only focuses on one service gateway interface, which solves the problem that traditional applications keep expanding in the scenario of continuous service innovation. This focus also results in less than 3% of the amount of functional code (mostly data classes) of traditional applications, and a business release takes only 5 minutes (Java)
conclusion
In general, with the help of automatic resolution tools, business students can split out a business interface within half an hour with one key, and pre-deployment, the intermediate process does not need manual intervention, and the split function retains the original development operation and maintenance habits, low migration cost, can be accepted by business students. Moreover, by virtue of the business focus of functions, one interface is one function, and each function has no interference from other businesses in the development period, with high testability and fast deployment speed. During the running period, each function runs on different physical machines. This natural physical isolation greatly improves the stability of the running period and reduces the operation and maintenance cost of services.
Looking forward to
Faas function platform is still in rapid development at present, there are places for improvement: machines cost small flow function high cost: under the group of the high demand for safety production, even small flow function, also need to each room two machines, waste is serious, the platform is considering by reducing the machine specifications and oversold, and other measures to improve machine utilization. Flexibility: When the upstream and downstream links of services are long, the flexibility of a single point cannot solve all problems. This requires comprehensive consideration and solution. Unified maintenance cost upgrade: When the group bayonet is released, every function needs to be repaired and re-released, which is a huge workload, and we are exploring relevant solutions in practice.