Xiaobiao learned that a large social e-commerce company completed the cloud migration at the end of 2020. I had the honor to have in-depth communication with their performance test leader, Zhang Ge (pseudonym), and brought back the first line of information for you.
One cloud migration can save 5 million hardware cost every year, do you?
Before you answer the question, the cost of the migration, the reliability of the new platform, and the security of the process all need to be considered. Enterprises can use cloud computing technology to promote business innovation, achieve agile and flexible management, and reduce IT costs, so as to improve the development power and competitiveness of enterprises in the new normal. Cloud is no longer a new thing for enterprises. However, with the development of their own businesses and the constant change of cloud computing service products, more and more enterprises need to face the challenge of migration between cloud platforms in order to obtain better services at lower cost.
How do you do cloud migration safely and efficiently?
Xiaobiao learned that a large social e-commerce company completed the cloud migration at the end of 2020. I had the honor to have in-depth communication with their performance test leader, Zhang Ge (pseudonym), and brought back the first line of information for you. For information protection, we hereby refer to the e-commerce company as A.
Finalize the migration plan, functional testing and verification
At the end of 2020, A Company made A major decision to migrate its business from Ali Cloud platform to Huawei Cloud platform. The existing business of the company contains hundreds of applications and tens of PT data have to be migrated, how to complete the migration safely and efficiently has become the most concerned issue. If the business doesn’t work on the cloud after the migration, the damage is incalculable, so the first thing to do is to finalize the migration plan. In order to better verify the feasibility of the migration solution, it is certainly not enough to test it once, let alone directly to the production environment. The test team of Company A first built A test environment with the smallest unit on Huawei Cloud to test the feasibility of different solutions and complete the comparison and verification between Huawei Cloud middleware and Ali Cloud middleware.
Is it possible to move the service and then the data? It involves cross-cloud functional testing. The test team keeps the database and Redis cache on Ali Cloud and migrates the application services to Huawei Cloud first. The business needs to call across the cloud, and the response time is more than three times as long as normal due to the large delay of two cloud access. As a result, it is clear that the service transfer and data transfer scheme is not feasible.
So the final shutdown of the full migration scheme. The test team synchronizes data from all services and test environments and begins functional testing of the entire business process. The test covers whether the new activity of goods can be created, whether the normal order can be placed, and whether the process of the whole scene such as the order, capital, customer service and after-sales can be run smoothly after placing the order. The business process runs smoothly and there is no problem in the functional implementation, then the next thing to do is to test the system performance and verify whether the system can achieve high availability, high stability and the expected performance indicators.
Select the best performance test solution
“We came to the conclusion that using the same set of application code and the same set of data to do the pressure test is convenient for data comparison and the performance test results obtained are also the most accurate.” Before the official performance test, the test team discussed various test scenarios in detail. Here we met the most core problem in the whole project — how to deal with the pressure test data? Let’s take a look at the differences between the three main options discussed at the time.
Scheme 1: transfer the business and data of Ali Cloud 1:1 to Huawei Cloud, directly use the existing data to perform performance pressure test in the production environment, and only clean and process the test data generated in the pressure test process after completion.
Scheme 2: The business and data of Ali Cloud are also moved 1:1 to Huawei Cloud, and the existing data are directly used for pressure test in the production environment. The difference is that the cleaning of the pressure test data is more simple and crude, and the database, files, cache, messageware and other data in the data layer are cleaned up and a full synchronization is made.
Plan 3: The first step is to transfer the business and data of Ali Cloud 1:1 to Huawei Cloud, and then introduce the full-link test technology in the production environment to identify the test data and isolate the test data during the test, so as to achieve the effect that the test data does not pollute the business data.
The proposal is put forward and discussed, there must be its corresponding advantages and disadvantages, the project team before making a decision, also made many comparisons and argumentation. So let’s look at plan one. The challenge with this solution is that it is extremely complex and can easily pollute the official data. The details are as follows:
- Pressure test involves a lot of data tables, which is easy to miss cleaning or wrong cleaning, debugging of cleaning scripts and verification after cleaning is also a large workload;
- In addition to the pressure data in the data table, the message queue, the cache will also have the pressure data, which brings great complexity to clean up;
- For the table with self-increasing primary key, it will also affect the synchronization of later increment data.
- Even if the pressure data is cleared, reporting applications will still be interfered with by the pressure data unless some hard work is done to reporting applications.
Let’s go to plan two. Although this scheme is also feasible, there are two problems in it. One is that there is still a risk after the secondary synchronization is not tested, and the second regression test is still needed. Second, the data migration cycle is relatively long, to complete several PT data, file migration, caching migration warm-up and other actions, often a few weeks. During this period, 1:1 machines of cloud environments on both sides need to be rented at the same time, and the hardware cost is not small expense. In addition, data migration requires the cooperation of each business line, and the repeated investment and waste of human resources caused by the second migration cannot be ignored.
Finally, let’s look at plan three. The introduction of full-link profiling for production environments makes it easy to migrate performance tests for new cloud environments. Before the pressure test, the call links of all businesses are sorted out without changing the business code. The probe Agent of serial technology is used to isolate the data in the production environment, so that the pressure test can be done boldly in the production environment without worrying about data pollution.
ForceCop production pressure data isolation capability now supports hundreds of middleware, including cloud-native middleware, enabling production pressure data to flow into pre-configured shadow cages, shadow message queues, shadow library tables without impacting business data. With the flow pressure platform and the pressure test console, it can simulate the operation of a large number of users in the production environment, run all the business call links, and the test results are natural, true and reliable. After comprehensive consideration, the final choice of scheme 3 performance test, the least cost investment and the most secure.
One person’s practice of full link test in production environment
Based on the discussion of the above scheme, taking into account the staffing of Company A, the project team finally decided to complete the glorious task of performance pressure test to the performance test engineer Ge Zhang.
How can one person carry the banner of a project that affects the business of 2 million businesses and the experience of 15 million users without previous replicable experience? Have you encountered any particularly difficult problems? When talking about these problems, Zhang looks relaxed.
The first step is to set reasonable test targets. In order to ensure the smooth development of the business, Huawei Cloud’s performance should be at least as good as that of Ali Cloud. To this end, Zhang Ge in 2020 to promote the performance of the target as the benchmark, the flow slowly increasing to do pressure test. With reference to the performance data of full-link test done before, the response time of Ali Cloud and Huawei Cloud is compared under the same traffic condition, and based on this, the gateway, connection pool, bandwidth, database and other areas that can be optimized are found.
One person certainly cannot complete all the work, that is the most difficult problem of internal staff collaboration. In order to facilitate the timely performance optimization and adjustment with the full-link compression test results, Brother Zhang deliberately set the performance compression test time at 1-2 hours after the technical team issued the version. At this time, most of the technical personnel are available, can timely cooperate to solve the related problems for performance optimization.
The test work was carried out according to plan under the deployment of Zhang Ge. It took more than ten days to complete several rounds of full-link pressure test in the production environment, and many problems were indeed found.
As an e-commerce enterprise, high concurrency is A prominent feature of the business. In addition, Company A relies heavily on Redis, and many Redis interfaces are called through A loop, so the requirements on the number of Redis connections are relatively high. Halfway through the test, Redis suddenly failed to connect. Through investigation, it was found that there was a problem with the configuration of the maximum number of connections of Huawei Cloud Redis, which could not meet its usage scenarios.
In addition, the most impressive one is the load balancing problem of Huawei Cloud Gateway, and the test result shows a curve. Huawei cloud has 3 sets of main ELB and 8 sub-ELB, but the pressure will always be on the first ELB, which will be transferred to the second ELB when the ELB is full. The CPU of a single server is 65% and the other servers are 15%, and the load is seriously unbalanced.
These two typical configuration and environment problems are difficult to find without full-link compression. The technical team of A Company combined with on site staff of Huawei Cloud to conduct debugging and rectification according to the pressure test results, and finally completed all performance optimization successfully.
After the previous test validation, you can go ahead and do a formal cloud migration with confidence. Due to full preparation in the early stage, the actual time of the main migration was only 2 days. After the launch, the business was carried out normally and 0 performance failure occurred.
The trend of enterprise cloud migration is irreversible, and today enterprises need to consider not only whether to migrate, but also how to do it safely. I think the production environment full link compression technology has provided its own answer.
If you are interested in the full link pressure test technology, you can add [small tree] enterprise WeChat, invite you into the group communication; If you are interested in the press-test platform mentioned in the article, you can also apply for a free trial of the product