Author’s brief introduction
Kang Meng, ctrip website operation center senior technical support engineer. I have years of practical experience in Internet system architecture design, back-end development and performance testing. Like to study new technology, good at transforming research results, improve work efficiency.
The background,
As we all know, in the process of product iteration, functional test and performance test are two indispensable links. In the process of product launch, capacity estimation is inseparable from performance testing. In the process of product iteration, test Case coverage is an important indicator of functional testing. Even a line of code modification without strict function and performance testing may lead to major production failures.
In recent years, Ctrip has gradually increased the number of application and transformation projects in the production environment. Rapid construction of test cases close to production, and adjustable pressure function and performance test scenarios have become more and more indispensable.
How to output larger pressure sources?
You may have a number of off-the-shelf solutions in mind, perhaps ApacheBench, but isn’t it a bit painful to simulate the context order of production using AB? Maybe you thought of JMeter, or LoadRunner? Putting aside the complexity of JMeter and LoadRunner’s secondary development of the encapsulation context logic, it can be daunting to think about the need to manually construct a large number of test cases that are close to real production scenarios.
Using the production environment flow playback system, you can use a large number of real scenes in the production environment as test cases, which naturally solves the problem of constructing a large number of test cases. The system supports recording multiple times of production flow, and supports multiple times of pressure playback in the playback process, which solves the problem of large pressure source structure. In addition, the system naturally supports layer 4 and layer 7 protocols, which can quickly support a variety of special application requirements; Finally, the cross-platform characteristics of the system also determines that the system has good scene adaptation ability.
Second, the plan
In the traffic playback system, the original traffic is still returned to the real server in the production environment, and the mirror copy of the traffic is distributed to the test server outside the cluster. On the test server, you can test the function of different versions or perform performance pressure test with a pressure of 10 times.
The flow playback system works as follows:
The simple cluster mode consists of a dedicated load balancer or network device that performs the role of load balancer and a group of back-end servers that provide specific services, such as Server1 and Server2 in the figure above.
First of all, the system adds a machine to the cluster, which is called “collector”. The basic function of the “collector” is to forward normal traffic of the cluster and output all the traffic flowing through the system back to one or more back-end application servers in the original cluster, such as Server1. In fact, the collector can listen to several times the traffic of a single Server.
At the same time, the “collector” will be listening to flow to make a mirror copy, can be saved as a way for offline files, also can online real-time to mirror any network traffic are forwarded to the production or testing environment can be up to the server, save offline files, can at any time to play back into offline production flow, Direct or modified forward to any network accessible server, on the target server (we call the “playback machine”) can run the test code with different versions, to see whether the production application and test application perform as expected.
You may be wondering, doesn’t this result in Server1 being burdened with twice as much traffic? Does not actually, the reason is by default, the weight of our production environment of the cluster Server are 5, the weight of machine and proportion, determines the distribution of the cluster flow together, and when traffic copy task at the beginning of the system to the cluster expansion an “automatic”, its weight defaults to 5, at the same time, the system will adjust the weight of Server1 is 1, In this case, the weight ratio between the collector and the production Server is as follows: Collector :Server1:Server2 = 5:6:5. In this way, the collector can record sufficient data and the traffic between the production machines does not cause capacity problems due to excessive stacking.
Since the copy is in the production operation flow, an emergency disaster recovery is very important, especially when the host machine malfunction, or application Full GC cause health testing play scenarios, such as if the system cannot perceive the back-end members have downtime, and continue to forward the production flow to have fault machine, will cause the failure of this part of the request, Therefore, a self-protection mechanism is required. In the case of a back-end application failure, the traffic diversion task can be automatically identified and terminated.
On the other hand, if the machine is kicked out of the cluster due to an occasional Full GC or network jitter, it is a little more sensitive to terminate the entire task immediately. If Server1 fails to be detected for three consecutive times during system traffic diversion, the task is automatically suspended. If Server1 fails to be detected for three consecutive times, the task may fail to be recovered in a short period of time and the task is automatically terminated immediately. This ensures that the downtime of the target server will not have a lasting impact on the production environment, and also takes care of the problem that the task termination is too sensitive due to occasional detection failures.
Three, systems,
(1) The system generally includes “Drainage task Setting”, “Playback Task Setting” and “Task query three modules”;
(2) To set the traffic diversion task, the information provided by the user mainly includes the AppID, PoolID, and the target machine of the traffic diversion, that is, the traffic of which machine is to be copied. The system will carry out various association verification according to APP information and Pool information, and automatically obtain the server information, provide the choice;
(3) Finally set the start and end time to quickly start the drainage task;
(4) Set a playback task. Set playback tasks, can pass in the last column of the task list page, click the “go back”, page will automatically jump to replay the task Settings, and copies flow task related parameters of automatic filling, requires the user to fill in only the playback system HOST head HOST, playback target IP, playback rate and other information related to the task itself;
(5) View monitoring data. In the whole process of replication/playback tasks, real-time monitoring data of various dimensions can be obtained. The system has integrated some commonly used monitoring indicators, such as CPU, memory, connection number, etc. The ES system has also been integrated into the system, and some common parameters are automatically set. You can view the request/response status code and response time applied during traffic replication and playback.
Four, project
Item 1: Traffic replication of a service on the Intranet
In the project, the system for network expansion and a service cluster a collector, collector will normal flow, forwarding back to the original cluster provide normal server machine, and the two copies of the traffic image, a document to caught the way, save for pcap format caught file, another in the form of a request message, save into offline text file.
From the perspective of the application server, you can view all incoming traffic through the collector:
Item 2: Traffic playback of a service on the Intranet
In this project, the system converted the pre-recorded offline traffic files into the form of requests and played them back to the test server outside the cluster for multi-version comparison:
V. Summary and prospect
The current traffic playback system can easily obtain the traffic mirroring data requested by massive users in the real production environment, which not only effectively saves the original cluster traffic without modification and loss, but also solves the problem that the test data constructed manually cannot fit the real production scene. In addition, the system has good cross-platform support characteristics, no matter the application of Windows platform or Linux platform, can mirror the complete production data; It is also important to note that the system supports layer 7 protocols and supports custom HTTP Header customization and modification.
The vitality of the technology platform needs constant polishing to get better. Next, we are also looking for ways to capture production traffic images with less intrusion in service cluster scenarios, such as placing the “collector” on the “target playback machine” to coexist, in order to perfect the direct connection application scenarios.
【 Recommended reading 】
Ctrip flight ticket real-time data processing practice and application
Based on image comparison technology, low-cost maintenance of Ctrip ticket front-end test platform SnapDiff
Fast obstacle clearing, what can VI do for you
Application of Branch integration accelerator Light Merge in Ctrip
Ctrip’s real-time risk control system based on big data analysis