Ctrip operation and maintenance automation platform, tens of thousands of server changes can be very easy

The lecturer introduction

Junya Hu: Ctrip Senior technical support engineer Profile: Ctrip Senior technical support engineer, Ctrip Technical Support Center, responsible for the company’s SaltStack, StackStorm and other operation and maintenance platform management, operation and maintenance automation tool development.

The topic to share is ctrip operation and maintenance automation platform based on StackStorm.

In May this year, the ransomware outbreak swept across the globe, affecting government departments, medical institutions, public transport, schools, businesses and more, causing huge losses around the world.

If the investment vision of the people, encountered this thing, may consider buying bitcoin. As an operation and maintenance engineer, he only thinks about how to prevent the virus from affecting his company’s business. I believe that many operations and maintenance colleagues are involved in the battle against the ransomware.

As for this virus, although it is widespread and seems to be powerful, there are many countermeasures. For example, disable port 445 to prevent viruses from spreading, or set up switch domain names on the Intranet to prevent viruses from running. Of course, these are just workaround’s solutions, and the bottom line is that security patches are kept up to date.

If there are only a few units, dozens of servers, patch update is very simple, log on the point, the installation or on a command, you could just when you have hundreds of thousands of servers by artificial is impossible, if suddenly sends a command to the all servers is not appropriate, may cause huge impact to the business.

So how do you automatically patch tens of thousands of servers?

Let’s take a look at how to patch a server.

The diagram above shows a relatively simple operation process. First, check whether the patch has been installed on the server, and if so, the process ends. If the server is not installed, pull the server out of production, install the patch, and restart the server to make the patch take effect.

Before pulling into the cluster, you may also need to fire up the application, such as having it build a cache to restore the application to its normal state before accessing production traffic. There are also some complications, such as a cluster pulling out some servers, the rest of the servers may not be able to support, cluster availability considerations.

Such a process of patching a server, if automated, accomplishes two tasks:

On the one hand, it realizes the operation of the whole workflow in the diagram.
On the other hand, it is not possible to log in to the server all at once, so remote operation is required, which is shown in yellow.

After realizing the automatic patching of a server, and then expanding from 1 to 1000, 10000, patching of thousands of servers, one thing to do is gray scale, gray scale, gray scale, important things to say three times.

No matter how skilled you operate, how superb technology, how confident of their own development tools, in the production of large-scale operation and maintenance operations, should be careful and careful, and batch gray is a good way to be careful, can greatly reduce the impact on production, improve the availability of the website.

Based on the above requirements for realizing automatic patching of tens of thousands of servers, we built a set of automatic operation and maintenance platform, including three modules:

1. SaltStack is used for remote control;
2. Use StackStorm to realize the operation process;
3, we developed our own tool JOBS to achieve batch grayscale.

And such a system, not only can complete the patch such a function, can basically cover a variety of daily operation and maintenance operation automation requirements, so I share it with you.

The following three aspects will be introduced in detail.

1. Remote control

SaltStack is an open source remote management platform that can manage servers of various operating systems. It has two main parts: Minion and Master. The Minion is installed on the server to be managed. After the minion is started, a long connection is established with the master. The master sends tasks to the Minion.

Similar remote management tools include Ansible, Chef, and Puppet. You can choose them based on your application scenario. I last year in GOPS Beijing station shared Ctrip in the use of SaltStack some experience, we can refer to, here will not repeat.

2. Operation process

From the perspective of operation and maintenance development process, the first is traditional operation and maintenance, mainly relying on manual operation. For example, online a server, log in the server according to the operation document step by step operation, more advanced point, write the configuration command in the script, run one or more scripts to complete the configuration.

What are the disadvantages? First of all, people repeat such work every day, tired, unproductive, inefficient delivery, tired and prone to error, forget certain configurations.

Using scripts, it is easy to develop the same function repeatedly. Many scripts do not specifically log, and it is difficult to find historical operations. A fault occurs when o&M operations are performed using scripts. Because unified O&M operation logs are unavailable, you cannot know who has done what in a timely manner.

Over time, operations has evolved into a more advanced DevOps era, and we are in that era. One of the defining features of this era is the use of various open source tools, as well as the development of many tools. Tools bring efficiency improvements and greatly speed up the process of o&M automation.

With so many tools at your disposal, there are some problems. Like the following questions:

There are many tools to manipulate to make a complex change
The same operation is repeated in the code of different scripts or tools
Do not know the operation logic of scripts or tools developed by others
No unified O&M operation logs exist

To solve these problems, we consider StackStorm, an open source automated operation and maintenance platform based on event-driven.

You have a variety of tools that provide apis for various operations. You can implement these API calls as actions on top of StackStorm, and then combine those actions into complex workflows for different tasks.

StackStorm can realize plug-in operation, visualization of operation logic and unification of operation and maintenance log.

StackStorm provides a Web interface as well as an API. You put the actions of various tools in there, select an action, fill in the parameters, and click run.

What exactly can you do with StackStorm?

We have many different change operations on a daily basis, but we often do the same things over and over again, such as installing software, restarting services, pulling clusters in and out, and so on. If you break down the different change operations, you break down these small operational atoms.

In turn, we can combine these operations atomic operations, like Lego bricks can make various models, and I can combine atomic operations into various change processes. In this way, the same operation only needs to be realized once and can be reused, avoiding repeated wheel construction and greatly improving the development efficiency.

In terms of troubleshooting, let’s look at a general onCall case.

Order at 2 o ‘clock in the morning, for example, there was a decline of the alarm, NOC open conference call, the related engineers call come in, absently up after engineer received a phone call, asking what went wrong, NOC need to state again, then engineer rush to open the computer, the VPN login to Intranet to check the relevant monitoring indicators, I used my own experience to troubleshoot the fault, spent a lot of time to locate the fault, then repaired the operation, and finally recovered the fault.

What is the problem with this troubleshooting process?

1. Long repair time

2, midnight troubleshooting, operation prone to error, and affect the next day to work

3. With the growth of business, alarm increases, which cannot be handled in time

4. Decreased website usability

What is the troubleshooting process like when StackStorm is used? StackStorm has Webhook to monitor alarms. When an alarm is sent to StackStorm, StackStorm can first analyze it based on expert experience or machine learning to determine whether the alarm can be handled automatically. If possible, perform troubleshooting operations, fault recovery.

If the fault cannot be rectified, the fault information and the preliminary analysis result are collected and sent to the corresponding engineers, saving the time of information collection and troubleshooting. In this way, the engineers can quickly rectify the fault. For some common and frequent failures, StackStorm can handle them automatically if there is a fixed way to handle them.

StackStorm can be combined with ChatOps to carry out daily operation and maintenance operations. For example, if you are participating in GOPS, StackStorm will send you an alarm and preliminary analysis, and you can send instructions to StackStorm in the Chat Room through your mobile phone to quickly repair faults.

Now that you’ve seen some of StackStorm’s features, take a look at StackStorm’s deployment architecture. The yellow part is the main StackStorm module, including authentication, API, rules engine, worker, Chatops, webui, etc. Mistras as workflow engine, PostgresQL as database. Mongodb stores action definitions, logs, and RabbitMQ is the message queue for all tasks. It’s a highly available architecture with worker and Mistral running on every server.

This is the data flow diagram of StackStorm, StackStorm maps the Chat message to the action through the rule engine here, the operation and maintenance atomic operations mentioned above are combined into workflow, the workflow parsing is done by Mistral, The execution of each specific action is completed by the worker.

StackStorm has three major benefits:

Improve the efficiency of automatic development
Visualization of operation logic
All o&M operations are recorded in detail

3. Batch gray scale

While StackStorm has a lot of advantages, you don’t want to do a single operation on tens of thousands of servers by manually typing it into StackStorm and clicking run, and then having to see StackStorm’s unreadable output and error stack if it goes wrong.

What you want is to create a task, assign a bunch of servers, execute a task at a certain time, and then give a statistical result. So we developed a tool called Jobs based on the need to automate a large number of servers.

The main objectives are as follows:

The first is automatic batching based on the selected batching strategy, such as 1 percent, 5 percent, or 10 percent of the server.
The second is that the operation is plug-in, and the operation running code is not implemented in Jobs, so it needs to be combined with StackStorm. Jobs sends the command to StackStorm, and the specific operation logic is implemented in StackStorm.
Finally, results can be counted, and how many successes and failures can be clearly seen in the task details page.

Above is the New task screen for JOBS, with batching policies, filtering servers, and more.

This is the Jobs task details page. The task information is on the left, and the batch details are on the right. When tasks are run in batches, faults can be detected and stopped in time to control the impact range.

4. To summarize

If you want to build a set of operation and maintenance automation platform, first deploy a set of remote management framework, which can be SaltStack or Ansible, and then realize daily operation and maintenance atomic operations on StackStorm, and then combine atomic operations into workflows according to specific operation requirements. Finally, For the large-scale server operation and maintenance tasks, we can consider developing a set of batch grayscale function system to complete automatic operation.

Ctrip operation and maintenance automation platform, tens of thousands of server changes can be very easy

1. Remote control

2. Operation process

3. Batch gray scale

4. To summarize

Related Posts

Test platform series (92) HTTP request support file upload related interface

Build and optimize matery blogs based on Hexo

Further learning, how to design a beautiful and practical website without code?