The demand for automatic operation and maintenance system is generated with the business growth and the continuous improvement of operation and maintenance efficiency and quality requirements.

Preface: In many startups and smes, o&M is still in the primitive state of “slash-and-burn”. The “knife” and “fire” are remote clients for O&M personnel, such as SecureCRT and Windows Remote desktop.

In this mode, the server installation, initialization, software deployment, service release, and monitoring are all manually completed. O&m personnel need to log in to the server to manage and maintain the server one by one. This non – concurrent linear way of working is the biggest obstacle to efficiency.

In addition, the manual operation may cause inconsistent server configurations, that is, the configurations of the same group of servers may be different.

Sometimes this difference is difficult to detect directly, such as the exception of an individual server within a load balancing group.

As services grow and the number of servers increases, O&M personnel turn to scripts and batch management tools.

Scripting and batch management tools do improve efficiency and project quality compared to slash-and-burn methods of working.

But there are still many problems with this approach.

  • The first is the problem of non-standardization of scripts.

Scripts written by different operations people vary greatly in their programming languages, coding styles, and robustness, and versioning of these scripts is also a challenge.

  • The second problem is the inheritance of scripts. Personnel dimission and work handover will lead to scripts that cannot be passed down and reused among O&M personnel, because the next O&M personnel may not understand and modify the script functions written by the previous o&M personnel.
  • The third is the selection of batch management tools.

Different managers choose different batch management tools, which will inevitably lead to management chaos, and can not well realize the requirements of mutual backup work among operation and maintenance personnel.

Therefore, the requirement of constructing automatic operation and maintenance system becomes more and more urgent. It is the only correct choice to realize standardization and improve engineering efficiency through automated operation and maintenance system. So how to build an automated operation and maintenance system?

This case study is divided into three broad areas:

  • The first is Why to build an automated operation and maintenance system, which is to solve the “Why and What” problem in “3W”, that is, Why and What.
  • The second part introduces How to design, operate and deal with problems of each operation and maintenance subsystem of our company, and How to solve the problem of “3W”, that is, How to do it.
  • The third is to think about some problems encountered in the process of automatic operation and maintenance of our company and make a summary.

One, the construction of automatic operation and maintenance system

Let’s take a look at why we want to build an automated operation and maintenance system. First, look at some of the challenges encountered by o&M, as shown in the figure below.

Challenges of operation and maintenance

The first is the need for the game. It is manifested in three aspects:

  • First, the number of games is large. Our company operates nearly 100 games now.
  • Second, the game architecture is complex. There is a big difference between game companies and general Internet companies, that is, there may be many sources of games, such as foreign, domestic, large manufacturers, small manufacturers; The architecture of each game may be different, some partitioned, some centralized, with a variety of requirements.
  • The third is the variety of operating systems, which is similar to the situation just now. Game developers have different backgrounds and programming preferences, including Windows and Linux.

The second is in the hardware environment, mainly for the number of servers, server models.

Since the establishment of the company has more than ten years of time, in this process, batch and installment procurement of servers almost across the major product lines of major Oems, a variety of models and miscellaneous.

Finally, there is the human factor. In the process of building an automated operation and maintenance system, a relatively important consideration is the human factor.

If everyone is technically competent, a lot of times one person can do all the work, and there may be no need for automated operations.

It is precisely because each o&M personnel has different abilities, uneven technical levels, and even different o&M habits and tools, that we must create a standard automatic o&M system to improve work efficiency.

Second, the goal of constructing automatic operation and maintenance system

Look again at the goal of building this automated operation and maintenance system, that is, what are our principles?

The author summarizes the construction goal of automatic operation and maintenance system into four words.

  • The first is “complete”, a system that covers all operational requirements.
  • The second is “simple”, simple and easy to use. If the operation process, operation interface and design idea of the system are complicated, the learning cost of operation and maintenance personnel will be very high, the effect of use will be reduced, and the ability and efficiency of the system will also be reduced.
  • The third is “efficiency”, especially when batch processing or performing specific tasks, we want the system to give feedback to the user in a timely manner.
  • The fourth is “security”. If a system is not secure, it may be quickly taken over by hackers. So safety is also an important factor.

Iii. Structure and operation mode of automatic operation and maintenance system

The following figure shows several subsystems of our current automated operation and maintenance system. Let’s take a look at how they work together.

The server is installed by an automated installation system and then taken over by an automated operation and maintenance platform. The automated operation and maintenance platform will provide the bottom support for the automated security check system, the automated client update system and the server update system.

Automated data analysis systems are associated with automated client update systems. The automated data analysis system gives feedback on the results of the automated client updating the system.

Automatic operation and maintenance architecture diagram

Let’s take a look at how each subsystem is designed and works.

3.1. Automatic installation system

When it comes to automatic installation, you may be familiar with it. We just said that the challenge is “two more and two less”, with many models and operating systems, but fewer people and less available time.

As shown in the following figure, the entire process uses a common framework. The system starts from PXE, selects the type of operating system (Windows or Linux) to be installed, and automatically identifies the driver to be installed based on the Windows operating system. Before the server is delivered to the user, basic security Settings such as firewall Settings and Windows sharing are turned off, which improves security to some extent and reduces the need for manual operations.

Automatic Installation Flowchart

3.2. Automatic operation and maintenance platform

After the server is installed by the automatic installation system, it is taken over by the automatic operation and maintenance platform. Automatic operation and maintenance platform is the operation platform of operation and maintenance personnel. It mainly solves the management problems caused by heterogeneous servers and operating systems. Operating systems are multifarious, we consider the following factors in the process of designing the system:

Design the user interface of the entire system into a browser-based architecture. O&m engineers can log in to the management system anytime and anywhere to perform O&M operations, which is convenient. The Octopod server issues instructions to the machine being operated on.

Manage heterogeneous servers in a unified manner. You may have hated Windows before, but Windows can be managed very well. We use open source SSH to manage Windows so that we can batch patch updates to the system, as well as batch password management and operations.

Take full advantage of existing protocols and tools. The feature of this platform is that all systems are managed by SSH instead of developing some agents by themselves, which also reflects the view of automatic operation and maintenance.

Most of the time we didn’t have to reinvent the wheel, and even if we built our own client-side approach, most of the time it wasn’t rigorously proven in a production environment.

SSH protocol itself has existed for many years, and has been used in our company for many years, the problem has already been out. Compared with making wheels, SSH is more stable, more able to withstand the test, and more convenient to use.

3.3. Automated security check system

The next system is automated security screening. Since we have many subsystems and many businesses, how to design a system to ensure their security? There are mainly two systems: the automated security platform and the server side.

Let’s start with automated security platforms. One difference between game companies and ordinary Internet companies is that the former need to send players a lot of clients (especially some clients are large), or patch files, to update, download and install.

If viruses and trojans are found in these files, it can be a very bad thing, even bad for the business and the company’s reputation. When these files are sent to the player’s computer, they must be checked by a virus detection system to ensure that they have not been injected with the appropriate virus code.

Look at the server side, mainly through the security scan architecture to ensure security.

Security is not an overnight fix. If you don’t constantly check, inspect, and probe your system, some of your missteps could expose your system to the Internet, or to malicious attackers.

This problem can be avoided to a large extent through an active and spontaneous security scanning architecture for all servers.

For example, last year we encountered a situation where a certain type of switch failed completely when it reached a certain number of ACLs.

If there is no supporting mechanism to check and detect, your server, port that you think is well protected, or sensitive IP may have been exposed. Therefore, through this proactive detection can reduce a lot of system or human security problems.

3.4. Automatic client update system

The game is periodic, especially on the day when the game is released or when the version is updated. At this time, the activity of players is very high and the download behavior is also relatively large. However, the update and download bandwidth may not be large at ordinary times, which is also a significant feature of the game.

This feature presents a great challenge for us to build such a distribution system.

The first challenge is that the bandwidth generated by the game can reach hundreds of gigabytes at peak times.

Second, many small operators or small and medium-sized operators have some caching mechanism. If this caching mechanism is not handled properly, it will affect the business, that is, the problem of illegal caching.

The third is about DNS scheduling.

DNS scheduling itself is resolved based on the player’s own Local DNS mechanism, which may cause inaccurate scheduling problems.

The fourth is DNS contamination, or the mechanism of DNS TTL that causes scheduling to be less sensitive and accurate. For these problems, we have the following two systems to solve.

The first is Autopatch, which solves the problem of downloading large file updates, and the second is traffic scheduling from multiple CDN vendors. Its operation process is also relatively simple. The operation and maintenance personnel upload files and check them, then synchronize them to CDN, which distributes them to relevant edge nodes, and finally decompress the files.

Just now, the periodicity of the game is that the bandwidth is not very large at ordinary times, but at a certain point, or during major events, the bandwidth is relatively large.

If we build a CDN system by ourselves, it may not be very cost-effective, so we introduce many large CDN manufacturers in China to schedule resources. Instead of giving the domain name to one or more of them, we schedule it through 302.

Because it is difficult to use CNAME directly according to the scale scheduling, especially when the bandwidth is large, a CDN manufacturer cannot solve, or a local fault occurs in a company and needs to be quickly removed.

The centralized dispatching system can realize the function of proportional dispatching. All requests sent by users are first scheduled on our side, but they do not generate direct download bandwidth. Instead, they are scheduled to third-party CDN vendors in proportion and region by relevant algorithms. Then, players actually download clients from third-party CDN vendors.

The second is the Dorado system. I just mentioned that the illegal cache mechanism of small carriers or some carriers may affect the service. For some key files, if the cache is an old version, it may cause serious problems. For example, in our list of zones, if we add a new zone on the server and it doesn’t show up on the client, players can’t access the new zone to play.

To solve these problems, we designed the internal code Dorado system, because these files themselves are relatively small, and the number is not very large, but need to use HTTPS encryption, encryption to avoid the cache problems of small operators.

Therefore, we have our own nodes for these key files and support HTTPS encryption on the nodes to avoid some problems caused by cache of small operators.

3.5. Automatic server update system

The server-side update mode adopted by us is also a traditional way similar to CDN. The target server downloads from the cache node to the central node, which is controlled by the cache node. In this way, the amount of data transmitted between networks can be reduced and efficiency can be improved.

When we designed this system, we also thought about using P2P to do it. P2P is cool and saves bandwidth, but there are several issues when it comes to large file distribution in production environments.

One is the problem of security control, it is difficult to make these servers can transmit data and can carry on security port protection.

Second, it is also a challenge to do flow control or flow limitation in P2P. So we ended up with a seemingly simple architecture.

3.6. Automated data analysis system

When it comes to client updates, the effect of the updates and whether the players have successfully installed or entered the game, we are often at a loss and can only read the logs.

But much of the information in the logs is incomplete and incomplete. When downloading the client, if you look at the HTTP log, there is 206 code in it, and it is difficult to calculate how many clients the player has downloaded in its entirety, or even whether he has downloaded it or not, and whether the verification results are correct. So we ended up designing an automated data analysis system. The goal was to analyze how the data was converted from the time the user downloaded the game to the time he logged in.

Ideally, the user downloads the client and then enters the game, but this is an ideal situation.

A lot of times, for example, because the network is not good, the user did not download successfully, or because of some problems with the account, the user did not log in to the game.

So, the data is a funnel. Our goal is to get the number of users who log in close to the number of users who download the client in the first place.

Let’s look at the architecture of the system. First of all, the downloader or installation client on the player side integrates some SDK in the game client, and reports the data of any key point, such as “download” button or “stop” button, without involving sensitive information. After the report, a Tomcat cluster will be created. After the cluster processes the data, the data will be written to MongoDB.

If you look at what’s wrong with this game during boot, the left column is divided into three files, one is 3MB, and two are more than 2GB, and you can actually imagine that. A lot of times players see small files and just download and install them, but they’re not complete. This also tells us that in most cases, in operation or business, guidance should be reasonable in order to avoid some problems.

3.7 Automatic data backup system

The design and implementation of the first version of our backup system was relatively simple: Each room would have an FTP server, and the data in the room would be written to the FTP server and then to the tape. However, as a result, the tape was scattered and there was no centralized storage place. In addition, FTP-based uploads have bandwidth and even latency requirements.

Then we designed a centralized backup system. It mainly solves the following two problems. The first is to simplify configuration. We all the configuration for all rooms, with a load balancer IP is ok, when a client needs to upload files, by gaining the actual upload the address of the load balancer, and then upload a file, by the server for receiving the second box on the left, and according to calibrate the MD5 value, if the check no problem. Go to the HDFS cluster of Hadoop. The cluster is currently tens of petabytes in size, with several terabytes of uploads per day.

The second is to improve transmission efficiency and success rate. In China, the network environment is very complex, and there are barriers or even barriers between operators, which lead to network instability, packet loss and delay. How do you solve these problems? If large files are transferred over TCP, there is a theoretical limit to the bandwidth delay product on a single connection.

Our innovation here is that the client upload using UDP protocol, UDP itself does not have any control, in other words, the client can be arbitrary, hard to send files. Eventually, the server checks which file fragments you have received and notifies the client to reupload any fragments that have not been uploaded. In this way, many problems caused by network jitter or large network delay can be avoided. Of course, it is also possible to do flow control on the client side. Think a little bit more when you encounter a problem, and you may be able to find an unconventional solution.

3.8 automatic monitoring and alarm system

Take a look at the game’s automated monitoring and alarm system (as shown below). There are game client, server and network link in the architecture of the game, so it is necessary to have a relatively complete system for all-round, three-dimensional monitoring, in order to ensure that business problems before the early warning, or alarm when problems occur.

For equipment room links, the Internet Data Center (IDC) monitors network quality. In terms of servers, network equipment and hardware, we do server health checks, performance monitoring, and network equipment and traffic monitoring.

In terms of system programs, we collect and analyze system logs; In the game server-side application, there is server-side program monitoring; On the client side, we will collect the embedded SDK to make the effect after downloading and updating, as well as collect the crash data.

As operation and maintenance personnel, when we think about problems or design architecture, we should not only focus on one technical aspect, or how cool and cool the technology is, but also think about the business architecture of the technology, or whether we can monitor our operation and maintenance capability and operation and maintenance system through business indicators.

In the game, there is a very important indicator is the number of online, by monitoring the online number of the business indicators, you can know whether the system working properly, if there are any omission, false positives, because a lot of times any part out of the question, can eventually embodies in the business, the value of data.

So we have a system to monitor the number of online users. Before each game goes online, we will access this system and collect the number of online users into the system in real time. If abnormal jitter occurs, it will be displayed in the system, and you can know whether there is a problem.

The above is a framework, let’s look at the details, how to do server monitoring. First, o&M engineers configure monitoring policies on the monitoring policy platform. The monitoring policy platform formats the data into relevant formats and then pushes the data to the automatic O&M platform.

The automated operation and maintenance platform will determine whether the data is external or remotely detected; Was it a network simulation or was it monitored locally.

Monitoring of traffic, local processes, and local logs, for example, is pushed off to the remote detection server, or the game server itself, which then reports the data. After data is reported, alarms are triggered based on thresholds configured by O&M engineers, and o&M engineers are notified to handle the alarms. Because there are so many different games and so many different operating systems, but there are always some things that people can share, like a template for monitoring or a strategy for monitoring, and we’ve integrated the server stuff as well.

As you can see, we have a lot of plug-ins. As long as o&M personnel choose relevant plug-ins and configure thresholds and cycles, they can save time and learning costs and improve the efficiency of configuration policies. Once the policy is configured, bind directly to the server you want to monitor.

To sum up, we have been doing automatic operation and maintenance system since the beginning of 2000. To sum up the past, I think there are three aspects for your reference.

The first is the principle of step by step, especially for small and medium-sized companies or startups, many times do not need a “superior” system. Focus on the current problem, deal with the current problem, the problems behind will be solved. If you design a large, feature-rich system from the beginning, it can lead to uncontrollable situations. For example, the system may eventually break down, or the development may become unmanageable because of coupling, or the project may run aground due to funding issues.

But if the initial goal is to solve some specific problems, then it will be easier to move forward.

In our automation in the process of the operational system, we first construct a basic server batch operation platform, the first part of the need to repeat the work moved to the platform, and according to operational needs rich this operation platform, function and promote efficiency, finally through the peripheral system, butt each other, form a complete system of automated operations.

Second, consider scalability. When designing the system, the functional or design aspects may not be considered so much, but it should be considered whether the system can be supported when the number of servers is relatively large, such as the order of magnitude from ten to hundreds, or even thousands, the system is still usable.

The third is for practical purposes. This is also reflected in our system.

In many cases, there are already mature protocols and tools on the market. Evaluate them to see if they work in a production environment, and if they do, just use them. There is no need to make one yourself.

Creating your own set of tools, many of which are unproven, can pose security problems.

Based on mature protocols and frameworks, it can improve efficiency and ensure stability and security. As can be seen from the section of “Automated Operation and Maintenance Platform”, we did not develop a set of Agent from scratch and implant it into the managed server, but used the open source SSH protocol and mature OpenSSH software.

This shows the idea of prioritizing open source solutions with secondary development rather than reinventing the wheel.

Source: http://blog.sina.cn/dpool/blog/s/blog_941cfba00102xznb.html