Best practice of game operation and maintenance: Sohu Tour automated operation and maintenance journey

Evolution of Changyou operation and maintenance management system

After graduation in 2008, I joined Sohu Changyou as an intern. I grew up with the company and experienced the entire operation and maintenance management system from small to large.

The entire operation and maintenance management system has evolved from the Stone Age (scripted) to the Bronze Age (semi-automated) and steam Age (DevOps), and is now in the automatic and intelligent transition stage.

Changyou operation and maintenance automation evolution

Analysis of the failures found that 40% of the failures were caused by inaccurate data. This happens because of problems with homegrown information or information exchanged between many systems.

So the first thing to do is to unify the data system, accuracy, call and reference interface. After that, a series of platforms for data and document distribution were developed, as well as standardization of various platforms.

At the bottom layer, the hybrid cloud model is adopted. On this basis, a number of support systems such as SEAL, centralized configuration management, management and service are built, as well as the most important angel and monitoring alarm system.

The main responsibility of angel system is authority management. Changyou’s operation and maintenance personnel are responsible for different games. Due to the particularity of the game version, once the game is leaked, it will have a great impact on the revenue of the whole game.

Therefore, strictly manage the permissions of each engineer. Monitoring alert systems is important because it involves the experience and income of all game players.

The characteristics and pain points of game operation and maintenance

Faced with such an operation and maintenance architecture, which parts of Changyou have automated? Let’s start by looking at the characteristics and pain points of game operations.

The architecture and application scenarios of each game, even the database and development language used, are completely different. There are also games developed in different countries, and the entire operating system and database environment, versions, have a lot of differences. In this way, the operation and maintenance of the entire platform and environment have to face great challenges.

There are many pain points of game operation and maintenance, such as:

The o&M scripts and tools are scattered, numerous, and difficult to reuse.
High elasticity of resource demand.
Balance of cost, efficiency and availability.
High concurrency with heavy traffic.
Faults need to be handled in real time and rectified as soon as possible.
Multiple version management.

In order to overcome these pain points, Changyou operation and maintenance has done a lot in the past four or five years, including changes in business and the number of engineers.

From 2014 to 2016, the business has achieved an annual growth of 20%, while the number of full-time engineers continues to decrease. This is because from 2014 to now, we have made a large number of automation tools, using automation platform and resource integration, and the annual resource cost has been reduced by 30%.

In 2016, THE CMDB Seal system was launched, integrating all online resources, completing the construction of public clusters, putting together the public services required by single games and each group of games, and reducing resource costs by 50%.

The interesting thing to note here is that the number of human failures was basically flat from 2014 to 2015, a side effect of automation. In 2016, there was a 30% drop in human failures, and that’s when automation started to kick in.

The global failure rate (network failure, hardware failure, all failures) decreased by 20% from 2014 to 2015 and 35% in 2016.

The reasons are as follows:

Forty percent of human failures are caused by inaccurate information or human error.
30% of human failures are due to communication barriers between process hopping and r&d.
More than 50% of the cost is due to idle and failed resources and underutilization of server performance resources.

For these reasons, Changyou operation and Maintenance has done a lot of things. The following will mainly share how to unify and standardize information through seal system, realize Automatic Devops delivery through PaaS platform, Docker container technology and hybrid cloud architecture.

The technical and logical structure of game operation and maintenance automation platform

In the process of platform design, the system is mainly developed using Python. Since 2015, we have found that if all development is done in Java, the participation of operations staff is very low.

The following figure shows the system architecture of automated O&M tasks:

Automatic operation and maintenance task system is a basic operation platform combining open source technology and existing resources of the company. In addition to basic O&M scenarios such as script execution and scheduled tasks, it also provides a process development framework, enabling O&M personnel to develop required service maintenance functions.

Seal System (CMDB)

Seal system carries records of all information at the hardware layer, application layer and network layer of Changyou, such as equipment, configuration, associated permissions, associated topology, associated environment, associated process, etc. Based on this information, the application is the core and driven by business scenarios.

The whole functional architecture is divided into data source, data layer and application layer from bottom to top. It is used to manage the server and related software and hardware asset information of the system center. It is the source of all system asset information. The data layer queries, changes and manages all assets, and displays the situation of assets through the statistical report module diagram.

Here is the interface of the SEAL System (CMDB) :

All information of terminal games and mobile games will be centralized in the poster system, which means that asset management specialists can initialize and allocate all resources through this platform.

The PaaS platform

The main responsibilities of PaaS are as follows:

Provides a consistent environment.
Provides application multi-tenant isolation and resource multi-tenant isolation.
Provides service discovery, elastic scaling, status management, resource allocation, and dynamic scheduling capabilities.
Supports pre-publishing, one-click publishing, one-click rollback, and automated deployment.
Provides transparent monitoring and disaster recovery.
Provides multi-angle business scenarios for o&M, development, and testing.

The following figure shows the main technology selection of PaaS platform:

As can be seen from the figure above, the PaaS platform also contains external components, including Docker. Because most of the code in game companies is on SVN, we also manage it on SVN.

Docker container technology

In the design of PaaS platform, the core part is Docker. How is sohu Changyou Docker designed?

Since 2014, we have iterated two versions, and sohu Changyou has optimized container monitoring data sharing, stability and mirror management.

Due to instability between Ceph replicas, cluster sharing is not supported, so changed to NFS + DRBD. Switch to Etcd to ensure data synchronization due to frequent Consul cluster Leader switchover and high service data synchronization load.

In order to deal with the problem that cAdvisor cannot summarize and view historical data, we developed Hunter by ourselves. After the operating system is upgraded from 2.6 to 3.18, system exceptions may occur because DeviceMapper information cannot be written after a long time running.

Hybrid cloud structure

At the bottom of changyou’s operation and maintenance system, the hybrid cloud structure is adopted. At the beginning, it is considered to directly access all public clouds and get through in a professional way, but the game needs BGP (gateway protocol).

Choose hybrid cloud compared with Changyou IDC to reduce the cost of about 20%, and make the resource elastic, cloud on cloud under cloud, expand and shrink more quickly. In terms of reliability, it not only implements remote hypermetro, but also has advantages such as anti-attack, DNS hijacking, and redundancy and reliability.

In the next step of Changyou’s operation and maintenance management system, layered capability of continuous delivery and standardization of public services will be taken as the exploration direction.

Layering capability for continuous delivery

The concept and principles of sustainable delivery will be used to automate Your operation and maintenance. During tool development, it is important to note that the more tools there are, the more problems there will be in the calls between tools.

Therefore, platformization and clustering services must be carried out, otherwise the cost will not be reduced, but there will still be many failures.

Standardization of public services

Changyou operation and maintenance to do the whole automatic process of the experience of three:

Simple and effective, don’t do fancy, because the most practical to the application is useful, the most effective and efficient for the application or the developer is best.
Be businesslike, not detached from R&D and applications.
Efficient, the nature of the game dictates that you have to be efficient, fast up, fast down, fast decisions.

Specific training content

Click the link to read the article to register.

Best practice of game operation and maintenance: Sohu Tour automated operation and maintenance journey

Related Posts

How did Tomcat fix the JDK native thread pool bug?

The Go language tutorial series, “if else statements | Go on topic

AQS and ReentrantReadWriteLock for Java concurrency