The value that operation and maintenance can deliver is not to backload, fill holes and put out fires, but to actively respond to changes and risks is an important ability to do a good job in operation and maintenance.
Meizu operation and maintenance team improves its ability to cope with changes by building a continuous integration cloud delivery platform, achieves the value goal of proactively responding to changes and improving benefits, and provides users and product teams with efficient delivery experience. Through this section from the research process, I hope to bring some enlightenment to everyone.
This sharing is divided into three parts:
-
Automation construction process
-
Continuous integration and cloud delivery
-
Prospect intelligent operation and maintenance
Automation construction process
The construction background of Meizu continuous integration is shown in the figure above:
-
2003 to 2008, Internet 1.0. Our Internet business is limited to the official website and BBS, and the server is PHP + MySQL.
-
2009 to 2011, Internet 2.0 era. We now have real server-side and operational work, including: architectural patterns for LVS and database design for master-slave replication. But our businesses still run on a single IDC.
-
2012-2013, Internet 2.5 era. In terms of Internet business, we have added application center, multimedia, and O2O.
In terms of architecture, we split the master-slave replication database into different databases, different tables, and different routes.
In terms of caching, we introduced Redis clustering and added distributed storage MFS (MooseFS).
At the same time, some corresponding supporting services have emerged, such as search engines, various MQ (Message Queues), etc.
-
In 2014, we entered the Internet 3.0 era. An important milestone in this era is that our Internet business has become one of the main businesses.
Development challenges to OPERATION and maintenance
In the evolution process from Internet 1.0 to 3.0, with the rapid growth of business, our operation and maintenance faced various challenges, mainly from the quality, efficiency, cost, security four aspects of analysis.
In terms of quality, the best way to measure quality is by its usability metrics. Generally we divide them into direct and indirect categories.
Direct metrics, from which we can see the availability of networks, services, applications, and systems; Indirect indicators, we can benchmark some empirical parameters, such as running speed; You can also check some business parameters, such as the arrival rate of SMS messages.
Our business availability was very low and we didn’t have a good monitoring system. At the same time, our monitoring status is also relatively chaotic, not only low coverage rate, and often cause some false positives, missed reports, misreports and other conditions. This leads directly to the untrustworthiness of the whole surveillance.
In terms of efficiency, efficiency is a standard to measure the functionality of the operation and maintenance platform, mainly reflected in the delivery of servers, various changes on line, and our ability to detect faults in time. We deliver and change frequently, but don’t integrate processes with automation, so the overall efficiency is low.
Cost, mainly reflected in the overall business scheduling, and delivery capacity improvement and optimization. Because our processes are imperfect and our work is opaque, it is impossible to assess how much capacity a business needs. Therefore, “pit filling”, “fire fighting”, “carry the pot” has become our operation and maintenance “routine”.
Security is the life baseline of the whole Internet product. So in the early product development process, we have developed some safety norms and systems.
Subsequently, a set of relatively perfect security system was established to reflect the team’s control degree of security issues through the dimensions of system, data and application.
Status quo of operation and maintenance platform
We have built a series of systems that are value-oriented. In terms of functions, it is mainly divided into the following systems:
-
Resource management system, we established a cloud platform through KVM + Docker. Based on the cloud platform, we set up a virtual computing and network resource management system, and control through CMDB.
-
Configuration management system, we have LVS, CDN, DNS and other management systems. At the same time, we have opened up some apis to the outside world. The advantage of this is that we can refine the corresponding permissions so that all operations can be controlled on our system.
-
Automatic system, we have work order, log, release, self-research operation and maintenance channel, as well as automatic inspection system. All of these can provide efficiency improvements in operations delivery and change.
-
Monitoring and capacity systems, we have basic monitoring, custom monitoring, business monitoring, and capacity systems. Capacity systems can not only help us assess how much resources a business really needs, but also achieve cost control for that business.
-
Security. All of our operations are logged in via a Fortress. This makes it easier for us to audit user operations.
Through our own WAF system and vulnerability management system, we can discover attacks and vulnerabilities autonomously. Then, the vulnerability information is further imported into the vulnerability management platform for iteration, repair and tracking.
Publishing Platform Evolution
Our release platform has experienced three release processes: weekly release, daily release and self-help release. Since the business was relatively simple at the beginning, we did it manually.
Later, as the business grew significantly, manual operations had to be replaced by automated tools. For example, we use automation tools to deliver commands, scripts, and tasks to the server.
Although this solves some problems, its overall release efficiency is still relatively low, and the success rate is not high.
To solve this problem, we associated CMDB “business tree” with business modules on the release platform, and developed some relevant specifications and indicators for release, so as to improve the success rate and fault tolerance of release.
In order to make the release more flexible, we distributed the authority to each business unit, which was reviewed by the head of each business unit. This way, our entire release process does not require operation and maintenance.
Let’s take a look at the current state of publishing platforms. We are characterized by many publishing strategies, such as self-publishing, one-click restart, static file publishing, etc.
At the same time, there are many supported release types, such as Jetty, task, chef, PHP, C++, etc.
As you can see from the chart, our release success rate is consistently above 98%, and our self-release rate continues to grow. In the process of release, over 90% of our business does not require operation and maintenance.
Delivery process
Our delivery process can be divided into development, test and production environments. Development is writing code locally, testing it, and then submitting it to the page.
Through Jenkins’ packing, and then to WTS Redmine. Such tests involve a deployment of the test environment, followed by some automatic or manual validation.
When we operate and maintain the production environment, we will prepare some basic environment to provide automatic deployment services to collect various logs, alarm monitoring, and rapid expansion of applications.
There is a delicate balance here: it requires that we have a fairly sound technical environment and that the people responsible for the autonomous framework be as stable as possible.
This helps us to have good documentation and technical precipitation. Otherwise, if the balance is broken, if some process is not followed, or our people leave, or our framework is updated too quickly, the delivery becomes impossible.
So what were the problems during delivery? We conclude as follows:
-
In terms of quality, we found that some code was ununit tested, and we needed to count coverage and Bug counts accordingly.
-
In terms of efficiency, automated deployment, automated testing and automated build services are scattered across different functions, resulting in “walls” that have not been broken down, so we cannot achieve a refined operation.
-
Communication costs are high and delivery becomes complicated.
-
Whether or not our code is safe, whether or not it can pass security tests, these things need to be addressed.
So what kind of value framework are we pursuing? At the bottom of the diagram is a development framework platform.
First of all, our cloud platform needs to automate the landing environment, so that we can ensure that the delivery environment is standardized.
The second is the overall development framework. Our technical committee continues to implement the basic development framework and architecture to ensure that we have a basic technology stack and an environmental automation process.
One of the core tenets of the delivery pipeline is to automate standardized processes. We have developed many processes and specifications to achieve a reliable and repeatable continuous delivery pipeline.
This process includes many things, such as parallel development during commit build, build build, unit testing, and system testing and integration testing during verification.
Finally, there is production delivery in the release and operations phases, which involves rolling back a release and subsequent production monitoring. These processes are done on this assembly line.
In addition, the system is a multi-role platform, where there are some people in charge of development and some people in charge of operations and testing for various kinds of coordination, making the platform to benefit our entire team.
Continuous integration and cloud delivery
Standardization construction
Our automation is divided into three stages, namely standardization, automation and intelligence.
On the standardization side, we have standardization of hardware, standardization of components, standardization of technology stacks (for example, the types of protocols we use), and standardization of monitoring.
In terms of test automation, we will cover a wide range of content, including: unit testing, unit coverage, test access conditions, such as whether bugs are allowed in the delivery process, etc.
There were two technical alternatives during construction:
-
All open source, we can use Docker to carry out environmental automation standards related operations, and use ES to do the log system. However, this scheme has a great impact on our existing system.
-
Based on the existing various platform system practice, we have made some specifications and procedures in CMDB and release platform.
In the end, we chose the second plan. Of course, in the implementation process of the plan, due to many platforms to be connected, we also encountered a lot of resistance.
Since these platforms are scattered across different departments such as PMO, test, operations, etc., we used different specifications during development to get through these departments. For example:
-
In the office of operation and maintenance, the publishing platform will involve the specifications related to the machine room, including which servers are in the machine room, which services are on the server, which servers are in the grayscale environment, which servers are in the production environment, etc. These are operated through the CMDB’s business tree.
-
In development, developers may use some fully open source platforms, such as Jenkins. Because it is completely open source and unmodified, it contains various operational specifications and some name identifiers that do not correspond to our business tree. All these have increased the difficulty of transformation.
Therefore, in the construction of the platform, one way we do is to unify the entrance. Since Jenkins is packaged, we can call the Jenkins API and integrate the packaging into our platform. At the same time, we also synchronized the requirement information to Redmine.
In addition, in order to realize the entry and tracking of bugs, we also integrated the entry of bugs into this platform.
This will not cause a big impact on our early operation, and solve the problem of mutual requirements and Bug number correlation.
Finally, as it is a multi-user platform, we also need to input and synchronize the information of relevant personnel (including those in charge of development, testing, operation and maintenance, etc.) into the system.
Automated construction
Let’s look at the continuous integration process:
-
The first is the requirements phase, for example: one of our product operators will enter his requirements into the system. The development lead then analyzes or rehearses the requirements and evaluates a delivery date.
-
Then we move into the development phase, which includes writing code, committing code, and compiling builds. There is also some static scanning at build time, as well as code coverage.
-
And in the test phase, the system will carry out a test environment deployment, while carrying out some automated tests, including various security tests and performance tests.
Of course, we will also do some manual validation to check that it meets the test entry criteria. If there is a problem, the process is sent back to the development department, requiring them to resubmit the code and perform the entry process again.
-
If there are no problems at this stage, the development lead or business operations will review the release and release the code to a grayscale environment.
In a grayscale environment, we also need to do some automated testing to check the security of the service. Only when the interface pass rate is reached can we finally release it to production.
From project requirements to release, we operate on our own platform, and the whole delivery process implements fine-grained progress management.
Let’s look at the release process:
-
The first is the environment check, which mainly checks whether there are a series of user directories on the server, and some related permissions.
-
At the same time, we pull files from the packaging platform to the IDC.
-
Then you need to turn off the monitoring. In the deployment process of the service, there will be a temporary unavailability, which will cause the monitoring alarm; So we will monitor the shutdown of the corresponding server.
-
Take the Web offline, of course, so that new traffic doesn’t come in.
-
The service is then stopped to ensure that the file is not occupied.
-
We do the update file operation.
-
We will start the service after the above process is complete.
-
After starting the service, we also need to perform monitoring checks. The main purpose of this check is to ensure that the services we update are available.
-
Then the Web went live, and we added services to the LVS cluster.
-
Finally, turn on the monitoring.
In the above release process, we will make parallel or serial releases for certain business characteristics. In this way, on the premise of ensuring the success rate, we can further improve the release efficiency.
With this continuous delivery platform in place, we can use it to support the rapidly iterative model of product development that is common on the Internet.
We can both implement pre-iteration requirements planning and ensure development, testing, and release during the iteration, as well as post-iteration reviews.
By collecting information and data, we can see whether the system has a serious problem with code quality, whether there is a blocking situation.
In addition, the situation of Bug fixes is also obvious. We can also get code coverage, code test pass rates, performance test, security test, and interface test data.
At the same time, we can not only know the compile pass rate, the success rate of the release, but also obtain other efficiency-related data.
This quality data drives and improves our technical capabilities to ensure the quality of the system before it goes live. Of course, we can also use these data to further improve and optimize the delivery process to ensure the reliability of the delivery process.
Intelligent operation and maintenance
Reviewing the above three stages of automation construction, we can find that: intelligent operation and maintenance is mainly to learn by collecting data, and realize the purpose of analysis and prediction.
For example, if the collected data shows that the disk swap rate is high recently, we can predict the time when the disk will fail next time.
At the same time, we can further predict the failure points of key switches that could bring down the data center altogether.
Good news! 2018 GOPS Global Operation & Maintenance Conference Shanghai Station is coming!
Back at the beginning, GOPS 10 is here to see you
Outstanding topics currently planned include:
For more information on GOPS, please read the original article