A preface.

The basic work of operation and maintenance is usually for existing systems and projects, such as servers, various cloud products, ongoing projects, monitoring, account permission control, project launch, etc., which is broad and tedious, with few constructive contents.

So when we take over a new system, it’s necessary to improve it and its surroundings. A few companies may have a more comprehensive operation and maintenance system, including our desktop operation and maintenance, network operation and maintenance, security operation and maintenance, RESEARCH and development operation and maintenance, database operation and maintenance, system operation and maintenance or application operation and maintenance and other professional teams, while more companies may only have 1-2 operation and maintenance. All of the above jobs need to be completed, but we will focus on application operation and maintenance.

When dealing with a new environment, you are faced with a pit left over from your previous job, which is more serious than developing inherited code. Transfer data actually should not just account password, work process, work matters needing attention, more important is operation and maintenance documents, because the system few simple environment, even if there were, also there are some subtle program logic relationship, not taken, it may lead to problem come from line, now mostly micro service structure, increase the complexity of the system maintenance.

For example, after taking over, the leader asked you to deploy a Java service using Docker and copy one from the formal environment to the test environment. As a result, something went wrong after startup. Maybe the startup parameters did not match the current environment, maybe the connection permission was not released, maybe the production database was connected after startup. If you delete or modify some historical data after the program is started, it is very scary to think about it.

This kind of problem is very common, as far as I am concerned, a lot of configuration information is very vague, the coupling degree between the project and the project is very high, it may be involved in which system, the whole body is connected, is not close, change is not dare to change, as an operation and maintenance engineer, we will not dare to move a project!

So it’s a creative process to build a bucket, and it’s a process that we go into. Only with a deeper understanding of the project can we better maintain the project.

Make a good foundation for operation and maintenance:

  • You should be very clear about your current environment and everything;
  • There should be monitoring, practical and useful monitoring that can detect problems;
  • Have a backup of everything that can be used for quick recovery and to do recovery walkthroughs.

Advanced:

  • To optimize the system;
  • Optimize the work flow

This is the outline above, which will be explained in detail later. In fact, it is also the public line, which is to standardize, streamline and then automate.

Basis of two.

2.1 Project Overview

After taking over the system, we should first ensure that we can perform routine maintenance and make a survey of the whole system, which generally includes the following items:

  • Project introduction

  • Account password table

  • Project resource management configuration list

  • Flow chart of various structures

  • Deploying the Maintenance Document

  • Summary table of project monitoring policies

  • Project emergency operation manual

  1. Project Introduction We can start from the scope of business of the current project, i.e. what is the function of the project? And who is the project leader and related personnel, so as to facilitate our better project docking.

  2. Account and Password table The account and password table should record the login user name and password of each server, platform and system in detail. In addition, there should also be a permission record table to record who has opened the permission and what the login account is. All of the above is just distance.

  3. In my opinion, there are at least but not limited to four sheets: server list, project domain name information, project service information, and third-party service resource list. Each sheet should list all environment information of the project, such as formal and test environment.

  4. Various structure project flow diagrams such as the network topology diagram and system architecture diagram of the project. From this topology diagram, we can see how many servers are used in this architecture and how the whole logical process is. For example, we can start from the domain name, Ddos protection corresponding to the domain name, then load balancing, and then the services of each module. Which libraries each service needs to connect to and which services it needs to interact with.

  5. Deployment and maintenance document This document can contain detailed installation steps, including the database and other addresses in the configuration file should be clearly written, so how to verify the utility of this document? Transferring this document to another colleague can achieve the purpose of migration, recovery and replication of the project.

  6. The summary table of project monitoring policies includes current monitoring policies, which can be divided into basic monitoring policies and service monitoring policies. Basic monitoring is to monitor basic server resources such as CPU, memory, and network bandwidth. Business monitoring is the monitoring of the service itself, which exposes current project problems. This table can be used to connect project monitoring issues with the project leader, or to check whether there is any omission in the monitoring later.

  7. Project Emergency operation manual known problems of the project, handling methods of frequently occurring problems, solutions, etc. Many friends may think that the above part of the content is business things, think that operation and maintenance is not necessary to learn, development to expand the deployment of a few services, deployment is, anyway, this is responsible for development, the development of business problems can be solved.

2.2 Be a good helper

Well, that’s not operational value at all. I believe many of you have played LOL and were fascinated by it at that time. I remember a saying that in a team of 5 people who play LOL, the captain is usually the assistant. The same goes for operations because the auxiliary has more free time and can see the whole picture.

Development, test, are all busy full screen in the zen master task list, endless requirements, test task, believes that many programmers are faced products or operating the demand of the fault, especially the Internet company, the project requires rapid iteration, classified into a version, there are multiple demand a version for about 2 weeks, is developed in 12 days, Release testing of the pre-release environment takes place within 2 days, during which bugs are addressed while new requirements are completed.

The release requirements of the project are scheduled to be three months later. It is not that the development can be rested after writing, but that the later tasks are just slightly ahead of schedule. The same is true of testing, which can be extremely busy following up on development with functional verification.

After the problem occurred, other positions did not pay much attention to the performance and other aspects, but focused on the development of the current task and the deadline. The fact that OOPS can be a commander, not an optional role, is not appreciated now just because most oOPS don’t realize it’s important.

For a small team of 20 development, 2 test and 1 operation and maintenance, the development team only studies its own modules. It is difficult to locate specific points whenever there is a problem. The operation and maintenance need to understand the overall system, so that the system and business can be better managed and checked.

2.3 Learning Services

For specific business how to understand clearly, the best way is to directly along the existing data to straighten it, encounter do not understand the place to ask development, the corresponding structure to draw and document maintenance, that in the company your value will increase exponentially. I believe it will also let leaders see more of your landing work, and will let colleagues willing to believe in your ability. Professionalism is more important than knowledge.

Of course, not only to better maintain the system, but also to prepare for future optimization. You don’t have to understand the code, just the whole process.

We now have this kind of problem, a module connection which libraries all don’t know, sometimes suddenly discovered that the test server connected to the production of a library, also use public network connection (VPC between environmental or safety isolation), it not clear at the beginning, not to take the initiative to understand, the problems and potential safety hazard will make you uncomfortable at some time in the back.

A simple example is there is an old e-commerce project, a few application service machine + a Mysql database, at that time asked a circle of people said that it should not be used, has been to the new above. Also did not go to save the backup directly deleted, not a few days after the user contacted to say that it is still using…

This blind spot in our nearly 300 servers there are a large number of existence, if the operation and maintenance staff do not take the initiative to understand, that development has no time to see, this hidden danger will be shelve indefinitely, or said above, will bite you at some point.

After we roughly understand the overall structure, to document driven, first set up a blank document page. The key of the key is to understand our topology, service configuration list, and at least the architecture of the project, what services are.

What to know, I can only describe in general here, more specific places should be according to your project situation to statistics.

  • Understand the domain name direction, request direction;
  • Understand the interactions between services, such as rabbitMQ, which programs are used and what they do;
  • How each module program communicates with each other and needs to interact;
  • By understanding the role of each service, when problems occur, we can quickly and accurately determine which service is the cause.

I’m currently using processon, an online tool that can be used to draw diagrams. If you don’t have enough free documentation available, you can pay for the team model, which is not very expensive.

2.4 Standards and procedures

After making the drawings, we began to sort them out in the system to make standardization and process. Establish the following criteria:

  • Service name naming: Commands can be executed according to the project environment or purpose
  • Port specification: Can be classified according to project environment or purpose
  • IP address planning: The IP address can be divided according to the project environment and region
  • Service Deployment Directory
  • Log Output directory
  • Backup Directory
  • Tool Package Directory
  • Data storage directory
  • Fixed location of other common directories such as script storage directory

Process:

  • Budget production process of each service resource
  • Server purchase delivery process
  • Service deployment, online, and maintenance process
  • Process for adding accounts and rights
  • Drill process for verifying periodic backup restoration
  • Periodically check the project resource usage process
  • Fault report management process

In fact, all resources have been standardized, and all processes should be repeated many times to prevent forgetting. These two things also lay the foundation for automation, which automatically performs repeated operations.

The completion of these two pieces indicates that there is no problem in system maintenance. As long as it is not new, at least the existing projects can be managed well.

To draw a simple mind map, the standardized categories mainly include but are not limited to the following:

2.5 maintenance

Monitoring and backup help is needed to maintain current resources without scaling optimizations. Effective monitoring can detect problems in time and avoid repeated inspection. Perfect backup can be in any system problems (non-functional bugs), immediately restore, simple, convenient, fast and effective.

Specifically, the monitoring technology can be selected according to the business. Generally, zabbix, container or K8S, Prometheus, graphical display of monitoring data, Grafana, and Open Falcon, Xiaomi’s operation and maintenance monitoring service are required.

If you use cloud services, backup is much simpler. You can use scripts to periodically back up to OSS. A physical machine can write a script that backs up anything stateful. For example, databases, configuration files, daily scripts, service logs, and so on are different from pure systems.

3. Advanced

3.1 System and Service optimization

Optimization of a system or service is common. It usually involves changing the configuration file so that the service can handle more concurrency and more stress. These have a lot of documents to look at, here will not repeat, mainly to mention a few points of attention.

  • Before using a new server, perform initial system configuration.
  • Optimization is not about changing the configuration. It requires a deep understanding of each parameter and the entire system.
  • Change the configuration of the time to be careful, do not take it for granted, it is likely that you here in order to reduce the bandwidth of the picture to open compression, the results of the crash, the picture on the line display is not complete;
  • When you do optimization, you have to have layers, and you can come from the simplest and cheapest place. For example, optimizing the structure and removing some useless modules can release a lot of resources.
  • As with security solutions, VPN+ fortress machines (such as JumpServer) can solve 99% of everyday security problems, rather than having to implement them first in expensive, detailed ways.

3.2 Workflow optimization

As mentioned above, other positions are very busy, so it is difficult to have energy to do other things, except for the system and the process here. They encounter workflow problems that, even if they can be solved, are hard to stop.

The users of o&M are internal personnel and servers, so work can be divided into visible and invisible (relative to departments). So in between maintaining devices and servers (which don’t always break down), you can focus on maintaining DevOps, the way you work.

For example, when we need to count our cloud service resources, if we use manual collection, we need to spend several days alone to do the statistics. It is very simple to solve this problem, do an automatic collection of cloud service tools, and then we can check and form beautification.

For example, a large number of Zabbix alarms at work may be relatively less urgent. Will it pollute our alarm tools, such as nails or wechat? So for the business is not available to do a telephone alarm is not better, so do not always staring at the nail alarm.

Looking at the difference is not big, staring is not in the way, in fact is the value of operation and maintenance. To give you a few more examples, if you are currently logged into Zabbix, Jenkins, Gitlab or Zen tao, OA or any other system, you have a separate account, wouldn’t it be better to do LDAP? Or do an internal use of commonly used url navigation page, so will not use more convenient.

The execution of SQL statements often requires the DBA to execute on the server, so a SQL audit platform, such as Archer, can audit the SQL submitted by developers anytime and anywhere for execution. Also for developers, is it safer, more manageable and easier to test or develop an online auditing platform using a WEB version of the database when querying data in a formal or test environment than everyone using Navcita?

Would it be a good idea to add a reminder after Jenkins’ release? Isn’t it easier and more efficient to automatically release updated projects after development commits code?

These seem to be fine without them, but when you see the impact of a particular development having some trouble, and you fix it, it’s an internal contribution, like the help in LOL, to the big picture, to the devOps drive.

The concept of DevOps has been around for years, and more people are pursuing this higher level of work. It’s all about operations and less about development. This also has something to do with the development work, so it is best to assist!

Of course, they use SVN well, you change Gitlab certainly do not love to use, although SVN all kinds of trouble, but he is familiar with. If you do not support, you can go to talk to the leadership, write a plan, the overall operation and maintenance system planning. If you really have an idea, the leader may be really interested in your idea, but if the leader is too lazy to look at it, then you can move on, this is really not worth paying, because for the rest of your life, you will be miserable. If you don’t have a lot of support and you don’t get attention (it’s common for companies to have no operations director, just a CTO), break your plan into small pieces, little changes, and try to relate to the results.

For example, if you want to promote Kubernets, you need to understand Kubernets, you need to understand Docker, and you need to find the pain point. Going back to the benefits of Kubernets, find the features of Kubernets, such as its resource scheduling, failover, better project running, and changes in company efficiency. Or to the profit side. A set of test environment can save XXXX money, the dozens of projects, a month can save XXX money, write a pragmatic point, the effect is very shocking.

Do any work is not only bury oneself in one’s head to do, communicate with the leader more even, we should break the silence between the upper and lower ranks actively, tell our idea to the leader in listen, obtain the idea of the leader, although a lot of things small company does not pay attention to, but you did and did not do are two different things. You put forward some more concepts than other colleagues, you are the pioneer in the company, people will think you are very strong, technology great, has been pushing the development. They’re not stupid about devops, everyone has heard of devops, and you can push it, and everyone will welcome it. The career path is filled with difficulties and opportunities.

3.3 the rules

When developers write code, they talk about how to write it, come up with a solution, draw a flow chart, what techniques to use. The new temporary increase in demand is also the same, at least received by the individual will first think about the idea.

But to the operation and maintenance here seems to be up at will, add security group directly under the notice to add, remarks at will write, when the other side is not used when also forget to delete. “Rules” also need to be established, so that people can be convinced, not everyone can tell you, and then add the problem, change the problem, the blame on you.

Of course, we can’t be silly and others to do it, everyone hates too many rules, that should be based on the convenience of establishing some processes, such as application whitelist can send an application email, remarks must be clear, use python script to regularly check the collection of security group rules to check, nip in the nip.

3.4 O&M Management Platform

The nature of the construction of the operation and maintenance automation platform is the realization process of the operation and maintenance team’s service capability. It frees us from a large number of repeated and irregular human operations and focuses on the improvement of the operation and maintenance service quality.

Due to the limited space of the article, I will not fully introduce the design ideas of the entire automation platform, but simply say some of my personal experience:

  1. The first is the principle of step by step, in the process of automated operations system, we can build a foundation first server batch operation platform, the first part of the need to repeat the work moved to the platform, and according to operational needs rich this operation platform, function and promote efficiency, finally through the peripheral system, mutual butt joint, Form a complete automatic operation and maintenance system.

  2. The second is to consider scalability. When designing the system, the function or design may not need to consider so much, but to consider when the number of servers is relatively large expansion, whether the system can still support, such as the order of magnitude from ten to one hundred, or thousands, whether the system is still available.

  3. The third is for practical purposes. This is also reflected in our system. Can consider to draw lessons from the market more mature tools for self-research. Why not just use it? At present, common open source operation and maintenance management platforms are not maintained by a dedicated team. If problems occur during use, it will be very difficult to troubleshoot. Of course there is no denying the value of reference. Therefore, it is suggested to give priority to open source solutions and part of secondary development.