Co-authors: Wang Pengcheng (Director of Operations and Maintenance); Jia Xiaohui (head of Cloud Company); Han Xiaoguang (Head of Operation and Maintenance)
Walking in the bustling road, think of a song: like a seagrass seagrass swaying with the waves, seagrass seagrass dancing in the spray. Yes, we are as small as seaweed, but there is always a piece of feeling in each of us to support us and strive to dance out our own youth life.
We are technical, technical feelings, it can bring what harvest? These problems will be discussed at the end of this paper, and the operation automation technology will be discussed first.
Based on the reality, this paper explores the essence and sorts out several classic operation and maintenance automation architectures in the industry, including theories and architectures, figures, texts and practices. Let you see what kind of architecture is not only, more let you know how to implement the architecture, and frankly tell you this system in the real production characteristics and problems in business, but also to help you summarize the shipment d automation (intelligent) some general ideas experience, finally discusses the career development, face to face with life feelings, to pursue the meaning of life.
Is the so-called admire view: the practice of intelligent operation and maintenance, we are not the same; Explore the cross road of life, seaweed has dreams.
Contents of this article:
○ Operation and maintenance of automatic chemical can not do a bad reason analysis
Dry technology like to see dry goods, read so many articles on operation and maintenance automation, intelligence, operation and maintenance architecture, always feel very in place, but just can’t learn, can’t imitate, why?
There are three reasons:
1. Too much theory and too little practice: Other people talk very well about their theories, but you are too full of brains and empty hands to actually build an operation and maintenance system.
2. Lack of practical solutions: People are doing a great job of introducing some practical applets, but you can’t see the overall o&M automation solution.
3. Some schemes lack theory: even if there is a structure and practical case introduction, the space is always limited, or you cannot sublimate to the height of theory by half-covering, so you still cannot imitate and realize flexibly.
In the final analysis, this is because dry goods are summed up by others’ practice. Without our own deep systematic thinking, we cannot see the essence of things, so they cannot be transformed into our own thinking and experience.
Well, this article will give you a different perspective, with technical feelings to think deeply about operation and maintenance automation.
I. Operation and maintenance Platform case 1
1.1 Application Scenarios
A media company and many large enterprises are also using this architecture system.
1.2 Case Features
The architecture is clear and lightweight, focusing on security control and flexible expansion, which is suitable for the operation and maintenance automation scenarios of large and medium-sized enterprises.
1.3 architecture diagram
1.4 Architecture Analysis
-
Unified management and control: The Master control system can control the Master and Login of the remote multi-network nodes, and then centrally control the Minion machine to which each Master node belongs.
-
Development tools: Python+Saltstack+Vue+Redis+Infuxdb
-
Permission management:
1. The production machine login permission is applied independently and approved by the administrator. 2. The administrator can assign multiple O&M roles, daily batch rights, deployment, and collaborative management
-
Master is used by background o&m managers, and Login is used by daily users to Login to a network node. The master controls their own Minion and is isolated from each other.
-
Permission management process: employees register and login to the Web system of the master controller, and then activate their login permission to login to the service machine.
-
The Master control system can be isolated from all the Master, Login, and Minion intranets, or it can be on the same Intranet with a group of resources.
-
The Master controller directly operates the relevant Master and indirectly operates the Minion through the Master
-
Login’s redis allows only master controller and node master to access, and master controller’s Redis allows only master to write. In a real enterprise deployment, there is a redis at login for quick access to machine-related information within that node. The respective master concentrates on the master controller redis, and the master controller redis is shared.
-
The Master obtains the information of each minion by monitoring events, and stores the timing data in their respective InFUxDB. Then, a piece of data is written in real time and used as real-time monitoring data in the chief controller Redis. The chief controller obtains and displays real-time data locally or calls the influxDB of the Master of each node to display historical data.
-
The actual system screenshot is as follows:
1.5 Architecture Summary
The architecture is clear and simple, focusing on security control and scalability.
The overall architecture is based on the principle of saltStack architecture, the SaltStack interface, can be standard overseas SaltStack enterprise edition, while doing some authority refinement control function.
The system mainly realizes functions: unified control, authority management, cloud fortress machine, real-time monitoring, automatic deployment, security audit.
The architecture is easy to expand and supports unified management and control of nodes in remote and multiple network zones. Only one Master system and one Login system need to be deployed on the new network.
Case two of operation and maintenance Platform
2.1 Application Scenarios
The architecture idea originated from a search company, a digital company, a travel service company, and has evolved many times.
2.2 Case Features
Simple architecture is not simple, biased to the monitoring field, can undertake the massive and high concurrency control of more than 100,000 servers.
2.3 Architecture Diagram
2.4 Architecture Analysis:
-
This architecture uses Agent client programming model, that is, an Agent is arranged on the server, which is responsible for data collection;
1. Generally speaking, the server-side programming model is characterized by: the server opens the listening port and waits for the arrival of the client connection. 2. There is also an Agent client programming model, which is characterized as follows: The Agent client opens the listening port, and the server initiates the connection to our Agent.
-
Trasfer (upward forwarding) to a distributed pipeline and then transfer, just like building blocks, can achieve flexible data processing, scalable architecture, cluster distribution.
-
The collected data is divided into two parts
1. Database storage, mainly for monitoring data display and follow-up troubleshooting. 2. The second is real-time monitoring, making a lot of monitoring alarm items.
-
There are about 200 monitoring items on each server, and the default frequency is a collection point every 5 seconds. I would say there are about 40 pieces of data collected every second.
-
The system basically can’t Cache, it has to do it in real time. Because the server monitoring system, we do the server should know that the delay alarm, not reported. If there is a problem, report it as quickly as possible.
-
The control system itself does not have any state, any state is stored in the database.
-
In order to build mass execution, it is necessary to have a deep understanding of Python asynchronous multithreading mechanism and modules, and have a deep understanding of GIL. It is recommended to learn about epoll and SELECT models.
-
In order to build a high concurrency task control system, it is necessary to deeply understand and utilize the running principle of Linux fork daemon, multi-process, multi-thread mechanism and application scenarios. You are advised to learn about the network event libraries Libevent and LibeV, and the asynchronous DNS resolution library C-ares.
-
Scalable architecture, flexible, can be arbitrarily scalable, adapt to various firewall ACLs. When I was in a digital company, there were many computer rooms and many ACLs on various firewalls, which caused various access problems. To adapt to various systems, you need to adapt through some modules.
-
In order to build a mass execution, stable and scalable operation and maintenance control system, in addition to the above knowledge, it is also necessary to know how to build a hierarchical architecture, with the following factors being considered:
1. Collect and control commands in a unified manner to ensure effective command collection, management, mass release, and route control. 2. How to design layered scheduling and forwarding of massive commands to ensure that services are isolated from each other and not affected, load balancing and massive command transfer, and flexible expansion; 3. How to receive and execute commands effectively, ensure correct and efficient execution, and collect execution effects in a mass manner.
2.5 conclusion:
This system architecture mainly monitors, can execute massive and high concurrent commands, can monitor tens of thousands of devices. In the actual development, we found that the more complex the pit, especially in the case of particularly high restrictions, finally back to simplicity, continuous optimization, is that each module is extremely simple, the feeling is distributed pipeline, can be found in the Linux system shadow. Keep it simple, because some modules, as we all know when we write code, are productively from here to there. If writing code is complicated in design, a lot of things will be bypassing, overtime may not be able to figure it out.
Case 3 of operation and maintenance Platform
3.1 Application Scenarios
A comprehensive portal site
3.2 architecture diagram
3.3 Architecture Analysis
The design concept of the operation and maintenance automation integrated management platform is to integrate and unify the existing operation and maintenance tool platforms as far as possible, unify the monitoring and management system resources, effectively associate and integrate data information, and create an intelligent integrated operation and maintenance monitoring and management platform. The integrated operation and maintenance management platform tailored to its own needs is integrated with independent development and external introduction.
3.3.1 Functional module design
This solution is built from three dimensions, namely IT operation and maintenance process, IT monitoring platform integration, IT operation and maintenance automation. These three dimensions mainly have the following major functional modules.
IT operation and maintenance process module: asset management, knowledge base management, security management, event management, daily event management.
IT monitoring platform integration module: monitoring alarm management, log management, performance management, report management.
Operation and maintenance automation module: application management, configuration management, program operation management.
The actual functional modules of the first phase of the system are shown in the figure below:
3.3.2 Development language and tools used in this solution
Back-end development is mainly implemented by Python, Shell and other programming languages.
Information collection uses syslog, Logstash, Agent, and SaltStack
Data is written to MySQL, Redis database, ES
The front-end WEB display and interaction with the backend data layer and application layer are implemented through the Django framework.
Use framework tools such as HTML, CSS, Bootstrap, etc.
The chart presentation mainly uses Echarts, Kibana
3.3.3 Circulation ideas of intelligent Operation and Maintenance Monitoring System:
An O&M system should consider whole-life cycle management from online to offline, and whole-service chain monitoring and management to ensure that the service system can be deployed, changed, auditable, traceable, safe and controllable. Operation and maintenance management should be process-oriented, compliant, visual and intelligent.
When monitoring trigger the alarm, and the warning message will be automatically written to the event work order system, the response post personnel to deal with the work order, for the stable and high reuse, risk small scene can be changed to disposal system automation, involves to change the behavior of the system will trigger a CMDB information updates, trigger audit behavior records at the same time, the classic work order processing experience can flow to the knowledge base.
Typical intelligent O&M event flow chart design is as follows:
3.3.4 Visual Log Monitoring
Collect and format the website access log information through logstash, and then send the formatted data to Redis. Then there is a logical processing program to consume the REDis database, logically process the access geographic information, access content and other information, and then temporarily put the statistical analysis results into mysql. The front-end through JS request to obtain the statistics of mysql, and then regularly refresh to the Web front-end Echarts control for drawing display, so a tall and gorgeous visual display module is realized.
In addition, a lot of our log visualization is actually done through ELK. Here is a screenshot of the system:
3.3.5 Asset Management module Implementation Ideas:
Django-based CBV development mode can quickly add, delete, change and check assets. The upside (if you are good at CBV) is that functionality can be implemented elegantly and quickly using existing libraries and mechanisms, while the downside (if you are not good at CBV) is that custom functionality requirements are greatly limited by the CBV development model.
3.3.6 Automatic deployment of systems and applications
Implementation idea of Agent initialization: For a blank operating system, firstly push salt Minion and system Agent into it based on the specified IP address to realize the management and control of the system, agent collects system information, and sever port opens the monitoring port to receive agent data and summarize it into the CMDB database. Salt Minion enables the daily operation of the system and the batch deployment of common applications, which are pre-polished salt states.
Automatic DEPLOYMENT of KVM VMS: Enter the host name, IP address, image, and physical machine information on the user interface. In the background, the salt master is used to push the salt status and pre-specified system image to minion. The KVM system is automatically deployed in batches through the KVM Libvirt interface. The following figure shows the automatic system deployment module:
3.4 Architecture Summary
This solution is considered too large and comprehensive, which is found inadequate in the later development and implementation, and many modules have low utilization rate. Finally, the more practical modules are asset management module, system deployment, application deployment, system information query and other functional modules.
This scheme is suitable for the integration of various products in operation and maintenance, but it lacks the integration strength of open source products, and it is difficult to integrate data and get through data.
In the account permission management module, the secondary development integrates CBV and FBV permission management and control to achieve accurate control of URL, operation interface and function module.
The design of the audit management module is a little cumbersome, which makes it difficult to format audit information properly. In the later stage, we should adjust the audit idea: the audited module sends out JSON data, and the audit module stores, formats and performs statistical analysis of JSON data effectively.
For chart analysis, trend analysis and background data storage, mysql storage and use are not good, consider using ES and Influxdb storage in the future.
Four, the nature of intelligent operation and maintenance
In fact, human beings have been walking on the road of intelligence. The operation and maintenance industry has been accompanied by intelligence since the beginning of its development. However, with the development of humanities and technology, the public gradually ignores or no longer thinks that things before are intelligent.
So what is the expected future of operational intelligence? A: Automation + big data +AI+ process strategy
Automation systems are platform objects, big data is means of production, AI is (algorithmic) productivity, and process strategies are relations of production.
In order to achieve intelligent operation and maintenance, at present, realistically speaking, we need to do as follows: Relying on the operation and maintenance automation platform, it collects massive operation and maintenance data and logs as production materials, takes CMDB as the core basic object, and comprehensively analyzes the relationship between various business systems, operation and maintenance organization, process strategy and other production relations through algorithms, so as to achieve prediction, analysis, research and judgment, fault location and treatment.
About intelligence, there is no more to say, everyone is pondering, human development is the pursuit of dreams, intelligent operation and maintenance is also the dream, in case it is realized 🙂
Feelings often in, craftsman is not old, not only technology, focus on practice
Tao Te Ching says: Tao gives birth to one, life to two, two gives birth to three, and three gives birth to all things. This article therefore pays attention to speak the truth, speak the train of thought, only introduces three cases, always can not be exhausted, but three things, point to the end. I hope this article gives readers the direction and methodology, the specific situation has to vary from person to thing, as long as you have an Epiphany, and adhere to the practice of your dream, then a better tomorrow in front of you — this is feelings.
Feelings are a dream, a pursuit, an expectation, wishful thinking, along the way.
Feelings have faith, strength and hope. Life without feelings lacks interest, happiness and meaning.
Feelings seem useless, the heart is necessary, into the bone marrow. I like seaweed floating, you run in the boundless huge, will experience different things, everyone has a different situation, we each are different, but all share the same feelings, this is the nature of things, we are willing to pay the price and to better, who said a worm, can not be a hero, seaweed also have a dream, seaweed also has the youth.
Once you know your inner dream, you need to practice and explore it. You need diamonds to work for China. So as technologists, we need our own diamonds. So how to get their own diamond? First of all, three ways are not advisable:
1. Belief without goals: If the original intention is uncertain, lack of preparation, and without goals and efforts, the probability of success will not be high;
2. Focus on immediate details: Focusing on small details instead of the big picture can make you look busy and tired, but in reality you may be growing slowly.
3. Thinking on paper: If it is just unimaginative, aiming high, then it can only be floating.
So if you want to get diamonds, you have to have real skills. With the rapid development of the society, many work scenarios will be eliminated by big data and artificial intelligence, or like seaweed floating, disappearing in the waves. Or dance in the waves like seaweed, moving through the wind. The author thinks that the future wave top talents need to have the characteristics of “ten type talents”.
Height: Have high vision, strategically oriented ability, direction control ability. Ability, execution ability, lead a team on a path.
Depth: Can grasp the problem, the key point of the matter keenly. Have good perseverance and tenacity to do things. Have expertise in a field.
Breadth: have a broad range of technical knowledge, have a broad mind and tolerance, tolerance and coordination ability.
Sixth, concluding remarks
This article is from three authors, in their respective fields, this article with a kind of feelings, peeling off the cocoon to see the essence, depth summed up some cases, readers should do: look at their own dream is what, learn from the experience of others, deduce their own wonderful.
Author: Wang Pengcheng (Operation and Maintenance Director); Jia Xiaohui (head of Cloud Company); Han Xiaoguang (Head of Operation and Maintenance)
END