Brief introduction:Network diagnostic tool Srecli-Net
1. background
The SRE operation and maintenance team is committed to improving the production efficiency of operation and maintenance through automation, promoting the iterative transformation to intelligent operation and maintenance, and solving the pain points of traditional operation and maintenance. Although the traditional operation and maintenance has a complete operation and maintenance system, the operation and maintenance modes are different, and the operation and maintenance operations are complicated and time-consuming. How to improve the operation and maintenance efficiency of hybrid cloud projects, improve the added value of operation and maintenance and customer satisfaction is still a tough problem for us.
The main challenges are as follows:
- With the rapid development and evolution of customer business, the lag of traditional operation and maintenance has been enlarged
With the development of customer business and the continuous evolution of business models, the amount of business data is also increasing year by year. It brings more opportunities and challenges to operation and maintenance. How to guarantee the stable, safe and efficient operation of data in cloud and business interaction inside and outside cloud is a problem worth thinking about by operation and maintenance personnel.
- The operation of each system of the platform is complex, and the cost of operation and maintenance learning increases
With the rapid iteration of cloud product versions of the cloud platform, it becomes more and more difficult to get familiar with the platform. With the change of cloud product versions and the emergence of new functions, the cost of learning for beginners increases and the difficulty of getting familiar with various operation and maintenance operations of the cloud platform increases. However, it cannot fundamentally solve the problem of fast enabling operation and maintenance capability. All of this can lead to a cascade of “butterfly effects” that can lead to high risk projects or level P failures that can directly affect the normal use of the customer’s cloud business.
- The ability of operation and maintenance personnel is uneven, and the operation and maintenance operations are complex
At present, there are some major problems in the operation and maintenance mode, such as manual experience judgment, various manual operation of the platform, low efficiency in handling problems, and long time of fault emergency treatment. Due to the complexity of the system, technical personnel will waste a lot of time in the operation and maintenance of the platform in the machine login, tool use and other basic problems guidance. After logging in, it faces the disunity of the operation instructions of adding, deleting, changing and checking. With the consumption of operation and maintenance for a long time, the on-site operation and maintenance personnel will be exhausted and unable to focus on online operation. Especially in the face of some inexperienced on site or customers, often can not find the target machine, the wrong command knock and other phenomena, making the overall operation and maintenance inefficient and frequent security risks.
The current main task is to improve the efficiency of operation and maintenance and reduce the learning cost of operation and maintenance personnel by integrating the above three aspects of operation and maintenance of customers, platforms and operation and maintenance. In this context, the SRE-CLI tool is launched, which supports shell functions, command completion, problem diagnosis, failure hemostasis and other functions of SRE-CLI tool, and gradually solves and improves the current situation.
2. Basic introduction to SRE-CLI
The SRE Command Line Interface (SRE CLI) is an operational tool that enables you to perform operational operations on a hybrid cloud using commands in a command-line Shell. With minimal configuration, you can use the SRE CLI to run commands that enable you to implement complex commands in your daily operations from a command prompt in a terminal program. Based on the “old Chinese medicine” experience of SRE in daily work, problem solving and fault emergency precipitation, and integrated in the hybrid cloud through command-line tools, SRE CLI can be run without configuration, and complex operations in the daily operation and maintenance process can be realized through simple commands.
CLI interactive capability model is mainly composed of four parts: access layer, interaction layer, back end and infrastructure. First of all, the end user through after login SRECLI, into the interactive interface layer, by selecting the corresponding scene instructions and assist to complete the specified action, the action will be called the back end each tool ability, as well as the data source data, through the infrastructure layer is calculated, calculation results of diagnosis will be directly output to the terminal CLI black screen interface, complete the whole process of interaction, As shown in the figure below.
Figure 1
- Problem diagnosis (ali\_diag)
High frequency operation is extracted from service order, work order and fault order, and common operation, problem & fault point tool is converted into atomic items. Through the daily operation and maintenance to query the product atomic items, problem points, failure points, and quickly query the key indicators to locate the problem points.
Figure 2
- Scene diagnostics (ali\_scene)
A series of troubleshooting ideas are precipitated by fault scenes and output in the form of “three plates and axes” to accurately locate the problem. On this basis, the fault point assembly and fault precise location are carried out.
Figure 3
- Emergency bleeding (ali\_cure)
After the real failure and risk hemostasis recovery methods are precipitated, and the solution is determined, rapid recovery is required. Recovery actions include restart, downgrade, current limiting, handover, etc. Help customer business to recover quickly.
- Daily query (ali\_query)
Daily query, associated data display and commonly used information acquisition can query the corresponding product, route, capacity, strategy and other information of IP address location in the cloud through precise query mode. Currently covering all kinds of IP dimension query of physical network.
- Smart stream capture (ali\_trace)
To meet the CLI’s ability to capture packets at various points in the cloud platform, the customized packet capture combination command can be used to quickly drop at the packet capture point and capture packets of network traffic in the specified inbound or outbound direction. Covering classic network type capture packet, VPC network type capture packet two.
3. CLI-NET concept
CLI -Net is a branch function of CLI system, which is mainly responsible for the diagnosis and diagnosis of the physical network direction in the hybrid cloud. It conducts the diagnosis and diagnosis of specific aspects in the physical network environment through unified format instructions. CLI -Net mainly covers four aspects of hybrid cloud physical network, including performance diagnosis of general network equipment in the cloud, network state diagnosis of cloud boundary, network state diagnosis in the cloud, and network state diagnosis of physical machine. It involves the running state of physical machines and switch networks of all products in the cloud, as well as the troubleshooting and diagnosis of Internet, IDC network and other networks outside the cloud to access the network inside the cloud. The specific diagnosis coverage is shown in the table below.
<span class=”lake-fontsize-10″><span>Cli-Net</span></span><span class=”lake-fontsize-10″>< SPAN > diagnostic coverage </ SPAN ></span> | <span class=”lake-fontsize-10″><span> Common Network Device Performance Diagnosis </span></span> | <span class=”lake-fontsize-10″><span> cloud boundary network status diagnosis </span></span> | <span class=”lake-fontsize-10″><span> cloud network status diagnosis </span></span> | <span class=”lake-fontsize-10″><span> physical machine network status diagnosis </span></span> |
<span class=”lake-fontsize-10″>ISW</span> | < span class = “lake – fontsize – 10” > “< / span > | < span class = “lake – fontsize – 10” > “< / span > | <span class=”lake-fontsize-10″> </span> | <span class=”lake-fontsize-10″> </span> |
<span class=”lake-fontsize-10″>DSW</span> | < span class = “lake – fontsize – 10” > “< / span > | <span class=”lake-fontsize-10″> </span> | < span class = “lake – fontsize – 10” > “< / span > | < span class = “lake – fontsize – 10” > “< / span > |
<span class=”lake-fontsize-10″>CSW</span> | < span class = “lake – fontsize – 10” > “< / span > | < span class = “lake – fontsize – 10” > “< / span > | <span class=”lake-fontsize-10″> </span> | <span class=”lake-fontsize-10″> </span> |
<span class=”lake-fontsize-10″>LSW</span> | < span class = “lake – fontsize – 10” > “< / span > | <span class=”lake-fontsize-10″> </span> | < span class = “lake – fontsize – 10” > “< / span > | < span class = “lake – fontsize – 10” > “< / span > |
<span class=”lake-fontsize-10″>ASW</span> | < span class = “lake – fontsize – 10” > “< / span > | <span class=”lake-fontsize-10″> </span> | < span class = “lake – fontsize – 10” > “< / span > | < span class = “lake – fontsize – 10” > “< / span > |
<span class=”lake-fontsize-10″> </span> | <span class=”lake-fontsize-10″>input</span> | <span class=”lake-fontsize-10″>input</span> | <span class=”lake-fontsize-10″>input</span> | <span class=”lake-fontsize-10″>input</span> |
<span> </span> | <span>Cli</span><span> </span> | Meaning < span > < / span > |
<span> </span> | <span>device_check</span> | <span> check the health status of each switch itself, including hardware, interface, routing, connectivity, to determine the output network equipment itself abnormal items. </span> |
<span> core network direction diagnosis </span> | <span>core-network</span> | <span> determines the output network exception items by checking the overall or specified physical machines involving all physical server routing paths, interconnection lines and routing states in the cloud. </span> |
<span> </span> | <span>Private direction</span> | <span> determines the output network exception item by checking the overall physical network condition involved between the user IDC and the VPC network in the cloud (including all instance-level resources). </span> |
<span> </span> | <span>Internet Direction</span> | <span> determines the output network exception item by checking the overall condition of physical network involved between the Internet and the VPC network in the cloud (including all instance-level resources). </span> |
<span> physical virtual direction diagnosis </span> | <span>physics virtual</span> | <span> determines the output network exception item by checking the overall physical condition between the VPC network (including all instance-level resources) and the classic network (including all cloud service resources). </span> |
</span> <span> </span> | <span> </span> | <span> instruction result </span> |
< p class = MsoNormal > <span class = MsoNormal > <span class = MsoNormal > <span class = MsoNormal | <span>ali_diag network ping project </span><span> {product name} </span> | < span > check each cluster in the cloud physics machine connectivity is normal < / span > |
<span>ali_diag network ping switch</span><span> {name} </span> | <span> </span> > <span> </span> | |
<span>ali_diag network hardware power </span><span> {switch} </span> | <span> </span> > <span> </span> | |
<span>ali_diag network route BGP </span><span> {switch} </span> | <span> switch BPG routing protocol status check </span> | |
<span>ali_scene network device_check</span> | <span> switch hardware self check </span> | |
< p class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class = MsoNormal class | <span>ali_scene network internet_direction</span> | <span> </span> > <span> </span> |
<span>ali_scene network private_direction</span> | <span> </span> > <span> </span> | |
<span>base</span><span> </span> | <span>ali_scene network core_network</span> | <span> device network connectivity check </span> |
<span>ali_scene network physics_virtual</span> | <span> </span> > <span> </span> | |
<span> physical machine failure on-line </span> | <span>ali_scene network core_network</span> | <span> physical machine network check </span> |
<span>ali_diag network route BGP </span><span> {switch} </span> | <span> </span> <span> </span> |
Copyright Notice: