Automatic operation and maintenance in the eyes of network engineers

From the perspective of a network worker, this paper discusses what tools network engineers can use to make networks more transparent and efficient in the process of enterprise network operation and maintenance.

Review of previous article: Apache Ranger — Hadoop ACL control tool

The introduction

There are countless iterations of the saying, “Networks are like wifi: when there is nothing wrong, no one is aware of it.” But for network engineers, this is just being present. Because the number of network engineers is in the single digits, even in companies with thousands of people, their work is not well known. “Is there something wrong with the network?” This sentence is almost all SRE mistake mantra, if this time network engineers say the silence, or fail to produce enough evidence, the pan is almost certainly, how to make the running state of the network environment more transparent, how at the time of each business failure from the innocence, this not only is the foundation, the content of the service team should be concerned about It is the black box that the whole technical team wants to understand.

1, monitor,

1.1 Network Device Survival monitoring

For SRE, we need to monitor whether the program is normal. For host group, we need to monitor whether the server hardware is normal. For network, we need to first care about whether the network equipment is reachable. When one TOR is unreachable, which basically indicates that there will be a block of servers unreachable, the business pain is quite intense.

The monitoring of network equipment is best combined with the service monitoring system as far as possible, because the network fault is likely to cause the service system exception. If the service monitoring system happens to be abnormal, the alarm of network equipment will lose reliability, and the “monitoring is not allowed”, not to mention who the boiler is. This situation can put the network engineer Trouble Shooting in a passive position, prolonging the failure time.

Every web worker has the basic programming knowledge when they walk out of school. Besides, the number of switches and the number of servers are of an order of magnitude different. Therefore, if you can understand a few words of Python, 100+ Python code can complete a simple program for device survival monitoring. A good example is the searchable NodePingManage on Github, which can also be deployed in multiple locations to eliminate single points of failure. With this kind of tool, the accessibility of every corner of the whole network is finally clear, the dark network environment, seems to reflect a ray of light.

1.2 Device Log Monitoring

The device survival alarm can warn of many exceptions and has high accuracy. However, if a network with good redundancy can be pinged through, it does not mean that there is no problem. In this case, careful network engineers will check the logs to show more details. For a server scale of 10,000, the number of network devices is only a thousand, but it is a nightmare to check the logs one by one and determine whether there is any abnormality.

The Log And Alarm program becomes a necessary product for network engineers to travel at home. Only a Syslog server is required, and a log monitoring program is deployed. When special keywords are found in logs, email and SMS alarms are triggered. A tool of this magnitude certainly requires more programming skill than 150+ Python code. There are many similar solutions on Github, and a search for LogScanWarning provides a demonstration example.

In this way, you can detect network faults without service being aware, such as fan speed abnormalities, POWER module faults, OSPF neighbor status jitter, port flapping, hacker blasting my device, device hardware parity error, module receiving and lighting anomalies, Kernel errors, and so on. Excellent network engineers can quickly locate faults when they occur, and NIux network engineers can eliminate hidden dangers before faults occur and prevent them from happening.

1.3 Traffic Monitoring

No matter how well the highway is paved, it can not be built. It is also the responsibility of network engineers to ensure smooth network, good quality, no packet loss, and stable delay. At this time, traffic monitoring becomes a necessity. The rapid development of services is reflected in the network level, that is, the increase of traffic within DC/DCI/IDC outbound traffic/dedicated traffic. Traffic monitoring can accurately grasp the peak and trough of services. When the line needs to be expanded, bandwidth usage is an important data for the boss to refer to. Generally, capacity expansion can be initiated when the traffic on a line exceeds 50%. This means that the main line is congested after the backup link is down.

1.4 Interface Error Monitoring

Like traffic monitoring, Error packet monitoring of interfaces can be collected by SNMP. OID: ifOutErrors, ifInErrors. Incremental Error packets directly affect service quality. Of course, you can use SNMP to collect many other information, such as the CPU, memory, temperature, and Session of the firewall. Such information helps you learn about the working environment of the device. These indicators are essential if you want to use an automatic inspection tool. Provides network monitoring software on the market has a lot of, for example: Falcon/Zabbix/Solarwinds/Cacti/Nigos etc, there are also open source charge, function similar to, you don’t summarize it here.

2. Manufacturing automatic operation and maintenance tools

After the one-two punch in Chapter 1, there will be almost no “unexpected glitches”, all anomalies should be documented, and you should know when SRE asks questions about the network environment. However, the work of network engineers is not only about fire fighting. In daily operation and maintenance work, network engineers often need to do some online changes, machine room expansion, service troubleshooting, etc. As a “lazy” network engineer, what can a program do to help?

2.1 UserDevice Tracker

The term is borrowed from a component in the Solarwinds suite, which literally translates as “user device tracker”, and is often needed in small and medium enterprise network operations:

I know the IP address of the server. Which port is it connected to?
I know a port on the switch. What is the IP address of the server connected to it?
Given the MAC address of a server, how do you know which port on which switch?

Large Internet companies typically have a CMDB or network management platform to record this information, but if you are a network manager in a small or medium enterprise without support from an O&M r&d team, and still use a layer 2 environment (server gateway in the core device), it is difficult. PORT<>MAC<>IP

Here’s an example:

A switch has multiple physical interfaces. A physical interface can have multiple MAC addresses. A MAC address can correspond to multiple IP addresses or none at all. With this basic model, there are only two things you need to do to find the correspondence between these three elements of a network-wide device. Obtain the MAC address table (MAC<->PORT) from the switch directly connected to the server, and then obtain the ARP table (IP<->MAC) from the gateway device of the server. The mapping between the two tables can be obtained based on the MAC address as the unique primary key. Github also has a lot of similar code for reference. With this correspondence, even without CMDB, you can still quickly locate the information you want. The average webmaster can find this information in 5 minutes, whereas you can find it in 5 seconds.

2.2 Secondary Encapsulation of northbound interfaces on Network Devices

In daily network operation and maintenance, there are some simple and repetitive tasks, such as: Vlan/to a device for an interface to add a point to host routing, etc., these operations is not scientific and technological content, also engineer valuable time, more importantly to simple human operation, repeat the number of times as long as enough, there are always mistakes, is the so-called “often walk along the river bank, which have not wet shoes”, But getting it wrong is a career blemish. How can you do such a job well?

Take “automatic switch port Vlan” function as an example, if there is a tool you only need to provide three parameters: device IP/port/Vlan number, you can automatically log in to the device to assign a specific interface to a specified Vlan, that is not beautiful. That’s right! What you need is an interface to the device encapsulation, now most network equipment manufacturers will provide their own API, whether NETCONF or RESTful, as long as you read the user manual, you can easily change the configuration of the device through the program, or even you can use a more “grounded” method, using the program “simulated login” device, While this approach doesn’t match NETCONF and RESTful apis in terms of efficiency, it’s unbeatable in terms of generality, since no vendor’s device doesn’t support SSH or TELNET.

With this theoretical basis, some simple operations on the network can implement changes through their own encapsulated interfaces, and even give the change permission to the business. As long as the business submits legitimate requests, the changes can take effect immediately. At this point, someone’s gonna freak out! Is it really a good idea to hand over access to network equipment to the business? What if it gets damaged… All doubts are legitimate, and they all have solutions. You can restrict the content of the program. You can restrict the switch to BE TOR instead of CSW. You can restrict the interface to be Access instead of Trunk. You can use dynamic Token to ensure the security of the interface. You can ask for the existing MAC of the interface to locate the interface. You can also whitelist the callers.

All considerations can be solidified into code rules, and only the program is the most faithful executor. Interface can provide 7*24 hours of year-round service, and people’s energy is limited, using programs to deal with the simple and regular needs of business, saving engineers precious time to think about life, this is the right way for network engineers to automatic operation and maintenance.

conclusion

The above, is the author combined with their own work experience summed up some of the mind, writing code for network engineers is indeed a little difficult, but as long as you cross this hurdle, you will get more rich time to expand their professional road, I hope this article can throw a brick to draw inspiration for automatic network operation and maintenance to make a contribution.

This article was first published on the public account “Mi Operation and Maintenance”. Click to view the original article.