preface

In the “pray Buddha bless the server does not break down”, “kill programmers worship heaven” environment, programmers every day can be said to be fearful, received phone calls and messages are scared to shiver, for our safety, timely discovery of server running problems is not just a problem of operation and maintenance. To summarize the common server monitoring metrics, I hope that all developers have a script running to protect their own life.

Articles are often people climb, but also do not indicate the original address, I’m here to update and error correction can’t sync, mention here the original address: http://www.cnblogs.com/zhenbianshu/p/7683496.html


Obtaining Server Information

When multiple machines need to be monitored at the same time, each machine needs to run a monitoring program. We first need to obtain the information of the server to identify the machines, and when problems occur, we can also evaluate the severity of the problem.

Get IP

Obtain the Intranet IP address:

Run the ifconfig command to obtain all network information and remove the local host and ipv6 information.

/ sbin/ifconfig | grep inet | grep -v '127.0.0.1' | grep -v inet6 | awk '{print $2}' | tr -d "addr:"

Note that the absolute path of ifconfig is used here, because the monitor script will not be executed with environment information if it is running on crontab.

Obtain the external IP address:

We can request other websites to display the IP of the external network. Some websites provide this service, such as ipecho.net/plain or alwayscoding.net, which I am too lazy to build.

Curl alwayscoding.net

Obtaining System Information

You are advised to use lsb_release -a to obtain system information:

Lsb_release -A LSB Version: : BASE-4.0-AMD64: BASE-4.0-NOarch: Core-4.0-AMD64 :core-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.5 (Final) release: 6.5 Codename: FinalCopy the code

The information is rich, you can intercept the required part of the string;


CPU

It is the ratio of the number of processes that the CPU can process to the maximum number of processes that the CPU can process in a period of time. That is, the maximum load of a CPU is 1.0. In this case, the CPU can execute all the processes beyond this limit. The system enters the over load state, and some processes need to wait for other processes to finish. We generally consider a CPU load below 0.6 to be healthy.

The top command is usually used to view the system load on the terminal. However, the top command is interactive and contains a large amount of miscellaneous data, which is not conducive to writing monitoring scripts. In general, uptime is used to obtain the average load of the last 1 minute, 5 minutes, and 15 minutes through its Average load field.

Uptime 16:03:30 up 130 days, 23:33, 1 User, Load Average: 4.62, 4.97, 5.08Copy the code

At this time, the average load of the system is about 5, not that the system has been overloaded, and no errors are displayed. This is because the number of CPU cores should also be considered when considering the load. The number of processes that a multi-core CPU can process at the same time is proportional to its number of cores.

We use nproc to look at the number of CPU cores on the system. The machine I’m using has 16 cores, so its maximum load is 16, and the average load is 5/16 = 0.32, and the CPU is healthy.


memory

Memory is another core metric to monitor, and too high a memory footprint will undoubtedly cause processes to fail to properly allocate memory for execution.

We can also use the top command to check the memory usage, but the free command is more commonly used in monitoring:

free -m
             total       used       free     shared    buffers     cached
Mem:         32108      18262      13846          0        487      11544
-/+ buffers/cache:       6230      25878
Swap:            0          0          0Copy the code

Let’s first look at the line of Mem, there are 32108M memory, 18262M has been used, and 13,846 is left, so the memory usage is 18262/32108*100% = 56.88%. What about shared, buffers, cached?

In fact, in Linux, the allocation of memory is also lazy principle. After the memory is allocated to a process, Linux does not immediately clear the memory, but stores the memory as cache. If the process starts again, it does not need to reload the memory. If the available memory is used up, this part of the cache is cleared and reused. Thus, the Buffers and cached parts of used are ready to be reused and cannot be counted as occupied. Shared is the part of a process’s shared memory that is occupied, but is rarely used. See the reference article at the end of this article for more on this.

The real data is the buffers and cache removal part of the third line, i.e. the true memory usage is 6230/(6230+25878)*100% = 19.4%.

In the fourth line, swap is used to temporarily store memory buffers and cache. Although it can speed up the restart of the process under normal circumstances, frequent read and write of swap will be caused when the physical memory is small, increasing the I/O pressure of the server. The use of swap depends on the situation.


network

The network is also an important metric when Linux is used as a Web server. There are many related commands, but each has its own strengths. We generally monitor the following states:

Use netstat to view listening ports.

Netstat – an | grep LISTEN | grep TCP | grep 80 see if any processes are monitoring port 80.

Use ping to monitor network connections

You can use the ping command to check whether the network is connected, use the -c option to control the number of requests, use the -w option to control the timeout (in milliseconds), and finally use the && short circuit characteristics to control the output:

ping -w 100 -c 1 weibo.com &>/dev/null && echo "connected"


The hard disk

The hard disk is not an important monitoring indicator. However, a failure to write files when the hard disk is full may affect process execution.

We use the df command to check the disk usage, and -h is printed in a readable format:

Df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 6.0g 32G 16% / TMPFS 16G 0 16G 0% /dev/shm /dev/vdb1 296G  16G 265G 6% /data0Copy the code

You can use the grep command to find the mounted node you want to query, and use the awk command to obtain the result field.

In addition, run du [-h] /path/to/dir [–max-depth=n] to view the size of a directory. Use –max-depth=n to control the traversal depth.


Run/others

Other monitoring status mainly includes process error log monitoring, request number monitoring, process status monitoring, etc., these can use some basic commands, such as ps.

For more detailed information, you need to use process logs. You can use commands such as grep and awk to analyze logs to obtain more detailed information.


conclusion

Finally, the statistics of the monitoring results can be used in the general “push” and “pull” way, and it is recommended that each machine push the results to a machine for statistics and alarm. You can also use rsync to pull it from each server, and the alarm mode, such as enterprise wechat, SMS and email, can be configured as required.

Finally, system monitoring is an important and ongoing concern, and good luck with your servers.

If you have any questions about this article, please leave a comment below. If you think this article is helpful to you, you can click the recommendation below to support me. The blog is always updated, welcome to follow.

Reference:

Understand Linux system load – Ruan Yifeng

Can a Cache in Linux memory really be reclaimed?