The developer feast is coming! The final of the first 51CTO Developer Contest was held on July 28th
As a qualified Linux operation and maintenance engineer, you must have a clear and clear idea of troubleshooting. When a problem occurs, you can quickly locate and solve the problem. Here is a general idea of troubleshooting:
- Pay attention to error message: every error, is to give error message, in general, this prompt basically locate the problem, so we must pay attention to this error message, if you turn a blind eye to these error information, the problem will never be solved.
- Check log files: Sometimes error messages only show the symptom of a problem. To better understand the problem, you must check the corresponding log files, which are divided into system log files (/var/log) and application log files. Combining these two log files, you can generally locate the problem.
- Analyze and locate problems: This process is complicated. Based on error information, log files, and other related situations, the cause of the problem is finally found.
- Solve the problem: find the cause of the problem, solve the problem is very simple.
According to this process, troubleshooting is the process of analyzing and finding the problem. Once the cause of the problem is determined, the fault is rectified.
Based on the above mentioned solutions to Linux O&M problems, we select 6 typical Linux O&M problems to see how to analyze and solve them:
Fault 1: The system cannot be started because the file system is damaged
Checking root filesystem
/dev/sda6 contains a file system with errors, check forced
An error occurred during the file system check Copy the code
This error can be seen in the operating system /dev/sda6 partition file system has a problem, the probability of this problem is very high, usually caused by the system power failure, causing the file system structure is inconsistent, generally, the solution to this problem is to use the FSCK command, forced repair.
# umount /dev/sda6
# fsck.ext3 -y /dev/sda6 Copy the code
Problem 2: Argument list too long error and solution
# crontab -e Copy the code
No space left on device is displayed after you save the Settings and exit
If the disk space is full, check the disk space.
# df -h Copy the code
The partition space of the/var disk has reached 100%. The fault is located. / var disk space is full, because crontab will write file information to/var directory when saving, but there is no space on the disk, so an error is reported.
Run the du -sh * command to check the size of all files or directories under/var. The directory /var/spool/clientmqueue occupies 90% of the total partition size of/var. Can I delete the files in /var/spool/clientmqueue
# rm *
/bin/rm :argument list too long Copy the code
The argument list too long error occurs when you try to pass too many arguments to a command in Linux. This is a long-standing limitation in Linux. Check this limitation by using the getconf ARG_MAX command.
# getconf ARG_MAX
# more /etc/issue Check version
Solutions:
1.
# rm [a-n]* -rf
# rm [o-z]* -rf Copy the code
2, use the find command to delete
# find /var/spool/clientmqueue -- type f -- print -- exec rm -- f {};Copy the code
3. Through shell scripts
#/bin/bash RM_DIR= '/var/spool/clientmqueue' CD $RM_DIR for I in 'ls' do rm -f $I doneCopy the code
4. Recompile the kernel
To manually increase the number of pages allocated to command-line arguments in the kernel, open the include/ Linux /binfmts.h file under the kernel source and find the following line:
#denfine MAX_ARG_PAGES 32 Copy the code
Change 32 to a larger value, such as 64 or 128, and recompile the kernel
Problem 3: Application failure due to inode exhaustion
After an Oracle database is restarted, the Oracle listener cannot be started. Linux error: No space left on device is displayed
The command output shows that the listener cannot start because the disk is exhausted. Oracle needs to create a listener log file before starting the listener, so it first checks the disk space usage
# df -h Copy the code
According to the disk output, all partitions have sufficient disk space. The path where Oracle listens to write logs is in the /var partition, and the /var partition has sufficient space.
Solution:
In Linux, the usage of disk space is divided into three parts: The first is the physical disk space, the second is the disk space occupied by inode nodes, and the third is the space used by Linux to store semaphore. If the problem is not physical disk space, run the df -i command to check whether inode nodes are running out. The output shows that the file cannot be written because inodes are running out.
To view the total number of inodes for a disk partition, run the following command
# dumpe2fs - h/dev/sda3 | grep 'Inode count'Copy the code
Each inode has a number. The operating system uses the inode number to distinguish different files. You can run the ls -i command to view the inode number corresponding to the file name
To view more detailed inode information for this file, use the stat command
# stat install.log Copy the code
To solve the problem
# find /var/spool/clientmqueue/ -name "*" -exec rm -rf {};Copy the code
Problem 4: The file was deleted, but the space was not freed
Linux does not have the recycle bin function, so all files to be deleted from the online server will be moved to the system/TMP directory first. Then periodically clear the data in/TMP. The system partition of the server is not divided into/TMP, so the data in/TMP actually occupies space in the root partition. If the problem is found, delete some data files in/TMP that occupy large space.
# du -sh /tmp/* | sort -nr |head -3 Copy the code
Run the access_log command to find a 66GB file in/TMP. This file should be an access log file generated by Apache. From the log size, it should be an Apache log file that has not been cleaned for a long time. After confirming that the file can be deleted, run the following delete command,
# rm /tmp/access_Iog
# df -h Copy the code
From the output, the root partition space is still not free
However, there are exceptions, such as a file process locking or a process writing data to the file. To understand this problem, you need to know the file storage mechanism and storage structure under Linux.
A file is stored in a file system in two parts: the data part and the pointer part. The pointer is in the meta-data of the file system. After data is deleted, the pointer is deleted from the meta-data, and the data part is stored in the disk. The access_log file has been deleted from the access_log file. The access_log file has been deleted from the access_log file. The access_log file has been deleted from the access_log file. Although the access_Ilog file is deleted, the pointer part of the file is not deleted from meta-data because the process is locked. Because the pointer is not deleted, the system kernel considers that the file is not deleted. Therefore, the df command is not released.
Troubleshooting:
With that in mind, let’s see if any process has been writing data to the access_log file using the Linux losf command, which gets a list of deleted files that are still being used by the application
# lsof | grep delete Copy the code
The/TMP /access_log file is locked by the HTTPD process, which continues to write log data to the file. The ‘deleted’ status in the last column indicates that the log file has been deleted, but the process continues to write data to the file. So space is not freed up.
Solve a problem:
The simplest way to solve this problem is to shut down or restart the HTTPD process, of course, restart the operating system can also be. The best way to free up disk space occupied by the file is to delete the file online. To do this, run the following command:
> / TMP/access_log # echo ""Copy the code
This method can not only release disk space immediately, but also ensure that the system continues to write logs to files. This method is often used to clean online log files generated by Web services such as Apache, tomcat, and nginx.
Problem 5: “too many open files” error and solution
Symptom: This is a Java-based Web application system. When data is added in the background, a message is displayed indicating that data cannot be added. Therefore, you log in to the server and check the Tomcat log, and the following exception information is found: Java.io
According to this error message, the basic judgment is that there are not enough file descriptors available in the system. Since the Tomcat service room system is started by WWW user, log in to the system as WWW user and run the ulimit -n command to check the maximum number of file descriptors that can be opened in the system. The output is as follows:
$ ulimit -n
65535 Copy the code
You can see that the server has set the maximum open file descriptor to 65535. This value should be sufficient, but why does this error occur
This case involves using the ulimit command
When using ulimit, there are several ways to use it:
1. Add to the user environment variable
If the user is using bash, add “ulimit -u128” to the environment variable file.bashrc or.bash_profile in the user directory to limit the user to 128 processes
Add it to the startup script of the application
If the application is Tomcat, add ‘ulimit -n 65535’ to the Tomcat startup script startup.sh to limit the user to 65535 file descriptors
3. Run the ulimit command on the shell command terminal
The resource restriction of this method only applies to the terminal executing the command, the setting is invalid after exiting or closing the terminal, and this setting does not affect other shell terminals
Solve a problem:
The ulimit setting is not a problem, so the setting must not take effect. Then check whether the WWW user environment variable to start Tomcat has added the Ulimit limit. There is no Ulimit for WWW users. Check whether the ulimit limit is added to the Tomcat startup script startup.sh file. Finally, to see if limits are added to the limits.conf file, check the limits.conf file and do the following
# cat /etc/security/limits.conf | grep www
www soft nofile 65535
www hard nofile 65535 Copy the code
The ulimit resource limit was added to the limits.conf file. The ulimit resource limit was added to the limits.conf file. Check the tomcat startup time as follows
# uptime Up 283 days # pgrep -f tomcat 4667 # ps -eo pid,lstart,etime|grep 4667 4667 Sat Jul 6 09; 2013 77-05:26:02 33:39Copy the code
As you can see from the output, the server has not been restarted for 283 years, and Tomcat was started on July 6, 2013 at 9am, which is nearly 77 days. Continue to look at the modification time of limits.
# stat /etc/security/limits.conf Copy the code
Conf file was last modified on July 12, 2013, which was later than the tomcat startup time. The solution to the problem is simple: restart Tomcat.
Problem 6: Read-only File system error and resolution
Resolution: There are many kinds of, the reasons of the problem may be caused by the file system data blocks appear inconsistent, also could be the result of a disk failure, mainstream ext3 / file system corruption has strong self-healing mechanism, for the simple error, the file system is generally can repair itself, when there was a fatal error cannot be repaired, To ensure data consistency and security, a file system temporarily blocks write operations on the file system, making the file system read only. Today, the above “read-only file system” phenomenon occurs.
The command FSCK is used to manually repair a file system error. Before repairing a file system, unmount the disk partition where the file system resides
# umount /www/data
Umount : /www/data: device is busy Copy the code
A message is displayed indicating that the file cannot be uninstalled. The process corresponding to the file may be running on the disk. Check as follows:
# fuser -m /dev/sdb1
/dev/sdb1: 8800 Copy the code
Then check what process port 8800 corresponds to,
# ps -ef |grep 8800 Copy the code
Check that apache is not shut down, stop apache
# /usr/local/apache2/bin/apachectl stop
# umount /www/data
# fsck -V -a /dev/sdb1
# mount /dev/sdb1 /www/data
Copy the code
【 Editor’s Recommendation 】
- Want to walk away with your annual bonus? Let’s start with these 3 big questions!
- Exploration and Practice of Intelligent Operation and Maintenance — Tech Neo Technology Salon No.18
- Experience sharing: refuse to learn, Linux entry learning three catch-22!
- Linux server security tips to make your server more secure
- Review of the 18th Session of Tech Neo Technology Salon — Development Trend of Intelligent Operation and Maintenance (including video and PPT)
Wu Xiaoyan
Thumb up 0