The system version is CentOS 7.
vm.swappiness
Setting method:
- Echo 1 > /proc/sys/vm-swappiness, or
- Sysctl -w vm.swappiness=1, or
- Edit the /etc/sysctl.conf file and add vm.swappiness=1
A swap space is similar to virtual memory in Windows. If the physical memory is insufficient, a swap partition on a hard disk can be used as memory. However, since the disk read/write rate is much less than the memory rate, once a large number of swaps occur, the system latency will increase, and even cause long-term service unavailability, which is fatal for large data clusters.
The vm.swappiness parameter controls how the kernel uses swap space. The default value is 60. The higher the value, the more swap space is used by the kernel.
For CDH clusters with large memory, we typically set this value to 0 or 1. 0 indicates that swap space is used only when the available physical memory is less than the minimum threshold vm.min_free_kbytes (described later), and 1 indicates that swap space is used at the minimum.
Two explanations have been found for the specific mechanism of this configuration:
- When the physical memory usage is higher than (100-vm. swappiness)%, the switch partition is used.
- Vm. swappiness achieves this effect by controlling whether more anonymous memory is reclaimed or more file cache is reclaimed when memory is reclaimed. A value of 100 indicates that anonymous memory and file caches are reclaimed with the same priority, and a default value of 60 indicates that file caches are reclaimed first.
vm.min_free_kbytes
Setting method:
- Echo 4194304 > /proc/sys/vm-min_free_kbytes, or
- Sysctl -w vm.min_free_kbytes=4194304, or
- Edit the /etc/sysctl.conf file and add vm.min_free_kbytes=4194304
Memory water mark
In Linux, the remaining memory has three water levels, denoted from high to low as high, low, and min. And the relationship is as follows:
Min = vm.min_free_kbytes, low = min * 5/4, high = min * 3/2.Copy the code
When the remaining memory is lower than the high value, the system considers that the memory is under certain pressure. When the remaining memory falls below the low value, the kSWapd daemon begins to reclaim memory. When it is further reduced to the min value, direct Reclaim of the system will be triggered. At this time, the running of the program will be blocked and the delay will become larger.
Therefore, the value of vm.min_free_kbytes should be neither too small nor too large. If it is too small (for example, only tens of meters), the difference between low and min will be very small, which is easy to trigger direct recycling and reduce efficiency. If the value is too large, it will cause a waste of memory resources and take more time to reclaim KSWAPD. In the above statement, set the memory size to 4G, depending on the physical memory size, generally set between 1G and 8GB.
Transparent Large Page (THP)
Setting method:
- echo never > /sys/kernel/mm/transparent_hugepage/enabled
- echo never > /sys/kernel/mm/transparent_hugepage/defrag
Huge Page is an implementation of memory paging management. Computer memory is addressed by table mapping (page table). Currently, system memory is 4KB per page as the smallest unit of memory addressing. As memory increases, the size of the page table increases. A 256G memory machine, if the use of 4KB small page, only page table size will be about 4G. The page table must be in memory, and it must be in CPU memory. If it is too large, a lot of misses will occur and memory addressing performance will deteriorate.
Huge Page is designed to solve this problem. It uses 2MB large pages instead of traditional small pages to manage the memory, so that the Page table size can be controlled to be small, and then all the pages can be loaded in the CPU memory to prevent the occurrence of miss. It has two ways of implementation, one is Static Huge Pages (SHP), the other is Transparent Huge Pages (THP). As their names suggest, SHP is static, while THP is dynamic. Because THP is allocated and managed at runtime, there is a degree of latency that can be detrimental to memory-intensive applications and must be turned off.
vm.zone_reclaim_mode
Setting method:
- Echo 0 > /proc/sys/vm-zone_reclaim_mode, or
- Sysctl -w vm. zone_reclaiM_mode =0, or
- Edit the /etc/sysctl.conf file and add vm.zone_reclaim_mode=0
This parameter is related to the Non-Uniform Memory Access (NUMA) feature.
NUMA allows each CPU to have its own dedicated memory area, as shown in the figure.
Only when the CPU directly attaches the corresponding physical address to the memory, the response time is short (Local Access). However, if you need to Access the data of the memory attached by other CPUS, you need to Access it through the Inter-connect channel, and the response time is slower than before (Remote Access).
NUMA may cause unbalanced CPU memory usage. The local memory of some cpus is insufficient and needs to be reclaimed frequently. As a result, a large number of swaps may occur, and the system response delay may cause severe jitter. At the same time, local memory in other parts of the CPU may be free. This leads to a strange phenomenon: when you run the free command to check that the current system still has some free physical memory, the system keeps swapping, causing a sharp decline in the performance of some applications. Therefore, NUMA’s memory reclamation policy, vm.zone_reclaim_mode, must be improved.
This parameter can be 0/1/3/4. 0 indicates that memory can be allocated to other memory areas when local memory is insufficient. 1 indicates that the local memory is insufficient. 3 indicates that the file cache object is reclaimed as early as possible. 4 indicates that local reclamation Preferentially uses swap to reclaim anonymous memory. Therefore, setting it to 0 reduces the probability of swap.
Tuned services
Following RHEL/Centos 6.3, a new tuning tool has been introduced with a large number of tuning configurations, enabling different tuning policies to be adjusted for different business services. The purpose is to simplify tuning and maximize system resources and energy efficiency. The principle is that each profile corresponds to a set of SYSCTL parameters, and sySCTL can be changed to suit different services. For CDH clusters, however, the tuning is already done and the business is relatively fixed, so there is no need for the tuned service to complicate matters. The steps to close it are as follows:
- tuned-adm off
- Execute tuned-adm list to return “No current Active profile”
- systemctl stop tuned
- systemctl disable tuned