I. Background introduction
I have done performance optimization for ELK log system and ES storage used by SkyWalking in the past year. Here I make some summaries. This paper focuses on the performance optimization scheme of ES as log storage in ELK architecture.
ELK architecture as log storage solution
Ii. Status analysis
1. Version and hardware configuration
- JDK: jdk1.8_171-b11 (64-bit)
- ES cluster: An ES cluster is deployed on three 16-core 32 GB VMS, and each node is allocated 20 GB heap memory
- ELK version: 6.3.0
- Garbage collector: ES default specifies old generation (CMS) + New generation (ParNew)
- CentOS Linux Release 7.4.1708(Core)
2. Performance problems
As more and more applications access ELK, about 230 new indexes are added every day, and about 30 million to 50 million documents are added.
Every morning and afternoon is the peak time for uploading logs. I checked the logs on Kibana and found problems:
(1) There is a 5-40 minute delay in logging
(2) Many logs are lost and cannot be found
3. Problem analysis
3.1 Log Delay
First, be clear: When will data be available?
Data is stored in the ES buffer and then written to the OS cache by performing refresh operations. After that, data can be searched.
Therefore, the log delay may be the result of our data being stored in buffer and not making it to the OS cache.
3.2 Log Loss
The logs show many write rejection cases
It can be seen from the log that the ES write thread pool is full, the maximum number of threads executing tasks has reached 16, and the queue of 200 capacity has no room for new tasks.
A look at the thread pool also shows that the write thread pool has many write tasks
GET /_cat/thread_pool? v&h=host,name,type,size,active,largest,rejected,completed,queue,queue_sizeCopy the code
So we need to optimize the write performance of ES.
4. Solution
4.1 Analysis Scenario
The optimization of ES can be divided into many aspects. We should consider the requirements of ES according to the application scenarios.
According to my practical experience, I will list the characteristics of three different scenarios:
- SkyWalking: ES is generally used as a data store to store information such as link tracking data and indicator data.
- ELK: Used to store system logs and analyze, search, and locate application problems.
- Full-text search services: ES is commonly used as a full-text search engine in businesses. For example, in takeout applications, ES is used to store business data of merchants and delicious food. Users can search merchants and delicious food information on the client based on keywords, geographical locations and other query conditions.
The characteristics of these three scenarios are as follows:
SkyWalking | ELK | Full-text search business | |
---|---|---|---|
Concurrent writes | High concurrency write | High concurrency write | Concurrency is generally not high |
Concurrent read | Concurrent low | Concurrent low | Concurrent high |
Real-time requirement | Within 5 minutes | Within 30 seconds | One minute |
Data integrity | Loss of small amounts of data is tolerated | Loss of small amounts of data is tolerated | Try not to lose 100% data |
About Real time
- In actual use, SkyWalking is usually not used frequently. It is usually used to check historical link tracking data or indicator data after finding problems in the application, so it can accept a delay of several minutes.
- ELK is often used to locate application problems in the development and testing stages. If the data cannot be queried quickly, the delay will be too long, which will delay a lot of time and greatly reduce work efficiency. If it is to check the log to locate production problems, it is even more urgent.
- In the business of full-text search, it is generally acceptable to view the latest data within 1 minute, for example, a new product is put on the shelf after a minute, but try to real-time, in a few seconds can be seen.
4.2 Direction of optimization
Tuning can be done in three ways: JVM performance tuning, ES performance tuning, and controlling data sources
3. ES performance optimization
Tuning can be done in three ways: JVM performance tuning, ES performance tuning, and controlling data sources
1. The JVM tuning
The first step is JVM tuning.
Because ES is dependent on the JVM to run, without proper setting of JVM parameters, it will waste resources and even cause ES to crash easily in OOM.
1.1 Monitor JVM running status
(1) View GC logs The problem: Both Young and Full GC are very frequent, especially the Young GC is very frequent and takes a lot of cumulative time.
(2) Use jstat to look at GC per second Parameters that
- S0: current usage ratio of surviving zone 1
- S1: Current usage ratio of surviving 2 zones
- E: Eden Park usage ratio
- O: The proportion used in the old days
- M: Usage ratio of metadata area
- CCS: compression usage ratio
- YGC: garbage collection times of young generation
- FGC: the number of garbage collections in the old years
- FGCT: Old age garbage collection cost time
- GCT: Total time consumed by garbage collection
Problem: As you can also see from the Jstat GC, Eden per second grows very fast and is soon full.
1.2 Locating the cause of frequent Young GC
1.2.1 Check whether the space for Cenozoic generation is too small
Use the following ways to check the memory size of the old and new years (1) usagejstat -gc pidView the size of Eden area and old age space (2) Usejmap -heap pidView the Eden area and space size of the old age. (3) View the GC details in GC logsWhere 996800K is the available space size of the new generation, that is, the space size of Eden area +1 Survivor area, so the total memory of the new generation is 996800K/0.9, about 1081M
According to the above methods, the total memory of the new generation is about 1081M, that is, about 1G; The old age total memory for 19864000K, about 19G. The ratio of new to old is about 1:19, which is unexpected.
1.2.1 Why isn’t the JDK default 1:2 ratio?
This is a really easy place to step in. If the new generation size is not displayed, the JVM will automatically tune it when using the CMS collector. The size of the new generation is calculated if it is not set, and may have nothing to do with the default configuration of NewRatio and something to do with the configuration of ParallelGCThreads.
How big is CMS GC by default?
It is best to configure the JVM parameters -xx :NewSize, -xx :MaxNewSize, or -xmn to avoid some strange GC.
1.3 Impact of the above phenomenon
Too small for the new generation, too big for the old
- If The new generation is too small: (1) The Eden area of The new generation will be used up soon, and Young GC will be triggered. In The process of Young GC, STW(Stop The World) will be stopped, that is, all worker threads will Stop, and only GC threads will be collecting garbage, which will cause ES to Stop for a short time. Frequent Young GC, many a mickle makes a mickle, has a great impact on system performance.
(2) Most objects quickly enter the old age, which can easily run out and trigger the Full GC.
- Too old: The STW time of the Full GC is longer than that of the Young GC. The pause time caused by the GC is in the range of tens of milliseconds to several seconds, which affects the performance of ES. At the same time, the client requesting the ES server does not respond within a certain period of time and a timeout exception occurs, resulting in a request failure.
1.4 the JVM to optimize
1.4.1 Configuring the heap Memory Size
It is not appropriate to allocate 20GB of 32GB of heap memory, so adjust to 50% of total memory, i.e. 16GB. Modify jvm.options for ElasticSearch
-Xms16g
-Xmx16g
Copy the code
Setting requirements:
-
Xms is the same size as Xmx.
JVM parameters -xms and -xmx Settings are inconsistent, during initialization, only the size of -xms will be initialized to store information, when the operating system is running out of space, in this case, a GC must be performed, GC will bring STW. When there is a lot of free space, it triggers capacity reduction. When it is insufficient again, the capacity is expanded again, and so on. These processes affect system performance. A similar problem exists in the MetaSpace area.
-
The JVM recommends no more than 32GB, otherwise the JVM will disable the memory object pointer compression technique, resulting in wasted memory
-
Xmx and Xms should not exceed 50% of physical RAM. See end of article: Recommendations for official heap memory Settings
Xmx and Xms should not exceed 50% of physical memory. Elasticsearch requires memory for purposes other than the JVM heap, and it is important to have space for this. For example, Elasticsearch uses out-of-heap buffers for efficient network communication, relies on the operating system’s file system cache for efficient file access, and the JVM itself requires some memory.
1.4.2 Configuring the Size of the new generation heap memory
Because you specify the size of the new generation, the JVM auto-callback allocates only 1 GB of memory to the new generation.
Modify elasticSearch jvm.options file, add
-XX:NewSize=8G
-XX:MaxNewSize=8G
Copy the code
The old age is automatically allocated 16G-8g =8G memory, the ratio of the new generation of the old age is 1:1. After the modification, the frequency of each Young GC is lower, and only a few data will enter the old age after each GC.
2.3 Using G1 Garbage Collector (not practiced)
The G1 garbage collector lets the system user set the impact of the garbage collection heap system, then splits the memory into a large number of small regions, tracking the size of objects that can be collected in each Region and the estimated time it will take to complete the collection. Finally, during garbage collection, Try to limit the impact of garbage collection on the system within the time we specify, and try to collect as many garbage objects as possible within the limited time. The G1 garbage collector generally performs better with large quantities and large amounts of memory.
The default garbage collector used by ES is: old generation (CMS) + New generation (ParNew). If it is JDK1.9, ES uses the G1 garbage collector by default.
Because WE are using JDK1.8, the garbage collector is not switched. If there are further performance problems, switch to the G1 garbage collector to test for better performance.
1.5 Optimization effect
1.5.1 The growth rate of memory usage of the new generation is lower
Before optimization
GC data is printed once per second. As you can see, the Young generation grows quickly and becomes full in a few seconds, causing the Young GC to fire very frequently, every few seconds. Every time the Young GC has a high probability of surviving objects entering the old age, moreover, when there are many living objects (see the old GC data in the first red box in the figure above), there is (51.44-51.08)/100 * 19000M = about 68M. There are more objects entering the old age each time, and the frequent Young GC will cause the generation mode of the new and old age to lose its effect, which is equivalent to the old age replacing the new generation to store the objects generated in the recent period. When the old age is Full and the Full GC is triggered, there will be a large number of surviving objects, because these objects are likely to be recently added and still alive, so a Full GC will not collect many objects at a time. This is a vicious cycle in which the old age fills up quickly and becomes Full of GC, and then a large portion of the remaining ones are easily Full, which leads to frequent Full GC.
The optimized
GC data is printed once per second. It can be seen that the growth rate of the new generation is much slower and it takes at least 60 seconds to fill up. As shown in the red box above, objects entering the old age are about (15.68-15.60)/100 * 10000 = 8M, which is very small. So it takes a long time to trigger a Full GC. In addition, by the time the GC is Full, many objects in the old age will have survived for a long time and will not be referenced, so a large part of the old age will be reclaimed, leaving a relatively clean old age space that can continue to hold many objects.
1.5.2 Lower GC frequency in Cenozoic and old age
ES runs for 14 hours after starting up
Before optimization
The time of Young GC each time is not long, as can be seen from the above monitoring data, the time of 1467.995/27276 GC each time is about 0.05 seconds. How much time is there per second to actually process Young GC?
Calculation formula: 1467 seconds /(60 seconds × 60 minutes 14 hours) = about 0.028 seconds, that is 2.8 seconds in 100 seconds at Young GC, that is 2.8 seconds of pause, which is still a significant performance drain. The equation is: 60 seconds ×60 minutes *14 hours / 27276 times = 1 time /X second, X = 0.54, that is, there will be a Young GC in 0.54 seconds. It can be seen that Young GC frequency is very frequent.
The optimized
The number of Young GC is only one-tenth of that before the modification, and the time of Young GC is also about one-eighth. Full GC is also only one-eighth as many times and takes about a quarter of the time.
The impact of GC on the system has been greatly reduced and performance has been greatly improved.
2. ES tuning
The characteristics of ES as a log store have been analyzed as follows: high concurrent write, low read, accept 30-second delay, and tolerate partial log data loss. Let’s tune ES for these features.
2.1 Optimize ES index Settings
2.2.1 Underlying Principles of ES Write Data
Refresh ES Receives data requests and stores them to the MEMORY of ES. By default, data is written from the memory buffer to the OS cache every second. This process is called refresh.
When the OS cache is in place, the data can be searched (hence ES is near real time, since a delay of 1 s allows the data to be searched by executing refresh)
The fsync translog operation is performed every 5 seconds or after a change request is completed to flush the Translog from the cache to the disk. This operation takes a long time. If the data consistency requirement is not high, you are advised to change the index to asynchronous.
Flush ES By default, the OS cache is flushed to disk and the Translog file is flushed every 30 minutes.
merge
An index of ES consists of multiple shards, and a shard is actually a Lucene index composed of multiple segments, and Lucene constantly merges small segments into one large segment. This process is called segment merge (see link at the end of this article). ES has offline logic to merge small segments to optimize query performance. However, the merge consumes a lot of DISK I/OS, which affects the query performance.
2.2.2 Optimization direction
2.2.2.1 optimization fsync
To ensure that data is not lost, secure translog files:
Elasticsearch 2.0 or later, fsync is triggered to flush the segment from translog to disk when a write request (e.g. index, delete, Update, bulk, etc.) completes.
Alternatively: By default, data in translog is forcibly flushed to disk through fsync every 5s.
This approach improves data security while reducing performance a bit.
If the fsync operation is frequently performed, block may occur and some operations may take a long time. If some data is allowed to be lost, you can set translog to flush asynchronously to improve efficiency, and lower flush thresholds as follows:
"index.translog.durability": "async",
"index.translog.flush_threshold_size":"1024mb",
"index.translog.sync_interval": "120s"
Copy the code
2.2.2.2 optimization refresh
Data written to Lucene is not searchable in real time. ES must transform the data in memory into complete Lucene segment through refresh before it can be searched.
By default, after 1 second, the written data can be queried quickly, but it is bound to generate a large number of segments, and the retrieval performance will be affected. Therefore, longer time can reduce system overhead. For log search, the real-time requirement is not so high, set to 5 seconds or 10 seconds; For SkyWalking, the real-time requirement is lower, so we can set it to 30s.
The Settings are as follows:
"index.refresh_interval":"5s"
Copy the code
2.2.2.3 optimized merge
Index. The merge. The scheduler. Max_thread_count control concurrent merge threads, if the storage is concurrent good SSD performance, Use the default value Max (1, min(4, availableProcessors / 2)). When a node has a large number of CPU cores, resources consumed by merge may be high, which affects cluster performance. If common disks are used, set this value to 1, which causes DISK I/O congestion. Max_thread_count = max_thread_count = max_thread_count = max_thread_count = max_thread_count = max_thread_count = max_thread_count = 1
The Settings are as follows:
"index.merge.scheduler.max_thread_count":"1"
Copy the code
2.2.2 Optimize Settings
2.2.2.1 Set indexes for existing indexes
* close index curl -xpost 'http://localhost:9200/_all/_close' * open index curl -xput -h * close index curl -xpost 'http://localhost:9200/_all/_close "Content-Type:application/json" 'http://localhost:9200/_all/_settings? preserve_existing=true' -d '{"index.merge.scheduler.max_thread_count" : "1","index.refresh_interval" : "10s","index.translog.durability" : "async","index.translog.flush_threshold_size":"1024mb","index.translog.sync_interval" : "120s"}' # open index curl -XPOST 'http://localhost:9200/_all/_open'Copy the code
This method can modify the generated index, but does not take effect on the newly created index. Therefore, you can create an ES template and create indexes based on the template.
2.2.2.2 Creating an Index Template
PUT _template/business_log {"index_patterns": ["*202*.*.*"], "settings": { "index.merge.scheduler.max_thread_count" : "1","index.refresh_interval" : "5s","index.translog.durability" : "async","index.translog.flush_threshold_size":"1024mb","index.translog.sync_interval" : "120s"}} # query whether the template is created successfullyCopy the code
The index name is user-service-prod-2020.12.12. Therefore, the wildcard character **202.*.** is used to match the service log index to be created.
2.2 Optimize thread pool configuration
As mentioned earlier, the write thread pool was full, causing tasks to be rejected and some data to fail to be written.
After the above optimization, the number of rejection cases is much less, but there are still cases of task rejection.
So we also need to optimize the write thread pool.
From the Prometheus monitor, you can see the thread pool:
To see how the ES thread pool is running, elasticSearch_exporter has been installed to collect ES indicator data to Prometheus and view it through Grafana.
After various optimizations above, the amount of rejected data is much less, but there are still rejected cases, as shown in the following figure:
How to set the write thread pool:
See the thread pool for ElasticSearch
write
For single-document index/delete/update and bulk requests. Thread pool type is
fixed
with a size of# of available processors
, queue_size of200
. The maximum size for this pool is1 + # of available processors
.
The write thread pool takes a thread pool of type fixed, where the number of core threads is the same as the maximum number of threads. By default, the number of threads is equal to the number of CPU cores. The maximum number of threads that can be configured is 17.
Optimized scheme:
- The number of threads is changed to 17, which is the total number of CPU cores plus 1
- The queue capacity increases. The function of the queue at this point is peak elimination. However, the larger queue size does not increase processing speed by itself, just serves as a buffer. In addition, the queue capacity should not be too large, or too much heap memory will be occupied by a backlog of tasks.
Config/elasticSearch. yml file added configuration
Thread_pool: write: number of threads default to the number of CPU cores, i.e. 16 size: 17 queue_size: 10000Copy the code
Optimized effect As you can see, there is no rejection, which solves the log loss problem.
2.3 Lock memory to prevent the JVM from using Swap
Swap Swap partition:
When the system runs out of physical memory, it is necessary to free up some of the physical memory space for the currently running programs. The freed space may come from programs that have not operated for a long time. ** The freed space is temporarily stored in Swap, and when the programs are ready to run, the saved data is restored from Swap to memory. ** In this way, the system will always Swap when the physical memory is insufficient.
Swap memory is disabled for ElasticSearch
Disable the Swap partition because it adversely affects node performance and stability. It can cause garbage collection to last minutes instead of milliseconds, and can cause nodes to respond slowly or even disconnect from the cluster.
There are three ways to implement ES without Swap
2.3.1 Disabling Swap in Linux (Temporary Valid)
Execute the command
sudo swapoff -a
Copy the code
You can temporarily disable Swap memory, but it becomes invalid after the operating system restarts
2.3.2 Minimize the use of Swap in Linux (permanent)
Execute the following commands
echo "vm.swappiness = 1">> /etc/sysctl.conf
Copy the code
Swap is not used in normal cases, except in emergencies.
2.3.3 enable bootstrap. Memory_lock
Config/elasticSearch. yml file added configuration
Bootstrap. memory_lock: trueCopy the code
2.4 Reduce the number of fragments and copies
shard
The size of the index depends on the size of the fragment and segment. If the fragment is too small, the segment may be too small, which increases the overhead. A large number of fragments may Merge frequently, causing a large number of I/O operations and affecting write performance.
Since our index size was under 15GB and the default was 5 shards, there was no need for this many, so we changed it to 3.
"index.number_of_shards": "3"
Copy the code
Sharding Settings can also be configured in the index template.
replications
Reduce the number of cluster copy fragments. Too many copies will cause ES internal write expansion. The number of copies is 1 by default. If one node where an index resides breaks down, the other machine that has the copy has the backup data of the index and can use the index data normally. However, data write duplicates affect write performance. For log data, only one copy is required. For indexes with a large amount of data, you can set the number of copies to 0 to reduce the impact on performance.
"index.number_of_replicas": "1"
Copy the code
Sharding Settings can also be configured in the index template.
3. Control your data sources
3.1 Application Logs Are printed according to specifications
Some apps generate 10 GIGABytes of logs a day, while the average app generates only a few hundred to 1 gigabyte. 10G logs are generated a day because some application logs are improperly used. A large number of logs, such as the list query interface with a large amount of data, report data, and debug logs, do not need to be uploaded to the log server. This affects the log storage performance and application performance.
4. Effect of ES performance optimization
ELK performed well within two weeks after optimization, with no problems in use:
-
ES data is no longer lost
-
Data delay within 10 seconds, generally in 5 seconds can be detected
-
The load of each ES node is stable, and CPU and memory usage are not too high, as shown in the following figure
5. Reference documents
reference
- Remember a summary of ElasticSearch optimization
- ElasticSearch data write process and optimization
- Performance Optimization of Ten billion level real-time Computing System — Part ElasticSearch
- How big is CMS GC default new generation?
- Recommendations for official heap memory Settings
- ElasticSearch thread pool
- Swap memory is disabled for ElasticSearch
- Period of the merge
- (stackoverflow.com/questions/1…).