Not envy mandarin duck not envy fairy, a line of code half a day. Original: Taste of Little Sister (wechat official ID: XjjDog), welcome to share, please reserve the source.
ES is very widely used, especially ELKB, which is almost a point-sized logging system, is using it.
Logs are often written but rarely read, requiring high write speed. One of our clusters, for example, had 100 TERabytes of logs in a single cluster, with 10W logs written per second.
ES is not a simple sequential write. In order to construct inverted index, there are many time-consuming merge or additional operations behind the data reliability and real-time, disk I/O and CPU pressure is very heavy!
Using iotop, it can be found that the ES process occupies nearly 200MB/s of SSD I/O resources. Using the top command, it is found that the 8-core CPU is almost full of users. ES is very resource-hungry.
Since it is easy to find bottlenecks in the system, it is important to optimize this resource and squeeze the writing speed of ES to the limit.
The example version of this optimization is7.9.2
. The ES version is moving up so fast that it’s completely out of the 5 era.
1. Which operations occupy resources
To optimize, you need to first know the ES write process and know which steps are the most time-consuming.
First, there is the replica problem. In order to ensure the minimum high availability, the number of replicas is set to 1, which cannot be saved. Therefore, setting the number of replicas to 0 is only appropriate when data is first imported.
As shown in the figure above, a piece of data needs to go through several steps to finally land. There is even a tranlog backup mechanism for this process.
The underlying store of ES is Lucene, which contains a series of reverse indexes. Such indexes are called segments. Instead of writing directly to the segment, the record is written to a buffer first.
When the buffer is full, or has been in the buffer long enough to reach flush time (highlighted), the contents of the buffer are written to the segment at once.
This is why the configuration of the refresh_interval attribute severely affects performance. If you don’t want high real time, make it bigger.
The buffer uses 10% of the heap space by default, with a minimum of 48MB (shard specific). If you have many indexes and heavy writes, the memory footprint is significant and can be increased appropriately.
2. Start optimizing
Data writing operations include Flush, refresh, and merge. By adjusting their behavior, you can make a trade-off between performance and data reliability.
flush
As you can see from the introduction above, translog writes a full copy of the data, which is sort of like the binlog in MysSQL, or the AOF in Redis, to keep the data safe in case of an exception.
This is because after writing data to disk, we have to call fsync to flush the data to disk, otherwise the data will be lost in the event of a system power failure.
ES defaults to flush once per request, but this is not necessary for logging and can be made asynchronous with the following parameters:
curl -H "Content-Type: application/json" -XPUT 'http://localhost:9200/_all/_settings? preserve_existing=true' -d '{ "index.translog.durability" : "async", "index.translog.flush_threshold_size" : "512mb", "index.translog.sync_interval" : "60s" }'
Copy the code
This is arguably the most important optimization step and has the greatest impact on performance, but in extreme cases there is the possibility of losing some data. This is tolerable for logging systems.
refresh
In addition to writing translog, ES writes data to a buffer. But watch out! In this case, the contents of the buffer are not searchable and need to be written to the segment.
This is the refresh action, which defaults to 1 second. The data that you write, with a high probability of one second, will be searched.
Therefore, ES is not a real-time search system, but a near-realtime system.
The refresh interval can be modified with index.refresh_interval.
For logging systems, of course, it should be a bit larger. Xjjdog adjusts to 120s here to reduce the frequency of these falling segments, which will naturally be faster.
curl -H "Content-Type: application/json" -XPUT 'http://localhost:9200/_all/_settings? preserve_existing=true' -d '{ "index.refresh_interval" : "120s" }'
Copy the code
merge
Merge is actually lucene’s mechanism. It mainly merges small segments to generate larger segments to improve the speed of retrieval.
The reason is that the refresh process generates a large number of small segment files, and data deletion also generates space debris. So merge, in layman’s terms, is like a defragmentation process. There are vaccum processes that do the same thing with Postgresql and others.
Obviously, this collation operation costs both I/O and CPU.
Interestingly, merge has three strategies.
- Tiered, the default option, which was to combine index segments of similar size, taking into account the maximum number of index segments allowed at each layer.
- Log_byte_size Specifies the logarithm of the number of bytes. Multiple indexes are selected to create a new index.
- Log_doc takes the number of documents in the index segment as the unit of calculation. Multiple indexes are selected to merge to create a new index.
Each of these policies has a very detailed, targeted configuration that I won’t belabor here.
Since there is no random deletion in the log system, we can keep the default.
3, fine-tuning
The new version optimizes the configuration of thread pools, eliminating the need to configure complex search, BULK, and INDEX thread pools. Size, thread_pool.write. Size, thread_pool.listener.size, thread_pool.analyze. Adjust the data exposed by the _CAT /thread_pool interface.
In fact, you can configure multiple disks to spread I/O pressure, but data hotspots are concentrated on a single disk.
Lucene’s index building process, which is very CPU intensive, can reduce the number of inverted indexes to reduce CPU consumption. The first optimization is to reduce the number of fields; The second optimization is to reduce the number of index fields. To do this, set the index property of the field that does not need to be searched to not_analyzed or no. As for _source and _all, they are not very useful in actual debugging and will not be described again.
In addition, if logs are transmitted through a fileBeat or Logstash component, batch mode is usually enabled. Through batch can increase the performance, but should not be too large, can be set according to the actual observation, generally between 1K-1W is ok.
End
ES can be used in a variety of scenarios. In view of its NoSQL nature, some even use it to replace the traditional relational database.
This is fine, but be aware of its latency. This article focuses on throughput first log writing scenarios, where data latency is particularly significant. ES is not designed for this scenario by default, so out-of-the-box configuration is inefficient.
We learned that ES writes have flush, refresh, and merge procedures. Translog and merge actions have the greatest impact on I/O. The processes that affect the CPU the most are index creation and merge. In a normal mapping design, minimize the number of fields, as well as the number of index fields. In this way, the CPU and I/O bottlenecks can be solved.
Xjjdog is a public account that doesn’t allow programmers to get sidetracked. Focus on infrastructure and Linux. Ten years architecture, ten billion daily flow, and you discuss the world of high concurrency, give you a different taste.