Elasticsearch tutorial live replay

Everything happens for a reason

Hello, I need to merge read-only index segments and have a few questions

  • Segment 1: max_num_segments=1
  • 2. When merging, pass
POST /my_index/_forcemerge? max_num_segments=1
Copy the code

Will it eat up all the machine resources and make the service temporarily unavailable? Max_num_segments =1 eats up all resources), but I can’t find a _forcemergener resource depletion from the official documentation.

  • 3. Is there any special attention and adjustment to index. Merge parameters in ES 6.7 and above?

(I’m using all the defaults so far)

Dead hit Elasticsearch knowledge planet t.cn/RmwM3N9

This is a basic concept, and to ensure the accuracy of the statement, official links and other source addresses will be posted below for a deeper understanding of the relevant knowledge of segment merging.

1. What is a segment?

Photo credit: medium.com/@hansrajcho…

As you can see from the top down,

  • A cluster contains one or more nodes;
  • A node contains one or more indexes;
  • An index: similar to a database in Mysql;
  • Each index in turn consists of one or more shards;
  • Each shard is a Lucene index instance, which you can think of as a separate search engine that indexes a subset of data in the Elasticsearch cluster and handles related queries;
  • Each fragment contains multiple segments, each of which is an inverted index.

During query, all segment query results are returned as the final fragmented query result.

2. Why are segments immutable?

In Lucene, in order to achieve high indexing speed, the segment architecture is used for storage.

A batch of written data is stored in a segment, where each segment is a single file on disk.

Because the file operation between writes is very heavy, one segment is set immutable so that all subsequent writes go to the New segment.

3. What is segment merge?

Because the automatic refresh process creates a new segment every second (as determined by the dynamic configuration parameter refresh_interval), this can cause the number of segments to explode in a short time.

Too many segments can cause major problems.

  • Consuming resources: Each segment consumes file handles, memory, and CPU cycles;
  • Search is slow: each search request must check each segment in turn; So the more segments, the slower the search.

Elasticsearch solves this problem by merging segments in the background.

Smaller segments are merged into larger segments, which are then merged into larger segments.

4. What does segment merge do?

Segment merging purges old deleted documents from the file system.

Deleted documents (or older versions of updated documents) are not copied to the new large section.

You don’t have to do anything to start a segment merge. Indexing and searching are done automatically.

  • When indexing, the refresh operation creates a new segment and opens the segment for search.
  • The merge process selects a small number of similar-sized segments and merges them into a larger segment behind the scenes. This does not break indexing and searching.

5. Why segment merge?

  • The higher the number of index segments, the lower the search performance and the more memory consumption.
  • Index segments are immutable and you cannot physically remove information from them.

Document can be physically deleted, but only marked for deletion, not physically deleted.

  • When segments are merged, the documents marked for deletion are not copied into the new index segment, thus reducing the number of documents in the final index segment.

6. What are the benefits of segment merging?

  • Reduce the number of index segments and improve the speed of retrieval;
  • Reduce index size (number of documents)

Reason: Segment merging removes documents that are marked as deleted.

7. Possible problems caused by segment merging?

  • The cost of disk I/O operations
  • Segment merging can significantly affect performance in slow systems.

8. On the size of the merged paragraph (usually 1) — for question 1

The documentation for earlier versions is as follows:

Optimize API (now deprecated, same principle)

The Optimize API can be thought of as a forced merge API. It forces a shard to be joined to the number of segments specified by the max_num_segments parameter.

The intention is to reduce the number of segments (usually down to one) to improve search performance.

9. On segment merge resource consumption — for question 2

Official interpretation of resource consumption

orce merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.

In a word: causes disk IO consumption and affects retrieval performance.

Force merge API

www.elastic.co/guide/en/el…

The following is the interpretation of the old version of the document. The principle is consistent and can be referred to. The API is outdated.

Period of consolidation

www.elastic.co/guide/cn/el…

Note that there are no resource constraints to triggering segment merges using the Optimize API.

This can consume all the I/O resources on your node, leaving it with no “rich” resources to process search requests, potentially rendering the cluster unresponsive.

If you want to optimize an index, you need to move the index to a secure node using sharding allocation (see migrating old indexes) and then execute it.

Yes, it is very resource-intensive and recommended to practice in a non-business intensive manner.

My online environment, I am all 1am segment merge (script control, no one operating system at night)

10. Recommended parameters — For QUESTION 3

  • To reduce the generation frequency of segments, change the value of “refresh_inteval” to 1s by default. If timeliness requirements are not high, change the value to 30s.
  • The index. The merge. The scheduler. Max_thread_count: according to the number of CPU cores

Recommendation:

www.elastic.co/guide/en/el…

  • Old version parameter modification reference value is not much, also recommended to see:

Index performance tips

www.elastic.co/guide/cn/el…

reference

1, medium.com/@hansrajcho…

2, stackoverflow.com/questions/3…

3, www.elastic.co/cn/blog/fou…

Recommendation:

Commonly used dry | Elasticsearch development of actual combat command list

The official documentation for your Elasticsearch puzzle is already available……

Dry goods | Elasticsearch developers best practice guide

Elasticsearch development operational combat Tips

The importance of the theory of dry goods | Elasticsearch data modeling

Dry goods | Elasticsearch index design practical guide

Dry goods | Elasticsearch multi-table associated design guidelines

Learn more in less time, faster!

40%+ Elastic certified engineers in China are here!