Introduction | tencent cloud Elasticsearch is widely applied to real time log analysis, structured data analysis, such as full-text retrieval scenario, this article will, in the form of scene implant to introduce you to cooperate with tencent cloud customers of various typical problems in the process, and the corresponding solution idea and method, want to communicate with you. Article author: Bellen, Tencent Cloud big data R&D engineer.

The background,

The game business of a medium-sized Internet company uses Tencent Cloud’s Elasticsearch product and uses ELK architecture to store business logs.

Because the game business itself logs a lot of data (write peaks at 100W QPS), we stumbled a lot during the months we served our customers. After several optimization and adjustment, the ES cluster of the customer is finally adjusted to be relatively stable, which avoids the abnormal reading and writing of the customer cluster at the peak of business, and reduces the capital cost and usage cost of the customer.

Second, the first encounter with customers

Solution architect A: Bellen, XX company wants to launch A new game, and they decide to use ELK architecture for log storage. They decide to choose between XX Cloud and us. We first go to their company to communicate with them and try to win the game!

Bellen: Ok, anytime!

Then I went to the company with the architect to communicate with the head of the operation and maintenance department responsible for the underlying components.

XX company operation and maintenance boss: don’t talk about your PPT, first tell me what you can bring to us!

Bellen: Well, we have many advantages, such as flexible cluster scaling, smooth one-click upgrade of cluster version, and high availability of cluster with cross-room DISASTER recovery……

XX company operation and maintenance boss: what you said is also available from other manufacturers. I just want to ask you a question. Now we need to store game logs for a year, but we can’t delete the data. Do you have a solution that can meet our need to store such a large amount of data while reducing our cost?

Bellen: We provide a cluster with hot and cold mode. The hot node uses SSD cloud disks and the cold node uses SATA disks. The ILM index life cycle management function of ES is adopted to regularly migrate the old indexes from the hot node to the cold node, which can reduce the overall cost.

Alternatively, it is possible to periodically back up older indexes to the COS object store using the Snapshot and then drop the indexes for a lower cost.

Operation and maintenance boss of XX Company: Storage in COS is cold storage. When we need to query the data in COS, we have to restore the data to ES? That’s no good. It’s too slow. Business can’t wait that long. Our data cannot be deleted, only put in ES! Could you please provide us with an API, so that the old index data can be stored in COS, but the data can still be queried through this API, instead of restoring to ES first and then querying?

Bellen: Well, it can be done, but it will take time. Is it possible to use hadoop on COS architecture to import the old index data into COS through tools and query it through Hive? In this way, the cost will be very low and the data will still be available at any time.

XX company operations manager: No, we just want to use mature ELK architecture to do it, and add hadoop. We don’t have enough manpower to do it.

Bellen: Well, let’s run a cluster test and see how it performs. As for the problem that the stock data is placed in COS but also needs to be queried, we can formulate a plan first and implement it as soon as possible.

Operation and maintenance boss of XX Company: Ok, now we estimate the data volume of 10TB per day. We should first buy a cluster, and the data volume that can last for 3 months is ok. Can you give some suggestions on the configuration of a cluster?

Bellen: Currently, the maximum disk capacity of a single node is 6TB. CPU and memory can be placed on an 8-core 32G single node, and 2W QPS can be written on a single node without any problem. Vertical and horizontal expansion can also be carried out later.

XX company operation and Maintenance boss: Ok, let’s test it first.

Three, the cluster can not withstand the pressure

N days later, architect A directly feedback in wechat group: Bellen, the customer feedback ES cluster performance is not good, using logstash to consume the log data in Kafka, running for almost A day, the data is not even, this is an online cluster, please take A look urgently.

I see, a face meng, when has been on-line ah, is not still in the test?

Operation and Maintenance B of XX Company: We bought an 8-core 32G* 10-node cluster with 6TB of single-node disk and 10 shards and 1 copy of index setting. Now we use logstash to consume data in Kafka, which has not been equated. There are still a lot of data backlog in Kafka.

Then I immediately checked the monitoring data of the cluster, and found that CPU and load were very high, the average heap memory usage of JVM reached 90%, the JVM GC of nodes was very frequent, and some nodes kept coming online and offline due to slow response.

After communication, it is found that the user uses fileBeat + Kafka + Logstash + ElasticSearch. Currently, log data has been stored in Kafka for 10 days and 20 logstash sets have been started for consumption. The batch size of logstash is also set to 5000, and the performance bottleneck is on the ES side.

Theoretically, there is no problem to run 10W QPS for the customer’s 8-core 32G* 10-node cluster. However, the LOGstash consumption backlog data can write more than 10W QPS to ES. Therefore, ES cannot bear the writing pressure, so the ES cluster can only be expanded.

In order to speed up the consumption of stock data, the configuration of single node is firstly lengthwise expanded to 32 core 64GB, and then nodes are added horizontally to ensure that ES cluster can support the write of 100W QPS at most (it should be noted here that the number of index fragments needs to be adjusted after nodes are added).

Therefore, when new customers access ES, they must evaluate node configuration and cluster scale in advance, which can be evaluated from the following aspects:

  • Storage capacity: Considering the number of index copies, data expansion, additional disk space occupied by internal ES tasks (such as segment Merge), and disk space occupied by the operating system, if 50% more free disk space needs to be reserved, the total storage capacity of the cluster is about 4 times that of the source data.

  • Computing resources: Write is mainly considered. A 2-core 8GB node can support 5000 QPS write. With the increase of the number of nodes and node specifications, the write capacity increases linearly.

  • Index and shard quantity evaluation: Generally, the data volume of a shard is 30-50 GB, which can be used to determine the number of shards for the index and determine whether to build the index by day or by month. To control the total number of fragments on a node, it is recommended that the 1GB heap memory support 20-30 fragments. In addition, you need to control the total number of fragments in the cluster. Generally, the total number of fragments in the cluster should not exceed 3 million.

Logstash consumption Kafka performance tuning

The above problem is that there is no reasonable evaluation of cluster configuration and scale before the service goes online, which leads to the failure of ES cluster after the service goes online.

With proper capacity expansion, the cluster eventually withstood the write pressure, but new problems arose. Because kafka has a large backlog of data, customers who consume Kafka data with Logstash report two problems:

  • Add multiple Logstash consumption kafka data, no linear increase in consumption speed;

  • Consumption rates are not uniform across Kafka’s different topics and between different partitions within the topic.

After analyzing the client Logstash configuration file, it is found that the main reasons for the problem are as follows:

  • The number of partitions in topic is small: Although there are a large number of Logstash machines, they do not make full use of machine resources to consume data in parallel, resulting in a slow consumption speed.

  • All logstash configuration files are the same, and a single group is used to consume all topics at the same time, resulting in resource competition.

After analysis, kafka and Logstash are optimized as follows:

  • Increased the number of kafka topic partitions

  • Grouping logstash: for topics with a large amount of data, a separate consumption group can be set up, and a group of Logstash consumption groups can be used to consume this topic. Other topics with a small amount of data can share a single consumption group and a set of Logstash.

  • The total number of Consumer_threads in each Logstash group is the same as the total number of partitions in the consumer group. For example, if there are three Logstash processes consuming 24 topic partitions, the consumer_threads in each Logstash configuration file is set to 8.

Through the above optimization, the logstash machine resources are finally fully utilized, and the accumulated Kafka data is quickly consumed. After the consumption speed catches up with the lifetime speed, the Logstash consumption Kafka has been running steadily without backlog.

In addition, the client used version 5.6.4 of The Logstash at the beginning, but the version was older. In the process of use, the problem occurred that the logstash was thrown abnormally after a single message body was too long:

whose size is larger than the fetch size 4194304 and hence cannot be ever returned. Increase the fetch size on the client (using max.partition.fetch.bytes), or decrease the maximum message size the brokerwill allow (using message.max.bytes)
Copy the code

This problem was avoided by upgrading the Logstash to higher version 6.8 (the 6.x version of The LogStash fix this problem, avoiding crashes).

5, the disk is full, emergency capacity expansion?

The customer’s game has been online for a month. It was originally estimated that the data volume would be up to 10TB per day, but in fact, 20TB data would be generated every day during the operation. The total data disk utilization rate of 6TB*60=360TB also reached 80%.

In view of this situation, we suggest customers to use the cluster architecture of hot and cold separation, add a batch of warm nodes to store cold data on the basis of the original 60 hot nodes, and use the ILM(Index life cycle management) function to periodically migrate the indexes on the hot nodes to the warm nodes.

By adding warm nodes, the total number of disks in the customer’s cluster reaches 780TB, which can meet storage requirements for up to three months. But the customer’s needs have not been met.

Operation and maintenance boss of XX Company: Please give us a solution that can store data for one year. It is not a long-term solution to always expand disk capacity by adding nodes. We have to stare at this cluster every day, and the operation and maintenance cost is very high! If I keep adding nodes, I’m going to lose ES, right?

Bellen: you can try to use our newly launched models that support local sites. The hot node supports local SSDS of up to 7.2TB, and the warm node supports local SATA disks of up to 48TB.

On the one hand, the performance of the hot node is better than that of the cloud disk, and the warm node can support larger disk capacity. As the disk capacity supported by a single node increases, the number of nodes is not too large and the pit triggered by too many nodes can be avoided.

XX company operation and maintenance boss: now using cloud disk, can you replace the cost site, how to replace?

Bellen: It cannot be directly replaced. New nodes with local sites need to be added to the cluster to migrate data from the old cloud disk nodes to the new nodes. After the migration, the old nodes are removed to ensure that services are not interrupted and read and write operations can be carried out normally.

XX company operation and maintenance boss: Good, can be implemented, as soon as possible!

Cloud disk switch to the local site, is automatically implemented by calling the CLOUD service background API. After implementation, the process of migrating data from the old node to the new node is triggered. But about half an hour later, the problem arose again:

XX company operation B: Bellen, check it quickly, ES is almost dropping 0.

By looking at the cluster monitoring, it is found that the write QPS directly drops from 50W to 1W, and the write rejection rate increases sharply. The cluster logs show that the write fails because the index is not created successfully in the current hour.

In an emergency, the following operations were performed to locate the cause:

1. GET _cluster/health

The cluster health status is Green, but there are about 6500 relocating_shards, number_OF_Pending_tasks number reaches tens of thousands.

2. GET _cat/pending_tasks? v

There are a large number of “Shard-started” tasks in progress with “URGENT” priority, and a large number of “put Mapping “tasks with “HIGH” priority. The “URGENT” priority is higher than the “HIGH” priority, because a large number of shards are migrated from the old node to the new node, causing the index creation task to be blocked, resulting in data writing failure.

3. GET _cluster/settings

Why are there so many shards in migration? Through the GET _cluster/Settings found “cluster. Routing. Allocation. Node_concurrent_recoveries” has a value of 50. Currently 130 old nodes are migrating shards to 130 new nodes, so 130*50=6500 shards in migration.

And “cluster routing. Allocation. Node_concurrent_recoveries” the value of the parameter default is 2. This value must have been artificially modified to speed up the fragment migration during vertical cluster expansion. Because the cluster does not have many nodes to start with and there are not many shards in the index migration at the same time, creating new indexes will not be blocked.

4. PUT _cluster/settings

Now through the PUT _cluster/Settings to make the “cluster. Routing. Allocation. Node_concurrent_recoveries” parameter changes to 2. However, because the priority of “put Settings “task is “HIGH”, which is lower than the priority of “shard-started” task, the operation to update this parameter is still blocked. ES reported an error and the task timed out. At this point, has carried on the multiple retries, ultimately successful “cluster. Routing. Allocation. Node_concurrent_recoveries” parameter changes to 2.

5. Delete the exclude configuration

GET _cluster/health: delete data migration from the cluster and exclude data migration from the cluster. Exclude data migration from the cluster and exclude data migration from the cluster.

PUT _cluster/settings{ "transient": { "cluster.routing.allocation.exclude._name": "" }}
Copy the code

6. Speed up fragment migration

At the same time, increase the maximum number of bytes per second (default: 40MB) for data transmission at the fragment recovery point to speed up the execution of the fragment migration task of the existing storage:

PUT _cluster/settings{ "transient": { "indices": { "recovery": { "max_bytes_per_sec": "200mb" } } }}
Copy the code

7. Create indexes in advance

Now you see that the number of shards in the migration is slowly decreasing, the new index has been created successfully, and writes are back to normal. At the next hour, the index creation is still slow because hundreds of shards are still in migration. It takes about 5 minutes to create a new index, and the write fails within 5 minutes.

After hundreds of migrated shards have been executed, new indexes are faster and write failures are eliminated.

However, the problem is that the process of switching cloud disk nodes to the local site is being implemented. Data needs to be migrated from the old 130 nodes to the new 130 nodes, and the data migration task cannot be stopped. What should I do?

Since the creation of new indexes is slow, it is necessary to create indexes in advance to avoid data write failures at every hour. By writing python scripts executed once a day, every hour of the next day ahead of the index is created, create finished put “cluster. Routing. Allocation. Exclude. _name” changes to all the old node, data migration tasks to ensure normal operation.

8. Results presentation

A total of 400TB of data was migrated after about 10 days. With the Python script that created the index ahead of time, there were no write failures for 10 days.

After this expansion operation, the following experience is summarized:

If too many shards are migrated at the same time, index creation and other configuration updates will be blocked.

So when you migrate data, To ensure that the “cluster routing. Allocation. Node_concurrent_recoveries” parameters and “Cluster routing. Allocation. Cluster_concurrent_rebalance” for smaller values.

If data migration is necessary, you can create indexes in advance to avoid a write failure caused by time-consuming index creation in ES.

6. 100,000 fragments?

After a stable run, the cluster failed again.

B: Bellen, the cluster did not write after 1 o ‘clock in the morning last night. Now there is a large amount of data accumulation in Kafka. Would you please check it as soon as possible?

The cluster is in yellow state and a large number of error logs are found:

{"message":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]; : [cluster_block_exception] blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];" ,"statusCode":503,"error":"Service Unavailable"Copy the code

Then check the cluster log and discover “Master not discovered yet…” The cluster failed to select the master node. The cluster failed to select the master node.

The es process is oom. The es process is oom. The es process is oom. After checking the JVM heap memory snapshot file directory and finding a large number of snapshot files, we deleted some files and restarted the ES process. The ES process started normally. However, the problem is that the heap memory utilization is too high, gc is very frequent, master node response is very slow, a large number of index creation tasks have timed out, blocked in the task queue, the cluster still cannot recover to normal.

See the cluster master node configuration is 16 core 32GB memory, the JVM is only allocated 16GB memory. At this point, we had to increase the memory size of the master node to 64GB (vm, using Tencent cloud CVM, can adjust the machine specifications, need to restart). After the master node restarted, we changed the JVMS. Options file in es directory to adjust the heap memory size. The ES process is restarted.

All three master nodes are back to normal, but shards still need to be recovered. The GET _cluster/health command shows that there are more than 10W fragments in the cluster, and it will take some time for these fragments to recover.

Through large “cluster routing. Allocation. Node_concurrent_recoveries”, increase the shard to restore the number of concurrent. In fact, the recovery of 5W master shards is relatively fast, but the recovery of replica shards is much slower, because some replica shards need to synchronize data from the master shards to recover. In this case, you can set the number of some old index copies to 0, so that the recovery of a large number of replica fragments can be completed as soon as possible. In this way, new indexes can be created normally, so that the cluster can write data normally.

The root cause of this failure is summarized as follows: The number of indexes and fragments in the cluster is too large, the cluster metadata occupies a large amount of heap memory, and the JVM memory of the master node is only 16GB (32GB for the data node). The frequent full GC of the master node leads to the master node’s exception, which ultimately leads to the whole cluster’s exception.

So to solve this problem, or fundamentally solve the problem of excessive number of cluster fragments.

Currently, log indexes are created on an hourly basis with 60 shards and 1 copy, 24602=2880 shards per day, and 86,400 shards per month. This number of shards can cause serious problems. There are several ways to solve the problem of too many fragments:

  • You can enable shrink in the Warm phase of ILM to reduce the number of old indexes by 12 times from 60 to 5.

  • The service can change the index creation time to every two hours or longer, and calculate the appropriate time to create new indexes based on the maximum 50GB data supported by the number of fragments.

  • Set the copy of the old index to 0, and only keep the master shard. The number of shards can be doubled again, and the storage capacity can also be doubled.

  • To close the oldest index periodically, execute {index}/_close.

After communication with the customer, the customer said that the first two methods can be accepted, but the last two methods cannot be accepted, because considering the possibility of disk failure, a copy must be kept to ensure data reliability. You must also ensure that all data is always searchable and cannot be closed.

ILM is a bit of a “pit”

In the above article, although 10W sharding was prevented by temporarily adding memory to the master node, it could not fundamentally solve the problem.

The customer’s data is planned to be retained for one year. If not optimized, the cluster will surely not be able to carry hundreds of thousands of fragments. Therefore, the problem of excessive fragments in the cluster as a whole needs to be solved.

As mentioned earlier, enabling shrink and reducing the granularity of index creation (adjusted to create an index every two hours) is acceptable to customers, which reduces the number of shards to a certain extent and can stabilize the cluster for a while.

Assisted the customer to configure the following ILM policies on Kibana:

In the Warm Phase, indexes that have been created for more than 360 hours are migrated from the hot node to the warm node, keeping the number of copies of the index at 1.

The reason 360 hours is used as a condition, rather than 15 days, is because the client’s index is created by the hour.

If 15 days is used as the migration condition, 24 indexes from 15 days ago will be triggered at dawn every morning, and a total of 24*120=2880 fragments will be migrated at the same time. This is easy to cause the problem mentioned above that the creation of indexes will be blocked due to the excessive number of migrated fragments.

Therefore, for 360 hours, only one index is migrated per hour. In this way, the migration tasks of the 24 indexes are equal, avoiding other tasks being blocked.

Also, in the Warm Phase, set index shrink to reduce the number of index fragments to 5. Because the old index does not perform writes, you can also perform force merge to forcibly merge the segment file into one segment file to achieve better query performance.

In addition, after the ILM policy is set, you can add the index.lifecycle. Name configuration to the index template, so that all newly created indexes can be associated with the newly added ILM policy, so that ILM can run properly.

The ES version used by the customer is 6.8.2. Some problems were found when ILM was running:

A newly added policy takes effect only for newly created indexes. For existing indexes, you can batch modify index Settings’ index.lifecycle.

If a policy is modified, all the existing indexes, regardless of whether the policy has been implemented or not, will not execute the modified policy. That is, the modified policy takes effect only for the newly created indexes after the modification.

For example, if you did not enable shrink in the first policy and now modify the policy to add the shrink operation, In this case, it is only possible to perform shrink on existing indexes if the index has been created for more than 360 hours. If you want to perform shrink on existing indexes, you can only perform it in batches through scripts.

Simultaneous index migration and shrink in a Warm phase trigger an ES bug. For example, in the ILM policy shown in the figure above, the index itself contains 60 fragments and 1 copy, which are initially on the hot node. After 360 hours of creation, the index is migrated to the warm node, and the fragmentation is increased to 5. A large number of Unassigned shards are found after a period of time. The reasons are as follows:

"deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[x-2020.06.19-13][58], node[LKsSwrDsSrSPRZa-EPBJPg], [P], s[STARTED], a[id=iRiG6mZsQUm5Z_xLiEtKqg]]" }, { "decider" : "awareness", "decision" : "NO", "explanation" : "there are too many copies of the shard allocated to nodes with attribute [ip], there are [2] total configured shard copies for this shard id and [130] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]" }
Copy the code

This is because the shrink operation migrates the entire index data to a node, then builds the new shard metadata in memory, and points the new shard data to the old shard data via soft links. When shrink is implemented in ILM, the ILM will configure the indexes as follows:

```{ "index.routing" : { "allocation" : { "require" : { "temperature" : "warm", "_id" : "LKsSwrDsSrSPRZa-EPBJPg" } } }}```
Copy the code

The problem is that the index contains replicas, and the master shard and replica shard cannot be on the same node, so some shards (not all of them, just some of them) cannot be allocated.

This is a bug that triggered ILM version 6.8. We need to check the source code to locate and fix this bug. It is still under study. The current workaround periodically scans the index of unassigned Shards and modifies its Settings:

```{ "index.routing" : { "allocation" : { "require" : { "temperature" : "warm", "_id" : null } } }}```
Copy the code

Ensure that fragments are migrated from the hot node to the warm node first so that subsequent shrink operations can be performed smoothly. It may also fail because all 60 fragments are on the same node, which may trigger the rebalance and cause the fragments to migrate. Moreover, the shrink precondition is not met, which causes the execution to fail.

To completely avoid this problem, we need to set the ILM policy to allow indexes that are created over 360 hours to be set to 0, but the client does not accept it.

Implement SLM by yourself

The previous section described the impact of 10W shards on the cluster and the reduction of the number of shards by enabling shrink, but there are two main issues that need to be addressed:

  • How to ensure that the total number of fragments in a cluster is not higher than 10W within a year and is stable at a low water level?

  • If you perform shrink in ILM, some partitions may not be allocated and shrink execution may fail.

It can be estimated that the number of shards in a year is 24120365=1051200 shards if the index is built by hour, 60 shards and 1 copy. After shrink is implemented, the number of partitions is 2410350 + 2412015 = 127200. (Within 15 days, the new index is still kept 60 partitions and 1 copy to ensure write performance and data reliability, whereas the old index is kept 5 partitions and 1 copy to ensure data reliability), there are still more than 10W partitions.

Combined with the total storage capacity of the cluster in a year and the amount of data that can be supported by a single shard, we expect the total number of shards in the cluster to be stable at 6W ~ 8W. How to optimize it?

The solution we can think of is to perform cold backup of data, and cold backup the old indexes to other storage media, such as HDFS, S3, and COS object storage of Cloud.

However, the problem is that if you want to query these cold standby data, you need to restore it to ES first, which is slow and unacceptable to customers. The old index is still a copy of 1. You can make a cold backup of the old index and then set the copy to 0. This has the following advantages:

  • The total number of cluster fragments can be reduced by half;

  • The amount of data stored can also be halved, allowing more data to be stored in the cluster.

  • The old index is still readily accessible;

  • In extreme cases, if the index data with only one copy cannot be recovered due to a disk fault, you can restore the index data from the cold backup media.

After communication with the customer, the customer accepted the above scheme and planned to cold-reserve the old index in COS, the object storage of Tencent Cloud. The implementation steps are as follows:

  • All the old indexes need to be processed in batches, backed up to COS as soon as possible, and then change the number of copies in batches to 0;

  • The newly created index adopts the policy of daily backup and changes the policy in combination with ILM. During ILM execution, the number of index copies is changed to 0 (Both warm phase and Cold Phase of ILM support setting the number of index copies).

The implementation of the first step can be achieved through scripts. In this case, Tencent CLOUD SCF cloud function is used for implementation, which is convenient, fast and can be monitored. The main points of implementation are:

  • Create snapshot on a daily basis and back up 24 indexes on a daily basis. If a snapshot is created on a monthly basis or in a larger granularity, the amount of data is too large. If an interruption occurs during snapshot execution, the snapshot must be restarted, which consumes time and energy. Creating snapshots by hour is also not applicable. As a result, too many snapshots may be created.

  • After a snapshot is created, you need to poll its status and ensure that the state of the previous snapshot is SUCCESS before creating the next snapshot. Since the snapshot is created on a daily basis, the snapshot name can be snapshot-2020.06.01. This snapshot backs up all indexes on June 1 only. After snapshot-2020.06.01 is successfully executed and the next snapshot is created, you need to know the index snapshot of the current day. Therefore, you need to record the current snapshot. In either of the following ways, the current snapshot date suffix 2020.06.01 is written to a file. The script reads the file each time through scheduled task polling. Another option is to create a temporary index, write “2020.06.01” to a doc of the temporary index, and then query or update the DOC.

  • When creating a snapshot, you can set include_global_state to false so that global cluster status information is not backed up.

After the first step, you can batch set the number of index copies that have been backed up to 0. This frees up a lot of disk space at once and significantly reduces the number of shards in the cluster as a whole.

The next step is to perform a snapshot every day and create more old indexes for backup. The script can be executed by crontab or Tencent Cloud SCF periodically.

After that, you can change the ILM policy, turn on Cold Phase, and change the number of index copies to 0:

In this case, timing is 20 days after creation. Ensure that backup of old index data is completed in step 2 before entering cold Phase.

By chilling old index data and reducing index copies, we can keep the total number of shards in the cluster at a low water level.

But there is another problem to be solved, namely the problem of shrink failure. As it happens, we can solve the shrink failure problem completely by cold-storing old index data and reducing index copies.

As mentioned earlier, shrink fails because the number of copies of the index is 1. Now we can back up the data and reduce the copy earlier so that the old index is already 0 copy when it enters the warm phase of ILM, and the shrink operation will not be a problem after that.

At the same time, because the replica is reduced, the amount of data migrated from hot nodes to warm nodes is also reduced by half, thus reducing cluster load and killing two birds with one stone.

Therefore, we need to modify the ILM policy to set the number of indexed copies to 0 in the Warm phase and remove the Cold Phase.

Another optional optimization is to freeze old indexes. Freezing an index is to remove the data (such as FST, metadata, etc.) that reside in the index from memory to reduce memory usage.

When a frozen index is queried, a temporary index data structure is reconstructed and stored in the memory, and then cleared after the query is complete. Note that frozen indexes cannot be queried by default. You need to explicitly add “ignore_throttled=false” to the query.

After the above optimization, we finally solved the problem of excessive number of clusters and shrink failure. Additional scheduled task scripts have been introduced to automate snapshots. This feature is already available in ES 7.4 under the name OF SLM and can be used in conjunction with ILM. The “wait_for_snapshot” ACTION was added to ILM, but it can only be used in the DELETE phase, which is not suitable for our scenario.

Customers like Searchable Snapshots!

In the above scenario, we spent a lot of effort to solve problems and optimize usage to ensure the stable operation of ES cluster and support petabyte storage.

Back to the original, if we could come up with a solution where customers could just put hot data on SSDS and store cold data on COS/S3, but at the same time make cold data available on demand, all the problems we had before would be solved. The conceivable benefits are:

  • Smaller clusters and very cheap COS/S3 object storage can support petabytes of data with very low cost of capital for customers;

  • A small-scale cluster only needs to be able to support hot index writing and query, and the total number of fragments in the cluster is not too large, thus avoiding cluster instability.

This is the Searchable Snapshots feature currently being developed by the ES open source community.

from

[Searchable Snapshots API](https://www.elastic.co/guide/en/elasticsearch/reference/master/searchable-snapshots-apis.html)

You can create an index and mount it to a specified snapshot. The new index is queryable, although the query time may be slower, but in logging scenarios, when querying some older indexes, the delay is generally acceptable.

So I think Searchable Snapshots fix a lot of pain points and bring new prosperity to ES!

conclusion

Having experienced the above practice of operation and optimization of ES cluster, we have summarized the following experience to share with you:

First: Evaluate the cluster scale and node specifications before putting a new cluster online.

Secondly, the number of fragments in the cluster as a whole should not be too large. The usage mode can be adjusted and continuous optimization can be carried out with the help of ES itself to maintain the overall number of fragments in the cluster at a low level to ensure the stability of the cluster.

Third: Searchable Snapshots can bring new vitality to ES. We need to focus on and study its implementation principle.

From the beginning, I got in touch with customers, understood their demands, gradually solved the problem of ES cluster, and finally made ES cluster stable. The experience in this process made me really understand the true meaning of the saying “practice brings knowledge”. Only through continuous practice can we respond quickly to abnormal situations and respond quickly to customer optimization requirements.

See Tencent technology, learn cloud computing knowledge, to cloud + community