background

The game business of a medium-sized Internet company uses Tencent Cloud’s Elasticsearch product and uses ELK architecture to store business logs.

Because the game of business itself is a large amount of log data (written to peak at 100 w QPS), several months in service customers, stepped on a lot of pit, after several optimization and adjustment, adjust customer ES cluster more stable, to avoid the abnormal customer cluster, speaking, reading and writing in business peaks, and reduces the customer’s cost of capital and the cost.

Elasticsearch High availability cluster environment tuning Tutorial.

Scenario 1: First encounter with a customer

Solution architect A: Bellen, XX is going to launch A new game, log storage decided to use ELK architecture, they decided to choose between XX Cloud and us, we first go to their company to communicate with them, strive to win!

Bellen: Ok, anytime!

Go to the company with the architect and talk to the head of the operations department responsible for the underlying components.

XX company operation and maintenance boss: don’t talk about your PPT, first tell me what you can bring to us!

Bellen: Well, we have many advantages, such as flexible cluster scaling, smooth cluster version upgrade with one click, and high availability of clusters with cross-room DISASTER recovery…

XX company operation and maintenance boss: what you said is also available from other manufacturers. I just want to ask you a question. Now we need to store game logs for a year, but we can’t delete the data. Do you have a solution that can meet our need to store such a large amount of data while reducing our cost?

Bellen: We provide a cluster with hot and cold mode. The hot node uses SSD cloud disks and the cold node uses SATA disks. The ILM index life cycle management function of ES is adopted to regularly migrate the old indexes from the hot node to the cold node, which can reduce the overall cost. Alternatively, it is possible to periodically back up older indexes to the COS object store using the Snapshot and then drop the indexes for a lower cost.

Operation and maintenance boss of XX Company: Storage in COS is cold storage. When we need to query the data in COS, we have to restore the data to ES? This is no good, the speed is too slow, the business can not wait for such a long time, our data cannot be deleted, can only be put in ES! Could you please provide us with an API, so that the old index data can be stored in COS, but the data can still be queried through this API, instead of restoring to ES first and then querying?

Bellen: Well, it can be done, but it will take time. Is it possible to use hadoop on COS architecture to import the old index data into COS through tools and query it through Hive? In this way, the cost will be very low and the data will still be available at any time.

XX company operation and maintenance boss: No, we just want to use mature ELK architecture to do it, and add hadoop, we don’t have enough manpower to do it!

Bellen: Well, let’s run a cluster test and see how it performs. As for the problem that the stock data is placed in COS but also needs to be queried, we can formulate a plan first and implement it as soon as possible.

Operation and maintenance boss of XX Company: Ok, now we estimate the data volume of 10TB per day. We should first buy a cluster, and the data volume that can last for 3 months is ok. Can you give some suggestions on the configuration of a cluster?

Bellen: Currently, the maximum disk capacity of a single node is 6TB. CPU and memory can be placed on an 8-core 32G single node, and 2W QPS can be written on a single node without any problem. Vertical and horizontal expansion can also be carried out later.

XX company operation and Maintenance boss: Ok, let’s test it first.

Scenario 2: The cluster is overwhelmed

N days later, architect A directly feedback in wechat group: Bellen, the customer feedback ES cluster performance is not good, using logstash to consume the log data in Kafka, running for almost A day, the data is not even, this is an online cluster, please take A look urgently.

I see, a face meng, when has been on-line ah, is not still in the test?

Operation and Maintenance B of XX Company: We bought an 8-core 32G* 10-node cluster with 6TB of single-node disk and 10 shards and 1 copy of index setting. Now we use logstash to consume data in Kafka, which has not been equated. There are still a lot of data backlog in Kafka.

Then I immediately checked the monitoring data of the cluster, and found that CPU and load were very high, the average utilization of JVM heap memory reached 90%, the JVM GC of nodes was very frequent, and some nodes kept coming online and offline due to slow response.

After communication, it is found that the user uses fileBeat + Kafka + Logstash + ElasticSearch. Currently, log data has been stored in Kafka for 10 days and 20 logstash sets have been started for consumption. The batch size of logstash is also set to 5000, and the performance bottleneck is on the ES side. Theoretically, there is no problem to run 10W QPS for the customer’s 8-core 32G* 10-node cluster, but the LOGstash consumption backlog data can write far more than 10W QPS to ES, so ES cannot bear the writing pressure, so the ES cluster can only be expanded. In order to speed up the consumption of stock data, the configuration of single node is firstly lengthwise expanded to 32 core 64GB, and then nodes are added horizontally to ensure that ES cluster can support the write of 100W QPS at most (it should be noted here that the number of index fragments needs to be adjusted after nodes are added).

Therefore, when new customers access ES, they must evaluate node configuration and cluster scale in advance, which can be evaluated from the following aspects:

  • Storage capacity: Considering the number of index copies, data inflation, additional disk space occupied by internal ES tasks (such as segment Merge), and disk space occupied by the operating system, if 50% more free disk space needs to be reserved, the total storage capacity of the cluster is about 4 times the amount of source data
  • Computing resources: Write is mainly considered. A 2-core 8GB node can support 5000Qps write. With the increase of the number of nodes and node specifications, the write capacity increases linearly
  • Index and shard quantity evaluation: Generally, the data volume of a shard is 30-50GB, which can be used to determine the number of shards for the index and determine whether to build the index by day or by month. To control the total number of fragments on a node, 1GB heap memory should support 20-30 fragments. In addition, you need to control the total number of fragments in the cluster. Generally, the total number of fragments in the cluster should not exceed 3 million.

Scenario 3: Logstash consumption Kafka performance tuning

The above problem is that there is no reasonable evaluation of cluster configuration and scale before the service goes online, which leads to the failure of ES cluster after the service goes online. With proper capacity expansion, the cluster eventually withstood the write pressure, but new problems arose.

Because kafka has a large backlog of data, customers who consume Kafka data with Logstash report two problems:

  1. Add multiple Logstash consumption kafka data without linear increase in consumption speed
  2. Consumption rates are not uniform across Kafka’s different topics and between different partitions within the topic

After analyzing the client Logstash configuration file, it is found that the main reasons for the problem are as follows:

  1. The number of partitions in topic is small. Although there are a large number of Logstash machines, they fail to make full use of machine resources to consume data in parallel, resulting in a slow consumption speed
  2. All logstash configuration files are the same, and a single group is used to consume all topics at the same time, resulting in resource competition

After analysis, kafka and Logstash are optimized as follows:

  1. Increase the number of partitions in Kafka Topic
  2. Group logstash; For a topic with a large amount of data, a separate consumption group can be set up. A group of Logstash consumption groups can be used to consume this topic. Other topics with smaller data volumes can share a single consumer group and a set of Logstash
  3. The total number of Consumer_threads in each logstash group is the same as the total number of partions in the consumer group. For example, if there are three Logstash processes and the number of partitions consumed is 24, Then the consumer_threads in each Logstash configuration file is set to 8

Through the above optimization, the logstash machine resources are finally fully utilized, and the accumulated Kafka data is quickly consumed. After the consumption speed catches up with the lifetime speed, the Logstash consumption Kafka has been running steadily without backlog.

In addition, the client used version 5.6.4 of The Logstash at the beginning, but the version was older. In the process of use, the problem occurred that the logstash was thrown abnormally after a single message body was too long:

whose size is larger than the fetch size 4194304 and hence cannot be ever returned. Increase the fetch size on the client (using max.partition.fetch.bytes), or decrease the maximum message size the broker will allow (using message.max.bytes)
Copy the code

This problem was avoided by upgrading the Logstash to higher version 6.8 (the 6.x version of The LogStash fix this problem, avoiding crashes).

Scenario 4: Disk capacity is about to be full. Emergency capacity expansion?

The customer’s game has been online for a month. It was originally estimated that the data volume would be up to 10TB per day, but in fact, 20TB data would be generated every day during the operation. The total data disk utilization rate of 6TB*60=360TB also reached 80%. In view of this situation, we suggest customers to use the cluster architecture of hot and cold separation, add a batch of warm nodes to store cold data on the basis of the original 60 hot nodes, and use the ILM(Index life cycle management) function to periodically migrate the indexes on the hot nodes to the warm nodes.

By adding warm nodes, the total number of disks in the customer’s cluster reaches 780TB, which can meet storage requirements for up to three months. But the customer’s needs have not been met:

Operation and maintenance boss of XX Company: Please give us a solution that can store data for one year. It is not a long-term solution to always expand disk capacity by adding nodes. We have to stare at this cluster every day, and the operation and maintenance cost is very high! If I keep adding nodes, I’m going to lose ES, right?

Bellen: you can try to use our newly launched models that support local sites. The hot node supports local SSDS of up to 7.2TB, and the warm node supports local SATA disks of up to 48TB. On the one hand, the performance of the hot node is better than that of the cloud disk, and the warm node can support larger disk capacity. As the disk capacity supported by a single node increases, the number of nodes is not too large and the pit triggered by too many nodes can be avoided.

XX company operation and maintenance boss: now using cloud disk, can you replace the cost site, how to replace?

Bellen: It cannot be directly replaced. New nodes with local sites need to be added to the cluster to migrate data from the old cloud disk nodes to the new nodes. After the migration, the old nodes are removed to ensure that services are not interrupted and read and write operations can be carried out normally.

XX company operation and maintenance boss: Good, can be implemented, as soon as possible!

Cloud disk switch to the local site, is automatically implemented by calling the CLOUD service background API. After implementation, the process of data migration from the old node to the new node was triggered, but about half an hour later, the problem appeared again:

XX Company operation and Maintenance small: Bellen, please take a look, ES write is about to drop 0.

By checking the cluster monitoring, it is found that the write QPS directly drops from 50W to 1W, and the write rejection rate increases sharply. By checking the cluster log, it is found that the write fails because the index is not created successfully in the current hour.

In an emergency, the following operations were performed to locate the cause:

  1. GET _cluster/health

The cluster health status is Green, but there are about 6500 relocating_shards, number_OF_Pending_tasks number reaches tens of thousands.

  1. GET _cat/pending_tasks? v

There are a large number of “shard-started” tasks in progress with “URGENT” priority and a large number of “put Mapping “tasks with “HIGH” priority. URGENT has a higher priority than HIGH, because a large number of shards migrate from the old node to the new node, causing the index creation task to be blocked, resulting in data writing failure.

  1. GET _cluster/settings

Why are there so many shards in migration? Through the GET _cluster/Settings found “cluster. Routing. Allocation. Node_concurrent_recoveries” has a value of 50, and currently has 130 old node in moved the fragmentation and 130 So 130*50=6500 shards in migration. And “cluster routing. Allocation. Node_concurrent_recoveries” the value of the parameter defaults to 2, should be in the expansion and vertical cluster before, in order to speed up the migration of fragmentation artificially modified this value (because of the cluster node number there are not many at the beginning, There are not too many shards in the index migration at the same time, so creating a new index is not blocked.

  1. PUT _cluster/settings

Now through the PUT _cluster/Settings to make the “cluster. Routing. Allocation. Node_concurrent_recoveries” parameter changes to 2. However, because the priority of “put Settings “task is “HIGH”, which is lower than the priority of “shard-started” task, the operation to update this parameter is still blocked. ES reported an error and the task timed out. At this point, has carried on the multiple retries, ultimately successful “cluster. Routing. Allocation. Node_concurrent_recoveries” parameter changes to 2.

  1. Unexclude Configuration

GET _cluster/health: delete data migration from the cluster and exclude data migration from the cluster. Exclude data migration from the cluster and exclude data migration from the cluster.

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name"""}}Copy the code
  1. Accelerated fragment migration

At the same time, increase the maximum number of bytes per second (default: 40MB) for data transmission at the fragment recovery point to speed up the execution of the fragment migration task of the existing storage:

PUT _cluster/settings
{
  "transient": {
    "indices": {
      "recovery": {
        "max_bytes_per_sec""200mb"}}}}Copy the code
  1. Create indexes ahead of time

Now you see that the number of shards in the migration is slowly decreasing, the new index has been created successfully, and writes are back to normal. At the next hour, the index creation is still slow because hundreds of shards are still in migration. It takes about 5 minutes to create a new index, and the write fails within 5 minutes.

After hundreds of migrated shards have been executed, new indexes are faster and write failures are eliminated. However, the problem is that the process of switching cloud disk nodes to the local site is being implemented. Data needs to be migrated from the old 130 nodes to the new 130 nodes, and the data migration task cannot be stopped. What should I do? Since the creation of new indexes is slow, it is necessary to create indexes in advance to avoid data write failures at every hour. By writing python scripts executed once a day, every hour of the next day ahead of the index is created, create finished put “cluster. Routing. Allocation. Exclude. _name” changes to all the old node, data migration tasks to ensure normal operation.

  1. The results show

A total of 400TB of data was migrated after about 10 days. With the Python script that created the index ahead of time, there were no write failures for 10 days.

After this expansion operation, the following experience is summarized:

  1. If the number of shards is too large, index creation and other configuration update operations will be blocked if the number of shards is too large. Therefore, during data migration, To ensure that the “cluster routing. Allocation. Node_concurrent_recoveries” parameters and “cluster. Routing. Allocation. Cluster_concurrent_rebalance” for more Little value.
  2. If data migration is necessary, you can create indexes in advance to avoid a write failure caused by time-consuming index creation in ES.

Scene 5:100,000 shards?

After a stable run, the cluster failed again.

B: Bellen, the cluster did not write after 1 o ‘clock in the morning last night. Now there is a large amount of data accumulation in Kafka. Would you please check it as soon as possible?

The cluster is in yellow state. A large number of error logs are found in the cluster:

{"message":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized]; : [cluster_block_exception] blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"."statusCode":503."error":"Service Unavailable"}
Copy the code

Then check the cluster log and discover “Master not discovered yet…” The cluster failed to select the master node. The cluster failed to select the master node.

The es process is oom. The es process is oom. The es process is oom. After checking the JVM heap memory snapshot file directory and finding a large number of snapshot files, we deleted some files and restarted the ES process. The es process started normally. However, the problem is that the heap memory utilization is too high, gc is very frequent, master node response is very slow, a large number of index creation tasks have timed out, blocked in the task queue, the cluster still cannot recover to normal.

The JVM is only allocated 16GB of memory, so we have to increase the memory of the master node to 64GB(the VIRTUAL machine using Tencent Cloud CVM can adjust the machine specifications, and need to restart). After the master restart, modify the ES directory jvm.options file, adjust the heap size, and restart the ES process.

The three master nodes are recovered, but the fragments still need to be recovered. You can see from GET _cluster/health that there are more than 10W fragments in the cluster, and it will take some time to recover these fragments. Through large “cluster routing. Allocation. Node_concurrent_recoveries”, increase the shard to restore the number of concurrent. In fact, the recovery of 5W master shards is relatively fast, but the recovery of replica shards is much slower, because some replica shards need to synchronize data from the master shard to recover. In this case, you can set the number of some old index copies to 0, so that the recovery of a large number of replica fragments can be completed as soon as possible. In this way, new indexes can be created normally, so that the cluster can write data normally.

The root cause of this failure is summarized as follows: The number of indexes and fragments in the cluster is too large, the cluster metadata occupies a large amount of heap memory, and the JVM memory of the master node is only 16GB(32GB for the data node). The frequent full GC of the master node leads to the master node’s exception, which ultimately leads to the whole cluster’s exception. So to solve this problem, or fundamentally solve the problem of excessive number of cluster fragments.

Currently, log indexes are created on an hourly basis with 60 sharding and 1 copy. 24*60*2=2880 sharding per day, resulting in 86,400 sharding per month. This number of sharding can cause serious problems. There are several ways to solve the problem of too many fragments:

  1. You can enable shrink in the Warm phase of ILM to reduce the number of old indexes by 12 times from 60 to 5.
  2. The service can change the index creation time to every two hours or longer, and calculate the appropriate time to create new indexes based on the maximum 50GB data supported by the number of fragments.
  3. Set the copy of the old index to 0, and only keep the master shard. The number of shards can be doubled again, and the storage capacity can also be doubled.
  4. To close the oldest index periodically, execute {index}/_close.

After communication with the customer, the customer said that mode 1 and mode 2 are acceptable, but mode 3 and mode 4 are not, because considering the possibility of disk failure, a copy must be kept to ensure data reliability. You must also ensure that all data is always searchable and cannot be closed.

Scenario 6: ILM with a bit of a hole

In the above article, although 10W sharding was prevented by temporarily adding memory to the master node, it could not fundamentally solve the problem. Customer data is planned to be retained for a year, and without optimization, the cluster will inevitably fail to support hundreds of thousands of shards. Therefore, we need to focus on solving the problem of excessive number of fragments in the cluster. As mentioned earlier, enabling shrink and reducing the granularity of index creation (adjusted to create an index every two hours) is acceptable to customers, which reduces the number of shards to a certain extent and can stabilize the cluster for a while.

Assisted the customer to configure the following ILM policies on Kibana:

In the Warm Phase, indexes that have been created for more than 360 hours are migrated from the hot node to the warm node, keeping the number of copies of the index at 1. The reason why 360 hours is used as the condition instead of 15 days is that the customer’s index is created by hour. If 15 days is used as the migration condition, 24 indexes from 15 days ago will be triggered at dawn every day. A total of 24*120=2880 fragments will be migrated at the same time. It is easy to cause the problem mentioned above that the index creation is blocked due to the excessive number of migrated fragments. Therefore, for 360 hours, only one index is migrated per hour. In this way, the migration tasks of the 24 indexes are equal, avoiding other tasks being blocked.

In the warm Phase, also set index shrink to reduce the number of index fragments to 5. Since the old index does not perform data writes, you can also perform force merge to merge the segment files into one to achieve better query performance.

In addition, after the ILM policy is set, you can add the index.lifecycle. Name configuration to the index template, so that all newly created indexes can be associated with the newly added ILM policy, so that ILM can run properly.

The ES version used by the customer is 6.8.2. Some problems were found when ILM was running:

  1. A newly added policy takes effect only for newly created indexes. For existing indexes, you can modify index Settings in batches to execute the policy.
  2. If a policy is modified, all the existing indexes, regardless of whether the policy has been implemented or not, will not execute the modified policy. That is, the modified policy takes effect only for the newly created indexes after the modification. For example, if you did not enable shrink in the first policy and now change the policy content to add the shrink operation, then the new index will be executed only after the index has been created for more than 360 hours. If you want to perform shrink on an existing index, you can only do it in batches through a script.
  3. In the warm phase, simultaneous index migration and shrink will trigger the ES bug, as shown in the ILM policy in the figure above. The index itself contains 60 slices and 1 copy, which are initially on the hot node, and will be migrated 360 hours after the creation. In practice, it is found that a large number of Unassigned Shards are found after a period of time. The reasons for the fragmentation failure are as follows:
 "deciders": [{"decider" : "same_shard"."decision" : "NO"."explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[x-2020.06.19-13][58], node[LKsSwrDsSrSPRZa-EPBJPg], [P], s[STARTED], a[id=iRiG6mZsQUm5Z_xLiEtKqg]]"
         },
         {
           "decider" : "awareness"."decision" : "NO"."explanation" : "there are too many copies of the shard allocated to nodes with attribute [ip], there are [2] total configured shard copies for this shard id and [130] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
         }
Copy the code

This is because the shrink operation migrates the entire index data to a node, then builds the new shard metadata in memory, and points the new shard data to the old shard data via soft links. When shrink is implemented in ILM, the ILM will configure the indexes as follows:

 {
   "index.routing" : {
           "allocation" : {
             "require" : {
               "temperature" : "warm"."_id" : "LKsSwrDsSrSPRZa-EPBJPg"}}}}Copy the code

The problem is that the index contains replicas, and the master shard and replica shard cannot be on the same node, so some shards (not all of them, just some of them) cannot be allocated. This is a bug that triggered ILM version 6.8. We need to check the source code to locate and fix this bug. It is still under study. The current workaround periodically scans the index of unassigned Shards and modifies its Settings:

 {
   "index.routing" : {
           "allocation" : {
             "require" : {
               "temperature" : "warm"."_id" : "LKsSwrDsSrSPRZa-EPBJPg"}}}}Copy the code

Make it a priority to migrate fragments from the hot node to the warm node first, so that the subsequent shrink can be performed smoothly (or fail, because all 60 fragments are on the same node, which might trigger the rebalance and cause the fragments to migrate. The execution fails. To completely avoid this problem, we need to set the ILM policy to allow indexes that are created over 360 hours to be set to 0, but the client does not accept it.

Scenario 7: Implement SLM yourself

The previous section described the impact of 10W shards on the cluster and the reduction of the number of shards by enabling shrink, but there are two main issues that need to be addressed:

  1. How to ensure that the total number of fragments in a cluster is not higher than 10W within a year and is stable at a low water level?
  2. If you perform shrink in ILM, some partitions may not be allocated and shrink execution may fail.

The number of sharding in a year is 24*120*365=1051200 sharding. If shrink is implemented, the number of fragments is 24*10*350 + 24*120*15 = 127200. The old index shrink was 5 fragments and 1 copy), which still has more than 10W fragments. Combined with the total storage capacity of the cluster in a year and the amount of data that can be supported by a single shard, we expect the total number of shards in the cluster to be stable at 6W ~ 8W, how to optimize?

The solution we can think of is to perform cold backup of data, and cold backup the old indexes to other storage media, such as HDFS, S3, and COS object storage of Cloud. However, the problem is that if you want to query these cold standby data, you need to restore it to ES first, which is slow and unacceptable to customers. The old index is still a copy of 1. You can make a cold backup of the old index and then set the copy to 0. This has the following advantages:

  1. The total number of cluster fragments can be reduced by half;
  2. The amount of data stored can also be halved, allowing more data to be stored in the cluster.
  3. The old index is still readily accessible;
  4. In extreme cases, if the index data with only one copy cannot be recovered due to a disk fault, you can restore the index data from the cold backup media.

After communication with the customer, the customer accepted the above scheme and planned to cold-reserve the old index in COS, the object storage of Tencent Cloud. The implementation steps are as follows:

  1. All the old indexes need to be processed in batches, backed up to COS as soon as possible, and then change the number of copies in batches to 0;
  2. The newly created index adopts the policy of daily backup and changes the policy in combination with ILM. During ILM execution, the number of index copies is changed to 0(Both warm phase and Cold Phase of ILM support setting the number of index copies).

The implementation of the first step can be achieved through scripts. In this case, Tencent CLOUD SCF cloud function is used for implementation, which is convenient, fast and can be monitored. The main points of implementation are:

  1. Snapshot is created on a daily basis to back up 24 indexes generated every day in batches. If snapshots are created on a monthly basis or at a larger granularity, the data volume is too large. If an interruption occurs during snapshot execution, all indexes must be restarted, which consumes time and energy. Creating snapshots by hour is also not applicable. As a result, too many snapshots may be created.
  2. After each snapshot is created, you need to poll the status of the snapshot and ensure that the state of the previous snapshot is SUCCESS before creating the next snapshot. Since the snapshot is created on a daily basis, the snapshot name can be snapshot-2020.06.01. This snapshot backs up all indexes on June 1 only. After snapshot-2020.06.01 is successfully executed and the next snapshot is created, you need to know the index snapshot of the current day. Therefore, you need to record the current snapshot. In either of the following ways, the current snapshot date suffix 2020.06.01 is written to a file. The script reads the file each time through scheduled task polling. Another option is to create a temporary index, write “2020.06.01” to a doc of the temporary index, and then query or update the DOC.
  3. When creating a snapshot, you can set include_global_state to false so that global cluster status information is not backed up.

After the first step, you can batch set the number of index copies that have been backed up to zero, freeing up a lot of disk space at once and significantly reducing the number of shards in the cluster as a whole.

The next step is to perform a snapshot every day and create more old indexes for backup. The script can be executed by crontab or Tencent Cloud SCF periodically.

After that, you can change the ILM policy, turn on Cold Phase, and change the number of index copies to 0:

In this case, timing is 20 days after creation. Ensure that backup of old index data is completed in step 2 before entering cold Phase.

By chilling old index data and reducing index copies, we can keep the total number of shards in the cluster at a low water level. But there is another problem to be solved, namely the problem of shrink failure. As it happens, we can solve the shrink failure problem completely by cold-storing old index data and reducing index copies.

As mentioned above, the reason why shrink fails is that the number of copies of the index is 1. Now we can back up the data and reduce the number of copies earlier, so that the old index has 0 copies when it enters the warm phase of ILM. At the same time, because the replica is reduced, the amount of data migrated from hot nodes to warm nodes is also reduced by half, thus reducing cluster load and killing two birds with one stone.

Therefore, we need to modify the ILM policy to set the number of indexed copies to 0 in the Warm phase and remove the Cold Phase.

Another optional optimization study is: the index of old freezing, freezing index index refers to the permanent memory of some data from memory to remove (FST, for example, metadata, etc.), thus reducing memory usage, and the query has frozen index, will be to construct temporary index data structure stored in memory, query over again cleared; Note that frozen indexes cannot be queried by default. You need to explicitly add “ignore_throttled=false” to the query.

After the above optimization, we finally solved the problem of excessive number of clusters and shrink failure. Additional scheduled task scripts have been introduced to automate snapshots. This feature is already available in ES 7.4 under the name OF SLM and can be used in conjunction with ILM. The “wait_for_snapshot” ACTION was added to ILM, but it can only be used in the DELETE phase, which is not suitable for our scenario.

Scene 8: Customers love Searchable Snapshots!

In the above scenario, we spent a lot of effort to solve problems and optimize usage to ensure the stable operation of ES cluster and support petabyte storage. Back to the original, if we could come up with a solution where customers could just put hot data on SSDS and store cold data on COS/S3, but at the same time make cold data available on demand, all the problems we had before would be solved. The conceivable benefits are:

  1. Smaller clusters and very cheap COS/S3 object storage can support petabytes of data with very low cost of capital for customers;
  2. A small-scale cluster only needs to be able to support hot index writing and query, and the total number of fragments in the cluster is not too large, thus avoiding cluster instability.

We can call the Searchable Snapshots API “Snapshots” and call it “Snapshots”. You can create an index and mount it to a snapshot. Although the query time may be slower, in logging scenarios, large delays are generally acceptable when querying older indexes.

So I think Searchable Snapshots fix a lot of pain points and bring new prosperity to ES!

conclusion

After the above practice of operation and optimization of ES cluster, we have summarized the following experiences:

  1. Before a new cluster goes online, evaluate the cluster scale and node specifications
  2. The number of fragments in the cluster as a whole should not be too large. The use mode can be adjusted and continuous optimization can be made with the help of ES itself to maintain the overall number of fragments in the cluster at a low water level to ensure the stability of the cluster
  3. Searchable Snapshots can bring new vitality to ES. We need to focus on and study the principle of its implementation.
  4. Tuning practical interview real questions, want to actual combat to improve the ability to try.

From the very beginning and customer contact, understanding customer demands, gradually solve the problem of ES of the cluster, eventually making ES cluster can remain stable, among this experience made me really to comprehend “genuine knowledge comes from practice” the true meaning of this sentence, only constant practice, can the abnormal situation to respond quickly, and the optimization of customers demand rapid feedback.

The original link: www.toutiao.com/i6859556881…