Original: https://mp.weixin.qq.com/s/FVbjGTvZP97u5CAydDx7dw


preface

Elasticsearch (ES) is a popular open source distributed search and analysis engine. It can easily implement real-time log analysis, full-text search, and structured data analysis, and greatly reduce the cost of data mining.


ES has rich large-scale landing scenes within Tencent and among Tencent cloud users, which continuously promote the optimization of native ES towards high availability, high performance and low cost. This paper will introduce Tencent’s challenges, optimization ideas, optimization results and future exploration directions in the implementation of ES application, hoping to provide reference for developers.


1. Application scenarios of ES

1.Tencent internal application scenarios



In Tencent, the application scenarios of ES mainly include “real-time log analysis, search service, timing data analysis, etc.




1. [Real-time Log Analysis Scenario]

The typical scenario is as follows:

Operation logs: for example, slow logs and exception logs (used to locate service problems).

Business logs: such as user click and access logs (used to analyze user behavior);

Audit logs: Used for security analysis.

ES perfectly solves the need for real-time log analysis and has the following features:


Elastic ecosystem Integrity: Any developer can build a complete log analytics system in real time by simply deploying mature components.

High timeliness: it takes 10 seconds for logs to be generated and accessible. Compared with traditional big data solutions that take tens of minutes to several hours, the timeliness is improved by hundreds of times.

Flexible search and analysis capabilities: ES has very flexible search and analysis capabilities due to the support for inverted index, column storage and other data structures.

Short search response time: ES supports interactive analysis, with search response times in seconds, even with trillions of logs.

ES has thrived in recent years because of its ability to perfectly support “logging real-time analysis scenarios.”




2.Search Service Scenario

The typical scenario is as follows:

Commodity search: namely, commodity search in major e-commerce platforms;

APP search:App search in the app Store;
Site search:That is, forums, online documents and other search functions.



Tencent uses ES to support a large number of search services, which have the following features:

High performance: the maximum value of a single service is 10w+ QPS, the average response time is about 20ms, and the P95 delay is less than 100ms.

Strong correlation: The search experience mainly depends on the degree of matching between “search results” and “user intentions”, which can be evaluated by the accuracy rate, recall rate and other indicators.

High availability: In search scenarios, four nines are required and disaster recovery (Dr) is supported.



3.Sequence data analysis scenario


Typical timing data includes:


Metrics: Traditional server monitoring.

APM: monitors application performance.

Sensor data: generated by iot data, intelligent hardware, industrial iot, etc.

Tencent has long dabbled in this kind of scene and has accumulated deep experience. This type of scenario has the following characteristics:


High concurrent write: the maximum size of a single online cluster is 600+ nodes, and the write throughput is 1000w/s.

High query performance: the query delay of a single curve or time line is 10ms.

Multi-dimensional analysis: Requires flexible and multi-dimensional statistical analysis capabilities. For example, you can perform statistical analysis based on different dimensions, such as regions and service modules.


2. Industry application scenarios

In the industry, ES is mainly used in e-commerce platforms, such as searching commodities, labels and stores.


M, an e-commerce website with an annual GMV of 20 billion, is a typical example. Its application scenarios in ES are also very extensive, including product search and recommendation, log analysis and monitoring, statistical analysis, etc. Its business teams run dozens of ES clusters and are growing.


Ii. Challenges encountered



1.Tencent’s challenges



Under such a large scale, high pressure and multi-scene business background within Tencent, the application of ES has encountered many challenges, which can be basically divided into two categories: search and timing.


1.
Search class

High availability: Take e-commerce search, APP search, and intra-site search as examples. They attach importance to availability and have service SlAs of more than 4 9’s. Disaster recovery scenarios, such as single-node failures and network failures in single-node rooms, need to be considered.

High performance: They have very high standards for performance, such as 20W QPS, 20ms flat sound, P95 delay 100ms.

In short, the core challenge in a search-like business scenario is this
High availability and high performance.



2.
The sequential class

Timing scenarios include logging, Metrics, APM, and so on. Compared with search services, which focus on high availability and high performance, sequential services pay more attention to cost and performance.

For example, users of sequential scenarios often require high write throughput, some of which can reach 1000W /s; In this way, the data can be retained for 30 days and the storage capacity can reach PB level. If a user has 100 machines for the actual online business and 50 machines for monitoring, logging, etc., this is basically unacceptable to most users (because monitoring, logging is relatively low revenue). Therefore, the main challenges in sequential services are storage cost, computing cost and so on.


In the face of such formidable challenges, Tencent has carried out in-depth optimization practices in ES kernel (such optimized results are also open to external users through Tencent Cloud, including e-commerce website M).


2. Challenges in the industry



The industry is experiencing similar challenges. Taking e-commerce website M as an example, e-commerce is frequently promoted, so they have to pay attention to ES cluster
The stability ofAnd of the cluster
Disaster backup.


In terms of cluster stability, the problems they often encounter are “a large range of queries will cause JVM OOM of ES node (memory overflow), affecting cluster stability”, and “when the number of indexes or fragments is too large, the cluster tuple changes slowly and the CPU load is high, affecting cluster stability”.




In terms of disaster recovery and backup, they often encounter problems such as “how to ensure service continuity when a single room fails” and “how to recover data when a failure occurs”.


3. ES optimization practice



First of all, in view of the demand for “high availability”, The developers of Tencent divide the whole into parts and make breakthroughs one by one, and divide high availability into three dimensions:


1 system robustness: refers to the robustness of the ES kernel itself, which is also a common problem faced by distributed systems. Such as:


Fault tolerance of cluster under abnormal query and pressure overload;
Scalability of clusters in high stress scenarios;
Data balancing between nodes and multiple disks when a cluster is expanded or a node is faulty.

2. Disaster recovery solution: For example, a management and control system is used to quickly recover equipment room services when a network fault occurs, prevent data loss when a natural disaster occurs, and quickly roll back after misoperations.

3. System defects: These are inevitable problems in the development of any system, such as Master node congestion, distributed deadlock, slow rolling restart, etc.


To solve the above problems, Tencent’s ES solution is as follows:


1. System robustness:


Service instability: Traffic limiting is implemented to tolerate exceptions such as machine network faults and abnormal query.


Cluster scalability: by optimizing the metadata management and control logic of the cluster, the cluster expansion capacity has been improved by 10 times, supporting thousand-level node cluster and millions of fragments.


Cluster balancing: Optimize the fragment balancing among nodes and multiple disks to balance the pressure of a large-scale cluster.


2. Dr Scheme:


Data recoverable:

By extending the plug-in mechanism of ES to support backup file, ES data backup file back to cheap storage, to ensure the recovery of data;


Fault tolerance:

Tencent CLOUD ES supports cross-availability zone Dr. Users can deploy multiple availability zones as required to tolerate the failure of a single room.


In addition, Tencent CLOUD ES also supports COS backup function. Users can directly back up the underlying data files to COS of Tencent cloud object storage by operating THE API of ES, realizing low-cost and easy data backup function.


Abnormal recovery:

The garbage can can be used to quickly recover a cluster in the event of overpayment or misoperation.


3. System defects:


Tencent fixed the native ES bugs in rolling restarts, Master blocking, distributed deadlocks, etc. The rolling restart optimization can accelerate the node restart speed by more than 5 times; Master congestion is an issue that Tencent is working on in es6.x with Elastic.


In terms of service stream limiting, Tencent has carried out stream limiting work at four levels:

Permissions: Optimized, ES supports XPack and custom permissions to prevent attacks and misoperations;

Queue level: by optimizing task execution speed, repetition, priority and other details, it can solve the problems that users often encounter, such as accumulation of Master task queue and task starvation.

Memory level: Starting from ES 6.x, memory flow limiting can be implemented on all links (including HTTP portals, coordination nodes, and data nodes). JVM memory and gradient statistics are used for precise control.

Multi-tenant layer: The CVM/Cgroups scheme is used to ensure resource isolation between multiple tenants.

This section describes the problem of traffic limiting in aggregation scenarios. When using ES for aggregation analysis, the memory is often exhausted due to too many aggregation buckets. The max_buckets parameter is provided in ES 6.8 to control the maximum number of buckets for aggregation, but this method has very limitations: Whether or not the memory is exhausted also depends on the size of the single bucket (in some scenarios 200,000 buckets will work, but in others 100,000 buckets will be exhausted), so the user is not sure exactly what to set this parameter to. At this time, Tencent adopts gradient algorithm for optimization, and checks THE JVM memory once for every 1000 buckets allocated. When the memory is insufficient, the request is interrupted in time to ensure the high availability of ES cluster.

These traffic limiting schemes can solve the stability problems of ES services in the scenarios of abnormal query, pressure overload, single-node failure, and network partition. However, a small number of scenes remain unreachable, so Tencent is now exploring to cover more abnormal scenes by relying on chaos testing.


To solve the problem of “too many indexes or fragments lead to slow change of cluster metadata and high CPU load, which affects cluster stability”, Tencent optimized the algorithm of Shard allocation on the kernel, and accelerated the change speed of cluster metadata by using the method of caching node routing table and incremental update of metadata.
Having solved the problem of high availability and high performance, it’s time to talk about cost optimization practices.


Cost challenges are mainly reflected in the consumption of machine resources in sequential scenarios (such as logging and monitoring). According to the analysis of typical online logging and sequential services, the cost ratio of hard disk, memory and computing resources is close to 8:4:1. In other words, hard disk and memory are the main conflicts, followed by computing costs.
Temporal data has an obvious access characteristic: “more near and less far”. The number of visits to data in the last seven days accounted for more than 95%; Historical data is rarely accessed and is usually accessed for statistical information.

Based on these bottleneck analysis and data access features, the cost optimization solution is as follows:

(1) Use cold and hot separation architecture, and use mixed storage solutions to balance cost and performance;

(2) Since historical data is usually accessed by statistics, storage and performance are traded through prediction calculations;

(3) If the historical data is not used at all, it can be backed up to a cheaper storage system;

(4) Based on the access characteristics of sequential data, the memory cost can be optimized by Cache.

(5) Other optimization methods: such as storage tailoring, life cycle management, etc.

Here’s the Rollup section. Rollup is similar to Cube and materialized view in big data scenarios. The core idea of Rollup is to generate statistics in advance through prediction and release original granularity data, thus reducing storage cost and improving query performance. Here is a simple example: in a machine monitoring scenario, the raw granularity of monitoring data is 10 seconds, whereas monitoring data from a month ago is generally viewed in an hour granularity, which is a Rollup application scenario. Rollup officially started with ES 6.x, and Tencent has actually started working on it in 5.x.

In the field of big data, traditional solutions rely on external offline computing systems to periodically read the full amount of data for calculation, which has high computing overhead and maintenance costs.


Mesa, Google’s AD metrics system, uses a continuous generation scheme. When data is written, the system generates one input data for each Rollup and sorts the data. The bottom layer merges multiple rollups in Compact/Merge, which has relatively low computation and maintenance costs.


ES supports data sorting from 6.x. Tencent merges multiple ways through streaming query to generate Rollup. The final calculation cost is less than 10% of the CPU cost when full data is written, and the memory usage is less than 10MB. This kernel optimization method has been contributed to the open source community by Tencent to solve the computing and memory bottlenecks of open source Rollup.

As mentioned above, when many users use large storage models, memory becomes the bottleneck first and hard disks cannot be fully utilized. The main bottleneck of this kind of problem is that indexes occupy a large amount of memory. Considering that historical data is rarely accessed in sequence scenarios and some fields are not used in some scenarios, Tencent introduces Cache to improve memory utilization efficiency.

In memory optimization, what is the industry’s solution?


The ES community since 7.x supports indexes to be placed out of the heap and loaded on demand like docValues. The disadvantage of this approach is that the importance of indexes and data is completely different. A large query can easily lead to the obsolescence of indexes and the degradation of subsequent query performance by multiple levels.
Hbase uses the Cache to Cache indexes and data blocks to improve hot data access performance. Starting from Hbase 2.0, the Off Heap technology is introduced to ensure that the off-heap and in-heap memory access performance is similar. Based on community experience, Tencent introduced LFU Cache into ES to improve memory utilization efficiency, placed the Cache outside the heap to reduce the heap memory pressure, and at the same time reduced the loss through internal and external copy of the Weak Reference heap and other technologies. In this way, memory utilization is increased by 80%, query performance loss is less than 2%, and GC overhead is reduced by 30%.


When it comes to performance optimization practices, take timing scenarios represented by logging and monitoring as examples. They have very high requirements on write performance, with write concurrency up to 1000W /s. However, when writing with primary keys, ES performance degrades by 1+ times, and CPU cannot be fully utilized in some pressure measurement scenarios. The scenario represented by the search service has high requirements on query performance, requiring 20W QPS and 20ms flat ring, and avoiding query burrs caused by GC and poor execution plan.



To solve the above problems, Tencent also has countermeasures:

(1) Write optimization: For the primary key deduplication scenario, index clipping is used to accelerate the primary key deduplication process and improve write performance by 45%. In addition, write performance is optimized through quantization execution, branch hops and instruction Miss are also one of the directions Tencent is exploring, and the performance is expected to be improved by 1 times.

(2) CPU utilization optimization: Aiming at the problem that CPU cannot be fully utilized in partial pressure measurement scenarios, the performance can be improved by 20% by optimizing resource preemption when ES refresh Translog.

(3) Query optimization: Optimize the Merge strategy to improve query performance. Query pruning based on the min/ Max index of each Segment record improves query performance by 30%. Using the CBO policy, the Cache query operation avoids burrs that take 10 times more time. In addition, optimizing performance with some new hardware (such as Intel’s AEP, Optane, QAT, etc.) is a good direction to explore.

The ES native Merge strategy focuses on size similarity (Segments of similar size should be selected) and maximum upper limit (Segments should be pieced together to 5GB). It is possible that a Segment contains data from the whole month of January or March 1. When users query data from a certain hour of March 1, they must scan a large amount of useless data, resulting in serious performance loss.

To solve the above problems, Tencent introduced timing Merge in ES: when choosing to Merge Segments, time is taken into account to Merge Segments with similar time. When the user queries data on March 1, only small Segments are scanned. The rest can be quickly clipped.
In addition, ES recommends that search users perform a Force Merge after writing data. This method combines all Segments into one to improve search performance. However, this increases the user’s cost and is not conducive to clipping because all data needs to be scanned in sequential scenarios.
For this reason, Tencent introduced automatic Merge of cold data in ES: for inactive indexes, low-level Segments are automatically merged to close to 5GB, reducing the number of files and facilitating cliping of sequential scenarios. For search scenarios, the user can resize the target Segment so that all Segments end up merged into one. This Merge strategy optimization can double the performance of search scenarios.


After the access to Tencent cloud ES, the ES basic capability and development operation and maintenance efficiency of e-commerce website M have been significantly improved:


Reliability:Based on Tencent’s ability to optimize the ES kernel, the reliability of ES cluster of e-commerce website M has been significantly improved, carrying multiple flood peaks and reaped obvious economic benefits.


Secure Dr:Multiple availability zones provide disaster recovery and X-Pack permission management provide security.


Operation efficiency:Cloud provides efficient deployment and stable elastic scaling capabilities, while X-Pack provides SQL capabilities to improve operation convenience. The improvement of operation and maintenance efficiency has greatly liberated manpower.



In particular, the entire project was very smooth and stable during the migration:




The original ES cluster of M realizes the complete non-stop service migration;




After the migration, the docking of the operation and maintenance system developed by M during the self-built ES is still maintained.


In the process of migration, THE ES community version does not support M’s special requirements for the kernel, but the special kernel team of Tencent Cloud ES actively responds and provides this capability.


Iv. Future planning and open source contribution



At present, there are many domestic APPLICATION scenarios of ES in “large query range” and “large index or fragment number”, so domestic community developers also pay special attention to this. In these two areas, Tencent cloud ES optimization has been officially adopted by Elastic due to its comprehensiveness and code cleanliness.


In the past six months, Tencent has submitted 10+PR to the open source community, covering various modules such as write, query, cluster management, etc. (some of the optimization was done in conjunction with Elastic’s official development). Elasticsearch is an open source collaboration team that allows all developers from Tencent to participate in Elastic’s ecosystem.


Currently, Tencent and Elastic have launched the core enhanced ES Cloud service on Tencent Cloud, which will export trillion-level search capabilities. However, there are still many valuable directions for further research in the development of ES, which is also the reason why Tencent Cloud iteratively optimized products and continued core construction after launching products.


In the future, Tencent will make long-term exploration:

Taking the big data Atlas as an idea, the whole big data field can be divided into three parts according to the characteristics of data volume and delay requirements.

(1) DataEngineering, including batch calculation and flow calculation;

(2) DataDiscovery, including interactive analysis, search, etc.;

(3) DataApps, mainly used to support online services.

Although ES is considered a search technology, ES also supports online search, document services, and other scenarios; In addition, there are a number of mature OLAP systems that are also based on technology stacks such as inverted indexes and column and column blending. It can be seen that ES has a very strong feasibility to develop in these two fields. Therefore, Tencent will focus on “online services” and “OLAP analysis” and other directions to explore ES technology.

The last

Welcome everyone to pay attention to my public account [programmer Chase wind], 2019 many companies Java interview questions sorted out more than 1000 400 pages of PDF documents, articles will be updated in it, sorted information will also be placed in it.

If you like the article, remember to pay attention to me. Thank you for your support!