Author: johngqjiang, r&d engineer of Tencent TEG cloud architecture platform department

Original: https://mp.weixin.qq.com/s/CNf75yT0A0QPki-Qhw3_8w



preface

Elasticsearch (ES) is the preferred open-source distributed search and analysis engine. It can easily meet users’ requirements for real-time log analysis, full-text search, and structured data analysis, greatly reducing the cost of data mining in the era of big data. Tencent uses ES on a large scale in various internal scenarios, and has partnered with Elastic to provide core enhanced ES cloud services on Tencent Cloud. These large-scale and diverse scenarios have enabled Tencent to continuously optimize native ES for high availability, high performance, and low cost. Decryption of Elasticsearch by Tencent Trillionth Tier

I. Application scenario of ES in Tencent

The main content of this sharing includes: Firstly, rich application scenarios of ES in Tencent and typical characteristics of various scenarios are introduced; Then it presents the challenges we encountered in large-scale, high-pressure, and diverse usage scenarios. In view of these challenges, we focus on the high availability, low cost and high performance optimization practices of ES kernel carried out by Tencent. Finally, we will briefly share our thoughts on ES future planning and open source contribution.

Let’s take a look at the application scenario of ES in Tencent. At first, WE used ES for real-time log analysis. Typical logs are as follows: Operation logs, such as slow logs and abnormal logs, are used to locate service problems. Business logs, such as user click and access logs, can be used to analyze user behavior. Audit logs can be used for security analysis. ES perfectly solves the need of real-time log analysis, and it has the following features:

  • Elastic Ecology offers a complete log analysis solution that can be easily deployed by any developer, o&M user using mature components.
  • In the Elastic ecosystem, logs are typically generated to be accessible within 10s. Compared with traditional big data solutions of dozens of minutes and hours, timeliness is very high.
  • With support for inverted indexes, column storage, and other data structures, ES provides very flexible search analysis capabilities.
  • Interactive analysis is supported, with ES search response times of seconds, even in the case of trillion-level logs.
Log is the most basic and extensive form of data in the Internet industry. ES perfectly solves the real-time analysis scenario of log, which is also an important reason for the rapid development of ES in recent years.

The second type of application scenario is search service, typical scenarios include: commodity search, similar to jingdong, Taobao, Pinduoduo commodity search; APP search, support APP search in the APP store; Site search, support forum, online documents and other search functions. We support a number of search services, which have the following features:

  • High performance: a maximum of 10w+ QPS for a single service, 20ms to 20ms flat ring, and a P95 delay of less than 100ms.
  • Strong correlation: The search experience mainly depends on whether the search results highly match the user’s intention, which needs to be evaluated by the accuracy rate, recall rate and other indicators.
  • High availability: In search scenarios, four nines are required to support disaster recovery (Dr) when a single server fails. Any e-commerce service, such as Taobao, JD.com and Pinduoduo, can make headlines after an hour’s outage.

The third type of usage scenario is temporal data analysis. Typical temporal data include: Metrics, traditional server monitoring; APM, application performance monitoring; Iot data, sensor data generated by intelligent hardware, industrial iot, etc. This kind of scene Tencent began to explore very early in this respect has accumulated very rich experience. This type of scenario has the following characteristics:

  • High concurrent write: the maximum write throughput of a single online cluster is 600+ nodes and 1000w/s.
  • High query performance: the query delay for a curve or time line is 10ms to 10ms.
  • Multi-dimensional analysis: Flexible and multi-dimensional statistical analysis capabilities are required. For example, we can conduct statistical analysis flexibly according to regions and business modules when viewing monitoring.

Ii. Challenges encountered

Previously, we introduced the extensive application of ES in Tencent. Under such a large scale, high pressure and rich application scenarios, we encountered many challenges, which can be generally divided into two categories: search and timing.

First, let’s take a look at the challenges of the search business. Represented by e-commerce search, APP search and in-site search, these businesses attach great importance to availability, with service SLA reaching more than 4 9, and need to tolerate single-machine failure and network failure of single-machine room, etc. At the same time, high performance, low burr, such as 20W QPS, 20ms flat noise, P95 delay 100ms. In a word, in search business scenarios, the core challenges lie in high availability and high performance.

The other class of business challenges, which we call sequential business challenges, includes logging, Metrics, APM, and other scenarios. Compared with search services that focus on high availability and high performance, sequential services pay more attention to cost and performance. For example, users of sequential scenes usually require high write throughput, and some scenes can reach 1000W /sWPS. In this write throughput, data is retained for 30 days, usually up to petabytes of storage. However, the reality is that the benefits of logging and monitoring scenarios are relatively low. It is likely that the number of machines used by users for online services is only 100, while monitoring and logging needs 50, which is basically unacceptable for most users. Therefore, the main challenges in sequential services are storage cost, computing cost and so on.

Previously, we introduced the challenges of high availability, low cost and high performance in search and timing business scenarios. In view of these challenges, we will focus on sharing Tencent’s in-depth practice in ES kernel.

3. ES optimization practice

First, let’s look at high availability optimization, which we divide into three dimensions:

  • System robustness: refers to the robustness of the ES kernel itself, which is also a common problem faced by distributed systems. For example, the fault tolerance of the cluster under abnormal query and pressure overload; Scalability of clusters in high stress scenarios; Data balancing between nodes and multiple disks when a cluster is expanded or a node is faulty.
  • Disaster recovery solution: The management and control system is used to quickly recover equipment room networks when faults occur, prevent data loss when natural disasters occur, and quickly recover equipment room networks when misoperations occur.
  • System bugs: This is a constant feature of any system, such as clogged Master nodes, distributed deadlocks, slow rolling restarts, etc.

To address the above problems, here are our solutions to high availability:

In terms of system robustness, we tolerate service instability caused by machine network faults and abnormal queries through service flow limiting, which will be introduced later. By optimizing the management and control logic of cluster metadata, the capacity of cluster expansion is increased by an order of magnitude, supporting clusters of thousands of nodes and millions of fragments, and solving the scalability problem of cluster. In terms of cluster balancing, the fragmentation balancing between nodes and multiple disks is optimized to ensure the pressure balancing of large-scale clusters.

In the aspect of disaster recovery scheme, we support backup and file back by extending the plug-in mechanism of ES, and backup and file ES data back to cheap storage to ensure data recovery. Cross-availability zone DISASTER recovery (Dr). You can deploy multiple availability zones as required to tolerate single-server failures. The garbage can mechanism ensures that the cluster can be quickly recovered in case of overpayment or misoperation.

On the system Bug side, we fixed rolling restarts, Master blocking, distributed deadlocks and a number of other bugs. The rolling restart optimization can accelerate the node restart speed by 5+ times. For details, please refer to PR ES-46520. Master congestion problem, we made optimization in ES 6.x version together with official.

Here we expand on the service traffic limiting part. We have done four levels of traffic limiting work: permissions level, we support XPack and self-developed permissions to prevent attacks and misoperations; Queue level, by optimizing task execution speed, repetition, priority and other problems, users often encounter the accumulation of Master task queue, task starvation and other problems; At the memory level, we start from ES 6.x, which supports memory flow limiting on HTTP entry, coordination node, data node and other full links, and uses JVM memory, gradient statistics and other methods for precise control. At the multi-tenant level, we use the CVM/Cgroups scheme to ensure resource isolation between multi-tenants.

This section describes traffic limiting in aggregation scenarios in detail. When using ES for aggregation analysis, users often encounter memory explosion due to too many aggregation buckets. The max_buckets parameter is provided in ES 6.8 to control the maximum number of buckets for aggregation, but this method is very limited. In some scenarios, 200,000 buckets may work, but in other scenarios, 100,000 buckets may be used up, depending on the size of each bucket, and the user may not know exactly what is appropriate. In the process of aggregation analysis, we used gradient algorithm for optimization, and checked THE JVM memory for every 1000 buckets allocated. When the memory was insufficient, the request was interrupted in time to ensure the high availability of ES cluster. For details, please refer to PR ES-46751/47806.

Our current traffic limiting scheme can greatly improve the stability of ES service in the case of abnormal query, pressure overload, single node failure, network partition and other scenarios. However, there are still a few scenes that are not completely covered, so we are also introducing chaos test to cover more abnormal scenes.

Now that we’ve introduced high availability solutions, let’s look at cost optimization practices. Cost challenges are mainly reflected in the consumption of machine resources in timing sequence scenarios represented by logs and monitoring. Based on the analysis of typical online log and timing services, the overall cost ratio of hard disk, memory and computing resources is close to 8:4:1. Hard disk and memory are the main conflicts, followed by computing costs.

By analyzing the sequence scenario, we can find that the sequence data has obvious access characteristics. First, hot and cold characteristics. Time series data access is more near and less far, and the number of data accesses in the last 7 days can reach more than 95%. Historical data is rarely accessed and is usually accessed for statistical information.

Based on these bottleneck analysis and data access features, we introduce a cost optimization solution.

In terms of hard disk cost, due to the obvious cold and hot characteristics of data, we first adopt the cold and hot separation architecture, and use the mixed storage scheme to balance the cost and performance. Second, since historical data is typically accessed with statistics, the trade-off for storage and performance is through predictive arithmetic, as described below. If historical data is not used at all, it can be backed up to a cheaper storage system. Other optimizations include storage tailoring, lifecycle management, and so on.

In terms of memory cost, many users will find that only 20% of the storage resources are used and the memory is insufficient when using large storage models. In fact, based on the access characteristics of sequential data, we can use Cache to optimize, which will be described later.

Let’s expand on the Rollup section. Rollup officially started with ES 6.x, and actually Tencent has already started this part of Rollup in 5.X. Rollup is similar to Cube and materialized view in big data scenarios. The core idea of Rollup is to generate statistics in advance through predictive computation to release original granularity data, thus reducing storage cost and improving query performance, and generally generating data-level benefits. To take a simple example, in a machine monitoring scenario, the original granularity of monitoring data is 10 seconds, whereas monitoring data from a month ago is generally viewed in an hour granularity, which is a Rollup application scenario.

In the field of big data, traditional solutions rely on external offline computing systems to periodically read full data for calculation, which has high computing overhead and maintenance costs. Mesa, Google’s AD metrics system, uses a continuous generation scheme. When data is written, the system generates one input data for each Rollup and sorts the data. The bottom layer merges multiple rollups in Compact/Merge, which has relatively low computation and maintenance costs. ES supports data sorting since 6.x. We perform multiway merging through streaming query to generate Rollup. The final calculation cost is less than 10% of the CPU cost when full data is written, and the memory usage is less than 10MB. We have feedback kernel optimization to the open source community to solve the calculation and memory bottleneck of open source Rollup. For details, please refer to PR ES-48399.

Next, let’s expand on the memory optimization section. As mentioned earlier, when many users use large storage models, memory becomes the bottleneck first and hard disks cannot be fully utilized. The main bottleneck is that indexes occupy a large amount of memory. However, we know that historical data is rarely accessed in sequence class scenarios, and some fields are basically not used in some scenarios. Therefore, we can improve memory utilization efficiency by introducing Cache.

What is the industry’s approach to memory optimization? The ES community since 7.x supports indexes to be placed out of the heap and loaded on demand like docValues. However, the disadvantage of this approach is that the importance of indexes and data is completely different. A large query can easily lead to the obsolescence of indexes and the degradation of subsequent query performance multiples. Hbase uses the Cache Cache to Cache indexes and data blocks to improve hot data access performance. Starting from Hbase 2.0, the Off Heap technology is introduced to ensure that the access performance of the memory outside the Heap is similar to that inside the Heap. Based on community experience, LFU Cache is introduced into ES to improve memory utilization efficiency, Cache is placed outside the heap to reduce the heap memory pressure, and Weak Reference and reducing in-and-out-of-heap copy technologies are used to reduce loss. The end result is an 80% increase in memory utilization, full utilization of large storage models, query performance loss of no more than 2%, and a 30% reduction in GC overhead.

We’ve covered usability, cost optimization solutions, and finally we’ll cover performance optimization practices. Sequential scenarios represented by logging and monitoring have high requirements on write performance, with write concurrency up to 1000W /s. However, we found that when writing with primary key, ES performance attenuated by 1+ times, and CPU could not be fully utilized in some pressure measurement scenarios. The scenario represented by the search service has very high requirements for query, requiring 20W QPS and 20ms flat ring, and avoiding the query burr caused by GC and poor execution plan as far as possible.

In view of the above problems, we introduce Tencent’s performance optimization practices:

In terms of write, for the primary key deduplication scenario, index clipping is used to accelerate the primary key deduplication process and improve write performance by 45%. For details, please refer to PR Lucene-8980. For insufficient CPU utilization in some pressure test scenarios, optimize resource preemption during Translog refresh in ES to improve performance by 20%. For details, see PR ES-45765/47790. We are trying to optimize write performance through vectorization, and expect to double write performance by reducing branch hops and instruction misses.

In terms of queries, we optimize the Merge strategy to improve query performance, which will be described later. Based on the min/ Max index of each Segment record, prune the query and improve query performance by 30%. This section describes how to use the CBO policy to avoid burr that takes 10+ times of the Cache query operation. For details, see Lucene-9002. In addition, we are also trying to optimize performance with some new hardware, such as Intel’s AEP, Optane, QAT, etc.

Next, let’s expand into the Merge strategy optimization section. The original Merge strategy of ES focuses on size similarity and maximum limit. Size similarity means that Segments of similar size are selected to Merge, and maximum limit means that Segments are pieced together to 5GB. In this case, a Segment may contain data from the whole month of January or March 1. When users query data from a certain hour of March 1, they must scan a large amount of useless data, resulting in serious performance loss.

In ES, sequential Merge is introduced. When selecting Segments to Merge, time is taken into account, so that Segments with similar time are merged together. When we query the data on March 1, we only need to scan a few small Segments, and the rest can be quickly cropped out.

In addition, ES recommends that users perform a Force Merge after writing data to search Segments to improve search performance. However, this increases the user’s cost, and is not conducive to clipping in sequential scenarios, requiring scanning of all data. In ES, we introduced automatic Merge of cold data. For inactive indexes, low-level Segments are automatically merged to close to 5GB, reducing the number of files and facilitating clipping of sequential scenarios. For search scenarios, you can increase the size of the target Segment so that all Segments end up merged into one. Our Merge strategy optimization can double the performance of the search scenario.

The previous introduction of our optimization practices in THE ES kernel is completed. Finally, we will simply share our thoughts on open source contribution and future planning.

Iv. Future planning and open source contribution

In the past half year, we submitted 10+PR to the open source community, involving various modules such as writing, query and cluster management. Part of the optimization was completed with the official developers. In the previous introduction, corresponding PR links have been provided for your reference. We’ve also formed an open source collaborative group within the company to help build the Elastic ecosystem.

In general, the benefits of open source outweigh the disadvantages, and we are reporting those benefits to encourage more students to contribute to Open source at Elastic Ecology: First, open source can reduce the maintenance cost of branches. With more and more self-developed functions, the cost of maintaining independent branches becomes higher and higher, which is mainly reflected in the synchronization with the open source version and the rapid introduction of new open source features. Secondly, open source can help developers to control the kernel more deeply and understand the latest technology trends, because in the process of open source feedback, it will involve continuous interaction with official developers. In addition, open source can help build your technical influence in the community and gain recognition from the open source community. Finally, the rapid development of The Elastic ecosystem is conducive to the development of business services and personal technology. We hope that everyone can join us to help the sustainable and rapid development of the Elastic ecosystem.

In terms of future planning, this sharing focuses on Tencent’s optimization practices in ES kernel, including high availability, low cost and high performance. In addition, we also provide a set of management and control platform to support automatic management and control, operation and maintenance of online clusters and provide ES services for Tencent cloud customers. However, from the analysis of a large number of online operation experience, we found that there are still very rich and high-value directions to follow up, and we will continue to strengthen the construction of products and cores.

In terms of long-term exploration, we introduce it in combination with big data atlas. The whole field of big Data can be divided into three parts according to the characteristics of Data volume and delay requirements. The first part is Data Engineering, including batch computing and streaming computing that we are familiar with. The second part is Data Discovery, including interactive analysis, search, etc. The third section is Data Apps, which are used to support online services.

Although we put ES in the search domain, there are many users who use ES to support online search, document services, etc. In addition, we know that there are many mature OLAP systems, which are also based on technology stacks such as inverted index and column and column blending. Therefore, we believe that it is very feasible for ES to develop into these two fields in the future, and we will focus on the exploration of OLAP analysis and online services in the future.

The last

Welcome everyone to pay attention to my public account [programmer Chase wind], 2019 many companies Java interview questions sorted out more than 1000 400 pages of PDF documents, articles will be updated in it, sorted information will also be placed in it.

If you like the article, remember to pay attention to me. Thank you for your support!