Author: johngqjiang, r&d engineer of Tencent TEG cloud architecture platform departmentOriginal: https://mp.weixin.qq.com/s/CNf75yT0A0QPki-Qhw3_8w
preface
Elasticsearch (ES) is the preferred open-source distributed search and analysis engine. It can easily meet users’ requirements for real-time log analysis, full-text search, and structured data analysis, greatly reducing the cost of data mining in the era of big data. Tencent uses ES on a large scale in various internal scenarios, and has partnered with Elastic to provide core enhanced ES cloud services on Tencent Cloud. These large-scale and diverse scenarios have enabled Tencent to continuously optimize native ES for high availability, high performance, and low cost. Decryption of Elasticsearch by Tencent Trillionth Tier
I. Application scenario of ES in Tencent
-
Elastic Ecology offers a complete log analysis solution that can be easily deployed by any developer, o&M user using mature components.
-
In the Elastic ecosystem, logs are typically generated to be accessible within 10s. Compared with traditional big data solutions of dozens of minutes and hours, timeliness is very high.
-
With support for inverted indexes, column storage, and other data structures, ES provides very flexible search analysis capabilities.
-
Interactive analysis is supported, with ES search response times of seconds, even in the case of trillion-level logs.
-
High performance: a maximum of 10w+ QPS for a single service, 20ms to 20ms flat ring, and a P95 delay of less than 100ms.
-
Strong correlation: The search experience mainly depends on whether the search results highly match the user’s intention, which needs to be evaluated by the accuracy rate, recall rate and other indicators.
-
High availability: In search scenarios, four nines are required to support disaster recovery (Dr) when a single server fails. Any e-commerce service, such as Taobao, JD.com and Pinduoduo, can make headlines after an hour’s outage.
-
High concurrent write: the maximum write throughput of a single online cluster is 600+ nodes and 1000w/s.
-
High query performance: the query delay for a curve or time line is 10ms to 10ms.
-
Multi-dimensional analysis: Flexible and multi-dimensional statistical analysis capabilities are required. For example, we can conduct statistical analysis flexibly according to regions and business modules when viewing monitoring.
Ii. Challenges encountered
3. ES optimization practice
-
System robustness: refers to the robustness of the ES kernel itself, which is also a common problem faced by distributed systems. For example, the fault tolerance of the cluster under abnormal query and pressure overload; Scalability of clusters in high stress scenarios; Data balancing between nodes and multiple disks when a cluster is expanded or a node is faulty.
-
Disaster recovery solution: The management and control system is used to quickly recover equipment room networks when faults occur, prevent data loss when natural disasters occur, and quickly recover equipment room networks when misoperations occur.
-
System bugs: This is a constant feature of any system, such as clogged Master nodes, distributed deadlocks, slow rolling restarts, etc.
Iv. Future planning and open source contribution
The last
If you like the article, remember to pay attention to me. Thank you for your support!