This sharing was brought to you by Zhao Hanqing, a senior engineer from Alibaba. Mainly about:

  • Log analysis system based on ELK + Kafka
  • Experience in Elasticsearch optimization
  • Elasticsearch operation and Maintenance practice

ElasticSearch is introduced

Distributed real-time analysis search engine, the advantages include:

  • Query near real-time
  • Small memory consumption, fast search speed
  • Strong scalability
  • High availability

The data structure

  • FST(Finite State Transducer)

This data structure is suitable for text queries. Through the repeated use of word prefixes and suffixes in the dictionary, the compression ratio is generally between 3 and 20 times. O(len (STR)) query time complexity. Range search, prefix search has obvious advantages over traditional HashMap.

  • BDK Tree

Suitable for numerical, geographic information (GEO) and other multidimensional data types. When K=1, binary search tree, query complexity log(N)

K=2, determine the segmentation dimension, and select the middle point of this dimension

scalability

Through index sharding mechanism, the horizontal expansion of cluster is realized

High availability

Shard redundancy, cross-availability deployment, and data snapshot. Nodes in the cluster are faulty and data is damaged.

ElasticSearch buckets

Kibana: Data visualization, interacting with ElasticSearch. Elasticsearch: Store, index, search. Logstash: Data collection, filtering, transformation. Beats: Lighter and more diverse than LogStash: Filebeat, Metricbeat, Packetbeat, Winlogbeat…

Log analysis system based on ELK and Kafka

Logstash advantages

Date: select * from Elasticsearch document where timestampmetrics are used as data metrics for Elasticsearch Document Obtain logstash metricscodec.multiline: Multiple rows of data compose a record fingerprint: Prevents the insertion of duplicate data

Disadvantages of Logstash: Collecting logs is inefficient and consumes resources. Filebeat: Compensates for the weakness, but has fewer plug-ins of its own.

Kafka is used for log transmission

Kafka has data caching capabilities. Kafka data can be reused. Kafka itself is highly available and protects against data loss. Kafka has a better throughput. Kafka is widely used.

Practical lesson: Create different topics for different services. Set the number of Topic partitions based on the number of service logs. Configure consumer_Threads based on the number of Kafka partitions and the number of Logstash consuming topics. Try to clarify the logstash and corresponding consumption topic (s), and minimize the use of wildcards for configuration consumption topic.

Basic issues in cluster planning:

  1. Total data size: how much data flows in each day and how many days of data are saved.

Daily amount of data added: Daily number of logs added x Number of backups. If the all field is enabled, the value is doubled. For example, if 1 TB of logs are added every day, each shard has one backup, and enableAll is enabled, the actual data increment of the Elasticsearch cluster is about 4 TB. If 4T of data needs to be stored every day, and if the data needs to be stored for 30 days, the minimum storage is 120T, and 20% buffer is generally added. At least 144 TB of storage space must be prepared. Hot-node and warm-node types can be classified according to the characteristics of log scenarios. Hot-nodes usually use SSD disks, and warm-nodes use common mechanical disks.

  1. Single-node configuration: how many indexes per node, how many shards, how much shard size is controlled. The total cluster size is obtained based on the total data volume and single-node configuration. For a single node, the rule of thumb is that the ratio of CPU to Memory is 1:4. Memory: The ratio of Disk is 1:24. The XMX setting of Elasticsearch Heap is usually not greater than 32GB. The ratio of Memory to shard is between 1:20 and 1:25. The size of each shard does not exceed 50 GB.

Practical case Analysis

When a failover occurs on the production line, the amount of backup cluster logs increases suddenly, and the amount of data in Kafka increases suddenly. The amount of data consumed by LogStash kafka injected into Elasticsearch increases in a unit of time. If some of the primary shards being inserted are concentrated on a node, the node’s Elasticsearch service will become too busy due to the large amount of data that needs to be indexed, multiple Logstash bulk requests, etc.

If the master node fails to respond to a request (such as the Cluster Health heartbeat), the master node will be blocked because it is waiting for a response from the master node, causing other nodes to think that the master node is missing and triggering a series of emergency responses. For example, re-select the master.

If the logstash request cannot be responded to in time, the Logstash connect ElasticSearch will have a timeout. The Logstash connect elasticSearch will recognize the elasticSearch as dead and will no longer consume Kafka. Kafka finds that if a consumer disappears in a consumer group, it triggers the entire consumer group to make rebalance, which affects other Logstash consumption and the throughput of the cluster.

Typical herd behavior, need to eliminate the influence of the first sheep. GET/cat/ Threadpool/bulk? V&h = name, host, active, queue, rejected, completed which node is busy positioning: the queue is larger, the rejected increasing. Then use GET /cat/shards to find active shards on the node. Finally, use the POST /cluster/ Reroute API to move the shard to a low-load node to relieve the stress on that node.

ElasticSearch Cluster O&M practice

Our main concerns are:

  1. Cluster health Status 2. Cluster index and search performance
  2. CPU, memory, and Disk usage of nodes

Cluster Green, normal. In cluster YELLOW, replica shard is not allocated. Cluster red because primary shard is not allocated.

Main cause: The node disk usage of the cluster exceeded the watermark (85% by default). You can use the API GET/cat/ Allocation to check the disk usage of a node. Can through the API GET/cluster/Settings view cluster routing. The allocation. The enable is forbidden. But through the API GET / _cluster/allocation/explain? Pretty Check the specific reason why shard was not assigned to node.

Monitoring tools recommended use: cerebro (https://github.com/lmenezes/cerebro)

Experience in ElasticSearch optimization

The index optimization

  1. Create indexes ahead of time
  2. Avoid sparse index, the document structure in index is best to keep consistent, if the document structure is not consistent, it is recommended to divide index, use an index with a small number of shards to store document with different field formats.

    3. Set refresh when loading a large amount of dataThe interval = 1, the index numberOf_replicas =0, and then set it back.

    4. In the case of low load and I/O pressure, bulk is more efficient than a single PUT or DELETE operation.

    5. Adjust index buffer(indices.memory.indexbufferThe size).
  3. Field that does not need score, disable norms; Doc_value is disabled for fields that do not require sort or aggregate.

Query optimization

Routing is used to speed up the query of a dimension. Avoid returning too many search result sets and limit them. If the heap pressure is not big, can be appropriately increase the node of the query cache (indices. The queries. Cache. The size). Adding a SHard backup can improve query concurrency, but pay attention to the total number of shards on a node. Merge segments regularly.

Ali Cloud ElasticSearch service

The ElasticSearch service provided by Ali Cloud includes features such as monitoring, alarm, log visualization, and one-click capacity expansion

Statement: all articles in this number are original, except for special notes, public readers have the right to read first, shall not be reproduced without the permission of the author, or tort liability.

Pay attention to my public number, background reply [JAVAPDF] get 200 pages of questions! 50000 people pay attention to the big data into the way of god, don’t you want to know? Fifty thousand people pay attention to the big data into the road of god, really not to understand it? Fifty thousand people pay attention to the big data into the way of god, sure really not to understand it?

Welcome your attentionBig Data as the Road to God

Note: all contents of the first public account, here does not guarantee real-time and integrity, we scan the qr code at the end of the attention oh ~