Elasticsearch Observable Best Practices share! 3 minutes for a quick start!

Introduction to the

Elasticsearch provides a number of metrics to help you detect signs of failure and take action when you encounter problems such as unreliable nodes, out-of-memory errors, and long garbage collection times. Several key areas to monitor are:

Cluster health and node availability
The host’s network and system
Searching performance Indicators
Index performance Indicators
Memory usage and GC metrics
Resource saturation and errors

Scene view

📎ElasticSearch scenario view. json

Built-in view

📎ElasticSearch Built-in view. json

precondition

DataKit installed (DataKit installation documentation)

configuration

Monitoring Indicator Collection

Enter DataKit installation directory under the conf. D/db directory and copy the elasticsearch. Conf. Sample and named elasticsearch. Conf. The following is an example:

[[inputs.elasticsearch]] ## specify a list of one or more Elasticsearch servers # you can add username and password to your url to use basic authentication: # servers = ["http://user:pass@localhost:9200"] servers = ["http://localhost:9200"] ## Timeout for HTTP requests to the elastic search server(s) http_timeout = "5s" ## When local is true (the default), the node will read only its own stats. ## Set local to false when you want to read the node stats from all nodes ## of the cluster. local = true ## Set cluster_health to true when you want to also obtain cluster health stats cluster_health  = true ## Adjust cluster_health_level when you want to also obtain detailed health stats ## The options are ## - indices (default) ## - cluster # cluster_health_level = "cluster" ## Set cluster_stats to true when you want to also obtain cluster stats. cluster_stats = true ## Only gather cluster_stats from the master node. To work this require local  = true cluster_stats_only_from_master = true ## Indices to collect; can be one or more indices names or _all indices_include = ["_all"] ## One of "shards", "cluster", "indices" indices_level = "shards" ## node_stats is a list of sub-stats that you want to have gathered. Valid options ##  are "indices", "os", "process", "jvm", "thread_pool", "fs", "transport", "http", ## "breaker". Per default, all stats are gathered. node_stats = ["jvm", "http","indices","os","process","thread_pool","fs","transport"] ## HTTP Basic Authentication username and password. # username = "" # password = "" ## Optional TLS Config # tls_ca = "/etc/telegraf/ca.pem" # tls_cert = "/etc/telegraf/cert.pem" # tls_key = "/etc/telegraf/key.pem" ## Use TLS but skip chain & host verification # insecure_skip_verify = falseCopy the code

Restart datakit to take effect

systemctl restart datakit

Log collection

Go to the conf.d/log directory in the DataKit installation directory, copy tailf.conf.sample and name it tailf.conf. The following is an example:

[[inputs.tailf]] # glob logfiles # required logfiles = ["/var/log/elasticsearch/solution.log"] # glob filteer ignore = [""] # required source = "es_clusterlog" # grok pipeline script path pipeline = "elasticsearch_cluster_log.p" # read file from beginning # if from_begin was false, off auto discovery file from_beginning = true ## characters are replaced using the unicode replacement character ## When  set to the empty string the data is not decoded to text. ## ex: character_encoding = "utf-8" ## character_encoding = "utf-16le" ## character_encoding = "utf-16le" ## character_encoding  = "gbk" ## character_encoding = "gb18030" ## character_encoding = "" # character_encoding = "" ## The pattern should be  a regexp ## Note the use of '''XXX''' # match = '''^\d{4}-\d{2}-\d{2}''' #Add Tag for elasticsearch cluster [inputs.tailf.tags] cluster_name = "solution"Copy the code

Elasticsearch Cluster information Log Cut the Grok script

📎 elasticsearch_cluster_log. P

Restart datakit to take effect

systemctl restart datakit

Monitoring Indicators

1. Cluster health and node availability

Indicator description	The name of the	metrics
Cluster status (green, yellow, red)	`cluster_health.status`	other
Number of data nodes	`cluster_health.number_of_data_nodes`	availability
Example Initialize the number of fragments	`cluster_health.initializing_shards`	availability
Number of unallocated fragments	`cluster_health.unassigned_shards`	availability
Active fragment number	`cluster_health.active_shards`	availability
Number of shards in migration	`cluster_health.relocating_shards`	availability

Highlights of cluster health and node availability

Cluster status: If the cluster status is YELLOW, at least one copy is unallocated or lost. Although the search results are still intact, if more shards disappear, the entire index may be lost. If the cluster status is RED, at least one master shard is missing, the index is missing data, which means the search will return partial results, and new document data cannot be indexed to the shard. Consider setting an alarm that will be triggered if the status has been yellow for more than 5 minutes, or if the last detection status was red.
Initializing and unassigned sharding: When the index is first created or the node is restarted, the shard is temporarily in the “initializing” state until transitioned to the “started” or “unassigned” state, as the Master node tries to assign shard to the Data node. If you see a shard in an initialized or unallocated state for too long, it could be a sign of cluster instability.

2. Host network and system

Indicator description	The name of the	metrics
Available disk space	disk.used``disk.total	Resource utilization
Memory usage	`os.mem_used_percent`	Resource utilization
CPU utilization	`os.cpu_percent`	Resource utilization
Network byte send/receive	transport.tx_size_in_bytes``transport.rx_size_in_bytes	Resource utilization
Open the file descriptor	`clusterstats_nodes.process_open_file_descriptors_avg`	Resource utilization
The HTTP link count	`http.current_open`	Resource utilization

Host network and system essentials

Disk space: This metric becomes even more important if the Elasticsearch cluster is write-loaded. Once the space runs out, no inserts or updates can be performed and the node goes offline, which should not be allowed in the business. If the available space on a node is less than 20%, you should use a tool such as a Curator to free up space by dropping some of the larger indexes. If the business logic does not allow dropping indexes, the alternative is to expand the number of nodes and have the Master redistribute shards among the new nodes (although this increases the load on the Master node). Another thing to keep in mind is that documents that contain fields that need to be analyzed take up much more disk space than documents that don’t get analyzed (exact values).
CPU utilization on nodes: It is helpful to use graphs to show CPU usage for different node types. For example, you can create three different graphs to show CPU usage for different node types in the cluster (such as Data, Master, and Client nodes) and compare the graphs to see if one type of node is overloaded. If you see an increase in CPU utilization, this is usually due to the load caused by a lot of searching or indexing work. Set up a notification to determine if your node’s CPU usage continues to grow, and add more nodes as needed to redistribute the load.

Network byte send/receive: Communication between nodes is a key component of measuring cluster balance. The network needs to be monitored to ensure that it is up and running and able to keep up with the needs of the cluster (for example, redistribution of replication or sharding between nodes). Elasticsearch provides transport metrics for communication between cluster nodes, but you can also see how much traffic the network is receiving by looking at byte rates sent and received.
Open file descriptors: File descriptors are used for communication between nodes, client connections, and file operations. If open file descriptors reach system limits, new connections and file operations are unavailable until old ones are closed. If more than 80% of the available file descriptors are in use, you may need to increase the system’s maximum file descriptor count. Most Linux systems limit the number of file descriptors per process to 1024. When using Elasticsearch in production, you should set the number of operating system file descriptors to a larger value, such as 64,000.

HTTP links: Requests from any language other than the Java Client will communicate with Elasticsearch using the HTTP-based RESTful API. If the total number of OPEN HTTP connections is increasing, it may indicate that your HTTP client is not properly establishing persistent connections. Reestablishing the connection adds an additional time overhead in the request response time. So make sure your client is configured correctly to avoid performance impact, or use one of the official Elasticsearch clients with the HTTP connection configured correctly.

3. Query performance indicators

If you are primarily using Elasticsearch for queries, you should be concerned about query latency and take action when the threshold is exceeded. Monitoring the metrics associated with Query and Fetch can help you track the execution of queries over time. For example, you might want to track peak query curves and long-term query request growth trends so that configurations can be optimized for better performance and reliability.

Indicator description	The name of the	metrics
Total number of cluster query operations	`indices.search_query_total`	throughput
Total cluster query operation time	`indices.search_query_time_in_millis`	performance
Number of queries being performed in the cluster	`indices.search_query_current`	throughput
Total number of cluster fetch operations	`indices.search_fetch_total`	throughput
Total time spent for obtaining a cluster	`indices.search_fetch_time_in_millis`	performance
Number of retrieves currently in progress for the cluster	`indices.search_fetch_current`	throughput

Key points of searching performance indicators:

Query load: Monitoring the current number of Query concurrency gives you an idea of how many requests are being processed by the cluster at any given moment. A focus on unusual peaks and valleys may reveal some potential risks. You may also want to monitor the usage of the query thread pool queue.
Query latency: Although Elasticsearch does not provide this metric directly, we can simply calculate the average Query latency by sampling the total number of Query requests at regular intervals and dividing it by how long it takes. If we exceed one of our thresholds, we need to troubleshoot potential resource bottlenecks or optimize query statements.

Fetch latency: The second phase of the search process, the Fetch phase, usually takes less time than the query phase. If you notice that this metric keeps increasing, it could indicate slow disk speed, rich documentation (such as document highlighting), or too many results for requests.

4. Index performance indicators

Indexing requests are similar to writes in a traditional database. If your Elasticsearch cluster is a write-load type, it becomes important to monitor and analyze the performance and efficiency of index updates. Before discussing these metrics, let’s take a look at how Elasticsearch updates its indexes. If the index changes (for example, new data is added, or existing data needs to be updated or deleted), each related shard of the index goes through the following two procedures: refresh and Flush.

Indicator description	The name of the	metrics
The total number of documents indexed	`indices.indexing_index_total`	throughput
Total time spent indexing documents	`indices.indexing_index_time_in_millis`	performance
Index average fetch delay	`indices.search_fetch_time_in_millis`	performance
Average index query delay	`indices.search_query_time_in_millis`	performance
The number of documents currently being indexed	`indices.indexing_index_current`	throughput
Total index refresh	`indices.refresh_total`	throughput
Total time spent refreshing an index	`indices.refresh_total_time_in_millis`	performance
The total number of indexes flushed to disk	`indices.flush_total`	throughput
The total time it took to flush indexes to disk	`indices.flush_total_time_in_millis`	performance
Index number of merged documents	`indices.merges_current_docs`	throughput
Index merging takes time	`indices.merges_total_stopped_time_in_millis`	performance
Number of pending tasks	`indices.number_of_pending_tasks`	throughput

Key points of indexing performance metrics:

Indexing latency: Elasticsearch does not provide this metric directly, but you can calculate the average index latency by calculating index_total and index_TIME_IN_millis. If you see this metric climbing, it may be because there are too many documents in bulk at once. Elasticsearch recommends a single bulk document size of 5-15MB, which can be gradually increased to a more appropriate value if resources allow.
Flush delay: Since Elasticsearch persists data to disk through Flush operations, it’s useful to keep an eye on this metric so you can take action if necessary. For example, if this indicator continues to increase steadily, it may indicate that the disk I/O capacity is insufficient. If this continues, the data cannot be indexed. In this case, lower the value of index.translog.flush_threshold_size to reduce the translog size that triggers flush. In the meantime, if your cluster is a typical write-heavy system, you should use a tool like iostat to continuously monitor disk IO and consider upgrading disk types if necessary.

5. Memory usage and GC metrics

Memory is one of the key resources to keep an eye on when Elasticsearch is running. Elasticsearch and Lucene make use of all available RAM on nodes in two ways: JVM heap and file system cache. Elasticsearch runs on the Java Virtual Machine (JVM), which means that the duration and frequency of JVM garbage collection will be another important monitoring area.

Indicator description	The name of the	metrics
Youth generation garbage collection number	`jvm.gc_collectors_young_collection_count`	–
Old age garbage collection	`jvm.gc_collectors_old_collection_count`	–
Total garbage collection time of young generation	`jvm.gc_collectors_young_collection_time_in_millis`	–
Total garbage collection time in the old days	`jvm.gc.collectors.old.collection_time_in_millis`	–
The percentage of current JVM heap memory	`jvm.mem_heap_used_percent`	Resource utilization
Committed JVM heap memory size	`jvm.mem_heap_committed_in_bytes`	Resource utilization

Highlights of memory usage and GC metrics:

JVM heap memory usage: Elasticsearch starts garbage collection by default when the JVM stack usage reaches 75%. So it becomes useful to monitor node stack usage and set alarm thresholds to determine which nodes are consistently using the stack at 85%, indicating that garbage collection is not keeping up with garbage generation. You can solve this problem by increasing the stack size or expanding the cluster by adding more nodes.
JVM heap memory usage and committed JVM heap memory size: It is more useful to monitor how much MEMORY the JVM is using (used) than how much memory the JVM stack is committed. The curve of the stack memory in use is usually jagged, rising as garbage accumulates and falling as garbage is collected. If this trend starts to tilt upwards over time, it means that garbage collection is not keeping pace with object creation, which can lead to slower garbage collection times and ultimately outofMemoryErrors.

Garbage collection duration and frequency: The JVM pauses all tasks in order to collect unwanted object information. This state is commonly referred to as “Stop the world” and can be experienced by both young and old garbage collectors. The Master node checks the status of other nodes every 30 seconds. If the garbage collection time of a node exceeds this time, the Master may consider the node as disconnected.
Memory usage: Elasticsearch makes good use of any RAM not already allocated to the JVM heap. Like Kafka, Elasticsearch is designed to rely on the operating system’s file system cache to handle requests quickly and reliably. If a segment was recently written to disk by Elasticsearch, it is already in the cache. However, if a node has been shut down and restarted, data must be read from disk the first time a segment is queried. So this is one of the most important reasons to ensure that the cluster remains stable and the nodes do not crash. In general, it is important to monitor memory usage on nodes and provide as much RAM as possible for Elasticsearch to maximize filesystem cache utilization without overrunning.

6. Resource saturation and errors

The Elasticsearch node uses thread pools to manage memory and CPU usage by threads. Since thread pools are automatically configured based on the number of processor cores, it is often pointless to tune them. However, it is a good idea to determine whether your node is sufficient by looking at the request queue and the circumstances in which requests are rejected. If something goes wrong, you may need to add more nodes to handle all the concurrent requests.

Queues and Rejections of thread pools: Each Elasticsearch node maintains many types of thread pools. Which thread pools should be monitored depends on how Elasticsearch is used. In general, the most important thread pools are search, index, merger, and bulk. The size of each thread pool queue represents how many requests are waiting to be served at the current node. The purpose of queues is to allow nodes to track and eventually process these requests, rather than simply discard them. However, thread pool queues are not infinite (larger queues take up more memory), and once the thread pool reaches its maximum queue size (the default is different for different types of thread pools), subsequent requests are rejected by the thread pool.

Indicator description	The name of the	metrics
Number of queued threads in the thread pool	thread_pool.rollup_indexing_queue``thread_pool.search_queue``thread_pool.transform_indexing_queue``thread_pool.force_mer ge_queue	saturation
Number of rejected threads in the thread pool	thread_pool.rollup_indexing_rejected``thread_pool.transform_indexing_rejected``thread_pool.search_rejected``thread_pool. force_merge_rejected	error

Resource saturation and error points:

Thread pool queues: Larger thread pool queues are not better because larger thread pools consume more resources and increase the risk of lost requests during node downtime. If you see an increasing number of queued and rejected threads, you need to try slowing down the request rate, increasing the number of processors on the node, or increasing the number of nodes in the cluster.
Request queuing and request rejection for Bulk processing: Batch processing is a more efficient way to send multiple requests simultaneously. In general, if there are many operations to be performed (creating indexes, or adding, updating, or deleting documents), you should try to send the request as a batch rather than multiple individual requests. Rejected batch requests are usually related to trying to index too many documents in a batch request. While the Elasticsearch documentation states that batch requests being rejected are not necessarily something to worry about, you should try to implement some fallback strategies to handle this situation effectively.

Cache utilization metric: Each query request is sent to each shard in the index and hits each segment of each shard. Elasticsearch caches queries segment by segment to speed up response time. On the other hand, if your caches take up too much heap, they may slow down rather than speed up!