Troubleshoot the red health status of the online ELK cluster

Original address: haifeiWu’s blog address: www.hchstudio.cn welcome to reprint, reprint please indicate the author and source, thank you!

The current status of the data analysis platform is as follows: Single machine, Elasticsearch (ES), single node (empty cluster),1000+shrads, about 200 GB in size.

Troubleshoot problems

Check server memory and CPU status

Use TOP to check the CPU and memory usage of the server, as shown below. (At that time, the CPU usage of the ES application of the main server was over 90%, so there must be something wrong.)

At that time, only about 150M of the server with 8G memory was left, which must be the problem of ES.

ES Cluster Status

Check the health value of the ES cluster and find that the status is red, which indicates that some master fragments are unavailable. The current status of the building owner is historical data available, but new index data cannot be generated.

curl http://localhost:9200/_cluster/health? pretty {"cluster_name" : "elasticsearch"."status" : "red"."timed_out" : false."number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 663,
  "active_shards" : 663,
  "relocating_shards": 0."initializing_shards": 0."unassigned_shards" : 6,
  "delayed_unassigned_shards": 0."number_of_pending_tasks": 0."number_of_in_flight_fetch": 0."task_max_waiting_in_queue_millis": 0."active_shards_percent_as_number": 99.10313901345292}Copy the code

Checking the state of each index, I found that most indexes were in red state and were not available. Because too much index data was opened, ES occupied a large amount of CPU and memory, making logstash unavailable, and new index data could not be created, resulting in data loss.

curl -XGET   "http://localhost:9200/_cat/indices? v"Size pri.store. Size Red Open jr. - 2016.12.203 3 0 Red Open Jr -2016.12.21 3 0 Red Open jr-2016.12.22 3 0 Red Open jr-2016.12.23 3 0 Red Open jr-2016.12.23 3 0 Red Open jr-2016.12.24 3 0 Red Open Jr -2016.12.25 3 0 Red Open jr-2016.12.26 3 0 Red Open jr-2016.12.27 3 0Copy the code

The ES cluster fragment is unavailable, causing the query failure

Exception thrown when querying ES:

[2018-08-06 18:27:24.553][DEBUG][action.search            ] [Godfrey Calthrop] All shards failed for phase: [query]
[jr-2018.08.06][[jr-2018.08. [6]2]] NoShardAvailableActionException[null]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:129)
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115)
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:47)
    at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52)
    at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:582)
    at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:85)
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54)
    at org.elasticsearch.rest.RestController.executeHandler(RestController.java:205)
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166)
    at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:128)
    at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:86)
    at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:449)
    at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:61)
Copy the code

Problem solving

According to the above investigation, we probably know that the historical index data is in the open state too much, which leads to the CPU and memory usage of ES is too high, which leads to the unavailability.

Close indexes that are not needed to reduce memory usage
curl -XPOST "http://localhost:9200/index_name/_close"
Copy the code

episode

After the non-hot index data is disabled, the health value of the ES cluster is still in the red state. The red state of the index may affect the status of the ES cluster, as shown below

Curl the GET http://10.252.148.85:9200/_cluster/health? level=indices {"cluster_name": "elasticsearch"."status": "red"."timed_out": false."number_of_nodes": 1,
	"number_of_data_nodes": 1,
	"active_primary_shards": 660,
	"active_shards": 660,
	"relocating_shards": 0."initializing_shards": 0."unassigned_shards": 9,
	"delayed_unassigned_shards": 0."number_of_pending_tasks": 0."number_of_in_flight_fetch": 0."task_max_waiting_in_queue_millis": 0."active_shards_percent_as_number": 98.65470852017937."indices": {
		"Jr - 2018.08.06": {
			"status": "red"."number_of_shards": 3."number_of_replicas": 0."active_primary_shards": 0."active_shards": 0."relocating_shards": 0."initializing_shards": 0."unassigned_shards": 3}}}Copy the code

Delete this index data (this is the dirty data generated during the primary investigation of the building, index directly deleted)

curl -XDELETE 'http://10.252.148.85:9200/jr-2018.08.06'
Copy the code

summary

When ES is in a single point, pay attention to the index status of ES and the monitoring of the server, and clean or close unnecessary index data in time to avoid this situation. Along with you on the road of technological growth.

Troubleshoot the red health status of the online ELK cluster

Troubleshoot problems

Check server memory and CPU status

ES Cluster Status

The ES cluster fragment is unavailable, causing the query failure

Problem solving

episode

summary

Related Posts

Linux looks at disk size and folder space

SpringBoot integrate Redis

Major Internet companies must test Java interview questions and interview skills