ELK Tips are Tips for using ELK from the Elastic Chinese community.
A, Logstash
1. Logstash performance tuning main parameters
pipeline.workers
: Sets how many threads to start to execute fliter and output. Increase the size of this parameter when the input content is stacked and the CPU usage is sufficient.pipeline.batch.size
: Sets the maximum number of events that a single worker thread can collect before executing filters and outputs. Larger batch sizes are usually more efficient, but increase memory overhead. The output plug-in treats each batch as an output unit. ; For example, the ES output makes a batch request for each batch received; Adjust thepipeline.batch.size
Adjust the size of Bulk requests sent to ES;pipeline.batch.delay
: Sets the Logstash pipeline latency. The pipeline batch latency is the maximum time (in milliseconds) that the Logstash pipeline waits for a new message after receiving an event in the current pipeline worker thread; In short, whenpipeline.batch.size
When dissatisfied, they waitpipeline.batch.delay
The filter and output operations start after the timeout.
2. Use Ruby Filter to calculate a new field based on an existing field
filter {
ruby {
code => "event.set('kpi', ((event.get('a') + event.get('b'))/(event.get('c')+event.get('d'))).round(2))"}}Copy the code
How does the logstash filter determine whether a field is empty or null
if! [updateTime]Copy the code
4. Date Filter Set multiple Date formats
date {
match => ["logtime"."yyyy-MM-dd HH:mm:ss.SSS"."yyyy-MM-dd HH:mm:ss,SSS"]
target => "logtime_utc"
}
Copy the code
Second, the Elasticsearch
1. Search After pages efficiently
Normally we would use from and size to page-turn query results, but when deep paging is reached, the cost becomes prohibitive (heap memory footprint and time consumption are proportional to the size of from+size). So ES sets a limit (index.max_result_window), with a default value of 10000, to prevent users from scrolling too deeply.
The Scroll API is recommended for efficient deep scrolling, but the scrolling context is expensive, so don’t use Scroll for real-time user requests. The search_after parameter solves the problem of deep scrolling by providing a live cursor. The idea is to use the results of the previous page to help retrieve the next page.
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"}},"search_after"A: [1463538857,"654323"]."sort": [{"date": "asc"},
{"tie_breaker_id": "asc"}}]Copy the code
2, ES document similarity BM25 parameter setting
By default, ES2.X uses TF/IDF algorithm to calculate document similarity. Starting from ES5.X, BM25 is the default similarity calculation algorithm.
PUT /index
{
"settings" : {
"index" : {
"similarity" : {
"my_similarity" : {
"type" : "DFR"."basic_model" : "g"."after_effect" : "l"."normalization" : "h2"."normalization.h2.c" : "3.0"
}
}
}
}
}
PUT /index/_mapping/_doc
{
"properties" : {
"title" : { "type" : "text"."similarity" : "my_similarity"}}}Copy the code
3. ES2.X score calculation
Score calculation script:
double tf = Math.sqrt(doc.freq); Double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0; double idf = math.log ((field.doccount +1.0)/(term.docfreq +1.0)) +1.0; double norm = 1/Math.sqrt(doc.length);return query.boost * tf * idf * norm;
Copy the code
- Ignore word frequency statistics and word frequency position: will field
index_options
Set todocs
; - Ignore field length: Set field length
"norms": { "enabled": false }
;
4, CircuitBreakingException: [parent] Data too large
Error message:
[WARN ][r.suppressed ] path: /, params: {}
org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data forWhenever [< http_request >] [] 1454565650/1.3 gb,which is larger than the limitLeft [Request = 0/0B, FieldData = 568/568B, In_flight_requests =0/0b, Accounting = 145456502/1.3 GB].Copy the code
JVM heap memory load current query data so will be submitted to the data too large, the request is fusing, indices, breaker. Request. Limit by default the JVM heap of 60%, Therefore, this problem can be solved by adjusting the Heap Size of ES.
5. ES recommends free automated operation and maintenance tools
- Ansible: github.com/elastic/ans…
- Puppet: github.com/elastic/pup…
- Cookbook: github.com/elastic/coo…
- Curator: www.elastic.co/guide/en/el…
Elasticsearch – HanLP word segmentation plugin package
Core functions:
- Built in a variety of word segmentation mode, suitable for different scenarios;
- Built-in dictionary, no additional configuration can be used;
- Support external dictionary, users can define the word segmentation algorithm, based on the dictionary or model;
- Supports token-level customized dictionaries for multi-tenant scenarios.
- Support for remote dictionary hot update (to be developed);
- Pinyin filter, Simplified and traditional filter (to be developed);
- Word or word based Ngram segmentation (to be developed).
Github.com/AnyListen/e…
7. Delayed index fragment redistribution during node restart
When a node leaves the cluster for a short period of time, it generally does not affect the overall system operation. The following request can be used to delay the redistribution of index fragments.
PUT _all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"}}Copy the code
8. After modifying ES data, query the data before the modification
The default value is 1 second. If your requirement must be visible at the end of writing, you can force refresh by adding the refresh parameter while writing, but it is strongly recommended not to do this because it will bring down the entire cluster.
Terms Query Retrieves Terms from another index
When a Terms Query needs to specify many Terms, manually setting the Terms is cumbersome. You can use the terms-lookup method to load the matching Terms from another index.
PUT /users/_doc/2
{
"followers" : ["1"."3"]
}
PUT /tweets/_doc/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users"."type" : "_doc"."id" : "2"."path" : "followers"}}}} -- -- -- -- -- -- -- -- -- -- - the equivalent of the following statements -- -- -- -- -- -- -- -- -- -- -- -- -- -- PUT/users / _doc / 2 {"followers": [{"id" : "1"
},
{
"id" : "2"}}]Copy the code
10. Set ES backup path
Error message:
doesn't match any of the locations specified by path.repo because this setting is empty
Copy the code
End solution, modify ES configuration file:
# add the following configuration in elasticSearch.yml to set the backup repository path
path.repo: ["/home/test/backup/zty_logstash"]
Copy the code
Query cache and Filter cache
The Filter cache was renamed to Node Query Cache, which means Query cache is the same as Filter cache. Query Cache uses LRU caching (old, unused Cache data is discarded when the Cache is full). Query Cache only caches the content that is used in the filter context.
12. What are the factors to consider about the size of a Shard?
Lucene base does not have this size limit, the range of 20-40GB itself is relatively large, experience value is sometimes the head, not always work.
- Elasticsearch isolates and migrates data in fragments. Too many fragments increase migration costs.
- A Lucene shard is a Lucene library. A Lucene directory contains a number of segments. Each Segment has an upper limit on the number of documents. So the total number of documents that can be represented is integer. MAX_VALUE – 128 = 2^ 31-128 = 2147483647-1 = 2,147,483,519, or 2.14 billion;
- Also, if you do not force merge into a Segment, the number of documents in a single shard can exceed this number;
- The larger the single Lucene, the larger the index will be, the higher the operation cost of the query naturally, the greater the IO pressure, naturally will affect the query experience;
- How much data is appropriate for a particular shard still needs to be tested in combination with actual business data and actual queries to evaluate.
13. Limit the update of specified fields by mapping during ES index update
Dynamic Mapping is the default parameter for Elasticsearch. The new fields are automatically merged into the previous Mapping. You can configure Dynamic Mapping for Elasticsearch. True: the default value indicates that new fields can be dynamically added. False: ignores subsequent operations such as the index of the field, but the index is still successful. Strict Indicates that unknown fields are not supported. Therefore, errors are thrown directly.
14. ES Data snapshot to HDFS
ES making snapshots and using ES-Hadoop derivative data are two completely different ways, using ES-Hadoopp late import costs may not be small.
- If you want to restore data quickly, of course, the fastest way is to do snapshot and restore, the speed completely depends on the network and disk speed.
- To save disk resources, you can select the latest version supported by 6.5
source_only
In this mode, the exported snapshot is much smaller than the exported snapshot. However, the exported snapshot needs to be rebuilt during recovery, which is slow.
15, Segment. Memory
The size of a segment, dependent on indexing buffer, will be generated in three ways:
- Indexing buffer generates segment files when all buffers are filled, default to 10% of the heap, shared by nodes.
- Refresh the segment file in the index buffer. Refresh the segment file in the index buffer. Refresh the segment file in the index buffer.
- Finally, ES automatically merges small segment files periodically to create new segment files.
3. Selected community articles
- 2018 Elastic Advent Calendar Sharing Event
- Use ES-Hadoop to write Spark Streaming data to ES
- Elastic Stack 6.5 New features
- Make Elasticsearch fly! — Performance optimization practice dry goods
Any Code, Code Any!
Scan code to pay attention to “AnyCode”, programming road, together forward.