preface

A basic introduction to ELK

ELK is an acronym for three software products, Elasticsearch, Logstash and Kibana. These three pieces of software are open source software, usually used in conjunction with each other, and since they are under the name of Elastic. Co, they are called ELK stack for short

On the left side, we deployed multiple servers, and then we collected data through LogStash. After collecting data, we sent it to ES cluster for storage, and then displayed it in our browser through Kibana. It was a simple routine

Elasticsearch is an open source distributed search engine. It features: Out of the box, distributed, zero configuration, automatic discovery, index sharding, index copy, RESTful interface, multiple data sources, automatic search load, etc.

Logstash is a completely open source tool that collects, filters, and stores your logs for later use (e.g. ES search).

Kibana is also an open source and free tool that provides Logstash and ElasticSearch with a log analysis friendly Web interface that helps you summarize, analyze and search important data logs.

ELK implementation introduction

ELK Implementation language Introduction to the
ElasticSearch Java Real-time distributed search and analysis engine for full text search, structured search and analysis, based on Lucene. Similar to Solr
Logstash JRuby Data collection engine with real-time channel capability, including input, filter, and output modules, generally in the filter module to do log formatting and parsing work
Kibana JavaScript Provides analysis platform and visual Web platform for ElasticSerach. It can find and call data in the index of ElasticSerach, and generate tables of various dimensions

References to ELK

ELK website: https://www.elastic.co/

ELK website document: https://www.elastic.co/guide/index.html

ELK Chinese manual: https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html

ELK Chinese community: https://elasticsearch.cn/

Deployment of Elasticsearch

1.1 introduction

ElaticSearch, referred to as ES, ES is an open source highly extended distributed full-text search engine, which can store and retrieve data in near real time; It scales well and can scale to hundreds of servers to handle petabytes of data. ES is also developed in Java and uses Lucene as its core to implement all of its indexing and search capabilities, but its intent is to hide the complexity of Lucene through a simple RESTful API to make full-text search easy.

Here are some examples of how search engines can be used

GayHub:

In early 2013, GitHub abandoned Solr in favor of ElasticSearch. It uses ElasticSearch to search 20 terabytes of data, including 1.3 billion files and 130 billion lines of code."

Copy the code

Wiki: Wikipedia: Launch the core search architecture based on ElasticSearch

SoundCloud:

SoundClouduseElasticSearch18.Hundreds of millions of users provide real-time and accurate music search service

Copy the code

Baidu:

Currently, Baidu widely uses ElasticSearch as text data analysis to collect various indicator data and user-defined data from all baidu servers. Through multi-dimensional analysis and display of various data, it can assist in locating and analyzing instances or anomalies at the service level. At present, it covers more than 20 internal business lines of Baidu (including Casio, cloud analysis, network alliance, prediction, Library, direct number, wallet, risk control, etc.), with a maximum of 100 machines and 200 ES nodes in a single cluster, and imports 30TB+ data every day

Copy the code

Sina: 3.2 billion real-time logs are analyzed using ElasticSearch

Alibaba: Build your own log collection and analysis system with ElasticSearch

1.2 Preparations

I put my use of the installation package uploaded to Baidu Cloud, there is a need to self lift, to avoid the embarrassment of the network 🤣

Link: HTTPS://pan.baidu.com/s/17m0LmmRcffQbfhjSikIHhA

Extract code: L1lj

Copy the code

The first thing to note is that ES cannot be started as root. It has to be installed as a regular user. In this case, we use a new user

Solutions can also refer to the https://www.cnblogs.com/gcgc/p/10297563.html directly

First is the next installation package, and then decompress it, the operation is relatively simple here is not clear

Then we create two folders, both in the ES directory

mkdir -p /usr/local/elasticsearch-6.7.0/logs/

mkdir -p /usr/local/elasticsearch-6.7.0/datas

Copy the code

A data file and a log folder are created

Then change the configuration file to vim ElasticSearch.yml in the config folder

cluster.name: myes

node.name: node3

path.data: your installation path/ElasticSearch6.7. 0/datas

pathLogs: your installation path/ElasticSearch6.7. 0/logs

Net. host: indicates the address of the own node

http.port: 9200

discovery.zen.ping.unicast.hosts: ["node1"."node2"."node3"]

bootstrap.system_call_filter: false

bootstrap.memory_lock: false

http.cors.enabled: true

http.cors.allow-origin: "*"



cluster.name -- The name of the cluster, default is ElasticSearch

node.name - the node name

path.data -- Data file saving path

path.logs -- Specifies the path for storing log files

http.port - Access port number

Discovery. Zen. Ping. Unicast. Hosts: cluster automatically discover functions

bootstrap.system_call_filter: 

bootstrap.memory_lock: false

http.cors.enabled: true

http.cors.allow-origin: "*"

Copy the code

/usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas

We can also set the maximum and minimum heap memory in the JVM. option configuration file, and then distribute it to another node. Of course, the other two nodes will have the same configuration file as above.

cd /usr/local

SCP - r elasticsearch - 6.7.0/2:$PWD

SCP - r elasticsearch - 6.7.0 / node3:$PWD

Copy the code

Because my first two virtual machines crashed somehow, MY own computer memory is insufficient. So I’ll just go with 🤣

1.3 Modifying some system configurations

1.3.1 Maximum number of Open files for common Users

Problem error message description:

max file descriptors [4096for elasticsearch process likely too low, increase to at least [65536]

Copy the code

Solution: Remove the maximum number of files that can be opened by ordinary users, otherwise ES may start up with an error

sudo vi /etc/security/limits.conf



*soft nofile 65536

*hard nofile 131072

*soft nproc 2048

*hard nproc 4096

Copy the code

Just paste these four lines

1.3.2 Limit on the number of threads started by common Users

sudo vi /etc/sysctl.conf



Add the following two lines

vm.max_map_count=655360

fs.file-max=655360

Copy the code

Run the sudo sysctl -p command to make the configuration take effect

Note: After the above two problems are modified, be sure to reconnect to Linux to take effect. Close secureCRT or XShell and open it again to connect to Linux

1.3.3 Actions after reconnecting the tool

Execute the following four commands and you’ll be fine

[hadoop@node01 ~]$ ulimit -Hn

131072

[hadoop@node01 ~]$ ulimit -Sn

65536

[hadoop@node01 ~]$ ulimit -Hu

4096

[hadoop@node01 ~]$ ulimit -Su

4096

Copy the code

1.3.4 Starting the ES Cluster

nohup /usr/local/ elasticsearch - 6.7.0 / bin/elasticsearch > & 1 & 2

Copy the code

After successful startup, the JSP can see the es service process and access the page

http://node3:9200/? pretty

Copy the code

Note: If a machine fails to start the service, go to the machine’s logs to see the error log

When this page appears, it’s already started. It’s very ugly

1.3.5 elasticsearch – head plug-in

After the ES service is started, the interface is ugly. In order to better view the information in the index library, we can install the Elasticsearch-head plug-in to make it easier to see the ES management interface

First install Node.js

1.3.6 Node. Js

Node.js is a JavaScript runtime environment based on Chrome’S V8 engine.

Node.js is a Javascript Runtime environment developed by Ryan Dahl and released in May 2009. It is essentially a wrapper around the Chrome V8 engine. Node.js is not a JavaScript framework, unlike CakePHP, Django, and Rails. Node.js is not a browser-side library. It can’t be compared to jQuery or ExtJS. Node.js is a development platform that allows JavaScript to run on the server side, making JavaScript a scripting language on a par with server-side languages like PHP, Python, Perl, and Ruby.

Installation steps reference: https://www.cnblogs.com/kevingrace/p/8990169.html

After downloading the installation package, we unzip it and do some configuration

sudo ln -s /usr/local/ node - v8.1.0 - Linux - x64 / lib/node_modules/NPM/bin/NPM - cli. Js/usr /local/bin/npm

sudo ln -s /usr/local/ node - v8.1.0 - Linux - x64 / bin/node/usr /local/bin/node

Copy the code

Then modify the environment variables

sudo vim .bash_profile



export NODE_HOME=/usr/local/ node - v8.1.0 - Linux - x64

export PATH=:$PATH:$NODE_HOME/bin

Copy the code

And then source it

The above version of node and NPM is sufficient

1.3.7 ElasticSearch-head installation

Online installation is not recommended, because everyone’s Internet speed (you know), of course, like me prepared in advance of the installation package is not the same

At this point we unzip it and modify gruntfile.js

cd /usr/local/elasticsearch-head

vim Gruntfile.js

Copy the code

You can use shift+:, go to command mode, and use /hostname to find this parameter. Hostname: ‘your own node IP’

after

cd /usr/local/elasticsearch-head/_site

vim app.js

Copy the code

As above, look for the http:// keyword and change localhost to your node IP

Then we can start the service

cd /usr/local/elasticsearch-head/node_modules/grunt/bin/

nohup ./grunt server >/dev/null 2>&1 &

Copy the code

After visiting http://your node IP: 9100/can visit, of course, is also a good to see where to go to the page, there is, I have two virtual machine silly, so there is no way to start a single node

1.3.8 How Do I Close this Function

Run the following command to find the elasticsearch-head plug-in process, and then run the kill -9 command to kill the process

sudo yum install net-tools     

netstat -nltp | grep 9100

kill -9 88297

Copy the code

1.3.9 Kibana installation

This thing is basically decompression can be used

Prepare an installation package and unzip it

cd /usr/local/ kibana - 6.7.0 - Linux - x86_64 / config /

vi kibana.yml





The configuration is as follows:

server.host: "node3"

elasticsearch.hosts: ["http://node3:9200"]

Copy the code

Then we can get started

cd /usr/local/ kibana - 6.7.0 - Linux - x86_64

nohup bin/kibana >/dev/null 2>&1 &

Copy the code

All you need is to wait

Well, let’s wait a little longer 😂

It says here that it tells you that there is no data in your cluster, so just click on the first one

At this point, the installation is complete. Then the way we interact with the ES is DEV Tools — developer tools

2. ElasticSearch

2.1 Some concepts of ES

Let’s use a traditional relational database analogy

Relational DB -> Databases -> Tables -> Rows -> Columns

Elasticsearch -> Indices   -> Types  -> Documents -> Fields

Copy the code

We usually contact with MySQL, for example, does it have multiple databases, each database has multiple tables, each table has many Rows, and then the data has Columns

There are Indices under ES, and there are Types under ES, which are similar to tables. Each piece of data is Documents and the Fields are Fields

2.2 Explanation of some proper nouns

2.2.1 index

An index is a collection of documents with somewhat similar characteristics. For example, you can have an index for customer data, another index for catalog data, and an index for order data. An index is identified by a name (which must be all lowercase) and is used when indexing, searching, updating, and deleting documents corresponding to that index. You can define as many indexes as you want in a cluster.

2.2.2 Type

In an index, you can define one or more types. A type is a logical classification/partition of your index, the semantics of which are entirely up to us. Typically, a type is defined for documents that have a common set of fields. For example, let’s say you run a blogging platform and store all your data in an index. In this index, you can define one type for user data, another type for blog data, and, of course, another type for comment data.

2.2.3 Field

It is similar to the field of the data table, which classifies and identifies the document data according to different attributes

2.2.4 Mapping

It’s something like a Javabeans. Mapping is the manner in which process the data and the rules of some limitations, such as a field, default values, the data type of the analyzer, whether to be indexed and so on, these are the mapping can be set inside, the other is the handling es inside some of the data using rules set also called mapping, according to the optimal rule processing data for performance improvement is very big, That’s why you need to build the mapping, and you need to think about how to build the mapping for better performance.

2.2.5

A document is a basic unit of information that can be indexed. For example, you can have a document for a customer, a document for a product, and, of course, a document for an order. Documents are represented in THE Javascript Object Notation (JSON) format, a ubiquitous Internet data interaction format. You can store as many documents as you want in an index/type. Note that although a document physically exists in an index, the document must be indexed/given an index type.

2.2.6 Near-real-time NRT

Elasticsearch is a near real time search platform. This means that there is a slight delay (usually less than 1 second) from indexing a document until it can be searched

2.2.7 Shards & Replicas

An index can store large amounts of data beyond the hardware limits of a single node. For example, an index with 1 billion documents takes up 1 TERabyte of disk space, which is not available at any node; Or a single node can process a search request and respond too slowly. To solve this problem, Elasticsearch provides the ability to divide the index into multiple pieces called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent “index” that can be placed on any node in the cluster. Sharding is important for two reasons:

Allows you to split/expand your content horizontally.

Allows you to do distributed, parallel operations on top of shards, improving performance/throughput.

Copy the code

How a shard is distributed and how its documents are aggregated back into search requests is completely managed by Elasticsearch and is transparent to you as the user.

In a network environment where failure can happen at any time, having a failover mechanism is useful in cases where a shard/node somehow becomes offline or disappears for any reason. For this purpose Elasticsearch allows you to create one or more copies of a shard called Replicas, or replicas in Chinese.

Replicas is important for two main reasons:

Provides high availability in the event of sharding/node failures. For this reason, a replicated shard is never placed on the same node as the original/primary shard. Expand your search volume/throughput, because search can run in parallel across all Replicas. In summary, each index can be divided into multiple shards. An index can also be copied 0 times (meaning no copies) or many times. Once replicated, each index has a distinction between the primary shard (the original shard that was the source of the replication) and the replicated shard (a copy of the primary shard). The number of Shards & Replicas can be specified when the index is created. You can dynamically change the number of Replicas at any time after the index is created, but you cannot change the number of shards afterwards.

By default, each index in Elasticsearch is shard by five primary shards and one replicas, which means that if you have at least two nodes in your cluster, your index will have five primary shards and five replicated shards (one full copy) for a total of 10 shards per index.

2.2.8 a management tool

Curl is an open source file transfer tool that uses URL syntax to work on the command line. Using curl, you can easily implement common GET/POST requests. Simply think of it as a tool that can access urls from below the command line. Curl is available in centos’s default repository, if you do not have yum installed.

yum -y install curl

Copy the code
curl

-x Specifies the HTTP request methods: HEAD, GET, POST, PUT, DELETE

-d Specifies the data to be transmitted

-h Specifies the HTTP request header

Copy the code

2.3 Creating indexes using Xput

2.3.1 Creating an index

At this time, our index is the only one created by Kibana by default

Execute the following statement in our Dev Tools for Kibana

curl -XPUT http://node3:9200/blog01/? pretty

Copy the code

Kibana will automatically help us with some formatting

After successful execution, we can see the so-called nice page 🤣 again at this time


2.3.2 Inserting a piece of data

curl-XPUT http://node3:9200/blog01/article/1? pretty -d '{"id""1"."title""What is ELK"} '

Copy the code

Use the PUT verb to add a document to /article(document type) and assign the document ID of 1. The URL path is displayed as index/doctype/ID (index/ document type/ID).

You can see this in the data browsing module by clicking blog01

Content-type header [Application /x-www-form-urlencoded] is not supported

For this reason, because ES has added security mechanisms, strict content type checking can also serve as a layer of protection against cross-site request forgery attacks. Website to explain

http.content_type.required

2.3.3 Querying Data

curl -XGET http://node3:9200/blog01/article/1? pretty

Copy the code

This command can be executed on Either Kibana or cluster

2.3.4 Updating the document

The update operation is basically the same as the insert operation, because it is an update with an ID and an insert without an ID, similar to the Java operation

curl-XPUT http://node3:9200/blog01/article/1? pretty -d '{"id""1"."title"" What is elasticsearch"} '

Copy the code

Don’t post pictures, just play

2.3.5 Searching for documents

curl -XGET "http://node3:9200/blog01/article/_search? q=title:elasticsearch" 

Copy the code

2.3.6 Deleting documents and Indexes

Delete the document

curl -XDELETE "http://node3:9200/blog01/article/1? pretty"

Copy the code

Remove the index

curl -XDELETE http://node3:9200/blog01? pretty

Copy the code

Execution will be deleted, so I’m not going to execute it

2.4 Querying ES conditions

So let’s just simulate some data

POST /school/student/_bulk

"index": { "_id": 1}}

"name" : "tellYourDream"."age": 25."sex""boy"."birth""1995-01-01" , "about""i like bigdata" }

"index": { "_id": 2}}

"name" : "guanyu"."age": 21."sex""boy"."birth""1995-01-02" , "about""i like diaocan" }

"index": { "_id": 3}}

"name" : "zhangfei"."age": 18."sex""boy"."birth""1998-01-02" , "about""i like travel" }

"index": { "_id": 4}}

"name" : "diaocan"."age": 20."sex""girl"."birth""1996-01-02" , "about""i like travel and sport" }

"index": { "_id": 5}}

"name" : "panjinlian"."age": 25."sex""girl"."birth""1991-01-02" , "about""i like travel and wusong" }

"index": { "_id"6}} :

"name" : "caocao"."age": 30."sex""boy"."birth""1988-01-02" , "about""i like xiaoqiao" }

"index": { "_id"7}} :

"name" : "zhaoyun"."age": 31."sex""boy"."birth""1997-01-02" , "about""i like travel and music" }

"index": { "_id"8}} :

"name" : "xiaoqiao"."age": 18."sex""girl"."birth""1998-01-02" , "about""i like caocao" }

"index": { "_id"9}} :

"name" : "daqiao"."age": 20."sex""girl"."birth""1996-01-02" , "about""i like travel and history" }

Copy the code

Copy directly to Kibana below the execution can be

You can also see it on the es-head side

2.4.1 1. Use match_all to perform queries

GET/school/student/_search? pretty

{

    "query": {

        "match_all": {}

    }

}

Copy the code

Problem: After matching with match_all, all the data will be retrieved, but often the real business need is not to find all the data, but to retrieve what they want; And for THE ES cluster, direct retrieval of all data, it is easy to cause GC phenomenon. So, we need to learn how to retrieve data efficiently, right

2.4.2 Querying Information by Key Fields

GET/school/student/_search? pretty

{

    "query": {

         "match": {"about""travel"}

     }

}

Copy the code

If you want to query at this time like travel, and can’t be a boy, how to do? [match] query doesn’t support multiple fields under one match

2.4.3 Composite query of bool

When multiple query combinations occur, they can be included with a bool. The bool merge contains either must, must_not or should, which means or

GET/school/student/_search? pretty

{

"query": {

   "bool": {

      "must": { "match": {"about""travel"}},

      "must_not": {"match": {"sex""boy"}}

     }

  }

}

Copy the code

2.4.4 Should in the compound query of bool

“Should” means “optional” (show if “should” matches, don’t show otherwise) example: Query for “should” that likes to travel. If there are men, show them, otherwise don’t show them

GET/school/student/_search? pretty

{

"query": {

   "bool": {

      "must": { "match": {"about""travel"}},

      "should": {"match": {"sex""boy"}}         

     }

  }

}

Copy the code

2.4.5. Term match

Use term for an exact matching (such as a number, date, Boolean, or not_analyzed string (an unanalyzed text data type)) syntax

"term": { "age"20 }}

"term": { "date""2018-04-01" }}

"term": { "sex"Boy: ""}}

"term": { "about""travel" }}

Copy the code

Example: query for people who like to travel

GET/school/student/_search? pretty

{

"query": {

   "bool": {

      "must": { "term": {"about""travel"}},

      "should": {"term": {"sex""boy"}}         

     }}

}

Copy the code

2.4.6 Using terms to match multiple values

GET/school/student/_search? pretty

{

"query": {

   "bool": {

      "must": { "terms": {"about": ["travel"."history"]}}          

     }

  }

}

Copy the code

2.4.7 Range filtering

The Range filter allows us to find data in a specified Range:

Gt: - more than

Gae: -- Greater than or equal to

Lt: - less than

Lte: -- less than or equal to

Copy the code

Example: Find students over 20 and under 25

GET/school/student/_search? pretty

{

"query": {

   "range": {

    "age": {"gt":20."lte":25}

         }

      }

}

Copy the code

2.4.8 Exists and Missing filters

Exists and Missing filters can be used to find out whether a field is included or not in a document

Example: Find a document that contains age in its field

GET/school/student/_search? pretty

{

"query": {

   "exists": {

    "field""age"  

         }

      }

}

Copy the code

2.4.9 Multi-conditional filtering of bool

Bool can also be used to filter multiple rows just like match:

Must: -- The exact match of multiple query conditions, equivalent toand 。

Must_not: -- The opposite match of multiple query conditions, equivalent tonot 。

Should: -- At least one query condition matchesor

Copy the code

Example: Filter out students whose about field contains travel and who are older than 20 but younger than 30

GET/school/student/_search? pretty

{

  "query": {

    "bool": {

      "must": [

        {"term": {

          "about": {

            "value""travel"

          }

}}, {"range": {

          "age": {

            "gte"20.

            "lte"30

          }

        }}

      ]

    }

  }

}

Copy the code

2.4.10 Merge query and filter criteria

Usually complex query statements, we also need to cooperate with the filter statement to achieve caching, with the filter statement can be implemented

Example: query a document that likes to travel and is 20 years old

GET/school/student/_search? pretty

{

  "query": {

   "bool": {

     "must": {"match": {"about""travel"}},     

     "filter": [{"term": {"age"20}}]

     }

  }

}

Copy the code

2.5 “Mappings & Settings” is an important concept in ES

“Mappings” is used to define the type of a field in ES. In ES, each field has a default type. “Mappings” can automatically determine the type of the field based on the first time we insert it

DELETE  document

PUT document

{

  "mappings": {

    "article" : {

      "properties":

      {

        "title" : {"type""text"},

        "author" : {"type""text"},

        "titleScore" : {"type""double"



      }

    }

  }

}

Copy the code

Then you can use get document/article/_mapping

Settings is used to define the number of fragments and copies.

DELETE document

PUT document

{

  "mappings": {

    "article" : {

      "properties":

      {

        "title" : {"type""text"},

        "author" : {"type""text"},

        "titleScore" : {"type""double"



      }

    }

  }

}



GET /document/_settings

Copy the code

Since I only have one node, I won’t take a screenshot of the es-head graph, which is useless 😂

Three, ES paging solution

Simulate some data and then copy it directly to Kibana

DELETE us

POST /_bulk

"create": { "_index""us"."_type""tweet"."_id""1" }}

"email" : "[email protected]"."name" : "John Smith"."username" : "@john" }

"create": { "_index""us"."_type""tweet"."_id""2" }}

"email" : "[email protected]"."name" : "Mary Jones"."username" : "@mary" }

"create": { "_index""us"."_type""tweet"."_id""3" }}

"date" : "2014-09-13"."name" : "Mary Jones"."tweet" : "Elasticsearch means full text search has never been so easy"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""4" }}

"date" : "2014-09-14"."name" : "John Smith"."tweet" : "@mary it is not just text, it does everything"."user_id" : 1 }

"create": { "_index""us"."_type""tweet"."_id""5" }}

"date" : "2014-09-15"."name" : "Mary Jones"."tweet" : "However did I manage before Elasticsearch?"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""6" }}

"date" : "2014-09-16"."name" : "John Smith"."tweet" : "The Elasticsearch API is really easy to use"."user_id" : 1 }

"create": { "_index""us"."_type""tweet"."_id""Seven" }}

"date" : "2014-09-17"."name" : "Mary Jones"."tweet" : "The Query DSL is really powerful and flexible"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""8" }}

"date" : "2014-09-18"."name" : "John Smith"."user_id" : 1 }

"create": { "_index""us"."_type""tweet"."_id""9" }}

"date" : "2014-09-19"."name" : "Mary Jones"."tweet" : "Geo-location aggregations are really cool"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""10" }}

"date" : "2014-09-20"."name" : "John Smith"."tweet" : "Elasticsearch surely is one of the hottest new NoSQL products"."user_id" : 1 }

"create": { "_index""us"."_type""tweet"."_id""11" }}

"date" : "2014-09-21"."name" : "Mary Jones"."tweet" : "Elasticsearch is built for the cloud, easy to scale"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""12" }}

"date" : "2014-09-22"."name" : "John Smith"."tweet" : "Elasticsearch and I have left the honeymoon stage, and I still love her."."user_id" : 1 }

"create": { "_index""us"."_type""tweet"."_id""13" }}

"date" : "2014-09-23"."name" : "Mary Jones"."tweet" : "So yes, I am an Elasticsearch fanboy"."user_id" : 2 }

"create": { "_index""us"."_type""tweet"."_id""14" }}

"date" : "2014-09-24"."name" : "John Smith"."tweet" : "How many more cheesy tweets do I have to write?"."user_id" : 1 }

Copy the code

3.1 Size +from Shallow pages

In the normal query process, if I want to query the first 10 pieces of data:

  1. The client sends a request to a node
  2. The node forwards to each shard and queries the first 10 items on each shard
  3. The results are returned to the nodes, the data is consolidated, and the first 10 items are extracted
  4. Return to the requesting client

From defines the offset value of the target data, and size defines the number of events currently returned

None Example Query the first five entries

GET/us/_search? pretty

{

  "from" : 0 , "size" : 5

}

Copy the code

Start from the fifth count, query 5

GET/us/_search? pretty

{

  "from" : 5 , "size" : 5

}

Copy the code

This is what we then do in the Java API to query our ES data, but this shallow paging is only good for small amounts of data, because as from increases, the query takes longer and the query becomes exponentially less efficient

Advantages: From +size is more efficient when the amount of data is not large

Disadvantages: In the case of very large amount of data, the from+size page loads all records into memory, which is not only very slow to run, but also easy to cause ES to run out of memory and die

3.2 Deep pagination scroll

For the shallow paging described above, when Elasticsearch responds to a request, it must determine the order of docs and order the responses.

If the number of pages requested is small (say 20 docs per page), Elasticsearch will have no problem, but if the number of pages is large (say 20 pages), Elasticsearch will have to remove all the docs from pages 1 through 20 and remove all the docs from pages 1 through 19. Get page 20 of docs.

The solution is to use scroll, which is a cache that maintains the snapshot information of the current index segment (the snapshot information is the snapshot when you perform the scroll query). Scroll can be divided into two steps: initialization and traversal. 1. During initialization, all search results that meet search conditions are cached, which can be thought of as snapshots. 2. Take data from the snapshot.

Initialize the



GETus/_search? scroll=3m



"query": {"match_all": {}},

 "size"3

}

Copy the code

When initialized, it is similar to normal search, where scroll=3m represents 3 minutes of cached data. Size: 3 indicates that three pieces of data are being queried

During the traversal, the scrollid from the previous traversal is retrieved, and then the traversal is repeated with the scroll parameter. If the data returned is empty, the traversal is complete

GET /_search/scroll

{

  "scroll" : "1m".

  "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAShGFnE5ZURldkZjVDFHMHVaODdBOThZVWcAAAAAAAEoSRZxOWVEZXZGY1QxRzB1Wjg3QTk4WVVnAAAAAAABKEcWcTl lRGV2RmNUMUcwdVo4N0E5OFlVZwAAAAAAAShIFnE5ZURldkZjVDFHMHVaODdBOThZVWcAAAAAAAEoShZxOWVEZXZGY1QxRzB1Wjg3QTk4WVVn"

}

Copy the code

Note: You need to pass the scroll parameter every time, and you do not need to specify index and type (do not set the cache time too long, it will occupy memory).

Contrast:

Shallow pagination, each query will go to the index library (local folder) query pageNum*page data, and then intercept the previous data, leaving the last data. This operation is performed on each shard, and eventually the data from multiple shards is merged together, sorted again, and intercepted as needed.

Deep paging allows you to put all the data that meets the query criteria into memory at once. When paging, query in memory. With relatively shallow paging, you can avoid multiple disk reads.

4. ES Chinese word segmentation IK

By default, ES has good support for English text segmentation, but like Lucene, if you need to use Chinese full-text retrieval, you need to use Chinese segmentation, and like Lucene, before using Chinese full-text retrieval, you need to integrate IK segmentation. So we next to install IK word segmentation, to achieve Chinese word segmentation

4.1 download

wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.7.0/elasticsearch-analysis-ik-6.7.0.zip

Copy the code

Then create a new directory in the ES directory to store the plug-in

mkdir -p /usr/local/elasticsearch-6.7.0/plugins/analysis-ik



unzip elasticsearch-analysis-ik-6.7.0.zip  -d /usr/local/elasticsearch-6.7.0/plugins/analysis-ik/

Copy the code

Then you can distribute it to other machines

cd /usr/local/ elasticsearch - 6.7.0 / plugins

scp -r analysis-ik/ node2:$PWD 

Copy the code

Then you need to restart the ES service

ps -ef|grep elasticsearch | grep bootstrap | awk '{print $2}' |xargs kill -9

nohup /kkb/install/elasticsearch-6.7.0/bin/elasticsearch 2> &1 &

Copy the code

4.2 Experience

Create the index library in Kibana and configure the IK word divider

delete iktest

PUT /iktest? pretty

{

    "settings" : {

        "analysis" : {

            "analyzer" : {

                "ik" : {

                    "tokenizer" : "ik_max_word"

                }

            }

        }

    },

    "mappings" : {

        "article" : {

            "dynamic" : true.

            "properties" : {

                "subject" : {

                    "type" : "text".

                    "analyzer" : "ik_max_word"

                }

            }

        }

    }

}

Copy the code

When creating the index library, we specified the word segmentation as ik_max_word, which will perform the most fine-grained segmentation of our Chinese language. Now look at the effect

Here you can see that my ID is split into several pieces

You can also highlight your query using the following syntax, such as this effect

POST /iktest/article/_search? pretty

{

    "query" : { "match" : { "subject" : "Fight pneumonia." }},

    "highlight" : {

        "pre_tags" : ["<font color=red>"].

        "post_tags" : ["</font>"].

        "fields" : {

            "subject" : {}

        }

    }

}

Copy the code

4.3 Configuring hot Word Updates

We can see that my ID is split into several pieces, because “say what you want” is not a hot word in the plugin’s perspective, but we can tell it that I’m a hot word 🤣

For example, now, the master of shadow flow, CAI Xukun, are divided, then how do we configure? We often need to be able to update our web hot words in real time. We can solve this problem by implementing a remote thesaurus using Tomcat.

4.4 Installing a Tomcat

Just install tomcat

And then you have to go to its ROOT path, like mine is

cd /usr/local/ apache tomcat - 8.5.34 / webapps/ROOT

And then a new file comes out

vi hot.dic 

Copy the code


Then save and exit

Next, start your Tomcat. If it starts successfully, you will see a bootstrap service

You can access hot.dic

4.5 Modify the configuration of the word participle

I’m just modifying this file

Click in and there are some Chinese comments, and then you see this

The comment this line are removed, and then we just can access to the tomcat configuration of the path of the word, which is http://192.168.200.11:8080/hot.dic

This configuration file is then distributed to the other two nodes

And then we’ll just restart ES, and now we’ll do the participle, and we’ll have the keyword

finally

This article is mainly deployment, the following will involve the operation of some API what, the space does not make so long 😶