preface
A basic introduction to ELK
ELK is an acronym for three software products, Elasticsearch, Logstash and Kibana. These three pieces of software are open source software, usually used in conjunction with each other, and since they are under the name of Elastic. Co, they are called ELK stack for short
On the left side, we deployed multiple servers, and then we collected data through LogStash. After collecting data, we sent it to ES cluster for storage, and then displayed it in our browser through Kibana. It was a simple routine
Elasticsearch is an open source distributed search engine. It features: Out of the box, distributed, zero configuration, automatic discovery, index sharding, index copy, RESTful interface, multiple data sources, automatic search load, etc.
Logstash is a completely open source tool that collects, filters, and stores your logs for later use (e.g. ES search).
Kibana is also an open source and free tool that provides Logstash and ElasticSearch with a log analysis friendly Web interface that helps you summarize, analyze and search important data logs.
ELK implementation introduction
ELK | Implementation language | Introduction to the |
---|---|---|
ElasticSearch | Java | Real-time distributed search and analysis engine for full text search, structured search and analysis, based on Lucene. Similar to Solr |
Logstash | JRuby | Data collection engine with real-time channel capability, including input, filter, and output modules, generally in the filter module to do log formatting and parsing work |
Kibana | JavaScript | Provides analysis platform and visual Web platform for ElasticSerach. It can find and call data in the index of ElasticSerach, and generate tables of various dimensions |
References to ELK
ELK website: https://www.elastic.co/
ELK website document: https://www.elastic.co/guide/index.html
ELK Chinese manual: https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html
ELK Chinese community: https://elasticsearch.cn/
Deployment of Elasticsearch
1.1 introduction
ElaticSearch, referred to as ES, ES is an open source highly extended distributed full-text search engine, which can store and retrieve data in near real time; It scales well and can scale to hundreds of servers to handle petabytes of data. ES is also developed in Java and uses Lucene as its core to implement all of its indexing and search capabilities, but its intent is to hide the complexity of Lucene through a simple RESTful API to make full-text search easy.
Here are some examples of how search engines can be used
GayHub:
In early 2013, GitHub abandoned Solr in favor of ElasticSearch. It uses ElasticSearch to search 20 terabytes of data, including 1.3 billion files and 130 billion lines of code."
Copy the code
Wiki: Wikipedia: Launch the core search architecture based on ElasticSearch
SoundCloud:
SoundClouduseElasticSearch18.Hundreds of millions of users provide real-time and accurate music search service
Copy the code
Baidu:
Currently, Baidu widely uses ElasticSearch as text data analysis to collect various indicator data and user-defined data from all baidu servers. Through multi-dimensional analysis and display of various data, it can assist in locating and analyzing instances or anomalies at the service level. At present, it covers more than 20 internal business lines of Baidu (including Casio, cloud analysis, network alliance, prediction, Library, direct number, wallet, risk control, etc.), with a maximum of 100 machines and 200 ES nodes in a single cluster, and imports 30TB+ data every day
Copy the code
Sina: 3.2 billion real-time logs are analyzed using ElasticSearch
Alibaba: Build your own log collection and analysis system with ElasticSearch
1.2 Preparations
I put my use of the installation package uploaded to Baidu Cloud, there is a need to self lift, to avoid the embarrassment of the network 🤣
Link: HTTPS://pan.baidu.com/s/17m0LmmRcffQbfhjSikIHhA
Extract code: L1lj
Copy the code
The first thing to note is that ES cannot be started as root. It has to be installed as a regular user. In this case, we use a new user
Solutions can also refer to the https://www.cnblogs.com/gcgc/p/10297563.html directly
First is the next installation package, and then decompress it, the operation is relatively simple here is not clear
Then we create two folders, both in the ES directory
mkdir -p /usr/local/elasticsearch-6.7.0/logs/
mkdir -p /usr/local/elasticsearch-6.7.0/datas
Copy the code
A data file and a log folder are created
Then change the configuration file to vim ElasticSearch.yml in the config folder
cluster.name: myes
node.name: node3
path.data: your installation path/ElasticSearch6.7. 0/datas
pathLogs: your installation path/ElasticSearch6.7. 0/logs
Net. host: indicates the address of the own node
http.port: 9200
discovery.zen.ping.unicast.hosts: ["node1"."node2"."node3"]
bootstrap.system_call_filter: false
bootstrap.memory_lock: false
http.cors.enabled: true
http.cors.allow-origin: "*"
cluster.name -- The name of the cluster, default is ElasticSearch
node.name - the node name
path.data -- Data file saving path
path.logs -- Specifies the path for storing log files
http.port - Access port number
Discovery. Zen. Ping. Unicast. Hosts: cluster automatically discover functions
bootstrap.system_call_filter:
bootstrap.memory_lock: false
http.cors.enabled: true
http.cors.allow-origin: "*"
Copy the code
/usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas /usr/local/elasticsearch-6.7.0/datas
We can also set the maximum and minimum heap memory in the JVM. option configuration file, and then distribute it to another node. Of course, the other two nodes will have the same configuration file as above.
cd /usr/local
SCP - r elasticsearch - 6.7.0/2:$PWD
SCP - r elasticsearch - 6.7.0 / node3:$PWD
Copy the code
Because my first two virtual machines crashed somehow, MY own computer memory is insufficient. So I’ll just go with 🤣
1.3 Modifying some system configurations
1.3.1 Maximum number of Open files for common Users
Problem error message description:
max file descriptors [4096] for elasticsearch process likely too low, increase to at least [65536]
Copy the code
Solution: Remove the maximum number of files that can be opened by ordinary users, otherwise ES may start up with an error
sudo vi /etc/security/limits.conf
*soft nofile 65536
*hard nofile 131072
*soft nproc 2048
*hard nproc 4096
Copy the code
Just paste these four lines
1.3.2 Limit on the number of threads started by common Users
sudo vi /etc/sysctl.conf
Add the following two lines
vm.max_map_count=655360
fs.file-max=655360
Copy the code
Run the sudo sysctl -p command to make the configuration take effect
Note: After the above two problems are modified, be sure to reconnect to Linux to take effect. Close secureCRT or XShell and open it again to connect to Linux
1.3.3 Actions after reconnecting the tool
Execute the following four commands and you’ll be fine
[hadoop@node01 ~]$ ulimit -Hn
131072
[hadoop@node01 ~]$ ulimit -Sn
65536
[hadoop@node01 ~]$ ulimit -Hu
4096
[hadoop@node01 ~]$ ulimit -Su
4096
Copy the code
1.3.4 Starting the ES Cluster
nohup /usr/local/ elasticsearch - 6.7.0 / bin/elasticsearch > & 1 & 2
Copy the code
After successful startup, the JSP can see the es service process and access the page
http://node3:9200/? pretty
Copy the code
Note: If a machine fails to start the service, go to the machine’s logs to see the error log
When this page appears, it’s already started. It’s very ugly
1.3.5 elasticsearch – head plug-in
After the ES service is started, the interface is ugly. In order to better view the information in the index library, we can install the Elasticsearch-head plug-in to make it easier to see the ES management interface
First install Node.js
1.3.6 Node. Js
Node.js is a JavaScript runtime environment based on Chrome’S V8 engine.
Node.js is a Javascript Runtime environment developed by Ryan Dahl and released in May 2009. It is essentially a wrapper around the Chrome V8 engine. Node.js is not a JavaScript framework, unlike CakePHP, Django, and Rails. Node.js is not a browser-side library. It can’t be compared to jQuery or ExtJS. Node.js is a development platform that allows JavaScript to run on the server side, making JavaScript a scripting language on a par with server-side languages like PHP, Python, Perl, and Ruby.
Installation steps reference: https://www.cnblogs.com/kevingrace/p/8990169.html
After downloading the installation package, we unzip it and do some configuration
sudo ln -s /usr/local/ node - v8.1.0 - Linux - x64 / lib/node_modules/NPM/bin/NPM - cli. Js/usr /local/bin/npm
sudo ln -s /usr/local/ node - v8.1.0 - Linux - x64 / bin/node/usr /local/bin/node
Copy the code
Then modify the environment variables
sudo vim .bash_profile
export NODE_HOME=/usr/local/ node - v8.1.0 - Linux - x64
export PATH=:$PATH:$NODE_HOME/bin
Copy the code
And then source it
The above version of node and NPM is sufficient
1.3.7 ElasticSearch-head installation
Online installation is not recommended, because everyone’s Internet speed (you know), of course, like me prepared in advance of the installation package is not the same
At this point we unzip it and modify gruntfile.js
cd /usr/local/elasticsearch-head
vim Gruntfile.js
Copy the code
You can use shift+:, go to command mode, and use /hostname to find this parameter. Hostname: ‘your own node IP’
after
cd /usr/local/elasticsearch-head/_site
vim app.js
Copy the code
As above, look for the http:// keyword and change localhost to your node IP
Then we can start the service
cd /usr/local/elasticsearch-head/node_modules/grunt/bin/
nohup ./grunt server >/dev/null 2>&1 &
Copy the code
After visiting http://your node IP: 9100/can visit, of course, is also a good to see where to go to the page, there is, I have two virtual machine silly, so there is no way to start a single node
1.3.8 How Do I Close this Function
Run the following command to find the elasticsearch-head plug-in process, and then run the kill -9 command to kill the process
sudo yum install net-tools
netstat -nltp | grep 9100
kill -9 88297
Copy the code
1.3.9 Kibana installation
This thing is basically decompression can be used
Prepare an installation package and unzip it
cd /usr/local/ kibana - 6.7.0 - Linux - x86_64 / config /
vi kibana.yml
The configuration is as follows:
server.host: "node3"
elasticsearch.hosts: ["http://node3:9200"]
Copy the code
Then we can get started
cd /usr/local/ kibana - 6.7.0 - Linux - x86_64
nohup bin/kibana >/dev/null 2>&1 &
Copy the code
All you need is to wait
Well, let’s wait a little longer 😂
It says here that it tells you that there is no data in your cluster, so just click on the first one
At this point, the installation is complete. Then the way we interact with the ES is DEV Tools — developer tools
2. ElasticSearch
2.1 Some concepts of ES
Let’s use a traditional relational database analogy
Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices -> Types -> Documents -> Fields
Copy the code
We usually contact with MySQL, for example, does it have multiple databases, each database has multiple tables, each table has many Rows, and then the data has Columns
There are Indices under ES, and there are Types under ES, which are similar to tables. Each piece of data is Documents and the Fields are Fields
2.2 Explanation of some proper nouns
2.2.1 index
An index is a collection of documents with somewhat similar characteristics. For example, you can have an index for customer data, another index for catalog data, and an index for order data. An index is identified by a name (which must be all lowercase) and is used when indexing, searching, updating, and deleting documents corresponding to that index. You can define as many indexes as you want in a cluster.
2.2.2 Type
In an index, you can define one or more types. A type is a logical classification/partition of your index, the semantics of which are entirely up to us. Typically, a type is defined for documents that have a common set of fields. For example, let’s say you run a blogging platform and store all your data in an index. In this index, you can define one type for user data, another type for blog data, and, of course, another type for comment data.
2.2.3 Field
It is similar to the field of the data table, which classifies and identifies the document data according to different attributes
2.2.4 Mapping
It’s something like a Javabeans. Mapping is the manner in which process the data and the rules of some limitations, such as a field, default values, the data type of the analyzer, whether to be indexed and so on, these are the mapping can be set inside, the other is the handling es inside some of the data using rules set also called mapping, according to the optimal rule processing data for performance improvement is very big, That’s why you need to build the mapping, and you need to think about how to build the mapping for better performance.
2.2.5
A document is a basic unit of information that can be indexed. For example, you can have a document for a customer, a document for a product, and, of course, a document for an order. Documents are represented in THE Javascript Object Notation (JSON) format, a ubiquitous Internet data interaction format. You can store as many documents as you want in an index/type. Note that although a document physically exists in an index, the document must be indexed/given an index type.
2.2.6 Near-real-time NRT
Elasticsearch is a near real time search platform. This means that there is a slight delay (usually less than 1 second) from indexing a document until it can be searched
2.2.7 Shards & Replicas
An index can store large amounts of data beyond the hardware limits of a single node. For example, an index with 1 billion documents takes up 1 TERabyte of disk space, which is not available at any node; Or a single node can process a search request and respond too slowly. To solve this problem, Elasticsearch provides the ability to divide the index into multiple pieces called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent “index” that can be placed on any node in the cluster. Sharding is important for two reasons:
Allows you to split/expand your content horizontally.
Allows you to do distributed, parallel operations on top of shards, improving performance/throughput.
Copy the code
How a shard is distributed and how its documents are aggregated back into search requests is completely managed by Elasticsearch and is transparent to you as the user.
In a network environment where failure can happen at any time, having a failover mechanism is useful in cases where a shard/node somehow becomes offline or disappears for any reason. For this purpose Elasticsearch allows you to create one or more copies of a shard called Replicas, or replicas in Chinese.
Replicas is important for two main reasons:
Provides high availability in the event of sharding/node failures. For this reason, a replicated shard is never placed on the same node as the original/primary shard. Expand your search volume/throughput, because search can run in parallel across all Replicas. In summary, each index can be divided into multiple shards. An index can also be copied 0 times (meaning no copies) or many times. Once replicated, each index has a distinction between the primary shard (the original shard that was the source of the replication) and the replicated shard (a copy of the primary shard). The number of Shards & Replicas can be specified when the index is created. You can dynamically change the number of Replicas at any time after the index is created, but you cannot change the number of shards afterwards.
By default, each index in Elasticsearch is shard by five primary shards and one replicas, which means that if you have at least two nodes in your cluster, your index will have five primary shards and five replicated shards (one full copy) for a total of 10 shards per index.
2.2.8 a management tool
Curl is an open source file transfer tool that uses URL syntax to work on the command line. Using curl, you can easily implement common GET/POST requests. Simply think of it as a tool that can access urls from below the command line. Curl is available in centos’s default repository, if you do not have yum installed.
yum -y install curl
Copy the code
curl
-x Specifies the HTTP request methods: HEAD, GET, POST, PUT, DELETE
-d Specifies the data to be transmitted
-h Specifies the HTTP request header
Copy the code
2.3 Creating indexes using Xput
2.3.1 Creating an index
At this time, our index is the only one created by Kibana by default
Execute the following statement in our Dev Tools for Kibana
curl -XPUT http://node3:9200/blog01/? pretty
Copy the code
Kibana will automatically help us with some formatting
After successful execution, we can see the so-called nice page 🤣 again at this time
2.3.2 Inserting a piece of data
curl-XPUT http://node3:9200/blog01/article/1? pretty -d '{"id": "1"."title": "What is ELK"} '
Copy the code
Use the PUT verb to add a document to /article(document type) and assign the document ID of 1. The URL path is displayed as index/doctype/ID (index/ document type/ID).
You can see this in the data browsing module by clicking blog01
Content-type header [Application /x-www-form-urlencoded] is not supported
For this reason, because ES has added security mechanisms, strict content type checking can also serve as a layer of protection against cross-site request forgery attacks. Website to explain
http.content_type.required
2.3.3 Querying Data
curl -XGET http://node3:9200/blog01/article/1? pretty
Copy the code
This command can be executed on Either Kibana or cluster
2.3.4 Updating the document
The update operation is basically the same as the insert operation, because it is an update with an ID and an insert without an ID, similar to the Java operation
curl-XPUT http://node3:9200/blog01/article/1? pretty -d '{"id": "1"."title": " What is elasticsearch"} '
Copy the code
Don’t post pictures, just play
2.3.5 Searching for documents
curl -XGET "http://node3:9200/blog01/article/_search? q=title:elasticsearch"
Copy the code
2.3.6 Deleting documents and Indexes
Delete the document
curl -XDELETE "http://node3:9200/blog01/article/1? pretty"
Copy the code
Remove the index
curl -XDELETE http://node3:9200/blog01? pretty
Copy the code
Execution will be deleted, so I’m not going to execute it
2.4 Querying ES conditions
So let’s just simulate some data
POST /school/student/_bulk
{ "index": { "_id": 1}}
{ "name" : "tellYourDream"."age": 25."sex": "boy"."birth": "1995-01-01" , "about": "i like bigdata" }
{ "index": { "_id": 2}}
{ "name" : "guanyu"."age": 21."sex": "boy"."birth": "1995-01-02" , "about": "i like diaocan" }
{ "index": { "_id": 3}}
{ "name" : "zhangfei"."age": 18."sex": "boy"."birth": "1998-01-02" , "about": "i like travel" }
{ "index": { "_id": 4}}
{ "name" : "diaocan"."age": 20."sex": "girl"."birth": "1996-01-02" , "about": "i like travel and sport" }
{ "index": { "_id": 5}}
{ "name" : "panjinlian"."age": 25."sex": "girl"."birth": "1991-01-02" , "about": "i like travel and wusong" }
{ "index": { "_id"6}} :
{ "name" : "caocao"."age": 30."sex": "boy"."birth": "1988-01-02" , "about": "i like xiaoqiao" }
{ "index": { "_id"7}} :
{ "name" : "zhaoyun"."age": 31."sex": "boy"."birth": "1997-01-02" , "about": "i like travel and music" }
{ "index": { "_id"8}} :
{ "name" : "xiaoqiao"."age": 18."sex": "girl"."birth": "1998-01-02" , "about": "i like caocao" }
{ "index": { "_id"9}} :
{ "name" : "daqiao"."age": 20."sex": "girl"."birth": "1996-01-02" , "about": "i like travel and history" }
Copy the code
Copy directly to Kibana below the execution can be
You can also see it on the es-head side
2.4.1 1. Use match_all to perform queries
GET/school/student/_search? pretty
{
"query": {
"match_all": {}
}
}
Copy the code
Problem: After matching with match_all, all the data will be retrieved, but often the real business need is not to find all the data, but to retrieve what they want; And for THE ES cluster, direct retrieval of all data, it is easy to cause GC phenomenon. So, we need to learn how to retrieve data efficiently, right
2.4.2 Querying Information by Key Fields
GET/school/student/_search? pretty
{
"query": {
"match": {"about": "travel"}
}
}
Copy the code
If you want to query at this time like travel, and can’t be a boy, how to do? [match] query doesn’t support multiple fields under one match
2.4.3 Composite query of bool
When multiple query combinations occur, they can be included with a bool. The bool merge contains either must, must_not or should, which means or
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": { "match": {"about": "travel"}},
"must_not": {"match": {"sex": "boy"}}
}
}
}
Copy the code
2.4.4 Should in the compound query of bool
“Should” means “optional” (show if “should” matches, don’t show otherwise) example: Query for “should” that likes to travel. If there are men, show them, otherwise don’t show them
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": { "match": {"about": "travel"}},
"should": {"match": {"sex": "boy"}}
}
}
}
Copy the code
2.4.5. Term match
Use term for an exact matching (such as a number, date, Boolean, or not_analyzed string (an unanalyzed text data type)) syntax
{ "term": { "age": 20 }}
{ "term": { "date": "2018-04-01" }}
{ "term": { "sex"Boy: ""}}
{ "term": { "about": "travel" }}
Copy the code
Example: query for people who like to travel
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": { "term": {"about": "travel"}},
"should": {"term": {"sex": "boy"}}
}}
}
Copy the code
2.4.6 Using terms to match multiple values
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": { "terms": {"about": ["travel"."history"]}}
}
}
}
Copy the code
2.4.7 Range filtering
The Range filter allows us to find data in a specified Range:
Gt: - more than
Gae: -- Greater than or equal to
Lt: - less than
Lte: -- less than or equal to
Copy the code
Example: Find students over 20 and under 25
GET/school/student/_search? pretty
{
"query": {
"range": {
"age": {"gt":20."lte":25}
}
}
}
Copy the code
2.4.8 Exists and Missing filters
Exists and Missing filters can be used to find out whether a field is included or not in a document
Example: Find a document that contains age in its field
GET/school/student/_search? pretty
{
"query": {
"exists": {
"field": "age"
}
}
}
Copy the code
2.4.9 Multi-conditional filtering of bool
Bool can also be used to filter multiple rows just like match:
Must: -- The exact match of multiple query conditions, equivalent toand 。
Must_not: -- The opposite match of multiple query conditions, equivalent tonot 。
Should: -- At least one query condition matchesor
Copy the code
Example: Filter out students whose about field contains travel and who are older than 20 but younger than 30
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": [
{"term": {
"about": {
"value": "travel"
}
}}, {"range": {
"age": {
"gte": 20.
"lte": 30
}
}}
]
}
}
}
Copy the code
2.4.10 Merge query and filter criteria
Usually complex query statements, we also need to cooperate with the filter statement to achieve caching, with the filter statement can be implemented
Example: query a document that likes to travel and is 20 years old
GET/school/student/_search? pretty
{
"query": {
"bool": {
"must": {"match": {"about": "travel"}},
"filter": [{"term": {"age": 20}}]
}
}
}
Copy the code
2.5 “Mappings & Settings” is an important concept in ES
“Mappings” is used to define the type of a field in ES. In ES, each field has a default type. “Mappings” can automatically determine the type of the field based on the first time we insert it
DELETE document
PUT document
{
"mappings": {
"article" : {
"properties":
{
"title" : {"type": "text"},
"author" : {"type": "text"},
"titleScore" : {"type": "double"}
}
}
}
}
Copy the code
Then you can use get document/article/_mapping
Settings is used to define the number of fragments and copies.
DELETE document
PUT document
{
"mappings": {
"article" : {
"properties":
{
"title" : {"type": "text"},
"author" : {"type": "text"},
"titleScore" : {"type": "double"}
}
}
}
}
GET /document/_settings
Copy the code
Since I only have one node, I won’t take a screenshot of the es-head graph, which is useless 😂
Three, ES paging solution
Simulate some data and then copy it directly to Kibana
DELETE us
POST /_bulk
{ "create": { "_index": "us"."_type": "tweet"."_id": "1" }}
{ "email" : "[email protected]"."name" : "John Smith"."username" : "@john" }
{ "create": { "_index": "us"."_type": "tweet"."_id": "2" }}
{ "email" : "[email protected]"."name" : "Mary Jones"."username" : "@mary" }
{ "create": { "_index": "us"."_type": "tweet"."_id": "3" }}
{ "date" : "2014-09-13"."name" : "Mary Jones"."tweet" : "Elasticsearch means full text search has never been so easy"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "4" }}
{ "date" : "2014-09-14"."name" : "John Smith"."tweet" : "@mary it is not just text, it does everything"."user_id" : 1 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "5" }}
{ "date" : "2014-09-15"."name" : "Mary Jones"."tweet" : "However did I manage before Elasticsearch?"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "6" }}
{ "date" : "2014-09-16"."name" : "John Smith"."tweet" : "The Elasticsearch API is really easy to use"."user_id" : 1 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "Seven" }}
{ "date" : "2014-09-17"."name" : "Mary Jones"."tweet" : "The Query DSL is really powerful and flexible"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "8" }}
{ "date" : "2014-09-18"."name" : "John Smith"."user_id" : 1 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "9" }}
{ "date" : "2014-09-19"."name" : "Mary Jones"."tweet" : "Geo-location aggregations are really cool"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "10" }}
{ "date" : "2014-09-20"."name" : "John Smith"."tweet" : "Elasticsearch surely is one of the hottest new NoSQL products"."user_id" : 1 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "11" }}
{ "date" : "2014-09-21"."name" : "Mary Jones"."tweet" : "Elasticsearch is built for the cloud, easy to scale"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "12" }}
{ "date" : "2014-09-22"."name" : "John Smith"."tweet" : "Elasticsearch and I have left the honeymoon stage, and I still love her."."user_id" : 1 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "13" }}
{ "date" : "2014-09-23"."name" : "Mary Jones"."tweet" : "So yes, I am an Elasticsearch fanboy"."user_id" : 2 }
{ "create": { "_index": "us"."_type": "tweet"."_id": "14" }}
{ "date" : "2014-09-24"."name" : "John Smith"."tweet" : "How many more cheesy tweets do I have to write?"."user_id" : 1 }
Copy the code
3.1 Size +from Shallow pages
In the normal query process, if I want to query the first 10 pieces of data:
- The client sends a request to a node
- The node forwards to each shard and queries the first 10 items on each shard
- The results are returned to the nodes, the data is consolidated, and the first 10 items are extracted
- Return to the requesting client
From defines the offset value of the target data, and size defines the number of events currently returned
None Example Query the first five entries
GET/us/_search? pretty
{
"from" : 0 , "size" : 5
}
Copy the code
Start from the fifth count, query 5
GET/us/_search? pretty
{
"from" : 5 , "size" : 5
}
Copy the code
This is what we then do in the Java API to query our ES data, but this shallow paging is only good for small amounts of data, because as from increases, the query takes longer and the query becomes exponentially less efficient
Advantages: From +size is more efficient when the amount of data is not large
Disadvantages: In the case of very large amount of data, the from+size page loads all records into memory, which is not only very slow to run, but also easy to cause ES to run out of memory and die
3.2 Deep pagination scroll
For the shallow paging described above, when Elasticsearch responds to a request, it must determine the order of docs and order the responses.
If the number of pages requested is small (say 20 docs per page), Elasticsearch will have no problem, but if the number of pages is large (say 20 pages), Elasticsearch will have to remove all the docs from pages 1 through 20 and remove all the docs from pages 1 through 19. Get page 20 of docs.
The solution is to use scroll, which is a cache that maintains the snapshot information of the current index segment (the snapshot information is the snapshot when you perform the scroll query). Scroll can be divided into two steps: initialization and traversal. 1. During initialization, all search results that meet search conditions are cached, which can be thought of as snapshots. 2. Take data from the snapshot.
Initialize the
GETus/_search? scroll=3m
{
"query": {"match_all": {}},
"size": 3
}
Copy the code
When initialized, it is similar to normal search, where scroll=3m represents 3 minutes of cached data. Size: 3 indicates that three pieces of data are being queried
During the traversal, the scrollid from the previous traversal is retrieved, and then the traversal is repeated with the scroll parameter. If the data returned is empty, the traversal is complete
GET /_search/scroll
{
"scroll" : "1m".
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAShGFnE5ZURldkZjVDFHMHVaODdBOThZVWcAAAAAAAEoSRZxOWVEZXZGY1QxRzB1Wjg3QTk4WVVnAAAAAAABKEcWcTl lRGV2RmNUMUcwdVo4N0E5OFlVZwAAAAAAAShIFnE5ZURldkZjVDFHMHVaODdBOThZVWcAAAAAAAEoShZxOWVEZXZGY1QxRzB1Wjg3QTk4WVVn"
}
Copy the code
Note: You need to pass the scroll parameter every time, and you do not need to specify index and type (do not set the cache time too long, it will occupy memory).
Contrast:
Shallow pagination, each query will go to the index library (local folder) query pageNum*page data, and then intercept the previous data, leaving the last data. This operation is performed on each shard, and eventually the data from multiple shards is merged together, sorted again, and intercepted as needed.
Deep paging allows you to put all the data that meets the query criteria into memory at once. When paging, query in memory. With relatively shallow paging, you can avoid multiple disk reads.
4. ES Chinese word segmentation IK
By default, ES has good support for English text segmentation, but like Lucene, if you need to use Chinese full-text retrieval, you need to use Chinese segmentation, and like Lucene, before using Chinese full-text retrieval, you need to integrate IK segmentation. So we next to install IK word segmentation, to achieve Chinese word segmentation
4.1 download
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.7.0/elasticsearch-analysis-ik-6.7.0.zip
Copy the code
Then create a new directory in the ES directory to store the plug-in
mkdir -p /usr/local/elasticsearch-6.7.0/plugins/analysis-ik
unzip elasticsearch-analysis-ik-6.7.0.zip -d /usr/local/elasticsearch-6.7.0/plugins/analysis-ik/
Copy the code
Then you can distribute it to other machines
cd /usr/local/ elasticsearch - 6.7.0 / plugins
scp -r analysis-ik/ node2:$PWD
Copy the code
Then you need to restart the ES service
ps -ef|grep elasticsearch | grep bootstrap | awk '{print $2}' |xargs kill -9
nohup /kkb/install/elasticsearch-6.7.0/bin/elasticsearch 2> &1 &
Copy the code
4.2 Experience
Create the index library in Kibana and configure the IK word divider
delete iktest
PUT /iktest? pretty
{
"settings" : {
"analysis" : {
"analyzer" : {
"ik" : {
"tokenizer" : "ik_max_word"
}
}
}
},
"mappings" : {
"article" : {
"dynamic" : true.
"properties" : {
"subject" : {
"type" : "text".
"analyzer" : "ik_max_word"
}
}
}
}
}
Copy the code
When creating the index library, we specified the word segmentation as ik_max_word, which will perform the most fine-grained segmentation of our Chinese language. Now look at the effect
Here you can see that my ID is split into several pieces
You can also highlight your query using the following syntax, such as this effect
POST /iktest/article/_search? pretty
{
"query" : { "match" : { "subject" : "Fight pneumonia." }},
"highlight" : {
"pre_tags" : ["<font color=red>"].
"post_tags" : ["</font>"].
"fields" : {
"subject" : {}
}
}
}
Copy the code
4.3 Configuring hot Word Updates
We can see that my ID is split into several pieces, because “say what you want” is not a hot word in the plugin’s perspective, but we can tell it that I’m a hot word 🤣
For example, now, the master of shadow flow, CAI Xukun, are divided, then how do we configure? We often need to be able to update our web hot words in real time. We can solve this problem by implementing a remote thesaurus using Tomcat.
4.4 Installing a Tomcat
Just install tomcat
And then you have to go to its ROOT path, like mine is
cd /usr/local/ apache tomcat - 8.5.34 / webapps/ROOT
And then a new file comes out
vi hot.dic
Copy the code
Then save and exit
Next, start your Tomcat. If it starts successfully, you will see a bootstrap service
You can access hot.dic
4.5 Modify the configuration of the word participle
I’m just modifying this file
Click in and there are some Chinese comments, and then you see this
The comment this line are removed, and then we just can access to the tomcat configuration of the path of the word, which is http://192.168.200.11:8080/hot.dic
This configuration file is then distributed to the other two nodes
And then we’ll just restart ES, and now we’ll do the participle, and we’ll have the keyword
finally
This article is mainly deployment, the following will involve the operation of some API what, the space does not make so long 😶