Elasticsearch is introduced
Elasticsearch is an open source distributed search engine built on top of Apache Lucene. Lucene is kaiyuan’s search engine package that allows you to search through Java programs. Elasticsearch takes full advantage of Lucene and extends it to make storage, indexing, and searching faster and easier. So how does Elastic solve the search problem? 1. Provide quick queries. Elasticsearch uses Lucene as its base layer. Lucene is a high-performance search engine package that indexes all data by default. 2. Ensure relevancy of results. Several algorithms can be used to calculate the correlation score and then sort the results one by one based on the score. 3. Go beyond exact matching to handle misspellings; Support variants; Use statistics; Give automatic prompts
A typical Elasticsearch use case
1, use ES as the primary back-end system 2, add ES to existing system 3, and use ES as the back-end part of the existing solution.
How does ES organize data
Logical design and physical design Logical design: Document: equivalent to a row in a mysql table Type: equivalent to a table in mysql Index: equivalent to a database Physical design: node and sharding: A node is an instance of ES. Each time an ES process is started, an instance is added. There can be one to more nodes per machine. A shard is the smallest unit handled by ES, and a shard is Lucene’s index, a directory of files that contains an inverted index. A shard can be a master shard or a replica shard, which is used for searching, or a new master shard if the master shard of crude oil is lost.
3 types in ES
Core — including string and numeric arrays and multivariate fields — stores multiple predefined fields of the same core type in a field — including _TTL and _TIMESTAMP. Basic operations on data in ES include indexing, updating, and deleting data
Search data
1. Scope of the search request
% curl = % curl'localhost:9200/_search' -d '... '// search for get-together index %curl'localhost:9200/_get-together/_search' -d '... '// Search for the event type %curl in get-together'localhost:9200/get-together/event/_search' -d '... '% curl 'localhost: 9200/_all/event/_search' -d '.'% curl' localhost: 9200/*/event/_search' -d '... '% curl searches for events and grouping types in get-together and other indexes'localhost:9200/get-together,other/event,group/_search' -d '... '// Search for indexes that start with get-toge, but do not include get-together % curl'localhost:9200/+get-toge*,-get-together/_search' -d '... '
Copy the code
Basic building blocks for searching requests
Query: This is the most important component of a search request. Configures the best documents to return based on ratings, including which documents you don’t want to return. Use the lookup DSL and filter DSL to configure. Size: number of documents returned from: Used with size for paging operations _source: configured return field sort: the default sort is the score of documents. If you don’t care about document scores, you can also add additional sort to control the return of documents.
The difference between match queries and term filters
Match query: determines whether there is a match in the document, calculates the score of the document if there is a match, returns the document, and continues to the next document if there is no match. Term filter: Determines whether a document matches. If it does, the document will be cached. If it does not, the document will be skipped. The main difference is whether there is a scoring process #####Elasticsearch how to analyze data 1, character filtering — there are two ways to specify the parser used by the I field: when creating an index, set it for a specific index; In the CONFIGURATION file for ES, set up the global analyzer 2, text shards into a word — shards the text into single or multiple participles 3, word filters — transforms each word 4 using word filters, and word indexes — stores these participles in the index
Scoring mechanism for Elasticsearch
Word frequency: The more times a term appears in a document, the more relevant the document is to the search. Reverse document frequency: The more times a term appears in different documents, the less important it is.
Use aggregation to explore the data
In many cases, users don’t care about specific search results, they want a set of statistics, such as hot topics in the news, revenue trends for different products, number of visitors to the site, etc. The aggregation feature of ES solves this problem well. Aggregation includes metric aggregation and bucket aggregation. Metric aggregation refers to the statistical analysis of a set of documents, including maximum and minimum values, standard deviation, etc. Bucket aggregation splits matching documents into one or more buckets and tells you how many documents are in each bucket. With bucket aggregates, you can nest other aggregates, with sub-aggregates running on each bucket created by the upper aggregate.
Relationships between documents
Some documents are inherently related. There will be related entities in the data. The following types can be used to define relationships between ES documents. Object type: Allows you to use an object or an array of objects as the value of a document field. Nested documents: The problem with object types is that all the data is stored in the same document, so a search might look for multiple subdocuments. Keeping some of the fields in separate Luncne documents avoids some unexpected parent-child relationships between the result documents: use completely separate ES documents for different types of data. But they can still be father and son. De-canonicalization: Replication of data for relational purposes.
Ability to scale
Extensibility is a very important factor in order to handle more index and search requests, or to process them faster. It mainly includes adding nodes, deleting nodes and electing primary nodes.
Improve performance
In order to better improve the performance of ES, the following methods can be adopted: 1. Merge HTTP requests. Batch index, update and delete 2, optimize the processing of Lucene segmentation. (1) Including the frequency of refresh and scour. Flush is the writing of indexed data from memory to disk; (2) merge, store data in an immutable set of files, that is, in segments. The more indexes, the more segments are created, and since searching through too many segments is slow, the smaller segments are merged into larger segments in the background. (3) Storage and storage traffic limiting. Es adjusts the write speed per second to limit the impact of merges on the I/O system. 3. Make full use of caches including filter caches, shard query caches, and JVM heap and operating system caches. , etc.