ElasticSearch profile

ElasticSearch is a Lucene-based search server. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface. Elasticsearch, developed in Java and released as open source under the Apache license, is a popular enterprise-level search engine. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use.

ElasticSearch characteristics

1. It can be used as a large distributed cluster (hundreds of servers) technology to process PB-level data and serve large companies; It can also run on a single machine and serve small companies. ElasticSearch combines full text search, data analysis, and distributed technology. 3, for the user, is out of the box, very simple. 4. Database capabilities are inadequate in many areas (transactions, as well as various online transactional operations); Special features, such as full-text indexing, synonym processing, correlation ranking, complex data analysis, near real time processing of massive data. ElasticSearch is a complement to traditional databases and provides a lot of functionality that a database does not. 5. The default distributed working mode. Each node always assumes that it is or will be part of a cluster and will join a cluster as soon as work starts. Peer-to-peer architecture can avoid single points of failure. The nodes are automatically connected to other nodes in the cluster for data exchange and monitoring. This includes automatic replication of index sharding.

The core concept of ElasticSearch

Near real-time means both, with a small delay (about 1 second) between the time the data is written and the time it can be searched; Performing search and analysis based on ES can be done in seconds. A Cluster is a collection of one or more nodes (servers) that collectively hold your entire data and provide joint indexing and search capabilities across all nodes. A cluster is identified by a unique name, which by default is “ElasticSearch”. This name is important because if a node is set to join the cluster by its name, it can only be part of the cluster. Node a Node in a cluster that also has a name (randomly assigned by default). The Node name is important (for o&M operations). The default Node is added to a cluster named “ElasticSearch”. They will automatically form an ElasticSearch cluster, but a node can also form an ElasticSearch cluster. An Index contains a collection of document data that has a similar structure, such as a customer Index, a category Index, an order Index, and an Index with a name. An index contains many documents, and an index represents a class of similar or identical documents. For example, if you create a product index, a product index, you might have all of the product data in it, all of the product documents. Each index can have one or more types. Type is a logical data category in index, and document under Type has the same field.

The full text retrieval

Full-text retrieval refers to: computer indexing program through scanning every word in the article, to establish an index for each word, indicating the number and position of the word in the article, when the user queries, the retrieval program according to the index established in advance to search, and search results feedback to the user’s retrieval method. This process is similar to looking up words through the search word table in a dictionary. Full text search data in a search engine database.

Inverted index

Introduction to the

Inverted indexing stems from the need to find records based on the value of an attribute in practical applications. Each entry in such an index table contains an attribute value and the address of the records that have that attribute value. Because it is not the record that determines the attribute value, but the attribute value that determines the position of the record, it is called an inverted index

The inverted index contains two parts

Word dictionary: Record the words of all documents, and record the associations of words to inverted lists

Word dictionaries are generally large and can be implemented using B+ trees or hash zippers for high-performance inserts and queries

Inverted list: document combinations that record the corresponding words, consisting of inverted index entries

Invert index entries

Document ID word frequency TF – Number of occurrences of the word in the document, used for correlation scoring Position Position – Word segmentation Position in the document, used for statement search Offset Offset – Records the start and end of the word Position, to achieve highlighting

Example:

The principle of

Introduction of Lucene

Lucene is a high performance, extensible information retrieval (IR) library. Lucene is a mature, free and open source search library developed by the Java language, based on the Apache license. Lucene is only a software class library, if you want to play the function of Lucene, also need to develop a call Lucene class library application.

Analyzer

Text analysis is performed by a word splitter

A word Analyzer is a component that specializes in word segmentation and consists of three parts

Character Filters: for raw text processing, such as taking out HTML Tokenizer: for word segmentation according to rules Token Filters: for word segmentation, lowercase, remove stopwords, add synonyms

ElasticSearch has a built-in word splitter

ElasticSearch is a distributed feature

The ElasticSearch cluster is a distributed system with high availability and scalability

Benefits of ElasticSearch’s distributed architecture

1. Horizontally expand storage capacity. 2. Improve system availability

ElasticSearch’s distributed architecture

The default name of elasticSearch can be changed in the configuration file or specified in the -e cluster.name=geektime command. A cluster can have one or more nodes

What is the Mapping?

1. Mapping is similar to the definition of schema in a database

Define the name of the field in the index defines the data type of the field, such as string, number, Boolean…… Field, related configuration of inverted index

2. Mapping maps JSON documents to the flat format that Lucene needs
3. A Mapping belongs to the Type of an index

Each document belongs to a Type. A Type has a Mapping definition 7.0. There is no need to specify Type information in the Mapping definition

Whether to change the field type of the Mapping

What is aggregation?

Elasticsearch provides the Bucket/Metric/Pipeline/Matrix polymerization of four ways

ElasticSearch search related

Term based query

The Importance of Term

Term is the smallest unit of semantic expression. Both search and natural language processing using statistical language models require dealing with Term

The characteristics of

Term queries do not do word segmentation, and you need the keyword if you want to match all ids

Composite query – Changes Constant Score to Filter

Full-text based query

Full-text based search

Match Query / Match Phrase Query / Query String Query

The characteristics of

Match Phrase Query

Match Query Query process

conclusion

Term based lookup vs full-text based lookup

Term queries do not have word segmentation, full-text queries do

Control “Text” vs. “keyword” field segmentation by field Mapping

You can set the field to keyword in the Mapping if you want to do precise queries

Precision & Recall composite query with parameter control query – Constant Score query

Even Term queries for the keyword are scored. Queries can be converted to Filtering to improve performance by removing the correlation score

Structured search

Structured data

Structured search refers to the search of structured data

Dates, Boolean types, and numbers are all structured

Text can also be structured

For example, colored pens can have discrete sets of colors: red, green, blue A blog may be tagged, for example, distributed and searchable e-commerce sites have UPCs or other unique identifiers that need to follow a tightly regulated, structured format

Correlation and correlation score

The correlation

The relevance score of the search describes the degree to which a document matches the query statement. ES will calculate _score for each result matching the query condition

The nature of scoring is to rank the documents that best meet the user’s needs first. Before ES5, the default correlation score was tF-IDF, now BM25

ElasticSearch workflow

The boot process

When the ElasticSearch node starts, it uses broadcast technology (which can also be configured as unicast) to discover and connect to other nodes in the same cluster.One node in the cluster is selected as the master node. This node is responsible for the status management of the cluster and responding to the change of the cluster topology, and indexing to the corresponding nodes of the cluster.

Fault detection

When the cluster is working properly, the management node monitors all available nodes to see if they are working. If any node does not respond within the predefined timeout period, the node is considered disconnected and the error handling process is started.

The index data

ElasticSearch provides four ways to create an index:

Using the index API, it allows the user to send a document to a specific index. Allows users to send multiple documents to a cluster at once through the BULK API. (using the HTTP protocol) allows users to send multiple documents to a cluster at once via the UDP Bulk API. (using UDP) sends data using a plugin called a river, which runs on the ElasticSearch node and can fetch data from an external system.

Pay attention toIndex building only happens on the master shard, not the replica. When an index request is sent to a node, if the node has no corresponding master shard or only a copy, the request is forwarded to the node with the correct master shard (as shown below).

Query data

A DSL can be used for data queries, which are generally divided into two phases:

Fragmentation phase: Query is distributed to multiple shards containing related documents to execute the query. Merge phase: Collect the returned results from multiple slices, then merge, sort, follow up, and return them to the client.

Apache Lucene default scoring formula explained

  • Document score is a parameter that describes how well a document matches a query.
  • When a document is returned by Lucene, it means that the document matches the query submitted by the user, in which case each returned document is given a score. The higher the score, the more relevant the document.
  • The same document scores differently for different queries, and it is meaningless to compare a document’s scores for different queries.
  • Apache Lucene default scoring mechanism: TF/IDF(word frequency/reverse document frequency)

ElasticSearch, MongoDB, mysql

Introduction to the

ElasticSearch

ElasticSearch is a Lucene-based search server. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface. Elasticsearch, developed in Java and released as open source under the Apache license, is a popular enterprise-level search engine. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use.

MongoDB

MongoDB is an open source database system based on distributed file storage written in C++ language. In the case of high load, adding more nodes ensures server performance and provides scalable high-performance data storage solutions for WEB applications. MongoDB stores data as a document. The data structure consists of key=>value pairs. MongoDB documents are similar to JSON objects. Field values can contain other documents, arrays, and document arrays.

MySQL

MySQL is a lightweight relational database management system. Associated databases keep data in different tables, rather than putting all data in one large warehouse, which increases speed and flexibility.

Application scenarios

ElasticSearch

ElasticSearch is mainly used for full-text indexing, synonym processing, log management, application performance monitoring, ranking, complex data analysis, and massive data processing in near real time.

MongoDB

MongoDB is suitable for those scenarios where the table structure changes frequently, the logical structure of data is not so complex and does not require multiple table query operations, and the amount of data is relatively large.

MySQL

MySQL can be used as a data warehouse for storing data and for scenarios where data needs to be transacted

The advantages and disadvantages

ElasticSearch

advantages

2, automatic index for all fields, to achieve high-performance complex aggregation query, so as long as it is stored in ES data, no matter how complex the aggregation query can also get good performance

disadvantages

1. The field type cannot be modified

ES needs to set up a Mapping before creating a field. The Mapping contains the type information of each field. ES needs to set up appropriate indexes for the field based on the Mapping. Because of this Mapping, the fields in ES cannot be typed once they are created. For example, what if you want to temporarily add a field in a table that has been created and already contains a lot of data? Can only delete the entire data table and rebuild again! Therefore, ES is much more flexible than MySQL but far less flexible than MongoDB in data structure.

2. The write performance is low

The write performance of ES is also affected by automatic indexing, which is significantly lower than that of MongoDB. For the same data, ES occupies significantly more storage space than MongoDB.

High hardware resource consumption. 4. Weak transaction relationship support

MongoDB

advantages

1. Table structure is flexible and variable

Each row of data in MongoDB is simply converted into Json format and stored, so there is no such concept as table structure in MySQL in MongoDB

2. Field types can be modified at any time

Can be directly simple and crude arbitrary structure of data into the same table, do not have to consider the constraints of the table structure, let alone like MySQL because of the need to modify the table structure and big trouble

3, horizontal scalability, can support a large amount of data and development

disadvantages

  • Transaction relationship support is weak
  • MongoDB does not need to define the table structure, which brings great convenience to the modification of the table structure, but also hinders advanced operations such as multi-table query and complex transactions. Therefore, if the logical structure of the data is very complex and complex multi-table queries or transaction operations are often required, then relational databases such as MySQL are obviously more suitable.

MySQL

advantages

  • The core thread it uses is fully multithreaded and supports multiple processors.
  • Full support for SQL GROUP BY and ORDER BY clauses.
  • Support for large databases, which can easily support databases with tens of millions of records. As an open source database, it can be modified for different applications.
  • Support transaction operations

disadvantages

  • Hot backup is not supported
  • There is no Stored Procedure language

The difference between

  • MongoDB is a document database based on JSON data model, ElasticSearch is a search database based on JSON data model, and MySQL is a relational database.
  • In CRUD, MongoDB uses MQL to manipulate data, ElasticSearch uses DSL to manipulate data, MySQL uses SQL to manipulate data.
  • In terms of architecture, MongoDB implements high availability through replication sets, ElasticSearch implements high availability through cluster, and MySQL implements high availability through cluster mode.
  • In terms of horizontal expansion capability, MongoDB is fully supported by native sharding, ElasticSearch is also supported by sharding (Primary Shard & Replica Shard), and MySQL is supported by data partitioning or application intrusion.
  • Index support, MongoDB (b-tree, full-text index, geo-location index, multi-build index, TTL index) ElasticSearch (full-text index, inverted index), MySQL (b-tree)
  • In terms of data capacity, MongoDB has no theoretical upper limit, while MySQL is about 10 million or 100 million levels.
  • In terms of model relationship, MongoDB expresses the relationship between tables by embedding data and referencing fields, while MySQL expresses the relationship between tables by association relationship and primary foreign key.

Analysis of the

  • Es full text search has a powerful analyzer and can be flexibly combined, intelligent matching query. Mongodb has a limit on the number of full-text search fields. Es automatically indexes all fields. Mongodb manually indexes all fields. So ES would be better for searching than mongodb.
  • If your data size is large, the read performance of the data is very high, the structure of the data table needs to change frequently, and sometimes you need to do some aggregated queries, choose MongoDB.
  • Select ElasticSearch if you need to build a search engine or if you want to build a high looking data visualization platform and your data has some analysis value.
  • If the data transaction requirements are high, choose MySQL;