E-commerce and search engine products that typically involve large databases face the problem that product information retrieval takes too long. This poor user experience may lead to the loss of potential customers. This lag is due to the fact that the product is designed to use a relational database, where the data is scattered across multiple tables, and the relational data process the data in these tables to retrieve the search results is far faster. Companies are looking for alternatives to data storage to facilitate fast retrieval, and Elasticsearch (ES) is a great way to solve these problems.
image
What is Elasticsearch?
Elasticsearch is a Lucene based search engine. It provides a distributed multi – user – capable full – text search engine based on RESTful Web interface.
In other words, Elasticsearch is an open source, standalone database server developed in Java. Basically, it is used for full-text search and analysis. It takes data from a variety of sources and stores it in a complex format that is highly optimized for search. As mentioned above, Elasticsearch uses Apache Lucene as the heart of its search. Since Lucene is just a library, it can be difficult to use. But don’t worry, Elasticsearch encapsulates all search engine operations and can be done using the corresponding Restful API. Elasticsearch is a fast and efficient way to store, search, and analyze large amounts of data, and is especially useful when dealing with semi-structured data (i.e., natural language).
What can Elasticsearch do?
GitHub not only helps us find isolated code repositories when we search on GitHub, but also helps with code-level searches and highlighting search terms. It can also help you make product recommendations when you are shopping online. Elasticsearch helps you locate your passengers and drivers when you’re off work. ELK (Elastic Stack), which combines Kibana, Logstash and Beats, is widely used for big data and near real-time analysis. It includes log analysis, index monitoring, information security and other fields. It can help you explore massive, structured and unstructured data, create visual reports on demand, and set alarm thresholds for monitoring data.
image
Elasticsearch feature history for versions 5, 6, and 7
V5.x
- Lucene 6.x,
- Performance improved, default scoring mechanism changed from TF-IDF to BM 25
- Support for Ingest nodes, Completion Suggested, and Java REST clients
- Type is marked deprecated, supporting the Keyword Type
- Performance optimization
- Index throughput has been greatly improved by reducing internal contention, preventing concurrent updates to the same document from competing, and reducing locking requirements when synchronizing transaction logs
- Instant Aggregations, which provides Aggregation caching at the Shard level
- New Profile API
V6.x
- Lucene 7.x
- Removal of Types, in 6.0, multiple types in an index were initially dissupported
- Search across multiple Elasticsearch clusters, keeping the original index in the 5.x cluster, and search across clusters to search both 6.x and 5.x clusters
- Cross-cluster Replication (CCR)
- Friendlier upgrades and data migration, easier migration between major releases, experience upgrades
- Performance optimization
- Sparse area is improved to reduce storage cost
- You can use index sort to speed up query performance
V7.x
- Lucene 8.0
- Major improvements – Officially abolishing support for multiple types under a single index
- 7.1 From now on, the Security function is free of charge
- ECK allows users to configure, manage, and operate Elasticsearch clusters on Kubernetes
- TransportClient is obsolete so that ES7 Java code can only use RestClient
- New features
- New cluster coordination
- More fully functional REST Client
- Script Score Query, the next generation of scoring methods
- Performance optimization
- The default Primary Shard number was changed from 5 to 1 to avoid Over Sharding
- Performance optimization for faster Top K retrieval
Basic concepts of Elasticsearch
Elasticsearch (Index, Document, Type)
Index
- An Index is a container for documents. It is a combination of a class of documents
- Index represents the concept of logical space: each Index has its own Mapping that defines the field name and field type of the contained document
- The Shard embodies the concept of physical space: the data in the index is scattered across the Shard
- Index Mapping and Settings
- Mapping defines the types of document fields
- Setting defines different data distributions
Define different data distributions
{
"movies" : {
"settings" : {
"index" : {
"creation_date" : "1570452552",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "pB0UsxjfQT2fW-s8Uy-Nsg",
"version" : {
"created" : "2030599"
}
}
}
}
}
Copy the code
Define the types of document fields
{
"movie": {
"mappings": {
"doc": {
"properties": {
"songName": {
"type": "text"
},
"singer": {
"type": "text"
},
"price": {
"type": "integer"
}
}
}
}
}
}
Copy the code
Index has different semantics. In ES, it refers to the index created in the cluster (noun), or it can refer to the process of document to ES (verb), i.e. the process of an inverted index. Seeing an index elsewhere is more indicative of a B-tree index or inverted index.
The Document (the Document)
- Elasticsearch is document-oriented, and a document is the smallest unit of all searchable data
- Log entries in log files
- Specific information about a movie
- Details of a song
- The document is serialized to JSON format and saved in Elasticsearch
- JSON objects consist of fields,
- Each field has a corresponding field type (string/numeric/Boolean/date/binary/range type)
- Each document has a Unique ID
- You can specify the ID yourself or generate it automatically through Elasticsearch
case
{" songName ":" say good don't cry ", "singer" : "jay Chou", "price" : 3}Copy the code
Metadata for the document
{ "_index" : "song", "_type" : "_doc", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "songName" : "Say no cry "," Singer ":" Jay ", "price" : 3}}Copy the code
- Metadata, used to annotate relevant information about a document
- _index: indicates the index name of the document
- _type: name of the type to which the document belongs
- _ID: indicates the unique Id of a document
- _source: Raw JSON data of the document
- _all: consolidates the contents of all fields into this field
- _version: indicates the version of a document
- _score: correlation score
Type (Type)
- Prior to 7.0, multiple Types could be set for an Index
- Since 6.0, Type has been Deprecated. 7.0 Start an index. Only one Type -“_doc” can be created.
RDBMS VS Elasticsearch
The following is a poor analogy between an RDBMS and Elasticsearch. The Elasticsearch cluster can contain multiple Indes (databases), and each index can contain a DOC Type (table). Each type contains multiple documents (records), and each Document contains multiple Fields (columns). A DSL is equivalent to THE SQL of an RDBMS.
RDBMS |
Elasticsearch |
---|---|
Schema |
Mapping |
Table |
Index(Type) |
Column |
Filed |
Row |
Document |
SQL |
DSL |
6, summary
Elasticsearch can do this in 10 milliseconds, compared to a traditional SQL database management system that takes more than 10 seconds to get the required search query data. Because Elasticsearch has a distributed architecture, it can scale to thousands of servers and hold petabytes of data. We don’t have to manage the complexity of distributed design because ES does it automatically. There are many ways for us to index or query some documents, but with ES, we can easily retrieve the full text of massive data quickly and get the results we want.
Author: peak link: www.jianshu.com/p/1dc661517… The copyright of the book belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please indicate the source.
The original link: www.jianshu.com/p/1dc661517…