Overview of ElasticSearch

What is ElasticSearch?

Elasticsearch is a Lucene-based search server. It provides a distributed multi-user capable full-text search engine based on a RESTful Web interface. Developed in the Java language and distributed as open source under the Apache license, Elasticsearch is a popular enterprise search engine. Elasticsearch is designed for cloud computing. It is stable, reliable, fast and easy to install and use. Official clients are available in Java,.net (C#), PHP, Python, Apache Groovy, Ruby, and many other languages. According to DB-Engines, Elasticsearch is the most popular enterprise search engine, followed by Apache Solr, which is also based on Lucene.

Link to our official website: www.elastic.co/cn/what-is/…

2. Application scenario of ElasticSearch

  • storage

ElasticSearch is naturally distributed and has the ability to store massive amounts of data. Its search and data analysis functions are built on the massive amounts of data stored by ElasticSearch. ElasticSearch is a convenient way to store massive amounts of data. Especially when the amount of data is growing rapidly, ElasticSearch can be used in conjunction with data collection tools such as crawlers.

  • search

Every field of ElasticSearch is indexed and can be used for search. It also provides a rich search API, which enables the search engine (full text search, highlighting, recommendation search, etc.) to respond to massive amounts of data in near real time.

  1. Stack Overflow (foreign program exception discussion forum), IT problem, program error, submit, someone will discuss and answer with you, full text search, search related questions and answers, program error, will report the error information pasted into the inside, search there is no corresponding answer;
  2. GitHub (open source code management), search hundreds of billions of lines of code;
  3. E-commerce sites, search goods;
  4. Log data analysis. Logstash collects logs. ElasticSearch performs complex data analysis (ELK technology, ElasticSearch + LogStash + Kibana).
  • The data analysis

ElasticSearch also provides a large number of data analysis apis and rich aggregation capabilities for data analysis and processing on the basis of massive data. Scenario: The crawler crawls the data of an item on different e-commerce platforms and uses ElasticSearch to analyze the data (historical price, purchasing power, etc.).

Excerpt: www.cnblogs.com/cdchencw/p/…

3 Relationship between ElasticSearch and Lucene

Lucene can be considered the most advanced, best-performing, and full-featured search engine library (framework) available to date. However, to use Lucene, you must use Java as the development language and integrate it directly into your application, and Lucene is very complex to configure and use. You need to know a lot about retrieval to understand how it works.

Lucene faults:

  • It can only be used in Java projects and is integrated directly into the project as a JAR package.
  • It’s very complicated to use – it’s a lot of code to create indexes and search indexes.
  • Cluster environment not supported – Index data is not synchronized (large projects not supported).
  • If there is too much index data, the index library and the application reside on the same server and occupy the same hard disk. Little shared space.

All of the above shortcomings of the Lucene framework are addressed by ES.

4 Comparison between ElasticSearch and Solr

  • Solr is faster when simply searching for existing data.

  • When creating indexes in real time, Solr causes IO blocking, resulting in poor query performance. Elasticsearch has obvious advantages over Solr.

  • Large Internet companies, in actual production tests, saw their average query speed increase by 50 times after switching from Solr to Elasticsearch.

Conclusion:

  • Solr uses Zookeeper for distributed management, while Elasticsearch has its own distributed coordination management function.
  • Solr supports more data formats such as JSON, XML, and CSV, while Elasticsearch only supports JSON files.
  • Solr performs better than Elasticsearch in traditional search applications, but is significantly less effective when dealing with real-time search applications.
  • Solr is a powerful solution for traditional search applications, but Elasticsearch is better suited for emerging real-time search applications.

5 Comparison between ElasticSearch and relational databases

ElasticSearch Index (Index) Type (Type) Document Field (Field)
Relational database DataBase (DataBase) The Table (Table). The Row (line) The Column (Column)

2. Full-text retrieval

1. What is full-text search?

Full-text Search: The process of creating an index and then searching the index is called full-text Search.

If you have a paragraph of content, create an index for the word, and save the position of the word in the article, and the number of times it appears. When the user queries, it queries through the previous established index, returns the text position and occurrence times corresponding to the words in the index to the user, because it has the specific text position, so it can read out the specific content.

2. What is an inverted index?

An index is similar to a table of contents. Usually we use an index, which is located by a primary key.

Core concepts of ElasticSearch

1. Index

An Index can be narrowly understood to correspond to a DataBase in a relational DataBase (personally understood to correspond to a wide table)), an index is identified by a name (which must be all lowercase) and is used when indexing, searching, updating, and deleting documents corresponding to the index.

2. Mapping

The Mapping in ElasticSearch is used to define a document

Mapping is the restriction of the way data is processed and the rules, such as the data type of a field, the default value, the participle, whether it is indexed, etc., which can be set in the mapping. Mapping Includes static mapping and dynamic mapping.

3. Field

Quite so | columns of data table columns

4. Type

Each field should have a corresponding type, such as Text, Keyword, Byte, and so on

5. Document

A document corresponds to a row of records in a relational database table. A document is a basic unit of information that can be indexed. The document is represented in JSON (Javascript Object Notation) format.

6. Node

A Node is a server in a cluster. As a part of the cluster, it stores data and participates in the index and search functions of the cluster

A node can be added to a specified cluster by configuring the cluster name. By default, each node is scheduled to join a cluster called “ElasticSearch”

This means that if several nodes are started in a network, assuming they can find each other, they will automatically form and join a cluster called “ElasticSearch”

You can have as many nodes as you want in a cluster. Also, if no Elasticsearch node is currently running on the network, starting an Elasticsearch node will create and join a cluster called “Elasticsearch” by default.

7. Cluster

Cluster a cluster is an organization of one or more nodes that hold the entire data and provide indexing and search functions together

8. Shards

Shard

An index can store large amounts of data beyond the hardware limits of a single node. For example, an index with 1 billion documents takes up 1 TERabyte of disk space, which is not available at any node; Or a single node can process a search request and respond too slowly. To solve this problem, Elasticsearch provides the ability to divide the index into multiple pieces called shards. That is, data cannot be stored on one node and is distributed to multiple nodes for storage. Each node stores a portion of data. There are two main reasons for sharding:

  • Allow horizontal split/expansion of your content capacity
  • Improves performance/throughput by allowing distributed, parallel operations on top of sharding

When creating an index, you can specify the number of shards you want, and each shard itself is a fully functional and independent “index” that can be placed on any node in the cluster.

How a shard is distributed and how its documents are aggregated back into search requests is completely managed by Elasticsearch and is transparent to the user.

9. replicas

Replica In the case of fragmentation or node failure, replica provides high availability.

It is important that a copy shard (copy) is never placed on the same node as the original/primary shard.

Replicas extend search volume/throughput because searches can run in parallel across all replicas.

Once replicas are set, each index has a primary shard and a replica shard. The number of shards and replicas can be specified at the time of index creation. The number of replicas can be dynamically changed at any time after the index is created, but the number of shards cannot be changed.

Four,

This article gives you a quick introduction to ElasticSearch and some of the concepts you need to know to understand how ES works. Stay tuned for the next installment, which will cover the cluster installation of ES.

Welcome to follow the wechat official account (MarkZoe) to learn from each other and communicate with each other.