YuanWenXian in: mp.weixin.qq.com/s/0cTaqiYSJ…

Interesting story behind ElasticSearch

Many years ago, an unemployed developer named Shay Banon, newly married, followed his wife to London, where she was studying to be a chef. Looking for a lucrative job, he started using an early version of Lucene to build a recipe search engine for his wife. It was difficult to use Lucene directly, so Shay started building an abstraction layer that Java developers could use to easily add search capabilities to their applications. He released his first open source project, Compass. Shay later landed a job working on in-memory data grids in high-performance, distributed environments. The need for a high-performance, real-time, distributed search engine was so great that he decided to rewrite Compass as a standalone service called Elasticsearch.

The first public version was released in February 2010, and since then Elasticsearch has become one of the most active ficolin-related projects on Github, with over 300 resulting molecules (currently 736). One company is already offering a commercial service around Elasticsearch and developing new features, but Elasticsearch will always be open source and available to everyone.

Supposedly, Shay’s wife is still waiting for her recipe search engine…

ElasticSearch

Elasticsearch is an open source search engine built on top of Apache Lucene™, a full-text search engine library. Lucene is arguably the most advanced, high-performance, full-featured search engine library available today — both open source and private. But Lucene is just a library. To get the most out of it, you need to use Java and integrate Lucene directly into your application. Worse, you may need a degree in information retrieval to understand how it works. Lucene is very complex. Elasticsearch is also written in Java and uses Lucene internally for indexing and searching, but it is intended to make full text retrieval easy by hiding the complexity of Lucene and providing a simple and consistent RESTful API instead. However, Elasticsearch is more than Lucene, and more than just a full-text search engine. It can be accurately described as follows:

  1. A distributed real-time document store where each field can be indexed and searched
  2. A distributed real-time analysis search engine
  3. Capable of extending hundreds of service nodes and supporting PB level of structured or unstructured data

2.1 elasticSearch

  • Distributed search engine

Es can be used as a distributed search engine, such as product search on Baidu, Taobao, and site search on general Web systems. Es is a good technology selection.

  • Data analysis engine

Es provides rich APIS on the basis of search to support personalized search and data analysis functions, such as e-commerce websites, where we can query the hot items in recent days.

  • Processing massive data in near real time

Es is a distributed search engine. Through clustering and internal sharding, ES can disperse massive data to multiple servers for storage and retrieval, greatly improving its scalability and disaster recovery capability. The so-called near real-time is a relative concept. Generally, if the corresponding speed can reach the level of seconds, we call it real near real-time. The near-real time of ES includes two aspects: first, the written data can be retrieved after 1s. Second, its retrieval and analysis response speed can reach second level.

2.2 features of ElasticSearch

  • distributed

Es is a distributed search engine, which can implement distributed features such as data Dr Migration, dynamic capacity expansion, and load balancing.

  • Huge amounts of data

Es can process PB level data. Because es is a distributed architecture and supports dynamic expansion, it is no longer a problem to process and store massive data.

Basic concepts for ElasticSearch

Basic concepts of data in ES

  • index

An index is similar to the “database” of a relational database — it is where we store the data associated with the index.

Tip:

In fact, our data is stored and indexed in shards, an index is just a logical space that groups one or more shards together. However, these are just internal details — our program doesn’t care about fragmentation at all. For our program, documents are stored in an index. The rest of the details are up to Elasticsearch.

  • type

The concept of type is similar to the concept of tables in MySql.

In applications, we use objects to represent “things”, such as a user, a blog post, a comment, or an email. Each object belongs to a class that defines attributes or data associated with the object. Objects of the User class may contain names, genders, ages, and Email addresses. In relational databases, we often store similar objects in a table because they have the same structure. Similarly, in Elasticsearch, we use documents of the same type to represent the same “thing” because they have the same data structure. Each type has its own mapping or structure definition, just like columns in a traditional database table. Documents of all types are stored under the same index, but the type mapping tells Elasticsearch how different documents are indexed. We’ll explore how to define and manage maps in the Maps section, but for now we’ll rely on Elasticsearch to automatically process data structures.

  • document

Document is the basic index unit of ES. Document is similar to a row of records in MySql. The document data is in JSON format.

  • id

In MySql we use a primary key to indicate the uniqueness of a record, and in ES id is the concept. Ids in ES can also be self-generated. The ids automatically generated in ES have the following characteristics: The ids automatically generated are URL-safe, Base64 encoding, and GUID, ensuring that ids do not conflict (globally unique IDS) in distributed mode. Of course, we can also specify.

2. Some concepts of ES under distribution

  • Cluster:

I believe that the partners familiar with distributed will not be unfamiliar with this Cluster, Cluster stands for a Cluster of ES, the so-called Cluster is a distributed ES Cluster composed of many ES.

  • Node:

A node is an ES node in an ES Cluster. In simple terms, an ES instance is a node in the cluster.

3. Two concepts of ES storage policy

  • Shard and replica:

To add data to Elasticsearch, we need an index — a place to store associated data. In practice, an index is just a “logical namespace” that points to one or more shards. A shard is a minimal “worker unit” that holds only a fraction of all the data in the index.

Sharding is an instance of Lucene and a complete search engine in its own right. Our documents are stored in the shard and indexed in the shard, but our application does not communicate directly with them. Instead, it communicates directly with the index. Sharding is the key for Elasticsearch to distribute data in clusters. Think of shards as containers for data. Documents are stored in shards, which are then distributed to nodes in your cluster. When your cluster expands or shrinks, Elasticsearch will automatically migrate shards between your nodes to keep the cluster in balance. A shard can be a primary shard or a replica shard.

Each document in your index belongs to a separate master shard, so the number of master shards determines how much data the index can store. In theory there is no limit to how much data a master shard can store, depending on how you actually use it. The maximum size of sharding depends entirely on your usage: the size of your hardware storage, the size and complexity of your documents, how you index and query your documents, and the response time you expect.

A replicated shard is just a copy of the master shard, which protects against data loss due to hardware failures and provides read requests, such as searching for or retrieving documents from other shards. When the index is created, the number of master shards is fixed, but the number of replica shards can be adjusted at any time. By default, an index is assigned 5 master shards, and a master shard has only one replicate shard by default.

Key points:

Shard comes in two types:

  1. Primary shard — Primary shard
  2. Replica Shard — Replica shard (also called backup shard or replica shard)

It should be noted that there is a common convention in the industry. A single word shard generally refers to the primary shard, while a single word replica refers to the Replica shard.

Another note is up shard is relative to the index, if the current index have a duplicate shard, so relative to the main fragmentation is every a primary shard has a replication shard, i.e. if there are five main fragmentation has five duplicate the shard, and the main fragmentation and replication is a one-to-one relationship between shard.

Important: The primary shard cannot be on the same node as the Replica shard. Important things to say three times:

The primary shard cannot be on the same node as the replica shard

So the minimum high availability configuration for ES is two servers.

Install and develop elasticSearch tools

  • I installed ElasticSearch version 6.6.2

  • Development tools: Kibana -6.6.2 (Note that Kibana version must be the same as ElasticSearch version)

Elasticsearch-head is also available locally

Installation, we go to Baidu, there are a lot of very detailed installation steps, here is not redundant.

How do YOU implement curl in Kibana

4. Cluster health status

The Elasticsearch cluster monitoring information contains many statistics. The most important one is cluster health, which is displayed as green, yellow, or red in the Status field.

In Kibana: GET /_cat/health? v

1 epoch      timestamp cluster        status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
2 1568794410 08:13:30  my-application yellow          1         1     47  47    0    0       40             0                  -                 54.0%
Copy the code

We can see that the health status of my local cluster is yellow, but here the question arises. How to judge the health status of the cluster?

  • Green (healthy) All master and replica shards are working fine.
  • Yellow (sub-healthy) All master shards are working properly, but not all replica shards are working properly.
  • Red (unhealthy) Primary shard is not working properly.

Note:

I only configure one elasticsearch node locally. Because the primary shard and replica shard cannot be allocated to the same node, there is no replica shard in my local Elasticsearch. So the health condition is yellow.

The article will be updated continuously. You can search “Maimo Coding” on wechat to read it for the first time. Every day to share quality articles, big factory experience, big factory face, help interview, is worth paying attention to every programmer platform.

Original address: www.cnblogs.com/hello-shf/p…