1.1 Elasticsearch Installation and Deployment in CentOS 7

Elasticsearch

Course environment

  • CentOS 7.3 x64
  • JDK version: 1.8 (minimum required), main push: JDK 1.8.0_121
  • Elasticsearch version: 5.2.0
  • Related software packages Baidu cloud download address (password: 0yzd) : pan.baidu.com/s/1qXQXZRm
  • How to install Elasticsearch and Kibana on Github: github.com/judasn/Linu…
  • Install both Elasticsearch and Kibana. The rest of the tutorial is based on commands executed on Kibana’s Dev Tools.

Elasticsearch is introduced

  • The latest version (201705) is 5.4
  • Website: www.elastic.co/
  • Github address: github.com/elastic/ela…
  • Elasticsearch 5.2 website document: www.elastic.co/guide/en/el…

Elasticsearch scenario

  • Wherever the search function is used
  • Log Data Analysis
  • BI, big data system (I heard there are PB level applications in the industry)
  • The data analysis
  • No database

Elasticsearch experience

  • Search system everywhere, for the general programmer, more is their own business system within the search.
  • Traditional database search function, in the case of a large amount of data performance is poor, so there must be a search function to replace this.
  • Elasticsearch was Solr, now Elasticsearch. The latter is newer, more capable, and more in tune with The Times.
  • Both Solr and Elasticsearch base are: Lucene
  • Lucene: full – text search, inverted index

Elasticsearch advantages

  • High availability with redundant copies
  • Own distributed, support sharding, can be distributed to more than one machine, so the data is very large can also carry
  • Encapsulates many advanced functions, convenient for us to call

Elasticsearch core concepts

  • Website: www.elastic.co/guide/en/el…
  • Near Realtime (NRT) : Near Realtime (Elasticsearch has a small delay, usually 1 second, which is usually invisible).
  • Cluster: a Cluster, which can have one or more nodes that store data in the Cluster. The cluster has a name, which is important because it is used for the configuration of each node. The node is added to the cluster by the cluster name. Theoretically, single node is the optimal solution, but it is only suitable for small data volume. You are advised to name each link as YouMeek-dev, YouMeek-prod, and YouMeek-test
  • Node: indicates a Node in an owning cluster. If the entire cluster has only one node, that node is the cluster itself. The node also has a name (the default UUID is randomly assigned at startup), and the node name can also be customized. It is recommended to customize the node name, which is important for compilation and operation management.
  • Index: a collection of documents, like a library in a database structure. The index name must be all lowercase, cannot start with an underscore, and cannot contain a comma (, for example, youmeek-index).
  • Type: Type, similar to a table in a database structure. Although now a Index can have more than one Type, but is developing Elasticsearch 6 plans to scrap the characteristics, an Index can only have one Type, specific see: Elasticsearch. Cn/article / 158
  • Document: the smallest unit of data in Elasticsearch, similar to a row of data in a database structure. So a row of data can also have multiple fields, or fields. Document is usually represented in JSON format.
  • Shard: fragments. Full name primary shards (commonly used in write operations) Elasticsearch splits Document data from an Index into multiple shards and stores them on multiple servers. The limit is 2147483519 (integer.max_value — 128) documents.
  • Replica: Replica. Full name Replica shards (common read operations can be allocated for use). Replica is mainly used to ensure high availability (failover), data backup, and enhance parallel search for high throughput.
  • Suppose we now have two machines and want to create an environment where the two machines are used as two nodes to form a Cluster. If you create an Index, there will be 5 primary shards. If you create an Index, there will be 5 primary shards. If you create an Index, there will be 5 primary shards. Each of the five primary Shards fragments has one Replica Replica for backup, so the final result of the two sub-fragments is: there are five primary shards and five Replica shards correspond to them.
  • Generally, the minimum high availability configuration is two servers. The general recommendation is 5 machines, preferably an odd number of machines.
  • If there are five machines, you can plan them as follows: Two nodes serve as master and both of them serve as commander to coordinate cluster-level transactions. Cancel the data rights of these two nodes. Then, after planning the data cluster of three nodes, cancel the master right of these three nodes. Let them feel at ease to do a good job of data storage and retrieval services. The advantage of this is that the responsibilities are clearly defined, which can maximize prevent the master node from affecting the data node, resulting in unstable factors. Such as data node data replication, data balance, routing, and so on, directly affect the stability of the master. This can lead to split brain problems.

Other information

  • basis
    • For Elasticsearch, please read this article
    • ElasticSearch big data distributed ElasticSearch engine use – from 0 to 1
    • Elasticsearch learning notes
    • Elasticsearch is deployed in a cluster
    • Elasticsearch Java API in depth
  • The intermediate
    • Discuss the core technology practice of search engine based on Lucene
    • Elasticsearch optimization of 100 million level actual combat
    • ElasticSearch is the most popular search engine in the world
    • Elasticsearch: The Definitive Guide
    • A diagram shows how relational/non-relational databases synchronize with Elasticsearch
    • Dissect Elasticsearch cluster – one
    • Richaaaard Elasticsearch