preface

With the popularity of the Internet, the data of explosive growth, the performance of traditional relational database in the quick search has become increasingly unable to meet the requirements above, we enter a key word, for example, you need to implement segmentation and highlight features, such as traditional relational database like search not only from the function or are unable to meet the requirements from the speed. Of course, some traditional relational databases also support full-text index retrieval (such as MySQL), but this still cannot meet the demand, so we increasingly need a professional search engine to handle the search function, in order to achieve the purpose of fast search and analysis of data.

What is Elasticsearch?

Elasticsearch is a distributed search engine based on Apache Lucene. Its core functions can be summarized in two words: search and analysis.

Elasticsearch provides near real-time search and analysis capabilities for all data types. Elasticsearch stores data efficiently whether it’s structured text or unstructured text, numeric or geographic, and builds indexes in a way that makes searching fast. And because of its naturally distributed nature, deployments can scale up seamlessly as data volumes grow.

What about Apache Lucene?

Apache Lucene is a high-performance, full-featured text search engine library. It is a very excellent, mature, open source, free full-text index search toolkit. Its mission is to provide full-text search at the global level.

Lucene was originally written in Java by Doug Cutting, who donated it to Apache in 2001 when Lucene gained a strong user base. Over time, Apache Lucene also developed C++, C#, Python and other languages.

Apache Lucene has a very active community, and because it is open source and mature, Solr is also a search implementation based on Apache Lucene. Lucence is the query engine layer, ElasticSearch and Solr are the MySQL service layer).

Common nouns

For those of you who are new to Elasticsearch, some of these terms may be unfamiliar to you, but let’s take a look at some common Elasticsearch terms by comparing them to relational databases.

  • Index: Corresponds to a database in a relational database.
  • Mapping type: equivalent to a table in a traditional database. This is outdated in 7.x and will no longer be supported in 8.x.
  • Document: Equivalent to a row of data in a relational database.
  • Field: Equivalent to a field in a relational database.
  • Mapping: it is equivalent to a traditional table building sentence in a database. You can set the data types and other features in some columns. If you do not set mapping, it is created by default.

Why is mapping Type removed

There are two reasons for removing the Mapping type:

  • It is not accurate to compare a mapping type to a table in a database, because in a traditional relational database, if there are two columns of the same name in different tables, they are also unrelated (from a storage perspective, regardless of the logical relationship), but in ElasticSearch, in the same index, If there are two fields with the same name in different mapping types, the underlying Lucence is stored in the same field. That is, fields of the same name in different mapping types in the same index must have the same mapping (data structure definition). This can lead to conflict situations in some situations.
  • More importantly, storing different mapping types with few or no common fields in the same index leads to data sparsity and affects Lucene’s ability to effectively compress documents.

Without the mapping type, we can define different data types as different indexes instead of using the mapping type to distinguish them. This approach also has two advantages:

  • The data is more likely to be dense and therefore can better take advantage of the compression techniques used in Lucene.
  • The statistics used for scoring in a full-text index may be more accurate because all documents in the same index belong to the same entity.

Install and configure Elasticsearch

Elasticsearch is easy to install, and you can download the version of Elasticsearch by clicking here. Note that Elasticsearch requires Java version 1.8 or later, and cannot be started as root. You must create a different user.

After downloading Elasticsearch, its home directory is ES_HOME:

If the installation is in tar.gz or zip format, the default configuration file is in the $ES_HOME/config directory. In this case, you can change the default path of the configuration file by setting the environment variable ES_PATH_CONF. The config directory contains the following configuration files:

  • Elasticsearch. yml: searches related configurations.
  • Jvm. options: JVM-related configuration.
  • Log4j2. properties: log configuration

You must know the configuration

For Elasticsearch, there are a few key configurations that you need to know how to configure. Here are some of the most important configurations.

Configure data and log

To configure the data or log directory, use the following configuration (if not, the default directory is $ES_HOME)

path:
    data: /var/lib/elasticsearch
    logs: /var/log/elasticsearch
Copy the code

The data directory supports multiple directories. To configure multiple directories, you can use the following configuration:

path:
  data:
    - /mnt/elasticsearch_1
    - /mnt/elasticsearch_2
    - /mnt/elasticsearch_3
Copy the code

Configuring a cluster Name (Cluster Name)

If you are using the Elasticsearch cluster, you must specify the name of the cluster (default is Elasticsearch). Ensure that the cluster name is unique:

cluster.name: logging-prod
Copy the code

Configuring a Node Name (Node Name)

The node name is very important and is included in the Response result of many API calls. If the node name is not configured, then the host name of the current server is used as the node name by default at startup:

node.name: prod-data-2
Copy the code

Configuring the Network IP Address

As with most middleware, if the network IP is not configured, only the local machine can access it by default. If you want other machines to access it, you need to do the following:

Network. The host: 192.168.1.10Copy the code

If you want to be accessible to all machines, you can simply configure 0.0.0.0.

Elasticsearch will go from development mode to production mode if 0.0.0.0 or some other non-loopback address is specified. Development mode is just a warning before the configuration, it can start successfully.

bootstrap check failure [1] of [1]: the default discovery settings are unsuitable for production use; at least one of [di
Copy the code

Local loopback address

A non-loopback address is a local loopback address. A local loopback address is an IP address starting with 127. In most cases, the local loopback address points to 127.0.0.1 (equivalent to ::1 in ipv6) and then to localhost. So in some people’s minds, 127.0.0.1 is equivalent to localhost. The loopback address is not necessarily 127.0.0.1, and 127.0.0.1 is not necessarily equivalent to localhost, depending on the configuration.

Discover cluster information configuration

As mentioned above, when network. Host is set to a non-local loopback address, Elasticsearch will switch to production mode. After switching to production mode, some cluster parameters must be configured.

Configure discovery. Seed_hosts

By default, Elasticsearch requires no network configuration and is available right out of the box. Elasticsearch will automatically search for local 9300 to 9305 port services and try to connect to them to form a cluster. This mode is not secure, so when we switch to production mode, we need to use static configuration to discover cluster hosts, that is, to specify the cluster IP or domain name by discovery.seed_hosts (when omitted, port 9300 is default, when using IVP6, Use []).

Discovery. Seed_hosts: - 192.168.1.10:9300-192.168.1.11 - SEEDs.mydomain.com - [0:0:0:0: FFFF: C0A8:10C]:9301Copy the code

Note that if you use a domain name, Elasticsearch attempts to connect to all IP addresses when the domain name resolves multiple IP addresses.

Configure discovery. Seed_providers

In the discovery. Seed_hosts configuration above, if a node in the cluster is qualified to be a master node (whether it is qualified to be a master node is through node-master: If there are no fixed names or addresses, you can use the discovery.seed_providers provider to dynamically find their addresses. There are also two ways of finding this, which I will not expand on here and will cover later when I introduce clusters.

Configure cluster. Initial_master_nodes

When the Elasticsearch cluster is first started, the cluster determines the number of votes from the first election that are eligible to become the master node. In development mode, this step is automatically performed by Elasticsearch if there is no discovery configuration. This auto-execution mode is inherently insecure, so when a new cluster is started in production mode, all nodes that are eligible to be master must be explicitly listed, and their votes should be counted in the first election.

PS: The node name must be the same as node.name.

Start the

After the above explanation, we only need to modify the following configuration because it is a single machine:

Network. Host: 0.0.0.0 node.name: node-1 cluster.initial_master_nodes: ["node-1"]Copy the code

After the configuration is modified, go to the bin directory and run the./ Elasticsearch command. To enable elasticsearch on the background, run the./elasticsearch -d command.

After the startup, the default port is 9200. If the following information is displayed after accessing http://ip:9200, the startup is successful:

Other errors during startup

If max_map_count is not modified during startup, the following error will occur:

bootstrap check failure [1] of [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [
Copy the code

This is because Elasticsearch uses an Mappfs directory by default to store indexes. By default, the operating system limits the MMAP count too low, which may lead to insufficient memory exceptions (if you use RPM or other formats after installation, it will be automatically changed).

In Linux, you can run the sysctl -w vm.max_map_count=262144 command to change the value. After the change, you can run the sysctl vm.max_map_count command to check whether it takes effect. Alternatively, you can run the vim /etc/sysctl.conf command to modify the Settings, and then run the sysctl -p command for the Settings to take effect.

conclusion

Elasticsearch (Apache Lucene) is the search engine for Elasticsearch, and the search engine for Elasticsearch (Apache Lucene) is the search engine for Elasticsearch. Elasticsearch will be used in the next installment of Elasticsearch.