preface
With the popularity of the Internet, the data of explosive growth, the performance of traditional relational database in the quick search has become increasingly unable to meet the requirements above, we enter a key word, for example, you need to implement segmentation and highlight features, such as traditional relational database like search not only from the function or are unable to meet the requirements from the speed. Of course, some traditional relational databases also support full-text index retrieval (such as MySQL), but this still cannot meet the demand, so we increasingly need a professional search engine to handle the search function, in order to achieve the purpose of fast search and analysis of data.
What is the Elasticsearch
Elasticsearch is a distributed search engine based on Apache Lucene. Its core functions can be summarized in two words: search and analysis.
Elasticsearch provides near real-time search and analysis capabilities for all data types. Elasticsearch stores data efficiently whether it’s structured text or unstructured text, numeric or geographic, and builds indexes in a way that makes searching fast. And because of its naturally distributed nature, deployments can be scaled up seamlessly as data volumes grow.
Talk about the Apache Lucene
Apache Lucene is a high-performance, full-featured text search engine library. It is a very excellent, mature, open source, free full-text index search toolkit. Its mission is to provide full-text search at the global level.
Lucene was originally written in Java by Doug Cutting, who donated it to Apache in 2001 when Lucene gained a strong user base. Over time, Apache Lucene also developed C++, C#, Python and other languages.
Apache Lucene has a very active community, and because it is open source and mature, Solr is also a search implementation based on Apache Lucene. Lucence is the query engine layer, ElasticSearch and Solr are the MySQL service layer).
Common nouns
For those of you who are new to Elasticsearch, some of these terms may be unfamiliar to you, but let’s take a look at some common Elasticsearch terms by comparing them to relational databases.
- Index: Corresponds to a database in a relational database.
- Mapping type: equivalent to a table in a traditional database. This is outdated in 7.x and will no longer be supported in 8.x.
- Document: Equivalent to a row of data in a relational database.
- Field: Equivalent to a field in a relational database.
- Mapping: it is equivalent to a table building sentence in a traditional database. You can set the data types and other features in some columns
mapping
Is created by default.
Why is mapping Type removed
There are two reasons for removing the Mapping type:
- It is not accurate to compare a mapping type to a table in a database, because in a traditional relational database, if there are two fields of the same name in different tables, then the two are unrelated (from a storage perspective, regardless of the logical relationship), but in
ElasticSearch
In, in the sameindex
In, differentmapping type
, if there are two fields with the same name, then the underlyingLucence
They’re all stored in the same field. That is, fields of the same name in different mapping types in the same index must have the samemapping
Data structure definition. This can lead to conflict situations in some situations. - In the same
index
Stores little or no common field differences inmapping type
Results in data sparsity and impactsLucene
The ability to effectively compress documents is also a more important reason.
Without the mapping type, we can define different data types as different indexes instead of using the mapping type to distinguish them. This approach also has two advantages:
- Data is more likely to be dense and therefore better utilized
Lucene
Compression techniques used in. - Statistics for scoring in full-text indexes may be more prepared because all documents in the same index belong to the same entity.
Install and configure Elasticsearch
Elasticsearch is easy to install, and you can download the version of Elasticsearch by clicking here. Note that Elasticsearch requires Java version 1.8 or later, and cannot be started as root. You must create a different user.
After downloading Elasticsearch, its home directory is ES_HOME:
If the installation is in tar.gz or zip format, the default configuration file is in the $ES_HOME/config directory. In this case, you can change the default path of the configuration file by setting the environment variable ES_PATH_CONF. The config directory contains the following configuration files:
- Elasticsearch. yml: searches related configurations.
- The JVM. Options:
jvm
Related configurations. - Log4j2. properties: log configuration.
You must know the configuration
For Elasticsearch, there are a few key configurations that you need to know how to configure. Here are some of the most important configurations.
Configure data and log
To configure the data or log directory, use the following configuration (if not, the default directory is $ES_HOME)
path:
data: /var/lib/elasticsearch
logs: /var/log/elasticsearch
Copy the code
The data directory supports multiple directories. To configure multiple directories, you can use the following configuration:
path:
data:
- /mnt/elasticsearch_1
- /mnt/elasticsearch_2
- /mnt/elasticsearch_3
Copy the code
Configuring a cluster Name (Cluster Name)
If you are using the Elasticsearch cluster, you must specify the name of the cluster (default is Elasticsearch). Ensure that the cluster name is unique:
cluster.name: logging-prod
Copy the code
Configuring a Node Name (Node Name)
The node name is very important and is included in the Response result of many API calls. If the node name is not configured, then the host name of the current server is used as the node name by default at startup:
node.name: prod-data-2
Copy the code
Configuring the Network IP Address
As with most middleware, if the network IP is not configured, only the local machine can access it by default. If you want other machines to access it, you need to do the following:
network.host: 192.1681.10.
Copy the code
If you want to be accessible to all machines, you can simply configure 0.0.0.0.
Elasticsearch will go from development mode to production mode if 0.0.0.0 or some other non-loopback address is specified. Development mode is just a warning before the configuration, it can start successfully.
bootstrap check failure [1] of [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
Copy the code
Local loopback address
A non-loopback address is a local loopback address. A local loopback address is an IP address starting with 127. In most cases, the local loopback address is set to 127.0.0.1 (in ipv6, it is equivalent to ::1). XXX is rarely used. Local loopback data will pass through the transport layer (TCP, etc.) and network layer (IP) rather than through hardware devices such as network cards. Therefore, local loopback is generally used to test the native network configuration (since local loopback addresses can be used for software testing and local services without any hardware network interface).
Localhost, which is actually the host name, is usually configured in the hosts file, which by default points to 127.0.0.1 as the loopback address. Therefore, we equate 127.0.0.1 to localhost, but the “equivalence” can be changed.
Discover cluster information configuration
As mentioned above, when network. Host is set to a non-local loopback address, Elasticsearch will switch to production mode. After switching to production mode, some cluster parameters must be configured.
Configure discovery. Seed_hosts
By default, Elasticsearch requires no network configuration and is available right out of the box. Elasticsearch will automatically search for local 9300 to 9305 port services and try to connect to them to form a cluster. This mode is not secure, so when we switch to production mode, we need to use static configuration to discover cluster hosts, that is, to specify the cluster IP or domain name by discovery.seed_hosts (when omitted, port 9300 is default, when using IVP6, Use []).
discovery.seed_hosts:
- 192.1681.10.: 9300
- 192.1681.11.
- seeds.mydomain.com
- [0:0:0:0:0:ffff:c0a8:10c]: 9301
Copy the code
Note that if you use a domain name, Elasticsearch attempts to connect to all IP addresses when the domain name resolves multiple IP addresses.
Configure discovery. Seed_providers
In the discovery. Seed_hosts configuration above, if a node in the cluster is qualified to be a master node (whether it is qualified to be a master node is through node-master: If there are no fixed names or addresses, you can use the discovery.seed_providers provider to dynamically find their addresses. There are also two ways of finding this, which I will not expand on here and will cover later when I introduce clusters.
Configure cluster. Initial_master_nodes
When the Elasticsearch cluster is first started, the cluster determines the number of votes from the first election that are eligible to become the master node. In development mode, this step is automatically performed by Elasticsearch if there is no discovery configuration. This auto-execution mode is inherently insecure, so when a new cluster is started in production mode, all nodes that are eligible to be master must be explicitly listed, and their votes should be counted in the first election.
PS: The node name must be the same as node.name.
Start the
After the above explanation, we only need to modify the following configuration because it is a single machine:
network.host: 0.0. 0. 0
node.name: node-1
cluster.initial_master_nodes: ["node-1"]
Copy the code
After the configuration is modified, go to the bin directory and run the./ Elasticsearch command. To enable elasticsearch on the background, run the./elasticsearch -d command.
After the startup, the default port is 9200. If the following information is displayed after accessing http://ip:9200, the startup is successful:
Other errors during startup
If max_map_count is not modified during startup, the following error will occur:
bootstrap check failure [1] of [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
Copy the code
This is because Elasticsearch uses an Mappfs directory by default to store indexes. The operating system’s default limit on mMAP counts is too low, which may result in out-of-memory exceptions (which are automatically modified if installed using formats such as RPM).
In Linux, you can run the sysctl -w vm.max_map_count=262144 command to change the value. After the change, you can run the sysctl vm.max_map_count command to check whether it takes effect. Alternatively, you can run the vim /etc/sysctl.conf command to modify the Settings, and then run the sysctl -p command for the Settings to take effect.
conclusion
Elasticsearch (Apache Lucene) is the search engine for Elasticsearch, and the search engine for Elasticsearch (Apache Lucene) is the search engine for Elasticsearch. Elasticsearch will be used in the next installment of Elasticsearch.