1 Introduction to Elasticsearch
Elasticsearch is a distributed, RESTful search and data analysis engine that can solve a variety of emerging use cases and centrally store data at the heart of the Elastic Stack.
Elasticsearch has three major capabilities: search, analysis, and storage.
Examples of Elasticsearch nouns
- Node: A single Instance of Elasticsearch is called a Node
- Cluster: A group of nodes forms a Cluster
- Shards: Sharding. When the amount of data is too large, the data will be sharded due to limited memory and disk processing capacity
- Replicas are Replicas. Backup of master shard data, high availability
- Analyzer: The default standard Analyzer for Elasticsearch contains a standard Tokenizer and three filters. Standard TokenFilter, Lower Case TokenFilter, and Stop TokenFilter
Core of Elasticsearch
2.1 Architecture Design
2.2 storage
Elasticsearch is a document-oriented database. ES stores data in JSON format
Elasticsearch only has the concepts Index, Type, Docments, Field, etc
2.3 the search
Elasticsearch searches based on inverted indexes with a powerful word splitter
Elasticsearch is designed to do the search in memory as much as possible, minimizing disk access
Matters needing attention:
- Fields that do not need indexes must be clearly defined. By default, indexes are automatically created to save memory resources
- For String fields, you need to explicitly define them if you don’t need Analysis, which is the default
- It is important to choose regular ids; too random ids (such as UUID) are not good for queries
- Choosing the right segmentation will make the search more accurate
2.4 analysis
Analysis consists of the following processes:
- First, the text is divided into separate terms suitable for inverted indexing;
- Later, terms are unified into a standard format to improve their “searchability” or “recall”.
The profiler is responsible for doing this. The parser encapsulates three functions in a single package:
-
Character filter
First, the string passes through each character filter in order.
The filter’s job is to unscramble strings before word segmentation. Character filters can be used to remove specific symbols or to convert characters, such as & to and.
-
Word segmentation is
Second, the processed string is divided into individual terms by the classifier.
-
Token filter
Finally, entries pass through each token filter in order.
This process may change terms (for example, to harmonize case), delete terms (for example, to remove a, and, the, etc.), or add terms (for example, to add synonyms to improve searchability).
The pros and cons of Elasticsearch
Advantages:
- Support distributed, horizontal expansion;
- Provides a rich Restful interface, lower the threshold of use, can be called by any programming language;
- High availability, with data backup function, can set multiple backup;
Disadvantages:
- Data consistency problem, prone to brain split;
- No permission management;
4 Data source of Elasticsearch
Where does Elasticsearch data come from? In this case, we need to collect data in various ways. Typically, we can pull data from third-party repositories using Logstash, or we can use Elastic agents or Beats to collect metrics, logs, tracking, and events from applications and infrastructure.
4.1 Elastic Agent
With Elastic agent, you can collect all forms of data from anywhere with a single unified agent on each host. Set installation, configuration and extension in one.
4.2 Beats
Beats is a platform for lightweight collectors. These collectors can send data to Logstash and ElasticSearch from the edge machine. It is developed by Go language and runs quickly.
4.3 Logstash
Logstash is a server-side data processing pipeline that captures data from multiple sources, transforms it, and sends it to different repositories.
Features:
- Parsing and converting data in real time;
- Reliability and security. Logstash ensures that running events can be delivered at least once by persisting queues and encrypts data transmission.
- High scalability, many plug-ins;
4.4 Ali Canal (MySQL Incremental Data synchronization)
In the MySQL incremental data synchronization middleware, Canal can simulate itself as a secondary node of MySQL. It can send Dump protocol requests to the primary node of MySQL. After receiving Dump request, MySQL will push Binary Log to Canal, and Canal will parse the data for synchronization.
4.5 Add, delete, change and check ES in business
Adding, deleting, modifying, and checking ES through section or directly integrating it into business code can also meet the needs, but it is highly intrusive and not recommended.
5 Application scenarios of Elasticsearch
- Search engines (full text search, highlighting, search recommendations)
- Massive data storage (not suitable for frequent updates, does not support transaction control), combined with crawler, log collection system effect is better
- Data analysis (ELK log analysis)
6 ELK Centralized log solution
ELK is short for Elasticsearch, Logstash, and Kibana. It provides a unified solution for centralized log management and log analysis.
Features of a centralized log management system:
- Collect – Collects log data from multiple sources
- Transfer – The system can stably transfer log data to a centralized storage system
- Storage – Can store large-scale log data, and can be convenient, fast query
- Analysis – Supports visual analysis reports
- Warning – Provides error reporting and monitoring mechanisms