ElasticSearch

Elasticsearch profile

  • Elasticsearch is a distributed, multi-tenant, full-text search engine based on the Lucene library.
  • Elasticsearch is implemented by Java development.
  • Elasticsearch provides a REST interface and JSON documentation that can be called by any programming language
  • Elasticsearch can search and analyze any type of data in near real time, whether structured, semi-structured, or unstructured. Elasticsearch is efficient for storage and fast searching.
  • High performance, high availability (data, services), horizontal scalability, easy to use
  • Different node types are supported

Application scenarios of ElasticSearch

  • Add search capabilities to your application or website

  • Store and analyze logs, indicators, and security event data

  • Use machine learning to automatically model the behavior of data in real time

  • Automate business workflow using Elasticsearch as the storage engine

  • Manage, integrate, and analyze spatial information using Elasticsearch as a geographic Information system (GIS)

  • Store and process genetic data using Elasticsearch as a bioinformatics research tool

Elasticsearch family

Elasticsearch ecosphere

  • Logstash and beatsFor data fetching
  • elasticsearchIt is used for data storage, analysis and calculation
  • kibanaFor data visualization
  • x-packMainly for commercial use, such as security, monitoring, alarm, graph query, machine learning, etc.

logstash

Introduction to the
  • Open source server-side data processing pipeline to collect data from different sources, transform data, and send data to different repositories.
  • Originally used for log collection and processing.
features
  • Parse and transform data in real time
  • extensible
    • Support more than 200 plug-ins (logging, database, Arcsigh, Netflow)
  • Reliability, safety
    • Logstash ensures that running events are delivered at least once through persistent queues
    • Data transmission encryption
  • monitoring

kibana

Introduction to the
  • Kiwi fruit + Banana
  • Data visualization tools to help users solve any questions about data
  • Logstash based tools

Beats layer is responsible for collecting data and storing it directly to Elastic Search or to logstash for parsing and filtering and storing it to Elastic Search. Elasticsearch is a storage engine that provides APIS for searching data. Analyze data and other operations, Kibana and ElasticSearch visual interaction.

The basic concept of ElasticSearch

Document

  • Elasticsearch is document oriented, document is a basic unit of information that can be indexed
  • Document will be serialized to JSON format and saved in ElasticSearch
    • A JSON object consists of fields
    • Each field has a corresponding field type, such as string/numeric/Boolean/date/binary/range type
    • Json document, flexible format, do not need to define the format
    • The field type can be specified or automatically calculated by ElasticSearch
    • Support data, support nested JSON
  • Each document has a unique ID
    • You can specify your own ID
    • Elasticsearch is automatically generated
Document MetaData
{
  "_index" : "movies"."_type" : "_doc"."_id" : "37475"."_score" : 1.0."_source" : {
    "genre" : [
      "Drama"]."id" : "37475"."year" : 2005."title" : "Unfinished Life, An"."@version" : "1"}}Copy the code

Document metaData is used to annotate information about documents

  • _index: indicates the index of the document
  • _type: indicates the type to which document belongs_doc
  • _id: document unique id
  • _source: JSON Document content
  • @version: document version
  • _score: correlation score

Index

  • Index is a container for document, a combination of a class of documents
  • Each index has its own mapping definition that defines the field name and field type of the contained document
  • The data in the index is scattered over the Shard. Setting can define different data distribution
// index settings
{
    "settings":
    {
        "index":
        {
            "creation_date": "1624690171977"."number_of_shards": "1"."number_of_replicas": "1"."uuid": "HqFyAwvOQ8Ctfwy7Cbwz-A"."version":
            {
                "created": "7090299"
            },
            "provided_name": "movies"}}}// index mapping
{
  "mappings": {
    "_doc": {
      "properties": {
        "@version": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 256}}},"genre": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 256}}},"id": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 256}}},"title": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 256}}},"year": {
          "type": "long"
        }
      }
    }
  }
}
Copy the code

Type

In an index, you can define one or more types. A type is a logical classification/partition of your index, and its semantics are entirely up to you. Typically, a type is defined for documents that have the same set of fields. For example, let’s say you run a blogging platform and store all your data in an index. In this index, you can define one type for user data, another type for blog data, and, of course, another type for comment data.

Note that:

  1. Prior to 7.0, multiple types could be set for an index
  2. Since 6.0, Type has been Deprecated. As of 7.0, only one Type – can be created for an index_doc
  3. – 8.0 will be completely scrapped

Cluster

  • An ElasticSearch cluster is a cluster of one or more nodes that collectively hold all of your data and provide indexing and search functions.

  • A cluster is identified by a unique name, which defaults to “ElasticSearch”. Different clusters can be identified by different names. You can modify the names in the configuration file or add -e cluster.name=es_demo in the startup command line interface.

  • This ensures high availability of the system

    • High availability of services: Node services can be stopped
    • High availability of data: Data is not lost even if some nodes are lost
  • scalability

    • Increasing volume of requests/data (distribute data across all nodes)
    • Horizontal expansion, add nodes, nodes can be specified by the name of a cluster, to join the cluster
  • Cluster status:

    • Green– Shard and Replica are allocated normally
    • Yellow– Shard are all properly allocated, Replica is not properly allocated.
    • Red– Some shards were not allocated properly, for example, a new index was created when the server disk capacity exceeded 85%.
#Viewing Cluster StatusThe curl -i http://192.168.0.41:9200/_cluster/healthCopy the code
{
    "cluster_name": "es_demo"."status": "yellow"."timed_out": false."number_of_nodes": 1."number_of_data_nodes": 1."active_primary_shards": 7."active_shards": 7."relocating_shards": 0."initializing_shards": 0."unassigned_shards": 1."delayed_unassigned_shards": 0."number_of_pending_tasks": 0."number_of_in_flight_fetch": 0."task_max_waiting_in_queue_millis": 0."active_shards_percent_as_number": 87.5
}
Copy the code

Node

  • A node is an instance of ElasticSearch, which is essentially a Java process. You can run multiple ElasticSearch processes on a server by modifying the port. However, in production environments, you are advised to run only one Instance of ElasticSearch on a machine.
  • Each node is given a name, either through a configuration file or on the command line at startup-E node.name=node1The specified
  • After each node is started, it is assigned a unique ID and stored in the data directory.
  • There are many types of nodes, and different types of nodes have different functions.
    • master node & master-eligible node
    • data node & coordinating node
    • Hot & warm node
    • machine learning node
    • tribe node
  • A single node can play multiple roles in a development environment, but a single dedicated node should be used in a production environment.
The node type Configuration parameters The default value
master-eligible node node.master true
data node node.data true
ingest node.ingest true
coordinating only There is no Each node is a coordinating node by default. Set all other types to false.
machine learning node.ml true(enable x-pack)
Master Node & Master-eligible Nodes
  • When each node is started, it defaults to one Master Eligible node. This can be set through a configuration filenode.master: falseIs prohibited.
  • Master Eligible Node You can join the main selection process to become a Master node.
  • When the first node starts, it elects itself as a Master node.
  • The cluster status is stored on each node. Only the master node can change the cluster status
    • Cluster state, which maintains the necessary information in a cluster
      • All node information
      • All indexes and their associated mapping and setting information
      • Fragmented routing information
    • Any node can modify the information, resulting in data inconsistency
Data Node & Coordinating Node
  • A node that can store data is called a data node. It is responsible for preserving shard data and plays a crucial role in data expansion.
  • Coordinating Node is responsible for accepting client requests, distributing them to the appropriate nodes, and finally bringing the results together.
  • Each node functions as a coordinating node by default.
Hot & Warm Node
  • Data nodes with different hardware configurations can be used to implement the Hot & Warm architecture, reducing the cost of cluster deployment.
Machine Learning Node
  • Run ML jobs for exception detection.
Tribe Node
  • The Tribe Node is connected to different ES clusters and can be treated as an independent cluster.

Shard & Replica

  • Shard, used to solve the problem of data scaling horizontally. Sharding allows you to distribute data to all nodes in a cluster.
    • A shard is a running instance of Lucene
    • The number of fragments is passed when the index is creatednumber_of_shardsSpecifies that subsequent changes are not allowed, except for reindex
  • Replica, used to solve the problem of high availability of data, is a fragmented copy.
    • Replica number can be dynamically adjusted throughnumber_of_replicasThe specified
    • Adding Replica can also improve the high availability of the service to a certain extent.
  • Capacity planning is required for sharding in the production environment
    • The number of fragments is too small. Procedure
      • Nodes cannot be added to achieve horizontal scaling
      • The amount of data in a single fragment is too large, which causes data redistribution time.
    • If the number of fragments is set too large, the default main fragment is set to 1, which solves the problem of over-sharding
      • It affects the relevance scoring of search results and the accuracy of statistical results.
      • Excessive fragments on a single node waste resources and affect performance.

elasticsearch vs rdbms

RDBMS ElasticSearch
Table Index
Row Document
Column Field
Schema Mapping
SQL DSL

Elasticsearch provides high performance full text search with no transaction support and no JOIN support