How to model Elasticsearch data better

Elasticsearch has many advantages over relational databases: high performance, scalability, near real-time search, and support for data analysis of large amounts of data. However, it is not a panacea, it does not provide a good solution for dealing with the relationship between the index entities, unlike relational databases that use the paradigm to regulate your data. So how to better Elasticsearch data modeling is very important.

Management of relational data

Select * from Elasticsearch where there are multiple actors in a movie. Select * from Elasticsearch where there are multiple actors in a movie. Select * from Elasticsearch where there are multiple actors in a movie.

Movie has two properties: title, actors Actor has two properties: first_name, last_nameCopy the code

Ordinary objects

PUT my_movies
{
  "mappings" : {
      "properties" : {
        "actors" : {
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}}},"title" : {
          "type" : "text"."fields" : {
            "keyword" : {"type" : "keyword"."ignore_above": 256}}}}}}Copy the code

When updating actor information, movie information needs to be updated at the same time. For frequently updated demand scenes, the performance is poor
First_name = a and actor. Last_name = b. The value of an actor’s “actor.first_name = a” and the value of another actor’s “actor.last_name = b” movies are also queried
There is redundancy in data. Multiple copies of data will be stored in the same actor in different movies
Optimal read performance, no need for associated query

Nested object

PUT my_movies
{
    "mappings" : {
      "properties" : {
        "actors" : {
          "type": "nested", // Specify that actors are a Nested object, by defaulttype="object"
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}}},"title" : {
          "type" : "text"."fields" : {"keyword": {"type":"keyword"."ignore_above":256}}}}}}Copy the code

To define a nested object, you only need to set the type of the object to “nested”
Each actor in the Nested document is saved in a separate Lucene document that joins the root document when querying
Each nested object is indexed independently, so that the correlation of the fields in the query can be guaranteed:

// select 'actor. First_name = a' from 'actor. Last_name = b' POST my_movies/_search {"query": {
      "nested": {// Nested objects are hidden in separate documents and must be queried using nested queries. Otherwise, the nested objects cannot be queried"path": "actors", // Path must be specified because multiple NEST fields may exist in an index"query": {
          "bool": {
            "must": [{"match": {"actors.first_name": "a"}},
              {"match": {"actors.last_name": "b"}}]}}}}}Copy the code

Nest objects also support sorted nested and aggregated aggregations
Adding, deleting, or modifying nested objects still requires re-indexing the entire document, which results in poor performance for frequently updated scenarios
A nested document query returns the entire document, not the matching nested document
Related parameters:

Nested_fields. Limit: Sets the maximum number of fields that can reside in a nested object. The default value is 50. Set the maximum number of nested objects that each document can have. The default is 10000Copy the code

The Parent/Child objects

PUT my_movies
{
  "mappings": {
    "properties": {
      "movie_comments_relation": {// a field belonging to my_movies that is a join type used to specify parent-child document relationships"type": "join"// Specify the join type"relations": {// declare the Parent/Child relationship"movie": "actor"//movie is the name of Parent, actor is the name of Child}},"content": {"type": "text"},
      "title": {"type": "keyword"}}}}Copy the code

A parent document and a child document are two separate documents, but in the same index, that is, an index has both a parent document and a child document
The Parent/Child object joins two documents together to achieve a one-to-many relationship between documents
The parent document and Child document must be stored in the same shard. Therefore, routing parameters must be provided when the Child document is added, deleted, changed, or read. The mapping between the parent and Child documents is maintained in Doc Values
Only one join field can exist in an index, that is, only one Parent/Child object relationship can exist
The main advantages of Parent/Child:

- When the parent document is updated, it does not need to re-index the corresponding child document. - Creating, modifying, and deleting child documents does not affect the parent document and other child documents. This method is applicable to scenarios where there are many child documents or frequently updated documentsCopy the code

Index parent and child documents

# index ID=movie1 parent document
PUT my_movies/_doc/movie1
{
  "title":The Matrix."blog_comments_relation"The parent document can also be abbreviated here"blog_comments_relation": "movie"
    "name":"movie"// Specify that a parent document is being created by blog_comments_relation.name = movie}}# index subdocumentsPUT my_movies/_doc/actor? To ensure that the parent and child documents are indexed on the same shard and the query join performance, routing must be passed to {"first_name":"Jack"."last_name":"Moble"."blog_comments_relation": {"name":"actor"// Specify that the current index is a subdocument"parent":"movie1"// The ID of its parent is movie1}}Copy the code

The Parent/Child query

Routing_parent (ID); routing_parent (ID)
Query the child documents of the Parent document according to the Parent Id
POST my_movies/_search
{
  "query": {
    "parent_id": {
      "type": "movie"."id": "movie1"}}}The # Has Child query returns the parent document to which it belongs, based on some information about the Child document
POST my_movies/_search
{
  "query": {
    "has_child": {
      "type": "actor"."query": {// query the subdocument first_name equals"Jack"All parent documents of the"match": {"first_name" : "Jack"}
       }
    }
  }
}

# Has Parent query returns information about the child documents, based on some information about the Parent document
POST my_movies/_search
{
  "query": {
    "has_parent": {
      "parent_type": "movie"."query" : {
        "match": {"title" : "Learning Hadoop"}
       }
    }
  }
}
Copy the code

Application side association

We can simulate the relational relationship of relational database through association processing in business logic
Store movie and actor in two indexes, and then associate them by adding a field in actor representing the ID of the parent document
Application side association In the query, it may require two queries, consuming certain performance, but the query processing is simple and the implementation is convenient

Index reconstruction and update

Extensibility and stability of the model are very important. If the model is not well defined, frequent index rebuilds may occur in the future as requirements change. Then when do you need to rebuild the index?

- The Mapping information is changed: the field type, word segmentation, and dictionary are updated. - The Setting information is changed: the number of master fragments is changed. - Data is migrated within or between clustersCopy the code

Elasticsearch provides two ways to Update and rebuild an index: Update By Query and Reindex

Update By Query

Update By Query Rebuilds an existing index and is suitable for scenarios where a new field is added

// Set the Dynamic property tofalse"Indicates that the mapping information is not dynamically changed. Even if a new field is added, it is not indexed and only stored in _source PUTtest
{
  "mappings": {
    "dynamic": false."properties": {
      "text": {"type": "text"}}}} // The newly added field flag will not be indexed to query POSTtest/_doc? refresh {"text": "words words"."flag": "bar"} // Flag fields can be indexed to POST by _update_by_querytest/_update_by_query? refresh&conflicts=proceedCopy the code

Update By Query Version conflict

- When using Update By Query, a snapshot is first taken and the version number is recorded. If new data is inserted during the Update process, a version conflict will occur. - By default, if a document has a version conflict during the Update, the Update will fail. However, documents that have been updated cannot be rolled back - you can set the conficts parameter to PROCEED so that the update will not be aborted if a version conflict occurs while updating a documentCopy the code

You can Update By Query multiple indexes at the same time

POST twitter,blog/_update_by_query
Copy the code

Update the index of the specified shard with the routing parameter

POST twitter/_update_by_query? routing=1Copy the code

Update By Query Uses rolling Update logic, 1000 documents at a time By default, and can be changed By scroll_size

POST twitter/_update_by_query? scroll_size=100Copy the code

Update By Query documents can be preprocessed using pipelines

PUT _ingest/pipeline/set-foo
{
  "description" : "sets foo"."processors": [{"set" : {
        "field": "foo"."value": "bar"} } ] } POST twitter/_update_by_query? pipeline=set-foo
Copy the code

Task API: Because index updates can be time-consuming, ES provides an asynchronous way to obtain update progress through the Task API

Set up the asynchronous update with wait_for_completion = false, in which case a taskId will be returnedPOST twitter/_update_by_query? wait_for_completion=false

Update progress can be obtained directly through taskId
GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619
Copy the code

ReIndex API

ES does not allow you to change the field type of existing data on the original Mapping. You can only create a new index, set the correct field type, and import data again. In this case, you need to use the ReIndex API.

As with Update By Query, progress can be obtained asynchronously with the parameter wait_for_completion=false
# As with Update By Query, you can control execution in case of a version conflict By using the argument Conflicts =proceedPOST _reindex? wait_for_completion=false&conflicts=proceed 
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"."op_type": "create"// If there are documents in dest, which may cause version conflicts, add op_type = create, to indicate that only documents that do not exist will be written}}Copy the code

When is the ReIndex API used

- Change the number of primary fragments in the index. - Change the type of the Mapping field. - Migrate data within a cluster and across clustersCopy the code

The _source field must be set to true to use the ReIndex API
The ReIndex API also supports rebuilding indexes across clusters, enabling data migration

# target source need to add the white list, allowing access to the address: reindex. Remote. Whitelist: "otherhost: 9200"
POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200"// Cluster address"username": "user"."password": "pass"
    },
    "index": "source"."query": {  // testDocuments whose fields are data are re-indexed"match": {
        "test": "data"}}},"dest": {
    "index": "dest"}}Copy the code

Max_docs: The max_docs parameter can be used to limit the number of documents that can be rebuilt at a time

POST _reindex
{
  "max_docs": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"}}Copy the code

Multiple index files can be rebuilt to a single target index

POST _reindex
{
  "source": {
    "index": ["twitter"."blog"]},"dest": {
    "index": "all_together"}}Copy the code

You can select only some fields for index reconstruction

POST _reindex
{
  "source": {
    "index": "twitter"."_source": ["user"."_doc"]  Rebuild only the user and _doc fields for each document
  },
  "dest": {
    "index": "new_twitter"}}Copy the code

Reindex can be done by modifying the meta information of a document through a script

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"."version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}"."lang": "painless"}}Copy the code

Ingest Pipeline

Ingest Pipeline Node For Elasticsearch Ingest Pipeline Node for Elasticsearch

Ingest Pipeline Node has the ability to preprocess data, intercept the Index or Bulk API requests, transform the data, and return it back to the Index or Bulk API
By default, each Node is an Ingest Node, which can be disabled with the node.injest=false argument
Ingest Pipeline in some cases allows us to preprocess data without a Logstash
Ingest Pipeline uses Pipeline & Processor to process the passed data according to Pipeline data
Each Processor is an abstract encapsulation of processing behavior. ES provides many built-in processors and allows plug-ins to define their own processors
Built-in processors are as follows:

-remove /Rename Processor: removes a Rename field. -append: adds a new field. -convert: converts data types. -date /JSON: Date format conversion -date Index Name Processor: allocates documents that pass this Processor to the Index in a specified time format -fail Processor: exception processing -foreach Proccesor: Grok Processor: log date format cut. Gsub/Join/Split: string replacement, array to string, string to array. Lowercase/Upcase: case conversionCopy the code

How to use Ingest Pipeline

Check whether the Processor works properly by using the _ingest/pipeline/ _SIMULATE interface
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags"."processors": [ You can define multiple processors
      {
        "split": {         Use split Processor
          "field": "tags".# preprocess the field field
          "separator": "," # shard by comma}}},"docs": [  # Documents to process
    {
      "_index": "index"."_id": "id"."_source": {
        "title": "Introducing big data......"."tags": "hadoop,elasticsearch,spark"."content": "You konw, for big data"}}, {"_index": "index"."_id": "idxx"."_source": {
        "title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"}}}]Create a new Pipeline named blog_pipeline
PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline"."processors": [  A Pipeline can have more than one processor
      {
        "split": {
          "field": "tags"."separator": ","}}, {"set": {"field": "views"."value": 0}}]}# Test whether the pipeline works properly
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [{"_source": {
        "title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"}}}]Update data using pipelinePUT tech_blogs/_doc/2? pipeline=blog_pipeline {"title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"
}

Use blog_pipeline to update data when update_by_query, only update documents that do not have field = viewsPOST tech_blogs/_update_by_query? pipeline=blog_pipeline {"query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}
Copy the code

Ingest Node VS Logstash

	Logstash	Ingest Node
Data input and output	Supports reading and writing from different data sources	You can only get data from the ES REST API and write to ES
Data cache	The realization of a simple data queue, support rewrite	Cache not supported
The data processing	Support for a large number of plug-ins, but also support for custom development	Support for built-in plug-ins and custom development
Configuration and Use	Independent deployment is required, which adds some architectural complexity	No additional deployment required

Painless Script

Painless is a scripting language designed specifically for Elasticsearch and is the default scripting language for Elasticsearch
Painless can be used directly as an inline script in Elasticsearch or stored for subsequent use by multiple queries
Painless Script is several times faster than other scripts in terms of performance
Extended Java syntax to support all Java data types and a subset of Java apis
Painless Script has features such as security, support for display types, and dynamically defining types
Painless is mainly used for the following purposes:

- Update, delete, data aggregation and other operations - calculate returned fields - process the score of documents - execute scripts in Ingest Pipeline - process data in Reindex API, Update By QueryCopy the code

stored script:

Save the script in Cluster State
POST _scripts/update_views
{
  "script": {"lang": "painless"."source": "ctx._source.views += params.new_views"}}Copy the code

Script caching: Scripts are expensive to compile, so ES caches the compiled scripts in the Cache

- Inline Scripts and Store Scripts will be cached - 100 Scripts will be cached by default - script.cache.max_size sets the maximum number of cached Scripts - script.cache.expire Sets the cache timeout - Script. max_compilations_rate: a maximum of 75 compilations_rates are executed in 5 minutes by defaultCopy the code

In addition, for further information about Painness Script, please refer to the official documentation

store field VS _source

Storing data in ES has two main uses: “search” and “Retrieve” :

Search: text search, we do not know what information, do not know the specific document ID, just according to the keyword to inverted index query
Retrieve: Retrieve stored raw data based on ID

Among them, “search” can realize full-text retrieval function through inverted index, while “Retrieve” needs to be realized through store field or _source.

What is the _source

When we index the document, ES also stores the original JSON data of the document in the _source field
The _source field itself is not indexed and therefore cannot be searched, mainly to return the raw JSON data when searching other fields
If you do not want to store the _source field, you can set _source = false and the following functions will not be supported:

- Update, update_by_query, reindex API - highlighting - original JSON data cannot be retrieved when searchingCopy the code

If you want to store only some of the fields in the original JSON, you can include or exclude

PUT logs
{
  "mappings": {
    "_source": {
      "includes": [
        "*.count"."meta.*"]."excludes": [
        "meta.description"."meta.other.*"]}}}Copy the code

When searching and querying, if you only want to retrieve some of the original fields, you can use the _source field

The entire _source will be parsed first, and some fields will be extracted and returned
GET /_search
{
    "_source": [ "obj1.*"."obj2.*"]."query" : {
        "term" : { "user" : "kimchy"}}}Copy the code

What is a field store

You can store raw data separately for a field by setting the Store property for that field

PUT my_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"."store": true 
      },
      "date": {
        "type": "date"."store": true 
      },
      "content": {
        "type": "text"}}}}Copy the code

The Store property is disabled by default
Obtain the required raw data through stored_fields during query

GET my_index/_search
{
  "stored_fields": [ "title"."date"]}Copy the code

How to store raw data correctly

If you have particularly large fields that are only for retrieval use, you can choose not to store them in the _source field, reducing disk usage and the cost of parsing and extracting fields from JSON at retrieve
If you have a particularly large field with a high retrieve frequency, you can set its store property to true so that it can be parsed separately without affecting the other fields
In the same index, it is not recommended to set the store property for multiple fields, because each field takes one IO to get, and _source only takes one IO
In most cases, setting the store property is not recommended, because _source already meets most requirements and has fast performance
Turning off the _source attribute loses a lot of functionality and requires careful selection

How to better model

Modeling for Elasticsearch is a tool for extracting descriptions from the real world. In this section we have introduced some of the concepts and tools used for modeling Elasticsearch.

Field type selection

Text: used for the full text field. The text will be indexed by word segmentation and used in the scenario that requires word segmentation search. Aggregation analysis and sorting are not recommended
Keyword: Applies to scenarios that do not require word segmentation, such as ID and enumeration. It is applicable to scenarios with accurate matching and supports sorting and aggregation by default
Multi-field type; If we have both word segmentation and exact match search scenarios for a document, we can add a subfield to the text type

PUT /employees/
{
  "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "job" : {
          "type" : "text"."fields" : {
            "keyword" : {
              "type" : "keyword".# text with a subtype keyword
              "ignore_above": 50}}},}}}The entire text can be aggregated by the subfield keword
POST employees/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"}}}}Text can be searched for exact matches with the subfield keyword
POST /employees/_search
{
  "query": {
    "term": {
      "job.keyword": {
        "value": "XHDK-A-1293-#fJ3"}}}}Copy the code

If the value type is too large, performance problems may occur. If the value type is too small, it may not be used in the future as the service volume increases
Dates/Booleans: Dates and booleans generally do not require much consideration and are easily selected

Field property Settings

Whether index, sort, and aggregate analysis are required: If only to store data, you can set Enabled =false
Whether sorting and aggregation are required: Select doc_values and fieldData based on the scenario in which sorting and aggregation are required
Index required or not: Select whether to enable the index function by setting the field index attribute. If the index function is disabled, it cannot be searched, but it can still support aggregate sorting, and data is saved in _source
Eager_global_ordinals: For update and aggregate queries, eager_global_ordinals = true can be set to the frequent keyword type field, and the global ordinals mapping can be set to improve query performance
How to store raw data: see “Store field vs _source” above
What data to store for inverted indexes: Properly setting the value of Index_Option can effectively improve the performance of inverted indexes
Whether correlation calculation norms are needed: Parameter norms can be set to be disabled; when the norms are on, many calculation factors will be stored for correlation calculation, wasting a large amount of storage space

Other modeling optimization suggestions

Index Alias: You can use Index Alias to decouple the application name from the Index name, without changing the name, without downtime, and achieve seamless Reindex
Index Template: Standardizes the Index creation process by setting an Index Template
To avoid using too many fields, you can limit the maximum number of fields by using index.mapping.total_fields

1) Services are difficult to maintain. 2) Cluster performance may be affected if Mapping information is stored in the Cluster State. 3) Reindex is required to delete or modify dataCopy the code

Do not enable the Dynamic attribute in the production environment. Field attributes are defined in advance because Dynamic field changes are difficult to maintain
Avoid using regular and fuzzy matching queries as much as possible because query performance is poor
Try to avoid inaccurate aggregation analysis caused by null values. Set the NULl_value attribute of the field, or modify this situation by using the missing attribute in the aggregation query
Adding Meta information to the Mapping file can facilitate version management and upload the Mapping file to Git for management

PUT softwares/
{
  "mappings": {
    "_meta": {
      "software_version_mapping": "1.0"}}}Copy the code

Kibana currently does not support nested and parent/Child types, so there are trade-offs to be made when modeling associated objects

reference

Elasticsearch store property vs _source field
Get started with Elasticsearch