Elasticsearch has many advantages over relational databases: high performance, scalability, near real-time search, and support for data analysis of large amounts of data. However, it is not a panacea, it does not provide a good solution for dealing with the relationship between the index entities, unlike relational databases that use the paradigm to regulate your data. So how to better Elasticsearch data modeling is very important.

Management of relational data

Select * from Elasticsearch where there are multiple actors in a movie. Select * from Elasticsearch where there are multiple actors in a movie. Select * from Elasticsearch where there are multiple actors in a movie.

Movie has two properties: title, actors Actor has two properties: first_name, last_nameCopy the code
Ordinary objects
PUT my_movies
{
  "mappings" : {
      "properties" : {
        "actors" : {
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}}},"title" : {
          "type" : "text"."fields" : {
            "keyword" : {"type" : "keyword"."ignore_above": 256}}}}}}Copy the code
  • When updating actor information, movie information needs to be updated at the same time. For frequently updated demand scenes, the performance is poor
  • First_name = a and actor. Last_name = b. The value of an actor’s “actor.first_name = a” and the value of another actor’s “actor.last_name = b” movies are also queried
  • There is redundancy in data. Multiple copies of data will be stored in the same actor in different movies
  • Optimal read performance, no need for associated query
Nested object
PUT my_movies
{
    "mappings" : {
      "properties" : {
        "actors" : {
          "type": "nested", // Specify that actors are a Nested object, by defaulttype="object"
          "properties" : {
            "first_name" : {"type" : "keyword"},
            "last_name" : {"type" : "keyword"}}},"title" : {
          "type" : "text"."fields" : {"keyword": {"type":"keyword"."ignore_above":256}}}}}}Copy the code
  • To define a nested object, you only need to set the type of the object to “nested”
  • Each actor in the Nested document is saved in a separate Lucene document that joins the root document when querying
  • Each nested object is indexed independently, so that the correlation of the fields in the query can be guaranteed:
// select 'actor. First_name = a' from 'actor. Last_name = b' POST my_movies/_search {"query": {
      "nested": {// Nested objects are hidden in separate documents and must be queried using nested queries. Otherwise, the nested objects cannot be queried"path": "actors", // Path must be specified because multiple NEST fields may exist in an index"query": {
          "bool": {
            "must": [{"match": {"actors.first_name": "a"}},
              {"match": {"actors.last_name": "b"}}]}}}}}Copy the code
  • Nest objects also support sorted nested and aggregated aggregations
  • Adding, deleting, or modifying nested objects still requires re-indexing the entire document, which results in poor performance for frequently updated scenarios
  • A nested document query returns the entire document, not the matching nested document
  • Related parameters:
Nested_fields. Limit: Sets the maximum number of fields that can reside in a nested object. The default value is 50. Set the maximum number of nested objects that each document can have. The default is 10000Copy the code
The Parent/Child objects
PUT my_movies
{
  "mappings": {
    "properties": {
      "movie_comments_relation": {// a field belonging to my_movies that is a join type used to specify parent-child document relationships"type": "join"// Specify the join type"relations": {// declare the Parent/Child relationship"movie": "actor"//movie is the name of Parent, actor is the name of Child}},"content": {"type": "text"},
      "title": {"type": "keyword"}}}}Copy the code
  • A parent document and a child document are two separate documents, but in the same index, that is, an index has both a parent document and a child document
  • The Parent/Child object joins two documents together to achieve a one-to-many relationship between documents
  • The parent document and Child document must be stored in the same shard. Therefore, routing parameters must be provided when the Child document is added, deleted, changed, or read. The mapping between the parent and Child documents is maintained in Doc Values
  • Only one join field can exist in an index, that is, only one Parent/Child object relationship can exist
  • The main advantages of Parent/Child:
- When the parent document is updated, it does not need to re-index the corresponding child document. - Creating, modifying, and deleting child documents does not affect the parent document and other child documents. This method is applicable to scenarios where there are many child documents or frequently updated documentsCopy the code
  • Index parent and child documents
# index ID=movie1 parent document
PUT my_movies/_doc/movie1
{
  "title":The Matrix."blog_comments_relation"The parent document can also be abbreviated here"blog_comments_relation": "movie"
    "name":"movie"// Specify that a parent document is being created by blog_comments_relation.name = movie}}# index subdocumentsPUT my_movies/_doc/actor? To ensure that the parent and child documents are indexed on the same shard and the query join performance, routing must be passed to {"first_name":"Jack"."last_name":"Moble"."blog_comments_relation": {"name":"actor"// Specify that the current index is a subdocument"parent":"movie1"// The ID of its parent is movie1}}Copy the code
  • The Parent/Child query
Routing_parent (ID); routing_parent (ID)
Query the child documents of the Parent document according to the Parent Id
POST my_movies/_search
{
  "query": {
    "parent_id": {
      "type": "movie"."id": "movie1"}}}The # Has Child query returns the parent document to which it belongs, based on some information about the Child document
POST my_movies/_search
{
  "query": {
    "has_child": {
      "type": "actor"."query": {// query the subdocument first_name equals"Jack"All parent documents of the"match": {"first_name" : "Jack"}
       }
    }
  }
}

# Has Parent query returns information about the child documents, based on some information about the Parent document
POST my_movies/_search
{
  "query": {
    "has_parent": {
      "parent_type": "movie"."query" : {
        "match": {"title" : "Learning Hadoop"}
       }
    }
  }
}
Copy the code
Application side association
  • We can simulate the relational relationship of relational database through association processing in business logic
  • Store movie and actor in two indexes, and then associate them by adding a field in actor representing the ID of the parent document
  • Application side association In the query, it may require two queries, consuming certain performance, but the query processing is simple and the implementation is convenient

Index reconstruction and update

Extensibility and stability of the model are very important. If the model is not well defined, frequent index rebuilds may occur in the future as requirements change. Then when do you need to rebuild the index?

- The Mapping information is changed: the field type, word segmentation, and dictionary are updated. - The Setting information is changed: the number of master fragments is changed. - Data is migrated within or between clustersCopy the code

Elasticsearch provides two ways to Update and rebuild an index: Update By Query and Reindex

Update By Query
  • Update By Query Rebuilds an existing index and is suitable for scenarios where a new field is added
// Set the Dynamic property tofalse"Indicates that the mapping information is not dynamically changed. Even if a new field is added, it is not indexed and only stored in _source PUTtest
{
  "mappings": {
    "dynamic": false."properties": {
      "text": {"type": "text"}}}} // The newly added field flag will not be indexed to query POSTtest/_doc? refresh {"text": "words words"."flag": "bar"} // Flag fields can be indexed to POST by _update_by_querytest/_update_by_query? refresh&conflicts=proceedCopy the code
  • Update By Query Version conflict
- When using Update By Query, a snapshot is first taken and the version number is recorded. If new data is inserted during the Update process, a version conflict will occur. - By default, if a document has a version conflict during the Update, the Update will fail. However, documents that have been updated cannot be rolled back - you can set the conficts parameter to PROCEED so that the update will not be aborted if a version conflict occurs while updating a documentCopy the code
  • You can Update By Query multiple indexes at the same time
POST twitter,blog/_update_by_query
Copy the code
  • Update the index of the specified shard with the routing parameter
POST twitter/_update_by_query? routing=1Copy the code
  • Update By Query Uses rolling Update logic, 1000 documents at a time By default, and can be changed By scroll_size
POST twitter/_update_by_query? scroll_size=100Copy the code
  • Update By Query documents can be preprocessed using pipelines
PUT _ingest/pipeline/set-foo
{
  "description" : "sets foo"."processors": [{"set" : {
        "field": "foo"."value": "bar"} } ] } POST twitter/_update_by_query? pipeline=set-foo
Copy the code
  • Task API: Because index updates can be time-consuming, ES provides an asynchronous way to obtain update progress through the Task API
Set up the asynchronous update with wait_for_completion = false, in which case a taskId will be returnedPOST twitter/_update_by_query? wait_for_completion=false

Update progress can be obtained directly through taskId
GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619
Copy the code
ReIndex API

ES does not allow you to change the field type of existing data on the original Mapping. You can only create a new index, set the correct field type, and import data again. In this case, you need to use the ReIndex API.

As with Update By Query, progress can be obtained asynchronously with the parameter wait_for_completion=false
# As with Update By Query, you can control execution in case of a version conflict By using the argument Conflicts =proceedPOST _reindex? wait_for_completion=false&conflicts=proceed 
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"."op_type": "create"// If there are documents in dest, which may cause version conflicts, add op_type = create, to indicate that only documents that do not exist will be written}}Copy the code
  • When is the ReIndex API used
- Change the number of primary fragments in the index. - Change the type of the Mapping field. - Migrate data within a cluster and across clustersCopy the code
  • The _source field must be set to true to use the ReIndex API
  • The ReIndex API also supports rebuilding indexes across clusters, enabling data migration
# target source need to add the white list, allowing access to the address: reindex. Remote. Whitelist: "otherhost: 9200"
POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200"// Cluster address"username": "user"."password": "pass"
    },
    "index": "source"."query": {  // testDocuments whose fields are data are re-indexed"match": {
        "test": "data"}}},"dest": {
    "index": "dest"}}Copy the code
  • Max_docs: The max_docs parameter can be used to limit the number of documents that can be rebuilt at a time
POST _reindex
{
  "max_docs": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"}}Copy the code
  • Multiple index files can be rebuilt to a single target index
POST _reindex
{
  "source": {
    "index": ["twitter"."blog"]},"dest": {
    "index": "all_together"}}Copy the code
  • You can select only some fields for index reconstruction
POST _reindex
{
  "source": {
    "index": "twitter"."_source": ["user"."_doc"]  Rebuild only the user and _doc fields for each document
  },
  "dest": {
    "index": "new_twitter"}}Copy the code
  • Reindex can be done by modifying the meta information of a document through a script
POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"."version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}"."lang": "painless"}}Copy the code

Ingest Pipeline

Ingest Pipeline Node For Elasticsearch Ingest Pipeline Node for Elasticsearch

  • Ingest Pipeline Node has the ability to preprocess data, intercept the Index or Bulk API requests, transform the data, and return it back to the Index or Bulk API
  • By default, each Node is an Ingest Node, which can be disabled with the node.injest=false argument
  • Ingest Pipeline in some cases allows us to preprocess data without a Logstash
  • Ingest Pipeline uses Pipeline & Processor to process the passed data according to Pipeline data
  • Each Processor is an abstract encapsulation of processing behavior. ES provides many built-in processors and allows plug-ins to define their own processors
  • Built-in processors are as follows:
-remove /Rename Processor: removes a Rename field. -append: adds a new field. -convert: converts data types. -date /JSON: Date format conversion -date Index Name Processor: allocates documents that pass this Processor to the Index in a specified time format -fail Processor: exception processing -foreach Proccesor: Grok Processor: log date format cut. Gsub/Join/Split: string replacement, array to string, string to array. Lowercase/Upcase: case conversionCopy the code
  • How to use Ingest Pipeline
Check whether the Processor works properly by using the _ingest/pipeline/ _SIMULATE interface
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags"."processors": [ You can define multiple processors
      {
        "split": {         Use split Processor
          "field": "tags".# preprocess the field field
          "separator": "," # shard by comma}}},"docs": [  # Documents to process
    {
      "_index": "index"."_id": "id"."_source": {
        "title": "Introducing big data......"."tags": "hadoop,elasticsearch,spark"."content": "You konw, for big data"}}, {"_index": "index"."_id": "idxx"."_source": {
        "title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"}}}]Create a new Pipeline named blog_pipeline
PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline"."processors": [  A Pipeline can have more than one processor
      {
        "split": {
          "field": "tags"."separator": ","}}, {"set": {"field": "views"."value": 0}}]}# Test whether the pipeline works properly
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [{"_source": {
        "title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"}}}]Update data using pipelinePUT tech_blogs/_doc/2? pipeline=blog_pipeline {"title": "Introducing cloud computering"."tags": "openstack,k8s"."content": "You konw, for cloud"
}

Use blog_pipeline to update data when update_by_query, only update documents that do not have field = viewsPOST tech_blogs/_update_by_query? pipeline=blog_pipeline {"query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}
Copy the code
  • Ingest Node VS Logstash
Logstash Ingest Node
Data input and output Supports reading and writing from different data sources You can only get data from the ES REST API and write to ES
Data cache The realization of a simple data queue, support rewrite Cache not supported
The data processing Support for a large number of plug-ins, but also support for custom development Support for built-in plug-ins and custom development
Configuration and Use Independent deployment is required, which adds some architectural complexity No additional deployment required

Painless Script

  • Painless is a scripting language designed specifically for Elasticsearch and is the default scripting language for Elasticsearch
  • Painless can be used directly as an inline script in Elasticsearch or stored for subsequent use by multiple queries
  • Painless Script is several times faster than other scripts in terms of performance
  • Extended Java syntax to support all Java data types and a subset of Java apis
  • Painless Script has features such as security, support for display types, and dynamically defining types
  • Painless is mainly used for the following purposes:
- Update, delete, data aggregation and other operations - calculate returned fields - process the score of documents - execute scripts in Ingest Pipeline - process data in Reindex API, Update By QueryCopy the code
  • stored script:
Save the script in Cluster State
POST _scripts/update_views
{
  "script": {"lang": "painless"."source": "ctx._source.views += params.new_views"}}Copy the code
  • Script caching: Scripts are expensive to compile, so ES caches the compiled scripts in the Cache
- Inline Scripts and Store Scripts will be cached - 100 Scripts will be cached by default - script.cache.max_size sets the maximum number of cached Scripts - script.cache.expire Sets the cache timeout - Script. max_compilations_rate: a maximum of 75 compilations_rates are executed in 5 minutes by defaultCopy the code

In addition, for further information about Painness Script, please refer to the official documentation

store field VS _source

Storing data in ES has two main uses: “search” and “Retrieve” :

  • Search: text search, we do not know what information, do not know the specific document ID, just according to the keyword to inverted index query
  • Retrieve: Retrieve stored raw data based on ID

Among them, “search” can realize full-text retrieval function through inverted index, while “Retrieve” needs to be realized through store field or _source.

What is the _source
  • When we index the document, ES also stores the original JSON data of the document in the _source field
  • The _source field itself is not indexed and therefore cannot be searched, mainly to return the raw JSON data when searching other fields
  • If you do not want to store the _source field, you can set _source = false and the following functions will not be supported:
- Update, update_by_query, reindex API - highlighting - original JSON data cannot be retrieved when searchingCopy the code
  • If you want to store only some of the fields in the original JSON, you can include or exclude
PUT logs
{
  "mappings": {
    "_source": {
      "includes": [
        "*.count"."meta.*"]."excludes": [
        "meta.description"."meta.other.*"]}}}Copy the code
  • When searching and querying, if you only want to retrieve some of the original fields, you can use the _source field
The entire _source will be parsed first, and some fields will be extracted and returned
GET /_search
{
    "_source": [ "obj1.*"."obj2.*"]."query" : {
        "term" : { "user" : "kimchy"}}}Copy the code
What is a field store
  • You can store raw data separately for a field by setting the Store property for that field
PUT my_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"."store": true 
      },
      "date": {
        "type": "date"."store": true 
      },
      "content": {
        "type": "text"}}}}Copy the code
  • The Store property is disabled by default
  • Obtain the required raw data through stored_fields during query
GET my_index/_search
{
  "stored_fields": [ "title"."date"]}Copy the code
How to store raw data correctly
  • If you have particularly large fields that are only for retrieval use, you can choose not to store them in the _source field, reducing disk usage and the cost of parsing and extracting fields from JSON at retrieve
  • If you have a particularly large field with a high retrieve frequency, you can set its store property to true so that it can be parsed separately without affecting the other fields
  • In the same index, it is not recommended to set the store property for multiple fields, because each field takes one IO to get, and _source only takes one IO
  • In most cases, setting the store property is not recommended, because _source already meets most requirements and has fast performance
  • Turning off the _source attribute loses a lot of functionality and requires careful selection

How to better model

Modeling for Elasticsearch is a tool for extracting descriptions from the real world. In this section we have introduced some of the concepts and tools used for modeling Elasticsearch.

Field type selection
  • Text: used for the full text field. The text will be indexed by word segmentation and used in the scenario that requires word segmentation search. Aggregation analysis and sorting are not recommended
  • Keyword: Applies to scenarios that do not require word segmentation, such as ID and enumeration. It is applicable to scenarios with accurate matching and supports sorting and aggregation by default
  • Multi-field type; If we have both word segmentation and exact match search scenarios for a document, we can add a subfield to the text type
PUT /employees/
{
  "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "job" : {
          "type" : "text"."fields" : {
            "keyword" : {
              "type" : "keyword".# text with a subtype keyword
              "ignore_above": 50}}},}}}The entire text can be aggregated by the subfield keword
POST employees/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"}}}}Text can be searched for exact matches with the subfield keyword
POST /employees/_search
{
  "query": {
    "term": {
      "job.keyword": {
        "value": "XHDK-A-1293-#fJ3"}}}}Copy the code
  • If the value type is too large, performance problems may occur. If the value type is too small, it may not be used in the future as the service volume increases
  • Dates/Booleans: Dates and booleans generally do not require much consideration and are easily selected
Field property Settings
  • Whether index, sort, and aggregate analysis are required: If only to store data, you can set Enabled =false
  • Whether sorting and aggregation are required: Select doc_values and fieldData based on the scenario in which sorting and aggregation are required
  • Index required or not: Select whether to enable the index function by setting the field index attribute. If the index function is disabled, it cannot be searched, but it can still support aggregate sorting, and data is saved in _source
  • Eager_global_ordinals: For update and aggregate queries, eager_global_ordinals = true can be set to the frequent keyword type field, and the global ordinals mapping can be set to improve query performance
  • How to store raw data: see “Store field vs _source” above
  • What data to store for inverted indexes: Properly setting the value of Index_Option can effectively improve the performance of inverted indexes
  • Whether correlation calculation norms are needed: Parameter norms can be set to be disabled; when the norms are on, many calculation factors will be stored for correlation calculation, wasting a large amount of storage space
Other modeling optimization suggestions
  • Index Alias: You can use Index Alias to decouple the application name from the Index name, without changing the name, without downtime, and achieve seamless Reindex
  • Index Template: Standardizes the Index creation process by setting an Index Template
  • To avoid using too many fields, you can limit the maximum number of fields by using index.mapping.total_fields
1) Services are difficult to maintain. 2) Cluster performance may be affected if Mapping information is stored in the Cluster State. 3) Reindex is required to delete or modify dataCopy the code
  • Do not enable the Dynamic attribute in the production environment. Field attributes are defined in advance because Dynamic field changes are difficult to maintain
  • Avoid using regular and fuzzy matching queries as much as possible because query performance is poor
  • Try to avoid inaccurate aggregation analysis caused by null values. Set the NULl_value attribute of the field, or modify this situation by using the missing attribute in the aggregation query
  • Adding Meta information to the Mapping file can facilitate version management and upload the Mapping file to Git for management
PUT softwares/
{
  "mappings": {
    "_meta": {
      "software_version_mapping": "1.0"}}}Copy the code
  • Kibana currently does not support nested and parent/Child types, so there are trade-offs to be made when modeling associated objects

reference

  • Elasticsearch store property vs _source field
  • Get started with Elasticsearch