7. Data modeling

1. The object andNestedobject

In this section we will look at the nested object of ElasticSearch

1.1 Formal design of relational database

  • Formal design (Normalization) main goal is to “reduce unnecessary updates”
  • Side effects: A fully formalized database often suffers from “query slowness.
    • The more formal the database, the more you needJoinThe more table
  • Canonization saves storage, but storage is getting cheaper
  • The stereotype simplifies updates, but the data “read” fetch operation can be slower

1.2 Anti-formal Design (Denormalization)

  • Antiformal design
    • dataFlatteningInstead of using associations, keep redundant copies of data in the document
  • Advantages: No processing requiredJoinsOperation, good data read performance
    • ElasticSearchBy compressing_sourceFields to reduce disk space overhead
  • Disadvantages: Not suitable for scenarios where data is frequently modified
    • A change in one piece of data (user name) can cause many data updates

1.3 inElasticSearchTo handle association relationships

  • Relational databases, generally, are consideredNormalizeData; inElasticSearchOften considerDenormalizedata
    • DenormalizeBenefits: Fast reads/no table joins/no row locks
  • ElasticSearchThey’re not very good at relationships. We generally use the following four methods to deal with association
    • Object type
    • Nested objects (Nested Object)
    • Parent-child correlation (Parent/Child)
    • Application side association

1.4 Case 1: Information on blogs and their authors

  • Object type
    • Keep the author’s information in the documentation of each blog
    • If the author information changes, the relevant blog document needs to be modified
Mapping PUT /blog {"mappings": {"properties": {"content": {"type": "text"}, "time": { "type": "date" }, "user": { "properties": { "city": { "type": "text" }, "userid": { "type": "long" }, "username": PUT blog/_doc/1 {"content": "I like ElasticSearch", "time": "2021-03-07T00:00:00", "user": { "userid": 1, "username": "rickyin", "city": "XiAn" } }Copy the code
GET blog/_search {"query": {"bool": {"must": [{"match": {"content": "ElasticSearch"}}, {"match": { { "user.username": "rickyin" } } ] } } }Copy the code

1.5 Case 2: Document containing an array of objects

Let my_movies {"mappings": {"properties": {"actors": {"properties": {"first_name": { "type": "keyword" }, "last_name": { "type": "keyword" } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }Copy the code
POST my_movies/_doc/1 {"title": "Speed", "actors": [{"first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }Copy the code
POST my_movies/_search {"query": {"bool": {first_name=Keanu and last_name=Hopper; { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } }Copy the code

Select * from last_name (last_name=Hopper) where first_name=Keanu (last_name=Hopper);

1.5.1 Why Do I Find Unnecessary Results?

  • When storing, the boundaries of internal objects are not taken into account,JSONFormats are processed as flat key-value pairs
  • Results in unexpected search results when querying multiple fields
  • You can useNested Data TypeSolve the problem

What is 1.5.2Nested Data Type

ElasticSearch also has a data type called nested, in which objects in an array of objects are indexed independently

  • NestedData types: Allows objects in an array of objects to be indexed independently
  • usenestedandprepertiesKeywords will be allactorsIndex to multiple separated documents
  • Internally,NestedThe document will be saved in twoLuceneDocument, at query timeJoinTo deal with
Let my_movies {"mappings": {"properties": {"actors": {"type": Nested "properties": {"first_name": {"type": "keyword"}, "last_name": {"type": {"keyword" : "keyword" } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }Copy the code
POST my_movies/_doc/1 {"title": "Speed", "actors": [{"first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }Copy the code
# on the nested object query POST my_movies / _search {" query ": {" bool" : {" must ": [{" match" : {" title ":" Speed "}}, {" nested ": {"bool": {"bool": {"bool": {"bool": {"bool": {"bool": {"bool": { { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } } } ] } } }Copy the code

POST my_movies/_search {"query": {"bool": {"must": [{"match": {"title": "Speed" } }, { "nested": { "path": "actors", "query": { "bool": { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Reeves" } } ] } } } } ] } } }Copy the code

1.5.3 Nested Query

  • Internally,NestedThe document will be saved in twoLuceneIn the document, it’s done at query timeJoinTo deal with

1.5.4 How do I Perform Aggregation Analysis on Nested Objects

Using normal aggregation will cause an error

2. Parent-child relationship of documents

2.1 Parent/Child

  • The objects andNestedObject limitations
    • With each update, the entire object (both root and nested) needs to be reindexed
  • ESProvides similar data in a relational databaseJoinThe implementation of the. useJoinData types can be implemented through maintenanceParent/ChildTo separate the two objects
    • A parent document and a child document are two separate documents
    • Updating the parent document does not require re-indexing the child document. Child documents are added, updated, or deleted without affecting the parent or other child documents

2.2 Defining the father-child relationship

  • Steps to define a father-child relationship
    • indexedMapping
    • Index parent document
    • Indexed subdocument
    • Query documents as needed

2.2.1 setMapping

# Parent/Child Mapping PUT my_blogs {" Settings ": {"number_of_shards": 2}, "mappings": {"properties": { "blog_comments_relation": { "type": "join", "relations": { "blog": "comment" } }, "content": { "type": "text" }, "title": { "type": "keyword" } } } }Copy the code

2.2.2 Index the parent document

PUT my_blogs/_doc/blog1 {"title": "Learning ElasticSearch", "content": "learing ELK @ rickyin", "blog_comments_relation": { "name": "blog" } } PUT my_blogs/_doc/blog2 { "title": "Learning Hadoop", "content": "learing Hadoop", "blog_comments_relation": { "name": "blog" } }Copy the code

2.2.3 Indexing subdocuments

  • Parent and child documents must exist on the same shard
    • Make sure that the queryjoinThe performance of the
  • When specifying a child document, you must specify its parent documentId
    • userouteParameter to ensure that the shard is assigned to the same shard
PUT my_blogs/_doc/comment1? routing=blog1 { "comment": "I am learing ELK", "username": "Rick", "blog_comments_realtion": { "name": "comment", "parent": "blog1" } } PUT my_blogs/_doc/comment2? routing=blog2 { "comment": "I like Hadoop!!!" , "username": "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } }Copy the code

2.3 Querying parent and child Documents

2.3.1 Querying All Documents

  • The query
GET my_blogs/_search {"match_all": {}}Copy the code
  • The query results
{ "took": 4, "timed_out": false, "_shards": { "total": 2, "successful": 2, "skipped": 0, "failed": 0 }, "hits": {" total ": {" value" : 4, "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" my_blogs ", "_type" : "_doc", "_id" : "blog1", "_score" : 1.0, "_source" : {" title ":" Learning ElasticSearch ", "content" : "learing ELK @ rickyin", "blog_comments_relation": { "name": "blog" } } }, { "_index": "my_blogs", "_type": "_doc", "_id" : "comment1", "_score" : 1.0, "_routing" : "blog1", "_source" : {" comment ":" I am learing ELK ", "username" : "Rick", "blog_comments_realtion": { "name": "comment", "parent": "blog1" } } }, { "_index": "my_blogs", "_type": "_doc", "_id" : "blog2", "_score" : 1.0, "_source" : {" title ":" Learning Hadoop ", "content" : "learing Hadoop", "blog_comments_relation": { "name": "blog" } } }, { "_index": "my_blogs", "_type": "_doc", "_id": "Comment2", "_score" : 1.0, "_routing" : "blog2", "_source" : {" comment ":" I like Hadoop!!! ", "username" : "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } } ] } }Copy the code

We can determine whether the current document is a parent or child document by the value of _source in the result of the query

2.3.2 View information based on the parent document Id

  • The query
GET my_blogs/_doc/blog2Copy the code
  • The query results
{
    "_index": "my_blogs",
    "_type": "_doc",
    "_id": "blog2",
    "_version": 1,
    "_seq_no": 2,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "title": "Learning Hadoop",
        "content": "learing Hadoop",
        "blog_comments_relation": {
            "name": "blog"
        }
    }
}
Copy the code

We found that a document obtained directly by ID does not get information about its sub-documents. Then how can we obtain information about its sub-documents by ID query?

  • throughParent IdQuery (I can’t find it here, don’t know why)
POST my_blogs/_search {"parent_id": {"type": "comment", "Id ": "blog1"}}Copy the code

2.3.3 Viewing Parent and child Documents

POST my_blogs/_search {"query": {"has_child": {"type": "comment", "query": {"match": {"username": "Rick"}}}}} # Has Parent POST my_blogs/_search {"query": {"has_parent": { "parent_type": "blog", "query": { "match": { "title": "Learing Hadoop" } } } } }Copy the code

2.3.4 Accessing subdocuments by ID

  • The query
GET my_blogs/_doc/comment2
Copy the code
  • The query results
{ "_index": "my_blogs", "_type": "_doc", "_id": "comment2", "_version": 2, "_seq_no": 4, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "I like Hadoop!!!" , "username": "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.3.5 throughIDandroutingAccess subdocuments

  • The query
GET my_blogs/_doc/comment3? routing=blog2Copy the code
  • The query results
{ "_index": "my_blogs", "_type": "_doc", "_id": "comment3", "_version": 1, "_seq_no": 5, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "I like Hadoop!!!" , "username": "Bob2", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.3.6 Updating subdocuments

  • Update statement
PUT my_blogs/_doc/comment3? routing=blog2 { "comment": "Hello Hadoop??" , "blog_comments_realtion" : { "name" : "comment", "parent" : "blog2" } }Copy the code
  • After the update
{ "_index": "my_blogs", "_type": "_doc", "_id": "comment3", "_version": 2, "_seq_no": 6, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "Hello Hadoop??" , "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.4 Nested objectsVSFather and son document

Nested Object Parent/Child
advantages Documents are stored together for high reading performance Parent and child documents can be updated independently
disadvantages Updating nested subdocuments requires updating the entire document Additional memory maintenance relationships are required. The read performance is relatively poor
Applicable scenario Subdocuments are updated occasionally, mostly with queries Subdocuments are updated frequently

3. Update By Query & Reindex API

ES will typically rebuild an index. Under what circumstances will an index be rebuilt?

3.1 Application Scenarios

  • In general, we need to rebuild the index in the following situations
    • The index ofMappingsChanges occur: field type changes, word divider and dictionary updates
    • The index ofSettingsChanged: The number of primary shards for the index has changed
    • Data needs to be migrated within and between clusters
  • ElasticSearchBuilt-inAPI
    • Update By Query: Rebuilds on an existing index
    • Reindex: Rebuilds an index on another index

3.2 Case 1: Adding a subfield to an index

When we index a blog document, we sometimes need to add a subfield to the Content field and specify an English classifier for it in order to improve its recall value

  • changeMapping, add subfields, use English word segmentation;
  • At this point, try to query the subfield;
  • Although some data already exists, but no result is returned (new data can return the result, but the stock of data does not return the result, this time needs usReIndexRebuild index)

3.2.1 Obtaining Data

PUT blogs/_doc/1 {"content": "Hadoop is cool", "keyword": "Hadoop "}Copy the code

3.2.2 Viewing the Result

  • statements
GET blogs/_search
{
  "query": {
    "match_all": {}
  }
}
Copy the code
  • The results of
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" content ":" Hadoop is cool, "" keyword" : "Hadoop"}}}}]Copy the code

3.2.3 viewMapping

  • statements
GET blogs/_mappingCopy the code
  • The results of
{
    "blogs": {
        "mappings": {
            "properties": {
                "content": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "keyword": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}
Copy the code

3.2.4 modifyMapping, add subfields, use English word segmentation

PUT blogs/_mapping {"properties": {"content": {"type": "text", "fields": {" English ": { "type": "text", "analyzer": "english" } } } } }Copy the code

3.2.5 Writing a New Document

PUT blogs/_doc/2 {"content": "keyword": "Elasticsearch rocks"}Copy the code

3.2.6 Querying the newly written document

  • The query
POST blogs/_search {"query": {"match": {"content.english": "Elasticsearch"}}Copy the code
  • The query results
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.2876821, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "2", "_score" : 0.2876821, "_source" : {"content" : "Elasticsearch rocks", "keyword" : "elasticsearch" } } ] } }Copy the code

3.2.7 queryMappingThe document written before the change

  • The query
POST blogs/_search {"query": {"match": {"content.english": "Hadoop"}}}Copy the code
  • The query results
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Copy the code

Here we can find that the documents before Mapping update cannot be queried by the newly added fields. Why? Because you haven’t rebuilt the index yet, once an inverted index is created, it can’t be changed, so if you want your document to be found, you need to rebuild the index

3.3.8 UpdateAll documents (ReIndexOperation)

POST blogs/_update_by_query {}Copy the code

_update_by_query: does a reindex operation on the original index

3.3.9 Querying the previously written document again

  • The query
POST blogs/_search {"query": {"match": {"content. English ": "Hadoop"}}Copy the code
  • The query results
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.9808292, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "1", "_score" : 0.9808292, "_source" : {" content ":" Hadoop is cool, "" keyword" : "Hadoop"}}}}]Copy the code

3.3 Case 2: Changing an existing field typeMappings

  • ESNot allowed in the originalMappingTo modify the field type
  • You can only create a new index, set the correct field type, and re-import the data

3.3.1 Viewing the original indexMappingAnd try to fix itMapping

  • View the original indexMapping
GET blogs/_mapping
Copy the code
  • The originalMappingdefine
{
  "blogs" : {
    "mappings" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            },
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "keyword" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

Copy the code
  • Try to modifyMapping
PUT blogs/_mapping
{
  "properties": {
    "content": "text",
    "field":{
      "english": {
        "type": "text",
        "analyzer": "english"
      }
    }
  },
  "keyword": {
    "type": "keyword"
  }
}
Copy the code
  • Modify the results
{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "No type specified for field [field]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "No type specified for field [field]"
  },
  "status": 400
}
Copy the code

3.3.2 Creating an Index

Mapping PUT blogs_fix/ {"mappings": {"properties": {"content": {"type": "text", "fields": {"mappings": {"properties": {"content": {"type": "text", "fields": { "english": { "type": "text", "analyzer": "english" } } }, "keyword": { "type": "keyword" } } } }Copy the code

3.3.3 ReIndex APIThe use of

# ReIndex API
POST _reindex
{
    "source": {
        "index": "blogs"
    },
    "dest": {
        "index": "blogs_fix"
    }
}
Copy the code

Use the ReIndex API to write data from the source to the new index in dest

3.3.4 Performing an operation on a New index

  • The query
GET blogs_fix/_doc/1
Copy the code
  • The query results
{
  "_index" : "blogs_fix",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "content" : "Hadoop is cool",
    "keyword" : "hadoop"
  }
}
Copy the code
  • testTerm Aggregation(we willkeywordThe type of the field is fromText–>Keyword, onlyTermBefore you can use itTerm Aggregation)
POST blogs_fix/_search
{
  "size": 0,
  "aggs": {
    "blog_keyword": {
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}

Copy the code
  • The test results
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "blog_keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "elasticsearch",
          "doc_count" : 1
        },
        {
          "key" : "hadoop",
          "doc_count" : 1
        }
      ]
    }
  }
}

Copy the code

3.4 ReIndex API

  • ReIndex APISupport for copying documents from one index to another
  • useReIndex APISome scenes of
    • Change the number of primary shards for the index
    • field-changingMappingField type in
    • Migration of data within or across clusters

3.4.1 trackReIndex APIPrecautions for use

  1. The index needs to be_sourceAttribute set toenableState, this is usuallyenable
  2. useReIndex APICreate a new index in advance and set itMapping

3.4.2 OP Type

  1. When we doReIndex APIIf there is data in the target index, what should we do
  2. We can specify"op_type": "create"The document will only be written if the target index does not have the document
  3. _reindexOnly non-existent documents are created
  4. Documents that already exist cause version conflicts

Rule 3.4.3 across the clusterReIndex

3.4.4 viewTask API

4. Ingest Pipeline & Painless Script

4.1 Ingest Pipeline

Let’s look at one such requirement

4.1.1 Requirements: Repair and enhance written data

  • We need to look atTagsField on the value ofAggregationStatistics, if we are in the current string format, can not meet the requirements;
  • We should be writingTagsFor fields, comma-separated text should be an array, not a string;

4.1.2 Ingest Node

  • ElasticSearch 5.0After the introduction of a new node type. By default, this is true for every nodeIngest Node
    • The ability to preprocess data and intercept itIndexorBulk APIThe request of
    • Transforms the data and returns it toIndexorBulk API
  • Don’t need toLogstash, data can be preprocessed, for example
    • Set a default value for a field; Rename the field name of a field; To the field valueSplitoperation
    • Allows you to setPainlessScripts for more complex processing of data

4.1.3 Pipeline & Processor

In Ingest Node, we can define a Pipeline

  • Pipeline: The pipeline processes the data (documents) as it passes through
  • Processor: ElasticSearchSome processing behaviors are abstractly packaged
    • ElasticSearchThere’s a lot built inProcessors. Also support through the plug-in way, to achieve their ownProcessor

4.1.4 usePipelineShred string

4.1.4.1 Inserting Data
PUT tech_blogs/_doc/1 {"title": "Introducing big data......" , "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" }Copy the code
4.1.4.2 segmentationTagsfield
  • Segmentation statements
# 测试 split tags
POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "to split blog tags",
        "processors": [
            {
                "split": {
                    "field": "tags",
                    "separator": ","
                }
            }
        ]
    },
    "docs": [
        {
            "_index": "index",
            "_id": "id",
            "_source": {
                "title": "Introducing big data......",
                "tags": "hadoop,elasticsearch,spark",
                "content": "You know,for big data"
            }
        },
        {
            "_index": "index",
            "_id": "idxx",
            "_source": {
                "title": "Introducing cloud computering",
                "tags": "openstack,k8s",
                "content": "You know,for cloud"
            }
        }
    ]
}
Copy the code
  • The results of
{
    "docs": [
        {
            "doc": {
                "_index": "index",
                "_type": "_doc",
                "_id": "id",
                "_source": {
                    "title": "Introducing big data......",
                    "content": "You know,for big data",
                    "tags": [
                        "hadoop",
                        "elasticsearch",
                        "spark"
                    ]
                },
                "_ingest": {
                    "timestamp": "2021-03-21T09:12:11.233145Z"
                }
            }
        },
        {
            "doc": {
                "_index": "index",
                "_type": "_doc",
                "_id": "idxx",
                "_source": {
                    "title": "Introducing cloud computering",
                    "content": "You know,for cloud",
                    "tags": [
                        "openstack",
                        "k8s"
                    ]
                },
                "_ingest": {
                    "timestamp": "2021-03-21T09:12:11.233155Z"
                }
            }
        }
    ]
}
Copy the code

4.1.4.3 PipelineThere can be more than oneProcessor, we add a field to the document
  • statements
"Pipeline ": {"description": "to split blog tags", "processors": [{"split": { "field": "tags", "separator": "," }, "set": { "field": "views", "value": 0 } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code
  • The results of
{ "docs": [ { "doc": { "_index": "index", "_type": "_doc", "_id": "id", "_source": { "title": "Introducing big data......", "content": "You know,for big data", "views": 0, "tags": [" hadoop elasticsearch ", ""," spark "]}, "_ingest" : {" timestamp ":" the 2021-03-21 T09: thy. 43964 z "}}}, {" doc ": { "_index": "index", "_type": "_doc", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "content": "You know,for cloud", "views": 0, "tags": [ "openstack", "k8s" ] }, "_ingest": { "timestamp": "2021-03-21T9:25:23.439645z"}}}]}Copy the code

4.1.5 asESAdd aPipeline

Before we test our Pipeline, when we test it and the Pipeline meets our needs, we write the Pipeline into our ES

  • Add the statement
Pipeline/Pipeline /blog_pipeline {"description": "a blog Pipeline ", "processors": [{"split": processors) { "field": "tags", "separator": "," }, "set": { "field": "views", "value": 0 } } ] }Copy the code
  • To viewPipelinestatements
# check Pipeline GET _ingest/ Pipeline /blog_pipelineCopy the code

4.1.6 testPipeline

# simulate {"docs": [{"_source": {"title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code

When we added Pipeline to ES, we only needed to improve docs when testing

4.1.7 USESPipelineTo update the data

4.1.7.1 Preparing Data
PUT tech_blogs/_doc/1 {"title": "Introducing big data......" , "tags", "hadoop, elasticsearch, spark", "content" : "You know, for the big data"} # use Pipeline to processing data PUT tech_blogs / _doc / 2? pipeline=blog_pipeline { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" }Copy the code
4.1.7.2 Viewing Two Data Items
  • Check the statement
POST tech_blogs/_search {}Copy the code
  • The results of
{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" tech_blogs ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" title ":" Introducing big data... ", "tags" : "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "tech_blogs", "_type": "_doc", "_id": "2", "_score": 1.0, "_source": {"title": "Introducing cloud computering", "content": "You know,for cloud", "views": 0, "tags": [ "openstack", "k8s" ] } } ] } }Copy the code

4.1.7.3 update_by_queryCan lead to errors
  • statements
POST tech_blogs/_update_by_query? pipeline=blog_pipeline {}Copy the code
  • The results of
{
  "took": 58,
  "timed_out": false,
  "total": 2,
  "updated": 1,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": [
    {
      "index": "tech_blogs",
      "type": "_doc",
      "id": "2",
      "cause": {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]"
          }
        },
        "header": {
          "processor_type": "split"
        }
      },
      "status": 500
    }
  ]
}
Copy the code

An array cannot be converted to a string exception. There are two data in our index, one is not processed by Pipeline, and the other is processed by Pipeline. After Pipeline processing, the tags field will be processed into an array, so there is an exception

4.1.7.4 increaseupdate_by_queryThe conditions of the
POST tech_blogs/_update_by_query? pipeline=blog_pipeline { "query": { "bool": { "must_not": { "exists": { "field": "views" } } } } }Copy the code

When there is no views field in the document, Pipeline processing is carried out, because the views field is also obtained after Pipeline processing

4.1.8 Some built-inProcessors

  • Split Processor(Example: Divide a given field value into an array)
  • Remove / Rename Processor(Example: Remove a rename field)
  • Append(example: add a new label to a product)
  • ConvertExample: Convert the price of goods from a string tofloatType)
  • Date / JSON(Example: date format conversion, string conversionJSONObject)
  • Date Index Name Processor(Example: Assign documents passing through the handler to an index in a specified time format)
  • Fail Processor(If an exception occurs, thePipelineThe specified error message can be returned to the user.
  • Foreach Process(Array fields, each element of the array uses the same processor)
  • Grok Processor(Log date format cut)
  • Gsub / Join / Split(String substitution/array to string/string to array)
  • Lowercase / Upcase(Case conversion)

4.1.9 Ingest Node VS Logstash

4.2 Painless

2PainlessIntroduction to the

  • Since theElasticSearch 5.xAfter introduction, specifically forElasticSearchDesign, extendedJavaThe grammar of the
  • 6.0,ESOnly supportPainless.Groovy,JavaScript,PythonAre no longer supported
  • PainlessSupport allJavaData type andJava APIA subset of
  • Painless ScriptHas the following features
    • High performance/Security
    • Display types or dynamically defined types are supported

4.2.2 PainlessThe purpose of the

  • Document fields can be processed
    • Update or delete fields and handle data aggregation operations
    • Script Field: Evaluates the returned fields in advance
    • Function Score: Process the score of the document
  • inIngest PipelineExecute the script in
  • inReindex API,Update By Query, the data is processed

Holdings byPainlessScript access field

Painless scripts have different syntax for accessing fields in different contexts

context grammar
Ingestion ctx.field_name
Update ctx._source.field_name
Search & Aggregation doc["field_name"]

4.2.4 PainlessScript Case 1

  • statements
# create a Script Processor POST _ingest/pipeline/ _SIMULATE {"pipeline": {"description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "script": { "source": If (ctx.containsKey("content")){ctx.content_length = ctx.content.length();  }else{ ctx.content_length = 0; } """ } }, { "set": { "field": "views", "value": 0 } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code
  • The results of
{ "docs" : [ { "doc" : { "_index" : "index", "_type" : "_doc", "_id" : "id", "_source" : { "title" : "Introducing big data......", "content" : "You know,for big data", "content_length" : 21, "views" : 0, "tags" : ["hadoop", "elasticSearch ", "spark"], "_ingest" : {"timestamp" :" 2021-03-21t11:16:54.233967z "}}}, {"doc" : { "_index" : "index", "_type" : "_doc", "_id" : "idxx", "_source" : { "title" : "Introducing cloud computering", "content" : "You know,for cloud", "content_length" : 18, "views" : 0, "tags" : [" it ", "k8s"]}, "_ingest" : {" timestamp ":" the 2021-03-21 T11:16:54. 233974 z "}}}}]Copy the code

4.2.5 PainlessScript Case 2

  • Data preparation
DELETE tech_blogs PUT tech_blogs/_doc/1 { "title": "Introducing big data......" , "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data", "views": 0 }Copy the code
  • Execute the script when updating
POST tech_blogs/_update/1 {"script": {"source": "ctx._source.views += params.new_views", "params": { "new_views": 100 } } }Copy the code
  • Viewing views count
POST tech_blogs/_search
{}
Copy the code
  • The results of
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" tech_blogs ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" title ":" Introducing big data... ", "tags" : "hadoop,elasticsearch,spark", "content" : "You know,for big data", "views" : 100 } } ] } }Copy the code
  • Save the script in Cluster State
# save scripts in Cluster State POST _scripts/update_views {"script": {"lang": "painless", "source": "ctx._source.views += params.new_views" } }Copy the code
  • Specified directlyupdateIDTo execute the script
POST tech_blogs/_update_by_query/1
{
    "script": {
        "id": "update_views",
        "params": {
            "new_views": 1000
        }
    }
}
Copy the code

4.2.6 Script Caching

  • Compilation is relatively expensive
  • ElasticSearchCaches the script after it is compiledCacheIn the
    • Inline scriptsStored ScriptsWill be cached
    • The default cache is 100 scripts