7. Data modeling

1. The object and`Nested`object

In this section we will look at the nested object of ElasticSearch

1.1 Formal design of relational database

Formal design (Normalization) main goal is to “reduce unnecessary updates”
Side effects: A fully formalized database often suffers from “query slowness.
- The more formal the database, the more you needJoinThe more table
Canonization saves storage, but storage is getting cheaper
The stereotype simplifies updates, but the data “read” fetch operation can be slower

1.2 Anti-formal Design (`Denormalization`)

Antiformal design
- dataFlatteningInstead of using associations, keep redundant copies of data in the document
Advantages: No processing requiredJoinsOperation, good data read performance
- ElasticSearchBy compressing_sourceFields to reduce disk space overhead
Disadvantages: Not suitable for scenarios where data is frequently modified
- A change in one piece of data (user name) can cause many data updates

1.3 in`ElasticSearch`To handle association relationships

Relational databases, generally, are consideredNormalizeData; inElasticSearchOften considerDenormalizedata
- DenormalizeBenefits: Fast reads/no table joins/no row locks
ElasticSearchThey’re not very good at relationships. We generally use the following four methods to deal with association
- Object type
- Nested objects (Nested Object)
- Parent-child correlation (Parent/Child)
- Application side association

1.4 Case 1: Information on blogs and their authors

Object type
- Keep the author’s information in the documentation of each blog
- If the author information changes, the relevant blog document needs to be modified

Mapping PUT /blog {"mappings": {"properties": {"content": {"type": "text"}, "time": { "type": "date" }, "user": { "properties": { "city": { "type": "text" }, "userid": { "type": "long" }, "username": PUT blog/_doc/1 {"content": "I like ElasticSearch", "time": "2021-03-07T00:00:00", "user": { "userid": 1, "username": "rickyin", "city": "XiAn" } }Copy the code

GET blog/_search {"query": {"bool": {"must": [{"match": {"content": "ElasticSearch"}}, {"match": { { "user.username": "rickyin" } } ] } } }Copy the code

1.5 Case 2: Document containing an array of objects

Let my_movies {"mappings": {"properties": {"actors": {"properties": {"first_name": { "type": "keyword" }, "last_name": { "type": "keyword" } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }Copy the code

POST my_movies/_doc/1 {"title": "Speed", "actors": [{"first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }Copy the code

POST my_movies/_search {"query": {"bool": {first_name=Keanu and last_name=Hopper; { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } }Copy the code

Select * from last_name (last_name=Hopper) where first_name=Keanu (last_name=Hopper);

1.5.1 Why Do I Find Unnecessary Results?

When storing, the boundaries of internal objects are not taken into account,JSONFormats are processed as flat key-value pairs
Results in unexpected search results when querying multiple fields
You can useNested Data TypeSolve the problem

What is 1.5.2`Nested Data Type`

ElasticSearch also has a data type called nested, in which objects in an array of objects are indexed independently

NestedData types: Allows objects in an array of objects to be indexed independently
usenestedandprepertiesKeywords will be allactorsIndex to multiple separated documents
Internally,NestedThe document will be saved in twoLuceneDocument, at query timeJoinTo deal with

Let my_movies {"mappings": {"properties": {"actors": {"type": Nested "properties": {"first_name": {"type": "keyword"}, "last_name": {"type": {"keyword" : "keyword" } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }Copy the code

POST my_movies/_doc/1 {"title": "Speed", "actors": [{"first_name": "Keanu", "last_name": "Reeves" }, { "first_name": "Dennis", "last_name": "Hopper" } ] }Copy the code

# on the nested object query POST my_movies / _search {" query ": {" bool" : {" must ": [{" match" : {" title ":" Speed "}}, {" nested ": {"bool": {"bool": {"bool": {"bool": {"bool": {"bool": {"bool": { { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Hopper" } } ] } } } } ] } } }Copy the code

POST my_movies/_search {"query": {"bool": {"must": [{"match": {"title": "Speed" } }, { "nested": { "path": "actors", "query": { "bool": { "must": [ { "match": { "actors.first_name": "Keanu" } }, { "match": { "actors.last_name": "Reeves" } } ] } } } } ] } } }Copy the code

1.5.3 Nested Query

Internally,NestedThe document will be saved in twoLuceneIn the document, it’s done at query timeJoinTo deal with

1.5.4 How do I Perform Aggregation Analysis on Nested Objects

Using normal aggregation will cause an error

2. Parent-child relationship of documents

2.1 `Parent`/`Child`

The objects andNestedObject limitations
- With each update, the entire object (both root and nested) needs to be reindexed
ESProvides similar data in a relational databaseJoinThe implementation of the. useJoinData types can be implemented through maintenanceParent/ChildTo separate the two objects
- A parent document and a child document are two separate documents
- Updating the parent document does not require re-indexing the child document. Child documents are added, updated, or deleted without affecting the parent or other child documents

2.2 Defining the father-child relationship

Steps to define a father-child relationship
- indexedMapping
- Index parent document
- Indexed subdocument
- Query documents as needed

2.2.1 set`Mapping`

# Parent/Child Mapping PUT my_blogs {" Settings ": {"number_of_shards": 2}, "mappings": {"properties": { "blog_comments_relation": { "type": "join", "relations": { "blog": "comment" } }, "content": { "type": "text" }, "title": { "type": "keyword" } } } }Copy the code

2.2.2 Index the parent document

PUT my_blogs/_doc/blog1 {"title": "Learning ElasticSearch", "content": "learing ELK @ rickyin", "blog_comments_relation": { "name": "blog" } } PUT my_blogs/_doc/blog2 { "title": "Learning Hadoop", "content": "learing Hadoop", "blog_comments_relation": { "name": "blog" } }Copy the code

2.2.3 Indexing subdocuments

Parent and child documents must exist on the same shard
- Make sure that the queryjoinThe performance of the
When specifying a child document, you must specify its parent documentId
- userouteParameter to ensure that the shard is assigned to the same shard

PUT my_blogs/_doc/comment1? routing=blog1 { "comment": "I am learing ELK", "username": "Rick", "blog_comments_realtion": { "name": "comment", "parent": "blog1" } } PUT my_blogs/_doc/comment2? routing=blog2 { "comment": "I like Hadoop!!!" , "username": "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } }Copy the code

2.3 Querying parent and child Documents

2.3.1 Querying All Documents

The query

GET my_blogs/_search {"match_all": {}}Copy the code

The query results

{ "took": 4, "timed_out": false, "_shards": { "total": 2, "successful": 2, "skipped": 0, "failed": 0 }, "hits": {" total ": {" value" : 4, "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" my_blogs ", "_type" : "_doc", "_id" : "blog1", "_score" : 1.0, "_source" : {" title ":" Learning ElasticSearch ", "content" : "learing ELK @ rickyin", "blog_comments_relation": { "name": "blog" } } }, { "_index": "my_blogs", "_type": "_doc", "_id" : "comment1", "_score" : 1.0, "_routing" : "blog1", "_source" : {" comment ":" I am learing ELK ", "username" : "Rick", "blog_comments_realtion": { "name": "comment", "parent": "blog1" } } }, { "_index": "my_blogs", "_type": "_doc", "_id" : "blog2", "_score" : 1.0, "_source" : {" title ":" Learning Hadoop ", "content" : "learing Hadoop", "blog_comments_relation": { "name": "blog" } } }, { "_index": "my_blogs", "_type": "_doc", "_id": "Comment2", "_score" : 1.0, "_routing" : "blog2", "_source" : {" comment ":" I like Hadoop!!! ", "username" : "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } } ] } }Copy the code

We can determine whether the current document is a parent or child document by the value of _source in the result of the query

2.3.2 View information based on the parent document Id

The query

GET my_blogs/_doc/blog2Copy the code

The query results

{
    "_index": "my_blogs",
    "_type": "_doc",
    "_id": "blog2",
    "_version": 1,
    "_seq_no": 2,
    "_primary_term": 1,
    "found": true,
    "_source": {
        "title": "Learning Hadoop",
        "content": "learing Hadoop",
        "blog_comments_relation": {
            "name": "blog"
        }
    }
}
Copy the code

We found that a document obtained directly by ID does not get information about its sub-documents. Then how can we obtain information about its sub-documents by ID query?

throughParent IdQuery (I can’t find it here, don’t know why)

POST my_blogs/_search {"parent_id": {"type": "comment", "Id ": "blog1"}}Copy the code

2.3.3 Viewing Parent and child Documents

POST my_blogs/_search {"query": {"has_child": {"type": "comment", "query": {"match": {"username": "Rick"}}}}} # Has Parent POST my_blogs/_search {"query": {"has_parent": { "parent_type": "blog", "query": { "match": { "title": "Learing Hadoop" } } } } }Copy the code

2.3.4 Accessing subdocuments by ID

The query

GET my_blogs/_doc/comment2
Copy the code

The query results

{ "_index": "my_blogs", "_type": "_doc", "_id": "comment2", "_version": 2, "_seq_no": 4, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "I like Hadoop!!!" , "username": "Bob", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.3.5 through`ID`and`routing`Access subdocuments

The query

GET my_blogs/_doc/comment3? routing=blog2Copy the code

The query results

{ "_index": "my_blogs", "_type": "_doc", "_id": "comment3", "_version": 1, "_seq_no": 5, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "I like Hadoop!!!" , "username": "Bob2", "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.3.6 Updating subdocuments

Update statement

PUT my_blogs/_doc/comment3? routing=blog2 { "comment": "Hello Hadoop??" , "blog_comments_realtion" : { "name" : "comment", "parent" : "blog2" } }Copy the code

After the update

{ "_index": "my_blogs", "_type": "_doc", "_id": "comment3", "_version": 2, "_seq_no": 6, "_primary_term": 1, "_routing": "blog2", "found": true, "_source": { "comment": "Hello Hadoop??" , "blog_comments_realtion": { "name": "comment", "parent": "blog2" } } }Copy the code

2.4 Nested objects`VS`Father and son document

‘	Nested Object	Parent/Child
advantages	Documents are stored together for high reading performance	Parent and child documents can be updated independently
disadvantages	Updating nested subdocuments requires updating the entire document	Additional memory maintenance relationships are required. The read performance is relatively poor
Applicable scenario	Subdocuments are updated occasionally, mostly with queries	Subdocuments are updated frequently

3. `Update By Query` & `Reindex API`

ES will typically rebuild an index. Under what circumstances will an index be rebuilt?

3.1 Application Scenarios

In general, we need to rebuild the index in the following situations
- The index ofMappingsChanges occur: field type changes, word divider and dictionary updates
- The index ofSettingsChanged: The number of primary shards for the index has changed
- Data needs to be migrated within and between clusters
ElasticSearchBuilt-inAPI
- Update By Query: Rebuilds on an existing index
- Reindex: Rebuilds an index on another index

3.2 Case 1: Adding a subfield to an index

When we index a blog document, we sometimes need to add a subfield to the Content field and specify an English classifier for it in order to improve its recall value

changeMapping, add subfields, use English word segmentation;
At this point, try to query the subfield;
Although some data already exists, but no result is returned (new data can return the result, but the stock of data does not return the result, this time needs usReIndexRebuild index)

3.2.1 Obtaining Data

PUT blogs/_doc/1 {"content": "Hadoop is cool", "keyword": "Hadoop "}Copy the code

3.2.2 Viewing the Result

statements

GET blogs/_search
{
  "query": {
    "match_all": {}
  }
}
Copy the code

The results of

{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" content ":" Hadoop is cool, "" keyword" : "Hadoop"}}}}]Copy the code

3.2.3 view`Mapping`

statements

GET blogs/_mappingCopy the code

The results of

{
    "blogs": {
        "mappings": {
            "properties": {
                "content": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                },
                "keyword": {
                    "type": "text",
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                        }
                    }
                }
            }
        }
    }
}
Copy the code

3.2.4 modify`Mapping`, add subfields, use English word segmentation

PUT blogs/_mapping {"properties": {"content": {"type": "text", "fields": {" English ": { "type": "text", "analyzer": "english" } } } } }Copy the code

3.2.5 Writing a New Document

PUT blogs/_doc/2 {"content": "keyword": "Elasticsearch rocks"}Copy the code

3.2.6 Querying the newly written document

The query

POST blogs/_search {"query": {"match": {"content.english": "Elasticsearch"}}Copy the code

The query results

{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.2876821, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "2", "_score" : 0.2876821, "_source" : {"content" : "Elasticsearch rocks", "keyword" : "elasticsearch" } } ] } }Copy the code

3.2.7 query`Mapping`The document written before the change

The query

POST blogs/_search {"query": {"match": {"content.english": "Hadoop"}}}Copy the code

The query results

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Copy the code

Here we can find that the documents before Mapping update cannot be queried by the newly added fields. Why? Because you haven’t rebuilt the index yet, once an inverted index is created, it can’t be changed, so if you want your document to be found, you need to rebuild the index

3.3.8 `Update`All documents (`ReIndex`Operation)

POST blogs/_update_by_query {}Copy the code

_update_by_query: does a reindex operation on the original index

3.3.9 Querying the previously written document again

The query

POST blogs/_search {"query": {"match": {"content. English ": "Hadoop"}}Copy the code

The query results

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.9808292, "hits" : [{" _index ":" blogs, "" _type" : "_doc", "_id" : "1", "_score" : 0.9808292, "_source" : {" content ":" Hadoop is cool, "" keyword" : "Hadoop"}}}}]Copy the code

3.3 Case 2: Changing an existing field type`Mappings`

ESNot allowed in the originalMappingTo modify the field type
You can only create a new index, set the correct field type, and re-import the data

3.3.1 Viewing the original index`Mapping`And try to fix it`Mapping`

View the original indexMapping

GET blogs/_mapping
Copy the code

The originalMappingdefine

{
  "blogs" : {
    "mappings" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            },
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "keyword" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

Copy the code

Try to modifyMapping

PUT blogs/_mapping
{
  "properties": {
    "content": "text",
    "field":{
      "english": {
        "type": "text",
        "analyzer": "english"
      }
    }
  },
  "keyword": {
    "type": "keyword"
  }
}
Copy the code

Modify the results

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "No type specified for field [field]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "No type specified for field [field]"
  },
  "status": 400
}
Copy the code

3.3.2 Creating an Index

Mapping PUT blogs_fix/ {"mappings": {"properties": {"content": {"type": "text", "fields": {"mappings": {"properties": {"content": {"type": "text", "fields": { "english": { "type": "text", "analyzer": "english" } } }, "keyword": { "type": "keyword" } } } }Copy the code

3.3.3 `ReIndex API`The use of

# ReIndex API
POST _reindex
{
    "source": {
        "index": "blogs"
    },
    "dest": {
        "index": "blogs_fix"
    }
}
Copy the code

Use the ReIndex API to write data from the source to the new index in dest

3.3.4 Performing an operation on a New index

The query

GET blogs_fix/_doc/1
Copy the code

The query results

{
  "_index" : "blogs_fix",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "content" : "Hadoop is cool",
    "keyword" : "hadoop"
  }
}
Copy the code

testTerm Aggregation(we willkeywordThe type of the field is fromText–>Keyword, onlyTermBefore you can use itTerm Aggregation)

POST blogs_fix/_search
{
  "size": 0,
  "aggs": {
    "blog_keyword": {
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}

Copy the code

The test results

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "blog_keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "elasticsearch",
          "doc_count" : 1
        },
        {
          "key" : "hadoop",
          "doc_count" : 1
        }
      ]
    }
  }
}

Copy the code

3.4 `ReIndex API`

ReIndex APISupport for copying documents from one index to another
useReIndex APISome scenes of
- Change the number of primary shards for the index
- field-changingMappingField type in
- Migration of data within or across clusters

3.4.1 track`ReIndex API`Precautions for use

The index needs to be_sourceAttribute set toenableState, this is usuallyenable
useReIndex APICreate a new index in advance and set itMapping

3.4.2 `OP Type`

When we doReIndex APIIf there is data in the target index, what should we do
We can specify"op_type": "create"The document will only be written if the target index does not have the document
_reindexOnly non-existent documents are created
Documents that already exist cause version conflicts

Rule 3.4.3 across the cluster`ReIndex`

3.4.4 view`Task API`

4. `Ingest Pipeline` & `Painless Script`

4.1 `Ingest Pipeline`

Let’s look at one such requirement

4.1.1 Requirements: Repair and enhance written data

We need to look atTagsField on the value ofAggregationStatistics, if we are in the current string format, can not meet the requirements;
We should be writingTagsFor fields, comma-separated text should be an array, not a string;

4.1.2 `Ingest Node`

ElasticSearch 5.0After the introduction of a new node type. By default, this is true for every nodeIngest Node
- The ability to preprocess data and intercept itIndexorBulk APIThe request of
- Transforms the data and returns it toIndexorBulk API
Don’t need toLogstash, data can be preprocessed, for example
- Set a default value for a field; Rename the field name of a field; To the field valueSplitoperation
- Allows you to setPainlessScripts for more complex processing of data

4.1.3 `Pipeline` & `Processor`

In Ingest Node, we can define a Pipeline

Pipeline: The pipeline processes the data (documents) as it passes through
Processor: ElasticSearchSome processing behaviors are abstractly packaged
- ElasticSearchThere’s a lot built inProcessors. Also support through the plug-in way, to achieve their ownProcessor

4.1.4 use`Pipeline`Shred string

4.1.4.1 Inserting Data

PUT tech_blogs/_doc/1 {"title": "Introducing big data......" , "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" }Copy the code

4.1.4.2 segmentation`Tags`field

Segmentation statements

# 测试 split tags
POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "to split blog tags",
        "processors": [
            {
                "split": {
                    "field": "tags",
                    "separator": ","
                }
            }
        ]
    },
    "docs": [
        {
            "_index": "index",
            "_id": "id",
            "_source": {
                "title": "Introducing big data......",
                "tags": "hadoop,elasticsearch,spark",
                "content": "You know,for big data"
            }
        },
        {
            "_index": "index",
            "_id": "idxx",
            "_source": {
                "title": "Introducing cloud computering",
                "tags": "openstack,k8s",
                "content": "You know,for cloud"
            }
        }
    ]
}
Copy the code

The results of

{
    "docs": [
        {
            "doc": {
                "_index": "index",
                "_type": "_doc",
                "_id": "id",
                "_source": {
                    "title": "Introducing big data......",
                    "content": "You know,for big data",
                    "tags": [
                        "hadoop",
                        "elasticsearch",
                        "spark"
                    ]
                },
                "_ingest": {
                    "timestamp": "2021-03-21T09:12:11.233145Z"
                }
            }
        },
        {
            "doc": {
                "_index": "index",
                "_type": "_doc",
                "_id": "idxx",
                "_source": {
                    "title": "Introducing cloud computering",
                    "content": "You know,for cloud",
                    "tags": [
                        "openstack",
                        "k8s"
                    ]
                },
                "_ingest": {
                    "timestamp": "2021-03-21T09:12:11.233155Z"
                }
            }
        }
    ]
}
Copy the code

4.1.4.3 `Pipeline`There can be more than one`Processor`, we add a field to the document

statements

"Pipeline ": {"description": "to split blog tags", "processors": [{"split": { "field": "tags", "separator": "," }, "set": { "field": "views", "value": 0 } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code

The results of

{ "docs": [ { "doc": { "_index": "index", "_type": "_doc", "_id": "id", "_source": { "title": "Introducing big data......", "content": "You know,for big data", "views": 0, "tags": [" hadoop elasticsearch ", ""," spark "]}, "_ingest" : {" timestamp ":" the 2021-03-21 T09: thy. 43964 z "}}}, {" doc ": { "_index": "index", "_type": "_doc", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "content": "You know,for cloud", "views": 0, "tags": [ "openstack", "k8s" ] }, "_ingest": { "timestamp": "2021-03-21T9:25:23.439645z"}}}]}Copy the code

4.1.5 as`ES`Add a`Pipeline`

Before we test our Pipeline, when we test it and the Pipeline meets our needs, we write the Pipeline into our ES

Add the statement

Pipeline/Pipeline /blog_pipeline {"description": "a blog Pipeline ", "processors": [{"split": processors) { "field": "tags", "separator": "," }, "set": { "field": "views", "value": 0 } } ] }Copy the code

To viewPipelinestatements

# check Pipeline GET _ingest/ Pipeline /blog_pipelineCopy the code

4.1.6 test`Pipeline`

# simulate {"docs": [{"_source": {"title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code

When we added Pipeline to ES, we only needed to improve docs when testing

4.1.7 USES`Pipeline`To update the data

4.1.7.1 Preparing Data

PUT tech_blogs/_doc/1 {"title": "Introducing big data......" , "tags", "hadoop, elasticsearch, spark", "content" : "You know, for the big data"} # use Pipeline to processing data PUT tech_blogs / _doc / 2? pipeline=blog_pipeline { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" }Copy the code

4.1.7.2 Viewing Two Data Items

Check the statement

POST tech_blogs/_search {}Copy the code

The results of

{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" tech_blogs ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" title ":" Introducing big data... ", "tags" : "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "tech_blogs", "_type": "_doc", "_id": "2", "_score": 1.0, "_source": {"title": "Introducing cloud computering", "content": "You know,for cloud", "views": 0, "tags": [ "openstack", "k8s" ] } } ] } }Copy the code

4.1.7.3 `update_by_query`Can lead to errors

statements

POST tech_blogs/_update_by_query? pipeline=blog_pipeline {}Copy the code

The results of

{
  "took": 58,
  "timed_out": false,
  "total": 2,
  "updated": 1,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": [
    {
      "index": "tech_blogs",
      "type": "_doc",
      "id": "2",
      "cause": {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]"
          }
        },
        "header": {
          "processor_type": "split"
        }
      },
      "status": 500
    }
  ]
}
Copy the code

An array cannot be converted to a string exception. There are two data in our index, one is not processed by Pipeline, and the other is processed by Pipeline. After Pipeline processing, the tags field will be processed into an array, so there is an exception

4.1.7.4 increase`update_by_query`The conditions of the

POST tech_blogs/_update_by_query? pipeline=blog_pipeline { "query": { "bool": { "must_not": { "exists": { "field": "views" } } } } }Copy the code

When there is no views field in the document, Pipeline processing is carried out, because the views field is also obtained after Pipeline processing

4.1.8 Some built-in`Processors`

Split Processor(Example: Divide a given field value into an array)
Remove / Rename Processor(Example: Remove a rename field)
Append(example: add a new label to a product)
ConvertExample: Convert the price of goods from a string tofloatType)
Date / JSON(Example: date format conversion, string conversionJSONObject)
Date Index Name Processor(Example: Assign documents passing through the handler to an index in a specified time format)
Fail Processor(If an exception occurs, thePipelineThe specified error message can be returned to the user.
Foreach Process(Array fields, each element of the array uses the same processor)
Grok Processor(Log date format cut)
Gsub / Join / Split(String substitution/array to string/string to array)
Lowercase / Upcase(Case conversion)

4.1.9 `Ingest Node` VS `Logstash`

4.2 `Painless`

2`Painless`Introduction to the

Since theElasticSearch 5.xAfter introduction, specifically forElasticSearchDesign, extendedJavaThe grammar of the
6.0,ESOnly supportPainless.Groovy,JavaScript,PythonAre no longer supported
PainlessSupport allJavaData type andJava APIA subset of
Painless ScriptHas the following features
- High performance/Security
- Display types or dynamically defined types are supported

4.2.2 `Painless`The purpose of the

Document fields can be processed
- Update or delete fields and handle data aggregation operations
- Script Field: Evaluates the returned fields in advance
- Function Score: Process the score of the document
inIngest PipelineExecute the script in
inReindex API,Update By Query, the data is processed

Holdings by`Painless`Script access field

Painless scripts have different syntax for accessing fields in different contexts

context	grammar
`Ingestion`	`ctx.field_name`
`Update`	`ctx._source.field_name`
`Search & Aggregation`	`doc["field_name"]`

4.2.4 `Painless`Script Case 1

statements

# create a Script Processor POST _ingest/pipeline/ _SIMULATE {"pipeline": {"description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "script": { "source": If (ctx.containsKey("content")){ctx.content_length = ctx.content.length();  }else{ ctx.content_length = 0; } """ } }, { "set": { "field": "views", "value": 0 } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You know,for cloud" } } ] }Copy the code

The results of

{ "docs" : [ { "doc" : { "_index" : "index", "_type" : "_doc", "_id" : "id", "_source" : { "title" : "Introducing big data......", "content" : "You know,for big data", "content_length" : 21, "views" : 0, "tags" : ["hadoop", "elasticSearch ", "spark"], "_ingest" : {"timestamp" :" 2021-03-21t11:16:54.233967z "}}}, {"doc" : { "_index" : "index", "_type" : "_doc", "_id" : "idxx", "_source" : { "title" : "Introducing cloud computering", "content" : "You know,for cloud", "content_length" : 18, "views" : 0, "tags" : [" it ", "k8s"]}, "_ingest" : {" timestamp ":" the 2021-03-21 T11:16:54. 233974 z "}}}}]Copy the code

4.2.5 `Painless`Script Case 2

Data preparation

DELETE tech_blogs PUT tech_blogs/_doc/1 { "title": "Introducing big data......" , "tags": "hadoop,elasticsearch,spark", "content": "You know,for big data", "views": 0 }Copy the code

Execute the script when updating

POST tech_blogs/_update/1 {"script": {"source": "ctx._source.views += params.new_views", "params": { "new_views": 100 } } }Copy the code

Viewing views count

POST tech_blogs/_search
{}
Copy the code

The results of

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ":" tech_blogs ", "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" title ":" Introducing big data... ", "tags" : "hadoop,elasticsearch,spark", "content" : "You know,for big data", "views" : 100 } } ] } }Copy the code

Save the script in Cluster State

# save scripts in Cluster State POST _scripts/update_views {"script": {"lang": "painless", "source": "ctx._source.views += params.new_views" } }Copy the code

Specified directlyupdateIDTo execute the script

POST tech_blogs/_update_by_query/1
{
    "script": {
        "id": "update_views",
        "params": {
            "new_views": 1000
        }
    }
}
Copy the code

4.2.6 Script Caching

Compilation is relatively expensive
ElasticSearchCaches the script after it is compiledCacheIn the
- Inline scripts 和 Stored ScriptsWill be cached
- The default cache is 100 scripts

Elasticsearch core technology and practice vii

7. Data modeling

1. The object andNestedobject

1.1 Formal design of relational database

1.2 Anti-formal Design (Denormalization)

1.3 inElasticSearchTo handle association relationships

1.4 Case 1: Information on blogs and their authors

1.5 Case 2: Document containing an array of objects

1.5.1 Why Do I Find Unnecessary Results?

What is 1.5.2Nested Data Type

1.5.3 Nested Query

1.5.4 How do I Perform Aggregation Analysis on Nested Objects

2. Parent-child relationship of documents

2.1 Parent/Child

2.2 Defining the father-child relationship

2.2.1 setMapping

2.2.2 Index the parent document

2.2.3 Indexing subdocuments

2.3 Querying parent and child Documents

2.3.1 Querying All Documents

2.3.2 View information based on the parent document Id

2.3.3 Viewing Parent and child Documents

2.3.4 Accessing subdocuments by ID

2.3.5 throughIDandroutingAccess subdocuments

2.3.6 Updating subdocuments

2.4 Nested objectsVSFather and son document

3. Update By Query & Reindex API

3.1 Application Scenarios

3.2 Case 1: Adding a subfield to an index

3.2.1 Obtaining Data

3.2.2 Viewing the Result

3.2.3 viewMapping

3.2.4 modifyMapping, add subfields, use English word segmentation

3.2.5 Writing a New Document

3.2.6 Querying the newly written document

3.2.7 queryMappingThe document written before the change

3.3.8 UpdateAll documents (ReIndexOperation)

3.3.9 Querying the previously written document again

3.3 Case 2: Changing an existing field typeMappings

3.3.1 Viewing the original indexMappingAnd try to fix itMapping

3.3.2 Creating an Index

3.3.3 ReIndex APIThe use of

3.3.4 Performing an operation on a New index

3.4 ReIndex API

3.4.1 trackReIndex APIPrecautions for use

3.4.2 OP Type

Rule 3.4.3 across the clusterReIndex

3.4.4 viewTask API

4. Ingest Pipeline & Painless Script

4.1 Ingest Pipeline

4.1.1 Requirements: Repair and enhance written data

4.1.2 Ingest Node

4.1.3 Pipeline & Processor

4.1.4 usePipelineShred string