In many cases, we want to update all of our documents:

  • Add a new field or make a field a multi-field
  • Update all documents with a single value, or update all documents for a composite query condition

In today’s article, we’ll look at these uses of _update_by_query.

 

To prepare data

Let’s create an index called Twitter:

PUT twitter
{
  "mappings": {
    "properties": {
      "DOB": {
        "type": "date"
      },
      "address": {
        "type": "keyword"
      },
      "city": {
        "type": "text"
      },
      "country": {
        "type": "keyword"
      },
      "uid": {
        "type": "long"
      },
      "user": {
        "type": "keyword"
      },
      "province": {
        "type": "keyword"
      },
      "message": {
        "type": "text"
      },
      "location": {
        "type": "geo_point"
      }
    }
  }
}
Copy the code

We use the following BULK API to import the data:

POST _bulk { "index" : { "_index" : "twitter", "_id": 1}} {"user":" Zhang SAN ","message":" Nice weather today, Walk to, "" uid" : 2, "city", "Beijing", "province", "Beijing", "country" : "Chinese", "address" : "haidian district in Beijing, China", "location" : {" lat ":" 39.970718 ", "says lon" : "116. 325747 "}, "DOB" : "1980-12-01"} {" index ": {" _index" : "twitter", "_id" : 2}} {" user ":" liu ", "message" : "in yunnan, the next stop!" , "uid" : 3, "city" : "Beijing", "province", "Beijing", "country" : "Chinese", "address" : "China Beijing dongcheng district stylobate factory three 3", "location" : {" lat ":" 39.904313 ", "says lon" : "116 . 412754 "}, "DOB" : "1981-12-01"} {" index ": {" _index" : "twitter", "_id" : 3}} {" user ":" bill ", "message" : "happy birthday!" , "uid" : 4, "city" : "Beijing", "province", "Beijing", "country" : "Chinese", "address" : "China Beijing dongcheng district", "location" : {" lat ":" 39.893801 ", "says lon" : "116.408986 "}, "DOB":"1982-12-01"} { "index" : { "_index" : "twitter", "_id": 4}} {" user ", "old jia", "message" : "123, gogogo", "uid" : 5, "city" : "Beijing", "province", "Beijing", "country" : "Chinese", "address" : "China Beijing chaoyang district jianguomen", "location ": {" lat" : "39.718256", "says lon" : "116.367910"}, "DOB" : "1983-12-01"} {" index ": {" _index" : "twitter", "_id" : 5}} {"user":" Lao Wang ","message":"Happy BirthDay My Friend!" , "uid" : 6, "city" : "Beijing", "province", "Beijing", "country" : "Chinese", "address" : "chaoyang district in Beijing, China international trade", "location" : {" lat ":" 39.918256 ", "says lon" : "116.4679 10"}, "DOB":"1984-12-01"} { "index" : { "_index" : "twitter", "_id": 6}} {"user":" Lao Wu ","message":" today is my birthday, friends come, what birthday happy!" , "uid" : 7, "city", "Shanghai", "province", "Shanghai", "country" : "Chinese", "address" : "China Shanghai minhang district", "location" : {" lat ":" 31.175927 ", "says lon" : "121.383328 "}, "DOB":"1985-12-01"}Copy the code

Turn a field into a multi-field

Above, we consciously set the city field to text, but in practice city is generally the keyword type. Let’s say we want to aggregate the city field. So how do we correct this mistake? Do we need to delete our index and rebuild it again using the new Mapping? This may not be realistic in our practical use. This is because your data can be very large, and such changes can cause a lot of problems. So how do we solve this problem?

One way is to make city a mulit-field without dropping the previous index, so that it can be either a keyword or text field. To do this, let’s modify The Twitter mapping:

PUT twitter/_mapping
{
  "properties": {
    "DOB": {
      "type": "date"
    },
    "address": {
      "type": "keyword"
    },
    "city": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "country": {
      "type": "keyword"
    },
    "uid": {
      "type": "long"
    },
    "user": {
      "type": "keyword"
    },
    "province": {
      "type": "keyword"
    },
    "message": {
      "type": "text"
    },
    "location": {
      "type": "geo_point"
    }
  }
}
Copy the code

Notice above that we changed the city field to a mult-field. Even though we have modified the mapping, our index does not have our message field segmented. To achieve this, we can do the following:

POST twitter/_update_by_query
Copy the code

After doing the above, the message field will be re-indexed and searchable.

GET twitter/_search {"query": {"match": {"city. Keyword ": "keyword"}}}Copy the code

The results shown above are:

"Hits" : {" total ": {" value" : 5, "base" : "eq"}, "max_score" : 0.21357408, "hits" : [{" _index ": "Twitter," "_type" : "_doc", "_id" : "1", "_score" : 0.21357408, "_source" : {" user ":" zhang ", "message" : "City" : "Beijing ", "province" :" Beijing ", "country" : "China ", "address" :" Haidian district, Beijing ", "location" : {" lat ":" 39.970718 ", "says lon" : "116.325747"}, "DOB" : "1980-12-01"}},...}.Copy the code

Of course, since this field becomes a multi-field field and contains city.keyword, we can perform an aggregated search on it:

GET twitter/_search
{
  "size": 0,
  "aggs": {
    "city_distribution": {
      "terms": {
        "field": "city.keyword",
        "size": 5
      }
    }
  }
}
Copy the code

We have made statistics on city above, and the results are shown as follows:

"aggregations" : { "city_distribution" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{" key ":" Beijing ", "doc_count" : 5}, {" key ":" Shanghai ", "doc_count" : 1}]}}Copy the code

If we don’t change city to multi-field, we won’t be able to count this field.

 

Add a new field

We can also add a new field to our Twitter via script, such as:

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source['contact'] = \"139111111111\""
  }
}
Copy the code

We add a new field contact to all the documents and give it the same value:

GET twitter/_search
{
  "query": {
    "match_all": {}
  }
}
Copy the code

The command above shows the result:

"Hits" : {" total ": {" value" : 6, "base" : "eq"}, "max_score" : 1.0, "hits" : [{" _index ": "Twitter," "_type" : "_doc", "_id" : "1", "_score" : 1.0, "_source" : {" uid ": 2," country ":" Chinese ", "address" : "Haidian district in Beijing, China", "province", "Beijing", "city" : "Beijing", "DOB" : "1980-12-01", "contact" : "139111111111", "location" : {" lon ": "116.325747", "lat" : "39.970718"}, "message" : "today the weather is good, walk to", "user" : "* *"}},...}.Copy the code

As we can see from the above, we have added a new field contact.

Modify an existing field

If we want to increment the UID of all documents in Beijing by 1, we can do the following:

POST Twitter /_update_by_query {"query": {"match": {"city. Keyword ": "keyword"}}, "script": {"source": """ if(ctx._source.containsKey("content")) { ctx._source.content_length = ctx._source.content.length(); } else { ctx._source.content_length = 0; } ctx._source['uid'] += params['one']"; """ "params": { "one": 1 } } }Copy the code

After executing the above command, we query:

GET twitter/_search {"query": {"match": {"city. Keyword ": "keyword"}}}Copy the code

Display result:

"Hits" : {" total ": {" value" : 5, "base" : "eq"}, "max_score" : 0.24116206, "hits" : [{" _index ": "Twitter," "_type" : "_doc", "_id" : "1", "_score" : 0.24116206, "_source" : {" uid ": 3," country ":" Chinese ", "address" : "Haidian district in Beijing, China", "province", "Beijing", "city" : "Beijing", "DOB" : "1980-12-01", "contact" : "139111111111", "location" : {" lon ": "116.325747", "lat" : "39.970718"}, "message" : "today the weather is good, walk to", "user" : "* *"}},...}.Copy the code

The uid value of all documents showing city as Beijing has been increased by 1. The uid value above, _id 1, was 2; now it is 3.

If no dynamic mapping is performed, reindex indicates the index

Suppose you create an index without dynamic mapping, populate it with data, and then add a mapping value to get more fields from the data:

PUT test { "mappings": { "dynamic": false, "properties": { "text": {"type": "text"} } } } POST test/_doc? refresh { "text": "words words", "flag": "bar" } POST test/_doc? refresh { "text": "words words", "flag": "foo" } PUT test/_mapping { "properties": { "text": {"type": "text"}, "flag": {"type": "text", "analyzer": "keyword"} } }Copy the code

Above we create an index called test. First of all, dynamic mapping is disabled, that is, fields that are not defined by mapping are automatically identified during indexing, they only exist in the source, and we cannot search for them. To correct this error, we tried to fix the problem by modifying its mapping in the last step above. So under the new mapping, can the documents we imported before be searched? Let’s try the following command:

POST test/_search? filter_path=hits.total { "query": { "match": { "flag": "foo" } } }Copy the code

We tried to search for all documents with foo in flag, but it returned:

{
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    }
  }
}
Copy the code

So what’s the problem? In fact, after we modified the Mapping, we did not update the documents we had imported. We need to use _update_by_query to do something similar to reindex. We use the following command:

POST test/_update_by_query? refresh&conflicts=proceedCopy the code

Let’s search our documents again:

POST test/_search? filter_path=hits.total { "query": { "match": { "flag": "foo" } } }Copy the code

The above query shows the result:

{
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    }
  }
}
Copy the code

Obviously, after running _update_by_query, we can find our document.

Reindex for large amounts of data

All of the _update_BY_query above is good for small amounts of data. However, in our practical application, we may encounter a large amount of data, so in case of accidents in the process of reindex, do we need to start from scratch? Or do we need to go through the data we’ve already processed? A common solution is to define a field in our mapping, such as flag, which we can add to track our progress:

POST blogs_fixed/_update_by_query
{
  "query": {
    "range": {
      "flag": {
        "lt": 1
      }
    }
  },
  "script": {
    "source": "ctx._source['flag']=1"
  }
}
Copy the code

Even if the reindex process had failed, when we ran _update_by_query above again, the previously processed files would no longer be processed.

_update_by_query In addition to the above usage, we can also use pipepline to process our index data. See my previous article “COVID-19 Data Analysis and Visualization using Elastic Stack” for detailed usage.

Read more about Elasticsearch: Reindex Interface.

 

Reference:

【 1 】 www.elastic.co/guide/en/el…