Elasticsearch tutorial live replay

1. Start with two practical problems….

Error 1: ElasticSearch-head index file number is inconsistent.

One: 3429, one: 5291. What does that mean?

Question 2: When ES data was written in batches, the status of a large number of documents became deleted. What is the cause?

Database read data, batch insert into es, id user-defined database primary key value, after batch insert, no error, but using cerebro, a large number of documents in the state is deleted, database primary key value 100% did not repeat, do not know why this is so?

Sources of problems: elasticsearch. Cn/question / 11…

Both of the above problems involve deleting and updating documents. Let’s first explain the two concepts, and then disassemble and analyze the problems much easier.

2. Document version _version

Insert a record in Mysql, we visually display a row of records. Elasticsearch is a document search engine, and what you see is a JSON record. As shown below:

  • _ID indicates a unique ID.
  • _version Indicates the version number of a document.

At this point, we often ask, okay? How does the version number change after updating or deleting existing data?

Watch a demo to find out.

DELETE test # PUT test/_doc/ once1
{
    "counter" : 2."tags" : ["blue"]} #"_version" : 1,
GET test/_doc/1"count" : 1."deleted" : 0
GET test/_stats

# 再执行一次(更新操作)
#  "_version" : 2(Version number +1), PUT the test / _doc /1
{
    "counter" : 3."tags" : ["blue"."green"]}Copy the code

Writing the document again is equivalent to updating the original document. The _version changes from 1 to 2.

At this point, we find through the _STATS API that deleted is displayed as 1.

"count" : 1."deleted" : 1
GET test/_stats
Copy the code

After the delete operation is performed, view the version result: the version number of _version + 1.

DELETE test/_doc/1
Copy the code
{
  "_index" : "test"."_type" : "_doc"."_id" : "1"."_version" : 3."result" : "deleted"."_shards" : {
    "total" : 2."successful" : 1."failed" : 0
  },
  "_seq_no" : 2."_primary_term" : 1
}
Copy the code

The _stats API shows that deleted is 3.

Therefore, a preliminary conclusion is drawn:

  • The update and deletion operations are based on the original document version number +1, and each time, the version number +1.
  • At the same time, documents of earlier versions are marked as deleted. Problem 2: Documents that are only written repeatedly are also marked as deleted.

3. The nature of document deletion, index deletion and document update?

3.1 Delete the essence of the document

  • Delete document essence: Logical deletion, not physical deletion.

After you delete a document, the document to be deleted is not deleted from the disk immediately, but is marked as deleted (version number _version + 1, result: “deleted”,). The most obvious response is the frequently asked question “How do I delete a document without reducing disk space?”

As more data is indexed, Elasticsearch will clean up documents marked as deleted in the background.

If you want to delete from the disk, you need to use the segment merge to achieve, specific practice refer to:

POST test/_forcemerge? only_expunge_deletesCopy the code

The parameter “only_expunge_deletes” only documents that are marked as deleted are cleared.

"count" : 0."deleted" : 0
GET test/_stats
Copy the code

This inevitably raises a question: since more and more documents are deleted, is there a faster way to delete historical cold data in bulk or in full?

If yes, delete all data under the index by dropping the index.

3.2 The nature of index deletion

Unlike deleting a document, deleting an index means deleting its sharding, mapping, and data.

Index deletion nature: Physical deletion of data.

Unlike document deletion, index deletion is more direct, fast, and violent. After an index is deleted, all data associated with the index is deleted directly from disk.

Index deletion involves two steps:

  • Update the cluster;
  • The fragment is deleted from the disk.

It is important to note that dropped indexes cannot be recovered if no index snapshot backup or other data backup exists (this question is asked at least 10 + times).

Delete an index as follows:

DELETE test
Copy the code

3.2 Update the essence of the document

Update the essence of the document: delete + add.

In Lucene, the core engine of Elasticsearch, inserting or updating a document has thesame cost: in Lucene and Elasticsearch, to update means to replace.

Elasticsearch marks the old document as deleted and adds a brand new document. As with deleting documents, old documents cannot be accessed, but they are not physically deleted immediately unless a segment merge operation is performed manually or periodically.

4. Turn to the first two questions

4.1 Why is the number of Docs Inconsistent?

So let’s just repeat it, and then combine it with the principles of the last two videos. Directly take kibana_ e-commerce data (kibana) sample data as the basic data.

  • Step 1: Raw data size: 4675 for all.

  • Step 2: Delete data in batches. Delete data from order_id > 584670.
POST kibana_sample_data_ecommerce/_delete_by_query
{
  "query": {"range": {"order_id": {"gt"584670}}}}Copy the code

Number of returned results:

{
  "took" : 100."timed_out" : false."total" : 1246."deleted" : 1246."batches" : 2. }Copy the code

In other words, 1,246 records were deleted.

  • Step 3: Check the results again:

Changes to the docs values: the original value is the first screenshot value: 4675.

  • 4675-1246 = 3429, preliminary view: represents the exact size of the number of documents.
  • 4675+1246 = 5921, indicating the number of original documents + the number of deleted documents.

Let’s see: _stats stats looks like this:

GET kibana_sample_data_ecommerce/_stats
Copy the code

Return result:

  "_all" : {
    "primaries" : {
      "docs" : {
        "count" : 3429."deleted" : 2492
      },
Copy the code
  • The first value is of size: count;
  • The second value is the size of count + deleted.

2492 = twice as many as 1246. What I understand (welcome to discuss this issue) is:

  • File size: 1246
  • Number of documents marked as deleted, Version + 1:1246

During the actual test validation, the deleted value changed from 2492 to 1246 and finally to 0.

Of course, you can also use force_merge to force a segment merge implementation.

4.2 Why are a large number of documents in deleted status?

My guess: when synchronizing, document data with the same ID was written. If the same data is written twice or more, Elasticsearch overwrites it (essentially updates it).

As mentioned earlier, the essence of an update is to mark the original document as deleted and then insert another document.

So, try manually performing the Force Merge operation, and the Deleted document will not exist. Or, naturally, there will be no deleted documents at the time of the merger.

5, summary

A thought and summary triggered by a small question.

Reference:

  1. t.zsxq.com/zfiIIur 
  2. www.elastic.co/cn/blog/luc…
  3. Elasticsearch official document
  4. Elasticsearch 7.X cookbook 英文版

recommended

  1. Elasticsearch segment merge – this post is full of details!
  2. The importance of the theory of dry goods | Elasticsearch data modeling
  3. Elasticsearch data modeling from a practical problem
  4. Get out of The game, Get out of the Game

The largest Elastic official account in China

Click here to see how you can hone your Elastic skills every day with nearly 1,000 Elastic enthusiasts around the world!