At the suggestion of a friend with my last name, the following articles are subtitled for easy reference.

Today’s essay will be short and you can read it quickly.

According to the steps, say that finish after layer, should began to involve the index layer, but I want to say is a distributed search engine, so in addition to the index layer, layer, and a divided these two concepts are closely linked together, I’m afraid I said is bad, so before said index layer and fragmentation, we first intuitive have a knowledge of the index, And first familiarize yourself with some special data structures and some common algorithms of the index layer, so that you have an overall understanding of the search engine and then talk about the index layer and the sharding layer.

This article will only touch on the addition, deletion and modification of data, because the search is a big topic, involving a little bit of things, may need a separate article or even two, ok, no nonsense, let’s talk about today’s topic.

Before we have basically introduced the bottom data structure of the search engine, then the most basic function of a search engine is also the same as the database, in order to make the search engine run, there must be an increase, delete, change, check four basic functions, down we one by one.

Before that, we also need some preparation work, so all of the content or use an example derivation, if we have such a weibo data, every piece of data is a weibo user posted a tweet, there are three fields, each of the data user nickname, weibo content, time, specific data to grow like this:

{” nickname “:” YH solid wu “, “content” : “today is a good day”, “datetime” : “the 2016-05-05 22:12:00”}

So first, we need to create an index structure like this

The field names The field type
nickname String (full word matching) inverted + forward
content String (participle pattern matching) inverted + forward
datetime Datetime (time type) alignment
_id Primary key (automatic forming primary key) inverted + forward row

Well, if you submit the contents of the above table to a search engine, the search engine will set up the fields as we wrote earlier, and set up the inverted and forward rows for each field

increase

Adding this is actually full index building and incremental index building, and that’s what we’ve been talking about, so we won’t go into details here.

Full amount and increment increases when the data is a one of the above raw data into the search engine, each adding a data will increase the data in the memory, and then reaches a certain condition will be a batch of data written to the file, and reaches a certain conditions will the sections together.

The only thing to say is the primary key field. Since there is no SUCH thing as an _ID in the original data, the search engine internally generates an _ID for each data item when adding data.

There are many methods to generate the _id, including random number generation and GUID generation. As long as the uniqueness of the _id is guaranteed, the last thing we need is a distributed search engine, so there are some special methods to generate the _id. Generally, it is easy to generate the same _ID with random number, timestamp, local IP(MAC) address, so the probability of producing the same _ID is almost eliminated. Of course, there are other ways to generate unique _id for distributed systems, which we will cover later.

The value of the B+ tree is the doCID corresponding to the primary key. Since the primary key is unique, only one value value is needed, so there is no need to invert the chain.

If the data comes with a unique ID, it’s easier to just use it.

General search engines also provide data retrieval without primary keys. This kind of index only has add and retrieve operations. Since there is no primary key, it does not provide delete and modify operations.

delete

Delete operation is a special search engine operation, because we know that the save to the data in the search engine has inverted file, think about the data structure of the inverted file, to remove a document, need to find out the contents of this document, and find all the documentation contains the inverted chains, and then put a chain of each inverted the documents in the delete, Rearranging the inversion chain, writing to the inversion file, plus reading dictionary data, at least 6 or 7 IO times.

It’s a hassle, isn’t it? Of course we’re not going to do that. Here, we’re going to use a new data structure, bitmap, to solve the problem in a minute.

What is a bitmap? Bitmap is a very efficient and common data structure. It is a data structure that stores specific data through an array of bits. Since a bit is the smallest unit of data, this kind of data structure is often very storage efficient, and a bitmap is only suitable for storing true /false data (extended to store more information). For example, the following is a 32-bit bitmap. It can be used to indicate which of the numbers from 0 to 31 are valid (represented by 0) and which are invalid (represented by 1), in which case 0, 4, and 14 are invalid. This data structure is easy to understand.

0000 0000 0000 0000 0001 0001

In search engines, docid is continuous, and whether doCID is valid or deleted corresponds to true and false, so bitmaps are a good way to describe whether a document has been deleted.

The basic operation of bitmap is bit operation, so the speed is very fast. In addition, bitmap uses a bit to represent a document, so the storage space is very small. A 16M bitmap can store 134,217,728 information about whether documents have been deleted.

So, to remove a piece of data from a search engine, follow these steps

  • Find the doCID of this document by the primary key
  • Set the docid bit of the bitmap to 1 directly
  • The end of the

So say to delete, actually did not delete, the data is still there, in the retrieval of the time to do a little movement can let a person feel deleted, what little movement to say when the retrieval again.

change

With the increase and delete, change the word is simple, delete and then add is changed.

There is also a small trick to change. In order to minimize the amount of data, when we modify, if only the row field is modified and the length of the row field does not change before and after modification, we can directly overwrite the original record without deleting and adding records. Of course, in this way, During the modification, you need to compare all the fields to determine which update mode to use. It is not as direct as deleting and then adding. Moreover, in a distributed environment, this is not conducive to data synchronization.

Had said to add, delete, change, very very simple, but we found that the search engine if kept modify delete data, but does not actually delete data, the data will be more and more will not reduce, so search engines every once in a while all want to redo the whole quantity index, get rid of useless data real, otherwise redundant data more and more, The query efficiency is affected.

We will go into more detail in the next paper, which will cover a little more algorithms than one.


If you want to read other articles, please feel free to follow my official account and talk about search engines, recommended advertising algorithms and other nonsense 🙂