This article is participating in the topic of “performance optimization actual combat record” essay activity

Recently, some query interfaces have been taking longer to load due to regular search in the filter criteria, but these fields have also been indexed in a normal way. Non-regular searches have no effect, so the intuitive idea is that regular searches are unclear where they use the index and need to be patched up.

What is an index

The core function of a database is to store data and the corresponding ability to manipulate data: add, delete, change and check. When the data volume gradually increases, such as the number of users: millions, tens of millions, or even hundreds of millions, we need to find some compound conditions from the data, it is very complicated, of course, one by one, but I believe no one will do it, because it will be very slow. What to do at this point?

It’s actually the same thing as the dictionary example that you learned about databases. We usually find this data based on certain criteria. For dictionaries, for example, we usually look them up in pinyin, not in pinyin: those words contain nuggets.

So what does the dictionary do? Put pinyin, our most common search criteria, at the top of the list, so that when looking up a word in the dictionary, you just need to turn to the first few pages, find the word you want, and then jump to the corresponding page. What is the index?

Index is: take out part of the field values, independently maintained into a fast query data structure.

The index principle

The quick data structure for this query is simply to create a tree (tree view is more like nested JSON). Of course, the tree has been optimized for querying data and looks like the image below. The specific algorithm name: B+ Tree, you can go to know. So when we query, it’s like a dichotomy to guess numbers, and we can locate data very quickly, even if it’s hundreds of millions of data.

So why is regular search indexing still slow?

Regular search procedure

If we think about it for a moment, it’s pretty simple. Because the idea of an index is to take the value of that field, separate it out, and store it in a faster lookup format. The process of index lookup is to compare the newly created data directly. However, the re is not simply a comparison of size, but also requires an operation to know the final result. Therefore, for indexed regex matching (similar to fuzzy matching), everything will be regex one by one until the result is achieved, which is naturally very slow.

That’s about it, so let’s use mongodb to see what happens in action. General query index is B+ Tree, mongodb, SQL, so it is reasonable.

Mongo data testing

Check test data totals
db.game.count();
# 50407 data

The way to get the query process is to add explain
db.game.find({"name":"Spicy Hero"}).explain("allPlansExecution");
Copy the code

The data structure

{
  "_id": "5590b3f9bac548696b8b45cf"."des": "Spicy Hero" is a historical brawl genre semi-real-time RPG mobile game. High-precision restoration of the Great Wall, the royal palace and other scenes, 100 types of Q meng generals, thousands of hero combinations, original leading role skills, but also a giant BOSS battle, siege battle and other characteristics of the gameplay, so that the mobile game farewell boring."."name": "Spicy Hero"."download": 1
  // There are many other fields
}
Copy the code

To avoid long results, most process explanations are removed. Only a few key pieces of data were left. The query command is as follows:

  1. Common queryDb.game.find ({"name":" hot hero "}).explain("allPlansExecution");
  2. Regular queryDb.game.find ({"name":/ numa hero /}).explain("allPlansExecution");

Test results for various conditions

No index + normal query

[{"executionStats": {
      "executionSuccess": true."nReturned": 1."executionTimeMillis": 20."totalKeysExamined": 0."totalDocsExamined": 50407,}}]Copy the code

Unindexed + regular query

[{"executionStats": {
      "executionSuccess": true."nReturned": 1."executionTimeMillis": 31."totalKeysExamined": 0."totalDocsExamined": 50407,}}]Copy the code

Add index + normal query

[{"executionStats": {
      "executionSuccess": true."nReturned": 1."executionTimeMillis": 0."totalKeysExamined": 1."totalDocsExamined": 1,}}]Copy the code

Add index + regular query

[{"executionStats": {
      "executionSuccess": true."nReturned": 1."executionTimeMillis": 40."totalKeysExamined": 50407."totalDocsExamined": 1,}}]Copy the code

summary

ExecutionSuccess and nReturned show that one data was returned successfully. Then the main processes are respectively: the number of indexes and document scans in the query process.

The title Number of rows scanned by index Number of scanned lines of a document Perform time-consuming
No index Common query 0 50407 20
No indexed regular query 0 50407 30
Normal query with index 1 1 0
Indexed regular queries 50407 1 40
  1. The entire table is scanned without adding indexes, reading raw data – documents one by one. Generally can be considered to be stored in the hard disk, the hard disk read but much slower than the memory. There’s nothing particularly slow here because maybe there’s too little data in memory.
  2. Once you index it, you’ll see that the document scan part is1, so the index matching process is built on a separate data structure, reducing the amount of hard disk read and write.
  3. But with the index, the re still scans all the fields, except that it uses the index structure to hold the fields instead of reading them from disk. Because the re will not know the result until it matches, it will still be slow, and everything at the bottom is simple programming, no black magic.

Index size

Test data size 127M, desc index (description text) is 42.8M. The Download digital type index is only 221.2KB.

We can also see from the size, index creation is a separate store of data, so for text field creation must be considered clearly. There are some problems with indexing large text:

  1. The index itself may not make much sense, as it is generally used for fuzzy matching for large text
  2. The index may be too large to load all the memory at once, which will slow things down even more

What did you get

There is no black magic in indexing itself, simply speaking, it is part of the data of redundant query conditions, which can be constructed into a more complex query structure, so as to solve the query performance problem of large data in general.

Regular queries or fuzzy matching, there is no other optimization, database queries are just like writing ordinary code, only when the code is actually executed to know the result, so need to be careful to use, especially for the application side of the interface! What about searching?

Fuzzy search weight sorting these generally belong to: full-text search. Full-text search database will adopt a completely different scheme to deal with fuzzy search, its basic logic is:

  1. All content is segmented, and the values themselves are de-associated with the results of these segmented words. Such as:I love BeijingIt can be divided into:I love Beijing, wo love BeijingAnd so on
  2. When searching, the search content is segmented first to get some possible combinations of columns, and then to match. This is just the process of segmentation and association, so it is very fast

Of course, some databases also provide full-text indexes (for example, Mongo has a text index for the text type, which can be searched, but currently does not support Chinese.

Different requirements must reasonably switch between different technical solutions. Do not be obsessed with a database, a language, a framework. Reasonable use and trade-off is the best.

About mongo’s free online experience

If you want to learn about Mongo, you can go to mongo’s official website and be redirected to its cloud service. You can create a Mongo cluster with 2 replica sets for free.