The introduction
The work in these two weeks is mainly to deal with tens of millions of text data. Since I have never been in contact with similar amount of data before, I will summarize the problems I have encountered in the process of processing tens of millions of data and the solutions
Clear goals
Before completing the task, we must clarify our goals. First, we will talk about the data. The data is two tables, one is the list of articles, the other is the content of articles, each article involves some people
This task is a bit Like the implementation of a search engine, we input keywords to find the relevant web pages, a simple point to achieve is to directly use SQL Like query, but there are two big problems
- Inaccurate search. If we search
Zhang hua
May be theZhang Huashuo
People will come out - The search takes too long and the full-text search is very slow in tens of millions of documents
We want our queries to be accurate, and we want our queries to be milliseconds fast. So we tried using ElasticSearch as our “database” and abandoned the default word segmentation in favor of “manual word segmentation” to achieve accurate and fast queries
So our goal is to simply “stuff” the data from MySQL into ElasticSearch and then “fetch” it again
The first obstacle is MySQL.
The first hurdle I encountered was the speed of the database’s response. In order to get the data out of the database completely (new data is still being generated), I pulled the ids out of MySQL in small chunks
At the beginning, the program ran happily and the speed was stable. However, I found that the speed suddenly slowed down after running 100,000 yuan. At first, I thought there was something wrong with the analysis.
Let’s analyze why this SQL is so slow
select * from a order by id limit 10 offset 100000
Copy the code
In order to get those 10, MySQL has to scan 100,000 data. Although it’s faster to scan on the primary key, a hundred thousand is a lot of data. Even if a primary key scan takes 0.01ms, times a hundred thousand is a lot of data
Almost every time we solve the problem of database speed, our first concern is the index, so let’s talk a little bit more about: why index is fast?
Database is a pile of data collection, it provides the tools we quickly get what we want, use the library to terms, the database is the library, they have so good book class, what do you want to buy books, according to the classification to find, so if you library has ten million book, “Hans Christian Andersen fairy tale” you are looking for, All you have to do is follow the index, and if you want to find a book without categories and you don’t know its classification, you’ll have to look through the library to find the book you want, and that’s what happens without an index.
In the world of programs, too, if you want to find a record quickly, if you don’t use an index, you have to traverse it. If you’re lucky, you’ll find it in a flash. If you’re unlucky, you’ll never find it. In MySQL, behind the index is the B+ tree, which reduces the worst-case data lookup to a log2N level. If you want to know why trees are so fast, check out my previous blog post
How to illustrate this indexing task, suppose you have 10 million data, and your worst-case scenario is 24 lookups, 24:10 million, a staggering 410,000 times difference, and the bigger the data, the bigger the difference, and this is where we get the power of indexing.
We went back to front, in order to use the index, so we have to juggle with your id (primary key), my first thought is partitioned according to the id, but I carefully looked at the database, id not all continuous (perhaps because deleting data), if I use the id of a fixed interval to, access to the data may have part no, It was possible but not elegant, and I had to add code to handle empty data.
By this time I thought of, for the first time, we get the id of the if we can continue to use in the back, and update so we can use the index, so we only put every piece of data of the last id remember, then go to the id for the next batch of data, so it can realize with the index and don’t change too much code.
Then our SQL will change to the following statement
select * from a where id > 1111 order by id limit 10
Copy the code
Data that would take minutes to “pull out” through a simple test can be pulled out in milliseconds, tens of thousands of times. That’s the end of our first obstacle
For ElasticSearch, there is a super scroll function. In their parameter, there is a scroll_id. Speed up “page turning” by using index ids
ElasticSearch
Storage and sorting
Next I’ll show you how I can optimize storage and write a custom dynamic DSL to achieve the functionality we want.
Before I get to ElasticSearch, let me briefly explain what I understand about ElasticSearch
ElasticSearch profile
I heard about ElasticSearch on Zhihu before I actually used it. What really shocked me about ElasticSearch was that when I imported tens of millions of data into it, it gave you a response in milliseconds, whereas it took me tens of minutes to query in MySQL
We can put the ElasticSearch likened to a database, compared to the MySQL it made it hard on the query performance, I want to have a good introduce how did he do it, but I found someone has the summary is very good, can see the information, it is can do it so fast it is this: index + + cache memory
ElasticSearch uses flashback indexes to reduce query time to logN level, uses memory to limit physical query speed, and has some filtering caches to keep it stable for both complex and simple queries
ElasticSearch also supports large text search. ElasticSearch automatically classifs text by default, and scores matching text with advanced classification algorithms (TF/IDF, BM25). ElasticSearch is a “search engine” with a specific algorithm designed for ElasticSearch, which is a search engine with a full text index.
ElasticSearch is a bit different from the current commercial search engines like Google, Baidu, Bing, etc. ElasticSearch uses plain text, so it can only use TF/IDF algorithms to calculate the relevant items for a given keyword, but now the commercial search engines input web pages. So now commercial engines like Google use the Google Page Rank algorithm to calculate document relevance again. Of course, modern commerce engines don’t just use Google Page Rank, they also take into account other factors (e.g., Baidu’s bidding ranking, Google’s malicious influence ranking detection), but essentially, both ElasticSearch and modern commerce engines are doing the same thing. Score matches, which is how they differ from MySQL’s full-text search (MySQL doesn’t have rank probabilities, it only has order BY probabilities).
Design ideas
At first I was going to use ElasticSearch to sort all the people in the article, but let’s consider this: “Liu Er can eat two bowls of rice”. If we use “Liu Er” to search, the “Liu Er neng” in this article can also be retrieved, and our goal is to return the most possible results as far as possible, for those impossible results will not be returned
So we can’t have ElasticSearch do the word segmentation for us, but we want to use the scoring mechanism to help us do the most possible return first
Select keyword for each character (name, date of birth, ethnicity, etc.), so that ElasticSearch will not be able to split this field and will have to match all the characters in the query. So I’m going to store it in a document as a list
ElasticSearch will convert a list of documents to a list of documents
Let’s use the example of official documentation, which we store in the following document
{" group ":" fans ", "user" : [{" first ":" zhang ", "last", "China"}, {" first ":" li ", "last" : "4"}]}Copy the code
ElasticSearch will convert it to
{" group ":" fans ", "user. First:"/" zhang ", "li", "user. The last:"/" China ", "four"}Copy the code
Select user. First from user. Last from user. The answer is simple: we declared the user as a nested object, so that ElasticSearch would not split it apart but treat it as two documents. (Some would argue that this does not affect its speed, on the contrary, ElasticSearch often uses similar techniques to speed things up. See the blog post below for more details.)
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * It’s easy to use Boost
We mentioned above we can document parsing out object multiple attributes (name, age, gender, place of residence), but some documents or may not have this information, we query when there is a list of information (the person’s name, age, gender, etc.), so we use the boost to hit the more information of points, So we end up with the more hits the higher up the list, and of course the less hits will be filtered out just a little bit further down the list
conclusion
Through this face to face with thousands of data, let I learned a lot, though at first purpose just want to simple search out the best matching data, but in the actual process, through continuous problems are put forward on the results, finally achieved a satisfactory product, the product gradually forming and stability in the process of continuous optimization, I think is the biggest help to me to write design documents, And feedback the results during the product molding process, and finally slowly iterate on the best version