== Water friends of the planet asked ==
Hello, Mr. Shen, I want to ask you a question about id card information retrieval.
The company has a service of 50,000 concurrent queries per second, (suppose) according to the ID CARD MD5 query id information, there are 100 billion data, plain text storage, I saw you write LevelDB a few days ago, can this service use LevelDB memory database for storage? Are there other optimizations?
Voice: LevelDB_ “Memory KV Cache/database”. _
== Problem description ==
The last planet water friend asked is 3.6 billion log background paging query, followed by a 100 billion text MD5 query, this time the business, at least need to solve:
(1) Query problems;
(2) High performance issues;
(3) Storage problems;
First, query questions
The efficiency of text information search and retrieval is very low. The first problem to be solved is to transform text filtering into structured query.
Since the retrieval condition is MD5, it can be structured as:
(MD5, data)
This can be KV query, or database index query.
It should be noted that MD5 is generally represented as a string, and the performance of string as an index will be reduced. Therefore, MD5 of string type can be converted into two Uint64_t for storage to improve indexing efficiency.
(md5_high, md5_low, data)
Two long integers for joint index, or joint key in KV.
This service has a strong feature, is a single row of data on the primary key of the query, regardless of the amount of data, even without the use of caching, traditional relational database storage, single machine can carry at least 1W of queries.
_ Voiceover: _ But it can’t be stored in a single machine, more on that later.
Second, high performance
At 5W concurrency per second, the throughput is very high. The second thing to solve is: performance improvement.
The business of id card inquiry has two strong features:
(1) The queried data is fixed;
(2) only **** has query request, no modification request;
It’s easy to imagine that caches are very, very good for this scenario, and not only that, they can also load data into memory ahead of time, avoiding the “warm-up” of the cache.
Voiceover: Design according to business characteristics, any architecture design that deviates from business is a hooligan.
If the memory is large enough and the data is loaded in advance, the cache hit ratio can be 100%. Even if the data is not loaded in advance, each piece of data can be missed at most once in the cache. Once the data is stored in the cache, it will never be changed out because there is no write request.
Is it true that there’s enough memory?
Assuming that each ID card information is 0.5K, about 100 billion:
100 billion * 0.5K = 50,000G = 50T
Voice-over: Is that correct?
So if you’re not super rich, the cache can’t hold all the data, only the hot data.
Is 5W/s throughput a bottleneck?
There are many ways to linearly expand capacity:
(1) Redundancy of more than 10 sites and services;
(2) storage (primary key single row query) more than 10 copies of horizontal segmentation;
As you can see, 5W concurrency is not a problem.
Three, storage problems
The above part of the analysis, 100 billion ID information, 50T of data, data is too large, the traditional relational database, LevelDB single-machine memory database is not particularly suitable, manual level segmentation, split instances will be very many, difficult to maintain.
Use storage technologies such as Hbase that are suitable for large amounts of data.
Finally, combined with this example, it is suggested that:
(1) Do not text retrieval, must be structured;
(2) Single-line query, read-only without writing, cache + redundancy + horizontal segmentation can greatly improve throughput;
(3) Use technologies suitable for mass data storage;
With limited experience, more and better solutions are welcome.
Thinking is more important than conclusion.
Exercises after class:
100 billion data, different ID numbers may cause MD5 duplication, what to do?