A list,
I do not know if you have learned about the short url, the so-called short url is to convert a long URL into a short URL, so that users can access the normal URL when accessing the short URL. Wide range of applications, such as SMS, sharing links and so on… Below I will talk about how to design a high performance short url service dry goods, are you ready 🐒
Second, the body
How can you access normal data through a short link
It is assumed that the short url for the MTW. So / 6 kk03s we call it A corresponding long links for tech.meituan.com/2021/10/20/… We call it B (this is generated by an online url conversion tool, 😂). When we visit A, we go through the domain name resolution of mw.so and end up requesting one of the HTTP interfaces under the domain name. Without further explanation) the interface gets the parameter 6kK03S after the URL by which the back end locates to a long link, the original link, and finally redirects to the original link (301/302 as needed), and that’s it
How to turn a long link into a short link 🤨
Wrong way ❌
The first thing you’ll think of is encryption. Long urls are encrypted by an algorithm that compresses the length, allows users to access the encrypted URL, and decrypts the url through the back segment. ohhhh!! I am a genius, this is too easy! (If there is such an algorithm, then you have done wonders… So what should we do? 😱 if we can store the long link and the short link’s corresponding relation, when the user visits the short link, goes to the database to read the long link out, can’t it be solved perfectly?
How to locate a long link from a short link
So the question is, how do you design this storage? How to locate a long link from a short link? Using id, if we had a global ID that would never be repeated, would that solve the problem? When a long link applies to become a short link, we generate an ID for it, and then save it in the database, assuming id=996, then the ID of 996 corresponds to our long link, and finally we directly provide the 996 to the user, assuming the generated short link is baidu.com/996, Then the rest of the process is simple
- User access baidu.com/996
- The server receives the request for the 996 parameter
- The backend server uses the id 996 to query the database for long links
- Redirect to the long link
- The user sees the page with long links
Perfect solution!! Do you think that’s settled? Since it is so simple, what is the point of this article ❓
Question 1: How to generate global ID?
Q2: What if the ID is still too long?
Problem three: if I this short url service a day to generate one hundred thousand, one million, ten million levels of short url, how to store?
Problem four: What should I do if I have a large number of requests? Can the service hold up?
(The interviewer’s One-two punch)
Don’t panic, let’s go to 😴 (the following problem solutions based on distributed environment discussion)
How is the global ID generated
There are several common types of global ID generation, and for high-concurrency cases, we typically involve an ID generator, which we’ll explain later
Redis incr ⭐ ️ ⭐ ️ ⭐ ️ ⭐ ️
Full reliance on Redis, advantages: simple and mindless, disadvantages: data loss, even if Redis has RDB and AOF, loss is not guaranteed
Add the id of the database ⭐️⭐️ port maximum password
Advantages: Simple and brainless disadvantages: Use concurrency control, such as select for Update at the SQL level, ReentrantLock at the code level to ensure thread safety, and note that there is an upper limit on ids (depending on the data type). If the upper limit is reached, it will not regenerate into aN ID
Uuid (not recommended) ⭐️
Advantages: Simple and brainless Disadvantages: The generated iD is not continuous and mixed with Chinese and English. When the iD is dropped to mysql, the page splits, affecting mysql storage performance
SnowFlake ⭐️⭐️⭐ ⭐️⭐️
Advantages: Unique id in singleton mode Disadvantages: Time based, so time rollback causes id duplication (who would change the server time??)
What if the increment id is long
What if the length of our increment ID is too long, generating 8573834749584939, which does not meet our short link requirements? Base conversion We can use base 62 (number + lowercase + uppercase) for our ids
decimal | 62 into the system |
---|---|
996 | g4 |
1024102410241024 | 4GNTCX7B6 |
996996996996996996996 | j9TiP3ZLxcIA |
Is it a lot shorter (you are so short, but so capable 🤔)
How to store large amount of data
Large amount of data? How big is big? Ten, ten, one hundred, thousand, ten thousand, one hundred thousand, father, father !!!! According to The “Alibaba Java Development Manual”, only when a single table exceeds 500W rows, or the data size of a single table exceeds 2GB, is it recommended to perform table partitioning operations. How to select a table partitioning policy? Here are two common sub-table strategy schemes:
- According to date, it is broken down by month/quarter/year
- Fixed table sub-table
Schedule by time
Application scenario: the daily increase of data is large, such as the daily increase of one hundred thousand, million level data implementation: For example, today is 2021-10-28, select the table by month, we will store the data generated in October into T_ID_TEST_202110, this table has advantages: it can use mysql events to automatically create tables, and the capacity of a single table is controllable. Id generation takes time, which is inconvenient for statistical operations
Fixed table sub-table
Fixed table sub-table refers to their own evaluation of the future data growth, the data table is divided into several tables, such as 8, 16, 32, 64, 128 tables (2 x power is suggested here) applicable scenarios: can predict the size of the data volume, and do not want to create many tables specific implementation: Create table 8, table 16, table 32, table 64, table 128. T_id_test_1, t_ID_test_43, etc., when generating an id (996996), we take this id as the total number of modules in the table. Mysql > select * from test_4; mysql > select * from test_4; mysql > select * from test_4; mysql > select * from test_4; mysql > select * from test_4; mysql > select * from test_4; Table expansion is complex, so you need to suspend services and perform hash computations again
How to withstand high concurrency, 1W+ QPS
How do we deal with high concurrency? Let’s first summarize the short link process
- Generate a short link a. Obtain a global ID B. Save the long link and ID to the database C. Convert id to base 62 d. Concatenate the id into a short link
- Access the short link a. The back segment receives the request B. The request parameters C. The parameters in base 10 D. Fetch the long link corresponding to this ID from the database e. redirect
Let’s think about it for a moment. Which of these processes should we consider for a high-concurrency scenario? 🤔🤔 let’s take a look at the process of generating end links step by step. 1.b, 1.c and 1.d do not need to be considered, because no matter how large your concurrency, they will not affect, it is a process of calculation + return + write DB. B: Sure, what’s the problem? I/O bottleneck: if you select redis, mysql increment as global ID, then I/O is a significant loss, so how to solve it? 2.a, 2.b, 2.c, 2.e do not have the problem of high concurrency, we focus on 2.D we know that in a high concurrency environment, the most serious impact on performance is I/O, and a large number of requests are directly sent to mysql, if mysql configuration is not high, Mysql is also easy to run fast, so how to solve the problem? You can use caching
Global ID generator
The so-called ID generator is that I am responsible for generating (I/O), you are responsible for using (from memory), single responsibility, reduce I/O! Continuously generate specific implementation: a thread id, guarantee uniqueness and order, and then the generated id into the queue (first in first out, ensure order), to obtain a work thread id take out directly from the head of this queue, when it is not just to remove the worker threads I/O process, direct reading memory, speed is greatly improved. Of course, we can also set a rule, such as when the queue length is less than a certain threshold to regenerate the ID.
The cache
We need to cache from two places, the first data filtering, the second data caching data filtering: why data filtering? Because we need to Filter invalid data first, we will proceed to the later process only if the data is valid. Common data filters can use BloomFilter. When generating a short link, the ID of the time can be put into BloomFilter, followed by a request to pass the Filter of the first layer. If true, proceed, if false, reject BloomFilter (BloomFilter has an error, if it exists, it does not exist, if it does not exist) When a short link is generated, we can cache it to Redis first, read Redis first, redis does not go to mysql to read, and then add redis to prevent a large number of requests directly hit mysql and crush it. There is a problem to consider here, because the amount of short links is very large, we need to choose to cache for a long time, for some hot data, we need to cache for a long time, until it becomes unpopular data, for unpopular data, we don’t even need to cache. If the concurrency is still large and the corresponding time is slow, we can use the local cache, that is, the JVM level cache, can put the link information into the local map temporarily, use the LRU algorithm to ensure that it will not OOM, read the local cache first, then read the Redis, and finally mysql bottom.
Third, summary
This paper mainly discusses the design of a high-performance short link service, if there are mistakes, I hope you to help me correct. New xiao Bai first article, you can pay attention to me, there will be a lot of dry goods oh ~ ~ ~ an effort to make their strong deep drift Java programmer