Short url length

How long should a short url be? The current total number of web pages on the Internet is approximately 4.5 billion (see www.worldwidewebsize.com), which exceeds 232=42949672962^{32}=4294967296232=4294967296, so a 64-bit integer is sufficient.

How can a 64-bit integer be converted to a string? Log62 (264−1)=10.7log_{62} {(2^{64}-1)}=10.7log62(264−1)=10.7 log62(264−1)=10.7 log62(264−1)=10.7

In actual production, it can be shorter. For example, the length adopted by Sina Weibo is 7, because 627=352161460620862^7=3521614606208627=3521614606208, which is far more than the total number of urls on the Internet, and is definitely enough.

Most modern Web servers (e.g. Apache, Nginx) are case-sensitive in their urls, so it’s ok to use case-sensitive letters to distinguish different urls.

Therefore, the correct answer is a string of up to 7 characters consisting of 62 letters, uppercase and lowercase letters plus digits

One-to-one or one-to-many mapping?

One long url for one short url, or can it correspond to multiple short urls? It is also a matter of great choice

Generally speaking, a long url, in different locations, different users and other circumstances, the generated short URL should be different, so that the back-end database, can better data analysis. If a long url corresponds to a short url one by one, then in the database, there is only one line of data, can not distinguish between different sources, can not do data analysis.

This 7-bit short url is used as the unique ID. Various information can be hung under this ID, such as the User name that generates the URL, the site where it is located, and the User Agent in the HTTP header. Only after collecting this information, it is possible to conduct big data analysis and mine the value of data. A big source of revenue for short url providers is this data.

Correct answer: one to many

How to calculate the short url

Now that we have set the short url as a string of length 7, how do we calculate the short url?

The easiest way to think about it is to hash a 64-bit integer, convert it to base 62, and cut down 7 bits. But hash algorithms can conflict, and how to deal with that conflict is another problem. This method only transfers the contradiction, does not solve the contradiction, abandons.

MySQL database has an increment ID. Every time you come to a long url, you give it a number that keeps increasing. The advantage of this method compared to hashing is that there are no conflicts, and you don’t have to worry about handling conflicts. How to achieve a single number server? You can do this with a MySQL server (make sure you use REPLACE INTO, not store all ids) or a Redis server (use INCR) without writing a line of code; You can also write a RESTful API, the code is very simple, so I won’t go into details.

What are the disadvantages of a single transmitter? It’s a Single Point Of Failure (SPOF) and a performance bottleneck (in fact, if your QPS is big enough to crush MySQL, then your short URL service is successful and should be listed :D), so it’s suitable for small to medium businesses. Still, we need to think of a better solution for mega-corporations (and for trying to be awesome in interviews), so read on.

Let’s start with how to build a distributed transmitter composed of multiple machines.

  1. Use the UUID algorithm or the ObjectID generated by MongoDB. In addition, MongoDB’s ObjectID is a UUID. Each machine can work independently and the algorithm is naturally distributed. However, the ids generated by such an algorithm are usually very long. So that’s not going to work.

  2. Multiple MySQL servers. The first MySQL server has an initial value of 1 and increases by 8 each time. The second MySQL server has an initial value of 2 and increases by 8 each time, and so on. Round-robin load balancer will randomly send the request to any of the 10 MySQL servers and return an ID. Flickr uses this scheme, using only two MySQL servers. The only disadvantage of this method is that the ID is continuous, which is easy to be captured by the crawler. The crawler basically does not need to write code, but sends requests one by one along the ID, which is too convenient (manual squint).

  3. Distributed ID Generator. Distributed to generate unique ids, such as Twitter has a full-fledged open source project that does just that, Twitter Snowflake. Snowflake’s core algorithm is as follows:

The highest bit is not used, the highest bit is not used, and is always 0. The other three groups of bit placeholders can float, depending on specific business requirements. By default, the 41bit timestamp will support the algorithm until 2082, the 10bit work machine ID will support 1023 machines, and the serial number will support 4095 auto-increment sequence ids in 1 ms.

Instagram uses a similar scheme, 41 bits for timestamp, 13 bits for shard Id(one shard Id corresponds to a PostgreSQL machine), and a minimum of 10 bits for increment Id. Very similar to Snowflake’s design. This solution uses a PostgreSQL cluster instead of a Twitter Snowflake cluster. The advantage is that PostgreSQL is already available, which is easy to understand and maintain.

So, the right answer: Distributed ID Generator, Flick, Twitter Snowflake and Instagram are all good options.

How to store

How to store the correspondence between short url and long url? Use the short url as the primary key and the long url as the value, which can be stored with traditional relational data, such as MySQL, PostgreSQL, or any distributed KV database, such as Redis, LevelDB.

If you want to design the storage by hand, that’s another topic. You need to build a KV storage engine wheel completely. Current popular KV storage engines have LevelDB and RockDB, to read their source code.

301 or 302 redirect

This is also an interesting question. This question mainly tests your understanding of 301 and 302, as well as the browser cache mechanism.

301 is a permanent redirect, 302 is a temporary redirect. Short addresses do not change once generated, so using 301 is HTTP semantic. However, if 301, Google, Baidu and other search engines are used, the real address will be directly displayed in the search, so we cannot count the number of clicks on the short address, nor can we collect the User’s Cookie, User Agent and other information, which can be used for many interesting big data analysis. It is also the main source of profit for short url service providers.

So, the correct answer is 302 redirect.

You can catch the package to see how sina Weibo’s short URL is done, use Chrome browser, visit this URL t.cn/RX2VxjI, is my advance micro blog automatically generated short URL. Let’s grab the packet and see what the result is,

You can see that Sina Weibo is using 302 temporary redirection.

To prevent attacks

What if some malicious hacker sends a large number of requests to the TinyURL server in a short period of time and quickly runs out of ids?

First, limit the total number of IP requests in a day. If the number exceeds the threshold, the service is denied.

Limiting the number of REQUESTS is not enough, because hackers typically have millions of chickens with large IP addresses, so limiting the number of requests is not effective.

You can use a Redis as a cache server, store not ID-> long url, but long URL ->ID, only store the data within one day, using LRU mechanism to eliminate. This way, if a hacker sends a lot of the same long url, he can just return the short url from the cache server, and he can’t use up all our ids.

The resources

Cn reprinted. Soulmachine. Me / 2017-04-10 -…

supplement

1. Book Project:

codeGoogler/ProgramBooks

2: Video tutorial:

Programmers must read Java books SpringBoot, Spring, Mybatis, Redis, RabbitMQ, SpringCloud, high concurrency

Remember to like and search for good friends oh, the follow-up will continue to update selected technical articles!