In some cases where there is a word limit (such as the 140-character limit on Twitter and the split of long and short messages), filling in a long link will take up the available text length, and you need to convert the long link to a short one.

I. Functional analysis

A short link service only needs to expose two functions externally

  1. Short link generation: submit the url and return the generated short link
  2. Access redirection: Access the generated short link to redirect to the original url

In addition, because the short link generation needs to be available for a long time, you need to use database persistent storage.

The two features to be implemented are analyzed in more detail below

Function 1: Access short chain redirection

This feature is relatively simple, just need to be able to query according to the short link to the long link, can be many-to-one or one-to-one relationship.

That is, given a short chain, only one long chain is determined, and the short chain cannot be repeated


Function 2: Long chains produce short chains

Long chains generate short chains in two ways: by a number generator and by using a hash function

No. 1. The hair

The originator, as its name suggests, assigns a number to each long link to be added as a short link.

The simplest way to do this is to use the database’s increment field, converted to base 62 for the short chain. This is really quite simple to implement, but there are some problems:

  1. Data redundancy

For each newly added long link, a new number is assigned. If repeated long links occur, multiple different short links will be generated, resulting in data redundancy.

This seems to be an easy problem to solve: when adding, determine if the long link already exists, return the existing link if it does, or let the sender assign a new one. But the actual implementation is much trickier.

First, to determine whether the long link already exists, you need to check the long link first, which is necessary to index the column of the long link. Long links, since they are user input, can be quite long.

As described in the MySQL documentation, the maximum index length in different row formats is 767 or 3072 bytes, which may not accommodate long links. Even if, to say the least, we artificially limit the allowed link length to below the maximum index length, too long indexed columns can lead to inefficient searches.

In this case, it is not feasible to build indexes directly on columns of long links. Is there any other way to check the existence of long links?

You can create a separate column to store the hash value of the long link and index it. The query encounters the same hash value and then compares the value of the long chain to confirm whether it is duplicate. If no record is repeated, a new record is added.

  1. Increased security risks

It is possible for an increment to be seen as increasing even if it is converted to base 62. The increment of this time period can be calculated by generating short chain periodically, and then the usage of the whole system can be counted

A feasible optimization approach is to use a trend – increasing and non-continuous transmitter. Refer to Leaf — Meituan-Dianping distributed ID generation system

But the sender itself is more difficult to implement than the short link system.


2. Use hash functions

Long chains generating short chains actually does the same thing as a hash table, which is to map an infinite number of possible strings (long links) to a finite number of character combinations (short links).

So a simple way to do this is to use a hash(longUrl) as a short chain.

However, it is also inevitable to encounter the hash conflict inherent in hash functions, that is, multiple long links may generate the same short chain, which is contrary to our requirement that a short chain only jumps to a fixed long chain.

How should the problem of the Hashish conflict be resolved? Let’s think about where else it shows up, yes, hash tables, which is a very typical hash conflict scenario.

Let’s see if we can take a page from the hash table implementation to resolve hash conflicts:

In the implementation of hash table, there are usually two ways to solve hash conflict: linked list method and open address method

  1. The linked list method

Let’s look at the way linked lists resolve conflicts.

In a hash table, multiple keys hash to the same location in the array. Attaching the keys to that location in the array resolves hash collisions by using a linked list.

Let’s see if linked lists work in short linked systems.

As an analogy, in a short link system, multiple longurls hash to the same short chain, and then store multiple long chains on the same row in the database…

Wait, is there something wrong? If a short link corresponds to multiple long links, how do I know to redirect to that long link?

Summary: Linked list method PASS

  1. Open address method

For an open-address hash table, if you run into a hash collision, the hole in the array is already occupied. It uses a rule (linear probe, square probe, etc.) to explore the next pit, and if it is occupied, it continues to explore until it finds an empty “pit” and fills it.

Similarly, corresponding to the short link system, encounter hash conflict, you can generate a new short link value according to certain rules, again try to insert the database. That seems to work.

Looking at the performance of the open address method, why do hash tables in reality use linked list method rather than open address method?

This is often because the hash table capacity is limited, when the load factor rises to a certain extent, the open address method will appear a large number of “pits” occupied, and then repeated detection phenomenon, bringing high overhead.

In the case of a short-link system, if the system is considered a hash table, the capacity is astronomical (depending on the length of the short chain), the load factor is not very high, and occasional re-probes are perfectly acceptable.

Conclusion: Open address method is feasible


Second, the design

After comprehensive consideration of implementation complexity, security, data redundancy and other factors, I choose to use hash function to implement the short link system.

API design

  1. Generating short links

    • path:/api/create
    • method:POST
    • Request parameters
    {
        "longUrl":"https://github.com"
    }
    Copy the code
    • The response
    {
        "code":0."data": {"url":"http://127.0.0.1:7001/Sf5oy"}}Copy the code
  2. Access short link redirection

    • path:/ Short chain content
    • method:GET
    • Response: 302 redirect

Data table design

Here, only basic functions are realized. Only one table is needed to save the mapping between short links and long links. The structure is as follows

See ${approot}/app/model/url.js for the model definition

CREATE TABLE `url` (
  `shortUrl` varchar(255) NOT NULL COMMENT 'short chain',
  `longUrl` varchar(2100) NOT NULL COMMENT 'Original link',
  `visit` int NOT NULL DEFAULT '0' COMMENT 'Number of visits',
  `createdAt` datetime DEFAULT NULL COMMENT 'Creation time',
  `updatedAt` datetime DEFAULT NULL COMMENT 'Update Time'.PRIMARY KEY (`shortUrl`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='URL mapping table';
Copy the code
  • Why not use autoincrement primary key?

Given that the short chain must be unique, and that it is also a query condition, a unique index needs to be built on that column.

Further consideration, the short link system is a typical scenario of more read and less write, you can directly use the short link as the primary key, to avoid it as a secondary index also need to return to the table to look up the long link.

Todo :(primary key index leaf nodes hold more information, is it more likely to split pages than secondary index? If you don’t know, check again.

Three, code implementation

The full implementation code is here: github.com/hhgfy/demos…

Access short chain redirection

This function is very simple. The service queries the long link corresponding to the short link and then redirects it in the Controller

  //${approot}/app/service/url.js
  /** * select * from short chain@param {*} shortUrl
   * @return {string|null}* /
  async find(shortUrl) {
    const { ctx } = this;
    const res = await ctx.model.Url.findOne({
      attributes: [ 'shortUrl'.'longUrl'].where: {
        shortUrl,
      },
    });
    return res ? res.longUrl : null;
  }
Copy the code

Long chains make short chains

This feature is a little more complicated, using _gen(longUrl, shortUrl) to generate short links

When the unique key conflict occurs in the short chain, it judges whether the long chain is the same or not, and then it directly returns the existing short chain.

Otherwise, the hash algorithm is reused to generate a new short chain value, and _gen() is recursively executed. Since the probability of hash conflict is very low, multi-level recursion is unlikely to occur.

PS: Handling logic with exception of unique key conflict is not very well implemented

  // ${approot}/app/service/url.js
  /** * generates *@param {*} longUrl
   * @return {string} The resulting short chain */
  async create(longUrl) {
    const shortUrl = this._hash(longUrl);
    return await this._gen(longUrl, shortUrl);
  }

  /** * Insert data *@param {*} longUrl
   * @param {*} shortUrl* /
  async _gen(longUrl, shortUrl) {
    const { ctx } = this;
    try {
      await ctx.model.Url.create({
        shortUrl,
        longUrl,
      });
      ctx.logger.info('Generates short link \n${longUrl}\n${shortUrl}`);
    } catch (error) { // TODO: no exceptions for logical processing
      if (error.name === 'SequelizeUniqueConstraintError') {
        const item = await ctx.model.Url.findOne({
          attributes: [ 'shortUrl'].where: {
            shortUrl,
            longUrl,
          },
        });
        if(! item) {// A hash conflict occurred
          this.logger.warn('the hash conflict');
          // Assign a new short chain, similar to open address linear probe
          const newShortUrl = this._hash(shortUrl);
          return this._gen(longUrl, newShortUrl);
        }
        // The long chain has generated too short chain, returns the existing shortUrl
        ctx.logger.info('Repeatedly generate short link \n${longUrl}\n${shortUrl}`);
      } else {
        ctx.logger.error(error);
        ctx.throw('Generate exception', error); }}return shortUrl;
  }
Copy the code

The effect

Extend and optimize

In addition to the above basic functions, you can also do some extension functions

  • Clean up unused urls by last access time (short links are only valid for a certain period of time)
  • Access statistics
  • Interface the brush

reference

  • The principle and realization of short URL system
  • How to design high performance short link system?