This is the fourth day of my participation in Gwen Challenge

It is important to choose a good data type for the ID column, which is often used for comparison purposes (such as the criteria for a federated query) and for looking up other columns. Also, ids are often used for foreign keys. Therefore, the data type of an ID column relates not only to its own table, but also to other tables associated with it. Therefore, the data type of the ID is very important.

When choosing the data type of id, you need to consider not only the data store type, but also how MySQL calculates and compares this type. For example, MySQL stores ENUM and SET types internally as integers, but compares them as strings in string scenarios. Once you have selected the datatype of the ID, you need to ensure that the datatype of the related datatype that references the ID is the same, and exactly the same, including attributes such as length and whether there are signs or not! If mixing different data types can cause performance problems, and even if there are no performance problems, implicit data conversion during comparison can cause subtle errors. If you forget about different data types during actual development, unexpected problems can crop up.

When choosing the length, you also need to keep the field length as small as possible and leave some room for future growth. For example, if it is used to store provinces, and we only have a few dozen values, TINYINT is better, if the related table also has this ID, then the efficiency difference will be large.

Here are some typical types that apply to ids:

  • Int: Integers are usually the best choice because they are fast to evaluate and compare, and you can also set the AUTO_INCREMENT attribute to increment automatically.
  • ENUM and SET: Enumerations and sets are not usually selected as ids, but are appropriate for columns that contain types such as “type”, “state”, and “gender”. For example, when we need a table to store a drop-down menu, we usually have a value and a name, and we can use an enumeration as the primary key for the value.
  • Strings: Avoid strings as ids whenever possible because they take up more space and are generally slower than integers. When choosing a string as an ID, pay special attention to functions such as MD5, SHA1, and UUID. Each value is a large range of random values, with no order, which results in slower inserts and queries:
    • When inserting, it slows down because indexes are built at random locations (resulting in paging, random disk access, and aggregation of index fragments).
    • When querying, contiguous rows of data may be spread across disk or memory, which can also make it slower.

If you do use the UUID value, either remove the “-” character or use the UNHEX function to convert it to a 16-byte number and store it in BINARY(16). It can then be retrieved in hexadecimal format using the HEX function. There are many ways to generate UUID, some randomly distributed, some ordered, but even ordered is not as good as integer.

A distributed ids

For individual applications, the use of id increment or the use of the program to generate the ID is not a problem, but for distributed applications, this can lead to primary key conflict errors. For distributed ID, there are many algorithms at present, such as the famous Snowflakce algorithm, as well as some distributed ID open source algorithms of domestic big factories, such as:

  • Baidu’s UidGenerator.
  • Meituan’s Leaf.

The basic principle of distributed ids is to use machine code, timestamp and sequence number. Through a certain algorithm to achieve the uniqueness and order of distributed ID, specific interested can see the implementation of the snowflake algorithm.