Today’s motto: Go back to where you started.
This article explains the Mysql primary key problem, from the perspective of why to understand the Mysql primary key related knowledge, and expand to the primary key generation scheme problem. No longer afraid of being asked about Mysql and only knowing CRUD.
Why do we need primary keys
- Data recording must be unique (first paradigm)
- Data needs to be associated with a join
- The underlying database index is used to retrieve the data needed
The following is so much nonsense that you can skip to the next section.
“Information is what eliminates random uncertainty” (Shannon). By obtaining and identifying different information from nature and society, people can distinguish different things and understand and transform the world. Data is a record reflecting the attributes of objective things and a specific form of information. Data, after processing, becomes information; Information needs to be digitized into data before it can be stored and transmitted. Databases are used to store data records. As such, records are deterministic (relative) information whose determinism is uniqueness. The first reason is:
1. Data records must be unique
The world is made up of objective beings and their relations. Data is an existential relationship between digitization and modeling. In addition to the descriptive value of data itself, its value also lies in its correlation. In order to achieve the accuracy of association, data need to be identified with external correlation. So in data storage, the second function of primary keys, the second factor that exists, is:
2. Data needs to be associated
Data are used to describe objective reality and are meaningless in themselves. Only after the organization according to subjective needs can the process of meeting people’s understanding of things by certain means become meaningful. So the data needs to be retrieved and organized. The third function of the primary key is:
3. The underlying index of the database is used to retrieve data
Second, why should not be too long primary key
The point of this problem is long. What’s the advantage of being short over being long? — Short doesn’t take up space. But that bit of disk space is tiny compared to the total amount of data, and we don’t usually use primary key columns that much. So the reason should be fast, and has little to do with the raw data. This naturally leads to index related, and index read related. So why do long primary keys affect performance in indexes?
Above is Innodb’s index data structure. On the left is the clustered index, which locates data records by primary key. The right hand side is the secondary index, the column data index, through the column data to find the data primary key. If the data is queried through the secondary index, the process is as shown in the figure. First, the primary key is searched from the secondary index tree, and then the data row is searched by the primary key on the clustered index. The leaf node of the secondary index is the primary key value stored directly, not the primary key pointer. So if the primary key is too long, a secondary index tree can store fewer index records, so that in a limited index buffer, there will be more reads from disk, so performance will be reduced.
Why is it recommended to use auto-increment ID
InnoDB uses clustered indexes. As shown in the figure above, the data records themselves are stored on a leaf node of the main index (a B+Tree). This requires that data records within the same leaf node (size of a memory page or disk page) be stored in primary key order, so every time a new record is inserted, MySQL inserts it into the appropriate node and location based on its primary key. If the page reaches the load factor (InnoDB defaults to 15/16), A new page (node) is opened.
If the table uses auto-increment primary keys, each time a new record is inserted, the records are sequentially added to the place following the current index node, and when a page is full, a new page is automatically opened. This results in a compact index structure that is approximately sequentially filled. Since there is no need to move the existing data with each insert, this is efficient and does not add much overhead to maintaining the index, as shown on the left side of the figure below. Otherwise, because the value of the primary key is nearly random each time, each new record is inserted somewhere in the middle of the existing index page, and MySQL has to move the data in order to insert the new record into the appropriate place, as shown on the right in the figure below, which incurs some overhead. As a result, Mysql may require frequent flush buffers to maintain indexes, increasing the number of method disk I/OS, and frequently reorganizing index structures.
4. Service keys VS logical keys
A business Key, that is, an ID with business significance is used as the Key. For example, an order serial number is used as the primary Key of an order table. Logical keys are irrelevant service keys that are generated based on certain rules, for example, self-adding keys.
Advantages of business keys
- Key is of service significance and can be directly used as the search keyword in query
- No additional column and index space is required
- You can reduce some of the join operations.
Disadvantages of business keys
- When the business changes, you sometimes need to change the primary key
- Operation is difficult when multiple column keys are involved
- Service keys are long and occupy more space, resulting in larger DISK I/OS
- Data cannot be persisted until the Key is determined, and sometimes we want to add a record before updating the business Key without determining the data Key
- It is difficult to design a Key generation scheme with both ease of use and performance
Advantages of logical keys
- There is no need to change the Key logic because of business changes
- The operation is simple and easy to manage
- Logical keys tend to be smaller and have better performance
- Logical keys are easier to ensure uniqueness
- Easier to optimize
Disadvantages of logical Keys
- Querying primary key columns and primary key indexes requires additional disk space
- Additional IO is required to insert and update data
- More joins are possible
- Without unique policy constraints, duplicate keys tend to occur
- The keys in the test environment are inconsistent with those in the formal environment, which is not conducive to troubleshooting
- The value of Key is not associated with the data and does not conform to the three normal form
- Cannot be used to search for keywords
- Depending on the specific implementation of different database systems is not conducive to the replacement of the underlying database
Primary key generation
In general, we use Mysql’s increment ID as the primary key of the table, which is simple and gives the best performance. However, in the case of database and table, the self-increasing ID cannot meet the requirements. We can take a look at how different databases generate ids and also look at some distributed ID generation schemes. It helps us think about and even implement our own distributed ID generation service.
Implementation of database
Mysql on the
Mysql maintains an auto-increment counter in memory. Each time InnoDB accesses an auto-increment counter, InnoDB adds a lock named auto-Inc until the end of the statement. The auto-Inc lock is a special table-level lock used to increase concurrent insertability of columns containing AUTO_INCREMENT.
In the case of distribution, an independent service and database can generate ids, and still rely on Mysql’s table ID increment capability to uniformly generate ids for third-party services. You can use different tables for different businesses for performance reasons.
Mongodb ObjectId
To prevent primary key conflicts, Mongodb designs an ObjectId as the primary key ID. It consists of a 12-byte hexadecimal number that contains the following parts:
-
Time: indicates the Time stamp. 4 bytes. The second level.
-
Machine: indicates the Machine ID. 3 bytes. This ensures that different hosts generate different machine hash values, ensuring that there is no conflict in distribution, and that the value of the same machine is the same.
-
PID: indicates the ID of a process. 2 bytes. Machine is used to ensure that the objectId generated on different machines does not conflict, while PID is used to ensure that the objectId generated by different mongodb processes on the same Machine does not conflict.
-
INC: auto-increment counter. 3 bytes. The first nine bytes ensure that the objectId generated by different processes on different machines in one second does not conflict, and the auto-increment counter ensures that the objectId generated by different processes in the same second does not conflict, allowing 256 to the power of 3 to be equal to 16777216 unique records.
Cassandra TimeUUID
Cassandra generates a unique ID using the following rule: time + MAC + sequence
plan
- Zookeeper auto-increment: This mode is implemented using the Zookeeper auto-increment mechanism.
- Redis increment: Implemented by Redis increment mechanism.
- UUID: Use the UUID character string as the Key.
- Snowflake algorithm: Similar to Mongodb’s implementation,
1 bit symbol bit + 41 bit timestamp (millisecond level) + 10 bit data machine bit + 12 bit sequence in milliseconds
.
Open source implementation
- Baidu UidGenerator: Based on snowflake algorithm.
- Meituan Leaf: Implements a mechanism based on both Mysql augmentation (optimization) and Snowflake algorithms.
Recommend series
The Apache Druid database is a database with a large table. The Apache Druid database is a database with large tables. The Apache Druid database is a database with large tables
For more knowledge about data storage, please follow my official account.