• UUID or GUID as Primary Keys? Be Careful!
  • Originally written by Tom Harrison Jr
  • The Nuggets translation Project
  • Translator: zaraguo
  • Proofreader: Canonxu Yifili09

UUID or GUID as primary key? Watch out!

Nothing says “user friendly” like GUID!

Recently, I was reading an article about how to scale a database that caught my attention – the author suggested using UUIDs (similar to GUIDs) as primary keys for database tables.

The advantages of UUIDs

Here are some reasons why using a UUID as a primary key is better than using an incremented integer:

  1. When scaling a database, when you have multiple databases that contain the same piece of data, such as a customer set, using a UUID means that the ID is uniquely identified across all databases, not just this one. This ensures the security of migrating data across databases. For example, I once merged multiple database shards into a Hadoop cluster in a project, and there was no key conflict.
  2. You can know the value of the primary key before inserting data, which avoids a round of data look-up and simplifies transaction logic, since you need to know the value of the primary key before inserting the child record as a foreign key.
  3. UUIDs do not reveal information about data and are more secure than auto-increment integers when used in urls. For example, if I am customer number 12345678, then people will assume the existence of customer numbers 12345677 and 12345679, which provides an attack vector. (But we’ll see a better alternative later.)

The disadvantage of UUIDs

Don’t be so naive

A basic UUID would look something like this: 70e2e8DE-500E-4630-b3CB-166131d35c21, which would be treated as a string, such as vARCHar (36) – don’t do that!

You say, “Well, no one would do that.”

I thought about it a little bit – in the two large enterprise databases I’ve worked on, that’s exactly what they do. In addition to 9 times the excess overhead (the integer type takes up only 4 bytes compared to 36 bytes), strings are not as fast at sorting as numbers because they depend on collation.

There was one company where something really bad happened, and they used Latin-1 character sets in the beginning. When we tried to convert to UTF-8, several union indexes were too large to hold. Oh!

UUIDs regression of

Don’t underestimate how annoying it can be to deal with values that are too big to store and express.

Plan for actual expansion

If our goal is to expand, I mean really expand. So first let’s realize that ints are not large enough in many cases. It overflows at about 2 billion (4 bytes). Yet we have well over 2 billion pieces of data in each database.

Therefore, bigint is what we really need at some point, and it takes 8 bytes. There are several other strategies to choose from. Databases such as PostgreSQL and SQL Server have 16-byte native types.

Who cares if it’s twice the size of Bigint or four times the size of int? This is just a little bit of a byte, right?

Primary keys are ubiquitous in well-regulated databases

If your database is well regulated, as is the case at my current company, you evaluate each time a key is used as a foreign key.

Not only on disk, these keys need to be loaded into memory during join and sort. Memory is getting cheaper, but neither disk nor memory is finite, and neither is free.

Our database uses a large number of relational tables to store foreign keys, especially in one-to-many relationships. The account table contains multiple card numbers, addresses, phone numbers, user names, and so on. For any column in a set of tables with billions of accounts, the space overhead for foreign keys grows rapidly.

Sorting random numbers is very difficult

Another problem is fragmentation – because UUIDs are random, they have no natural generation order and therefore cannot be used in clusters. This is why SQL Server implements a newSequentialId () method for the use of clustered indexes, which is probably the right way to turn on UUIDs as the primary key. Other databases may have similar solutions, PostgreSQL, MySQL certainly does, and others probably do.

Primary keys should never be exposed, even UUIDs

Because the primary key is unique in its scope, it can obviously be used as a user number or in a URL to mark a unique page or record.

Don’t!

I’ll illustrate the point that exposing primary keys in an open environment is a bad idea.

As I said above, the basic problem with simple increments is that they are easy to guess. The botnet can use this to keep guessing until it finds the real value. (Of course, if you use UUIDs, they can do brute force cracking, but the odds are very low).

In theory, trying to guess a UUID can be stupid, but Microsoft cautions against using NewSequentialId () because it’s actually easier to guess in order to reduce clustering problems.

I thought my keys would never change (until they did)

There is another compelling reason not to use a primary key in an open environment: once you change the key, all external references become unavailable. Imagine a 404 page not found.

When do you need to change the key? It just so happens that we are doing data migration this week, because who would have thought when we started as a company in 2003 that we would now need 13 huge SQL Server databases and continue to grow rapidly?

Never say “never”. I worked on that migration project, and this has happened to me many times. Prevention is simpler. Migration becomes more difficult when you’re in the midst of trillions of data.

In fact, my current company scenario is the best example of why UUIDs are needed, why UUIDs are expensive, and why exposing primary keys in an open environment is a problem.

My internal system is external

The Hadoop infrastructure I manage receives data from all of our databases on a nightly basis. The Hadoop system connects to our SQL Server database, which is fine because the two are owned by the same company.

Also, to avoid serialization key collisions between multiple databases, we generate a fake primary key by associating two values with unique customer numbers (primary keys) across databases, plus their serial numbers in the table.

By doing this we create a tight and effective permanent link between years of historical user data. If these primary keys change in a relational database management system, our corresponding keys must also change, or there will be frightening inconsistencies.

How to have it both ways? Use an integer for an internal reference and UUIDs for an external reference

There is a solution that works in many different scenarios, and the short answer is to do both. (Please note: this is not a good approach – see my transcript of Chris’s response to the original post below)

Internally, let the database manage data relationships with small, efficient, numeric sequence keys, either int or BigInt.

Then add a column to hold the UUID (which can be programmed into an insert preprocessing operation). Within the scope of a database itself, relationships can be managed using plain primary and foreign keys.

When a reference to data needs to be exposed externally, even if the “externally” is another internal system, they must rely on UUID.

This way, if you need to change the internal primary key, you can also ensure that it is within a database. (Note: as Chris comments, this is obviously wrong)

We used this strategy on another company’s customer data precisely to avoid the “easy to guess” problem with primary keys. (Note: Avoiding is not the same as preventing, see below).

On the other hand, I’ll generate a “piece” of text (such as a blog post like this one) for the URL to make it more user-friendly. If there is a conflict, simply append a hash value.

Even as “secondary primary keys”, simply using UUIDs as strings is wrong: I recommend using the built-in database mechanism to generate 8-byte integer values.

Integers are used because they are efficient. UUIDs implemented by databases can also be used for irregular external references to avoid brute force cracking.

Chris Russell’s response to this section of the original post correctly points out two important logical caveats or errors. First, even exposing UUIDs instead of real primary keys can actually reveal a lot of information, especially when using NewSequentialIDS – don’t try to use UUIDs for security. Second, if the given schema relationship is internally managed by integer keys, you will still have key conflicts when merging two databases, unless all keys are allowed to have two records… If this is the case, use the UUID. So, in reality, the right solution might be: you can use UUIDs as keys, but never expose them. How things are handled internally or externally is best left to modules like URL-friendly handling, and (as Medium does) appends a hash value to the tail. Thank you, Chris!

Postscript and thanks

Thanks to Ruby Weekly (which I read all the time, even though I’m using Scala now), Starr Horne from Honeybadger for his excellent article on this point, Jeff Atwood, who writes humorous and intelligent articles on Coding Horror, co-founder of Stack Overflow, Naturally, there’s a great question from Starkoverflow at dba.stackexchange.com. There’s also a great article from MySqlserverTeam, another from theBuild.com and MSDN that I linked to earlier.

Postscript: Why am I writing this article

I learned a lot from writing this article.

It all started on a Sunday afternoon when I was reading my mail.

Then I came across an interesting article by Starr, and it got me thinking that his advice might have some unexpected effects. So I started googling, which broadened my understanding of UUIDs and changed my basic understanding and attitude about how to use them.

As I was writing, I emailed the team leader of the company to ask if our database design had taken into account some of the points I had discussed above. Hopefully we’ve done a good job, but I think we’ve avoided at least one unexpected surprise in the code we’re planning to release this week.

Writing this article is purely selfish 🙂

I hope you like it too!

photo


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. Android, iOS, React, front end, back end, product, design, etc. Keep an eye on the Nuggets Translation project for more quality translations.