Error of the day: I tried to store a UTF-8 string in a MariaDB “UTF8” encoded database and raised a strange error:

Incorrect String Value: '\xF0\x9F\x98\x83 <... 'forColumn 'summary' at row 1Copy the code
This is the UTF-8 client and UTF-8 server, located in the UTF-8 database, with UTF-8 encoding rules. The string “๐Ÿ˜ƒ” is a valid UTF-8.

But here’s the thing: MySQL’s “UTF8” is not UTF-8.

The “UTF8” encoding supports only three bytes per character. True UTF-8 encoding – used by everyone, including you – requires up to four bytes per character.

MySQL developers have never fixed this bug. They released a solution in 2010: a new character set called “UTF8MB4.”

Of course, they never published this (probably because the bug was so embarrassing). The guide on the Web now advises users to use “UTF8”. All of these guidelines are wrong.

In short:

ยท MySQL utF8MB4 stands for UTF-8.

ยท MySQL’s “UTF8” means “special character encoding”. This encoding cannot encode many Unicode characters.

I’ll make a thorough statement here: all MySQL and MariaDB users currently using “UTF8” should actually use “UTf8MB4”. No one should use “UTF8”.

What is coding? What is UTF-8?

Joel on Software wrote my favorite introduction (www.joelonsoftware.com/2003/10/08/…

). I’ll cut it down.

Computer stores text as 1s and 0s. The first letter in this paragraph is stored as “01000011” and your computer displays as “C”. Your computer selects “C” in two steps:

1. Your computer reads “01000011” and determines that it is the number 67. This is because 67 is encoded as “01000011”.

2. Your computer looks for the character number 67 in the Unicode character set and finds that 67 means “C”.

When I type “C”, the same thing happens to my result:

1. My computer maps “C” in the Unicode character set to 67.

2. My computer code is 67 and I send “01000011” to this Web server.

Character sets are a solved problem. Almost every program on the Internet uses the Unicode character set, because there is no incentive to use another.

But coding is more of a judgment call. Unicode has slots of over a million characters. (C and ๐Ÿ’ฉ are two characters)

The simplest encoding (UTF-32) occupies 32 bits per character. This is easy, because computers have been treating 32-bit groups as numbers for years, and they’re really good at it. But it doesn’t work: it’s a waste of space.

Utf-8 saves space. In UTF-8, common characters like “C” take up 8 bits, while characters like “other” require 16 or 24 bits. Blog posts like this take up four times less space in UTF-8 than in UTF-32. So it loads four times faster.

You may not realize it, but behind the scenes our computer agreed to UTF-8. If they don’t, then when I type

“๐Ÿ’ฉ”, you’ll see a bunch of random data.

MySQL’s “UTF8” character set is inconsistent with other programs. When they say “๐Ÿ’ฉ,” it hesitates.

A little bit of MySQL history

Why did MySQL developers invalidate “UTf8”? We can guess by looking at the commit log.

MySQL supports UTF-8 as of version 4.1. That was 2003 – before today’s UTF-8 standard, RFC 3629.

The previous UTF-8 standard, RFC 2279, supported up to six bytes per character. MySQL developers wrote RFC 2279 in the first pre-release of MySQL 4.1 on March 28, 2002.

Then in September made a mysterious, one-byte tweak to the MySQL source code: “UTF8 can now only handle three-byte sequences.”

Who submitted this? Why is that? I can’t say. The MySQL code base seems to have lost the old author name when Git was adopted. (MySQL used to use BitKeeper, just like the Linux kernel.) There is nothing on the mailing list circa September 2003 to explain this change.

But I can guess.

Back in 2002, MySQL gave users a speed boost if they could guarantee that every row in a table had the same number of bytes. To do this, the user declares the text column as “CHAR”. The CHAR column always has the same number of characters. If you type too few characters, it adds Spaces at the end; If you type too many characters, it will truncate the last character.

When MySQL developers first tried utF-8, they might have balked at the last six bytes of each character: a CHAR (1) column takes six bytes; The CHAR (2) column takes 12 bytes; And so on.

Let’s be clear: the unpublished initial behavior is correct. It is well documented and widely adopted, and anyone who understands UTF-8 will agree that this is true.

But apparently, MySQL developers (or merchants) are worried that one or two users will do two things:

1. Select the CHAR column. CHAR format is now a relic. At the time, MySQL was faster with CHAR columns. Until 2005, it wasn’t.)

2. Select “UTF8” for the CHAR columns.

My guess is that MySQL developers broke their “UTF8” encoding to help these users: 1) users trying to optimize space and speed; 2) Failure to optimize speed and space.

No one wins. Users who want speed and space will still be wrong to use the “UTF8” CHAR columns because those columns are still larger and slower than they were. Developers who want correctness use “UTf8” incorrectly because it cannot be stored

“๐Ÿ’ฉ”

Once MySQL publishes this invalid character set, it will never be able to fix it: it will force every user to rebuild every database. MySQL finally released UTF-8 support in 2010 under a different name: “UTF8MB4”.

Why is it so depressing

Obviously, I’m upset this week. My bugs are hard to find because I’m confused by the name “UTF8”. And I’m not the only one – almost all the articles I’ve found online refer to “UTF8” as UTF-8.

The name “UTf8” is always wrong. This is a proprietary character set. It creates new problems and does not solve the ones it is meant to solve.

This is false advertising.

My take-away lessons

1.Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems.

If you need a database, don’t use MySQL or mariadb.use PostgreSQL.

3.If you need to use MySQL or MariaDB, never use โ€œutf8โ€. Always use โ€œutf8mb4โ€ when you want UTF-8. Convert your database now to avoid headaches later.

Click on the link to the original English text

More articles are welcome to visit http://www.apexyun.com

Public id: Galaxy 1

Contact email: [email protected]

(Please do not reprint without permission)