This article takes about 3.6 minutes to read.


Translation: http://suo.im/4zBuvs

From: http://ju.outofmemory.cn

I recently ran into a bug where I tried to save a UTF-8 string in MariaDB encoded with “UTF8” via Rails and got a bizarre error:

Incorrect string value: '\ xF0 \ x9F \ x98 \ x83 <... 'forColumn 'summary' at row 1Copy the code

I’m using a UTF-8 client, the server is UTF-8, the database is UTF-8, and even the string “<… It’s also a legitimate UTF-8.

The problem is that MySQL’s “UTF8” is not really UTF-8.

“Utf8” only supports a maximum of three bytes per character, while true UTF-8 supports a maximum of four bytes per character.

MySQL never fixed the bug and released a character set called “UTf8MB4” in 2010 to circumvent the problem.

Of course, they didn’t advertise the new character set (probably because they were embarrassed by the bug), so developers are still being advised to use “UTF8” on the web, but that advice is wrong.

Briefly summarized as follows:

1.MySQL utF8MB4 is true UTF-8.

2.MySQL’s “UTF8” is a “proprietary encoding” that can encode only a few Unicode characters.

I want to be clear here: all MySQL and MariaDB users using “UTF8” should switch to “UTf8MB4” and never use “UTF8” again.

So what is coding? What is UTF-8?

As we all know, computers use zeros and ones to store text. For example, if the character “C” is saved as “01000011”, then the computer needs to go through two steps to display this character:

1. The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.

2. The computer looks for 67 in the Unicode character set and finds “C”.

The same:

1. My computer maps “C” to 67 in the Unicode character set.

2. My computer codes 67 as “01000011” and sends it to the Web server.

Almost all web applications use the Unicode character set because there is no reason to use any other character set.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest way to do it, because computers have always treated 32 bits as numbers, and that’s what computers are good at. The problem is, it’s a waste of space.

Utf-8 saves space. In UTF-8, the character “C” requires only 8 bits, and some infrequently used characters, such as “”, require 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded in UTF-8, would take up about a quarter of the space of UTF-32.

MySQL’s “UTF8” character set is incompatible with other programs.

MySQL, brief

Why did MySQL developers disable “UTF8”? The answer may be found in the commit log.

MySQL has supported UTF-8 since version 4.1, in 2003, and the UTF-8 standard used today (RFC 3629) came later.

Older versions of the UTF-8 standard (RFC 2279) supported up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release.

In September of that year, they made a tweak to the MySQL source code: “UTF8 now only supports 3-byte sequences at most.”

Who submitted the code? Why would he do that? That question is unknown. After moving to Git (MySQL originally used BitKeeper), many of the submitter names in the MySQL code base were lost. There is no clue on the September 2003 mailing list to explain the change.

But I can try to guess.

In 2002, MySQL made the decision that if users could ensure that every row of a data table used the same number of bytes, MySQL would get a big performance boost. To do this, the user needs to define text columns as “CHAR”, each of which always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the following space. If the number of characters inserted is more than the defined number, the following portion will be truncated.

MySQL developers initially experimented with UTF-8 using 6 bytes per character, CHAR(1) using 6 bytes, CHAR(2) using 12 bytes, and so on.

It should be said that they did the right thing in the first place, but that version was never released. But that’s what it says in the documentation, and it’s widely circulated, and everyone who knows UTF-8 agrees with what it says in the documentation.

But obviously, MySQL developers or vendors are worried about users doing two things:

1. Use CHAR to define columns (CHAR is ancient by now, but back then it was faster to use it in MySQL, though not since 2005).

2. Set the CHAR column encoding to UTF8.

My guess is that the MySQL developers were trying to help users who wanted a win-win in space and speed, but they screwed up the “UTF8” coding.

So the result is no winner. Users who want to win both space and speed actually use more space and slower than expected when they use the CHAR column “UTF8”. Users who want correctness can’t save characters like “” when they use” UTF8 “encoding.

After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.

Why is this so maddening

I’ve been freaking out all week because of this problem. I was fooled by “UTF8” and it took me a long time to find this bug. But I can’t be the only one. Almost all the articles on the web refer to “UTF8” as the real UTF-8.

“Utf8” is only a proprietary character set, and it presents us with new problems that have never been solved.

conclusion

If you’re using MySQL or MariaDB, don’t use “UTf8” encoding, use “UTf8MB4” instead. Here (https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4) provides a guide to an existing database character encoding from “utf8” into “utf8mb4”.


, END,

The growth path of programmers

Though the road is long, the journey is sure to come

This article was originally posted on the wechat public account of the same name “The Growth of programmers”, reply to “1024” you know, give a thumbs up.

Reply [520] to receive the best learning method for programmers

Reply to [256] for Java programmer growth plans


Highlights from the past

Here are 11 popular Java projects on GitHub

Separate database and separate table? How to never migrate data and avoid hot spots?

The 150 most common commands in Linux are summarized

Work crazy: Mybatis $and # do not use it!

Share some handy Chrome extensions

My dad had a ransomware virus on his computer…

P7 hacker is how to find his girlfriend cheating, painful experience!