I read an article today and I think it’s good.

Adam Hooper…

Translation: www.jianshu.com/p/ab9aa8d4d…

I recently encountered a bug where I tried to save a UTF-8 string in MariaDB encoded with “UTF8” using Rails, and a bizarre error occurred:

Incorrect string value: ‘ð Ÿ ˜ ƒ <… ‘For column’ summary ‘at Row 1 I’m using UTF-8 for the client, UTF-8 for the server, UTF-8 for the database, and even the string “???? “…” It is also legal utF-8.

The crux of the problem is that MySQL’s “UTF8” is not really UTF-8.

“Utf8” only supports a maximum of three bytes per character, whereas true UTF-8 supports a maximum of four bytes per character.

MySQL has never fixed this bug and released a character set called “UTF8MB4” in 2010 to bypass the problem.

Of course, they didn’t tell anyone about the new character set (probably because they were embarrassed by the bug), and the web still advises developers to use “UTf8”, but that advice is wrong.

A brief summary is as follows:

1.MySQL ‘UTF8MB4’ is the real ‘UTF-8’.

2.MySQL’s “UTF8” is a “proprietary encoding” that encodes only a few Unicode characters.

Let me clarify here: all MySQL and MariaDB users who are using “utf8” should switch to “UTf8MB4” and never use “utf8” again.

So what is coding? What is UTF-8?

As we all know, computers use zeros and ones to store text. For example, the character “C” is saved as “01000011”, then the computer needs to go through two steps when displaying this character:

1. The computer reads “01000011” and gets the number 67, because 67 is encoded as “01000011”.

The computer looked for 67 in the Unicode character set and found “C”.

The same:

1. My computer maps “C” to 67 in the Unicode character set.

2. My computer coded 67 as “01000011” and sent it to the Web server.

Almost all web applications use the Unicode character set because there is no reason to use any other character set.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest thing to do, because computers have traditionally treated 32 bits as numbers, and computers are best at crunching numbers. The problem is, it’s a waste of space.

Utf-8 can save space. In UTF-8, the character “C” requires only 8 bits, and some less commonly used characters, such as “”, require 32 bits. Other characters may use 16 or 24 bits. An article like this one, using UTF-8 encoding, would take up about a quarter of the space of UTF-32.

MySQL’s “UTF8” character set is incompatible with other programs, and its so-called “” may really be a lump of… MySQL, brief

Why did MySQL developers disable “UTf8”? We might be able to find out from the commit log. MySQL has supported utf-8 since version 4.1, 2003, and the utf-8 standard used today (RFC3629) came later.

The old UTF-8 standard (RFC2279) supported a maximum of six bytes per character. On March 28, 2002, MySQL developers used RFC2279 in the first MySQL4.1 preview release.

In September of that year, they made a tweak to the MySQL source code: “UTF8 now supports sequences of up to 3 bytes.” Who submitted the code? Why would he do that? The question is not known. After migrating to Git (where MySQL originally used BitKeeper), many of the committers’ names were lost from the MySQL code base. There is no clue in the September 2003 mailing list to explain the change.

But I can try to guess.

In 2002, MySQL made a decision: If users could guarantee that every row of a table used the same number of bytes, MySQL could get a big performance boost. To do this, the user needs to define the text columns as “CHAR,” each of which always has the same number of characters. If you insert less than the defined number of characters, MySQL fills in the blanks, and if you insert more than the defined number of characters, the excess is truncated.

MySQL developers initially experimented with UTF-8 with 6 bytes per character, CHAR(1) with 6 bytes, CHAR(2) with 12 bytes, and so on.

It’s fair to say that they did the right thing in the first place, but that version was never released. But it’s in the document, and it’s widely circulated, and everyone who knows ABOUT UTF-8 agrees with what’s in the document.

Obviously, MySQL developers or vendors are worried that users will do two things:

1. Use CHAR to define columns (CHAR is a bit of a relic now, but at the time, it was faster to use CHAR in MySQL, which has not been the case since 2005).

2. Set the encoding of the CHAR column to utf8.

My guess is that the MySQL developers were trying to help users who wanted to win both in space and speed, but they screwed up the “UTF8” coding.

So the result is that there are no winners. Users who want a win-win for both space and speed are actually using more space and slower than expected when using the CHAR column of “UTF8”. Users who want to be correct cannot save characters like “” when they use” utf8 “encoding.

After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.

Why is this so frustrating?

I’ve been freaking out about this for a whole week. I was fooled by “UTf8” and it took me a long time to find the bug. But I’m certainly not the only one; almost every article on the web treats “UTF8” as if it were the real UTF-8. “Utf8” is only a proprietary character set, it brings us new problems, but has not been solved.

conclusion

If you’re using MySQL or MariaDB, instead of using “UTF8”, use “UTF8MB4”. Here (mathiasbynens. Be/notes/mysql…

  • Number 3722, enter the number to this article

  • Enter m to get the article directory

Personal summary

For the author said that do not use UTF-8, I do not fully agree, can use UTF8MB4 when needed (after all, UTF-8 more space) like micro channel small program, coding pit we should step on, but in addition to this, seems to have not encountered any problems. So once you understand this problem, you can decide which code to use later on as needed.