Recently I encountered a magic problem, when INSERTING chat emoticons into the database, I mysteriously reported an error. My client, server and database are all UTF-8 encoding, so there should be no problem. It turns out that the root cause is that MySQL’s “UTF8” is not really UTF-8.
Mysql utF-8 error
MySQL’s “UTF8” is not really UTF-8. “Utf8” only supports a maximum of three bytes per character, while true UTF-8 supports a maximum of four bytes per character. MySQL never fixed the bug and released a character set called “UTf8MB4” in 2010 to circumvent the problem. Of course, they didn’t advertise the new character set (probably because they were embarrassed by the bug), so developers are still being advised to use “UTF8” on the web, but that advice is wrong. MySQL ‘UTF8MB4’ is true ‘UTF-8’. MySQL’s “UTF8” is a “proprietary encoding” that can encode only a few Unicode characters. I want to be clear here: all MySQL and MariaDB users using “UTF8” should switch to “UTf8MB4” and never use “UTF8” again.
Historical causes of the problem
Why did MySQL developers disable “UTF8”? The answer may be found in the commit log. MySQL has supported UTF-8 since version 4.1, in 2003, and the UTF-8 standard used today (RFC 3629) came later. Older versions of the UTF-8 standard (RFC 2279) supported up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release. In September of that year, they made a tweak to the MySQL source code: “UTF8 now only supports 3-byte sequences at most.” Who submitted the code? Why would he do that? That question is unknown. After moving to Git (MySQL originally used BitKeeper), many of the submitter names in the MySQL code base were lost. There is no clue on the September 2003 mailing list to explain the change. But I can try to guess. In 2002, MySQL made the decision that if users could ensure that every row of a data table used the same number of bytes, MySQL would get a big performance boost. To do this, the user needs to define text columns as “CHAR”, each of which always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the following space. If the number of characters inserted is more than the defined number, the following portion will be truncated. MySQL developers initially experimented with UTF-8 using 6 bytes per character, CHAR(1) using 6 bytes, CHAR(2) using 12 bytes, and so on. It should be said that they did the right thing in the first place, but that version was never released. But that’s what it says in the documentation, and it’s widely circulated, and everyone who knows UTF-8 agrees with what it says in the documentation. Obviously, though, MySQL developers or vendors are worried about users doing two things: (1) using CHAR to define columns. (CHAR may seem old-fashioned now, but it was faster to use it in MySQL at that time, but it hasn’t been since 2005.) (2) Set CHAR encoding to “UTf8”. My guess is that the MySQL developers were trying to help users who wanted a win-win in space and speed, but they screwed up the “UTF8” coding. So the result is no winner. Users who want to win both space and speed actually use more space and slower than expected when they use the CHAR column “UTF8”. Users who want correctness can’t save characters like “” when they use” UTF8 “encoding. After the invalid character set was published, MySQL could not fix it because it would require all users to rebuild their databases. Finally, MySQL rereleased “UTF8MB4” in 2010 to support true UTF-8.