In the production environment, the system encountered an error when accessing the address of the external WebService. The reason was that the address format was UTF-8 BOM format, and several invisible characters %EF%BB%BF were added to the end of the address. As a result, address identification errors and poor character coding knowledge cost me a whole day to solve this problem. Wang Xiaobo said: all the pain of people, in essence, are the anger of their own incompetence! (Wang Xiaobo: Nonsense, where did I say that?) Unicode, UTF-8, UTF-8 BOM, etc.

Let’s get started with two questions about character encoding:

Question 1: what is ANSI, Unicode, UCS, UTF-8, GBK that we often see? What is the relationship between them?

Question 2: Is UTF-8 and UTF-8 BOM the same? Which one should we use?

ANSI

I believe everyone is familiar with ASCII code as a programmer. ASCII is the character code developed by The Americans, which has made a mapping between English characters and binary. ASCII specifies the code of 128 characters, that is, it can be represented by a byte, only occupy the last 7 bits of a byte, the first bit is uniformly specified as 0.

But here’s the thing. ASCII was invented by the Americans, who thought a single byte was more than enough to represent all the characters, numbers, and symbols in the English world.

Arrived later French and German refused to accept, with what only English without French (such as: E), German (such as: A A)? ASCII uses only the first 128 codes in a byte, so europeans use the rest (128 to 255).

When we Chinese began to use computers, our great Chinese culture is extensive and profound, the obvious characters are not enough, including a variety of rare characters there are more than 90,000 Chinese characters, so the State Bureau of Standards set out, GB 2312 was born. GB 2312 uses two bytes to encode only the thousands of simplified Chinese characters commonly used. Our Taiwan compatriots saw that there was no traditional Chinese character, so they made up a traditional character code — BIG-5. Then, of course, GBK was born, which took GB2312 and BIG-5 under its umbrella and still encoded them in two bytes.

Gradually, countries began coding their own languages independently, and ANSI was born as Microsoft tried to reach out to each country and live up to the customer principle. ANSI does not harmonize national codes. Instead, it implements different binaries using different national codes on computers in different countries! For example, GBK is used in China, EUC-KR is used in South Korea, and ASCII is used in the United States, which is a kind of curve to save the country. But! There is a problem with garbled code! If a Chinese sends an ANSI document to an American, the American will see nothing but gibberish. Without a unified character code, it would be as if God had scattered the people of The Tower of Babel around the world and disrupted the language of the world, so that the people of the world could not speak each other and could not build the tower any more.

Unicode and UCS

It can be seen that ANSI is a coding scheme that does not solve the root cause of the problem, and there is still a long way to go to establish a universal character code to completely solve the problem of garbled characters. In the 1980s and 1990s, two groups were working on a unified character code.

  • Unicode alliance: Founded in the late 1980s, inventedUnicodeCharacter set, dedicated toUnicodeReplaces existing character encodings.
  • International Organization for Standardization (ISO): ISO/IEC/JTC1/SC2/WG2 working Group was created in 1984 to try to developUniversal character set(Universal Character Set, UCS), and finally developedISO 10646Standard.

The world did not need two different character sets, and the people of the two organizations realized this, and they began to work together, and the two still exist independently and publish standards independently, but strictly ensure that all characters in the two standards are in the same place and have the same name.

UTF-8

Note that UCS and Unicode are only character sets, specifying only the binary code for each symbol, but not how that symbol is stored. The specified storage mode is called UTF (Unicode Transformation Format), which includes UTF-8, UTF-16, and UTF-32.

For example, we just said that Unicode is a character set that assigns a unique ID to each character (or code point, as it’s technically called). The Unicode code point for the character “digg” is 25496, which is called U+6398. His UTF-8 code is E68E98.

Originally proposed were UTF-32 and UTF-16. As the name implies, UTF-32 uses four bytes to represent Unicode code points, and UTF-16 uses two bytes to represent Unicode code points. However, since two bytes are not enough, four bytes are used to encode some characters that are not commonly used. So UTF-16 is a variable length code.

But if you think about it, utF-32 and UTF-16 encode ASCII with a bunch of zeros in front of it, which is a huge waste of bandwidth in the Internet world, and UTF-8 solves this problem. Utf-8 is also a variable-length code and the rules are very simple:

  • Such asASCIIThe single-byte encoding, encoding mode andASCIIConsistent.
  • Multi-byte encoding meets the following requirements: When the number of bytes is N, the first N of the first byte is 1, the N+1 bit is 0, the first two bits of the next N-1 byte are 10, and the rest bits of the N byte are used for storageUnicodeCode point value in.
The number of bytes Unicode UTF-8
1 U+0000 ~ U+007F 0xxxxxxx
2 U+0080 ~ U+07FF 110xxxxx 10xxxxxx
3 U+0800 ~ U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 ~ U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

In Unicode, Chinese characters are two bytecode points in the range U+0800 to U+FFFF. So, now you know why Chinese characters take up three bytes in UTF-8?

Little Endian and Big Endian

Little Endian and Big Endian translate to small head and Big head. According to Wikipedia, these two words come from Irish author Jonathan Swift’s Gulliver’s Travels, but are now mostly used to refer to byte order.

  • Big Endian means that the lower end stores the higher byte.
  • Little Endian means that the low address end stores the low byte.

Again, take the Chinese character “Digui” as an example. The Unicode code point is U+6398, which requires two bytes of storage. If the store is 63 first and 98 last (63 at the lower end), it is Big Endian; Otherwise, if 98 is first and 63 is last (98 is at the lower end of the address), it’s Little Endian.

Utf-8 does not require Little Endian or Big Endian, but UTF-16 and UTF-32 do.

So how does the computer know which byte order to use?

According to the Unicode specification, each file is preceded by a character denoting the encoding order. This character is called zero width no-break space (FEFF). If the first two bytes of a file are FEFF it is Big Endian, and if FFFE it is Little Endian. And that label is BOM.

What is the BOM

BOM (byte-order Mark) is a special mark inserted at the beginning of a Unicode file encoded with UTF-8, UTF-16, or UTF-32 to identify the encoding type and byte order of a Unicode file.

encoding BOM
UTF-8 EF BB BF
Utf-16 (Big Endian) FE FF
Utf-16 (Little Endian) FF FE
Utf-32 (Big Endian) 00 00 FE FF
Utf-32 (Little Endian) FF FE 00 00

With or without BOM?

Why is utF-8 still in the table? Because UTF-8 is divided into those with and without BOM.

Microsoft recommends that all Unicode files be BOM, so it prefixes its UTF-8 text files with EF BB BF, which is how the Notepad program on Windows determines whether a text file is ASCII or UTF-8.

But this is just a custom rule for Windows, other systems do not adopt this rule. Linux/UNIX does not use BOM because BOM is just an identifier and is an invisible character, which violates UNIX design principles.

With or without a BOM depends on personal usage, as the Unicode specification allows both. If you are Linux only, utF-8 without BOM is fine, but if you are a Windows user, you are better off with BOM. Cross-platform is best combined with the actual specific analysis, involving Windows with BOM more general.

Of course, as a JAVA programmer, I use IDEA under Windows, while the production environment is deployed on Linux virtual machines. I have learned from the experience and learned from it. Next time, I must pay attention to remove BOM!!

This article is formatted using MDNICE