3.5.2. Unicode
Once upon a time, the world was simpler, or at least the computer world consisted of one ASCII character set: the American Standard Code for Information Interchange. ASCII, more precisely American ASCII, uses seven bits to represent 128 characters: upper and lower case of English letters, numbers, various punctuation marks, and device controls. This was sufficient for early computer programs, but it left users in many other parts of the world unable to use their own notation systems directly. With the growth of the Internet, it has become common to mix data in multiple languages. How do you effectively deal with this rich and diverse textual data in a variety of languages?
The answer is to use Unicode (unicode.org), a collection of all the world’s symbol systems, including accents and other diacritics, tabs and carriage returns, and a host of cryptic symbols, each assigned a single Unicode code point, Unicode code points correspond to the rune integer type in Go.
In version 8, the Unicode standard collected more than 120,000 characters, covering more than 100 languages. How does this manifest itself in computer programs and data? The common data type for a Unicode code point is int32, the Go equivalent of rune; The synonym rune means exactly that.
We can represent a runic sequence as an INT32 sequence. This encoding is called UTF-32 or UCS-4, and each Unicode code point is represented by the same size, 32 bits. This approach is relatively simple and uniform, but it wastes a lot of storage space because the text that big data computers can read is ASCII characters, which normally only take 8 bits or 1 byte to represent. And even the common characters are far fewer than 65,536, that is, the common characters can be expressed with 16bit encoding. But is there a better way to code?
3.5.3. UTF-8
UTF8 is a variable-length encoding that encodes Unicode code points as sequences of bytes. UTF8 was invented by Ken Thompson and Rob Pike, the fathers of Go, and is now a Unicode standard. The UTF8 encoding uses 1 to 4 bytes for each Unicode code point, only 1 byte for ASCII partial characters, and 2 or 3 bytes for common character parts. The high end bit of the first byte encoded by each symbol is used to indicate how many encoded bytes there are. If the high bit of the first byte is 0, it represents 7 bits of ASCII characters, which are still one byte each, compatible with traditional ASCII encoding. If the high bit of the first byte is 110, two bytes are required. Each subsequent high-end bit starts with 10. Larger Unicode code points are treated with a similar strategy.
0xxxxxxx runes 0- 127. (ASCII)
110xxxxx 10xxxxxx 128- 2047. (values <128 unused)
1110xxxx 10xxxxxx 10xxxxxx 2048- 65535. (values <2048 unused)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 65536-0x10ffff (other values unused)
Copy the code
Variable-length encodings cannot access the NTH character directly through the index, but UTF8 encodings gain many additional advantages. The first UTF8 encoding is compact, fully ASCII compatible, and can be synchronized automatically: it can determine the starting byte of the current character encoding by rolling back up to two bytes. It is also a prefix code, so there is no ambiguity and no need to look forward when decoding from left to right. No character encoding is a substring of other character encoding, or a string of other encoding sequences, so when searching for a character, just search for its byte encoding sequence, without worrying about the context of the search results will cause interference. Meanwhile, UTF8 encodings are in the same order as Unicode code points, so UTF8 encodings can be sorted directly. And because there is no embedded NUL(0) byte, it is perfectly compatible with programming languages that use NUL as string endings