Utf-8 implementation of Go

When the computer was born, all the characters in the computer could be represented by ASCII, the length of ASCII characters is 7 bits, which can represent 128 characters, which is enough for the United States and other countries, but for other countries in the world, especially east Asian countries, characters are not composed of letters, there are tens of thousands of Chinese characters. ASCII is simply not enough.

A character is essentially a numerical value in a computer, and the solution was to expand the range. Unicode solved this problem by including all characters in the world, and each character corresponds to a numeric value, which is called a Unicode code point.

But Unicode is not without its drawbacks. Because of the large range of representations, each Unicode requires 4 bytes to represent, whereas the original ASCII encoding, which used to require only 1 byte, now requires 4 bytes, which wastes a lot of storage.

Utf-8 solves this problem by letting each character choose its own size, using as many bytes as it needs. For characters with different bytes, there are different representations:

1 byte: 0xxxxxxx
2 bytes: 110xxxxx 10XXxxxx
3 bytes: 1110XXXX 10XXXXXX 10XXxxxx
4 bytes: 11110xxx 10 XXXXXX 10XXXXXX 10XXxxxx

Determine the number of bytes in each string by identifying the header.

Each Unicode character corresponds to a code point, which can be escaped in a string using \uhhhh for a 16-bit code point and \ uhhhhhh for a 32-bit code point, each h representing a hexadecimal number.

The special point here is that literal symbols with A code point value less than 256 can be represented by A single hexadecimal number. For example, ‘A’ can be represented by ‘\x41’. For code points greater than 256, either \u or \u must be escaped.

Go’s support for UTF-8 is good, and it’s interesting to note that two of the authors of Go, Ken Thompson and Rob Pike, are also the inventors of UTF-8. Go’s support for UTF-8 has a head start.

Go always uses UTF-8 for source files, and utF-8 is preferred for strings. So the escape of Unicode characters mentioned above is handled directly by Go. For example, the following three strings are equivalent in Go:

"World" "\u4e16\ u754C ""\U00004e16\U0000u754c"Copy the code

Go strings are stored in read-only [] bytes, so string values are immutable, which is safer and more efficient:

s := "left root"
t := s
s += ", right root"

fmt.Println(s) // left root, right root
fmt.Println(t) // left root
Copy the code

In the example above, the value of s has changed, but the value of t is still the old string. [] byte is a slice type, so string interception is efficient, but there are some holes in the process of string interception.

It is important to understand that strings in Go are stored in read-only [] bytes at the bottom, so ** strings in Go are essentially represented in bytes, not characters.

STR := "hello world" ftt. Println(STR [:2]) // he STR = "hello world" ftt. Println(STR [:2]) // ��Copy the code

Non-ascii characters usually occupy more than one byte. If a character is intercepted directly, it will not intercept the correct position, resulting in garbled characters. In the above example, a Chinese character is 3 bytes long, and only if the character is truncated strictly by the number of bytes, can the character display properly:

STR = "Hello world" FMT.Println(STR [:3]) // youCopy the code

[]rune = []rune = []rune = []rune = []rune = []rune = []rune = []rune = []rune

STR = "Hello world" runeStr := []rune(STR) fmt.println (string(runeStr[:1])) // youCopy the code

To convert a string to []rune is to convert the string to UTF-8 code points instead of []byte. Rune is int32.

The Go language has a special Unicode/UTF8 package to handle UTF8 characters. Since each character may occupy different bytes, the number of characters and the size of bytes are two different things:

S := "Hello, world "// commas are half-corner symbols fmt.println (len(s)) // 13 fmt.println (utf8.runecountinString (s)) // 9Copy the code

To get the total number of bytes a character occupies, use len. To count the number of characters, use utf8.runecountinString.

This package also provides other common functions:

// Check whether the utf8 code conforms to utf8: Func Valid(P []byte) bool FUNc ValidRune(r RUNe) bool FUNC ValidString(s string) bool // Indicates the number of bytes occupied by RUNe. Func RuneLen(r Func RuneCount(p []byte) int func RuneCountInString(s string) int // Encoding and decoding of rune func EncodeRune(p []byte, r rune) int func DecodeRune(p []byte) (r rune, size int) func DecodeRuneInString(s string) (r rune, size int) func DecodeLastRune(p []byte) (r rune, size int) func DecodeLastRuneInString(s string) (r rune, size int)Copy the code

In addition to the UTF8 package, the Unicode package pair provides a series of IsXX functions for rune checking:

Func Is(rangeTab *RangeTable, r rune) bool Func In(r rune, ranges... func Is(rangeTab *RangeTable, r rune) bool *RangeTable) bool // Whether it is any type of character in the ranges func IsControl(R RUNe) bool // Whether it is a control character func IsDigit(R rune) bool // Is it an Arabic character, Func IsLetter(R RUNe) bool // Whether it is a graphic character func IsLetter(R RUNe) bool // Whether it is a letter func IsLower(R RUNe) bool Func IsMark(r RUNe) bool Indicates whether it is a symbol character. Func IsNumber(r RUNe) bool Indicates whether it is a digit character. Range of (ranges []*RangeTable, Func IsPrint(r RUNe) bool func IsPunct(R RUNe) bool // Whether it is a punctuation mark func IsSpace(r RUNe) bool // Whether it is a space func IsSymbol(R RUNe) bool // Whether it is a symbol character func IsTitle(R RUNe) bool // Whether the first character of each word in a string is uppercase Func IsUpper(r RUNe) bool // Indicates whether it is an uppercase characterCopy the code

RangeTable is a classification of all Unicode characters, such as verifying whether a character is a Kanji:

Result := unicode.Is(unicode.Han, r) FMT.Println(result) // trueCopy the code

Unicode.Han is the RangeTable type, which represents a Chinese character.

The text/Rayjun

Related Posts

Why do the client and server have different initial sequence numbers during the TCP three-way handshake?

Kafka Factory interview questions

System description – SSO single sign-on