Basic concept

ASCII is the American Standard Code for Information Interchange. It is a single-byte character encoding scheme developed by the American National Standards Institute (ANSI). It uses the binary number of a single byte to encode a character.

The Unicode encoding specification sets a unique binary code for every character in all the world’s existing natural languages. It is based on the ASCII encoding set and breaks through the limitation that ASCII can only encode Latin letters. The Unicode encoding specification typically uses hexadecimal notation to represent integer values of Unicode code points and provides three different encoding formats: UTF-8, UTF-16, and UTF-32.

Utf-8 takes eight bits (a byte) as an encoding unit. It is a variable-width encoding scheme that represents a character in binary numbers of one or more bytes, up to four bytes. For an English character, it can be represented as a binary number of only one byte, while for a Chinese character, it needs to be represented as three bytes.

Rune is a basic data type unique to the Go language, where a single value represents a Unicode character, such as’ lv ‘or ‘M’. A value of type RUNe is stored in four bytes of space, which is always big enough to hold a UTF-8 encoded value.

String encoding

A value of type RUNe is a utF-8 encoded value at the bottom, which is an external representation (for us humans) and an internal representation (for computer systems), as shown in the following code:

str := "Go lovers"

fmt.Printf("The string: %q\n", str)

fmt.Printf("runes(char): %q\n"And []rune(str))   //['G' 'o']

fmt.Printf("runes(hex): %x\n"And []rune(str))    //[47 6f 7231 597d 8005]

fmt.Printf("bytes(hex): [% x]\n"And []byte(str)) //[47 6f e7 88 b1 e5 a5 bd e8 80 85]

Copy the code

The third line of output, explained more clearly, will not be repeated. For line 4 output, it is rendered in utF-8 encoding in hexadecimal format with 3 bytes. Line 5 prints the UTF-8 encoding value for each character into its corresponding byte sequence.


To sum up: A string value is, at bottom, a sequence of bytes that can represent several UTF-8 encoded values.

Traversal string

The range through:

str := "Go lovers"

fmt.Printf("The range traversal: \ n")

for i, c := range str {

 fmt.Printf("%d: %q [% x]\n", i, c, []byte(string(c)))

}

fmt.Printf("For traversal: \ n")

for i :=0; i < len(str); i++ {

 fmt.Printf("%d: [%c] [%x]\n", i, str[i], str[i])

}

Copy the code

The output is as follows:

The range through:

Zero:'G' [47]

1: 'o' [6f]

2: 'love' [e7 88 b1]

5: 'good' [e5 a5 bd]

8: 'the' [e8 80 85]

forThrough:

0: [G] [47]

1: [o] [6f]

2: [ç] [e7]

3: [ˆ] [88]

4: [±] [b1]

5: [å] [e5]

6: [selections] [a5]

7: [1/2 level] [bd]

8: [è] [e8]

9: [€] [80]

10: [] [85]

Copy the code

It can be seen that the traversal through range is based on rune, but the index values of adjacent characters are not necessarily continuous. Traversal by for, in bytes.

Type conversion

Strings cannot be changed directly. If they need to be changed, they need to be converted to mutable types ([]rune and []bype) and then converted back. However, no matter how you convert, you need to reallocate memory and copy data.

str := "hello, world!"

bs := []byte(str)  / / turn byte string

str2 := string(bs) / / byte string

rs := []rune(str)  / / string rune

str3 := string(rs) / / rune string



Copy the code

String, rune, and Byte are different and related to each other.

conclusion

Go code is made up of Unicode characters, which must be encoded and stored in the UTF-8 encoding format of the Unicode encoding specification, which defines the conversion between characters and sequences of bytes. Utf-8 is a variable-width encoding scheme that represents a character in binary numbers of one or more bytes, up to four bytes.

A string value in Go consists of several Unicode characters, each of which can be carried by a value of type Rune. These characters are converted to UTF-8 encoded values at the bottom level, and these UTF-8 encoded values are expressed and stored as sequences of bytes. Thus, a string value is, at the bottom, a sequence of bytes representing several UTF-8 encoded values.

For a string traversed by the for range method, the string values being traversed are split into a sequence of bytes, and then an attempt is made to find every UTF-8 encoding value, or Unicode character, contained in that sequence. The index values of adjacent Unicode characters are not necessarily contiguous, depending on whether the preceding Unicode character is a single-byte character, but this is no longer confusing once we understand the underlying mechanism.

The Unicode encoding specification and UTF-8 encoding format are fundamental to the Go language, and we should understand their importance to the Go language. This will help us understand the relevant data types in the Go language and write related programs in the future.

Welcome everyone to like a lot, more articles, please pay attention to the wechat public number “Lou Zai advanced road”, point attention, do not get lost ~~