How do computers represent characters
Computers are binary, and characters are eventually converted to binary and stored. A character set is the numeric value that defines a character. Unicode is a character set that specifies a number for each character, but it does not specify the binary storage of that number. Utf8 specifies the binary storage of Unicode values.
Utf8 is a variable length character encoding. Different characters can be stored in different sizes, such as 1 byte for “A” characters (Unicode 97) and 3 bytes for “medium” characters (Unicode 20013). The Unicode value of a character determines how many bytes a character needs to be represented in
What is the string?
In Go, string is a read-only utF8-encoded slice of bytes, so len is not the number of characters, but the number of bytes. The for loop also iterates over the output bytes.
a := "Randal";
for i := 0; i < len(a); i++ {
fmt.Printf("%x ", a[i])
fmt.Printf("%c ", a[I])} 52 61 6e 64 61 6cCopy the code
a := "China";
fmt.Println(len(a))
for i := 0; i < len(a); i++ {
fmt.Printf("%x ", a[i])
}
for i := 0; i < len(a); i++ {
fmt.Printf("%c "E4 B8 AD E5 9B BD a a ½Copy the code
The fmt.printf function supports generating formatted output from a list of expressions. Its first argument is a formatting indicator string, which specifies how the other arguments should be formatted. Where %c is used to output characters (Unicode code points), which are the Unicode values of characters. Since go uses the UTF8 encoding and “medium” uses the UTF8 encoding E4 B8 AD (which represents the Unicode value U+4E2D), https://unicode-table.com/en/00E4/ The unicode encoding for a is U+00E4.
As you can see from the above example, if the utF8 encoding of a character exceeds 1 byte, the output of a single character will be garbled. If you want to solve the garbled problem, you need to use rune
What is rune?
Rune is an alias for int32, which represents the Unicode encoding of a character. It is stored in 4 bytes. Converting a string to rune means that any character has its Unicode value stored in 4 bytes, so that each traversal returns a Unicode value instead of a byte. This will solve the garble problem
var s string
s = "China"
r := []rune(s)
for i := 0; i < len(r); i++ {
fmt.Printf("%x", r[i])
}
for i := 0; i < len(r); i++ {
fmt.Printf("%c", r[I])} // Output result 4e2D 56fd ChinaCopy the code
A string is traversed by a for range, and each object retrieved is of type Rune. Therefore, the following method can also solve the problem of garbled characters
var s string
s = "China"
for _, item := range s {
fmt.Printf("%c", item)} // Output result ChinaCopy the code
What is the bytes
Bytes operate on objects that are also byte slices. Unlike String, which is immutable, byte is mutable, so incrementally building strings by string can result in multiple memory allocations and copies. Using Bytes is not therefore any more efficient
package main
import (
"fmt"
"bytes"
)
func main() {
var s string
s = "China"
var b bytes.Buffer
b.WriteString("China")
for i := 0; i < 10; i++ {
s += "a"
b.WriteString("a")
}
fmt.Println(s)
fmt.Println(b.String())
}
Copy the code
Resources and tools
Utf8 String Data Type in Go byte vs String in Go