string
To explain the rune type, we need to understand the string type. The nature of a string is a string of eight bits, which can be converted directly to the []byte type
s := "abc"
fmt.Println([]byte(s))
// output
/ / [97, 98, 99]
Copy the code
The above code prints the number of characters in the ASCII encoding table. Of course, you can print not only the character number, but also the corresponding character directly
s := "abc"
for _, v := range []byte(s) {
fmt.Printf("%c, %v\n", v, v)
}
// output
// a, 97
// b, 98
// c, 99
Copy the code
We can say that a string is essentially an array of bytes ([] bytes).
The Golang source code itself uses utF-8 encoding (non-UTF-8 source code does not compile). In addition to ASCII, Golang can also print Unicode character encoding.
for _, v := range "Chinese characters" {
fmt.Printf("character %c,unicode %U \n", v, v)
}
// output:
// character is unicode U+6C49
// unicode U+5B57
Copy the code
Unicode and utf-8
To understand Rune, we also need to understand Unicode and UTF8
Unicode and UTF-8:
- Unicode is an encoding table that, in v13.0, has 143,859 characters.
- Utf-8 is an encoding of Unicode.
- Utf-8 encodes a character using 1-4 bytes.
This diagram shows the encoding rules for UTF-8
-
A Code point is a character number in Unicode.
-
The range of code points using one byte encoding is 0 to 7F, and the corresponding decimal is 0 to 127. There is full compatibility with ASCII encoding tables.
-
The range of code points in two-byte encoding is 80-7ff, which in decimal is 128-2047. This part is mainly Latin
-
Chinese code point range is 4e00-9ffF, a total of 20992 characters, corresponding to the third encoding rule, using 3 bytes for encoding.
Those interested can check out the Unicode encoding table at unicode-table.com/en/blocks/
rune
With string and Unicode and UTF8 encoding behind us, we can now discuss what rune types are and why they exist. Let’s start with some code:
chinese := "Chinese characters"
fmt.Printf("length: %d\n".len(chinese))
Copy the code
What do you think the result was? And the answer is 6?
length: 6
Copy the code
Why is the length 6?
As mentioned earlier in this article, the nature of the string type is []byte, which represents the corresponding character.
Len (string) indicates the length of the string in bytes
In UTF-8 encoding rules, a Chinese character is three bytes, and the length of the two Characters is six
So how do we count the correct length of Chinese? Now try converting the string to rune
chinese := []rune("Chinese characters")
fmt.Printf("length: %d\n".len(chinese))
// output
// length: 2
Copy the code
This time the result is 2, which gives the correct character length.
We now know that the value of rune represents the character code point, which is the encoding of a character in Unicode. When we convert string to rune, we convert UTF8 characters into code-points.
range string
Take a look at the example above, and this time let’s type out its index.
for i, v := range "Chinese characters" {
fmt.Printf("character %c,value %d,unicode %U position %d\n", v, v, v, i)
}
// output:
// character han,value 27721, Unicode U+6C49 position 0
// character,value 23383, Unicode U+5B57 position 3
Copy the code
When looping over strings, it’s a little bit more special. Index does not correspond to the position of a character in the string, but to the start position of a character in the string byte.
conclusion
Rune is a data type in Golang. The underlying type is INT32 and is used to store Unicode code-point.
type rune = int32
Copy the code
Ref:
- blog.golang.org/strings
- golang.org/pkg/fmt/
- Blog.golang.org/normalizati…
- Golang.org/ref/spec#Ru…