string

To explain the rune type, we need to understand the string type. The nature of a string is a string of eight bits, which can be converted directly to the []byte type

s := "abc"
fmt.Println([]byte(s))

// output
/ / [97, 98, 99]
Copy the code

The above code prints the number of characters in the ASCII encoding table. Of course, you can print not only the character number, but also the corresponding character directly

s := "abc"
for _, v := range []byte(s) {
    fmt.Printf("%c, %v\n", v, v)
}
// output
// a, 97
// b, 98
// c, 99
Copy the code

We can say that a string is essentially an array of bytes ([] bytes).

The Golang source code itself uses utF-8 encoding (non-UTF-8 source code does not compile). In addition to ASCII, Golang can also print Unicode character encoding.

    for _, v := range "Chinese characters" {
        fmt.Printf("character %c,unicode %U \n", v, v)
    }
// output:
// character is unicode U+6C49
// unicode U+5B57
Copy the code

Unicode and utf-8

To understand Rune, we also need to understand Unicode and UTF8

Unicode and UTF-8:

  • Unicode is an encoding table that, in v13.0, has 143,859 characters.
  • Utf-8 is an encoding of Unicode.
  • Utf-8 encodes a character using 1-4 bytes.

This diagram shows the encoding rules for UTF-8

  • A Code point is a character number in Unicode.

  • The range of code points using one byte encoding is 0 to 7F, and the corresponding decimal is 0 to 127. There is full compatibility with ASCII encoding tables.

  • The range of code points in two-byte encoding is 80-7ff, which in decimal is 128-2047. This part is mainly Latin

  • Chinese code point range is 4e00-9ffF, a total of 20992 characters, corresponding to the third encoding rule, using 3 bytes for encoding.

Those interested can check out the Unicode encoding table at unicode-table.com/en/blocks/

rune

With string and Unicode and UTF8 encoding behind us, we can now discuss what rune types are and why they exist. Let’s start with some code:

chinese := "Chinese characters"
fmt.Printf("length: %d\n".len(chinese))
Copy the code

What do you think the result was? And the answer is 6?

length: 6
Copy the code

Why is the length 6?

As mentioned earlier in this article, the nature of the string type is []byte, which represents the corresponding character.

Len (string) indicates the length of the string in bytes

In UTF-8 encoding rules, a Chinese character is three bytes, and the length of the two Characters is six

So how do we count the correct length of Chinese? Now try converting the string to rune

chinese := []rune("Chinese characters")
fmt.Printf("length: %d\n".len(chinese))
// output
// length: 2
Copy the code

This time the result is 2, which gives the correct character length.

We now know that the value of rune represents the character code point, which is the encoding of a character in Unicode. When we convert string to rune, we convert UTF8 characters into code-points.

range string

Take a look at the example above, and this time let’s type out its index.

    for i, v := range "Chinese characters" {
        fmt.Printf("character %c,value %d,unicode %U position %d\n", v, v, v, i)
    }    
// output:
// character han,value 27721, Unicode U+6C49 position 0
// character,value 23383, Unicode U+5B57 position 3
Copy the code

When looping over strings, it’s a little bit more special. Index does not correspond to the position of a character in the string, but to the start position of a character in the string byte.

conclusion

Rune is a data type in Golang. The underlying type is INT32 and is used to store Unicode code-point.

type rune = int32
Copy the code

Ref:

  • blog.golang.org/strings
  • golang.org/pkg/fmt/
  • Blog.golang.org/normalizati…
  • Golang.org/ref/spec#Ru…