Talk about character representation and string traversal in Go

Unlike other languages, there are no character types in Go; characters are just special use cases for integers.

Why are characters just special use cases for integers? Because in Go, the byte and rune types used to represent characters are aliases for integers. In the Go source we can see:

// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
Copy the code

byte 是 uint8The 1-byte alias for the ASCII character
rune 是 int32Is a 4-byte alias for utF-8 encoded Unicode code points

Tips: Unicode starts at 0 and assigns each symbol a number, called a “code point.”

Representation of characters

So how do you represent characters in the Go language?

Single quotation marks are used in the Go language to denote characters, such as ‘j’.

byte

If you want to represent characters of type byte, you can use the byte keyword to specify the type of the character variable:

var byteC byte = 'j'
Copy the code

And since byte is essentially an integer uint8, it can be converted directly to an integer value. In formatting specifiers we use %c for characters and %d for integers:

// Declare byte characters
var byteC byte = 'j'
fmt.Printf("The integer corresponding to the character %c is %d\n", byteC, byteC)
// Output: the integer corresponding to the character j is 106
Copy the code

rune

As with byte, characters that wish to declare type rune can be specified using the rune keyword:

var runeC rune = 'J'
Copy the code

But if you declare a character variable without specifying the type, Go defaults it to rune:

runeC := 'J'
fmt.Printf("Character %c is of type %T\n", runeC, runeC)
// Output: the character J is of type int32
Copy the code

Why two types?

Now, you might ask, why do you need two types when they’re both used to represent characters?

As we know, byte takes up one byte, so it can be used to represent ASCII characters. Utf-8 is a variable-length encoding method, with characters ranging from 1 byte to 4 bytes. Byte is obviously bad at this, and even if you want to use multiple bytes, you never know how many utF-8 characters you’re dealing with.

Therefore, if you arbitrarily intercept a Chinese string, you will output garbled characters:

testString := "Hello world."
fmt.Println(testString[:2]) // Output garbled characters because the first two bytes were intercepted
fmt.Println(testString[:3]) // Outputs "you", a Chinese character represented by three bytes
Copy the code

That’s where Rune’s help comes in. Use []rune() to convert a string to a Unicode code point and then intercept it without considering utF-8 characters in the string:

testString := "Hello world."
fmt.Println(string([]rune(testString)[:2])) // Output: "Hello"
Copy the code

Tips: Unicode, like ASCII, is a character set, and UTF-8 is an encoding.

Traversal string

String traversal has two ways, one is subscript traversal, the other is using range.

The subscript traversal

Because strings are stored in UTF-8 encoding in Go, when len() is used to get the length of the string, the length of the UTF-8 encoding string is retrieved in bytes. Indexing the string by subscript yields one byte. Therefore, if the string contains UTF-8 encoded characters, garbled characters will appear:

testString := "Hello, world"

for i := 0; i < len(testString); i++ {
	c := testString[i]
	fmt.Printf("The type of %c is %s\n", c, reflect.TypeOf(c))
}

/* Output: H is of type uint8 (ASCII characters return normal) e is of type uint8 l is of type uint8 L is of type uint8 O is of type uint8 I is of type uint8 (there's some weird garbled code going on here) and a quarter is of type uint8 The type of  is uint8; the type of a is uint8; the type of a is uint8; the type of a is uint8; the type of a is uint8; the type of c is uint8; the type of  is uint8 */
Copy the code

range

The range traversal yields characters of type rune:

testString := "Hello, world"

for _, c := range testString {
	fmt.Printf("The type of %c is %s\n", c, reflect.TypeOf(c))
}

/* Output: H is of type int32 e is of type int32 L is of type int32 L is of type int32 O is of type int32 world is of type int32 world is of type int32 */
Copy the code

conclusion

There is no concept of characters in Go. A character is a bunch of bytes, which can be either a single byte (ASCII character set) or multiple bytes (Unicode character set).
byte 是 uint8The 1-byte alias for the ASCII character
runeIt isint32Is a 4-byte alias for utF-8 encoded Unicode code points
String interception is in bytes
Indexing strings with subscripts yields bytes
Want to traverseruneType characters are usedrangeMethod to traverse

The resources

Ruan Yifeng: Unicode and JavaScript details
The Go Blog – Strings, bytes, runes and characters in Go

reading

Learn about arrays in Go
Simple to understand: take 🌰 to interpret source code, inverse code and complement code
Talk about object-oriented programming in Go

If you think the article is well written, please do me two small favors:

Like and follow me to get this article seen by more people
Follow the public account “Programming to Save the World”, and you will be the first to get updates

Your encouragement is the biggest motivation for my creation. Thank you all!