background

Hello, everyone, I’m Asong. Today we’re going to look at rune data types in the Go language. Let’s start with an interview question. Can you quickly answer the following question?

func main(a)  {
	str := "Golang Dream Factory"
	fmt.Println(len(str))
	fmt.Println(len([]rune(str)))
}
Copy the code

Do I run 15 and 15 or 15 and 9? Think about it for a moment and we’ll find out.

In fact, this is not an interview question, it is a question I encountered in daily development, the scene was like this: The back end needs to perform character check on the string sent from the front end, and the product’s requirement is to limit it to 200 characters. Then I directly use Len (STR) > 200 to judge when I do the back end check, and the result is a bug. The front end character check does not exceed 200 characters, and it is always wrong to call the back end interface. Using len([]rune(STR)) > 200 succeeds. The reasons are revealed in this article.

UnicodeAnd character encoding

Before introducing the Rune type, let’s start with some basics. —— Unicode and character encoding.

  • What is theUnicode?

As we all know, computers can only process numbers. If you want to process text, you need to convert it to numbers. In the early days, computers were designed to use 8 bits as a byte, and the largest integer represented by a byte was 255. Obviously, a byte represents Chinese, is not enough, at least two bytes, but also can not conflict with ASCII code, so, Our country developed GB2312 code, used to encode Chinese. But there are many languages in the world, and when different languages make a code, there will inevitably be conflicts, so Unicode characters are designed to solve this pain point. Unicode unites all languages into a single code. To sum up: “Unicode is really a way of encoding characters, which can be understood as a character-number mapping mechanism, using a number to represent a character.”

  • What is character encoding?

Although Unicode unites all languages into a single code, it does not specify how characters are stored in binary code. Take the Chinese character “Han” for example. Its Unicode code point is 0x6C49, which corresponds to the binary number 110110001001001. The binary number has 15 bits, which means it needs at least 2 bytes to represent it. As you can imagine, further characters in the Unicode dictionary might require three or four or more bytes to represent.

This leads to some problems, how does the computer know that you are representing one character with two bytes instead of two characters separately? If the largest character in Unicode can be represented in 4 bytes, then all characters should be represented in 4 bytes, and any characters that are not enough should be preceded by zeros. This does solve the coding problem, but it is a huge waste of space. If it is an English document, the file size is three times larger, which is obviously unacceptable.

Therefore, in order to better solve the problem of Unicode encoding, UTF-8 and UTF-16 are two popular encoding methods. Utf-8 is the most widely used Unicode encoding on the Internet. Its biggest feature is variable length. It can use 1-4 bytes to represent a character, depending on the length of the character. In UTF-8 encoding, an English byte is one byte and a Chinese byte is three bytes.

GoStrings in languages

The basic concept

Let’s take a look at the official definition of string:

// string is the set of all strings of 8-bit bytes, conventionally but not
// necessarily representing UTF-8-encoded text. A string may be empty, but
// not nil. Values of string type are immutable.
type string string
Copy the code

Artificial translation:

A string is a collection of 8-bit bytes, usually but not necessarily representing UTF-8 encoded text. String can be empty, but not nil. The value of string cannot be changed

In layman’s terms, a string is actually a read-only slice of bytes, an array of bytes at the bottom of the string, but the array is read-only and cannot be modified.

To verify this, write an example:

func main(a)  {
	byte1 := []byte("Hl Asong!")
	byte1[1] = 'i'

	str1 := "Hl Asong!"
	str1[1] = 'i'
}
Copy the code

This is true for byte operations, but it is true for string operations:

cannot assign to str1[1]

So string modification is not allowed, only substitution is supported.

From the previous analysis, we can also see that when we store characters in strings, that is, in bytes, we end up storing a number.

GoThe string encoding of the language

Now that we’ve introduced the basic concepts of strings, let’s look at how strings are encoded in the Go language.

Go source code for UTF-8 encoding format, the string in the source code is utF-8 text. So strings in Go are encoded in UTF-8.

GoLanguage string loop

The Go language can use range and subscript loops for strings. Let’s write an example to see the difference between the two ways of looping:

func main(a)  {
	str := "Golang Dream Factory"
	for k,v := range str{
		fmt.Printf("v type: %T index,val: %v,%v \n",v,k,v)
	}
	for i:=0 ; i< len(str) ; i++{
		fmt.Printf("v type: %T index,val:%v,%v \n",str[i],i,str[i])
	}
}
Copy the code

Running results:

v type: int32 index,val: 0.71 
v type: int32 index,val: 1.111 
v type: int32 index,val: 2.108 
v type: int32 index,val: 3.97 
v type: int32 index,val: 4.110 
v type: int32 index,val: 5.103 
v type: int32 index,val: 6.26790 
v type: int32 index,val: 9.24037 
v type: int32 index,val: 12.21378 
v type: uint8 index,val:0.71 
v type: uint8 index,val:1.111 
v type: uint8 index,val:2.108 
v type: uint8 index,val:3.97 
v type: uint8 index,val:4.110 
v type: uint8 index,val:5.103 
v type: uint8 index,val:6.230 
v type: uint8 index,val:7.162 
v type: uint8 index,val:8.166 
v type: uint8 index,val:9.229 
v type: uint8 index,val:10.183 
v type: uint8 index,val:11.165 
v type: uint8 index,val:12.229 
v type: uint8 index,val:13.142 
v type: uint8 index,val:14.130
Copy the code

According to the operation results, we can draw the following conclusions:

Subscript traversal retrieves ASCII characters, while Range traversal retrieves Unicode characters.

What is theruneThe data type

The official definition of rune is as follows:

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
Copy the code

Artificial translation:

Rune is an alias for INT32, equivalent in all respects to int32, and by convention is used to distinguish character values from integer values.

In plain English, a value of rune represents a Unicode character. Since a string encoded in the Go language utF-8 uses 1 to 4 bytes to represent a character, the INT32 type range fits perfectly.

The answer

Now let’s make a summary based on the problem we started with. In order to facilitate the view, here is the question:

func main(a)  {
	str := "Golang Dream Factory"
	fmt.Println(len(str))
	fmt.Println(len([]rune(str)))
}
Copy the code

The correct answers to this question are 15 and 9.

Specific reasons:

Len () is used to get the length of the string in bytes, and rune represents a Unicode character, so the length of the rune slice is the number of characters. Because in UTF-8 encoding, English takes 1 byte and Chinese takes 3 bytes, the final result is 15 and 9.

Post a picture for easy understanding:

unicode/utf8library

If you’re not sure about rune, you can use the Go standard library unicode/ UTF8, which provides a variety of ways to use rune. In this example, we can use utf8.RuneCountInString to get the number of characters. More library function use method please unlock, this article will not do too much introduction.

conclusion

For the full text, we make a summary:

  • Go language source code is alwaysUTF-8
  • GoThe language string can contain any byte, and the underlying character is read-onlybyteThe array.
  • GoStrings in the language can be looped, using the following table for loop retrievalacsiiCharacter, usingrangeCircularly acquiredunicodeCharacters.
  • GoProvided in the languageruneA type is used to distinguish character values from integer values. One value represents oneUnicodeCharacters.
  • GoUsed to get the length of a string in byteslen()Function to get the number of characters in a stringutf8.RuneCountInStringFunction or convert toruneSlice to find its length, both methods can achieve the desired result.

Well, that’s all for this article, the three qualities (share, like, read) are the author’s motivation to continue to create more quality content!

We have created a Golang learning and communication group. Welcome to join the group and we will learn and communicate together. Way to join the group: pay attention to the public account. For more learning materials, please go to the official number.

I am Asong, an ordinary programming ape. Let’s get stronger together. We welcome your attention, and we’ll see you next time

Recommended previous articles:

  • Unsafe package
  • Source analysis panic and recover, do not understand you hit me!
  • The scene of large face blows caused by empty structures
  • Leaf-segment Distributed ID Generation System (Golang implementation version)
  • Interviewer: What is the result of two nil comparisons?
  • Interviewer: Can you use Go to write some code to determine how the current system is stored?
  • How to smoothly toggle online Elasticsearch index