【Go】 Efficient string interception some thoughts

The original link: blog.thinkeridea.com/201910/go/e…

I recently SOLVED the problem of [SOLVED] String Size of 20 character in Go Forum. “Hollowaykeanho” provides the answer, and I’ve learned that String interception is not the best solution. So I did a series of experiments and got efficient methods for intercepting strings. This article will explain my practice step by step.

Byte slice interception

This is exactly the first solution “Hollowaykeanho” came up with, and I think the first solution many people came up with, using go’s built-in slicing syntax to intercept strings:

s := "abcdef"
fmt.Println(s[1:4])
Copy the code

We soon learned that this is byte interception, and there is no better way to handle ASCII byte interception than this. Chinese is often multiple bytes, 3 bytes in UTF8 encoding, and the following program will get garbled data:

s := "The language"
fmt.Println(s[1:4])
Copy the code

Killer – Type conversion []rune

The second solution given by “Hollowaykeanho” is to convert the string to []rune, slice it, and convert the result to a string.

s := "The language"
rs := []rune(s)
fmt.Println(strings(rs[1:4]))
Copy the code

First we got the right result, that was the biggest improvement. However, I have always been cautious about casting. I was worried about its performance, so I tried to find the answer in search engines and various forums, but the most I got was this solution, which seemed to be the only solution.

I tried to write a performance test to evaluate its performance:

package benchmark

import (
	"testing"
)

var benchmarkSubString = Go is a statically strongly typed, compiled, synced, garbage collector programming language developed by Google. It is sometimes called Golang for ease of search and identification.
var benchmarkSubStringLength = 20

func SubStrRunes(s string, length int) string {
	if utf8.RuneCountInString(s) > length {
		rs := []rune(s)
		return string(rs[:length])
	}

	return s
}

func BenchmarkSubStrRunes(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRunes(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I got something that surprised me:

goos: darwin goarch: amd64 pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRunes-8 872253 1363 ns/op 336 B/op 2 Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 2.120 sCopy the code

It took me about 1.3 microseconds to cut the first 20 characters of 69 strings, which was much more than I expected. I found that because of the memory allocation caused by the conversion, a new string was created, and the conversion required a lot of computation.

A lifeline – UTf8.decoderuneInString

I wanted to improve the extra computation and memory allocation caused by type conversion. I combed through the Strings package and found that there were no related tools. Then I came up with the UTF8 package, which provides multi-byte computation related tools. I looked through all of its documentation and found utf8.decoderuneinString can convert a single character and give the number of bytes the character takes up. I tried something like this:

package benchmark

import (
	"testing"
	"unicode/utf8"
)

var benchmarkSubString = Go is a statically strongly typed, compiled, synced, garbage collector programming language developed by Google. It is sometimes called Golang for ease of search and identification.
var benchmarkSubStringLength = 20

func SubStrDecodeRuneInString(s string, length int) string {
	var size, n int
	for i := 0; i < length && n < len(s); i++ {
		_, size = utf8.DecodeRuneInString(s[n:])
		n += size
	}

	return s[:n]
}

func BenchmarkSubStrDecodeRuneInString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrDecodeRuneInString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I ran it and got something that surprised me:

goos: darwin goarch: amd64 pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrDecodeRuneInString-8 10774401 105 ns/op 0 B/op 0 allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.250 sCopy the code

13 times more efficient than []rune conversion, eliminating memory allocation, it was really exciting and exciting, and I couldn’t wait to reply to “hollowaykeanho” to tell him that I had found a better method, and to provide related performance tests.

I’m a little excited, excited to browse all kinds of interesting problems in the BBS, when looking at a problem with the help of (forgot what problem -_ – | |), I was amazed to see that the way of another.

Good medicine doesn’t have to be bitter – range string iteration

Many people seem to forget that range iterates over characters, not bytes. I immediately tried to take advantage of this feature by writing the following use case:

Package Benchmark Import ("testing") var benchmarkSubString = "Go "is a Google programming language that is statically strongly typed, compiled, and combined with garbage collection. It is sometimes called Golang for ease of search and identification. var benchmarkSubStringLength = 20 func SubStrRange(s string, length int) string { var n, i int for i = range s { if n == length { break } n++ } return s[:i] } func BenchmarkSubStrRange(b *testing.B) { for i := 0; i < b.N; i++ { SubStrRange(benchmarkSubString, benchmarkSubStringLength) } }Copy the code

I tried to run it, which seemed infinitely magical, and the results did not disappoint me.

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRange - 8-12354991 91.3 ns/op/op 0 0 B Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.233 sCopy the code

It only improved 13%, but it was simple enough and easy enough to understand that it seemed to be the medicine I was looking for.

If you think this is the end, no, this is just the beginning for me.

The ultimate moment. – Build your own wheel

After drinking Range’s bowl of sweet, cloying medicine, I seemed to calm down. I needed to build a wheel, and it needed to be easier to use and more efficient.

So I carefully looked at two optimizations, both of which seemed to be aimed at finding the index location of a character of specified length. If I could provide such a method, would I be able to provide the user with a simple interception implementation s[:strIndex(20)]? I’ve been puzzling over how to provide an easy-to-use interface for two days.

I then created the exutf8.runeIndexinString and exutf8.runeIndex methods to compute the index position of the specified number of characters ending in a string and byte slice, respectively.

I implemented a string interception test with exutf8.runeIndexinString:

package benchmark

import (
	"testing"
	"unicode/utf8"

	"github.com/thinkeridea/go-extend/exunicode/exutf8"
)

var benchmarkSubString = Go is a statically strongly typed, compiled, synced, garbage collector programming language developed by Google. It is sometimes called Golang for ease of search and identification.
var benchmarkSubStringLength = 20

func SubStrRuneIndexInString(s string, length int) string {
	n, _ := exutf8.RuneIndexInString(s, length)
	return s[:n]
}

func BenchmarkSubStrRuneIndexInString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRuneIndexInString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I tried to run it and was pleased with the results:

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRuneIndexInString - 8-82.4 ns/op 13546849 0 B/op 0 allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.213 sCopy the code

The performance was 10% better than the Range, and I was glad to get another boost, which proved it worked.

It is efficient enough, but it is not easy to use, I need two lines of code to cut a string, if I want to cut 10~20 characters need four lines of code, this is not easy to use interface, I refer to other languages sub_string method, I thought I should design a similar interface to the user.

Exutf8.runesubstring and exutf8.runesub are the methods I wrote after thinking about them:

func RuneSubString(s string, start, length int) string

It takes three arguments:

s: Indicates the entered character string
startIf start is non-negative, the returned string will start at the string start position, counting from 0. For example, in the string “abcdef”, the character at position 0 is “A”, the string at position 2 is “C”, and so on. If start is negative, the returned string begins with the first start character at the end of the string. If the string length is less than start, an empty string is returned.
length: Truncated length. If a positive length is provided, the returned string will contain at most length characters from start (depending on the length of the string). If the length of a negative number is provided, the length character at the end of the string is omitted (if start is negative, the length character is counted from the end of the string). If start is not in the text, an empty string is returned. If length is provided with a value of 0, the substring returned starts at the start position and continues until the end of the string.

I created exstrings.SubString and exbytes.Sub as easily searchable alias methods.

Finally I need to do one more performance test to make sure it works:

package benchmark

import (
	"testing"

	"github.com/thinkeridea/go-extend/exunicode/exutf8"
)

var benchmarkSubString = Go is a statically strongly typed, compiled, synced, garbage collector programming language developed by Google. It is sometimes called Golang for ease of search and identification.
var benchmarkSubStringLength = 20

func SubStrRuneSubString(s string, length int) string {
	return exutf8.RuneSubString(s, 0, length)
}

func BenchmarkSubStrRuneSubString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRuneSubString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

Run it and it won’t let me down:

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRuneSubString - 8-13309082 83.9 ns/op/op 0 0 B Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.215 sCopy the code

RuneIndexInString is a step down from exutf8.runeIndexinString, but it provides an interface that is easy to interact with and use, and I think it is the most practical solution. If you are after the extreme, you can still use ExutF8.runeIndexinString, it is still the fastest solution.

conclusion

When you see code in question, even if it’s very simple, it’s still worth exploring. It’s not boring or boring, it’s rewarding.

Not only did I get a 16-fold performance improvement by switching from []rune to my own wheels, I also learned about the UTF8 package, improved range traversal string features, and included a number of practical and efficient solutions for the Go-Extend repository, allowing more users of Go-Extend to get results.

Go-extend is a repository of practical and efficient methods. If you have good functions and common and efficient solutions, please send me Pull Requests. You can also use this repository to speed up functionality and improve performance.

Transfer:

Author: Qi Yin (thinkeridea)

Links to this article:Blog.thinkeridea.com/201910/go/e…

Copyright Notice: All articles on this blog are used unless otherwise statedCC BY 4.0CNLicense agreement. Reprint please indicate the source!

【Go】 Efficient string interception some thoughts

Byte slice interception

Killer – Type conversion []rune

A lifeline – UTf8.decoderuneInString

Good medicine doesn’t have to be bitter – range string iteration

The ultimate moment. – Build your own wheel

conclusion

Related Posts

The Excel XLS, XLSX, XLSM mixed file, see how I use Python unified handling…

Chapter 26 Customizes the use of SAX parsers

Flink SQL knows why a stream join is difficult. (on)