Recently I found the problem of [SOLVED] String size of 20 character in the Go Forum, “Hollowaykeanho” gives the answer, and I found that intercepting strings is not the best solution. So I did a series of experiments and got a way to intercept strings efficiently, and this article will walk you through my practice step by step.

Byte slice interception

This is the first solution “Hollowaykeanho” comes up with, and I think it’s the first solution that many people think of, using go’s built-in slice syntax to intercept strings:

s := "abcdef"
fmt.Println(s[1:4])
Copy the code

We soon learned that this is byte by byte interception, when dealing with ASCII single-byte string interception, nothing is more perfect than this scheme, Chinese often takes many bytes, in utF8 encoding is 3 bytes, we will get garbled data as follows:

s := "The language"
fmt.Println(s[1:4])
Copy the code

Killer – Type conversion []rune

The second solution given by “Hollowaykeanho” is to convert the string to []rune, slice it, and turn the result into a string.

s := "The language"
rs := []rune(s)
fmt.Println(strings(rs[1:4]))
Copy the code

First of all, we got the right result, which is the biggest improvement. However, I have always been cautious about casting, I am worried about its performance, so I tried to search search engines and various forums for the answer, but I got the most solution, it seems to be the only solution.

I tried to write a performance test to measure its performance:

package benchmark

import (
	"testing"
)

var benchmarkSubString = "Go is a statically strongly typed, compiled, parallel, and garbage-collecting programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.
var benchmarkSubStringLength = 20

func SubStrRunes(s string, length int) string {
	if utf8.RuneCountInString(s) > length {
		rs := []rune(s)
		return string(rs[:length])
	}

	return s
}

func BenchmarkSubStrRunes(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRunes(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I got something that surprised me a little:

goos: darwin goarch: amd64 pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRunes-8 872253 1363 ns/op 336 B/op 2 Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 2.120 sCopy the code

It took about 1.3 microseconds to intercept the first 20 characters of the 69 string, which was much more than I expected. I found that because of the memory allocation of the cast, which resulted in a new string, and the cast required a lot of computation.

Life-saving straw – UTf8.DecodeRuneInString

I wanted to improve the extra operation and memory allocation brought by type conversion. I carefully sorted through the strings package and found that there was no relevant tool. Then I thought of the UTF8 package, which provided the tools related to multi-byte computation. I looked through all of its documentation and found that the utf8.DecodeRuneInString function converts a single character and gives the number of bytes that the character takes. I tried the following experiment:

package benchmark

import (
	"testing"
	"unicode/utf8"
)

var benchmarkSubString = "Go is a statically strongly typed, compiled, parallel, and garbage-collecting programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.
var benchmarkSubStringLength = 20

func SubStrDecodeRuneInString(s string, length int) string {
	var size, n int
	for i := 0; i < length && n < len(s); i++ {
		_, size = utf8.DecodeRuneInString(s[n:])
		n += size
	}

	return s[:n]
}

func BenchmarkSubStrDecodeRuneInString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrDecodeRuneInString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I was pleasantly surprised when I ran it:

goos: darwin goarch: amd64 pkg: github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrDecodeRuneInString-8 10774401 105 ns/op 0 B/op 0 allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.250 sCopy the code

13 times more efficient than []rune conversions, eliminating memory allocation, and it was exciting and exciting. I replied with “Hollowaykeanho” and told him I had found a better method and provided performance tests.

I’m a little excited, excited to browse all kinds of interesting problems in the BBS, when looking at a problem with the help of (forgot what problem -_ – | |), I was amazed to see that the way of another.

Good medicine doesn’t have to be bitter – range string iteration

Many people seem to forget that range iterates by characters, not bytes. Iterating a string with range returns the starting index of the character and the corresponding character. I immediately tried to write the following use case with this feature:

package benchmark

import (
	"testing"
)

var benchmarkSubString = "Go is a statically strongly typed, compiled, parallel, and garbage-collecting programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.
var benchmarkSubStringLength = 20

func SubStrRange(s string, length int) string {
	var n, i int
	for i = range s {
		if n == length {
			break
		}

		n++
	}

	return s[:i]
}

func BenchmarkSubStrRange(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRange(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I tried to run it, and it seemed magical, and it didn’t disappoint me.

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRange - 8-12354991 91.3 ns/op/op 0 0 B Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.233 sCopy the code

It was only 13 percent better, but it was simple enough and easy to understand, and it seemed like the medicine I was looking for.

If you think this is the end, no, this is just the beginning for me.

The ultimate moment – Build your own wheel

After drinking Range’s sweet, cloying medicine, I seemed to calm down. I needed to build a wheel. It needed to be easier to use and more efficient.

So I looked at the two optimizations, both of which seemed to be looking for the index position to intercept a given length of character, and if I could provide one, would I provide the user with a simple interception implementation s[:strIndex(20)], and I couldn’t get rid of the idea once it was sprouted. I agonized over how to provide an easy-to-use interface for two days.

I then created the exutf8.RuneIndexInString and exutF8.RuneIndex methods to calculate the index position at the end of a specified number of characters in a string and byte slice, respectively.

I implemented a string interception test using exutf8.RuneIndexInString:

package benchmark

import (
	"testing"
	"unicode/utf8"

	"github.com/thinkeridea/go-extend/exunicode/exutf8"
)

var benchmarkSubString = "Go is a statically strongly typed, compiled, parallel, and garbage-collecting programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.
var benchmarkSubStringLength = 20

func SubStrRuneIndexInString(s string, length int) string {
	n, _ := exutf8.RuneIndexInString(s, length)
	return s[:n]
}

func BenchmarkSubStrRuneIndexInString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRuneIndexInString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

I tried it, and I was very pleased with the results:

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRuneIndexInString - 8-82.4 ns/op 13546849 0 B/op 0 allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.213 sCopy the code

Performance improved by 10% over range, and I was pleased to get another improvement, which proved to be effective.

It is efficient enough, but not easy enough to use. I need two lines of code to intercept a string, and if I want to intercept characters between 10 and 20, I need four lines of code. This is not an easy interface for users to use, I refer to the sub_string method of other languages, I think I should design a similar interface for users.

Exutf8.RuneSubString and exutf8.RuneSub are the methods I wrote after thinking carefully:

func RuneSubString(s string, start, length int) string

It takes three arguments:

  • s: A string to be entered
  • startIf start is a non-negative number, the returned string will start at the start position of the string, counting from 0. For example, in the string “abcdef”, the character in position 0 is “a”, the string in position 2 is “c”, and so on. If start is negative, the returned string starts the start character before the end of the string. If the length of the string is less than start, an empty string is returned.
  • length: The length of the truncation. If a positive length is provided, the returned string will consist of up to length characters starting at start (depending on the length of the string). If a negative length is supplied, the length character at the end of the string is omitted (if start is negative, it starts at the end of the string). If start is not in this text, an empty string is returned. If a length of 0 is provided, the substring returned starts at the start position until the end of the string.

I provided them with aliases. According to usage habits, people tend to go to strings packages to find solutions to such problems. I created exstrings.SubString and exbytes.sub as easier to retrieve alias methods.

Finally I need to do one more performance test to make sure it works:

package benchmark

import (
	"testing"

	"github.com/thinkeridea/go-extend/exunicode/exutf8"
)

var benchmarkSubString = "Go is a statically strongly typed, compiled, parallel, and garbage-collecting programming language developed by Google. It is sometimes referred to as a Golang for easy search and identification.
var benchmarkSubStringLength = 20

func SubStrRuneSubString(s string, length int) string {
	return exutf8.RuneSubString(s, 0, length)
}

func BenchmarkSubStrRuneSubString(b *testing.B) {
	for i := 0; i < b.N; i++ {
		SubStrRuneSubString(benchmarkSubString, benchmarkSubStringLength)
	}
}
Copy the code

Run it, and it won’t disappoint me:

goos: darwin goarch: amd64 pkg: Github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark BenchmarkSubStrRuneSubString - 8-13309082 83.9 ns/op/op 0 0 B Allocs/op PASS ok github.com/thinkeridea/go-extend/exunicode/exutf8/benchmark 1.215 sCopy the code

RuneIndexInString is a bit lower than exutf8.RuneIndexInString, but it provides an interface that’s easy to interact with and use. I think it’s probably the most practical solution, and if you want to be extreme you can still use Exutf8.

conclusion

When you see questionable code, even if it’s very simple, it’s still worth digging into, and you don’t stop exploring it. It’s not boring and boring, but it’s very rewarding.

Not only did I get a 16x performance boost from []rune conversions to building my own wheels, BUT I also learned the UTF8 package, deepened the range traversal string feature, and indexed a number of practical and efficient solutions for the Go-Extend repository. Let more go-Extend users get results.

Go-extend is a repository of practical and efficient methods. If you have good functions and common and efficient solutions, I hope you will send me Pull requests. You can also use this repository to speed up the implementation of functions and improve performance.


Qi Yin by Thinkeridea

This paper links: blog.thinkeridea.com/201910/go/e… Copyright Notice: All articles in this blog are licensed under CC BY 4.0CN unless otherwise stated. Reprint please indicate the source!