The string and []byte types are the most commonly used data structures in programming. This article will explore the way of conversion between the two, through the analysis of their internal relationship to clear up the fog.
Two conversion modes
-
Standard conversion
The switch between string and []byte in GO makes every Gopher think of the following conversion, which we call the standard conversion.
// string to []byte
s1 := "hello"
b := []byte(s1)
// []byte to string
s2 := string(b)
Copy the code
-
Strong transformation
Another type of transformation can be implemented through the unsafe and Reflect packages, which we call a strong transformation (often referred to as the dark magic).
func String2Bytes(s string) []byte {
sh := (*reflect.StringHeader)(unsafe.Pointer(&s))
bh := reflect.SliceHeader{
Data: sh.Data,
Len: sh.Len,
Cap: sh.Len,
}
return* (* []byte)(unsafe.Pointer(&bh))
}
func Bytes2String(b []byte) string {
return* (*string)(unsafe.Pointer(&b))
}
Copy the code
-
The performance comparison
Since there are two transformations, it is important to compare their performance.
// Test the strong conversion function
func TestBytes2String(t *testing.T) {
x := []byte("Hello Gopher!")
y := Bytes2String(x)
z := string(x)
ify ! = z { t.Fail() } }// Test the strong conversion function
func TestString2Bytes(t *testing.T) {
x := "Hello Gopher!"
y := String2Bytes(x)
z := []byte(x)
if! bytes.Equal(y, z) { t.Fail() } }// Test the performance of the standard conversion string()
func Benchmark_NormalBytes2String(b *testing.B) {
x := []byte("Hello Gopher! Hello Gopher! Hello Gopher!")
for i := 0; i < b.N; i++ {
_ = string(x)
}
}
// Test the performance of strong conversion []byte to string
func Benchmark_Byte2String(b *testing.B) {
x := []byte("Hello Gopher! Hello Gopher! Hello Gopher!")
for i := 0; i < b.N; i++ {
_ = Bytes2String(x)
}
}
// Test the performance of standard conversion []byte
func Benchmark_NormalString2Bytes(b *testing.B) {
x := "Hello Gopher! Hello Gopher! Hello Gopher!"
for i := 0; i < b.N; i++ {
_ = []byte(x)
}
}
// Test the performance of strong conversion string to []byte
func Benchmark_String2Bytes(b *testing.B) {
x := "Hello Gopher! Hello Gopher! Hello Gopher!"
for i := 0; i < b.N; i++ {
_ = String2Bytes(x)
}
}
Copy the code
The test results are as follows
$ go test -bench="." -benchmem
goos: darwin
goarch: amd64
pkg: workspace/example/stringBytes
Benchmark_NormalBytes2String- 8 - 38363413 27.9 ns/op 48 B/op 1 allocs/op
Benchmark_Byte2String- 8 - 1000000000 0.265 ns/op 0 B/op 0 allocs/op
Benchmark_NormalString2Bytes- 8 - 32577080 34.8 ns/op 48 B/op 1 allocs/op
Benchmark_String2Bytes- 8 - 1000000000 0.532 ns/op 0 B/op 0 allocs/op
PASS
ok workspace/example/stringBytes 3.170s
Copy the code
Note that -benchmem can provide the number of times memory is allocated per operation, as well as the number of bytes allocated per operation.
When all of x’s data is Hello Gopher! , the test results are as follows
$ go test -bench="." -benchmem
goos: darwin
goarch: amd64
pkg: workspace/example/stringBytes
Benchmark_NormalBytes2String- 8 - 245907674 4.86 ns/op 0 B/op 0 allocs/op
Benchmark_Byte2String- 8 - 1000000000 0.266 ns/op 0 B/op 0 allocs/op
Benchmark_NormalString2Bytes- 8 - 202329386 5.92 ns/op 0 B/op 0 allocs/op
Benchmark_String2Bytes- 8 - 1000000000 0.532 ns/op 0 B/op 0 allocs/op
PASS
ok workspace/example/stringBytes 4.383s
Copy the code
The performance of strong conversions is significantly better than that of standard conversions.
The reader can consider the following questions
1. Why are strong conversions better than standard conversions?
2. In the above test, when x data is large, the standard conversion has a memory allocation operation, resulting in poor performance, but the strong conversion is not affected?
3. If strong conversions perform so well, why does the Go language provide us with standard conversions?
The principle of analysis
To answer these three questions, you need to first understand what a string and []byte are in GO.
-
[]byte
Byte is an alias of uint8 in go, as described in the Builtin of the GO standard library:
// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8
Copy the code
In the go source code SRC/Runtime /slice.go, slice is defined as follows:
type slice struct {
array unsafe.Pointer
len int
cap int
}
Copy the code
Array is a pointer to the underlying array, with len representing length and cap representing capacity. For []byte, array refers to an array of bytes.
-
string
The go library Builtin has the following information about string:
// string is the set of all strings of 8-bit bytes, conventionally but not
// necessarily representing UTF-8-encoded text. A string may be empty, but
// not nil. Values of string type are immutable.
type string string
Copy the code
A string is a collection of 8-bit bytes that usually, but not necessarily, represent UTF-8 encoded text. String can be empty, but it can’t be nil. The value of string cannot be changed.
In the go source code SRC/Runtime /string.go, string is defined as follows:
type stringStruct struct {
str unsafe.Pointer
len int
}
Copy the code
So a stringStruct is a string, and a STR pointer is the first address of some array, and len is the length of that array. So what is this array? We can find out when we instantiate a stringStruct object.
//go:nosplit
func gostringnocopy(str *byte) string {
ss := stringStruct{str: unsafe.Pointer(str), len: findnull(str)}
s := *(*string)(unsafe.Pointer(&ss))
return s
}
Copy the code
As you can see, the incoming STR pointer is a pointer to byte, so we can be sure that the underlying data structure of string is an array of bytes.
In summary, string and []byte are very similar in their underlying structure (the latter has only one more cap attribute in its underlying expression, so they are aligned in memory layout), This is why builtin’s built-in function copy has a special case of copy(DST []byte, SRC String) int.
// The copy built-in function copies elements from a source slice into a
// destination slice. (As a special case, it also will copy bytes from a
// string to a slice of bytes.) The source and destination may overlap. Copy
// returns the number of elements copied, which will be the minimum of
// len(src) and len(dst).
func copy(dst, src []Type) int
Copy the code
-
The difference between
The biggest difference between []byte and string is that the value of string cannot be changed. How does that make sense? Here are two examples.
For []byte, the following operations are possible:
b := []byte("Hello Gopher!")
b [1] = 'T'
Copy the code
String, the modification operation is forbidden:
s := "Hello Gopher!"
s[1] = 'T'
Copy the code
Strings can support operations like this:
s := "Hello Gopher!"
s = "Tello Gopher!"
Copy the code
The value of the string cannot be changed, but it can be replaced. StringStruct {STR: str_point, len: Str_len}, the STR pointer of the string structure points to the address of a character constant. The contents of this address cannot be changed because it is read-only, but the STR pointer can point to a different address.
Then, the meanings of the following operations are different:
s := "S1" // Allocate memory for "S1", to which the STR pointer in the s structure points
s = "S2" // Allocate memory space for "S2". The STR pointer in the s structure is converted to this memory
b := []byte{1} // Allocate memory space for the '1' array to which the b structure's array pointer points.
b = []byte{2} // Change array contents to '2'
Copy the code
The illustration below
Since the contents of the string pointer cannot be changed, each time the string is changed, the memory must be reallocated and the previously allocated space must be collected by GC. This is the root cause of the inefficiency of string operations compared to []byte operations.
-
Implementation details of the standard transformation
[]byte(string) implementation (source in SRC/Runtime /string.go)
// The constant is known to the compiler.
// There is no fundamental theory behind this number.
const tmpStringBufSize = 32
type tmpBuf [tmpStringBufSize]byte
func stringtoslicebyte(buf *tmpBuf, s string) []byte {
var b []byte
ifbuf ! =nil && len(s) <= len(buf) {
*buf = tmpBuf{}
b = buf[:len(s)]
} else {
b = rawbyteslice(len(s))
}
copy(b, s)
return b
}
// rawbyteslice allocates a new byte slice. The byte slice is not zeroed.
func rawbyteslice(size int) (b []byte) {
cap := roundupsize(uintptr(size))
p := mallocgc(cap.nil.false)
if cap! =uintptr(size) {
memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size))
}
*(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)}
return
}
Copy the code
There are two cases: whether the length of s is greater than 32. When the value is greater than 32, GO needs to call mallocGC to allocate a new chunk of memory (the size is determined by S), which answers question 2 above: when x is large, the standard conversion method will allocate memory once.
Finally, the string is copied to []byte by using the copy function. The sliceStringCopy method is implemented in SRC/Runtime /slice.go.
func slicestringcopy(to []byte, fm string) int {
if len(fm) == 0 || len(to) == 0 {
return 0
}
// The length of copy depends on the minimum length of string and []byte
n := len(fm)
if len(to) < n {
n = len(to)
}
// If race detection is enabled -- race
if raceenabled {
callerpc := getcallerpc()
pc := funcPC(slicestringcopy)
racewriterangepc(unsafe.Pointer(&to[0]), uintptr(n), callerpc, pc)
}
// If memory Sanitizer -msan is enabled
if msanenabled {
msanwrite(unsafe.Pointer(&to[0]), uintptr(n))
}
// This method copies n bytes of the underlying array of string from the header to the underlying array of []byte. (This is the core method of copy implementation, implemented in assembly level source file memmove_*.s)
memmove(unsafe.Pointer(&to[0]), stringStructOf(&fm).str, uintptr(n))
return n
}
Copy the code
The copy implementation process is illustrated below
Implementation of string([]byte) (source also in SRC/Runtime /string.go)
// Buf is a fixed-size buffer for the result,
// it is not nil if the result does not escape.
func slicebytetostring(buf *tmpBuf, b []byte) (str string) {
l := len(b)
if l == 0 {
// Turns out to be a relatively common case.
// Consider that you want to parse out data between parens in "foo()bar",
// you find the indices and convert the subslice to string.
return ""
}
// If race detection is enabled -- race
if raceenabled {
racereadrangepc(unsafe.Pointer(&b[0]),
uintptr(l),
getcallerpc(),
funcPC(slicebytetostring))
}
// If memory Sanitizer -msan is enabled
if msanenabled {
msanread(unsafe.Pointer(&b[0]), uintptr(l))
}
if l == 1 {
stringStructOf(&str).str = unsafe.Pointer(&staticbytes[b[0]])
stringStructOf(&str).len = 1
return
}
var p unsafe.Pointer
ifbuf ! =nil && len(b) <= len(buf) {
p = unsafe.Pointer(buf)
} else {
p = mallocgc(uintptr(len(b)), nil.false)
}
stringStructOf(&str).str = p
stringStructOf(&str).len = len(b)
// Copy the byte array to the string
memmove(p, (*(*slice)(unsafe.Pointer(&b))).array, uintptr(len(b)))
return
}
// Instantiate a stringStruct object
func stringStructOf(sp *string) *stringStruct {
return (*stringStruct)(unsafe.Pointer(sp))
}
Copy the code
As you can see, mallocGC is also called to allocate a new chunk of memory when the array is longer than 32. Finally, copy is completed by memmove.
-
Implementation details of the strong transformation
- The all-purpose unsafe.pointer
In GO, a Pointer *T of any type can be converted to a Pointer of type unsafe.pointer, which can store the address of any variable. Also, a Pointer to the unsafe.pointer type can be converted back to an ordinary Pointer, and it doesn’t have to be the same as the previous *T type. In addition, the unsafe.pointer type can be translated to uintPtr, which holds the value of the address to which the Pointer points, allowing us to compute the address numerically. The above is the basis for the implementation of the strong conversion mode.
While String and slice are in the Reflect package, the corresponding structs are Reflect. StringHeader and reflect.sliceHeader, which are run-time representations of String and slice.
type StringHeader struct {
Data uintptr
Len int
}
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
Copy the code
- Memory layout
From the runtime representations of String and Slice, we can see that the Date and Len fields are the same, except that the SilceHeader has an additional Cap field of type int. As a result, their memory layout is aligned, which means that you can translate directly through unsafe.pointer.
[] Byte to String diagram
String to []byte diagram
-
Q&A
Q1. Why do strong conversions perform better than standard conversions?
For standard conversions, either from []byte to String or string to []byte involves a copy of the underlying array. A strong conversion is a direct substitution of Pointers so that string and []byte point to the same underlying array. So, of course, the latter performance will be better.
Q2. In the above test, when x data is large, the standard conversion has a memory allocation operation, resulting in worse performance, but the strong conversion is not affected?
In standard conversion, when the length of data is greater than 32 bytes, new memory needs to be applied through mallocGC before data copy is performed. A strong conversion just changes the pointer to. Therefore, when the conversion data is large, the performance gap between the two will become more obvious.
Q3.if strong conversions perform so well, why does the GO language provide us with standard conversions?
First, we need to understand that Go is a type-safe language, and that the price of security is a compromise of performance. However, performance comparisons are relative, and this performance compromise is minimal with today’s machines. In addition, the way of strong conversion, will bring great security risks to our program.
The following example
a := "hello"
b := String2Bytes(a)
b[0] = 'H'
Copy the code
A is a string, and we said earlier that its value is not modifiable. The underlying array of a is assigned to b by a strong conversion. B is a []byte whose value can be changed, so changing the value of the underlying array will result in a serious error (as as defer+recover).
unexpected fault address 0x10b6139
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x10b6139 pc=0x1088f2c]
Copy the code
Q4. Why are strings designed to be unmodifiable?
I think it’s worth thinking about that. String is not modifiable, meaning it is read-only. The advantage of this is that in concurrent scenarios, we can use the same string multiple times without locking control, without worrying about security while ensuring efficient sharing.
-
Choose the scene
- In situations where you are not sure about the security risks, try to convert data in a standard way.
- Strong conversion can be used when the program has high performance requirements, meets the condition that only read operations are performed on the data, and there are frequent transformations (such as message forwarding scenarios).