Two days ago with GO write a website crawler practice, but climb down the content is garbled, a look at the original site is GBK code, and GO in the default code is UTF-8, so will lead to non-UTF-8 content is garbled.
So I went to look for GO’s transcoding library, and there are mainly three libraries, Mahonia, ICONV-GO and the official golang.org/x/text, which are used more.
I have used all three libraries and found that I am not very satisfied with them. Let’s take a look at GBK to UTF-8 for these three libraries.
- mahonia
package main
import "fmt"
import "github.com/axgle/mahonia"
func main(a) {
testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
testStr := string(testBytes)
enc := mahonia.NewDecoder("gbk")
res := enc.ConvertString(testStr)
fmt.Println(res) // Hello, world!
}
Copy the code
- iconv-go
package main
import (
"fmt"
"github.com/axgle/mahonia"
iconv "github.com/djimenez/iconv-go"
)
func main(a) {
testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
var res []byte
iconv.Convert(testBytes, res, "GBK"."UTF-8")
fmt.Printf(string(res)) // Hello, world!
}
Copy the code
- golang.org/x/text
package main
import (
"bytes"
"fmt"
"io/ioutil"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
)
func main(a) {
testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
decoder := simplifiedchinese.GBK.NewDecoder()
reader := transform.NewReader(bytes.NewReader(testBytes), decoder)
res, _ := ioutil.ReadAll(reader)
fmt.Printf(string(res)) // Hello, world!
}
Copy the code
The above is the basic use of these three libraries, and you can see that there are some problems with all three libraries:
- Mahonia’s API is the simplest, but only I/O
string
Type, which is what we do a lot of the time with data in GO[]byte
orio.Reader
Type, this is a little bit more limited. - Iconv-go can read
string
,[]byte
和io.Reader
Type of data, but the underlying layer is the encapsulation of C iconv library, in a variety of environments will cause problems, compile error is not easy to locate, I have several times before failed to install successfully :(. - golang.org/x/text This is the official library, but the API is too cumbersome, pass.
transcode
I thought about it, and I thought chaining calls would be a good solution, so I built a wheel called Transcode. Here’s how it works:
- GBK utf-8
package main
import (
"fmt"
"github.com/piex/transcode"
)
func main(a) {
testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
res := transcode.FromByteArray(testBytes).Decode("GBK").ToString()
fmt.Printf(res) // Hello, world
}
Copy the code
- Utf-8 GBK
package main
import (
"bytes"
"fmt"
"github.com/piex/transcode"
)
func main(a) {
testBytes := []byte{0xC4.0xE3.0xBA.0xC3.0xA3.0xAC.0xCA.0xC0.0xBD.0xE7.0xA3.0xA1}
testStr := "Hello, world!
res := transcode.FromString(testStr).Encode("GBK").ToByteArray()
fmt.Println(bytes.Equal(res, testBytes)) // true
}
Copy the code
The underlying library is a wrapper around the golang.org/x/text transcoding API. The knowledge API is too difficult to use, so it is a wrapper around the library. The library currently supports string and []byte input and output data types.
Here we use chained calls, mainly to return the structure at the end. Then we do something else with the ToString and ToByteArray methods.
Warehouse address: github.com/piex/transc… You can take a look at the source code, is still very simple, later will also add support for IO.Reader type, interested can PR.