Introduction to the

Package UTF-8 implements functions and constants to support text UTF8 encoding, which contains conversion functions for runes and UTF8 byte sequences.

In Unicode, a Chinese character is two bytes, and in UTF-8, a Chinese character is three bytes. Golang’s default encoding is UTF-8, so a Chinese character is three bytes by default, but the underlying string in Golang is actually an array of bytes.

Read this article patiently, do not believe you do not understand how to use

Constants defined

RuneSelf=0x80: The bytecode value of this value is 128, which is used to determine whether it is a regular ASCII code. Hicb (0xBF) the bytecode value is 191.FFThe bytecode corresponding to.

// The conditions RuneError==unicode.ReplacementChar and // MaxRune==unicode.MaxRune are verified in the tests. // Defining them locally avoids this package depending on package unicode. // Numbers fundamental to the encoding. const ( RuneError = '\uFFFD' // The "error" Rune or" Unicode replacement character" RuneSelf = 0x80 // characters below RuneSelf represent themselves, MaxRune = '\U0010FFFF' // Maximum valid Unicode code point UTFMax = 4 // Maximum number of bytes of Unicode characters encoded in UTF-8. // Code points in the surrogate range are not valid for UTF-8. const ( surrogateMin = 0xD800 surrogateMax = 0xDFFF ) const ( t1 = 0x00 // 0000 0000 tx = 0x80 // 1000 0000 t2 = 0xC0 // 1100 0000 t3 = 0xE0 // 1110 0000 t4 = 0xF0 // 1111 0000 t5 = 0xF8 // 1111 1000 maskx = 0x3F // 0011 1111 mask2 = 0x1F // 0001 1111 mask3 = 0x0F // 0000 1111 mask4 = 0x07 // 0000 0111 rune1Max = 1<< 7-1 rune2Max = 1<< 11-1 rune3Max = 1<< 16-1 // Default minimum and maximum consecutive bytes. Locb = 0x80 // 1000 0000 hicb = 0xBF // 1011 1111 // The names of these constants were chosen to maintain good alignment in the table below. // The first half byte is the index of special single-byte acceptRanges or F. // The second half byte is the state of the rune length or special one-byte case. xx = 0xF1 // invalid: size 1 as = 0xF0 // ASCII: size 1 s1 = 0x02 // accept 0, size 2 s2 = 0x13 // accept 1, size 3 s3 = 0x03 // accept 0, size 3 s4 = 0x23 // accept 2, size 3 s5 = 0x34 // accept 3, size 4 s6 = 0x04 // accept 0, size 4 s7 = 0x44 // accept 4, Size 4) // first is information about the first byte in the UTF-8 sequence. var first = [256]uint8{ // 1 2 3 4 5 6 7 8 9 A B C D E F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x00-0x0F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x10-0x1F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x20-0x2F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x30-0x3F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x40-0x4F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x50-0x5F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x60-0x6F as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x70-0x7F // 1 2 3 4 5 6 7 8 9 A B C D E F xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x80-0x8F xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x90-0x9F xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xA0-0xAF xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xB0-0xBF xx, xx, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xC0-0xCF s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xD0-0xDF s2, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s4, s3, s3, // 0xE0-0xEF s5, s6, s6, s6, s7, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, Type acceptRange struct {lo uint8 // Lowest value for second byte. Var acceptRanges = [...] acceptRange{ 0: {locb, hicb}, 1: {0xA0, hicb}, 2: {locb, 0x9F}, 3: {0x90, hicb}, 4: {locb, 0x8F}, }Copy the code

DecodeRune

DecodeRune decompresses the first UTF-8 encoding in p and returns the rune value and its width in bytes. If p is empty, it returns (RuneError, 0). Otherwise, if the encoding is invalid, (RuneError, 1) is returned. For a correct, non-empty UTF-8, both are unlikely outcomes. The encoding is invalid if it is not a valid UTF-8, encoding rune out of range, or the value is not the shortest UTF-8 encoding. No other validation is performed.

func DecodeRune(p []byte) (r rune, size int) { n := len(p) if n < 1 { return RuneError, 0} p0 := p[0] x := first[p0] if x >= as {// The following code simulates additional checks for x==xx and handles ASCII and invalid cases. This mask-and-OR method prevents additional branching. mask := rune(x) << 31 >> 31 // Create 0x0000 or 0xFFFF. return rune(p[0])&^mask | RuneError&mask, Accept := acceptRanges[x>>4] acceptRanges[x>>4] If n < int(sz) {return RuneError, 1} b1: b1 < = p [1] if the accept. Lo | | accept. Hi < b1 {/ / if the value of the second element in an array of bytes are not accept the valid range is illegal return RuneError, 1} if sz == 2 {// if sz == 2 {// if sz == 2} Then or operating return rune (p0 & mask2) < < | 6 rune (b1 & maskx), 2} b2: = p [2] the if b2 < locb | | hicb < b2 {return RuneError, 1 } if sz == 3 { return rune(p0&mask3)<<12 | rune(b1&maskx)<<6 | rune(b2&maskx), 3 } b3 := p[3] if b3 < locb || hicb < b3 { return RuneError, 1 } return rune(p0&mask4)<<18 | rune(b1&maskx)<<12 | rune(b2&maskx)<<6 | rune(b3&maskx), 4 }Copy the code

Example: When converting a byte slice to a Rune slice, we can process byte arrays in turn

func str2runes(s []byte) []rune {
    var p []int32
    for len(s) > 0 {
        fmt.Println(s)
        r, size := utf8.DecodeRune(s)
        fmt.Println(r,size)
        p = append(p, int32(r))
        s = s[size:]
     }
     return []rune(p)
}
Copy the code

However, due to the differences in the underlying data structure, this form of transformation inevitably leads to the redistribution of memory

DecodeRuneInString

Same as DecodeRune, except that the arguments are strings.

EncodeRune

EncodeRune writes rune’s UTF-8 encoding to P (it must be large enough). It returns the number of bytes written.

func EncodeRune(p []byte, r rune) int {
	// Negative values are erroneous. Making it unsigned addresses the problem.
	switch i := uint32(r); {
	case i <= rune1Max: // rune1Max = 111 1111(127)
		p[0] = byte(r)
		return 1
	case i <= rune2Max: // rune2Max = 10000000000 (1024)
		_ = p[1] // eliminate bounds checks
		p[0] = t2 | byte(r>>6)  // t2= 0xC0
		p[1] = tx | byte(r)&maskx // tx= 0x80
		return 2
	case i > MaxRune, surrogateMin <= i && i <= surrogateMax:
		r = RuneError
		fallthrough
	case i <= rune3Max: // rune3Max = 1000000000000000 (32768)
		_ = p[2] // eliminate bounds checks
		p[0] = t3 | byte(r>>12) // t3 = 0xE0
		p[1] = tx | byte(r>>6)&maskx
		p[2] = tx | byte(r)&maskx
		return 3
	default:
		_ = p[3] // eliminate bounds checks
		p[0] = t4 | byte(r>>18)
		p[1] = tx | byte(r>>12)&maskx
		p[2] = tx | byte(r>>6)&maskx
		p[3] = tx | byte(r)&maskx
		return 4
	}
}
Copy the code

RuneCountInString

Count the number of Runes in the string

Continue. Rune (string, 128) continue.rune (string, 128)

If a hexadecimal f1. Is an invalid character, continue. Rune number ++, that is, an invalid character is treated as a rune of length 1.

If a character’s code value in the first list and 7 bits and result in its word length, the principle is as follows:

Take the steel in the example above. Len (” steel “) returns 3. S [0] returns 233. As you can see from the resulting value, the character has a word length of 3.

The index is the value of x shifted 4 bits to the right, and the value fetched from the acceptRanges array is {locb, hicb}. That’s {0x80,0xbf}. C = s[1] = 146. accept.lo = 128, accept.hi = 191

if c := s[i+1]; c < accept.lo || accept.hi < c {
			size = 1
		}
Copy the code

It continues to judge that the value of C = S [2] is 162, which is also not satisfied

} else if c := s[i+2]; c < locb || hicb < c {
			size = 1
		} 
Copy the code

If size==3, we can determine the bytes that need to be skipped, and directly I +=size

The process for other functions is similar without going into too much detail.

// RuneCountInString is like RuneCount but its input is a string.
func RuneCountInString(s string) (n int) {
    ns := len(s)
    fmt.Println(ns)
    for i := 0; i < ns; n++ {
        c := s[i]
        if c < RuneSelf {
            // ASCII fast path
            i++
            continue
        }
        fmt.Println("c=", c)
        x := first[c]
        fmt.Println("x=", x)
        if x == xx {
            i++ // invalid.
            continue
        }
        size := int(x & 7)
        fmt.Println("size=", size)
        if i+size > ns {
            i++ // Short or invalid.
            continue
        }
        accept := acceptRanges[x>>4]
        fmt.Println("accept: ", accept)
        if c := s[i+1]; c < accept.lo || accept.hi < c {
            size = 1
        } else if size == 2 {
        } else if c := s[i+2]; c < locb || hicb < c {
            size = 1
        } else if size == 3 {
        } else if c := s[i+3]; c < locb || hicb < c {
            size = 1
        }
        i += size
    }
    return n
}
Copy the code

Example:

package main import ( "fmt" "unicode/utf8" ) func main(){ str := "Hello, Println(utf8.runecountinString (STR)) // 10}Copy the code

ValidString

The ValidString return value indicates whether the argument string is a valid UTF8-encodable string.

// ValidString reports whether s consists entirely of valid UTF-8-encoded runes.
func ValidString(s string) bool {
	n := len(s)
	for i := 0; i < n; {
		si := s[i]
		if si < RuneSelf {
			i++
			continue
		}
		x := first[si]
		if x == xx {
			return false // Illegal starter byte.
		}
		size := int(x & 7)
		if i+size > n {
			return false // Short or invalid.
		}
		accept := acceptRanges[x>>4]
		if c := s[i+1]; c < accept.lo || accept.hi < c {
			return false
		} else if size == 2 {
		} else if c := s[i+2]; c < locb || hicb < c {
			return false
		} else if size == 3 {
		} else if c := s[i+3]; c < locb || hicb < c {
			return false
		}
		i += size
	}
	return true
}
Copy the code

RuneCount

RuneCount returns the number of runes contained in the argument. In the first example, utf8.runecountinString was changed to the method call, which returns the same result. The wrong and short are treated as a one-byte rune. The single character H represents a one-byte rune.

// RuneCount returns the number of runes in p. Erroneous and short
// encodings are treated as single runes of width 1 byte.
func RuneCount(p []byte) int {
	np := len(p)
	var n int
	for i := 0; i < np; {
		n++
		c := p[i]
		if c < RuneSelf {
			// ASCII fast path
			i++
			continue
		}
		x := first[c]
		if x == xx {
			i++ // invalid.
			continue
		}
		size := int(x & 7)
		if i+size > np {
			i++ // Short or invalid.
			continue
		}
		accept := acceptRanges[x>>4]
		if c := p[i+1]; c < accept.lo || accept.hi < c {
			size = 1
		} else if size == 2 {
		} else if c := p[i+2]; c < locb || hicb < c {
			size = 1
		} else if size == 3 {
		} else if c := p[i+3]; c < locb || hicb < c {
			size = 1
		}
		i += size
	}
	return n
}
Copy the code

FullRune

If the string begins with an ASCII code value between 0 and 127, when executing first[p[0]], in the first list, the value before 127 is the same as 0xF0, the decimal identifier is 240, and the value after 7 is 0, so, Return true. There is a special case where an invalid code is also treated as a full Rune because it is converted to a single byte of error Rune.

func FullRune(p []byte) bool { n := len(p) if n == 0 { return false } x := first[p[0]] if n >= int(x&7) { return true //  ASCII, invalid or valid. } // Must be short or invalid. accept := acceptRanges[x>>4] if n > 1 && (p[1] < accept.lo || accept.hi  < p[1]) { return true } else if n > 2 && (p[2] < locb || hicb < p[2]) { return true } return false }Copy the code

FullRuneInString

This is similar to FullRune except that the argument is a string

// FullRuneInString is like FullRune but its input is a string. func FullRuneInString(s string) bool { n := len(s) if n == 0 { return false } x := first[s[0]] if n >= int(x&7) { fmt.Println("--------") return true // ASCII, invalid, or valid. } // Must be short or invalid. accept := acceptRanges[x>>4] if n > 1 && (s[1] < accept.lo || accept.hi < s[1])  { fmt.Println("xxxxxx") return true } else if n > 2 && (s[2] < locb || hicb < s[2]) { fmt.Println("eeeee") return true } return false }Copy the code

A comprehensive example:

package main

import (
    "fmt"
    "reflect"
    "unicode/utf8"
)

// Numbers fundamental to the encoding.
const (
    RuneError = '\uFFFD'     // the "error" Rune or "Unicode replacement character"
    RuneSelf  = 0x80         // characters below Runeself are represented as themselves in a single byte.
    MaxRune   = '\U0010FFFF' // Maximum valid Unicode code point.
    UTFMax    = 4            // maximum number of bytes of a UTF-8 encoded Unicode character.
)

const (
    t1 = 0x00 // 0000 0000
    tx = 0x80 // 1000 0000
    t2 = 0xC0 // 1100 0000
    t3 = 0xE0 // 1110 0000
    t4 = 0xF0 // 1111 0000
    t5 = 0xF8 // 1111 1000

    maskx = 0x3F // 0011 1111
    mask2 = 0x1F // 0001 1111
    mask3 = 0x0F // 0000 1111
    mask4 = 0x07 // 0000 0111

    rune1Max = 1<<7 - 1
    rune2Max = 1<<11 - 1
    rune3Max = 1<<16 - 1

    // The default lowest and highest continuation byte.
    locb = 0x80 // 1000 0000
    hicb = 0xBF // 1011 1111

    // These names of these constants are chosen to give nice alignment in the
    // table below. The first nibble is an index into acceptRanges or F for
    // special one-byte cases. The second nibble is the Rune length or the
    // Status for the special one-byte case.
    xx = 0xF1 // invalid: size 1
    as = 0xF0 // ASCII: size 1
    s1 = 0x02 // accept 0, size 2
    s2 = 0x13 // accept 1, size 3
    s3 = 0x03 // accept 0, size 3
    s4 = 0x23 // accept 2, size 3
    s5 = 0x34 // accept 3, size 4
    s6 = 0x04 // accept 0, size 4
    s7 = 0x44 // accept 4, size 4
)

type acceptRange struct {
    lo uint8 // lowest value for second byte.
    hi uint8 // highest value for second byte.
}

var acceptRanges = [...]acceptRange{
    0: {locb, hicb},
    1: {0xA0, hicb},
    2: {locb, 0x9F},
    3: {0x90, hicb},
    4: {locb, 0x8F},
}

// first is information about the first byte in a UTF-8 sequence.
var first = [256]uint8{
    //   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x00-0x0F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x10-0x1F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x20-0x2F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x30-0x3F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x40-0x4F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x50-0x5F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x60-0x6F
    as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, as, // 0x70-0x7F
    //   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
    xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x80-0x8F
    xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0x90-0x9F
    xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xA0-0xAF
    xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xB0-0xBF
    xx, xx, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xC0-0xCF
    s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, // 0xD0-0xDF
    s2, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, s4, s3, s3, // 0xE0-0xEF
    s5, s6, s6, s6, s7, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, xx, // 0xF0-0xFF
}


// RuneCountInString is like RuneCount but its input is a string.
func RuneCountInString(s string) (n int) {
    ns := len(s) 
    fmt.Println(ns)
    for i := 0; i < ns; n++ {
        c := s[i]
        if c < RuneSelf {
            // ASCII fast path
            i++
            continue
        }
        fmt.Println("c=", c)
        x := first[c]
        fmt.Println("x=", x)
        if x == xx {
            i++ // invalid.
            continue
        }
        size := int(x & 7)
        fmt.Println("size=", size)
        if i+size > ns {
            i++ // Short or invalid.
            continue
        }
        accept := acceptRanges[x>>4]
        fmt.Println("accept: ", accept)
        if c := s[i+1]; c < accept.lo || accept.hi < c {
            size = 1
        } else if size == 2 {
        } else if c := s[i+2]; c < locb || hicb < c {
            size = 1
        } else if size == 3 {
        } else if c := s[i+3]; c < locb || hicb < c {
            size = 1
        }
        i += size
    }
    return n
}


func FullRune(p []byte) bool {
    n := len(p)
    if n == 0 {
        return false
    }
    fmt.Println("po=", p[0])
    x := first[p[0]]
    if n >= int(x&7) {
        return true // ASCII, invalid or valid.
    }
    // Must be short or invalid.
    accept := acceptRanges[x>>4]
    if n > 1 && (p[1] < accept.lo || accept.hi < p[1]) {
        return true
    } else if n > 2 && (p[2] < locb || hicb < p[2]) {
        return true
    }
    return false
}


// FullRuneInString is like FullRune but its input is a string.
func FullRuneInString(s string) bool {
    n := len(s)
    if n == 0 {
        return false
    }
    x := first[s[0]]
    fmt.Println("xxx= ", x)
    fmt.Println("x&7= ", x&7)
    if n >= int(x&7) {
        fmt.Println("--------")
        return true // ASCII, invalid, or valid.
    }
    // Must be short or invalid.
    accept := acceptRanges[x>>4]
    if n > 1 && (s[1] < accept.lo || accept.hi < s[1]) {
        fmt.Println("xxxxxx")
        return true
    } else if n > 2 && (s[2] < locb || hicb < s[2]) {
        fmt.Println("eeeee")
        return true
    }
    return false
}

func main(){
    fmt.Println(reflect.TypeOf(acceptRanges))
    str := "Hello, 钢铁侠"
    fmt.Println(FullRuneInString(`\ubbbbbbb`))
    fmt.Println(FullRune([]byte(str)))
    fmt.Println(utf8.RuneCount([]byte(str)))
    fmt.Println(str)
    for i:=0;i<len(str);i++ {
        fmt.Println(str[i])
    }
    fmt.Println([]byte(str))
    for _, s := range str {
        fmt.Println(s)
    }
    fmt.Println(reflect.TypeOf([]rune(str)[4]))
    fmt.Println([]rune(str))
    fmt.Println([]int32(str))
    fmt.Println(utf8.RuneCountInString(str))
    //fmt.Println(first[uint8(str[6])])
    //accept := acceptRanges[4]
    fmt.Println(RuneCountInString(str))
    fmt.Println(utf8.ValidString(str))
}

Copy the code

Output:

[5]main.acceptRange xxx= 240 x&7= 0 -------- true po= 72 true 10 Hello, Iron Man 72 101 108 108 111 44 32 233 146 162 233 147 129 228 190 160 [72 101 108 108 111 44 32 233 146 162 233 147 129 228 190 160] 72 101 108 108 111 44 32 38050 38081 20384 int32 [72 101 108 108 111 44 32 38050 38081 20384] [72 101 108 108 111 44 32 38050 38081 20384] 10 16 c= 233 x= 3 size= 3 accept: {128 191} c= 233 x= 3 size= 3 accept: {128 191} c= 228 x= 3 size= 3 accept: {128 191} 10 trueCopy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Golang Unicode/UTF8 source code analysis | Niu Fengtian New Year essay

Introduction to the

Constants defined

DecodeRune

DecodeRuneInString

EncodeRune

RuneCountInString

ValidString

RuneCount

FullRune

FullRuneInString

A comprehensive example:

Golang Unicode/UTF8 source code analysis | Niu Fengtian New Year essay

Introduction to the

Constants defined

DecodeRune

DecodeRuneInString

EncodeRune

RuneCountInString

ValidString

RuneCount

FullRune

FullRuneInString

A comprehensive example:

Related Posts

Java Arrays lists common methods

Elkb hands-on experience, plus a set of complex configuration files

Implementation of many-to-many relational query of mybatis multi-table query – XML method