This article uses JS to simply implement UTF-8 encoding and Base64 encoding. Reading this article, you can understand the conversion between Unicode and UTF-8 and understand why Base64 encoding makes the data quantity longer.
Summary:
- Unicode basics
- Utf-8 encoding
- Base64 encoding
- conclusion
Unicode, ASCII, GB2312 encoding collection, etc., similar to dictionaries. A character is like a word, and the encoding of a character is like a word in a dictionary on a line on a page. When different systems look up the same code in the same dictionary, they get the same character.
The diagram below:
1. Simple understanding of Unicode
wikipedia:
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.
Before the creation of Unicode, various languages had different coding sets, ASCII,GB2312 and so on are also coding sets in the development process, and these coding sets conflict with each other, bringing trouble to the communication of different language systems. Because two identical characters can have completely different meanings in different encoding systems. The Unicode encoding set provides a unique number for each character, regardless of platform, program, or language. The Unicode character set is therefore widely used.
Javascript programs are written in the Unicode character set, from which each character in a string usually comes.
Unicode character sets are like dictionaries, and characters are like words. The Unicode code value for a character is similar to the page and line of a word in a dictionary.
2. The utf-8 encoding
2.1 Why does a Unicode character set need an encoding for transmission?
Because the Unicode code is converted to binary, which is a string of zeros and ones, when transmitted to another party, you need a rule to split the string of zeros and ones and then transcode them.
Hence the ‘transmission split rules’, utF-N encodings.
8bit = 1byte
data
The Universal Transformation Format (UTF) does not change the codes of each character in the character set. A new encoding method is established and the character codes are mapped to the transmission codes through this encoding method. The primary task is to save traffic and disk space while maintaining universality with the Unicode character set.
Storage Unicode is a set of symbols that specifies the binary code of the symbol, but does not specify how the binary code should be stored (that is, how many bytes it occupies), so different storage implementations have emerged. UTF-32
Characters are represented by four bytes
UTF-16
Characters are represented by two bytes or four bytes
UTF-8
A variable-length encoding that uses 1 to 4 bytes to represent characters as needed. (Passing on demand saves traffic and disk space, so UTF-8 is widely used.)
2.2UTF-8 encoding rules
- For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol.
- For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n+ 1 bits to 0, and the first two bits of the following bytes to 10. The rest is the Unicode encoding for the symbol.
Eg: UtF-8 encoding of characters
The Unicode encoding of the character is obtained using codePointAt, confirmed to be a few bytes, and then filled in according to the rules.
Coding process:
Supplementary information:
ES6 provides **codePointAt()** methods that correctly handle byte-stored characters and return a character’s code point (Unicode encoding).
ES6 provides the ** string.fromCodePoint ()** method to correctly handle a code point (Unicode encoding) and return the corresponding character of the code point (Unicode encoding)
2.3UTF-8 encoding and decoding simple implementation
Function encodeUtf8(STR) {var bytes = [] for (STR) {// for... The of loop can correctly recognize 32-bit UTF-16 characters. Let code = ch. CodePointAt (0) if (code >= 65536&& code <= 1114111) { Swallow the 8-bit bytes. Push ((code > > 18) | 0 xf0) bytes. Push (((code > > 12) & 0 x3f) | 0 x80) bytes. Push (((code > > 6) & 0 x3f) | 0 x80) bytes.push((code & 0x3f) | 0x80) } else if (code >= 2048 && code <= 65535) { bytes.push((code >> 12) | 0xe0) bytes.push(((code >> 6) & 0x3f) | 0x80) bytes.push((code & 0x3f) | 0x80) } else if (code >= 128 && code <= 2047) { Bytes. Push ((code > > 6) | 0 xc0) bytes. Push ((code & 0 x3f) | 0 x80)} else {bytes. Push (code)}} return bytes} / / repair function padStart(str, len, Prefix) {return ((new Array(len+1).join(prefix)) + STR).slice(-len) // Can also use new Array(len+1).fill(0)} // decode function DecodeUtf8 (STR) {let obStr = "let obStr = [... STR].map((ch)=> {// a hexadecimal number to binary requires four bits to represent the complete bit return padStart(parseInt(ch,16).toString(2), 4, 0) }).join('').match(/\d{8}/g).map((item)=> parseInt(item,2)) for (var i = 0; i < obStr.length; ) { let code = obStr[i] let code1, code2, code3, code4, If ((code&240) == 240) {code1 = (code&0x03).toString(2) code2 = padStart((obStr[i + 1] & 0x3f).toString(2),6, '0') code3 = padStart((obStr[i + 2] & 0x3f).toString(2),6, '0') code4 = padStart((obStr[i + 3] & 0x3f).toString(2),6, '0') hex = parseInt((code1 + code2 + code3 + code4),2) strValue = strValue + String.fromCodePoint(hex) i = i + 4 } else If ((code&224) == 224) {// 3 bytes table code1 = (code&0x07).toString(2) code2 = padStart((obStr[I + 1] & 0x3f).toString(2),6, '0') code3 = padStart((obStr[i + 2]& 0x3f).toString(2),6, '0') hex = parseInt((code1 + code2 + code3),2) strValue = strValue + String.fromCodePoint(hex) i = i + 3 } else if Code1 = (code&0x0f).toString(2) code2 = padStart((obStr[I + 1] & 0x3f).toString(2),6, '0') hex = parseInt((obStr + code2),2) strValue = strValue + String.fromCodePoint(hex) i = i + 2 } else { hex = code strValue = strValue + String.fromCodePoint(code) i = i + 1 } } return strValue } // byte to hex function TransferHex (bytes) {let s = 'bytes && bytes.forEach(ch => {s = s + ch.toString(16)}) return s} let text = 𠮷 SSDF 34534 ASD" let strHax = transferHex(encodeUtf8(text)) console.log(strHax) let STR = decodeUtf8(strHax) console.log(str) console.log("test ok?", text === str)Copy the code
3. The Base64 encoding
3.1 Base64 encoding rules
Rule: The Base64 encoding method requires that every three 8bit bytes be converted into four 6bit bytes, and then each 6bit byte is preceded by two high zeros to make up four 8bit bytes.
If the binary data to be encoded is not a multiple of three, the remaining Base64 will be zeroed at the end, and one or two ‘=’ will be added to the end of the encoding.
Each 8bit is encoded as: CHARST[paresInt(8bit,2)]
CHARTS = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
Copy the code
Coding process:
3.2 Base64 encoding and decoding simple implementation
const CHARTS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'; const prefix = '=' const prefixTwo = 2 const prefixfour = 4 function padEnd(str, len, prefix) { return (str + (new Array(len + 1)).join(prefix)).slice(0, len) } function padStart(str, len, Prefix){return ((new Array(len + 1).join(prefix)) + STR).slice(-len)} function encodeBase64(STR){let byteStr = "For (let ch of encodeUtf8(STR)){byteStr = byteStr + padStart(ch.tostring (2),8,0)} let rest = bytestr.length % 6 // Let restStr = rest == prefixTwo? '==' :'=' let prefixzero = rest === prefixTwo ? prefixfour: prefixTwo byteStr = padEnd(byteStr , byteStr.length + prefixzero,'0') return byteStr.match(/(\d{6})/g).map(val=>parseInt(val,2)).map(val=>CHARTS[val]).join('') + restStr; Function decodeBase64(STR) {let matchTime = str.match(/(ha)/g) let [...restStr] = str.replace(/=/g, "") restStr Reststr.map ((item)=> {let value = chart.indexof (item) return padStart(value.tostring (2),6,0) }).join('').match(/(\d{8})/g).map((item)=>parseInt(item,2).toString(16)).join() console.log(restStr) return DecodeUtf8 (restStr)} let strHax = encodeBase64(text) console.log(strHax) let STR = decodeBase64(strHax) console.log(str) console.log("test ok?" , text === str)Copy the code
The Base64 encoding method requires that every three 8-bit bytes be converted into four 6-bit bytes, making the amount of data one-third longer.
4. To summarize
Encoding is simply a representation of a character set. This article uses JS to realize UTF8 encoding and Base64 encoding simply. Code implementation is rough, understand the inaccuracy, but also consult. Welcome to discuss and study together.