About the author: Nekron Ant Financial · Data Experience Technology team

background

Because of the extensive and profound Chinese, as well as the early file coding is not unified, resulting in the file coding may now encounter GB2312, GBk, GB18030, UTF-8, BIG5 and so on. Because the knowledge of encoding and decoding is relatively low and unpopular, I have been very superficial to these codes, many times I would wonder whether the code name is uppercase or lowercase, whether the need to add “-” between English and numbers, who decided the rules and so on.

My superficial understanding is as follows:

coding instructions
GB2312 The earliest simplified Chinese coding, and the overseas version of HZ-GB-2312
BIG5 Traditional Chinese encoding, mainly used in Taiwan area. Some traditional Chinese game gibberish, in fact, because BIG5 code and GB2312 code error use
GBK Simplified + traditional, I regard it as GB2312+BIG5, non-national standard, but the Chinese environment is basically abided by. Later, I learned that K is actually the first letter of “expansion” in Pinyin, which is very Chinese…
GB18030 GB family’s new version, downward compatibility, the latest national standards, now Chinese software should support the encoding format, the new choice of file decoding
UTF-8 I don’t want to explain. The international coding standard, HTML is the most standard coding format now.

Concept of carding

After a long time of treading holes, I finally have a certain understanding of this kind of knowledge, and now some important concepts are reorganized as follows:

First of all, to digest the whole character codec knowledge, we must first clarify two concepts — character set and character coding.

Character set

As the name implies, it is a collection of characters. The most obvious difference between different character sets is that the number of characters is not the same. Common character sets include ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, etc.

A character encoding

Character encodings determine how a character set is mapped to the actual binary bytes. Each character encodings has its own design rules, such as whether they are fixed bytes or variable length, which are not expanded here.

Common references to GB2312, BIG5, UTF-8, etc., if not specified, generally refer to character encodings rather than character sets.

There is a one-to-many relationship between character sets and character encodings. Multiple character encodings can exist in the same character set, such as UTF-8, UTF-16, and so on.

BOM (Byte Order Mark)

When using Windows Notepad to save files, the encoding mode can be ANSI (GB family based on locale), Unicode, or UTF-8.

For the sake of clarity, point out that the Unicode encoding here is actually UTF-16LE.

With so many encodings, how does Windows decide which one to use when a file is opened?

The answer is: Windows (for example, simplified Chinese) adds a few bytes to the header of the file to indicate encoding. Three bytes (0xef, 0xbb, 0xBF) represent UTF-8. Two bytes (0xFF, 0xfe or 0xfe, 0xFF) represent UTF-16 (Unicode); None GB**.

It is worth noting that BOM does not provide meaning, so it should be discarded when parsing the content of the file. Otherwise, the parsed content will have redundant content in the head.

LE (little-endian) and BE (big-endian)

This is a bit of a byte-related story that is not the focus of this article, but will be explained in passing. LE and BE represent byte order, indicating that bytes start at the lowest/highest level, respectively.

The cpus we are used to are LE, so Windows uses LE by default when Unicode does not specify byte order.

Node Buffer API has two kinds of functions to handle LE and BE.

const buf = Buffer.from([0, 5]);

// Prints: 5
console.log(buf.readInt16BE());

// Prints: 1280
console.log(buf.readInt16LE());
Copy the code

The Node decoding

I first encountered this problem using Node and was given the following options:

  • Node-iconv (Encapsulation of system IconV)
  • Iconv-lite (pure JS)

Because Node-IconV involves node-gyp builds, and the developer is Windows, node-gyp environment preparation and a series of subsequent installation and build, make web developers like me not crazy (puk), and finally choose IconV-Lite.

The processing of decoding is roughly as follows:

const fs = require('fs')
const iconv = require('iconv-lite')

const buf = fs.readFileSync('/path/to/file')

// We can intercept the first few bytes to determine whether there is a BOM
buf.slice(0.3).equals(Buffer.from([0xef.0xbb.0xbf])) // UTF-8
buf.slice(0.2).equals(Buffer.from([0xff.0xfe])) // UTF-16LE

const str = iconv.decode(buf, 'gbk')

// Decoding the correct judgment needs to be adjusted according to the business scenario
// The first few characters are cut to check whether there are Chinese characters to determine whether the decoding is correct
// It can also be reversed to determine whether there is garble to determine whether the decoding is correct
// The common \u** in regular expressions are Unicode code points
// This area is a common character. If there are specific scenarios, you can expand the code point range according to the actual situation
/[\u4e00-\u9fa5]/.test(str.slice(0.3))

Copy the code

The front-end decoding

As the browser implementation of ES20151 becomes more and more popular, front-end codec becomes possible. The process of uploading files to the back end through the form form to analyze content can now be completely handled by the front end, not only less network interaction with the back end, but also more intuitive user experience because of interface feedback.

The general scenario is as follows:

const file = document.querySelector('.input-file').files[0]
const reader = new FileReader()

reader.onload = (a)= > {
	const content = reader.result
}
reader.onprogerss = evt= > {
	// Read the progress
}
reader.readAsText(file, 'utf-8') // encoding can be modified
Copy the code

A list of fileReader supported encoding is available here.

One interesting thing here is that if the file contains a BOM, such as a UTF-8 encoding, the encoding specified is invalid, and the BOM part is removed from the output, making it easier to use.

If you have higher control requirements for coding, you can convert to output TypedArray:

reader.onload = (a)= > {
	const buf = new Uint8Array(reader.result)
	// For more fine-grained operations
}
reader.readAsArrayBuffer(file)
Copy the code

After retrieving the data buffer for the text content, we can call TextDecoder to continue decoding, but note that the obtained TypedArray contains the BOM:

const decoder = new TextDecoder('gbk') 
const content = decoder.decode(buf)
Copy the code

If the file is large, the Blob’s slice can be used to slice:

const file = document.querySelector('.input-file').files[0]
const blob = file.slice(0.1024)
Copy the code

File newlines vary with operating systems. If line – by – line parsing is required, it depends on the scenario:

  • Linux: \n
  • Windows: \r\n
  • Mac OS: \r

Note: this is the default text editor rule for each system. If you use other software, such as commonly used sublime, VScode, Excel, etc., you can set your own newline character, which is usually \n or \r\n.

The front-end coding

TextEncoder can be used to convert string contents into TypedBuffer:

const encoder = new TextEncoder() 
encoder.encode(String)
Copy the code

It’s worth noting that, starting with Chrome 53, Encoder only supports UTF-8 encoding 2, officially because there are too few other encodings. There is a polyfill library that complements the encoding format for removal.

Front-end build file

After the front-end coding is completed, file generation is generally realized. The example code is as follows:

const a = document.createElement('a')
const buf = new TextEncoder()
const blob = new Blob([buf.encode('I am text')] and {type: 'text/plain'
})
a.download = 'file'A. href = url.createObjecturl (blob) a. lick()Copy the code

This generates a file named file with a suffix determined by type. If you want to export CSV, you only need to change the corresponding MIME type:

const blob = new Blob([buf.encode('First line,1\r\n second line,2')] and {type: 'text/csv'
})
Copy the code

Generally, CSV is opened by excel by default, and the content in the first column is garbled, because Excel follows the logic of Windows encoding (mentioned above). When no BOM is found, GB18030 encoding is used for decoding, resulting in garbled content.

In this case, you only need to add BOM to indicate the encoding format:

const blob = new Blob([new Uint8Array([0xef, 0xbb, 0xbf]), buf.encode('First line,1\r\n second line,2')] and {type: 'text/csv'
})

// or

const blob = new Blob([buf.encode('\ufeff first line,1\r\n second line,2')] and {type: 'text/csv'
})
Copy the code

A little clarification here, because UTF-8 and UTF-16LE are both part of the Unicode character set, they just have different implementations. So with certain rules, the two codes can be converted to each other, and a UTF-16LE BOM to a UTF-8 BOM is actually a UTF-8 BOM.

The attached:

  1. TypedArray
  2. TextEncoder


This article introduces character codec. If you are interested, you can follow the column or send your resume to ‘tao.qit####alibaba-inc.com’. Replace (‘####’, ‘@’)

Original address: github.com/ProtoTeam/b…