Introduction to the

Character encoding, character length error, truncated character error, UTF8, Unicode

Computers are all zeros and ones under the hood, but do you know how they grow into strings? The most common example in our real life is to find the word “wo” in Xinhua Dictionary. Similarly, the computer uses the combination of 0 and 1 to find the corresponding character in the dictionary. What does the dictionary contain?

The origin of

Computers were born in the United States and most of their users used English. The American National Standards Institute developed this dictionary, which contains a total of 256 ASCII characters, including 26 uppercase letters, 26 lowercase letters, and 10 Arabic numerals.

chaos

ASCII in binary is 0000 0000 to 1111 and 1111 is so full that there’s no room for Chinese characters so what do we do? Is the so-called jiangshan big talent, GB code GB series appeared, which is the most familiar GB2312.

So here’s the problem: there are 2,500 to 3,500 languages in the world, 930 of which have written languages. You can imagine scrolling through different languages and having to constantly switch between dictionaries, and every time you switch between characters you can’t find, garbled characters appear.

unified

Book with text, car with track, line homotopy.

The above sentence praises the achievements of the Qin Shi King, but it is impossible to unify the language in the real world. So can we think of a different way to solve this problem? Consider this question: “It takes a few steps to put an elephant in a refrigerator.” The answer, as everyone knows, is “Open it, put it in, close it.” What about unicode? That is, create a dictionary big enough to put all the characters in.

unicode

Unicode is a universal code that, as its name suggests, unites all of the world’s character encodings in Unicode Consortium and codepoints. The corresponding implementations are UTF8, UTF16, and UTF-32.

Variable length character encoding

UTF8, UTF16, and UTF-32 differ in how many bytes of data they contain.
is the most widely used Unicode implementation on the Internet.

The biggest feature of UTF is the variable length of bytes. For example, UTF8 can be 1 to 4 bytes to record universal codes. Why is it designed this way? For example, 0000 0000 and 0000 0000 can both represent 0, so using a shorter byte takes up less space and transfers faster.

episode

There was also a character set ucS-2 that was fixed to use two bytes to encode characters that were somewhat different from UTF variable-length characters, but were incorporated into UTF-16 with the unification process.

JavaScript character processing

Now that you understand character fundamentals and processes, what is JavaScript encoding? That’s right, it’s ucS-2 as mentioned in the episode, because UTF-16 didn’t exist when JavaScript was born.

But now everyone is using UTF variable-length character encodings. Utf-16 has two or four variable-bytes, whereas UCS-2 has only two. The result is that there is a 4-byte character between the two character sets that UCS-2 doesn’t recognize, and JavaScript will foolishly press ucS-2’s 2-byte character, plus the character is not in the dictionary and clumsy little brains can’t handle it and will only output garbled characters.

Due to the popularity of emoji and the fact that emoji is just outside the ucS-2 dictionary, it is necessary to be careful where emoji may appear in front-end development:

The length of the

BUG warning

Currently, the most commonly used emoji is a 4-byte encoding representation. Because UCS-2 is fixed at two bytes, emoji is treated as two UCS-2 characters in the statistical length, and the result is twice the expected output.

let emoji = "😊";

2 / / output
console.log(emoji.length);
Copy the code
BUG to lift

Array.prototype.from and spread (es6);

let emoji = "😊";

/ / output 1
console.log(Array.from(emoji).length);

/ / output 1
console.log([...emoji].length);
Copy the code

If array.prototype. from is not supported, we can use regular substitution to replace 4-byte characters with _ and calculate the length:

let emoji = "😊";

function countSymbols(string) {
    var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
    return string
        .replace(regexAstralSymbols, '_')
        .length;
}

/ / output 1
console.log(countSymbols(emoji));
Copy the code

Other string operations, such as concatenation or substitution, can also be implemented using arrays.

Inverted string

As mentioned above, emojis are treated as two UCS-2 characters, and when reversed, the four complete bytes are split up, which can be solved using Esrever.

let emoji = "😊";

The output is two garbled characters
console.log(emoji.split(' ').reverse().join(' '));
Copy the code

Character coding conversion

In the use of the String. The prototype. CharCodeAt and String fromCharCode might be a problem. ES6 two new methods can be used to replace String. The prototype. CodePointAt and String. FromCodePoint.

Regular match

The re. Matches a character, but utF-16 4-byte characters are treated as two characters, causing an error. ES6 provides a new solution with the u flag /./u.test(‘😊’), so remember to add it when writing the re.

String traversal

For string traversal, you can use for… Of statements.

scenario

If the back-end database runs to store emoji as the user name, the front-end needs to pay attention to the statistical errors caused by utF-16 4-byte characters when limiting the user name length, which can be applied in other similar scenarios.

Tip: In the development of wechat public account, because the user name and user input may appear emoji and other characters, it is necessary to set the character set of the database.

Don’t ask me why I know, because there are often tears in my eyes.

Grow up together

In the confused city, there is always a partner to grow up together.

  • You can click on this if you want more people to see the articlegive a like.
  • If you want to inspire your mistress thereGithubGive aLittle stars.
  • If you want to communicate more with small two add wechatm353839115.

PushMeTop originally contributed to this article