- How JavaScript handles Unicode encoding correctly!
- Author: Front-end xiaozhi
FundebugReproduced with authorization, copyright belongs to the original author.
The way JavaScript handles Unicode is surprising, to say the least. This article explains the pain points associated with handling Unicode in JavaScript, provides solutions to common problems, and explains how the ECMAScript 6 standard can improve the situation.
Unicode Basics
Before diving into JavaScript, let’s explain some of the basics of Unicode so that we all know at least a little about Unicode.
Unicode is the character encoding used by most programs today. The definition is simple, mapping a character with a code point. Code point values range from U+0000 to U+10FFFF and can represent more than 1.1 million characters. Here are some characters and their code points.
- A’s code point U+0041
- A’s code point U+0061
- © code point U+00A9
- ☃ code point U+2603
- 💩 code point U+1F4A9
Code points are usually formatted as hexadecimal numbers with zero padding of at least four digits in the format U + prefix.
The 65536 character bits at the front of Unicode are called BMP-Basic Multilingual Plane (Plane 0 for short), which ranges from U+0000 to U+FFFF. The most common characters are placed on this plane, which is the first plane defined and published by Unicode.
The rest of the characters are Supplementary Plane ** or astral planes, ranging from U+010000 to U+10FFFF. There are 16 Supplementary planes.
A code point in the secondary plane is easy to identify: if more than four hexadecimal digits are needed to represent a code point, it is a code in the secondary plane.
Now that you have a basic understanding of Unicode, let’s look at how it applies to JavaScript strings.
Escape sequences
Enter the following in Google Console:
>> '\x41\x42\x43'
'ABC'
>> '\x61\x62\x63'
'abc'
Copy the code
The following is called a hexadecimal escape sequence. They consist of two hexadecimal digits that reference a matching code point. For example, the \x41 code point U+0041 represents the capital letter A. These escape sequences are available for code points in the U+0000 to U+00FF range.
Also common are the following types of escapes:
>> '\u0041\u0042\u0043' 'ABC' >> 'I \u2661 JavaScript! "' I ♡ JavaScript!Copy the code
These are called Unicode escape sequences. They consist of four hexadecimal digits representing code points. For example, \u2661 means code point \U+2661 means a heart. These escape sequences can be used for code points in the U+0000 to U+FFFF range, that is, the entire base plane.
But what about all the other auxiliary planes? We need more than four hexadecimal digits to represent their code points, so how do we escape them?
In ECMAScript 6, this is easy because it introduces a new escape sequence: Unicode code point escape. Such as:
>> '\u{41}\u{42}\u{43}'
'ABC'
>> '\u{1F4A9}'
'💩' // U+1F4A9 PILE OF POO
Copy the code
You can use up to six hexadecimal digits between braces, which is enough to represent all Unicode code points. Therefore, by using this type of escape sequence, you can easily escape any Unicode code point based on its code point.
For backward compatibility with ECMAScript 5 and older environments, the unfortunate solution is to use proxy pairs:
>> '\uD83D\uDCA9'
'💩' // U+1F4A9 PILE OF POO
Copy the code
In this case, each escape represents half the code points of the surrogate entry. The two proxies form an auxiliary code point.
Note that the surrogate pair code point is completely different from the original code point. There are formulas for calculating surrogate pair code points based on a given auxiliary code point, and vice versa – calculating raw auxiliary code points based on the surrogate pair.
The code points in Supplementary Planes are encoded as 16-bit codes (32bits,4Bytes) in UTF-16, and are called surrogate pairs.
- Code a minus
0x10000
, the resulting value ranges from 20 bits long0.. 0xFFFFF
. - The value of the highest 10 bits (the value range is
0.. 0x3FF
) is combined with0xD800
I get the first code element or I call itHigh agent. - The value of the lowest 10 bits (as is the range of values
0.. 0x3FF
) is combined with0xDC00
You get the second symbol or the second symbolLow Surrogate, now the range of values is0xDC00.. 0xDFFF
.
With proxy pairs, code points in all auxiliary planes (that is, from U+010000 to U+10FFFF) can be represented, but the whole concept of using one escape to represent code points in the base plane and two escapes to represent code points in the auxiliary plane is confusing and has many annoying consequences.
Use the JavaScript string method to calculate the character length
For example, suppose you want to count the number of characters in a given string. What would you do?
The first thing that comes to mind is probably using the length attribute.
>> 'A'.length // Code point: U+0041 represents A
1
>> 'A'= ='\u0041'
true
>> 'B'.length // code point: U+0042 indicates B
1
>> 'B'= ='\u0042'
true
Copy the code
In these cases, the length attribute of the string reflects exactly the number of characters. This makes sense: if we use escape sequences to represent characters, it is obvious that we only need to escape each character once. But that’s not always the case! Here’s a slightly different example:
>> '𝐀'.length // Code point: U+1D400 indicates Math Bold with capital A
2
>> '𝐀'= ='\uD835\uDC00'
true
>> '𝐁'.length // Code point: U+1D401 indicates Math Bold with capital B
2
>> '𝐁'= ='\uD835\uDC01'
true
>> '💩'.length // U+1F4A9 PILE OF POO
2
>> '💩'= ='\uD83D\uDCA9'
true
Copy the code
Internally, JavaScript represents characters in the secondary plane as proxy pairs, and opens portions of individual proxy pairs as separate “characters.” If you use only ECMAScript 5 compatible escape sequences to represent characters, you will see that two escapes are required for each character in the secondary plane. This is confusing because people often use Unicode characters or graphics instead.
Calculates the number of characters in the auxiliary plane
Back to the question: How do I accurately count the number of characters in a JavaScript string? The trick is to properly parse proxy pairs and only count each pair as a character. You can use it like this:
var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
function countSymbols(string) {
return string
// Replace every surrogate pair with a BMP symbol.
.replace(regexAstralSymbols, '_')
/ /... and *then* get the length.
.length;
}
Copy the code
Or, if you use Punycode.js, take advantage of its utility method to convert between JavaScript strings and Unicode code points. The decode method takes a string and returns an array of Unicode encoding bits; Each character corresponds to one item.
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
Copy the code
In ES6, you can do something similar with array. from, which uses the iterator of a string to split it into an Array of strings, each containing one character:
function countSymbols(string) {
return Array.from(string).length;
}
Copy the code
Or, using the deconstruction operator… :
function countSymbols(string) {
return [...string].length;
}
Copy the code
Using these implementations, we can now calculate code points correctly, which will lead to more accurate results:
>> countSymbols('A') // Code point: U+0041 represents A
1
>> countSymbols('𝐀') // Code point: U+1D400 indicates Math Bold with capital A
1
>> countSymbols('💩') // U+1F4A9 PILE OF POO
1
Copy the code
Looking for a bump face
Consider this example:
>> 'mañana'= ='manana'
false
Copy the code
JavaScript tells us that these strings are different, but visually, there’s no way to tell! What’s going on here?
JavaScript escape tools will tell you why:
>> 'ma\xF1ana'= ='man\u0303ana'
false
>> 'ma\xF1ana'.length
6
>> 'man\u0303ana'.length
7
Copy the code
The first string contains code points U+00F1 for the letter n and the tilde on the head of n, while the second string uses two separate code points (U+006E for the letter N and U+0303 for the tilde) to create the same character. That explains why they’re different lengths.
However, if we want to count the number of characters in these strings the way we’re used to, we want both strings to be 6 in length, because that’s the number of visually distinguishable characters in each string. How do you do that?
In ECMAScript 6, the solution is fairly simple:
function countSymbolsPedantically(string) {
// Unicode Normalization, NFC form, to account for lookalikes:
var normalized = string.normalize('NFC');
// Account for astral symbols / surrogates, just like we did before:
return punycode.ucs2.decode(normalized).length;
}
Copy the code
The Normalize method on String.Prototype performs Unicode normalization, which explains these differences. If a code point represents the same character as another code point followed by a combined mark, it is normalized to a single code point form.
>> countSymbolsPedantically('mañana') // U+00F1
6
>> countSymbolsPedantically('manana') // U+006E + U+0303
6
Copy the code
For backward compatibility ECMAScript5 and old environment, you can use the String. The prototype, the normalize polyfill.
Evaluate other combined tags
However, the scheme above is still not perfect — applying multiple combination tag code points always results in a single visual character, but may not have the form of normalize, in which case normalize is not helpful. Such as:
>> 'q\u0307\u0323'.normalize('NFC') / / ̣ ` q ̇ `
'q\u0307\u0323'
>> countSymbolsPedantically('q\u0307\u0323')
3 // not 1
>> countSymbolsPedantically('Z ͑ ͫ ̓ ͪ ̂ ͫ ̽ ͏ ̴ ̙ ̤ ̞ ͉ ͚ ̯ ̞ ̠ ͍ A ̴ ̵ ̜ ̰ ͔ ͫ ͗ ͢ L ̠ ͨ ͧ ͩ ͘ G ̴ ̻ ͈ ͍ ͔ ̹ ̑ ͗ ̎ ̅ ͛ ́ Ǫ ̵ ̹ ̻ ̝ ̳ ͂ ̌ ̌ ͘! ͖ ̬ ̰ ̙ ̗ ̿ ̋ ͥ ͥ ̂ ͣ ̐ ́ ́ ͜ ͞ ')
74 // not 6
Copy the code
If you need a more precise solution, you can use regular expressions to remove any combination tags from the input string.
// Replace the following regular expression with its transformed equivalent to make it work in the old environment
var regexSymbolWithCombiningMarks = /(\P{Mark})(\p{Mark}+)/gu;
function countSymbolsIgnoringCombiningMarks(string) {
// Remove any combined characters, leaving only the characters they belong to:
var stripped = string.replace(regexSymbolWithCombiningMarks, function($0, symbol, combiningMarks) {
return symbol;
});
return punycode.ucs2.decode(stripped).length;
}
Copy the code
This function removes any combination tags, leaving only the characters to which they belong. Any combination mark that does not match (at the beginning of the string) remains unchanged. This solution even works in the ECMAScript3 environment, and it provides the most accurate results to date:
>> countSymbolsIgnoringCombiningMarks('q\u0307\u0323')
1
>> countSymbolsIgnoringCombiningMarks('Z ͑ ͫ ̓ ͪ ̂ ͫ ̽ ͏ ̴ ̙ ̤ ̞ ͉ ͚ ̯ ̞ ̠ ͍ A ̴ ̵ ̜ ̰ ͔ ͫ ͗ ͢ L ̠ ͨ ͧ ͩ ͘ G ̴ ̻ ͈ ͍ ͔ ̹ ̑ ͗ ̎ ̅ ͛ ́ Ǫ ̵ ̹ ̻ ̝ ̳ ͂ ̌ ̌ ͘! ͖ ̬ ̰ ̙ ̗ ̿ ̋ ͥ ͥ ̂ ͣ ̐ ́ ́ ͜ ͞ ')
6
Copy the code
Compute other types of graphics clusters
The algorithm above is still a simplification — it still doesn’t compute characters like this correctly: நி, the Chinese language is composed of jointed Jamo, such as 깍, emoji character sequence, such as 👨👩👧 👨 ((Fe U+200D + mo U+200D + Mo + mo + mo + mo + mo + mo) or other similar characters.
The Unicode standard Annex # 29 on Unicode text segmentation describes the algorithm used to determine glyphs cluster boundaries. For a completely accurate solution that works for all Unicode scripts, implement this algorithm in JavaScript and then count each glyph cluster as a single character. It was suggested that intl. Segmenter (a text segmentation API) be added to ECMAScript.
String inversion in JavaScript
Here is an example of a similar problem: inverting strings in JavaScript. How hard can it be, right? A common and very simple way to solve this problem is:
function reverse(string) {
return string.split(' ').reverse().join(' ');
}
Copy the code
It seems to work in a lot of situations:
>> reverse('abc')
'cba'
>> reverse('mañana') // U+00F1
'anañam'
Copy the code
However, it completely scrambles strings that contain composite markers or characters located in auxiliary planes.
>> reverse('mañana') // U+006E + U+0303
'ananam' // note: the `~` is now applied to the `a` instead of the `n`
>> reverse('💩') // U+1F4A9
'� �' // '\uDCA9\uD83D', the surrogate pair for '💩' in the wrong order
Copy the code
To properly invert characters in secondary planes in ES6, string iterators can be used in conjunction with array. from:
function reverse(string) {
return Array.from(string).reverse().join(' ');
}
Copy the code
However, this still does not solve the problem of composite tags.
Luckily, a brilliant computer scientist named Missy Elliot has come up with a bulletproof algorithm to explain these problems. It looks something like this:
I put the thong down, flip it, and then upside down. I put the thong down, flip it, and then upside down.
In fact: you can successfully avoid the problem by swapping the position of any combination tags with the character they belong to, and reversing any proxy pairs before further processing the string.
// Use the library Esrever (https://mths.be/esrever)
>> esrever.reverse('mañana') // U+006E + U+0303
'anañam'
>> esrever.reverse('💩') // U+1F4A9
'💩' // U+1F4A9
Copy the code
Unicode problems in string methods
This behavior affects other string methods as well.
To convert code points to characters
String. FromCharCode can convert a code point to a character. But it only applies to code points in the BMP range (that is, from U+0000 to U+FFFF). If you use it to convert code points beyond the BMP plane, you can get unexpected results.
>> String.fromCharCode(0x0041) // U+0041
'A' // U+0041
>> String.fromCharCode(0x1F4A9) // U+1F4A9
'' // U+F4A9, not U+1F4A9
Copy the code
The only solution is to calculate half of the proxy’s code points yourself and pass them as separate arguments.
>> String.fromCharCode(0xD83D.0xDCA9)
'💩' // U+1F4A9
Copy the code
If you don’t want to calculate half of the proxy, you can use the punycode.js utility method:
>> punycode.ucs2.encode([ 0x1F4A9 ])
'💩' // U+1F4A9
Copy the code
Fortunately, ECMAScript 6 introduces String.fromCodePoint(codePoint), which can locate characters in code points outside the base plane. It can be used at any Unicode encoding point, from U+000000 to U+10FFFF.
>> String.fromCodePoint(0x1F4A9)
'💩' // U+1F4A9
Copy the code
For backward compatibility with ECMAScript 5 and older environments, use String.fromCodePoint() polyfill.
Gets a character from a string
If you use string.prototype. charAt(position) to retrieve the first character in the containing String, you get only the first proxy and not the entire character.
>> '💩'.charAt(0) // U+1F4A9
'\uD83D' // U+D83D, i.e. the first surrogate half for U+1F4A9
Copy the code
There was a proposal to introduce String.prototype.at(position) in ECMAScript 7. It is similar to charAt, except that it tries to handle the full character instead of half the proxy.
>> '💩'.at(0) // U+1F4A9
'💩' // U+1F4A9
Copy the code
For backward compatibility with ECMAScript 5 and older environments, use String.prototype.at() polyfill/prollyfill.
Retrieves a code point from a string
Similarly, if you use the String. The prototype. CharCodeAt (position) to retrieve the first character of a String of code, will get the first agent of code, rather than a heap poo character code.
>> '💩'.charCodeAt(0)
0xD83D
Copy the code
Fortunately, ECMAScript 6 String is introduced. The prototype. CodePointAt (position), it is similar to charCodeAt, it just as much as possible complete character rather than half of the item to the agent.
>> '💩'.codePointAt(0)
0x1F4A9
Copy the code
For backward compatibility ECMAScript 5 and older environment, use String. Prototype. CodePointAt _polyfill ().
Iterate over all characters in a string
Imagine looping over each character in a string and performing some operations on each individual character.
In ECMAScript 5, you have to write a lot of boilerplate code to determine proxy pairs:
function getSymbols(string) {
var index = 0;
var length = string.length;
var output = [];
for (; index < length - 1; ++index) {
var charCode = string.charCodeAt(index);
if (charCode >= 0xD800 && charCode <= 0xDBFF) {
charCode = string.charCodeAt(index + 1);
if (charCode >= 0xDC00 && charCode <= 0xDFFF) {
output.push(string.slice(index, index + 2));
++index;
continue;
}
}
output.push(string.charAt(index));
}
output.push(string.charAt(index));
return output;
}
var symbols = getSymbols('💩');
symbols.forEach(function(symbol) {
console.log(symbol == '💩');
});
Copy the code
Or you can use regular expressions, such as var regexCodePoint = / [^ \ uD800 – \ uDFFF] | [\ uD800 – \ uDBFF] [\ uDC00 – \ uDFFF] | [\ uD800 – \ uDFFF] / g; And iterative matching
In ECMAScript 6, you can simply use for… Of. String iterators process entire characters, not proxy pairs.
for (const symbol of '💩') {
console.log(symbol == '💩');
}
Copy the code
Unfortunately, there is no way to populate it because for… Of is a syntactic level structure.
Other problems
This behavior affects almost all String methods, including here did not explicitly mentioned methods (such as a String. The prototype. The substring of a String. The prototype. The slice, etc.), so be careful when using them.
Unicode problems in regular expressions
Matches code points and Unicode scalar values
Dot operators in regular expressions (.) Only one “character” is matched, but because JavaScript exposes the proxy half as a separate “character”, it will never match characters on the secondary plane.
>> /foo.bar/.test('foo 💩 bar')
false
Copy the code
Let’s think, what regular expression can we use to match any Unicode character? Any good ideas? As shown below,. This w is not enough because it does not match newlines or entire characters on the auxiliary plane.
>> $/ / ^.test('💩')
false
Copy the code
To properly match newlines, we can use [\s\ s] instead, but this still doesn’t match the entire character on the secondary plane.
>> /^[\s\S]$/.test('💩')
false
Copy the code
It turns out that a regular expression that matches any Unicode encoding point is anything but simple:
>> /[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](? ! [\uDC00-\uDFFF])|(? :[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]/.test('💩') // wtf
true
Copy the code
Of course, you don’t want to write these regular expressions by hand, let alone debug them. To generate regular expressions like the one above, a library called Regenerate is used that easily creates regular expressions based on code points or character lists:
>> regenerate().addRange(0x0.0x10FFFF).toString()
'[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](? ! [\uDC00-\uDFFF])|(? :[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]'
Copy the code
From left to right, this regular expression matches BMP characters, surrogate pairs, or a single surrogate.
Although it is technically possible to use individual proxies in JavaScript strings, they do not map to any characters themselves and should therefore be avoided. The term Unicode scalar value refers to all code points except proxy code points. Here is a regular expression that matches any Unicode scalar value:
>> regenerate()
.addRange(0x0.0x10FFFF) // all Unicode code points
.removeRange(0xD800.0xDBFF) // minus high surrogates
.removeRange(0xDC00.0xDFFF) // minus low surrogates
.toRegExp()
/[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/
Copy the code
Regenerate is used as part of the build scripts to create complex regular expressions while still keeping the scripts that generate these expressions readable and easy to maintain.
ECMAScript 6 introduces a U flag for regular expressions that uses. The operator matches the entire code point, not half of the surrogate.
>> /foo.bar/.test('foo 💩 bar')
false
>> /foo.bar/u.test('foo 💩 bar')
true
Copy the code
Note that the. Operator still does not match newlines, and when the u flag is set, the. Operator is equivalent to the following backward-compatible regular expression pattern:
>> regenerate()
.addRange(0x0.0x10FFFF) // all Unicode code points
.remove( / / minus ` LineTerminator ` s (https://ecma-international.org/ecma-262/5.1/#sec-7.3) :
0x000A.// Line Feed <LF>
0x000D.// Carriage Return <CR>
0x2028.// Line Separator <LS>
0x2029 // Paragraph Separator <PS>
)
.toString();
'[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](? ! [\uDC00-\uDFFF])|(? :[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]'
>> /foo(? :[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](? ! [\uDC00-\uDFFF])|(? :[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])bar/u.test('foo 💩 bar')
true
Copy the code
A character at an auxiliary plane code point
Considering that /[a-c]/ matches any character from the letter A at code point U+0061 to the letter C at code point U+0063, it would seem that /[💩-💫]/ would match the code point U+1F4A9 to the code point U+1F4AB, but this is not the case:
>> / [💩 - 💫] /
SyntaxError: Invalid regular expression: Range out of order in character class
Copy the code
This happens because regular expressions are equivalent to:
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/
SyntaxError: Invalid regular expression: Range out of order in character class
Copy the code
It turns out that instead of matching code points U+1F4A9 to code points U+1F4AB as we thought, we match regular expressions:
-
U+D83D(high proxy bit)
-
Range from U+DCA9 to U+D83D (invalid because the start code point is larger than the end of the marked range)
-
U+DCAB(low proxy bit)
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCA9') // match U+1F4A9
true
>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4A9}') // match U+1F4A9
true
>> / [💩 - 💫] / u.test('💩') // match U+1F4A9
true
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCAA') // match U+1F4AA
true
>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4AA}') // match U+1F4AA
true
>> / [💩 - 💫] / u.test('💪') // match U+1F4AA
true
>> /[\uD83D\uDCA9-\uD83D\uDCAB]/u.test('\uD83D\uDCAB') // match U+1F4AB
true
>> /[\u{1F4A9}-\u{1F4AB}]/u.test('\u{1F4AB}') // match U+1F4AB
true
>> / [💩 - 💫] / u.test('💫') // match U+1F4AB
true
Copy the code
Unfortunately, this solution is not backward compatible with ECMAScript 5 and older environments. If this is a problem then use Regenerate to generate ES5-compatible regular expressions for characters in the secondary plane range:
>> regenerate().addRange('💩'.'💫')
'\uD83D[\uDCA9-\uDCAB]'
>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💩') // match U+1F4A9
true
>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💪') // match U+1F4AA
true
>> /^\uD83D[\uDCA9-\uDCAB]$/.test('💫') // match U+1F4AB
true
Copy the code
Bugs in the field and how to avoid them
This behavior can cause many problems. For example, Twitter allows 140 characters per tweet, and their backends don’t care what type of character it is — whether it’s in the auxiliary plane or not. But because JavaScript counts only read out the length of the string at one point in time on its site, regardless of proxy pairs, it is impossible to enter more than 70 characters in the secondary plane. (This bug has been fixed.)
Many JavaScript libraries that deal with strings do not properly parse characters in the secondary plane.
For example, countable.js does not properly evaluate characters in the secondary plane.
Underscore. String has a reverse method that does not handle characters in composite tags or auxiliary planes. (Switch to Missy Elliot’s algorithm)
It also incorrectly decodes the HTML numeric entities of characters in the auxiliary plane, such as 💩 . Many other HTML entity conversion libraries have similar problems. (Before fixing these errors, consider using HE instead of all HTML encoding/decoding requirements.)
The original:
Firebase.google.com/docs/cloud-…
The bugs that may exist after code deployment cannot be known in real time. In order to solve these bugs, I spent a lot of time on log debugging. Incidentally, I recommend a good BUG monitoring tool for youFundebug.
Your likes are my motivation to keep sharing good things.
A stupid code farmers, my world can only lifelong learning!
More content please pay attention to the public account “big move the world”!