In the development process, I occasionally encountered problems about coding, Unicode and Emoji, and found that I did not fully grasp the basic knowledge of these aspects. So after a search for learning, sorting out a few easy-to-understand articles to share.

[Core foundation] series of several articles, interested friends can understand:

  • Core base binary text (two) bit operation
  • Core Base Binary (I) 0.1 + 0.2! = 0.3 and ieEE-754 standards
  • Very hot, very hot, very hot
  • ASCII, Unicode, and UTF-8

I don’t know if you have encountered such confusion, in the requirement of verifying the length of the form, it is found that different character length may be different in size. For example, the “𠮷” length in the title is 2 (note that 📢 is not a Chinese character!). .

', '.length
/ / 1

'𠮷'.length
/ / 2

'❤'.length
/ / 1

'💩'.length
/ / 2
Copy the code

To explain this problem, start with utF-16 encoding.

UTF-16

As you can see from the ECMAScript® 2015 specification, ECMAScript strings use UTF-16 encoding.

Definite and indefinite: utF-16’s minimum code is two bytes, which is fixed even though the first byte may be both zero. It may be that only two bytes are required for basic plane (BMP) characters, representing the range U+0000 to U+FFFF, while four bytes U+010000 to U+10FFFF are required for supplementary planes.

In our last article, we covered the details of UTF-8 encoding and learned that UTF-8 encoding takes anywhere from 1 to 4 bytes, while UTF-16 requires 2 or 4 bytes. Let’s see how UTF-16 is encoded.

Utf-16 encoding logic

Utf-16 encoding is simple. For a given Unicode CodePoint cp (the CodePoint is the unique number of the character in Unicode) :

  1. If the code point is less than or equal toU+FFFF(that is, all characters of the base plane), do not need to handle, directly used.
  2. Otherwise, it is split into two parts(CP -65536/1024) + 0xD800.((cp – 65536) % 1024) + 0xDC00To store.

The Unicode standard states that U+D800… The U+DFFF value does not correspond to any character, so it can be used as a marker.

For A concrete example, the code point of character A is U+0041, which can be directly represented by A code element.

'\u0041'
// -> A

A === '\u0041'
// -> true
Copy the code

In Javascript, \u represents a Unicode escape character followed by a hexadecimal number.

The code point of the character 💩 is U+1f4a9, the character in the supplementary plane, and the two codes 55357 and 56489 are calculated by 👆 formula, which are expressed as D83D and DCA9 in hexadecimal, and the two coding results are combined into a proxy pair.

'\ud83d\udca9'
/ / - > '💩'

'💩'= = ='\ud83d\udca9'
// -> true
Copy the code

Because Javascript strings are encoded using UTF-16, the proxy pair \ UD83d \ UDCA9 can be correctly decoded to get code point U+1f4a9.

You can also use \u + {} to represent characters directly with code points in braces. They look different, but they show the same result.

'\u0041'= = ='\u{41}'
// -> true

'\ud83d\udca9'= = ='\u{1f4a9}'
// -> true
Copy the code

You can open the Dev Tool console panel and run the code to verify the results.

So why is the length judgment problematic?

To answer this question, look at the specification, which states that where ECMAScript operations interpret string values, each element is interpreted as a single UNIT of UTF-16 code.

Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit.

So a character like 💩 actually takes up two UTF-16 symbols, two elements, so its length property is 2. (This is related to the use of USC-2 encoding at the beginning of JS, thought 65536 characters can meet all requirements)

For the average user, this is completely confusing. Why does the program indicate that the Unicode character length is two characters long when only ‘𠮷’ is filled in?

I can see the following code in the async-Validator package I use with the Antd Form

const spRegexp = /[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]/g;

if (str) {
  val = value.replace(spRegexp, '_').length;
}
Copy the code

When you need to judge the length of the string, it will replace all the characters in the code point range of the supplement plane with underscores, so that the length of the judgment and the actual display is consistent!!

ES6 support for Unicode

The length attribute of the problem, mainly or the original design of the JS language, did not consider that there will be so many characters, that two bytes can be completely satisfied. So not just length, but some common operations on strings can also be exceptions to Unicode support.

The following sections describe some of the apis with exceptions and how to handle them correctly in ES6.

for vs for of

For example, if you print a string using a for loop, the string will traverse every “element” that JS understands, and the characters in the auxiliary plane will be recognized as two “elements”, resulting in “garbled characters”.

var str = '👻 yukio okamoto, 𠮷'
for (var i = 0; i < str.length; i ++) {
  console.log(str[i])
}

/ / - > �
/ / - > �
// -> y
// -> o
/ / - > �
/ / - > �
Copy the code

Using ES6’s for of syntax does not.

var str = '👻 yukio okamoto, 𠮷'
for (const char of str) {
  console.log(char)
}

/ / - > 👻
// -> y
// -> o
/ / - > 𠮷
Copy the code

Spread syntax

We mentioned the use of regular expressions to count character lengths by substituting characters from auxiliary planes. The same effect can be achieved using the expansion syntax.

[...'💩'].length
/ / - > 1
Copy the code

The same problem applies to slice, split, substr, and so on.

Regular expression u

ES6 also added the U descriptor for Unicode characters.

/^.$/.test('👻')
// -> false

/^.$/u.test('👻')
// -> true
Copy the code

charCodeAt/codePointAt

For strings, charCodeAt is also used to obtain the Code Point. This works for BMP plane characters, but if the character is a secondary plane character, charCodeAt returns only the number of the first Code element after encoding.

'feather'.charCodeAt(0)
/ / - > 32701
'feather'.codePointAt(0)
/ / - > 32701

'😸'.charCodeAt(0)
/ / - > 55357
'😸'.codePointAt(0)
/ / - > 128568
Copy the code

With codePointAt, characters are correctly identified and the correct code points are returned.

String.prototype.normalize()

Since JS interprets a string as a sequence of two-byte codes, equality is determined by the value of the sequence. So there could be strings that look exactly the same, but the string equality check is false.

'cafe'= = ='café'
// -> false
Copy the code

In the above code, the first cafe is composed of cafe plus an indented phonetic character \u0301, while the second cafe is composed of a caf + E character. So although they look the same, but the code points are different, so the judgment result of JS equality is false.

'cafe\u0301'
/ / - > 'cafe'

'cafe\u0301'.length
/ / - > 5

'café'.length
/ / - > 4
Copy the code

In order to correctly identify the code point is not the same, but the semantic judging the same String, ES6 increased String. The prototype. The method to normalize.

'cafe\u0301'.normalize() === 'café'.normalize()
// -> true

'cafe\u0301'.normalize().length
/ / - > 4
Copy the code

conclusion

This article is mainly the study notes of my recent re-learning of coding. Due to the limited time && level, there must be a lot of inaccurate descriptions and even wrong contents in this article. Please kindly point out if you find anything. ❤ ️

Refer to the article

  • ES6 Strings (and Unicode, ❤) in Depth
  • Unicode and JavaScript in detail
  • JavaScript has a Unicode problem
  • What every JavaScript developer should know about Unicode
  • Unicode in JavaScript