Strings are all too familiar. The first classic task to get into programming is to output strings:Hello, world. But do you know how JavaScript strings are represented on a computer?

The simplest and most intuitive but less accurate way to think about a string is that it is a sequence of English letters, numbers, and punctuation marks. For example, the following string consists of five letters and an exclamation mark:

const message = 'Hello! ';Copy the code

It can also be seen that the string is 6 characters long:

const message = 'Hello! '; message.length; / / = > 6Copy the code

If the string is made up of these visible characters (i.e. 127 ASCII characters), this is fine. However, when encountering unusual symbols, such as some emoticons 😀, 😁, 😈, you may get unexpected results:

= '😀' const smile; smile.length; / / = > 2Copy the code

Isn’t that weird? How can it be two when there is only one character? This is because JavaScript strings are actually made up of encoding units, not sequences of visible characters.

The ECMA 262 specification describes JavaScript strings as follows:

A String is an ordered sequence of zero or more 16-bit unsigned integer values. The string type is typically used to represent text data in a running ECMAScript program, in which case each element in the string is treated as a UTF-16 encoding unit value.

Simply put, a JavaScript string is a sequence of UTF-16 encoding units, just a string of numbers.

A code unit is a number between 0x0000 and 0xFFFF, and there is a correspondence between the code unit and the character. For example, the encoding unit 0x0048 corresponds to the actual character H:

const letter = '\u0048';
letter === 'H' // => true
Copy the code

If a whole string of ‘Hello! ‘in code units looks like this:

const message = '\u0048\u0065\u006C\u006C\u006F\u0021'; message === 'Hello! '; // => true message.length; / / = > 6Copy the code

As you can see, the string has six encoding units, one for each character. Any character in the Basic Multilingual Plane (BMP) can be represented by a UTF-16 encoding unit. Characters outside this range, however, require two UTF-16 encoding units to represent. For example, the aforementioned smiley face is encoded as \uD83D\uDE00:

const smile = '\uD83D\uDE00'; Smile = = = '😀'; // => true smile.length; / / = > 2Copy the code

These two encoding units exist in pairs and are used to represent characters beyond 0xFFFF. Can’t open it, otherwise it’ll be an unrecognizable garble. In addition, the.length is 2, indicating that this attribute is actually the number of string encoding units, not the number of characters. If you need to judge the number of characters, it is important to note that.length is not accurate. So what’s the solution? It can be done this way:

const message = 'Hello! '; = '😀' const smile; [...message].length; // => 6 [...smile].length; / / = > 1Copy the code

See this rather energetic logo, don’t pay attention to it?