A thought triggered by an expression

Introduction: What is the origin of character sets? What do character encodings matter? How does garbled code appear? With these questions in mind, we listen to the story of characters.

A few days ago, there was a bug in the test, “in the text area of limited length, the input of emoticons will show garbled characters”. I can’t help but have some ideas: what are these expressions? Why is it garbled? Which encoding method does JS use? I consulted relevant literature and finally found the answer. Today, I will talk about the story of coding in detail.

Bits, characters, bytes

Before we talk about coding, there are a few basic concepts that need to be clarified:

Bit: A Bit is the smallest unit of storage in a computer. The value of a bit is represented by 0 or 1. It is also the basic unit of network information transmission.
Byte: Eight bits represent a byte.
Characters: Characters are abstract entities that can be represented using a number of different character schemes or code pages.

We know that inside a computer, all information is ultimately represented as a binary sequence. Each bit has two states, zero and one, so eight bits can combine 256 states, which is called a byte. That is, a single byte can be used to represent 256 different states or symbols. If we make a corresponding table, for each 8-bit binary sequence, there is a unique symbol. Each state corresponds to a symbol, which is 256 symbols ranging from 0000000 to 11111111.

The Story of the East

In the 1960s, the United States developed a set of character codes, which unified the relationship between English characters and binary bits. This is called ASCII code and is still used today. Later, we were able to use computers in China, but Chinese characters couldn’t express it. So the hard-working and intelligent Chinese people combined ASCII numbers, punctuation marks, and letters, as well as numeric symbols, Roman Greek letters, and over 7,000 simplified Chinese characters. GB2312 is a Chinese extension of ASCII. Later, it was found that The Chinese culture was extensive and profound, and there were too many Chinese characters, so the original code was not enough, so we expanded the program, which was called GBK. GBK integrated all the contents of GB2312, and added 20,000 new Chinese characters (including traditional Chinese characters) and symbols. And then ethnic minorities started using computers, and it had to be expanded. GB18030 was born. From now on, the splendid Chinese character culture can be inherited in the computer.

Unified Character Set

3.1 Rule makers

The above story comes from the ancient Orient, but many countries around the world have their own codes. Each country has a different code. One can’t help but think “why not create a universal character set?” .

There were two groups doing this in the 1980s and 1990s:

International Organization for Standardization (ISO) :

Iso is an independent non-governmental organization. It is the world’s largest voluntary developer of international standards, promoting world trade by providing common standards between countries. ISO tries to develop a Common Character Set (UCS)

Unicode Consortium:

Its main purpose is to maintain and publish the Unicode standard, which was developed to replace existing character encoding schemes that were limited in size and scope and incompatible with multilingual environments. Unified’s success in the unified character set has led to its widespread use in internationalized and localized software.

The participants in both institutions realized that the world did not need two incompatible character sets, so they began to collaborate. Starting with Unicode 2.0, Unicode uses the same font libraries and codes as ISO 10646-1; ISO also promises that ISO 10646 will not assign values to UCS-4 encodments beyond U+10FFFF (Unicode encodments begin with U+) to make them consistent. Both programs remain independent and publish their standards independently. But Unicode- is more widely used because it is easier to remember.

3.2 Code points and zones

The character set assigns each symbol an encoding, called a code point, starting at 0. There are 109,449 symbols in the latest Unicode version. These symbol partition definitions, each region is called a face, a total of 17 (2^ 5) face, each face can store 65536(2^16) characters, that is to say, a total of 2^21 characters can be stored.

To sum up, Unicode refers to a set of characters, and how each character is represented requires encoding methods. We know encoding methods such as UTF-32, UTF-16, UTF-8, UCS-2, etc. What are they? What does it matter?

Utf-32 and UCS-4

The UCS-4 encoding was created before Unicode merged with UCS. Ucs-4 still uses 32 bits (4 bytes) to represent each Unicode encoding, with the entire code value indicating that the code space range is between 0 and hexadecimal 7fffff. However, the actual usage range is no more than 0x10FFFF, and UTF-32 was created to be compatible with the Unicode standard. Its code value is the same as UCS-4, but the coding space is limited between 0 and 0x10FFFF, so UTF-32 can be said to be a subset of UCS-4.

Ucs-4 has obvious problems from the above: because UTF-32 uses four bytes to represent each character, it uses four times as much space as ASCII for the same English text. Therefore, the HTML5 web standard explicitly states that utF-32 encoding is not allowed.

Utf-16 and UCS-2

5.1 UTF – 16

The encoding mode of UTF-16 is intermediate between UTF-32 and UTF-8, and combines the characteristics of fixed length and variable length encoding. As mentioned above, Unicode encoding points have 17 planes, one base plane and 16 secondary planes. Utf-16’s encoding rules are simple: characters in the base plane take two bytes, and characters in the secondary plane take four bytes. That means utF-16 encoding is either two bytes (U+0000 to U+FFFF) or four bytes (U+010000 to U+10FFFF).

5.2 Auxiliary Plane character representation

So the question is, if you have two bytes, how do you determine if these two bytes are one character, or if they are combined with the next two bytes to form one character? In fact, in the base plane, U+D800 to U+DFFF is an empty segment, which does not correspond to any code points. This empty segment is used to map the characters of the secondary plane. How do we map it? First of all, the auxiliary plane has 16 times 2 to the 16 characters, which is 2 to the 20 characters. There are 20 seats in total. The first 10 are separated from the last 10. The top 10 are called high (H) and the bottom 10 are called low (L). That is, characters of an auxiliary plane are split into character representations of two basic planes.

5.3 turn Unicode utf-8 16

How do I convert Unicode code points to UTF-16?

Basic flat characters: Convert code points directly to hexadecimal


     
      U+597D=>0x597D
     
Copy the code

Auxiliary plane characters:


     
      H = Math.floor((c-0x10000) / 0x400)+0xD800
      L = (c - 0x10000) % 0x400 + 0xDC00
     
Copy the code

UCS – 5.4 2

What is UCS-2, and what does it have to do with UTF-16?

JavaScript uses the Unicode character set. But only one encoding method is supported. The first encoding adopted for JS was neither UTF-16, UTF-32, or UTF-8, but UCS-2. Utf-16 is explicitly declared to be a superset of UCS-2. Ucs-2 encodes basic flat characters in UTF-16. The auxiliary flat character defines a 4-byte representation. Ucs-2 was integrated into UTF-16. And since JavaScript was born before UTF-16 encoding (UCS-2 was announced in 1990). Utf-16 was announced in 1996), so JS was the first to adopt the obsolete UCS-2.

The problem is obvious. JS can only handle UCS-2 encodings, resulting in all characters being two bytes in this language, if four bytes. Will be treated as two double-byte characters.

Six, utf-8

If you understand what I said about UTF-32 and UTF-16, you can see a problem. For English-speaking countries, a character can be encoded as a single byte. Using the above two encoding methods is a huge waste of bandwidth. Utf-8 was born!

Utf-8 encoding rules:

For symbols in ASCII code, single-byte encoding is used, with the same encoding value as the ASCII value. ASCII values range from 0 to 0x7F, and the first digit of the binary value of all encodings is 0 (this is exactly what distinguishes single-byte encodings from multi-byte encodings).
Other characters are encoded in multiple bytes (assuming N bytes). The first N bits of the first byte are all 1 bits, the N+1 bits are 0 bits, and the first two bits of the next n-1 bits are 10 bits. The rest of the N bits are used to store the code points in Unicode.

The number of bytes	Unicode	Utf-8 encoding
1	000000-00007F	0xxxxxxx
2	000080-0007FF	110xxxxx 10xxxxxx
3	000800-00FFFF	1110xxxx 10xxxxxx 10xxxxxx
4	010000-10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Seven, the initial doubts

At this point, the initial confusion can be explained:

What are these expressions?

Emoji is a character encoded by UCS-2 (UTF-16). Because the code point is in the auxiliary plane, it needs four bytes, namely two characters, which also explains why the length value of emoji is 2.

Why is it garbled?

Emoji, in JS, needs to be represented by two characters. When limiting the length of the text box, let’s say the length is 15. If you enter an expression (two characters) in the 15th character, the expression will be automatically cut in half by the input field, and it will not be displayed correctly.

Which encoding method does JS use?

JS was encoded using UCS-2 in its early days and UTF-16 later. Utf-16 is a superset of UCS-2.

How do I validate the characters of the secondary plane? (emoji, for example)

According to its characteristics, the regular check mode is as follows:


     
      var patt=/[\ud800-\udbff][\udc00-\udfff]/g;
     
Copy the code

Problem solution

First, instead of using the maxLength attribute of the native Textarea or input component for length validation, JS validation is used. When the input content exceeds the specified length, check whether the position of the limited length and the next position can exactly form an emoji (check regular as above). If it is an emoji, intercept it before the emoji.

Viii. Summary:

Unicode is a character set, not an encoding. Utf-8, UTF-16, UTF-32, UCS-2, and UCS-4 are the encoding modes.
Byte encoding “faction” classification

Comparison of UTF-8, UTF-16, and UTF-32 encoding

Ix. Extended Reading:

[1] wikipedia – unicode:https://en.wikipedia.org/wiki/Unicode

[2] wikipedia – the international organization for standardization: https://en.wikipedia.org/wiki/InternationalOrganizationfor_Standardization (iso)

[3] – wikipedia unicode alliance: https://en.wikipedia.org/wiki/Unicode_Consortium

[4] wikipedia – universal character set (UCS) : https://en.wikipedia.org/wiki/UniversalCodedCharacter_Set

[5] wikipedia – UCS-4:https://hu.wikipedia.org/wiki/UTF-32/UCS-4

[6] wikipedia – UTF-32:https://en.wikipedia.org/wiki/UTF-32

[7] – UTF-16:https://en.wikipedia.org/wiki/UTF-16 wikipedia

[8] – UTF-8:https://en.wikipedia.org/wiki/UTF-8 wikipedia

[9] Unicode and JavaScript a: http://www.ruanyifeng.com/blog/2014/12/unicode.html

[10] mobile front-end mobile phone input method with emoji emoticons: https://blog.csdn.net/binjly/article/details/47321043

[11] emoji inventory and garbled problem solving: https://blog.csdn.net/Mr_LXming/article/details/77967964