preface

Inside a computer, a picture, a video, or an article is essentially represented by a mass of zeros and ones. When the computer reads a piece of text from the network, it will receive several signals on the network port that alternate between high and low levels with specific rules, and it is these signals that represent the text. Some of us may have thought about this question: when wechat on mobile phone receives a text message from a friend, how can the “0” and “1” formed by these high and low levels be recognized by the computer and then displayed on our screen? In fact, character encodings play a major role.

In addition, in the computer college of the author’s university, there are few courses about the knowledge of computer word coding. Professors in most courses assume that students already know what character encoding is. However, when you use VSCode to open the original compilation file provided by a professor, only to find that the Chinese comments in the file are supposed to show a bunch of gibberish, then you have to do some character coding. At the same time, in the front-end field that the author focuses on, as Emoji are used more and more frequently in web pages recently, and there are some problems that need to be paid attention to in the compatibility of Emoji among various systems, it is becoming more and more important to know some knowledge about Unicode character encoding.

So, in chronological order, this article presents a list of character coding essentials that a programmer, especially a front-end programmer, should know in 2021. It covers topics like ASCII, GBK, UTF-8, and Emoji, as well as a few other interesting points worth mentioning. This article mainly refers to and quotes from wikipedia and Emojipedia. If you have any questions about the author’s interpretation, you can also go to Wikipedia and Emojipedia for relevant information in addition to Posting comments.

concept

Before we begin our formal introduction to character encodings, we need to clear up some concepts about character encodings.

What is a character set

A Charset is a collection of characters. A Character represents the smallest unit of semantic value in a paragraph of text. ** In Chinese, a character is a Chinese character, such as “I”; In English, a character is a letter, such as “A”.

What is a code point

A Code point represents the position of a character in a character set. Simply put, it’s a number that says, “What is the number of characters in the character set?” What character encoding is essentially doing is using an appropriate way to represent these code points in terms of zeros and ones, so that you can represent the characters. One might say, well, isn’t it possible to just convert the numbers to binary? This is true up to a point, and many character encoding specifications use this idea, so people tend to classify them both as character sets and encoding specifications for convenience. However, this approach does not work in complex situations, as discussed later.

What is character encoding

Character encoding refers to the set of relationships used to associate a series of numbers represented by 0s and 1s to a particular Character. For example, we can use the number 0 for the English letter “A” and the number 1 for the Chinese character “I”, and stipulate that we use a bit for the number, binary 0 for the number 0, and binary 1 for the number 1. So one of our bits encodes one character. When a computer receives 0001, it receives “AAA ME.”

The above example is a simple character encoding. Of course, encoding only two characters is not really useful, we often need some character codes that can correspond to tens of thousands of Chinese characters, so that people can carry daily communication. It should also be emphasized that the ** character set is not a character encoding; they are related, but not the same concept. ** We will reiterate this point in conjunction with the discussion below.

ASCII

introduce

For programmers of all disciplines, ASCII may have been introduced to them in the first semester of freshman year. It is a character code introduced in the 1960S in the United States. It contains the upper and lower case forms of the 26 Letters of the English alphabet, as well as some punctuation marks and special Control codes, as shown in the figure below:

Image from Lookuptables.com

It uses seven bits to encode the above characters, such as binary 1100001 for decimal number 97 and lowercase letter “A”. Use of ASCII code to read a text file contents as the “11010001100101110110011011001101111”, “hello” can be obtained.

After ASCII was introduced, it almost became the universal character code for computers at the time. However, ASCII could only encode English letters and punctuation marks, which was a big problem for non-English speaking countries such as Germany. With the development of computers, it became more common to use eight bits (that is, one byte) to store characters. Simple math shows that if you use eight bits to store the ASCII encoding of a character, there is actually one idle bit, usually in the highest bit, with a value of 0. If we can make use of this bit, we can store up to 128 additional ASCII characters while using 8 bits.

Based on this idea, other countries chose to add new characters to ASCII to form their own national character codes. The ISO/IEC 8859 code set described below is an example.

Control characters

In the 1960s, the main output method of computers was to type various words on paper by operating a Teletypewriter, until the rise of the monitor gradually transferred to the use of monitor output. In UNIX, the concept of “TTY” was a teletypewriter. Later, when monitors came into use, software was used to simulate teletypewriter operations and existing software for teletypewriters worked. The control characters described here are closely related to teleprinter.

Control codes are characters defined in ASCII that have special uses. When the computer encountered these characters, it manipulated the teletypewriter needle to do something special. Take the familiar newline character “\n”, which the computer encounters to move the needle down one line. This is combined with the return character “\r” to move the needle to the beginning of the line, so we use “\r\n” to make a newline operation, and the rest of the output will be printed from the new line. In ASCII, “\r” and “\n” are also abbreviated to CR and LF, respectively.

Using “\r” alone also works wonders, as in the following C code:

printf("Hello\rWorld");
Copy the code

This outputs “World” on the console, and the previous “Hello” is overwritten by the subsequent output “World” because we used carriage returns. On a teleprinter, we might get two words printed on top of each other, but on today’s terminals, because the teleprinter is simulated by software, the original text can be replaced.

ISO/IEC 8859

This is the European countries based on ASCII extended out of a series of character coding specifications, this series of specifications have a total of more than ten, covering the various characters used by various Countries in Europe. For example, the ISO/IEC 8859-1 character codes mainly contain the characters used by Western European countries (Germany, France, etc.), as shown in the following figure:

Image from Wikipedia

GB/T 2312-1980

Introduction to the

The above mentioned ISO/IEC 8859 was proposed mainly because the extra 128 characters were large enough to hold the characters used in these countries in Europe. However, for a country like China, where tens of thousands of Chinese characters are used to convey feelings and meanings, the ISO/IEC 8859 scheme is not feasible. In 1980, the country introduced the GB 2312 specification (” GB “is the pinyin abbreviation of” national standard “, which is the national standard), which covers thousands of commonly used simplified Chinese characters, punctuation marks and all characters in ASCII.

Coding scheme

GB 2312 mainly uses uc-CN encoding scheme. ** It encodes ASCII characters (English characters, etc.) as 1 bit, consistent with ASCII; And encode Chinese characters into two bits. ** For example, for the lowercase letter “A”, it would be encoded as “1100001”, just like ASCII. Another example is the Chinese character kun, which is encoded as 0xC0 0xA4 (in hexadecimal notation here). Also, each of the two bits encoded by a Chinese character must have a range between “0xA1” and “0xFE”. Thus, when a computer encounters a bit smaller than “0xA1”, it knows that it represents an ASCII character; When a character larger than “0xA1” is encountered, it is known that this bit and the following bit together represent a Kanji character.

Thus, on a computer using GB 2312, documents written using ASCII can be opened as normal as those written using GB 2312. As I said at the beginning of this article, encoding code points directly into their binary equivalents as a coding specification sometimes doesn’t work, and this is a good example. Since we needed to encode thousands of Chinese characters with ASCII compatibility, this flexible approach was the only way to do it.

history

GB 2312 is a very classic code specification for simplified Chinese characters, covering more than 99 percent of the characters in daily use, but excluding those obscure characters that appear in ancient Chinese and many people’s names. Therefore, in recent years, it is gradually replaced by GBK code and GB 18030 code. In 2017, the country listed GB 2312 as the recommended standard, and the name changed to GB/T 2312 (” T “is the first letter of the pinyin for” recommended “), which is no longer mandatory as before.

GBK

Introduction to the

GBK is probably the most widely known Chinese character code in China. GBK is the extension of GB 2312 in the character set, that is to say, GBK includes all the characters in GB 2312, in addition, it also includes traditional Chinese characters and simplified Chinese characters after THE launch of GB 2312.

GBK was released by Microsoft in 1993 and has been the default text font for the Simplified Chinese version of Windows ever since. GBK coding is used in the command prompt (CMD), Notepad, file name in the resource manager and other places of the system. It uses a similar encoding specification to UC-CN used by GB 2312, which also uses one or two bits to encode characters. Details are omitted here due to space constraints.

From ISO/IEC 8859 to GBK, we find that almost every country has its own character coding standard, which causes serious problems in international communication. For example: If a product specification written in Chinese and English in GBK is sent to a French computer using ISO 8859-1 code, he is likely to get a bunch of gibbet, because THE ISO 8859-1 code does not understand the various codes in GBK.

To solve this problem, at the end of the 1980s and the beginning of the 1990s, some people thought of creating an unusually large character set, including the character set of all the languages in the world, and then providing a unified coding scheme, so as to solve the problem of international communication. Unicode was born in this context.

Windows with GBK

Why is the Simplified Chinese version of Windows 10 still using GBK encoding? I think it is mainly because of historical reasons. Since the birth of GBK encoding in the 1990s, people have written countless documents using GBK encoding, if suddenly switch to UTF-8 and other Unicode encoding, then it is bound to cause great trouble.

In fact, the character encoding used in the Japanese version of Windows is not UTF-8. So if you’ve ever downloaded some of the older Japanese games, you’ll notice that the text is all garbled when you open the game. In this case, you will often need a program called transcribing software to display Japanese correctly in the game.

Ton, ton, ton, and perm

Maybe you’ve seen a joke like this, or come across some magical gibberish like this:

How does this amazing garble come about? According to the author’s search information, it turns out that VC++ compiler on Windows platform will automatically fill the value of uninitialized stack memory with 0xCC and the value of uninitialized heap memory with 0xCD in debug mode. At the same time, because in the simplified Chinese Version of Windows command prompt (CMD) using the default character encoding GBK. So, when encountering a series of 0xCC, the system will interpret the 0xCCCC as “hot” according to the GBK code; When a string of 0xCD is encountered, the system explains that 0xCDCD gets “tun”.

We also have a similar famous garbled “kun Jin Kao”, which will be introduced below.

Unicode

Introduction to the

Unicode is a universal character set maintained by the Unicode Consortium that contains most of the world’s languages. Released as version 1.0.0 in 1991, and the latest version 13.0 in 2020, it now includes more than 140,000 characters from 154 writing systems in current and historical use around the world, as well as Emoji, which have become increasingly popular in recent years. For compatibility with ASCII encodings, it defines the first 128 characters of the entire character set as all 128 characters of ASCII.

Unicode plane

Unicode consists of 17 planes, each of which can contain up to 216 characters. The first Plane is Plane 0, also known as BMP (Basic Multilingual Panel). The first characters to be added to Unicode were added to BMP, and later characters were added to Plane 1 and Plane 2. These two planes are also known as Supplementary Ideographic Plane and Teriary Ideographic Plane respectively. It is worth mentioning that the inclusion of Chinese characters in Unicode was undertaken by various organizations in China, South Korea, Japan, Vietnam and other countries and regions. Many Chinese characters are added to Unicode each year, including characters from ancient books and unusual characters in people’s names.

A myth

Note also that Unicode is only a Charset, not an encoding specification. Many people misunderstand Unicode as an encoding specification, but it is really just a character set that can only answer questions like “what character is in Unicode” and “what number is it?” not “how is this character represented in zeros and ones?” which is exactly what the encoding specification does. Unicode officially uses a variety of encoding specifications, such as UTF-8, UTF-16 and UTF-32.

Unicode Zalgo Text

An additional interesting concept is introduced here: Unicode Zalgo Text. Let’s start with an example: “e” and “e”. Can you tell the difference between these two characters? These two characters look exactly the same, but if you copy them separately and use a programming language to compare them, you get something like this:

> 'e' === 'e' falseCopy the code

In fact, the Unicode code point for the latter is U+00E9, while the Unicode code point for the former is U+0065, U+0301. The latter is a single character, but the former is an English lowercase letter “E” with a special symbol. This special symbol will show an oblique dot at the top of the preceding characters like “e”. In addition to this special character, Unicode has many similar special characters that mark the top or bottom of the preceding character. They are called Acute accents by Unicode and are used for text display in languages such as Latin and Greek.

So, someone came up with the idea of superimposing dozens or even hundreds of these special characters on characters to get some really weird text, Such as “H ̴ ͓ ̪ ̩ ̟ ̳ ̺ ̠ ̫ ̤ ͉ ̭ ̗ ̊ ̏ ͑ e ̷ ̛ ̮ ̠ ͖ ͕ ̝ ͍ ̤ l ̸ ̛ ̮ ͈ ͍ ̬ ̇ ̈ ͌ ́ ̌ ̑ ̎ ̿ ͌ ͝ l ̶ ͔ ͕ ̀ ̒ ̋ ͐ ȍ ̷ ̡ ̧ ̦ ̘ ̹ ̺ ̫ ̟ ̭ ̣ ͂ ̌ ͗ ̍ ̽ ͒ ̈ ͑ ̋ ̈ ́ ͂ ͑ ̄ ͠ ̴ ̨ ̲ ͓ ̙ ̺ ̰ ͖ ̞ ̯ ̼ ̝ ͚ ̦ ̐ ̈ ́ ͊ ̂ ̊ ́ ͊ ̚ ͅ w ̷ ̝ ͍ ͑ ̋ ͐ ̂ ͘ o ̸ ̱ ̽ ̈ ̓ ̌ ̅ ̽ ̑ ́ ̔ ͒ ̃ ̏ ̅ ͗ r ̶ ͙ ͖ ̬ ͙ ̚ ͝ ͅ l ̴ ̨ ͇ ͙ ͇ ̯ ̬ ͔ ̙ ̞ ͛ ̑ ̈ ̀ ͑ ̓ ̉ ̈ ́ ͒ ̾ ͠ ͝ d ̷ ̠ ͙ ̲ ̗ ̲ ̍ ̐ ̅ “. Some people use Acute accent to superimpose hundreds of characters on the same character, creating a magic character whose height is displayed even beyond the character’s line, overshadoading other lines. A few years ago, the author posted in the bar, QQ and other platforms often see someone forward these words.

UTF-8

Introduction to the

Utf-8 is a Unicode – oriented encoding specification for variable-width characters. ** “variable-length” refers to a variable length, which may be encoded as one, two, three, or four bytes depending on the character. ** UTF-8 specifically encodes ASCII characters as one byte, consistent with ASCII, making UTF-8 compatible with ASCII encodings.

Utf-8 is probably the most familiar character encoding specification for programmers. In the Web world, over 97% of Web pages use UTF-8 character encoding. Notably, his authors are Rob Pike and Ken Thompson, the author of C. They are also one of the authors of Go.

Rob Pike as a young man, left, and Ken Thompson (seated) while working at Bell LABS, right

Encoding rules

Since it is so widely used, it is worth taking a closer look at UTF-8, whose encoding rules are shown below:

The table is from Wikipedia

For Unicode characters with different range numbers, UTF-8 encodes them as bitstreams of different lengths. As you may find, UTF-8 is actually theoretically capable of encoding 221 characters, which is much more than the total number of characters Unicode currently contains. However, for safety, the maximum Number of Unicode numbers that UTF-8 can represent is limited to U+10FFFF. Therefore, utF-8 can encode Unicode characters in the range of U+0000 to U+10FFFF.

Utf-8 and C

Many people who are just beginning to learn C might write code like char C = ‘handsome’, thinking that a char type can hold a character, and since it can hold an English letter like “A”, it can also hold a Chinese character. However, such code will be warned by the compiler when compiled, for example in GCC you will get the following output:

main.c: In function 'main':
main.c:4:14: warning: multi-character character constant [-Wmultichar]
    ...
main.c:4:14: warning: overflow in implicit constant conversion [-Woverflow]
Copy the code

The source file uses UTF-8 encoding, which encodes the character “shuai” to three bytes, meaning that the compiler will read three bytes of data in a single quotation mark, but the char type can only hold one byte of data.

To verify this, we can write code like this:

#include <stdio.h>
#include <string.h>

int main(void) {
    printf("The length is %ld\n".strlen("Handsome"));
    return 0;
}
Copy the code

In this code, we use double quotation marks around the character “I” to tell C that this is an array of type CHAR. The output is The length is 3, as expected.

Kun jins kao

When introducing GBK coding, we introduce the principle of two famous garbled codes. Here’s how another famous garbled code, “kun Jin kao,” came into being. In Unicode, there is a special character “�” (U+FFFD). The specification requires Unicode processors to replace the corresponding character with this special character when they encounter a Unicode code point that cannot be processed, possibly because of an invalid code point value, etc. This can happen when decoding GBK encoded text using UTF-8.

The special character “�” is encoded as 0xEFBFBD in UTF-8. A string of 0xEFBFBD is decoded in GBK as a “kun” (0xEFBF), a “jin” (0xBDEF) and a “cuff” (0xBFBD), and this is how the kun cuff is generated.

UTF-16

Introduction to the

Utf-16 is another encoding specification officially adopted by Unicode. It has the same encoding range as UTF-8 and is also a variable-length encoding specification. However, utF-16’s encoding rules are special: for characters located at BMP, they are encoded as two bytes; For characters other than BMP (from U+10000 to U+10FFFF), they will be encoded as four characters.

Also due to this particular encoding rule, UTF-16 is not ASCII compatible. However, the Windows kernel, Java, and Javacript/ECMAScript internally use UTF-16 encoding.

Utf-8 and UTF – 16

In fact, for those Chinese characters in BMP, their code points are above U+2E80. Therefore, according to the UTF-8 encoding rules described above, they are encoded as three bytes. In UTF-16, however, only two bytes are encoded. Note that the Chinese characters in BMP already contain most common Chinese characters, including most names and rare characters. Therefore, it has been suggested that UTF-16 is more suitable for encoding Chinese characters than UTF-8, while UTF-8 is more suitable for encoding characters like English.

UTF-32

Utf-32 is also the official encoding specification adopted by Unicode. The main difference between UTF-8 and UTF-16 is that it is a fixed-length encoding specification, that is, each character is encoded as 32 bits (that is, 4 bytes). The RUNe type in Go uses utF-32 encoding.

Emoji 😎

Introduction to the

There is no need to say more about what Emoji is. Influenced by Japanese mobile social networks in the 1990s, Emoji is a set of unified standards jointly formed by various parties, including various expressions and symbols. Similar to Unicode, Emoji is constantly being updated and developed, with new Emoji being added to the standard.

What does the Emoji character set contain

Emoji is a character set included in Unicode. ** However, in addition to introducing new characters, Emoji also includes some characters that already exist in Unicode into the Emoji character set. ** For example, there is a special area in Unicode called “Majhong Tiles” that contains the characters represented by the mahjong Tiles. The Emoji standard lists only the red characters as Emoji. As a result, you’ll find that the characters you type with the input method look like Emoji only in the red center, while the other characters look like normal characters (i.e. black characters on a white background).

Emoji font

A confusing concept should be emphasized here: the **Emoji standard is just a character set that defines “what characters are Emoji” and “what a character is called (that is, what elements)”, but does not define “what Emoji looks like”. ** The appearance of each Emoji character you see is determined by the Emoji font. Many manufacturers have developed their own Emoji fonts. For example, Apple Introduced the Apple Color Emoji font to iOS, MacOS and other systems of its products, and Microsoft introduced the Segoe UI Emoji font to Windows 10. Emoji fonts vary from manufacturer to manufacturer. For example, the following image shows the “red envelope” Emoji (U+1F9E7) in different Emoji fonts:

Image from Emojipedia

At the same time, some manufacturers of fonts will be “added private”. For example, Microsoft’s Segoe UI Emoji font provides Emoji styles for other mahjong characters not included in the Emoji specification. ** Therefore, on Windows, if the program uses the Segoe UI Emoji font correctly, the mahjong characters you see are likely to look like emojis, but none of them are actually Emoji except red. ** For example, here is the 20,000 Segoe UI Emoji font:

Image from Emojipedia

However, on Ubuntu 20.04, which I used, the 20,000 characters would look like this:

Emoji with Unicode Zero-width Joiner

There is a special character in Unicode called zero-width joiner (ZWJ, code point U+200D). It is used to perform special typesetting operations on text in complex languages such as Arabic, such as splitting two words that were originally written together. ZWJ is not displayed if used alone.

Emoji also makes use of ZWJ, expressing some complex Emoji as several simple Emoji plus ZWJ. For example, the family Emoji character: 👨👩👦, which is actually connected by 👨, ZWJ, 👩, ZWJ, and silver. If you look at the UTF-8 encoding for the family Emoji character, you’ll see that it’s a combination of the utF-8 encoding for the above five characters. If you use JavaScript to run the following code: [… ‘👨 👩 👦’], you’ll get [” 👨 “, “”,” 👩 “, “”,” 👦 “]. Note that JavaScript requires a special set of methods for handling Unicode characters that are not BMP, such as Emoji.

ZWJ can also be used with plain text, making some text look identical but not equal, as in this example running on Node:

> const str1 = [...' > const str1 = [...']. Join (string.fromCodePoint (0x200D)); Undefined > const str2 = ' '; Str1 === str2 false - undefined > str2 === str2 falseCopy the code

We create a string that looks exactly the same, but is different, by adding ZWJ between each character. This technique can be used in many ways. Some people can use the blank ZWJ to add special information to a normal paragraph of text in order to “read between the lines”. “It is said that posts and replies containing XXX will be harmonized.” A water house was built in this way.

Emoji Modifier Fitzpatrick Type

The Emoji specification also defines a number of special-purpose characters, such as Fitzpatrick Type-n, the six Emoji Modifier, where N is from 1 to 6 (the code point in Unicode is from U+1F3FB to U+1F3FF). These six special characters are used to attach skin color information to skin-related emojis, as shown in the following example:

The table is from Wikipedia

Fitzpatrick is the name of a scientist who helped to classify the skin color of humans.

conclusion

The above is the author of some common character encoding collection, summary of some knowledge points. The main references are Wikipedia and Emojipedia. Some of the text may be biased, or need to be added, feel free to correct or add in the comments.