In the development process, I occasionally encountered problems about coding, Unicode and Emoji, and found that I did not fully grasp the basic knowledge of these aspects. So after a search for learning, sorting out a few easy-to-understand articles to share.
[Core foundation] series of several articles, interested friends can understand:
- Core base binary text (two) bit operation
- Core Base Binary (I) 0.1 + 0.2! = 0.3 and ieEE-754 standards
- Very hot, very hot, very hot
Intro
We all know that computers can only store bits, either zeros or ones, and that before computers, most of the information humans accumulated existed in the form of text. Words, books π How should this knowledge be stored in a computer?
Without zeros and ones, of course, imagine designing a system for storing upper and lower case letters (52 characters) and numbers (10 characters), 62 characters in total, which can be encoded using six bits, mapping codes to characters one by one.
For example, it could be designed like this:
000000 a 000001 b ... 111100 8, 111101 9Copy the code
Note π’ : this coding example is a demonstration, not practical!!
With one-to-one mapping, text can be stored on a computer. For example, Hello can be stored as 000111 000100 001011 001011 001110 in our “system”.
When designing encoding systems, different character presentation styles are not treated as different characters. The same paragraph of text, whether it is bold/italic/underline/word spacing, will not be regarded as two different paragraphs of text by the reader, nor by the computer.
The final rendering of a character also depends on the font, which can be problematic if the character does not have a graphical representation in the current font.
This “system” is what we call a character encoding set, and the first character encoding set we encountered in the early days of programming was ASCII.
ASCII
Our own “system” had major flaws, such as not being able to store Hello World because our encoding system did not support storing [Spaces] π». To meet the most basic needs of everyday life, a more comprehensive system is needed, and the most basic coding system in computers is ASCII.
The full name of ASCII is πΊπΈ American Standard Code for Information Interchange. The complete ASCII code is shown in the table below. The first 32 characters are control characters (originally designed for teleprinter use, many of which are outdated), starting with the 32nd digit, are displayable characters, including commonly used uppercase and lowercase letters, numbers, punctuation marks, and so on.
Why are there only 128 ASCII codes?
As you can see from the figure above, THE ASCII code uses only 128 bits (2 ^ 7). The reader may wonder why the ASCII code does not use 8 bits, since a byte has 8 bits. You can hold more characters.
The reason was that ACSII was originally designed (1967) for teleprinter use. At that time, the concept of byte was not yet available, registers were expensive, and the coding design needed to be as compact as possible.
IBM later extended ASCII to 8 bits with the addition of a few characters, EASCII
It is well known that in modern computers a byte is eight bits, and even though ASCII code uses only seven bits, it is stored in at least eight bits.
Carriage return (CR), line feed (NL), and EOF
Of the first 32 characters, there are two that developers can easily confuse: carriage return and line feed.
enter | A newline | |
---|---|---|
charCode | 13 | 10 |
hex | 0D | 0A |
Escape characters/re | \n | \r |
The semantic | Carriage Return (CR) | Line Feed (LF) |
Prior to 2018, most operating systems other than Windows had \n as a newline character. Notepad, the Windows text editor, only supports EOL as a newline character, which consists of CR and LF characters.
So I used to run into files created by others using Linux/macOS. When I opened the files, all the code was in one line. Fortunately, it has been fixed now.
In the lower right corner of VScode, you can choose which form to end the line with the current file.
// Use LF to save a newline hexdump index.txt 0000000 0a // use CRLF to save a newline hexdump index.txt 0000000 0d 0aCopy the code
Uppercase and lowercase encoding π»
Looking at all the alphabetic sequences in THE ASCII code in isolation, you might also wonder how the order of the codes was determined, and why a doesn’t follow Z immediately. (What a boring question πππ) If you look at it in binary form, there seems to be some clues.
A is in position 64 + 1 and A is in position 64 + 32 + 1, so it seems that the encoding of the letter is intentional. The advantage is that when the case needs to be switched, only one execution is required.
A 1000001 a 1100001
B 1000010 b 1100010. Z1011010 z 1111010
Copy the code
1100001 a
^ 0100000The...1000001 A
Copy the code
limitations
The biggest limitation of ASCII encoding is that it only uses one byte, representing 256 characters at most, which can only be used by Americans themselves, even in the UK, it is not completely applicable (for example, the 36th character $is a dollar sign, there is no sign for pound), let alone represent tens of thousands of pictographs of Asian characters.
In order to make their own language can be used on the Internet, early countries in addition to their own language codes, Chinese GB2312, Japanese Shift_JIS, Korean euc-Kr and so on. There is a lot of variation between codes, and it can be difficult to show multiple languages in a single article.
The same code point corresponds to different characters in different codes
But the Internet is by nature universal, and if the code used to communicate is different, and the semantics expressed on different machines, there is no extensibility. To solve this confusion, starting in 1988, several well-known computer companies collaborated to launch a new coding system, Unicode.
Unicode
To solve the problem of garbled characters using different encodings, Unicode assigns each character in the world a unique number, which we call a Code Point. Like ASCII, the first 128 characters of Unicode have been expanded to 21-bit, theoretically representing 1,114,112 characters. From 1991 release V1.0 to 2021-09 release V14.0 contains 144,697 characters in total, with a large number of CodePoints remaining unused.
Code Point Code Point
As mentioned earlier, Unicode is about assigning a unique number to every character in the world, so the problem of displaying characters from many different countries in a single article is solved. The Code Point ranges from U+0000 to U+10FFFF, where U+ indicates that this is a Unicode encoding set, followed by a hexadecimal number.
Using HTML, you can display character entities directly through Unicode in the form of the hexadecimal corresponding to the CodePoint followed by the &#x. Or the decimal number of CodePoint followed by &#.
烫= > hotThe & # 28907;= > hot悟= > enlightenmentThe & # 24735;= > enlightenmentCopy the code
Code Point is language, system, and program independent, so using HTML Unicode Entity is not affected by content-Type.
The character plane maps Unicode planes
Unicode characters are currently arranged into 17 groups called planes, each of which has 65536 (216) code points. Currently, however, only a few planes are used.
The first plane is called BMP for short. This abbreviation appears frequently when reading foreign related articles. Most of the Unicode characters are in this plane. Other Planes can be called Supplementary Planes. The Supplementary Planes range from U+10000 to U+10FFFF. For example, the following characters complement the plane:
character | coding |
---|---|
π | U+1D401 |
π΅ | U+1F035 |
π© | U+1F4A9 |
As we learned from the previous section, Unicode encodings need to be 21-bit, which requires at least three bytes to be stored and transferred.
Element Code Unit
Of course not, if each character is stored in three bytes, it will be a big disadvantage for pure English articles, as the first two bytes are all zeros, wasting a lot of memory space.
So why doesn’t ASCII need to consider a more efficient way of encoding? Because it takes up only seven bits, it only takes one byte anyway.
A character is encoded into a byte using a certain encoding method. The binary fragment representing the character after encoding is called a Code Unit. Common encoding modes are UTF-8, UTF-16, and UTF-32.
The one we use a lot is UTF-8, which is a variable-length encoding. A character encoded in UTF-8 can take up to one byte or four bytes.
Utf-8 encoding details
The UTF-8 design has the following characteristics of multi-character sequences:
- For single-byte characters, the highest bit is 0, resulting in the same encoding as ASCII.
- A multibyte sequence has a number of consecutive 1’s at the top of the first byte that determine the length of the byte sequence, for example
110
Means the encoding result takes up two bytes,1110
It’s three bytes, and so on. The first two digits of the non-first byte are fixed10
.
The text may not be clear, but look at the table below.
Digit code point | Code points starting | Code points to end | Sequence of bytes | Byte1 | Byte2 | Byte3 | Byte4 |
---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx |
|||
11 | U+0080 | U+07FF | 2 | 110xxxxx |
10xxxxxx |
||
16 | U+0800 | U+FFFF | 3 | 1110xxxx |
10xxxxxx |
10xxxxxx |
|
21 | U+10000 | U+1FFFFF | 4 | 11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
To test this locally, create an index. HTML file, use UTF-8 encoding, and execute the following command.
$ echo -n A > index.html
$ ls -lh index.html
> 1B
$ hexdump index.html
> 41
Copy the code
Parsing the -n argument tells the echo command not to enter A trailing newline character and that the file contains only A character. The -l argument makes the ls command appear as a detailed list, and the -h argument makes the print readable (I’ve hidden irrelevant information)
You can see that with UTF-8 encoding, the character A takes up only one byte and results in 41(16), just like ASCII encoding.
Try echo a Chinese character again.
HTML $ls -lh index. HTML > 3B $hexdump index. HTML > e7 83 abCopy the code
Using UTF-8 encoding, the hot word takes three bytes and the encoding result is different from the Unicode code point. According to the table above, convert its Unicode code point 70EB into binary 01110000 11101011 from right to left, fill in the position of X, and fill in the high order 0 to get
11100111 10000011 10101011
Copy the code
E7 83 AB in hexadecimal!! βΏ Angry, (Β° Β°) Blue βΏ
Instead of calculating the conversion between bases manually, you can use parseInt and toStirng as we learned in the core basic binary article.
parseInt('10000011'.2).toString(16)
Copy the code
The same is true for characters in the supplement plane, echo a π©, and you can see that it takes up four bytes. The coding logic is basically the same as above, and the coding results can be verified by themselves.
$echo -n π© > index.html $ls -lh index.html > 4B $hexdump index.html > f0 9f 92 a9Copy the code
conclusion
In this article we review the details of ASCII, UNICODE, and UTF-8 encodings to help readers understand text encodings.
In the next article we will continue to learn JavaScript and coding related to all kinds of difficult diseases, for this one catch!
Thirdly, due to the limited time && level, there must be a lot of inaccurate description and even wrong content in the article. Please kindly point out if you find anything. β€ οΈ
Further reading
- Wiki Code point
- Unicode character plane mapping
- Character encoding notes: ASCII, Unicode and UTF-8
- What every JavaScript developer should know about Unicode