This article should be distinguished from binary streams, which encode characters and process files.
A, ASCII code
We know that inside a computer, all information is ultimately a binary value. Each bit has two states, zero and one, so eight bits can combine 256 states, which is called a byte. That is, a byte can be used to represent a total of 256 different states, and each state corresponds to a symbol. That is, 256 symbols, ranging from 00000000 to 11111111.
In the 1960s, the United States developed a set of character codes, which unified the relationship between English characters and binary bits. This is called ASCII code and is still used today.
ASCII codes specify 128 characters in total, such as the SPACE 32 (binary 00100000) and the uppercase letter A 65 (binary 01000001). The 128 symbols (including 32 non-printable control symbols) occupy only the last seven bits of a byte, with the first one uniformly specified as zero.
A is the ASCII encoding of A, 65 or 01000001.
8 bits = 1 byte. 1 byte represents an ASCII code.
For single-byte characters, utF-8 encoding is consistent with ASCII encoding.
As an added bonus, the Base64 handler window.btoa encodes only ASCII characters (that is, single-byte characters).
Non-ascii encoding
128 symbols is enough to code English, but not enough to represent other languages. In French, for example, letters with phonetic symbols above them cannot be represented in ASCII. So some European countries decided to use the highest bits of the byte that were unused to encode new symbols. For example, the French e code is 130 (binary 10000010). As a result, the coding system used in these European countries can represent up to 256 symbols.
But here comes a new problem. Different countries have different letters, so even though they all use the 256-symbol code, they don’t represent the same letters. For example, 130 stands for E in French, Gimel (ג) in Hebrew, and another symbol in Russian. But anyway, in all of these codes, the symbols from 0 to 127 are the same, except for 128 to 255.
As for Asian writing, there are even more symbols, with about 100,000 Chinese characters. It is surely not enough that a single byte can represent only 256 symbols, so you must use more than one byte to represent a symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it can theoretically represent up to 256 x 256 = 65536 symbols.
The problem of Chinese coding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although a symbol is represented by multiple bytes, the kanji encoding of the GB class has nothing to do with subsequent Unicode and UTF-8.
Third, Unicode
As mentioned in the previous section, there are multiple ways of coding, and the same binary number can be interpreted as different symbols. Therefore, in order to open a text file, it is necessary to know how it is encoded, otherwise it will be garbled if it is interpreted in the wrong way. Why do e-mails often have garbled characters? Because the sender and the receiver use different coding methods.
You can imagine if there was a code that included all the symbols in the world. Each symbol is given a unique code, and the garble problem goes away. This is Unicode, as its name suggests, which is the code for all symbols.
Unicode is, of course, a large set, now larger than a million symbols. For example, U+0639 stands for the Arabic letter Ain, U+0041 for the English capital letter A, and U+4E25 for the Chinese character Yan. For a specific symbol mapping table, please refer to unicode.org or a special Chinese character mapping table.
4. Unicode problems
It is important to note that Unicode is just a set of symbols, and it only specifies the binary of a symbol, not how that binary should be stored.
For example, the Unicode for Chinese characters is the hexadecimal number 4E25, which is a full 15 bits (100111000100101) in binary, meaning that the representation of the symbol requires at least two bytes. For other larger symbols, it might take three bytes or four bytes, or even more.
There are two serious questions here. The first is, how do you distinguish Unicode from ASCII? How does the computer know that three bytes represent one symbol, rather than three symbols? The second problem is that we already know that the English alphabet with only one byte is enough, if the Unicode unified regulation, each symbol with three or four bytes, so every English letters before there must be two to three bytes is 0, it is a great waste for storage, the size of a text file will be two or three times as big, This is unacceptable.
The result is: 1) The emergence of multiple storage options for Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode was unavailable for a long time, until the advent of the Internet.
Fifth, utf-8
The popularity of the Internet strongly requires the emergence of a unified coding method. Utf-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters in two or four bytes) and UTF-32 (characters in four bytes), though they are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the implementations of Unicode.
One of the biggest features of UTF-8 is that it is a variable length encoding method. It can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.
Utf-8 encoding rules are simple, with only two rules:
-
For single-byte symbols, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.
-
For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all Unicode codes for this symbol.
The following table summarizes the coding rules, with the letter X representing the bits of code available.
Unicode symbol scope | utf-8 encoding (hexadecimal) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxCopy the code
According to the table above, interpreting the UTF-8 encoding is very simple. If the first byte is 0, the byte is a single character. If the first digit is 1, the number of consecutive 1’s indicates how many bytes the current character occupies.
The following uses the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.
The strict Unicode is 4E25 (100111000100101). According to the table above, 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the strict UTF-8 encoding requires three bytes. That is, the format is 1110XXXX 10XXXXXX 10XXXXXX. Then, starting with the last bit, the x in the format is filled in from back to front, and the extra bits are filled in with zeros. The result is that the strict UTF-8 code is 11100100 10111000 10100101, which translates into hexadecimal E4B8A5.
The strict character E4B8A5 is the strict urF-8 encoding.
-
For single-byte characters, occupy one byte space. All parts after 0 (7 bits) represent ordinals in Unicode. So utF-8 encoding is the same as ASCII for English letters.
-
For n-byte characters, the first n of the first byte is 1, the NTH +1 is 0, and the first two digits of the following bytes are 10. The remaining bits not mentioned are the Unicode code for this symbol.
Unicode and UTF-8 conversions
From the example in the previous section, you can see that the strict Unicode code is 4E25 and the UTF-8 code is E4B8A5, which are different. The conversion between them can be done programmatically.
Windows platform, there is one of the simplest conversion method, is to use the built-in notepad small program Notepad.exe. After opening the file, clicking the Save as command in the file menu brings up a dialog box with an encoded drop-down bar at the very bottom.
There are four options: ANSI, Unicode, Unicode Big Endian and UTF-8.
1) ANSI is the default encoding method. ASCII encoding is used for English files and GB2312 encoding is used for simplified Chinese files (Windows simplified Chinese version only, Big5 code is used for traditional Chinese version).
2) Unicode encoding refers to the UCS-2 encoding used by Notepad. exe, that is, the Unicode code is stored directly in two bytes. This option uses the Little Endian format.
3) The Unicode Big Endian encoding corresponds to the previous option. I’ll explain what little Endian and Big Endian mean in the next section.
4) UTF-8 encoding, which is the encoding method discussed in the previous section.
After selecting “encoding mode”, click the “save” button, and the encoding mode of the file will be converted immediately.
7. Byte order
Computer hardware stores data in two ways: big endian and little endian.
For example, the hexadecimal value 0x2211 stores 1 in two bytes: the high byte is 0x22 and the low byte is 0x11.
-
Big-endian: The way humans read and write numbers is that the most important byte is first and the least important byte is second.
-
Endian: The lowest byte is first and the highest byte is last, which is stored as 0x1122.
Why can hexadecimal be used instead of binary?
Similarly, the big-endian and little-endian bytes of 0x1234567 are written in the following figure.
Normally, the hexadecimal system should be converted to base 2, and then sorted by byte order. In this case, I’m just going to use hexadecimal notation.
I have never understood why there is a byte order, each read and write to distinguish, how troublesome! Wouldn’t it be more convenient to use a uniform big-endian order?
Last week, I read an article that answered all those questions. Moreover, I found that the original understanding was wrong, byte order is very simple.
First of all, why is there little endian order?
The answer is that it is more efficient for a computer circuit to process the lowest bits first, since calculations start in the lowest bits. Therefore, the internal processing of the computer is small endian.
However, humans are still used to reading and writing big-endian bytes. So, except for the internal processing of a computer, the other situations are almost always big endian, such as network transfers and file storage.
When a computer processes byte order, it does not know what is a high-order byte and what is a low-order byte. All it knows is to read bytes sequentially, first byte, then second byte.
If it is big-endian, the high-order bytes are read first and the low-order bytes are read later. The small endian order is just the opposite.
Understanding this will help you understand how a computer handles byte order.
Byte order processing is a sentence:
“Byte order must be distinguished only for reading, and nothing else.”
When the processor reads external data, it must know the byte order of the data and convert it to the correct value. This value is then used normally, regardless of byte order.
Even if you write data to external devices, you don’t need to worry about byte order, just write a value. The peripheral device takes care of the byte order itself.
For example, the processor reads a 16-bit integer. If it is big-endian, it is converted to a value as follows.
x = buf[offset] * 256 + buf[offset+1]; Copy the code
In the above code, buf is the starting address in memory of the entire data block, and offset is the location that is currently being read. The first byte multiplied by 256, plus the second byte, is the value of the big endian order, which can be overwritten with a logical operator.
x = buf[offset]<<8 | buf[offset+1]; Copy the code
In the code above, the first byte is moved 8 bits to the left (that is, followed by 8 zeros), and then is orted with the second byte.
If it is small endian, convert it to a value using the formula below.
x = buf[offset+1] * 256 + buf[offset]; Copy the code
The same is true for 32-bit integers.
Big endian / * * / I = (data [3] < < 0) | (data [2] < < 8) | (data [1] < < 16) | (data [0] < < 24); Little endian / * * / I = (data [0] < < 0) | (data [1] < < 8) | (data [2] < < 16) | (data [3] < < 24);Copy the code
Appendix: Why can hexadecimal notation be used instead of binary notation?
-
Computer hardware is 0101 binary, a hexadecimal number that is a multiple of two, making it easier to express a command or data. Binary is too long to look at, the larger the base, the shorter the expression length, hexadecimal is shorter, because when converting a hexadecimal number can be replaced by four binary numbers, 1111 is exactly F
-
So why hexadecimal? Maybe it’s because 2, 8, and 16 are 2 to the first, 3, and 4, respectively, which makes it easier to switch between bases
-
The ASCII character set was originally defined as the 8-bit character set (it has since been extended, but the base unit is still 8-bit). 8bit can be expressed directly in two hexadecimal bases, making it easier to read and store than other bases
-
CPU computing in computers followed the ASCII character set in 16, 32, 64 and so on, so hexadecimal was better for data exchange, but computers ended up operating in binary
reference
- Character encoding notes: ASCII, Unicode and UTF-8
- Understand byte order
- Computer memory address and why use hexadecimal
- ASCII, Unicode, UTF-8 and Base64
- 1 byte = 8 binaries; Hex 2211 to binary = 10 0010 0001 0001; So 2 bytes ↩