As we all know, all information in a computer is represented as a binary string, and the role of coding is to establish a mapping between human characters and binary bits, to facilitate information exchange between people and computers, and between people.

ASCII:

American Standard Code for Information Interchange American Standard Code for Information Interchange

In the 1960s, the United States developed a set of character codes, which unified the relationship between English characters and binary bits. This is called ASCII code and is still used today.

Coding rules:

ASCII codes specify A total of 128 characters, such as 32 for the SPACE “SPACE” (binary 00100000) and 65 for the uppercase letter A (binary 01000001). The 128 symbols (including 32 non-printable control symbols) occupy only the last seven bits of a byte, with the first bit uniformly specified as zero.

ASCII uses one byte, which can represent 256 states, but ASCII uses only 128, which is the last seven bits of a byte, and the first one is all 0.

Each country’s own code:

Since ONLY 128 characters are specified for ASCII, the characters of other countries, such as Chinese characters, are not taken into account, so different countries inherit ASCII to establish their own codes.

For example, Chinese character encoding:

We unceremoniously cancel all the strange symbols after 127 and say: A character less than 127 has the same meaning as before, but two characters larger than 127, when joined together, represent a Chinese character. The first byte (called high byte) is used from 0xA1 to 0xF7, and the second byte (low byte) is used from 0xA1 to 0xFE, so we can combine about 7,000 simplified Chinese characters. In these codes, we also put the mathematical symbols, Greek letters, Rome Japanese kana have entered, even have some in ASCII Numbers, punctuation, letters are entirely made up of two bytes long code again, it is often said that the “horn” characters, and in 127 under the original call “half Angle” character. The Chinese people saw that this was good, so they called the character scheme “GB2312”. GB2312 is a Chinese extension of ASCII.

Later found not enough:

So we have to continue to GB2312 did not use the code point to find out honestly use. When that wasn’t enough, they dropped the requirement that the lowest byte must be after 127, and that the first byte greater than 127 was the beginning of a Chinese character, regardless of whether it was followed by something in the extended character set. As a result, the extended coding scheme is called GBK standard, which includes all the contents of GB2312 and adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols. Later, ethnic minorities also want to use computers, so we further expand, and added thousands of new ethnic minority characters, GBK expanded to GB18030.

This method of encoding Chinese characters is called DBCS (Double Byte Charecter Set), which is why it is often said that Chinese characters are encoded twice as long as English characters.

Unicode character set:

Each country has its own code, which is not conducive to international communication, so ISO (International Organization for Standardization) launched a Universal multiple-OcTET Coded Character Set that contains all the characters on earth. UCS for short, commonly known as Unicode.

Although Unicode code set rules for character with which binary string, said only need one byte, but some characters and some characters are as many as four bytes, if use 4 bytes to represent a single character, so every English letters before there must be two to three bytes is 0, it is a great waste for storage, The size of the text file would be two or three times larger, which is unacceptable.

Unicode encoding rules:

Unicode encoding refers to the UCS-2 encoding, which is a Unicode code that directly stores characters in two bytes. (In this case, if a character contains more than 2 bytes of Unicode, it will not be able to be stored.)

Utf-8:

The popularity of the Internet strongly requires the emergence of a unified coding method. Utf-8 is the most widely used implementation of Unicode on the Internet. UTF (UCS Transfer Format) is the encoding rule of Unicode character set.

Utf-8 differs from Unicode:

  • Unicode is “character set”
  • Utf-8 is “Encoding rule”

Character set: assign a unique ID(Code Point /Code Point /Code Point) to each character; Encoding rules: Rules for converting code points into sequences of bytes (encoding/decoding => encryption/decryption).

Utf-8 encoding rules:

Utf-8 is a variable-length byte encoding. The minimum code unit is one byte. The first 1-3 bits of a byte are the descriptive part, followed by the actual ordinal part.

  1. For single-byte characters, occupy one byte space. All parts after 0 (7 bits) represent ordinals in Unicode. So utF-8 encoding is the same as ASCII for English letters.
  2. For n-byte characters, the first n of the first byte is 1, the NTH +1 is 0, and the first two digits of the following bytes are 10. The remaining bits not mentioned are the Unicode code for this symbol.
Unicode symbol range “hexadecimal” Utf-8 encoding “binary”
0000 0000-0000 007F 0xxxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx

To sum up: if a byte begins with a 0, that byte is a single character; If the first digit is 1, the number of consecutive 1’s indicates how many bytes are currently occupied.

The unicode for Yan is 4E25 (100111000100101). According to the table above, 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the UTF-8 encoding for Yan requires three bytes. That is, the format is 1110XXXX 10XXXXXX 10XXXXXX. Then, starting with the last binary bit of strict, the x in the format is filled in from back to front, and the extra bits are filled in with zeros. As a result, the UTF-8 code for “yan” is “11100100 10111000 10100101”, which translates into hexadecimal E4B8A5.

Little Endian and Big Endian:

As mentioned earlier, Unicode generally uses ucS-2 encoding and is available in two formats: Little Endian and Big Endian.

In the case of the Chinese character “Yan”, the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. Storage time:

  • 4E first, 25 last, Big Endian;
  • 25 in the front, 4E in the back, Little Endian.

How does a computer know which way a file is encoded? As defined in the Unicode specification, each file is preceded by a character denoting the encoding order. This character is called ZERO WIDTH no-break SPACE (FEFF). That’s exactly two bytes, and FF is one more than FE. If the first two bytes of a text file are FE FF, it means that the file is big. If the first two bytes are FF FE, the file has a small header.

Base64:

Base64 is an identification method for binary data based on 64 printable characters. Printable characters include letters A-z, A-z, and numbers 0-9 for A total of 62 characters, plus two printable symbols that vary from system to system.

In MIME format, the other two symbols are plus + and slash /, and equal = are used as suffixes.

Code conversion mode

  1. Each three bytes as a group, total of 24 binary bits.
  2. Divide the 24 bits into four groups of six bits each.
  3. Add two 00’s to the front of each group to expand to 32 bits, and four bytes.
  4. Then, according to the table below, we get the number of each extended byte, which is the Base64 encoding value.
The index Corresponding character The index Corresponding character The index Corresponding character The index Corresponding character
0 A 17 R 34 i 51 z
1 B 18 S 35 j 52 0
2 C 19 T 36 k 53 1
3 D 20 U 37 l 54 2
4 E 21 V 38 m 55 3
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w
15 P 32 g 49 x
16 Q 33 h 50 y

The characters in the Base64 character standard can be represented with 6 bits originally, but now two zeros are added in front of it to become 8 bits, resulting in the size of the Base64 encoded text is about one third of the original text.

Less than 3 bits are processed

  1. In the case of two bits, there are 16 bits in two bytes. Group them as above. Each group is 6 bits. For example, “Ma” can be converted into three groups of 00010011, 00010110 and 00010000. The corresponding Base64 values are T, W and E respectively, and a” =” number is added. Therefore, the Base64 encoding of “Ma” is “TWE=”.

  2. In the case of one bit, there are 8 bits in a byte. Group them as above. If there are 6 bits in each group, the second group is missing 4 bits, and 0 is used to complete the group. For example, “M” can be converted to 00010011 and 00010000, and the corresponding base64-bit values are respectively: T and Q, and two “=” numbers are added, so the Base64 encoding of “M” is “TQ==”;

Matters needing attention

  • While most encodings are string to binary, Base64 encodings are binary to string.
  • Base64 is mainly used for transmission, storage, and binary representation. It is not encryption, but you cannot see the inscription directly.
  • There are many encodings in Chinese (such as UTF-8, GB2312, GBK, etc.), and different encodings correspond to different Base64 encodings.

Hold two handcuffs

Your mouth is very hot

Foot a thousand tuen tuen tuen tuen

Laugh at the misfortune of all things

References:

  • ASCII, Unicode, UTF-8 and Base64 encodings
  • ASCII, Unicode, UTF-8 and Base64
  • What is the difference between Unicode and UTF-8?