Since computers can only process numbers, if they want to process text, they must first convert the text to numbers.

ASCII

The earliest computers were designed with 8 bits as a byte, so the largest integer for a word identifier was 255 “11111111(B) = 255”. 0-255 is used to identify upper and lower case English letters, numbers, and symbols. This code table is called ASCII code.

Unicode

If you want to represent Chinese characters (there are about 10W characters), obviously one byte is not enough, you need at least two bytes. And it can not conflict with the ASCII code table, so China specified GB2312 code, used to encode Chinese. Similarly, other languages face this problem, and Unicode was created to unify the encoding of all characters.

Unicode usually uses two bytes to represent a character, and all The English encodings change from single bytes to double bytes, filling in all the high bytes with zeros.

Currently, Unicode characters are arranged into 17 groups, 0x0000 to 0x10FFFF, each group called a Plane, while no Plane has 65535 code points, making a total of 1114,112. However, only a few planes are currently used.

UTF-8

Utf-8 (8-bit Unicode Transformation Format) is a Unicode variable-length character encoding.

Utf-8 differs from Unicode

  • Unicode is “character set”
  • Utf-8 is “Encoding rule”

Character set: assigning a unique ID to each character (Code Point /Code Point /Code Point) Encoding rules: rules for converting Code points to sequences of bytes (encoding/decoding => encryption/decryption)

Utf-8 encoding rules

Utf-8 is a variable-length byte encoding. The minimum code unit is one byte. The first 1-3 bits of a byte are the descriptive part, followed by the actual ordinal part.

  1. For single-byte characters, occupy one byte space. All parts after 0 (7 bits) represent ordinals in Unicode. So utF-8 encoding is the same as ASCII for English letters.
  2. For n-byte characters, the first n of the first byte is 1, the NTH +1 is 0, and the first two digits of the following bytes are 10. The remaining bits not mentioned are the Unicode code for this symbol.
Unicode symbol range “hexadecimal” Utf-8 encoding “binary”
0000 0000-0000 007F 0xxxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx

To sum up: if a byte begins with a 0, that byte is a single character; If the first digit is 1, the number of consecutive 1’s indicates how many bytes are currently occupied.

For example:

The bare unicode is 79C3 “0111 1001 1100 0011(B)”, which is in the range (0000 0080-0000 FFFF) according to the table above, so utF-8 requires three bytes, i.e. 1110XXXX 10XXXXXX 10XXXXXX 10XXXXXX, Inserts unicode from the last digit forward into format X, the extra complement 0. The utF-8 code is 11100111 10100111 10000011, converted to hexadecimal E7A783.

Little endian & Big endian

In the above bald example, the Unicode code 79C3 requires two bytes of storage, one for 79 and the other for C3. The storage mode 79 is first, and C3 is second, which is Big Endian. C3 first, 79 second is Little Endian.

Base64

Base64 is an identification method for binary data based on 64 printable characters. Printable characters include letters A-z, A-z, and numbers 0-9, which are 62 characters in total, and two printable symbols that vary from system to system.

In MIME format, the other two symbols are plus + and slash /, and equal = are used as suffixes.

Code conversion mode

  1. Each three bytes as a group, total of 24 binary bits.
  2. Divide the 24 bits into four groups of six bits each.
  3. Add two 00’s to the front of each group to expand to 32 bits, and four bytes.
  4. Then, according to the table below, we get the number of each extended byte, which is the Base64 encoding value.

To sum up, it can be concluded that:

  • The characters in the Base64 character standard can be represented with 6 bits originally, but now two zeros are added in front of it to become 8 bits, resulting in the size of the Base64 encoded text is about one third of the original text
  • Why use groups of 3 bytes? Since the least common multiple of 6 and 8 is 24, three bytes have exactly 24 binary bits, which are grouped in groups of six bits, and can be divided into four groups.

Examples are as follows:

  1. The ASCII values of “Man”, “a”, and “n” are 77, 97, and 110 respectively. The corresponding binary values are 01001101, 01100001, and 01101110. Concatenate them into a 24-bit binary string 010011010110000101101110.
  2. Divide the 24-bit binary string into four groups: 010011, 010110, 000101, 101110.
  3. Add 00 to the front of each group to expand to 32 bits, namely 00010011, 00010110, 00000101, 00101110.
  4. According to the above table, the Base64 encoding is T, W, F and U.

Less than 3 bits are processed

  1. In the case of two bits: there are 16 bits in two bytes, and they are grouped in the same way as above. Each group is 6 bits. Then the third group is missing 2 bits and is filled with 0. For example, “Ma” can be converted into three groups of 00010011, 00010110 and 00010000. The corresponding Base64 values are T, W and E respectively, and a” =” number is added. Therefore, the Base64 encoding of “Ma” is “TWE=”.
  2. In the case of one bit: a byte is 8 binary bits, grouped as above, each group of 6 bits, then the second group of 4 bits missing, filled with 0. For example, “M” can be converted to 00010011 and 00010000, and the corresponding base64-bit values are respectively: T and Q, and two “=” numbers are added, so the Base64 encoding of “M” is TQ==

Matters needing attention

  • While most encodings are string to binary, Base64 encodings are binary to string.
  • Base64 is mainly used for transmission, storage, and binary representation. It is not encryption, but you cannot see the inscription directly.
  • There are many encodings in Chinese (such as UTF-8, GB2312, GBK, etc.), and different encodings correspond to different Base64 encodings.