Always feel that the character encoding is very magical, always want to understand, but there has been no time to ponder, a little 😣. So take advantage of your free time to fill up on chicken soup. Luckily, I saw ruan yifeng’s blog on Google, talking about the relationship between Unicode and UTF-8 and ASCII. I was inspired, worthy of being a big guy, talking about that is a thorough ~

1. ASCII in previous lives

First of all, our letters are represented by 8-bit one-byte binary numbers, which can represent 256 characters ranging from 00000000 to 11111111.

Later, the United States defined the ASCII range, a total of 128 characters, occupied the last seven bits, the first default is 0.

Later, it was expanded to 256 characters, which in fact included the eighth digit. However, it should be noted that the additional 128 symbol encoding is not ASCII code, can only be said to be extended.

Later, with the increase of languages, there will be a problem of insufficient coding, such as our country has a lot of Chinese characters. Therefore, in order to meet such requirements, Unicode is commonly known as universal code, thus came into being, all characters and characters are corresponding to a specific Unicode code, realizing the “unified coding”.

2. Unicode in this life

Contains the words and symbols of various countries, a veritable code of nations. Here are the sites for querying Unicode: the specific symbol correspondence table, unicode.org, and the Chinese character correspondence table for querying Chinese characters.

However, Unicode only defines the encoding value of a character, but it does not define how to store the encoding value, which is essentially what storage method to use to represent the Unicode value. It is well known that a decimal number can be coded in base 2, 8, or 16, all representing the same number in meaning. Therefore, a universal encoding method is needed to encode and decode characters.

For example, the Unicode value of the spring word of the Spring Festival is \u6625, and the corresponding binary value is 1100110 00100101, which is less than two bytes, so it can be encoded by two bytes, or according to three or more bytes, the insufficient direct complement 0 is good. But things like a letter that takes up only one byte are expensive, and you need to fill out two or more bytes with zeros. So what is the general coding approach?

Utf-8, one of the most common encoding methods on the Internet. It’s worth noting that UTF-8 is just an implementation of Unicode (defining how to store Unicode values), and Unicode just defines character numbers (that is, an ID number).

3. Now UTF-8

Utf-8 is the main character of today. So far, it is the most used encoding method. I may not be able to finish writing today.


Utf-8, UTF-16, UTF-32, and so on are all encoding methods for Unicode storage. What is the difference

  • Utf-8: A variable length scheme that can use 1 to 6 bytes of storage. This shows that the storage efficiency is very high
  • Utf-16: a solution between UTF-8 and UTF-32 that uses two or four bytes for storage
  • UTF32: A fixed 4-byte storage method with one-to-one encoding, simple but inefficient storage.

3.1 UTF-8 encoding mode

This is a very important way of coding, so make sure you write it manually. If characters occupy:

  • Occupy a byte, then the highest bit is 0, other unchanged; For example, the Unicode value of the letter A is \u65, so the utF-8 value is 01100101

  • If a character occupies n bytes (see 👇 below to determine the rule), the first n bits of the first byte are represented by a 1, followed by a 0. The remaining bytes, starting with 10, fill the Unicode binary from the last digit of the character to the last digit of UTF-8, from back to front. Arrive front insufficient, fill 0 can.

Unicode symbol scope | utf-8 encoding (hexadecimal) | (binary) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0000 0000-0000 007F | 0xxxxxxx# 1 byte
0000 0080-0000 07FF | 110xxxxx 10xxxxxx            # 2 bytes
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx     # 3 bytes
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    # 4 bytes
Copy the code

Here’s an example:

For example: spring in the Spring Festival,

In the spring of characters: The corresponding value
Unicode (hexadecimal) \u6625
binary 1100110, 00100101,
Utf-8 (hexadecimal) e6–98–a5
Utf-8 corresponds to binary 11100110 –10011000 –10100101

(Note: the “–” and the space in the table are added for viewing convenience, but they are not actually available.)

The Unicode value is u6625 (hexadecimal number), and the corresponding range of Unicode values is 0000 0800-0000 FFFF (3 bytes).

According to the corresponding encoding mode, the first byte is 1110XXXX, the second byte is 10XXXXXX, and the third byte is 10XXXXXX. Then, the binary number corresponding to the Spring Unicode value is filled from the back to the front, and the insufficiency is filled with 0. It becomes 11100110 10011000 10100101.

4. Coding examples

In common Baidu search, when searching For Chinese characters, Chinese characters will be encoded in UTF-8, for example, the word “spring” will be searched

www.baidu.com/s?wd=%E6%98…

It can be seen that the utF-8 encoding value of spring is indeed E698A5

Conclusion:

Finally, I understand the relationship between ASCII and Unicode and UTF-8, and I know how to solve coding problems when I see them. If you feel that you do not understand, you can go to see ruan Yifeng god’s article, the link is at 👇. Some basic things still need to be understood and integrated, for the promotion of knowledge and ability are very helpful, so toward the god of strong strong ~

Reference article:

www.ruanyifeng.com/blog/2007/1…

Blog.csdn.net/guxiaonuan/…

Chinese Unicode website: www.chi2ko.com/tool/CJK.ht…

Online coding: tool.oschina.net/encode