coding
First of all, what is code?
Since computers only know zeros and ones, in order to identify characters, A uniform set of rules is needed to implement mappings such as 0100 0001-> A, which is called encoding
At the same time, with the development of computers, the need to identify the character is constantly increasing, resulting in a continuous increase in coding types
ASCLL
The most basic encoding, defined by the Americans, uses 1 byte (8 bits) to define all characters they use
Since the English language has only 26 letters, 256 (2^8) mapping bits are enough to identify all characters
In fact, ASCLL uses only the last seven bits (0 as the first digit, of the form 0xxx XXXX) and defines 128 characters, using A as an example:
A -> 65 -> 0100 0001
Unicode
As countries began to use codes to identify their own characters, it became clear that one-byte encoding was simply not enough:
- Native languages may require more than 256 characters
- For the same mapping bit, there are different interpretations in multiple languages. For example, 65 above means A in the United States, but it may mean something else in other countries. It is difficult for people to communicate with each other and frequent changes are required
This is where Unicode comes in. Unicode is a super large (solution 1) dictionary, with each character having a unique value (solution 2).
But Unicode is just a dictionary, and the exact encoding is not fixed, as follows:
A->65 -> ?
Only the mapping of one character to one number is specified
UTF-8
- Utf-8 is an implementation of Unicode
- Compatible with existing ASCLL
- The value contains 1 to 4 bytes
Coding rules:
-
When the byte length is 1, the first byte is 0 and the remaining 7 bits are the Unicode encoding value of the character. This method is the same as the encoding method of ASCLL code, so it is fully compatible with ASCLL code
-
If the length of a byte is greater than 1, N bytes are required. The first N bits of the first byte are 1, the N+1 bits are 0, and the first two bits of the remaining n-1 bytes are 10. All the remaining bits are used as the Unicode encoding value of the character
Rule 2 is a bit more complicated, but it’s easier to understand if you look at the table:
Unicode encoding value range | Utf-8 binary |
---|---|
0000 0000 ~ 0000 007F | 0xxxxxxx |
0000 0080 ~ 0000 07FF | 110xxxxx 10xxxxxx |
0000 0800 ~ 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0001 0000 ~ 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
conclusion
- Ascll: The basic encoding, 1 byte, actually defines 128 characters
- Unicode: A mapping rule that encodes all characters, but does not implement them
- Utf-8: a concrete implementation of Unicode, 1 to 4 bytes, compatible with ASCLL
reference
Have a thorough knowledge of Unicode encoding