background
Since you just graduated the most afraid of the problem is the problem of garbled, right? At least I am. Later, it became clear that there was a coding problem, and later, in order to have such a problem, they all chose UTF-8, and then gradually began to forget this problem. And then when my little brother and sister asked us about it, they were all told to change it to UTF-8.
But this is a kind of escape, in fact, the coding problem has plagued me for many years, in fact, to be honest, really do not understand. Before, some colleagues asked each other how many bytes a Chinese language takes? Right? Have you ever seen one? Do you have any answers? Ha ha
Recommend a few common addresses:
-
ASCII:tool.oschina.net/commons?typ…
-
GB2312 simplified Chinese character coding table: tools.jb51.net/table/gb231…
-
Unicode: tool.chinaz.com/Tools/Unico…
Common encoding
ASCII
It is the earliest and most universal single-byte encoding system, and is equivalent to the international standard ISO/IEC 646, in which an English letter (case insensitive) occupies a byte of space.
Extension: a byte is a group of contiguous binary digits. Usually 8 bits as a byte, such as 00001111, converted to decimal.
Minimum value: -128
Maximum value: 127
Standard ASCII, also known as basic ASCII, uses seven binary digits (the remaining one binary digit is zero) to represent all upper and lower case letters, digits 0 through 9, punctuation marks, and special control characters used in American English. Among them:
-
1, 0 ~ 31 and 127(a total of 33) are control characters or special characters for communication (the rest are displayable characters), such as control characters: LF (newline), CR (enter), FF (page change), DEL (delete), BS (backspace), BEL (ring), etc. Special characters for communication: SOH (header), EOT (end), ACK (confirmation), etc. ASCII values 8, 9, 10, and 13 are converted to backspace, TAB, newline, and carriage return characters, respectively. They do not have a specific graphical display, but can affect the text display differently depending on the application.
-
2. A total of 95 characters are 32 to 126(32 is a space), in which 48 to 57 are 0 to 9 Arabic digits.
-
3. There are 26 upper-case letters from 65 to 90, 26 lower-case letters from 97 to 122, and some punctuation marks and operation symbols.
Also note that in standard ASCII, its highest bit (B7) is used as a parity bit. The so-called parity check is a method used to check whether there is an error in the process of code transmission, generally divided into parity check and parity check two. Odd check specifies that the correct code must have an odd number of 1’s in a byte, and if not odd, add 1 to the highest bit b7. Parity specifies that the correct code must have an even number of 1s in a byte, and if not, add 1 to the highest bit b7.
UTF-8
Utf-8 is the most widely used Unicode encoding on the Internet. Its biggest feature is variable length. It can use 1-4 bytes to represent a character, depending on the length of the character. The coding rules are as follows:
-
1. For a single byte character, the first digit is set to 0, and the next 7 bits correspond to the Unicode code point for that character. Therefore, for characters 0-127 in English, it is exactly the same as ASCII. This means that documents from the ASCII era have no problem opening in UTF-8 encoding.
-
2. For characters that require N bytes (N > 1), the first N bits of the first byte are set to 1, the N + 1 bits are set to 0, the first two bits of the remaining n-1 bytes are set to 10, and the remaining binary bits are filled with the Unicode code points of the character.
The serial number | Unicode | UTF-8 |
---|---|---|
1 | 0000 0000 – 0000 007F | 0xxxxxxx |
2 | 0000 0080 – 0000 07FF | 110xxxxx 10xxxxxx |
3 | 0000 0800 – 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
4 | 0001 0000 – 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The following to the Chinese character “Han” for the benefit of specific instructions on how to carry out UTF-8 encoding and decoding, we will be easy to understand.
The Unicode code point of “Han” is 0x6C49 (110 1100 0100 1001). According to the comparison table above, it can be found that 0x0000 6c49 is in the range of the third line, so its format is 1110XXXX 10XXXXXX 10XXXXXX. Then, starting from the last bit of the binary number of “han”, the x in the corresponding format is filled in from back to front, and the extra X is filled in with 0. In this case, the utF-8 code for han is 11100110 10110001 10001001, which translates into hexadecimal 0xE6 0xB7 0x89.
UTF-16
Before we look at utF-16 encoding, let’s take a look at another concept, “plane.”
In the introduction above, it was mentioned that Unicode is a very thick dictionary that defines all the characters in the world in one set. So many characters are not defined at once, but partition definitions. Each area can store 65536 (216216) characters, called a plane. Currently, there are 17 (2525) planes, that is, the size of the entire Unicode character set is now 221221.
The first 65536 character bits, called the basic plane (BMP for short), range in code points from 0 to 216−1216−1, written in hexadecimal form from U+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode. The rest of the characters are placed in the auxiliary plane (SMP for short) with code points ranging from U+010000 to U+10FFFF.
With a basic understanding of the concept of a plane, let’s return to UTF-16. Utf-16 is intermediate between UTF-32 and UTF-8, and combines the characteristics of both fixed-length and variable-length encoding methods. The encoding rules are simple: characters in the base plane take up two bytes, and characters in the secondary plane take up four bytes. That is, utF-16’s encoding length is either 2 bytes (U+0000 to U+FFFF) or 4 bytes (U+010000 to U+10FFFF). So the question is, when we encounter two bytes, do we treat them as one character or with the next two bytes as one character?
There is a neat thing here, in the basic plane, U+D800 to U+DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map the characters of the secondary plane.
The secondary plane has a total of 220220 character bits, so at least 20 binary bits are required to represent these characters. Utf-16 splits the 20 binary bits in half, with the first 10 bits mapped from U+D800 to U+DBFF, called high (H), and the last 10 bits mapped from U+DC00 to U+DFFF, called low (L). This means that the characters of an auxiliary plane are split into the character representations of the two basic planes.
Therefore, when we encounter two bytes with code points between U+D800 and U+DBFF, we can conclude that the code points immediately following the two bytes should be between U+DC00 and U+DFFF, and these four bytes must be read together.
Unicode Emoji
Around 1999, a young Japanese man named Joisao Kurita was one of many straight men whose text messages to his girlfriend were often misunderstood. For example, “knowing” is interpreted as “angry” or “impatient”, which triggers a cold war. So kurita thought, “If I could insert emojis into the text to express my feelings, I think people would need them.” The original Emoji was born.
Emoji characters are part of the Unicode character set. A specific image of an Emoji corresponds to a specific Unicode byte. Common Emoji emoticons scope and the specific number of bytes in the Unicode character set mapping relationship, through Emoji Unicode Tables (apps. Timwhitlock. Info/Emoji/table…). See the.
For UGC websites, now more and more apps are using emoji, not just iphones, because it can express our feelings vividly. What we can do here is to transcode emojis and save them in the database in the way of text. Another way is to upgrade the database and change its code.
Recommend a description of the error address: blog.csdn.net/asahinokawa…
How many characters are in a Chinese character?
If by “character” you mean char in Java, well, it’s 16 bits, 2 bytes.
If by “characters” you mean those “abstract characters” that we see with our eyes, then it’s pointless to talk about how many bytes it takes. Specifically, it makes no sense to talk about how many bytes a character takes apart from the specific encoding. It’s like having an abstract integer “42.” How many bytes do you say it takes up? It depends on whether you use byte, short, int, or long to store it. Bytes are one byte, short is two bytes, ints are usually four bytes, and longs are usually eight bytes. Of course, if you use byte, because of its limited number of bits, some numbers cannot be stored. For example, 256 cannot be stored in a byte.
The same goes for characters. If you want to talk about “a few bytes”, you need to spell out the code first. The same character may occupy different bytes under different encodings. Take your word for example. The word is 2 bytes in GBK, 2 bytes in UTF-16, 3 bytes in UTF-8, and 4 bytes in UTF-32. Different characters may also occupy different bytes in the same encoding. The word takes 3 bytes in UTF-8 encoding, while the A takes 1 byte in UTF-8 encoding. (because UTF-8 is variable-length encoding), and char in Java is essentially UTF-16 encoding. Utf-16 is actually a variable-length encoding (2 bytes or 4 bytes).
If an abstract character is 4 bytes in UTF-16, it clearly cannot be placed in a char. In other words, a char can only hold utF-16 characters that are only 2 bytes long.
Refer to the address
- Zhidao.baidu.com/question/11…
- www.jianshu.com/p/be2867675…
If you like my article, you can follow the individual subscription number. Welcome to leave messages and communicate at any time. If you want to join the wechat group to discuss with us, please add the administrator to simplify the stack culture – little Assistant (lastpass4U), he will pull you into the group.