Preface:

Many of you have heard of ASCII, Unicode, and UTF-8, but may not know what they mean. Of course, I had a rough understanding of it at the beginning, so I decided to study it further and summarized it as follows.

Talented and uneducated, much advice!!

I. Historical origin

About the historical sources, in this quote predecessors summary, write very clear. Take the main part of it yourself and modify it. The text is as follows:

Once upon a time, a group of people decided to represent everything in the world with eight transistors that could be turned on and off in different states. They saw eight switches that were in good state, each of which they called “bytes.” And then they built machines that could process these bytes, and then the machines went on, and they could make a lot of states out of these bytes, and the states started changing. They saw that it was good, so they called the machine a computer. This is where computers come from, and each transistor has only two states, zero and one. In modern computers, the 8-bit binary is often referred to as a byte, which is the smallest unit of storage in a computer.

The computer was invented by an American. A total of 256(2 ^ 8) different states can be combined with eight bits of binary. They specify special uses for the 32 states numbered from 0. Once the terminal or printer encounters the agreed bytes, they have to do some agreed actions. Encounter 0×10, the terminal line, encounter 0×07, the terminal will beep to people, such as good encounter 0x1B, the printer will print white words, or the terminal will use color display letters. They saw that this was fine, so they called these byte states below 0×20 control codes. They then coded all Spaces, punctuation marks, numbers, upper and lower case letters into successive byte states up to number 127, so that computers could store English words in different bytes. Everyone felt so good about this that they called the scheme American Standard Code for Information Interchange. American Standard code for Information Interchange) all the computers in the world used the same ASCII scheme to store English characters.

All over the world are beginning to use computers, but in many countries with no English, there are many is not in the ASCII in their letters, in order to can preserve their words on the computer, they decided to adopt the space after the 127 to represent the new letters, symbols, and also add a lot of pictures form need to use down to the horizontal lines and vertical lines, cross, such as shape, Keep numbering until you get to the last state, 255. The page 128 through 255 character set is called the extended character set.

What happens when Chinese people use computers and find no Chinese? The Chinese solution is: By continuing to use characters smaller than 127 and using two bytes larger than 127 to represent a Chinese character, the first byte (which he called high byte) being used from 0xA1 to 0xF7, and the last byte (low byte) being used from 0xA1 to 0xFE, we were able to combine about 7,000 simplified Chinese characters. We’ve incorporated mathematical symbols, Roman And Greek letters, Japanese kana, and even ASCII numbers, punctuation marks, and letters, all of which are two bytes long. These are known as full-angle characters, whereas those under 127 are half-angle characters. The Chinese people saw that this was good, so they called the character scheme “GB2312”. GB2312 is a Chinese extension of ASCII.

But Chinese is still not enough, we found that the rare and traditional characters and so on are still unrecognizable. As a result, the extended encoding scheme is called GBK standard, which includes all the contents of GB2312 and adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols. But there are ethnic minority compatriots also need to use the computer, so there is an extension of ethnic minority characters, so THAT GBK expanded into GB18030, since then, the Culture of the Chinese nation can be inherited in the computer age. Chinese programmers saw that the standard for encoding Chinese characters was good, so they called it DBCS (Double Byte Charecter Set).

But every country has its own character code, so you can only read your own character code, not others. This is not in line with the open web culture.

It was then that an international organization called ISO (International Standardization Organization) decided to tackle the problem. Their solution was simple: scrap all regional coding schemes and create a new code that included every culture, every letter and symbol on the planet! They decided to call it “Universal Multiple-OcTET Coded Character Set,” or UCS, commonly known as Unicode. But the standard was later extended to 2 or 4 bytes to represent character encodings.

When Unicode was developed, the memory capacity of computers had grown so much that space was no longer an issue. As a result, ISO directly states that all characters must be represented in two bytes, or 16 bits. For ASCII “half corner” characters, the Unicode package keeps its original encoding unchanged, but expands its length from 8 bits to 16 bits, while characters from other cultures and languages are recoded entirely. Since the English symbol “half corner” only needs to use the lower 8 bits, the higher 8 bits are always 0, so this grand scheme will waste twice as much space to save the English text.

Unicode is a character coding scheme developed by the international organization that can contain all characters and symbols in the world. Currently, the latest Unicode standard is divided into 17 groups, 0x0000 to 0xFFFF. Each group is called a Plane, and each Plane has 65536 code points, a total of 1114112 (65536*17).

But There are also two problems with Unicode

How does a computer know that two bytes are a character, and how does it recognize that two bytes are a character?

For English characters, if they are represented by more than one byte, the first few bytes of the lower byte are all zeros. It’s a luxury and a waste of space, because most of the content on computers is still in English.

Charset and Encoding

If you’re familiar with the history of the character set, you should be familiar with the character set. Let’s talk about character sets and character encodings.

A character set is when we use decimal notation to represent the characters in the world

Character encoding is the rule of converting decimal to a binary code that the computer recognizes

  • Charset (Character Set) Character set: a collection of abstract characters. Including all kinds of characters, consonants and characters in the world.

  • Encoding (Charset Encoding) Character Encoding: Establishes rules corresponding to character sets and computer systems. In simple terms, it is a rule that converts characters into binary codes that computers can recognize.

Three, ASCII

The computer uses binary to store instructions and data. The byte is the basic unit of computation and processing of data. There are 256(2 ^ 8) possible states. These states are used by people to mark instructions and text. Initially, the first 32(0x20) states were used to represent some special actions of terminals, printers, etc., such as a newline when the terminal encounters byte 0x10, carriage return (CR 0x0D), BELL 0x07, etc. These first 32 states are also known as control codes. In addition to these special purpose states, there are 256-32=224 states that are not utilized, how can we tolerate the waste of resources, besides the control instructions, our text has not been represented. These extra states are then used to represent words such as English Character and punctuation marks, so that computers can display and record words. This was the origin of ASCII(American Standard Code for Information Interchange).

Four, Unicode

Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing.

When a Unicode character is represented, it is usually represented by “U+” followed by a set of hexadecimal digits. In the Basic Multilingual Plane (Basic Multilingual Plane, abbreviated BMP). It is also referred to as “zero plane”, plane 0), all the characters in the hexadecimal number (for example, U+4AE0, support more than 60,000 characters in total); Characters outside the zero plane require five – or six-digit hexadecimal numbers. Unicode is a character coding scheme developed by the international organization that can contain all characters and symbols in the world. Currently, Unicode characters are divided into 17 groups, 0x0000 to 0xFFFF. Each group is called a Plane, and each Plane has 65536 code points, for a total of 1114,112. Currently, however, only a few planes are used. Utf-8, UTF-16, and UTF-32 are all encoding schemes that convert numbers to program data.

The Universal Character Set (UCS) is a standard Character Set defined by ISO 10646 (also known as ISO/IEC 10646) standard. Ucs-2 is encoded in two bytes and UCS-4 in four bytes.

Ucs-4 is divided into 27=128 groups based on the highest byte with the highest bit 0. Each group is then divided into 256 planes based on the second highest byte. Each plane is divided into 256 rows based on the third byte, and each row has 256 code points. Plane 0 of Group 0 is called Basic Multilingual Plane (BMP). If the first two bytes of UCS-4 are all zero, then ucS-2 is obtained by removing the first two zero bytes from the BMP of UCS-4. Each plane has 216=65536 code points. The Unicode project uses 17 planes, with a total of 17×65536=1114112 code points.

Unicode Transformation Format (UTF) : Indicates the encoding mode of the Unicode character set, including UTF-8, UTF-16, and UTF-32. Since UTF-32 uses fixed-length four-byte encoding, this section does not cover utF-8, followed by UTF-16.

UTF-8

Utf-8 is a variable-length encoding that uses 1 to 4 bytes to represent a character. Utf-8 features different encoding lengths for different ranges of characters.

A single byte represents a character corresponding to the Unicode code point 0x0000-0x007F, and the first digit is 0 and the significant bit is 7, that is, 27=128 characters can be represented. This part is actually compatible with ASCII encoding rules, and unicode0x0000-0x007f characters are ALSO ASCII characters.

The double byte indicates the character corresponding to the Unicode code point 0x0080-0x07FF. The first byte starts with 110, the next byte starts with 10, and the significant bit is 11 bits, which can represent 2048 characters. In fact, the rule of utF-8 variable length encoding can be seen from the figure above. In order to let the computer know how to break a sentence with several bytes (that is, a character), there are several 1’s in the first few bytes. For example, if it is a double-byte sentence, the first byte begins with 110, and if it is a three-byte sentence, the first byte begins with 1110.

Three-byte refers to the character corresponding to the Unicode code point 0x0800-0xFFFF. The 16-bit significant bit can hold 65536 kinds of characters. The commonly used Chinese characters also fall in the three-byte part. Therefore, UTF-8 is three bytes to express the common Chinese characters, of course, there are nearly 100,000 Chinese characters in fact, 65536 can not accommodate such a huge System of Chinese characters, so some unpopular Chinese characters can only be expressed by four bytes. We know that Unicode puts all commonly used characters on the BMP plane (U+0000-U+FFFF), and by now utF-8 represents the entire contents of the BMP plane in 1-3 bytes.

Example 1: The Unicode encoding for “Han” is 0x6C49. 0x6C49 between 0x0800-0xFFFF uses a 3-byte template: 1110XXXX 10XXXXXX 10XXXXXX. Write 0x6C49 as binary: 0110 1100 0100 1001, and replace the x in the template with this bit stream in turn to get: 11100110 10110001 10001001, i.e. E6 B1 89.

UTF-16

Utf-16 is a variable-length encoding of two or four bytes representing a character. Unicode characters on the BMP plane (U+0000 to U+FFFF) are encoded in two bytes, and the remaining characters are encoded in four bytes. For the two-byte encoding part, the code point value is directly used to encode, and the conversion rules for the four-byte encoding part are not detailed here. We just need to remember that common Chinese and English characters are represented by two bytes in UTF-16.

The statement

Incompatible encoding methods can lead to garbled characters. For example, code A and code B are encoded differently. When A file is decoded using code A on A device with only code B, it must be garbled because it is unrecognizable. For example, the anti-Zhang espionage password book, each time the transmission of intelligence needs to use a different password to decrypt, but you use the wrong, of course, the translation of the content is confused, read certainly not smooth.

Writing is limited, uneducated, if there is wrong in the article, but also hope to inform.

Refer to the link