I’m sure you’ve all heard about character encoding. You can more or less name names like UTF-8, ASCII, Unicode, etc. When I was sorting out my notes these two days, I found some notes about character encoding, so I would like to review the relevant content with you.

Encoding and decoding

concept

First of all before understanding the various character encoding formats and standards, let’s first understand the concept of encoding and decoding. Encode is the process of converting information from one form or format to another form according to certain rules, while decode is the reverse process of encoding. Coding and decoding are a set of predetermined schemes, whether in the coding process or the decoding process, to comply with the rules to operate. For example, between A and B,1 represents me, 2 represents like,3 represents dislike, and 4 represents you. When A sent 124 to B one day,B understood A’s intention and gave A 134 back. In this case, the information in the transmission process is represented by 1234 numbers. The process from “I like you” to “124” can be called coding, and when B receives “124”, the process of converting it into “I like you” is called decoding. After completing the process of encoding and decoding twice back and forth, a sad story is born.

In our usual development process, coding and decoding is also an unavoidable topic. A lot of places we’re going to be talking about encoding and decoding. And different codes and decodes have different meanings in different scenes. For example, common character encoding and decoding,URL encoding and decoding.

Why to encode and decode

So maybe you have to ask, why isn’t it bad for us to encode and decode just the way it is? Of course not, because in a computer, you can’t store characters, you can only store zeros and ones. No matter what the character is, it needs to be converted to a number of zeros and ones before it can be stored in a computer. To put it bluntly, a computer calculates in binary, and binary is just 0 and 1. There are many reasons why computers adopt binary. It’s technically easy to make a few points here, because we can use a bistable circuit to represent 1 and 0, high level 1, low level 0. And because there are only 1 and 0, it is not easy to make mistakes during transmission and processing. In addition, binary arithmetic rules are relatively simple. Taking all these factors into account, the final computer adopted binary. When programmers write a high-level development language, such as C, they encode it into a machine language that the computer can recognize. In this way, it is not only convenient for programmers to write code, but also enables the system to run steadily and quickly.

The statement

Garbled characters are characters that cannot be read correctly due to the use of a different character set. For example, I tell you to open the xx line of xx page of Xinhua Dictionary, is what I want to say to you. As a result, you have taken an Oxford-English dictionary, then of course you cannot correctly understand what I want to say to you.

ASCII

American Standard Code for Information Interchange is the American Standard Code for Information Interchange. It is a set of computer coding systems based on the Latin alphabet used to display modern English and other Western European languages. It was published over 50 years ago and has been updated since, but the most recent edition is over 30 years old. It uses 7 binary digits to represent a character, a total of 128 characters are defined, ranging from 00000000 to 01111111. Why is it 128 characters? This is because at that time, the infrastructure was not perfect, the hardware conditions were not good. To save space, they agreed to use a byte to hold character numbers. A byte is an 8 bit. After removing the highest sign bit, the remaining 7 bits, namely 0000000 to 1111111, are 128 bits in total. This is the picture I came down from Baidu.

Control characters
Print characters

let str1 = '0'
console.log(str1.charCodeAt(0))  / / 48
let str2 = 'A'
console.log(str2.charCodeAt(0))  / / 65
let str3 = 'a'
console.log(str3.charCodeAt(0))  / / 97
Copy the code

The charCodeAt method returns the Unicode encoding of the character at the specified position. The return value is an integer between 0 and 65535. In JS, the ASCII encoding value is the same as Unicode. So the result returned here corresponds to what we saw in the figure above. In practice, we can use this method when we need to determine whether a character is lowercase or not.

function fn(char){
	let charCode = char.charCodeAt(0)
	if(charCode >= 97 && charCode <= 122) {return true
	}else{
		return false}}console.log(fn('d'))  // true
console.log(fn('D'))  // false
Copy the code

The regular expression /^[a-z]$/ can be used as a regular expression.

GB2312

As mentioned above, the old hair invented ASCII watches, but with the more and more popular computer systems, people all over the world began to use computers. But there are many countries in the world that are not based on the Latin alphabet. For example, our extensive and profound Chinese characters, since ancient times, there has been the story of cang Jie. Chinese characters have a profound cultural heritage and a long history of influence. ASCII alone did not represent Chinese, so the Chinese had their own code. A byte can only represent 2 to the power of 7, that we Chinese use 2 bytes to represent, that can represent the range is not 2 to the power of 15, that is 32768, fully 256 times higher than the ASCII code can represent the range. Therefore, the earliest standard customized by the Chinese people is called GB2312, which was released by the State Administration of Standards in the 1980s. Although there were tens of thousands of them, GB2312 did not include so many Chinese characters and punctuation marks. Just include some commonly used ones, which is why some agencies could not print out some rare characters before. Of course, since ASCII was first invented in America and GB2312 came after China, our GB2312 is compatible with ASCII code tables to a certain extent, that is, our 0 to 127 bits are retained.

GBK

GBK coding standard is GB2312 upgrade version, because the previous said, in the original GB2312 only thousands of common Chinese characters and characters, print out those rare words. So to solve this problem, a new GBK standard was released in the 1990s, which added more than 20,000 Chinese characters compared with the previous one. Below is my code table from GBK casually find a paragraph, we see if there are a lot of words do not know, laugh cry 😂.

GB18030

After GBK, several versions were released successively, such as GB18030-2000 and GB18030-2005. Collectively known as GB18030, Chinese called Information Technology Chinese Coded Character Set. It is fully backward compatible with GB 2312-1980, basically backward compatible with GBK, and supports All code points of Unicode (GB 13000). A total of 70,244 Chinese characters are included.

Unicode

With the popularization of computers, more and more countries are using them. And because everyone’s language family is different, in order to express the language of their own country, so we have developed their own coding format, you a set, I a set, this in the beginning of the single machine world and there is no problem, we each play their own, and do not need to communicate with others. But with the rise of the Internet, communication between people and between computers has become a necessary process. But at this time people are embarrassed to find, because the early everyone respective play their own, but also only collect their own use to. This has led to a plethora of standards, none of which can be used universally, as ASCII certainly cannot represent Chinese characters. Thus came Unicode, which includes character sets, coding schemes, and so on. It sets a unified and unique binary code for each character in each language, so as to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially released in 1994. Note that Unicode encodes characters in double bytes, meaning that both an English and Chinese punctuation mark and both English and Chinese punctuation mark occupy two bytes.

The charCodeAt method, mentioned earlier, returns Unicode. We tested numbers and letters in the previous code. Now let’s test the Chinese characters again

console.log('ah'.charCodeAt(0))  / / 21704
Copy the code

So 21704 is the Unicode value for this character.

UTF-8

With Unicode out of the way, we can finally move on to UTF-8. The relationship between the two is simply that Unicode is a character set that only specifies the numeric number of each character, not how that number is stored. And UTF-8 is the encoding rule. Again popular point, the former is a standard, the latter is a scheme, is a concrete implementation. In fact, there is not only utF-8, but also UTF-16 and UTF-32. However, as for the latter two programs, not much use. Because they have obvious disadvantages, such as UTF-16 specifying 2 or 4 bytes, and UTF-32 specifying 4 bytes total 32 bits to represent Unicode. So they’re not compatible with one byte of ASCII code, and they take up a lot of space, so they’re useless.

Therefore, UTF-8 is the most widely used encoding for Unicode support. One of its obvious features is that it uses a different number of bytes to represent a character during encoding. Like the original ASCII code, which was represented by one byte, it is now represented by one byte, so it is compatible. But for Chinese characters, we used to use two bytes, and now UTF-8 uses three bytes.

All in all, UTF-8 is the best character encoding scheme available today. A lot of times, when we’re writing code, we’ll declare that the character encoding of our code is UTF-8. For example, declare
in the header of an HTML file and # coding= UTF-8 in the header of a py file.

Encoding rules

  1. For single-byte symbols, the first byte is set to 0 and the next 7 bits are for that symbolUnicodeSo for English letters,UTF-8Coding andASCIIThe codes are the same
  2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are all set to 10. The remaining bits, not mentioned, are all bits of this symbolUnicodeCode.

Let’s convert a zhang word to utF-8 format.

console.log('å¼ '.charCodeAt(0))  / / 24352
Copy the code

The charCodeAt method takes the Unicode value of the sheet as 24352 and converts it to hexadecimal, resulting in 0x5f20. We go to the website of an online encoding and decoding tool, and the result is \ cut off 20, which matches the one before us.

Continuing the conversion, 5F20 converts each number to binary separately, resulting in :0101 1111 0010 0000. Applying the encoding rule 2 above, a Chinese character is 3 bytes, so n is 3. The first 3 bits of the first byte are set to 1, and the fourth bit is set to 0. The result is 1110XXXX 10XXXXXX 10xxxxxx, where XXX and so on are the remaining unmentioned bits, filled in with Unicode codes (i.e. 0101 1111 0010 0000).

0101 1111 0010 0000  # Fill in from right to left.
1110xxxx 10xxxxxx 10xxxxxx  The last x is replaced by the last digit above, 0, the penultimate x is replaced by the penultimate digit above, and so on.
11100101 10111100 10100000 # Filled binary
e5 bc a0 # after converting binary to hexadecimal, the result is e5 BC a0 note this result, which we will encounter later.
Copy the code

Note: online conversion can be used for children’s shoes that cannot convert to base.

URL encoding and decoding

I don’t know if you’ve noticed when we append a Query string to the URL in the Chrome address bar, as shown below

I typed name= zhang SAN. When I copied the URL, I opened a new page and pasted the address I just copied

The original Chinese character Zhang SAN became a bunch of unreadable strings %E5%BC%A0%E4%B8%89. This is because the string is already encoded by the browser. Attached is a website URL encoding/decoding, you can copy this long string into the decoding to see the effect. This is because most browsers now treat characters as UTF-8. And Chinese UTF-8 encoding, is a Chinese character is equal to three bytes, that is, three groups of %xx, that is, %xx%xx%xx. This looks a little familiar again. Yes, the url encoded sheet is %E5%BC%A0, which is consistent with the hexadecimal encoding result we mentioned above.

conclusion

There is a lot of knowledge about character encodings, and I just mentioned the most simple concepts here. If you are interested, you can understand it yourself. By the way, recommend a book, read a book before, but it seems not finished reading (😂), English called code, Chinese called code, interested in children’s shoes can go to see it.

Reference links:

  • Baike.baidu.com/item/ASCII/…
  • Baike.baidu.com/item/gb1803…
  • Baike.baidu.com/item/Unicod…
  • Blog.csdn.net/yingshukun/…
  • Blog.csdn.net/zhusongziye…