Talk about character sets and character encodings

This is the 25th day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

Today’s programming world is popular with a variety of programming languages, JAVASCRIPT, JAVA, PYTHON, C++… There is one data type that exists in all of these programming languages, the string, which shows its importance and therefore must be studied as well. Without further ado, go!

What is a character set

A character set is a collection of characters. It is defined as an ordered set of numbered characters. It assigns a number to each character used by the computer, forming a mapping (number => character)

There are many kinds of character sets. Different character sets cover different numbers of characters. The common character sets are as follows

ASCII Character set (American Standard Code for) : The ASCII character set contains 128 characters numbered from 0 to 127. It should be noted that numbers 0 to 31 and 127 are invisible characters, which are used to express special meanings. For example, the ASCII character set 10 corresponds to a newline character.

ISO 8859 series: These are essentially extensions of the ASCII character set, identical in encoding except for a few different characters
GB2312 character set: includes Chinese characters and Latin letters, Greek letters, Japanese hiragana and Katakana letters, Russian Cyrillic letters. It includes 6763 Chinese characters and 682 other characters. It should be noted that THE GB2312 character set does not include characters in the ASCII character set
GBK and GB18030 character set: GBK character set is on the basis of GB2312, it includes a total of 21886 Chinese characters and graphic symbols,GB18030 character set on the basis of GBK and made an extension, including 70244 Chinese characters
Unicode character set: Include all the characters in the world, used for the unified character set, a total of 17(numbering from 0 to 16) planes, each plane can contain 65536(0 to 65535) number, among which the 0 plane contains some of the most commonly used characters in the world. Therefore, the 0th Plane is also called Basic Multilingual Plane (BMP), which is translated into Basic Multilingual Plane. The number ranges from 0000 to FFFF and is represented by U+ hexadecimal number

With the concept of character sets behind us, we need to know another important concept: encoding. See below

What is encoding

Encoding is defined as how to map numbers in a character set to base 2 and how to map base 2 back to the corresponding number. Different encoding methods will produce different encoding results. Before introducing different encoding methods, we need to understand another important concept, that is, code elements

In the computer, the smallest unit of data processing is the byte. In different encoding methods, the number of bytes read at a time is different. The code element is the number of bytes read by the computer at a time

Now that we understand the meaning of symbols, let’s look at some common encoding methods

ASCII encoding: The use of the ASCII character set, specifying that the code element is one byte and that all characters in the character set are 1 byte
Utf-8 encodingUse:The unicode character set, the code element isA byteSince one byte is multiplied to 256 characters, most characters require more than one byte to represent them. How do you tell a computer whether a character is a single byte or multiple bytes? Here are the rules:
- If the first byte begins with 0, it must be a single-byte encoding (a single code element)
- If the first byte starts with 110, it must be double-byte encoding (2 code elements)
- If the first byte starts with 1110, it must be three-byte encoding (three code elements)
- So on

Note also that if a character is encoded in multiple bytes, all but the first byte must start with 10

Utf-16 encoding: Uses the Unicode character set and specifies the code element to be two bytes. Since two bytes represent a maximum of 65536 characters, characters in all planes except the basic plane need to be represented by two codes. Javascript is familiar with this encoding method
UTF – 32 code: Using the Unicode character set, the code element is defined as 4 bytes. Since 4 bytes represent a maximum of 4294967295 characters, and the Unicode character set contains 65536 x 17 = 1114112, one code element is sufficient to represent all characters. This means that all characters in Unicode are represented equally in four bytes, which is a huge waste of space

conclusion

I used to be confused about character set and encoding method, and thought they were the same thing. However, after recent research, I found how naive I was before. I believe that IN the future development, I can deal with the problem of character encoding more easily, and I think this is my harvest

Talk about character sets and character encodings

What is a character set

What is encoding

conclusion

Related Posts

Array type detection

Implement an image cropping tool from scratch

Problems encountered by NPM when sending packets