preface

Many programmers do not understand character encoding very well. Although they probably know ASCII, UTF8, GBK, Unicode and other terms and concepts, they still encounter all kinds of strange coding problems in the process of writing code. In Java, garbled characters are the most common, while in Python development, coding errors are the most common, such as: UnicodeDecodeError, UnicodeEncodeError, is a problem that almost every Python developer has encountered. This article begins with the origin of character encoding and explains how to deal with coding problems in programming. You can locate, analyze, and solve character encoding problems at your leisure.

When it comes to character encoding, we need to understand what encoding is and why.

What is coding

As everyone who has learned computer knows, the computer can only deal with 0 and 1 binary data, human with the help of the computer to see, hear any information, including: text, video, audio, pictures are stored and calculated in binary form in the computer. Computers are good at processing binary data, but human beings are not enough for binary data. In order to reduce the cost of communication between people and computers, people decide to number each character. For example, the number of letter A is 65, and the corresponding binary number is “01000001”. When A is stored in the computer, it is replaced by 01000001. When it is loaded and displayed in A file or web page for reading, it is converted into character A. This process involves the conversion of data in different formats.

Encode is the process of converting data from one form to another form. It is A set of algorithms. For example, the conversion of character A into 01000001 is A process of encoding, and decoding is the reverse process of encoding. Today we are going to talk about character encoding, algorithms for converting between characters and binary data. Encryption and decryption in cryptography is sometimes called encoding and decoding, but it is beyond the scope of this article.

What is a character set

A character set is a collection of all abstract characters supported by a system. It is the general name of all kinds of characters and symbols, common character set types include ASCII character set, GBK character set, Unicode character set and so on. Different character sets specify a limited number of characters. For example, the ASCII character set contains only Latin characters, GBK contains Chinese characters, and Unicode contains all characters in the world.

One cannot help asking, what is the relationship between character set and character encoding? Don’t worry, go down first

ASCII: Origin of character sets and character encodings

The first computer in the world was designed and developed in 1945 by two professors of the University of Pennsylvania – Mokili and Eckert. Americans drafted the first character set and coding standards for computers. Called ASCII (American Standard Code for Information Interchange), it contains 128 characters and their binary conversion relationships. 128 characters including the displayable 26 letters (case), 10 Numbers, punctuation and special control characters, also is the common characters in English and western European languages, which more than 128 characters in a byte to indicate, because a byte can represent 256 characters, so the current used only seven bytes, The highest bit is used for parity check. Therefore, lowercase a corresponds to 01100001 and uppercase A to 01000001.

The ASCII character set consists of 128 characters consisting of letters, digits, punctuation marks, and control characters (carriage return, newline, backspace). ASCII character encoding is a set of rules (algorithms) that convert these 128 characters into binary data that computers can recognize. To answer the previous question, generally speaking, a character set defines a set of character encoding rules with the same name. For example, ASCII defines a character set and a character encoding, but this is not absolute. For example, Unicode only defines a character set and the corresponding character encoding is UTF-8 and UTF-16.

ASCII was developed by the American National Standards Institute and finalized in 1967. It was originally a national standard in the United States and was later set as an International standard by the International Organization for Standardization (ISO), known as ISO 646 standard. Applies to all Latin letters.

EASCII: extended ASCII

With the continuous popularization of computers, computers began to be used by western European countries, and then there are many characters in western European languages not ASCII characters set, which caused great restrictions to their use of computers, just like in China, you can only communicate with others in English. So they figured out how to extend the ASCII character set so that only the first seven bits of a byte were used for ASCII, and if the eighth bit was used, the number of characters represented would be 256. This was later known as Extended ASCII (EASCII), which expanded over ASCII to include table symbols, computation symbols, Greek letters, and special Latin symbols.

However, EASCII did not form a unified standard, and each country has its own small plan, all want to play a role in the high byte, such as MS-DOS, IBM PC used their own definition of coded character set, in order to end this confusion, The international Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) jointly developed a series of 8-bit character set standards, called ISO 8859, the full name of ISO/IEC 8859, it is based on the ASCII extension, so completely ASCII, Only 0xA0~0xFF(160-255 in decimal notation) are used in the 128 character encoding schemes extended by ISO 8859. In fact, ISO 8859 is the general name of a set of character sets, including a total of 15 character sets, namely ISO 8859-1 to ISO 8859-15. ISO 8859-1 is also known as Latin-1, which is a Western European language. The other languages represent central, Southern and Northern European character sets respectively.

GB2312: Character set to meet Chinese needs

Later, computers began to be popularized in China, but one of the problems they faced was characters. Chinese characters are extensive and profound, and there are 3500 commonly used Chinese characters, which are far beyond the range of characters that can be represented by the ASCII character set. Even EASCII is not enough. In 1981, Standardization Administration of The People’s Republic of China set a character set called GB2312, each Chinese character symbol is composed of two bytes, theoretically it can represent 65536 characters, but it only contains 7445 characters, 6763 Chinese characters and 682 other characters, and it is compatible with ASCII. Characters defined in ASCII take up only one byte of space.

GB2312 included Chinese characters have covered 99.75% of the frequency of use in mainland China, but some rare characters, traditional characters and many characters used by ethnic minorities can not be processed, so GB2312 later created a character code called GBK, GBK not only included 27,484 Chinese characters, It also includes Tibetan, Mongolian, Uygur and other major ethnic minority languages. GBK is the use of GB2312 in the unused coding space on the expansion, so it can be fully compatible with GB2312 and ASCII. GB 18030 is the latest character set, compatible with GB 2312-1980 and GBK, which contains 70,244 Chinese characters in total. It adopts multi-byte encoding, and each character can be composed of 1, 2 and 4 bytes. In a sense, it can contain 1.61 million characters, including traditional Chinese characters and Japanese and Korean characters. Single byte is compatible with ASCII, double byte is compatible with GBK standard.

Unicode: The universal character set

Although we have our own characters and character set GBK code, but there are many countries in the world have their own language and characters, such as Japan JIS, Taiwan with BIG5, communication between different countries is very difficult, because there is no unified coding standard, may be the same character, byte is stored in A country with two word, In 1991, the International Organization for Standardization (ISO) and the Unicode Consortium (Unicode) developed the ISO/IEC 10646 (USC) and Unicode projects respectively. Both projects aimed to unify all characters in the world with a single character set. However, both sides soon realized that the world did not need two incompatible character sets. So they had a very amicable meeting about coding and decided to merge their work, and while the projects would remain separate and publish their own standards, they would have to be compatible. But because the name Unicode is easy to remember, it became more widely used and became the de facto unified coding standard.

Unicode is a set of characters that contains all characters in the world. Each character has a unique code point value. It’s not a character encoding, it’s just a character set. How Unicode characters are encoded can be UTF-8, UTF-16, or even GBK. Such as:

> > > a = u "good" > > > u '\ u597d' > a > > b = a.e ncode (" utf-8 ") > > > b '\ xe5 \ xa5 \ XBD' > > > > > > b = a.e ncode (" GBK ") > > > b '\ xba \ xc3Copy the code

Unicode itself does not specify whether a character should be represented by one, three or four bytes. Unicode only specifies that each character corresponds to a unique code point, ranging from 0000 to 10FFFF, with 1114,112 values. The number of bytes required for actual storage depends on the encoding format. For example, the character “A” takes only 1 byte in UTF-8 encoding, 2 bytes in UTF-16, and 4 bytes in UTF-32.

Utf-8: Unicode encoding

Unicode Transformation Format (UTF) encoding and Universal Coded Character Set (USC) encoding are two encoding methods in Unicode and ISO/IEC 10646 encoding system respectively. UCS can be divided into UCS-2 and UCS-4. Common TYPES of UTFS include UTF-8, UTF-16, and UTF-32. Because Unicode and USC are compatible, there is an equivalent relationship between these encodings

Ucs-2 uses two fixed-length bytes to represent a character. Utf-16 also uses two bytes, but UTF-16 is variably long. When two bytes cannot be represented, utF-16 uses four bytes to represent a character. Utf-16 can therefore be seen as an extension of ucS-2. Utf-32 is exactly equivalent to USC-4 and uses 4 bytes, which wastes a lot of space.

The advantage of UTF-8 is that it uses 1 to 4 bytes to represent a character in the unit of a single byte. From the first byte, you can determine the number of bytes in utF-8 encoding of a character. If the first byte begins with 0, it must be single-byte encoding, if it begins with 110, it must be double-byte encoding, if it begins with 1110, it must be three-byte encoding, and so on. Subsequent bytes of a multi-byte UTF-8 code start with 10, except for single bytes.

Utf-8 encodings of 1 to 4 bytes look like this:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Copy the code
  • Unicode range for single-byte encodings: \u0000~\u007F (0~127)

  • Unicode range: \u0080~\u07FF (128~2047)

  • Three-byte Unicode range: \ U0800 ~\uFFFF (2048~65535)

  • 4-byte Unicode range: \u10000~\u1FFFFF (65536~2097151)

Utf-8 is compatible with ASCII, saving space in data transfer and storage, and eliminating the need for big and small ends. Both are disadvantages of the UTF-16. However, for Chinese characters, utF-8 takes 3 bytes, while UTF-16 takes only 2 bytes. The advantage of UTF-16 is that it is very fast in calculating string length and performing index operations. Java uses utF-16 encoding scheme internally. Python3 uses utf-8. Utf-8 encoding is more widely used in the Internet domain.

You can specify which encoding format your system uses to store files. ANSI is a superset of ISO 8859-1. The reason why there is such a term for Unicode encoding in Windows, Utf-16 encodings, more specifically utF-16 little encodings. What are big encodings and little encodings?

Big end and little end

Order size end is the data stored in the memory, the big end model, refers to the high byte of the before, and stored in the lowest memory address, in line with the human, speaking, reading and writing method, the low byte of data, stored in a memory address, small end, by contrast, small end model, refers to the high byte of the last, stored in a memory address, and the low byte of data in the former, For example, the big-endian and little-endian bytes of the hexadecimal value 0x1234567 are written:

As for why there is a big end and a small end? For 16-bit or 32-bit processors, since the register width is larger than one byte, there is a problem of how to stack multiple bytes, because different operating systems read multiple bytes in different order. X86 and general oss (such as Windows, FreeBSD,Linux) use small-endian mode. But Mac OS, for example, is big-endian. This leads to the existence of big-endian storage and small-endian storage, which are not superior to each other.

Why utF-8 does not need to take into account the size end problem?

Utf-8’s encoding unit is 1 byte, so you don’t have to worry about byte order. Utf-16 uses two bytes to encode Unicode characters, and the encoding unit is two bytes. Therefore, we need to consider the byte order, because the two bytes need to be determined which is the most important and which is the least important.

Character encoding in Python2

Now that the theory is out of the way, coding in Python is one of the biggest and most common concerns of any Python developer. Python predates Unicode by several years, so the first versions of Python were continued up to Python2.7. The default encoding for Python is ASCII

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
Copy the code

Therefore, in Python source code, utF must be specified in order to save Chinese characters properly

8 or GBK format

# coding= UTF-8 /usr/bin/python # -*- coding: utf-8 -*-Copy the code

STR and unicode

Introduced characters in front of us, there is also necessary to repeat the difference between characters and bytes, characters, is a symbol, such as a Chinese, one letter, one number and one punctuation can be called a character, and convert byte is character is encoded into binary sequence, one byte is eight bits. For example, the character “P” is stored to the disk as a string of binary data 01110000, occupying one byte. Bytes are easy to store and network transfer, while characters are easy to read for display.

In Python2, the representation of characters and bytes is subtle, and the line between the two is blurred. Python2 divides strings into unicode and STR. Essentially STR is a binary byte sequence, and Unicode strings are characters. The following example code shows that the zen of STR is printed out as hex \xec\xf8, which corresponds to the binary byte sequence ‘11101100 11111000’.

> > > s = 'zen' > > > s' \ xec \ xf8 '> > > the type (s) < type >' STR 'Copy the code

The unicode symbol for u zen is u’\ U7985 ‘.

> > > u = u "zen" > > > u u '\ u7985' > > > type (u) < type > 'unicode'Copy the code

To save Unicode characters to files or transfer them to the network, we need to encode them into binary STR types, so Python strings provide encode methods to convert unicode to STR and vice versa.

encode

> > > u = u "zen" > > > u u '\ u7985' > > > u.e ncode (" utf-8 ") '\ xe7 \ xa6 \ x85'Copy the code

decode

>>> S = "Zen" >>> s.decode(" UTF-8 ") u'\u7985' >>>Copy the code

If you remember that STR is essentially a string of binary data and Unicode is a character (symbol), Encode is the process of converting characters (symbols) into binary data, so unicode to STR is converted using encode and vice versa.

encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string”.

Now that the conversion relationship between STR and Unicode is clear, let’s see when UnicodeEncodeError, UnicodeDecodeError, occurs.

UnicodeEncodeError

UnicodeEncodeError occurs when a Unicode string is converted to a STR byte sequence. Consider an example of saving a string of Unicode strings to a file

# -* -coding :utf-8 -* -def main(): name = u'Python 'f = open("output.txt", "w") f.write(name)Copy the code

The error log

UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 6-7: ordinal not in range(128)

Why does a UnicodeEncodeError occur?

Because when you call the write method, the program encodes characters into binary byte sequences, and there is an internal encoding process from Unicode to STR. The program determines the type of the string, and if it is STR, it writes it directly to the file without encoding. Because a string of type STR is itself a binary byte sequence. If the string is of Unicode type, it calls encode to convert the Unicode string to a binary STR type before saving it to a file, whereas in Python2, encode uses ASCII for encde by default.

Is equivalent to:

>>> u"Python zen ". Encode (" ASCII ")Copy the code

However, we know that the ASCII character set contains only 128 Latin characters, excluding Chinese characters, so the ‘ASCII’ codec can’t encode characters error occurs. To use encode correctly, you must specify a character set that contains Chinese characters, such as UTF-8 and GBK.

>>> u"Python zen ". Encode (" utF-8 ") 'Python\xe4\xb9\x8b\xe7\xa6\x85' >>> u"Python zen ". Encode (" GBK ") 'Python\xd6\xae\xec\xf8'Copy the code

To write Unicode strings to files correctly, you should pre-convert the strings to UTF-8 or GBK encoding.

Def main(): name = name. Encode ('utf-8') with open("output.txt", "w") as f: name.Copy the code

Or just write a string of type STR

Def main(): name = 'Python zen 'with open("output.txt", "w") as f: f.write(name)Copy the code

Of course, there is more than one way to write Unicode strings to files correctly, but the principle is the same, and I won’t cover it here, for writing strings to databases and transferring them to networks

UnicodeDecodeError

UnicodeDecodeError occurs when a byte sequence of type STR is decoded into a unicode string

> > > a = u "zen" > > > u '\ u7985' > a > > b = a.e ncode (" utf-8 ") > > > b '\ xe7 \ xa6 \ x85' > > > b.d ecode (" GBK ") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'gbk' codec can't decode byte 0x85 in position 2: incomplete multibyte sequenceCopy the code

A UnicodeDecodeError occurs when converting a utF-8 encoded byte sequence ‘\xe7\xa6\x85’ to a Unicode string using GBK decoding because GBK encoding takes only two bytes, Utf-8, on the other hand, takes up three bytes and has an extra byte when converted with GBK, so it cannot parse. The key to avoiding UnicodeDecodeError is to keep the encoding and decoding type the same.

This also answers the character “Zen” said at the beginning of the article, save to the file may take 3 bytes, may take 2 bytes, specific execution in encode when the encoding format is specified.

Take another example of UnicodeDecodeError

>>> x = u"Python" >>> y = "zen" >>> x + y Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128) >>>Copy the code

When STR + is performed with a Unicode string, Python implicitly converts (decodes) the byte sequence of type STR to the same Unicode type as X, but Python uses the default ASCII encoding to do the conversion. The ASCII character set does not contain Chinese characters, so an error is reported. Is equivalent to:

>>> y.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
Copy the code

The correct way to do this is to find a character encoding that contains Chinese characters, such as UTF-8 or GBK that explicitly decodes y to Unicode

> > > x = u "Python" > > > y = "zen" > > > y = y.d ecode (" utf-8 ") > > > x + y u 'Python \ u4e4b \ u7985'Copy the code

String and byte sequences in Python3

Python3 reconstructs string and character encodings so thoroughly that they are completely incompatible with Python2, and causes problems for many projects that want to migrate to Python3. Python3 sets the system’s default encoding to UTF-8, making characters and binary byte sequences clearer. Represented by STR and bytes, respectively. Text characters are all represented by STR, which represents all characters in the Unicode character set, while binary data is represented by a completely new data type, bytes, which is just an alias for STR, although Python2 also has a bytes type.

str

> > > a = "a" > > > a 'a' > > > type (a) < class 'STR' > > > > b = "zen" > > > b 'zen' > > > type (b) < class > 'STR'Copy the code

bytes

Python3 adds’ B ‘before a character quote to indicate that this is a bytes object, which is actually a sequence of binary bytes. Bytes can be ASCII characters and other hexadecimal characters, but cannot be represented by non-ASCII characters such as Chinese characters.

>>> c = b'a' >>> c b'a' >>> type(c) <class 'bytes'> >>> d = b'\xe7\xa6\x85' >>> d b'\xe7\xa6\x85' >>> type(d) <class 'bytes'> >>> >>> e = b' zen 'File "<stdin>", line 1 SyntaxError: bytes can only contain ASCII literal characters.Copy the code

The bytes type provides the same operations as STR, supporting sharding, indexing, and basic numeric operations. But STR and bytes cannot perform the + operation, although python2 does.

>>> b"a"+b"c"
b'ac'
>>> b"a"*2
b'aa'
>>> b"abcdef\xd6"[1:]
b'bcdef\xd6'
>>> b"abcdef\xd6"[-1]
214

>>> b"a" + "b"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
Copy the code

Python2 versus python3 bytes and characters

python2 python3 performance conversion role
str bytes byte encode storage
unicode str character decode According to

conclusion

  1. Character encoding is essentially a character-to-byte conversion process

  2. The evolution of character sets is: ASCII, EASCIi, ios8895-X, GB2312… Unicode

  3. Unicode is a character set. The corresponding encoding formats are UTF-8 and UTF-16

  4. Byte sequences are stored at both ends

  5. Characters and bytes in PYTHon2 are represented by unicode and STR types, respectively

  6. Python3 characters and bytes are represented by STR and bytes, respectively

Refer to the link

  • En.wikipedia.org/wiki/Unicod…
  • en.wikipedia.org/wiki/UTF-32
  • en.wikipedia.org/wiki/UTF-16
  • Zh.wikipedia.org/wiki/%E4%BD…
  • Zh.wikipedia.org/wiki/%E9%80…
  • En.wikipedia.org/wiki/Univer…
  • Unicode.org/faq/utf_bom…
  • www.fmddlmyy.cn/text6.html
  • Stackoverflow.com/questions/6…
  • Stackoverflow.com/questions/7…
  • www.meridiandiscovery.com/articles/un…
  • www.praim.com/character-e…
  • Stackoverflow.com/questions/4…
  • www.guokr.com/blog/83367/

This article was first published in GitChat, shall not be reproduced without authorization, reprint to contact GitChat.


Transcript: Liu Zhijun: Analyzing the Past and Present Life of Character Coding