Character encoding is an unavoidable problem in computer programming, whether you use Python2 or Python3, or C++, Java, etc., I feel it is very necessary to clarify the concept of character encoding in computer. This paper is mainly introduced in the following parts:

  • The basic concept
  • Introduction to common character encodings
  • The default encoding for Python
  • The character type in Python2
  • UnicodeEncodeError & UnicodeDecodeError root

The basic concept

  • Character

In computers and telecommunications, a character is a unit of information. It is the general name of various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers, etc. For example, a Chinese character, an English letter, a punctuation mark and so on are all one character.

  • Character Set

A character set is a collection of characters. There are many types of character sets, and each character set contains a different number of characters. For example, common character sets include ASCII character set, GB2312 character set, Unicode character set, etc. Among them, the ASCII character set has 128 characters, including displayable characters (such as English upper and lower case characters, Arabic digits) and control characters (such as space and carriage return keys). The GB2312 character set is the Chinese national standard simplified Chinese character set, including simplified Chinese characters, general symbols, numbers, etc. The Unicode character set contains all the characters used in the world’s languages,

  • Character Encoding

Character encoding refers to the encoding of characters in a character set into specific binary numbers for computer processing. Common character encodings include ASCII, UTF-8, GBK and so on. Generally speaking, character set and character encoding are often regarded as synonymous concepts. For example, for the character set ASCII, it not only has the meaning of “the set of characters”, but also contains the meaning of “encoding”, that is, ASCII represents both the character set and the corresponding character encoding.

Let’s use a table to summarize:

concept The concept description For example,
character A unit of information, a general term for various words and symbols ‘middle’, ‘a’, ‘1’, ‘$’,’ ¥’,…
Character set Collection of characters ASCII character set, GB2312 character set, Unicode character set
A character encoding To encode a character in a character set into a specific binary number ASCII encoding, GB2312 encoding, Unicode encoding
byte The unit of data stored in a computer, an 8-bit binary number 0 x01, 0 x45,…

Introduction to common character encodings

Common character encodings are ASCII, GBK, Unicode, UTF-8 and so on. Here, we focus on ASCII, Unicode, and UTF-8.

ASCII

Computers were born in The United States in The English language, and in the English world they were just letters, numbers, and common symbols.

In the 1960s, the United States developed a set of character encoding schemes, which stipulated the conversion relationship between English letters, numbers, and some common symbols and binary, known as ASCII (American Standard Code for Information Interchange). American Standard code for Information Interchange) code.

For example, the binary representation of the uppercase letter A is 01000001 (decimal 65), the binary representation of the lowercase letter A is 01100001 (decimal 97), and the binary representation of the SPACE is 00100000 (decimal 32).

Unicode

ASCII codes specify only 128 characters, which is sufficient in the United States. However, computers later spread to Europe, Asia, and even around the world, and the languages of countries in the world are almost completely different, using ASCII code to represent other languages is far from enough, so different countries and regions have developed their own coding schemes, such as GB2312 code and GBK code in Mainland China, The Japanese Shift_JIS code and so on.

Although each country and region can develop its own coding scheme, computers in different countries and regions will have various kinds of mojibake in the process of data transmission, which is undoubtedly a disaster.

How to do? The idea is to unify all the languages in the world into one coding scheme, called Unicode, which sets a unique binary code for each character in each language, so that text processing can be done across languages and platforms. Isn’t that great?

Unicode 1.0 was released in October 1991 and continues to be revised, with more characters added with each new release. The latest version is 9.0.0, released on June 21, 2016.

The Unicode standard uses hexadecimal numbers and prefixes them with U+, such as U+0041 for the uppercase letter “A” and U+4E25 for the Character “Yan”. For more symbol mapping tables, check unicode.org, or a special Chinese character mapping table.

UTF-8

Unicode already looks perfect and unified. However, Unicode has one big problem: waste of resources.

Why do you say that? In order to represent all the characters in the world, Unicode used two bytes at first, then four more when two bytes were not enough. For example, the Unicode encoding of the Chinese character “Yan” is the hexadecimal number 4E25, which has fifteen digits in binary, i.e. 100111000100101, so it takes at least two bytes to represent this character, but for other characters, it may take three or four bytes, or even more.

At this point, the question becomes, wouldn’t it be a waste of storage if the previous ASCII character sets were represented in this way? For example, the binary code of the uppercase letter “A” is 01000001, which requires only one byte. If Unicode used three or four bytes for characters, the first few bytes of the binary code of “A” would be zeros, which is A waste of storage space.

To solve this problem, people implemented UTF-16, UTF-32, and UTF-8 on the basis of Unicode. So let’s just say utF-8.

Utf-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that uses one to four bytes to represent characters, for example, ASCII characters continue to use a one-byte encoding, Arabic, Greek and so on use two bytes, common Chinese characters use three bytes, and so on.

Therefore, we say that UTF-8 is one of the implementations of Unicode. Other implementations include UTF-16 (characters are represented in two or four bytes) and UTF-32 (characters are represented in four bytes).

The default encoding for Python

The default encoding for Python2 is ASCII, and the default encoding for Python3 is UTF-8, which can be obtained as follows:

  • Python2
Python 2.7.11 (default, Feb 24 2016, 10:48:05)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'Copy the code
  • Python3
Python 3.5.2 (default, Jun 29 2016, 13:43:58)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'Copy the code

The character type in Python2

Python2 has two string-related types: STR and Unicode, whose parent is BaseString. ASCII is the default encoding, GBK, UTF-8, etc. Unicode strings use u’… ‘. The following diagram shows the relationship between STR and Unicode:

The conversion of the two strings is summarized as follows:

  • Convert utF-8 encoded string ‘XXX’ to Unicode string u ‘XXX’decode('utf-8')Methods:
>>> 'unu4e2d '. Decode (' utF-8 ') u'\u4e2d\Copy the code
  • Convert u ‘XXX’ to utF-8 encoding ‘XXX’encode('utf-8')Methods:
> > > u 'Chinese'. Encode (' utf-8) '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'Copy the code

UnicodeEncodeError & UnicodeDecodeError root

Python2 programs often encounter UnicodeEncodeError and UnicodeDecodeError, which are caused by code that uses STR and unicode strings mixed together. These errors are most likely to occur when Python attempts to encode unicode strings (encode) or decode STR strings (decode) using ASCII encoding by default.

Here are two common scenarios that we’d better keep in mind:

  • UnicodeDecodeError is most likely to occur when Python2 decode STR into unicode for string operations that contain both STR and unicode types.

Let’s look at an example:

>>> S = 'hello' # STR type, UTF-8 encoding >>> U = u' world '# Unicode type >>> S + U # will be implicitly converted, S.decodede (' ASCII ') + u Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)Copy the code

To avoid errors, we need to display the specified decoding using ‘UTF-8’, as follows:

>>> s = 'hello' Utf-8 encoding >>> u = u' world '>>> >>> s.decode(' UTF-8 ') + u # display specifies' UTF-8' for conversion u'\u4f60\ U597d \u4e16\u754c' # This is a Unicode stringCopy the code
  • UnicodeEncodeError can easily occur if a function or class or other object receives a string of type STR, but you are passing Unicode, and Python2 defaults to using ASCII to encode it as STR.

Let’s look at an example:

>>> u_str = u' hello '>>> STR (u_str) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)Copy the code

In the above code, u_str is a unicode string, and since STR () can only take arguments of type STR, Python tries to encode it as ASCII using ASCII, i.e. :

U_str.encode (' ASCII ') // u_str is a Unicode stringCopy the code

Use ASCII encoding to convert Unicode Chinese characters.

Let’s look at another example using raw_input. Note that raw_input only accepts strings of type STR:

>>> name = raw_input('input your name: ') input your name: Ethan >>> name 'Ethan' >>> name = raw_input(' Enter your name: ') >>> name '\xe5\xb0\x8f\xe6\x98\x8e' >>> type(name) >>> name = raw_input(u') ') # will try to use u' to enter your name '. Encode (' ASCII ') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ASCII' codec can't encode characters in position 0-5: ordinal not in range(128) >>> Name = raw_input(u' Enter your name: '.encode(' utF-8 ')) # >>> name '\xe5\xb0\x8f\xe6\x98\x8e' >>> type(name) >>> name = raw_input(u') 'encode (' utf-8)). The decode (' utf-8') # recommended input your name: xiao Ming > > > name u '\ u5c0f \ u660e' > > > type (name)Copy the code

Let’s look at another example of redirection:

Hello = u' hello 'print helloCopy the code

Py will print fine if you run Python hello.py on the terminal, but if you redirect it to the file python hello.py > result will find UnicodeEncodeError.

This is because print uses the console’s default encoding when output to the console, but when redirected to the file, print doesn’t know what encoding to use, so it uses the default encoding ASCII, causing encoding errors.

It should read:

Hello = u' hello 'print hello. Encode ('utf-8')Copy the code

Python hello.py > result is executed without problem.

summary

  • Utf-8 is a variable-length character encoding for Unicode, which is one of the implementations of Unicode.
  • There are many encoding standards for the Unicode character set, such as UTF-8, UTF-7, and UTF-16.
  • When performing string operations that contain both STR and unicode types, Python2 invariably decode STR into unicode reoperation.
  • If a function or class or other object receives a string of type STR, but you pass Unicode, Python2 defaults to using ASCII to encode it as STR.

The resources