UnicodeEncodeError, UnicodeDecodeError, UnicodeEncodeError, UnicodeDecodeError, UnicodeEncodeError, UnicodeDecodeError, UnicodeEncodeError, UnicodeDecodeError, UnicodeEncodeError, UnicodeDecodeError, UnicodeEncodeError, UnicodeDecodeError. It’s hard to remember which decode or encode method is used to convert STR to Unicode. What’s the problem?

To get to the bottom of this, I decided to dive into the details of Python string composition and character encoding

Bytes and characters

All the data stored in a computer, text characters, pictures, video, audio, software, is made up of a sequence of 01 bytes, one byte equals eight bits.

A character is a symbol, such as a Chinese character, an English letter, a number, or a punctuation mark.

Bytes are easy to store and network transfer, while characters are used for display and easy to read. The character “P” saved to the hard disk is a string of binary data 01110000, occupying one byte length

Encoding and decoding

The text we open with the editor, the character after character we see, is eventually saved on disk as a sequence of binary bytes. So the conversion process from character to byte is called encode, and the reverse is called decode, both of which are a reversible process. Encoding is for storage transmission, decoding is for easy display reading.

For example, the character “P” saved to the hard disk is a string of binary 01110000, occupying one byte length. Why is it possible that the character “Zen” is stored as “11100111 10100110 10000101” with a length of three bytes? I’ll save that for later.

Why is Python coding so painful? Of course, you can’t blame the developers.

This is because Python2 uses ASCII character encoding as the default encoding, and ASCII can’t handle Chinese, so why not use UTF-8? Guido wrote his first line of code for Python in the winter of 1989, and the first version was officially open source in February 1991. Unicode was released in October 1991, so utF-8 wasn’t even around the time Python was created, for one thing.

Python also makes two string types, Unicode and STR, so that developers get confused. Python3 rewrites strings completely, keeping only one type, but that’s another story.

STR and unicode

Python2 divides strings into Unicode and STR. STR is essentially a sequence of binary bytes. The following example code shows that the zen of STR type is printed out as hex \xec\xf8, which corresponds to the binary byte sequence ‘11101100 11111000’.

>>> s = 'zen '>>> s'\xec\xf8'>>> type(s)<type' STR '>Copy the code

The unicode u zen corresponds to the Unicode symbol u ‘\ U7985’.

> > > u = u "zen" > > > uu '\ u7985' > > > type (u) < type > 'unicode'Copy the code

To save Unicode symbols to a file or transfer them to the network, we need to encode them to STR, so Python provides an encode method to convert unicode to STR and vice versa.

encode

> > > u = u "zen" > > > uu '\ u7985' > > > u.e ncode (" utf-8 ") '\ xe7 \ xa6 \ x85'Copy the code

decode

>>> S = "Zen ">>> s.decode(" UTF-8 ")u'\u7985'>>>Copy the code

If you remember that STR is essentially a string of binary data and Unicode is a character (symbol), Encode is the process of converting characters (symbols) into binary data, so unicode to STR is converted using encode and vice versa.

encoding always takes a Unicode string and returns a bytes sequence, And decoding always takes a bytes sequence and returns a Unicode string “.

Now that the conversion relationship between STR and Unicode is clear, let’s see when UnicodeEncodeError, UnicodeDecodeError, occurs.

UnicodeEncodeError

UnicodeEncodeError occurs when a Unicode string is converted to a STR byte sequence. Consider an example of saving a string of Unicode strings to a file

# -* -coding :utf-8 -*-def main(): name = u'Python 'f = open("output.txt", "w") f.write(name)Copy the code

The error log

UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 6-7: ordinal not in range(128)

Why does a UnicodeEncodeError occur?

This is because Python determines the type of the string when calling the write method. If it is STR, it is written directly to the file, without encoding, because STR is itself a binary sequence of bytes.

If the string is of Unicode type, it saves the unicode string to a file by calling encode, which uses Python’s default ASCII code, to convert it to a binary STR type

Is equivalent to:

>>> u"Python zen ". Encode (" ASCII ")Copy the code

However, we know that the ASCII character set contains only 128 Latin characters, excluding Chinese characters, so the error of “ASCII” Codec can’t encode characters appears. To use encode correctly, you must specify a character set that contains Chinese characters, such as UTF-8 and GBK.

> > > u "zen Python" encode (" utf-8 ") 'Python \ xe4 \ xb9 \ x8b \ xe7 \ xa6 \ x85' > > > u "zen Python". Encode (" GBK ") 'Python \ xd6 \ xae \ xec \ xf8'Copy the code

To write Unicode strings to files correctly, you should pre-convert the strings to UTF-8 or GBK encoding.

Def main(): name = name. Encode ('utf-8') with open("output.txt", "w") as f: name.Copy the code

Of course, there is more than one way to write Unicode strings to files correctly, but the principle is the same, and I won’t cover it here, for writing strings to databases and transferring them to networks

UnicodeDecodeError

UnicodeDecodeError occurs when a byte sequence of type STR is decoded into a unicode string

> > > a = u "zen" > > > au '\ u7985' > > > b = a.e ncode (" utf-8 ") > > > b '\ xe7 \ xa6 \ x85' > > > b.d ecode (" GBK ") Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeDecodeError: 'gbk' codec can't decode byte 0x85 in position 2: incomplete multibyte sequenceCopy the code

A UnicodeDecodeError occurs when converting a utF-8 encoded byte sequence ‘\xe7\xa6\x85’ into a Unicode string using GBK encoding, because GBK encoding takes only two bytes, Utf-8, on the other hand, takes up three bytes and has an extra byte when converted with GBK, so it cannot parse. The key to avoiding UnicodeDecodeError is to keep the encoding and decoding type the same.

This also answers the character “Zen” said at the beginning of the article, save to the file may take 3 bytes, may take 2 bytes, specific execution in encode when the encoding format is specified.

Take another example of UnicodeDecodeError

>>> x = u"Python">>> y = "zen ">>> x + yTraceback (most recent call): File "<stdin>", line 1, in <module>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)>>>Copy the code

STR + with Unicode strings. Python implicitly converts (decodes) byte sequences of type STR to the same Unicode type as X, but Python uses the default ASCII encoding to do this. ASCII does not contain Chinese, so an error was reported.

>>> y.decode('ascii')Traceback (most recent call last):  File "<stdin>", line 1, in <module>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)Copy the code

The correct way is to explicitly decode y in UTF-8 or GBK.

> > > x = u "Python" > > > y = "zen" > > > y = y.d ecode (" utf-8 ") > > > x + yu 'Python \ u4e4b \ u7985'Copy the code

All of this is based on Python2. There will be a separate article about Python3 characters and encodings, so stay tuned.


Click on the “Zen of Python” to share Python related technical tips

Did this article help you? Share it with more people