Article source: UnicodeEncodeError
Encoding and decoding in Python is the interconversion of unicode and STR. The encoding is Unicode -> STR, whereas the decoding is STR -> Unicode. All that remains is to determine when to encode or decode. The “coding indication” at the beginning of the file, which is the # — coding: — statement. Python scripts are encoded in UTF-8 by default. Use the “encoding indicator” to correct characters that are not in the UTF-8 encoding range. Regarding sys.defaultencoding, this is used when decoding does not explicitly specify the decoding mode. For example, I have the following code:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = 'Chinese' # Note that STR is of type STR, not Unicode
s.encode('gb18030')
This code re-encodes s into GB18030 format, which is a Unicode -> STR conversion.
Copy the code
Since s is STR itself, Python automatically decodes s to Unicode and then to GB18030. Since the decoding is done automatically by Python, we did not specify the decoding method, so Python uses the method indicated by sys.defaultencoding. In many cases sys.defaultencoding is ANSCII, and an error will occur if S is not of this type. UnicodeDecodeError: UnicodeDecodeError: UnicodeDecodeError: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe4 in position 0: ordinal not in range(128)
In this case, we have two ways to correct the error: one is to explicitly indicate the encoding mode of S
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = 'Chinese'
s.decode('utf-8').encode('gb18030')
Copy the code
Second, change sys.defaultencoding to file encoding
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys) # Python2.5 will delete sys. setDefaultencoding after initialization, so we need to reload it
sys.setdefaultencoding('utf-8')
str = 'Chinese'
str.encode('gb18030')
Copy the code
And then after that, print”
Addr: “, the form [r]. “addr” value. The decode (‘ gb2312) encode (‘ utf-8) by success.
Let me summarize the reasons why I wrote this:
- When the data retrieved does not match the encoding declared in your current script, do the encoding conversion
2. In the process of encoding conversion, the data should be changed into Unicode code in its own encoding format, and then the Unicode is encoded according to UTF8
3. Why does my browser return gb2312 code data to the server? It should be related to the system code of the client
My crawler mistakes:
Traceback (most recent call last):
File "E: / workspace/webCrawler/day04/01 � � ȡ С ˵. Py." ", line 56, in <module>
getText(url)
File "E: / workspace/webCrawler/day04/01 � � ȡ С ˵. Py." ", line 41, in getText
fileName = i.decode('utf-8')
File "G: \ tools \ python2.7.12 \ lib \ encodings \ utf_8 py." ", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-8: ordinal not in range(128)
Copy the code
Add the code
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
Copy the code
And then it works fine