Common coding problem UnicodeEncodeError

Article source: UnicodeEncodeError

Encoding and decoding in Python is the interconversion of unicode and STR. The encoding is Unicode -> STR, whereas the decoding is STR -> Unicode. All that remains is to determine when to encode or decode. The “coding indication” at the beginning of the file, which is the # — coding: — statement. Python scripts are encoded in UTF-8 by default. Use the “encoding indicator” to correct characters that are not in the UTF-8 encoding range. Regarding sys.defaultencoding, this is used when decoding does not explicitly specify the decoding mode. For example, I have the following code:

 #! /usr/bin/env python
 # -*- coding: utf-8 -*-
 s = 'Chinese'  # Note that STR is of type STR, not Unicode
 s.encode('gb18030') 
 This code re-encodes s into GB18030 format, which is a Unicode -> STR conversion.
Copy the code

Since s is STR itself, Python automatically decodes s to Unicode and then to GB18030. Since the decoding is done automatically by Python, we did not specify the decoding method, so Python uses the method indicated by sys.defaultencoding. In many cases sys.defaultencoding is ANSCII, and an error will occur if S is not of this type. UnicodeDecodeError: UnicodeDecodeError: UnicodeDecodeError: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe4 in position 0: ordinal not in range(128)

In this case, we have two ways to correct the error: one is to explicitly indicate the encoding mode of S

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 

s = 'Chinese' 
s.decode('utf-8').encode('gb18030') 
Copy the code

Second, change sys.defaultencoding to file encoding

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 

import sys 
reload(sys) # Python2.5 will delete sys. setDefaultencoding after initialization, so we need to reload it
sys.setdefaultencoding('utf-8') 

str = 'Chinese' 
str.encode('gb18030')
Copy the code

And then after that, print”

Addr: “, the form [r]. “addr” value. The decode (‘ gb2312) encode (‘ utf-8) by success.

Let me summarize the reasons why I wrote this:

When the data retrieved does not match the encoding declared in your current script, do the encoding conversion

2. In the process of encoding conversion, the data should be changed into Unicode code in its own encoding format, and then the Unicode is encoded according to UTF8

3. Why does my browser return gb2312 code data to the server? It should be related to the system code of the client

My crawler mistakes:

Traceback (most recent call last):
  File "E: / workspace/webCrawler/day04/01 � � ȡ С ˵. Py." ", line 56, in <module>
    getText(url)
  File "E: / workspace/webCrawler/day04/01 � � ȡ С ˵. Py." ", line 41, in getText
    fileName = i.decode('utf-8')
  File "G: \ tools \ python2.7.12 \ lib \ encodings \ utf_8 py." ", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-8: ordinal not in range(128)
Copy the code

Add the code

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
Copy the code

And then it works fine

Common coding problem UnicodeEncodeError

Related Posts

GIT branch management is an art

Android development: excellence, those things you don’t know, I have all!

Five years lost front-end dog dramatic growth of the road | Denver annual essay