The ultimate solution to the Python garble problem (TL; DR)

It’s particularly funny that when we’re struggling with Python coding, almost every article we search for starts with a whine. Then, in desperation, write the solution (the solution is not the solution). This isn’t really a novice problem; even python veterans of a decade often get headaches. It is the same in China and abroad. Check out this Python expert at
Over half an hour of video explaining garbled code at PyConSo I understood that he himself was so confused about encoding, encoding, and decoding that the audience raised their hands and he refused to answer. You can imagine the complexity of this question.

Almost every Pythoner, I think, spends part of his or her life coding. The problem, you could say, is that if you don’t get to the bottom of it, you’ll always fall apart. Check out some of the articles I’ve written:

  • All-night complaints about Python 2.x
  • Python: STR (), repr(), print
  • Understanding of Chinese encoding in Python: Unicode, UTF-8, GBK

Whining over, here’s another ipython notebook that TOOK me all day to test write. Note source file in IPynb format is here. Of course, the link may fail. If you like ipython’s Live coding note and want to test coding with this note, please contact me.

In a nutshell, remember that in Python2 there are only two camps of strings:

unicodeandbytes

If type(string) results in STR, it refers to the bytes binary code. Various other implementations of Unicode are what we call UTF-8, GB2312, etc. Don’t get too complicated here, just keep these two camps in mind.

encodinganddecoding

One thing to remember: Converting from Unicode to bytes is called encoding. Decoding from bytes to Unicode is called decoding.

Remember this question back and forth before moving on to the next step!

And then let’s look at an example.

By comparing the two formats above, we can see the various differences between STR and Unicode.

So, since there are two different formats in variables, what happens if we operate on strings of both formats together?

As follows:

Look! Famous coding errorUnicodeDecodeError: 'ascii' codec can't decodeThere it was!

That’s how we compare the two format strings using an explicit string.

However, most of the python coding problems we deal with are not on explicit strings like this, either crawled from the web or read from local files, which means that the files are huge and the encoding format is hard to guess. So let’s break the problem down into two parts: local files and network resources.

Local file encoding test

First, create a local Chinese text file in UTF-8 format (in fact, it doesn’t matter whether.txt or.md, etc., the content is the same). It just says’ Hello ‘.

Then let’s read:

As you can see above, the bytes binary format is read from the file.


So to convert bytes to Unicode, we have to decode, which is called decoding.

This is actually the most confusing and most likely to cause errors later, is not clear about the code or the decode.

So as mentioned above, it’s important to remember these two differences.


So now what if I have it backwards? The following error occurs again:

Then again, how do we unify them?

In order to avoid messing with strings in both formats, it is necessary to unify them. But which is unified, Unicode or Bytes?

Now let’s look at what the format of a string looks like in a common context

That makes sense: almost everything except what r.ext returns is in the STR format, which is the bytes binary. So we just need to transform the relevant content of requests!

In fact, in response returned by Requests, in addition to response.text, we could also get the same content in Response.content, albeit in bytes.

That’s what we’re talking about. Instead of converting strings everywhere, just focus on this one place.

Why can’t we unify all string variables to Unicode?

The process of becoming Unicode is called decoding. Don’t misremember.

Because like response.text often throws ISO8859 and other unguessed and undetected encoding string (probability is very low), if encountered, it is very troublesome.

Decoding has two methods:

Unicode (b' hello ') b' hello '. Decode (' utF-8 ')Copy the code

Because the source code is not known, unicode() must be used for decoding, not.decode(‘ UTF-8 ‘), because obviously you can’t mess with the name of the decoding, and if the source is (most likely) ISO8859, etc., then the wrong decoding will definitely produce garbled characters or a direct program error. Remember that!

So only Unicode () can be used here. The following cases:

Summary of this stage: be sure to remember that the full text is used uniformlystrFormat string

Just keep an eye on requests and other related network operations, as long as you have control over strings from outside the sourcestrEverything else is fine!

Here’s a complete test from getting a network resource (a web page in Chinese that Requests thought was coded ISO8850) to operating locally and storing it to a local file.

import requests

r = requests.get('http://pycoders-weekly-chinese.readthedocs.io/en/latest/issue5/unipain.html')

# write a webpage to local file
with open('test.html', 'w') as f:
    f.write( r.content )

# read from a local html file
with open('test.html', 'r') as f:
    ss = f.read()Copy the code

And you’re done! The effect is as follows:

No more messing around, checking every variable, and writing a bunch of nested conversion methods!