This time we encountered a problem with Python encoding, and instead of the Unicode-bytes explained in PyTips 0x07~0x09, we encountered a different situation. The application scenario is as follows: crawlers grab web data and POST it to the server via the Requests module, but remove whitespace (including ‘\r\n’) from the data.
The problem is that the Requests module passes data in JSON format:
import requests as req
import json
Import re title = 'hello, \n world 'req.post(API, data=json.dumps({'title': title}))data = self.requests.body.decode()
data = re.sub(r'\s', ' ', data)
save_data(json.loads(data))
Copy the code
Though HTTP is transmitted through binary (Bytes), but by the self. The requests. Body. The decode () remains the Unicode – Bytes – HTTP Bytes — the principle of Unicode, So we can actually conclude that the problem is not due to Unicode encoding. Ignoring the intermediate transfer process, the above code can be simplified as:
import json
Dumps ({'title': title}) data = json.dumps({'title': title}) data = re.sub(r'\s', ' ', data) print(json.loads(data))Copy the code
{'title': 'Hello, \n world '}Copy the code
The problem is that re.sub(r’\s’, ‘ ‘, data) does not go out with whitespace, when in fact it looks fine to do so:
Print (re.sub(r'\s', "{'title': 'hello, \n world '}")Copy the code
{'title': 'Hello world '}Copy the code
The problem with json.dumps is that as long as you keep the unicode-bytes-Unicode sandwich you don’t suffer from coding problems (Python 3).
print(json.dumps({'title': title}))Copy the code
{"title": "\u4f60\u597d\uff0c\n\u4e16\u754c"}Copy the code
As a rule of thumb, the presence of raw Unicode encoding like “\u4f60” in Python 3 probably means that this is not what you want, and we only want Unicode or binary characters to display normally:
print("\u4f60")
print("\u4f60".encode())Copy the code
You b '\ xe4 \ XBD \ xa0'Copy the code
Json.dumps () changes the values in the original dictionary type to ASCII encoding, not encode(), but ASCII () :
help(ascii)Copy the code
Help on built-in function ascii in module builtins:
ascii(obj, /)
Return an ASCII-only representation of an object.
As repr(), return a string containing a printable representation of an
object, but escape the non-ASCII characters in the string returned by
repr() using \\x, \\u or \\U escapes. This generates a string similar
to that returned by repr() in Python 2.Copy the code
The difference can be illustrated by the following example:
def print_code_and_size(s):
print(s, type(s), len(s))
Yu = 'rain 'print_code_and_size(yu) print_code_and_size(yu)) print_code_and_size(yu) print_code_and_size(json.dumps(yu))Copy the code
Rain 1 b'\xe9\x9b\xa8' 3 '\ u96E8 '8" \ u96E8 "8Copy the code
That is, json.dumps() splits the original Unicode characters into individual ASCII codes instead of normal encode(), but this method provides a parameter ensure_ASCII = False to avoid this splitting:
print_code_and_size(json.dumps(yu, ensure_ascii=False))Copy the code
"Rain" 3Copy the code
Although the principle is clearer, this unfortunately does not solve our current problem because the newline character itself is ASCII and is not affected by the ensure_ASCII argument:
r = '\n'print_code_and_size(json.dumps(r, ensure_ascii=False))
print(list(json.dumps(r, ensure_ascii=False)))Copy the code
"\n" 4
['"', '\\', 'n', '"']Copy the code
The strings returned by json.dumps() are still broken up into separate characters, so whitespace removal is still impossible. So the correct response to this problem would be to remove Spaces before json.dumps() :
import json
Import re title = 'hello, \n world 'title = re.sub(r'\s',' ', title)
data = json.dumps({'title': title})
print(json.loads(data))Copy the code
{'title': 'Hello world '}Copy the code
conclusion
The problem shouldn’t have wasted so much time because it got so tangled up with coding problems that the idea went off the rails in the first place. There are two points:
-
The Unicode-bytes -[[===]] -bytes-Unicode pattern solves most encoding problems;
-
Json. dumps and ASCII encode correspond to decodes json.loads and eval, respectively.
Photo source: UNSPLASH