Last time we figured out the relationship between ASCII and Unicode and before UTF-8, then we came across the Python encoding, which left me confused. It is also a problem left over from the previous, I have not been able to understand, today I read a lot of big guy’s technology blog, finally understand. So let me write it down and share it with you if you’re confused. The following explanations are based on Python3

1. Code coding problem

Many of you have heard people say that Python3’s default encoding is UTF-8, which sounds a bit fancier. What is the default encoding for Python?

When we compile and run a py file (test.py), the Python compiler first reads the file test.py, then defaults to utF-8 decoding the data, then compiles and runs, and the program runs.

As we know, data decoding and encoding are paired and need to adopt the same encoding method, otherwise the decoded data will be different from the original data. It is also similar to AES for decryption and encryption.

Imagine that the test.py file is written by a text editor in the GBK encoding mode and is decoded in utF-8 by default, resulting in garbled characters.

If you don’t believe me, show code with you:

# Test environment:
OS: Mac os 10.15
IDE: Pycharm
Python: Python38.Author: Xiyuan ChildeCopy the code

1.1 Case 1:

The test script test.py is saved in UTF-8 encoding.

Content:

# coding=gbk
# Author: zwjjiaozhu
# Date: 2021/1/8
# IDE: VsCode

import sys
print(sys.getdefaultencoding())
name = 'little armour'
print(f"name: {name}\n name_type: {type(name)}\n :{repr(name)}")

with open('utf.txt'.'w', encoding='utf8') as f:
    f.write(name)

with open('gbk.txt'.'w', encoding='gbk') as f:
    f.write(name)
Copy the code

Results:

utf-8Name: hao 忕敳 name_type: <class 'str'>
:'hao 忕敳'

# Write file contentHao 忕敳 Xiao ACopy the code

At this point you should be confused, what is all this? Why print shows hao 忕敳. The file is written in the way of GBK, but the content is normally displayed as xiaojia, while in the way of UTF-8, it is displayed as hao 忕敳.

Let me explain it:

Add # coding= GBK to the first line of the code above, which tells the compiler that it needs to decode the test.py file into the corresponding Unicode code using GBK’s decoding method, and then run the code.

Because name = ‘xiao A’ is Chinese, so when using GBK to decode, translated into Unicode code value, after printing and display the garbled code problem (that is, the value does not correspond to the problem), other codes are all letters, different encoding and decoding methods can be normal display, who let is the American invention 😂

Then write to the file:

  • Encoding =’ UTF8 ‘, which writes the value of name (hao 忕敳) decoded by the compiler to the file UTF. TXT in UTF-8 encoding. When using notepad to open the utF. TXT file, notepad is opened in UTF-8 decoding by default, and is displayedHao 忕 敳. Encoding and decoding are paired, and the medium is UTF-8
  • Encoding =’ GBK ‘, similarly, the Unicode value after the name is decoded by the compiler (the Unicode value corresponding to hao 忕敳) is written to the file gbk.txt as the encoding of GBK. If the text is opened in UTF-8 decoding mode, it displaysLittle armourNormal, if the GBK way to open, that sorry, or displayHao 忕 敳. In effect, the Python compiler decoded the GBK of test.py and canceled out the GBK encoding of the file, resulting in the original test.py encoded in UTF-8.

If you still don’t understand, it is highly recommended that you draw a sketch by hand, here I draw a flow chart to further understand ~

Wow, I’m so tired, my brain hurts, finally explained,

2. String encoding

There are two common formats in Python, the string and bytes types, which are usually converted to bytes during network transmission and when writing to files. For example, “Hello” is equivalent to b “\xe4\ XBD \ xA0 \ xE5 \ xA5 \ XBD” (utF-8 encoding).

2.1 String and byte transfer

A byte begins with \x followed by a hexadecimal number

Method 1: Direct transfer using encode and decode

name = "Hello"
name_bytes = name.encode("utf8")   # Encoding in UTF8, GBK or ASCII
name2 = name_bytes.decode("utf8")   # The same must be used with utF8 for decoding, encoding and decoding is the corresponding, will not garbled
print(f"name:{name},type:{type(name)}")
print(f"name_bytes:{name_bytes},type:{type(name_bytes)}")
print(f"name2:{name2},type:{type(name2)}")

# the results:
Type :
      
# name_bytes:b'\xe4\xbd\xa0\xe5\xa5\xbd',type:<class 'bytes'>
# name2: hello,type:
      

Copy the code

In mode 2, the STR and bytes are transferred

age = '12'
age_bytes = bytes(age, encoding="utf8")
age_str = str(age_bytes.decode("utf8"))
print(f"age:{age},type:{type(age)}")
print(f"age_bytes:{age_bytes},type:{type(age_bytes)}")
print(f"age_str:{age_str},type:{type(age_str)}")

# the results:
# age:12,type:<class 'str'>
# age_bytes:b'12',type:<class 'bytes'>
# age_str:12,type:<class 'str'>
Copy the code

Description: Strings are encoded in Unicode inside Python, so in memory, Unicode is usually used as an intermediate encoding, which is to decode strings of other encodings into Unicode first. Then encode Unicode into another encoding.

For example, str1. Decode (‘gb2312’) converts str1 to STR, which is the Python STR type.

The role of encode is to convert the string encoding of type STR to bytes, such as str2.encode(‘gb2312’), which means that str2 is converted to bytes in the encoding of type GB2312. Therefore, when transcoding, we must first understand the data STR1 is what encoding, with what encoding, with what decoding method.

Bottom line: In Python, this is usually a direct conversion of bytes and strings, especially in crawlers: There are a lot of coding problems, from the source of the web page is a lot of utF-8 encoding, when decoding data should be used decode(‘ UTF-8 ‘) to get the correct display style, but some pages are used GBK encoding (using GBK decoding), There will be a lot of problems, the data are obtained with garbled!

Summary: finally finished writing, repeatedly thinking several times, more thorough understanding of the file coding. Next time, I’ll write about the various Python operations on JSON files and the differences between dictionaries and JSON, as well as the Base64 encoding algorithm

Reference article:

zhuanlan.zhihu.com/p/40834093

www.liaoxuefeng.com/wiki/101695…