Python Programming Time

My blog: python.iswbm.com/en/latest/c…

My Github: github.com/iswbm/Pytho…


Coding problems in Python have always been a nightmare for many Python developers, and even if you’ve been working in Python for years, you’re sure to often encounter annoying coding problems that take a long time to figure out.

After a while, you forget all about it and start looking through blogs and posts to figure out what coding is. What is Unicode? What’s the difference between it and ASCII? Why does Decode Encode always report errors? Python2 and Python3 have different string types. How do they correspond? How to detect encoding format?

Over and over again, this process is really painful.

Today I’ve covered some of the coding problems you might encounter in Python, so you can save this article from Google.

1. STR and bytes in Python 3

In Python3, strings are of two types, STR and bytes.

Here’s the difference:

  • Unicode String (STR type): Stored as Unicode code points,Forms of human knowledge
  • Byte String (Bytes): Stored in byte format.The form of machine knowledge

All strings you define in Python 3 are of Unicode string type, which can be identified using type and isinstance

# python3 > > > str_obj = "hello" > > > > > > type (str_obj) < class 'STR' > > > > > > > isinstance (" hello ", STR) True >>> >>> isinstance(" hello ", bytes) False >>>Copy the code

Bytes is a binary sequence object, so you define a string object of type Bytes as long as you prefix it with a b.

# python3
>>> byte_obj = b"Hello World!"
>>> type(byte_obj)
<class 'bytes'>
>>> 
>>> isinstance(byte_obj, str)
False
>>> 
>>> isinstance(byte_obj, bytes)
True
>>> 
Copy the code

But when defining a Chinese string, you can’t just add b to it, you should use encode instead.

>>> byte_obj=B "hello"
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>> 
>>> str_obj="Hello"
>>> 
>>> str_obj.encode("utf-8")
b'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> 
Copy the code

2. STR and Unicode in Python 2

In Python2, the string type is different from Python3 and needs to be carefully distinguished.

In Python2, there are only two types of strings, unicode and STR.

There are only unicode objects and non-Unicode objects (which should be called STR objects) :

  • Unicode String (Unicode type): Stored as Unicode code points,Forms of human knowledge
  • Byte String (STR type): Stored in byte format.The form of machine knowledge

When we define strings directly in double quotes or single quotes containing characters, it is the STR string object, as in this case

# python2

>>> str_obj="Hello"
>>>
>>> type(str_obj)
<type 'str'>
>>>
>>> str_obj
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>>
>>> isinstance(str_obj, bytes)
True
>>> isinstance(str_obj, str)
True
>>> isinstance(str_obj, unicode)
False
>>>
>>> str is bytes
True
Copy the code

When we add a U before a double or single quote, it indicates that we are defining a Unicode string object, as in this case

# python2

>>> unicode_obj = U "hello"
>>>
>>> unicode_obj
u'\u4f60\u597d'
>>>
>>> type(unicode_obj)
<type 'unicode'>
>>>
>>> isinstance(unicode_obj, bytes)
False
>>> isinstance(unicode_obj, str)
False
>>>
>>> isinstance(unicode_obj, unicode)
True
Copy the code

3. How to detect the object code

All characters have a corresponding code point in the Unicode character set.

And these encoding values are saved into binary bytecode according to certain rules, which is the encoding mode we say. Common ones are utF-8, GB2312 and so on.

That is, whenever we want to persist a string from memory to disk, we need to specify the encoding method, and conversely, when we read it, we need to specify the correct encoding method (a process called decoding), otherwise there will be garbled characters.

So the problem is, once we know the corresponding encoding method, we can decode it normally, but not all the time we can know what encoding method to decode it?

This brings us to a Python library, Chardet, which you need to install before using

python3 -m pip install chardet
Copy the code

Chardet has a detect method that predicts its encoding format

>>> import chardet
>>> chardet.detect('wechat official account: Python Programming Time'.encode('gbk'))
{'encoding': 'GB2312'.'confidence': 0.99.'language': 'Chinese'}
Copy the code

If you look at the output above, you can see that there is a confidence field, which indicates the confidence, or success rate, of the prediction.

However, if you have a small number of characters, you may “misdiagnose”), such as only two Chinese characters, as in the following example, we are using GBK encoding, using Chardet but recognize koI8-R encoding.

>>> str_obj = "Chinese"
>>> byte_obj = bytes(a, encoding='gbk')  Get a GBK encoded bytes
>>>
>>> chardet.detect(byte_obj)
{'encoding': 'KOI8-R'.'confidence': 0.682639754276994.'language': 'Russian'}
>>> 
>>> str_obj2 = str(byte_obj, encoding='KOI8-R')
>>> str_obj2
'ж п н д'
Copy the code

Therefore, in order to encode the accurate diagnosis, as many characters as possible to use.

Chardet support multinational language, can be seen in the official documentation to support these languages (chardet. Readthedocs. IO/en/latest/s…

4. The difference between encoding and decoding

Encoding and decoding are simply STR and bytes converted to each other. (Python 2 is a long way off, so use Python 3 for examples here and beyond.)

  • Encode: The encode method that converts string objects into binary byte sequences

  • Decode: Decode method that converts a sequence of binary bytes into a string object

So if we do know the encoding format, how do we convert to Unicode?

There are two ways to do this

The first is to use the decode method directly

>>> byte_obj.decode('gbk')
'Chinese'
>>> 
Copy the code

The second option is to use the STR class to roll

>>> str_obj = str(byte_obj, encoding='gbk')
>>> str_obj
'Chinese'
>>> 
Copy the code

5. How do I set the file encoding

In Python 2, the ASCII encoding is used by default for reading. Therefore, when we use Python 2, if you have Chinese in your Python file, we will get an error.

SyntaxError: Non-ASCII character '\xe4' in file demo.py
Copy the code

The reason is that the ASCII code table is too small to interpret Chinese.

In Python 3, uFT-8 is used by default to read, so a lot of work is saved.

There are two common solutions to this problem:

The first way

In python2, you can use the header designation

I could write it this way, although it’s nice

# -*- coding: utf-8 -*- 
Copy the code

But it’s a hassle, so I usually write it one of two ways

# coding:utf-8
# coding=utf-8 
Copy the code

The second way

import sys 

reload(sys) 
sys.setdefaultencoding('utf-8') 
Copy the code

Here, reload(sys) is performed before calling sys. setDefaultencoding (‘ UTF-8 ‘) to set the default decoding mode, which is necessary because Python removes the sys. setDefaultencoding method after loading sys. We need to reload sys to call the sys. setDefaultencoding method.