How does Python3 solve the tricky character encoding problem?

This article is from the official wechat account Python Zen (VTtalk) by Liu Zhijun

One of the most important improvements in Python3 is the resolution of the hole left over from string and character encodings in Python2. Why is Python coding so painful? Some flaws in Python2 string design have already been described: – The use of ASCII as the default encoding is unfriendly to Chinese handling. – Mislead developers by separating strings into Unicode and STR

Of course, this is not a Bug, but it can be avoided by being careful. But in Python3 both problems are solved nicely.

First, Python3 sets the system’s default encoding to utf-8

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
>>>Copy the code

Text characters are then more clearly distinguished from binary data, represented as STR and bytes respectively. Text characters are all represented by STR, which represents all characters in the Unicode character set, while binary data is represented by a new data type, bytes.

str

> > > a = "a" > > > a 'a' > > > type (a) < class 'STR' > > > > b = "zen" > > > b 'zen' > > > type (b) < class > 'STR'Copy the code

bytes

Python3 adds’ B ‘before a character quote to indicate that this is a bytes object, which is actually a sequence of binary bytes. Bytes can be ASCII characters and other hexadecimal characters, but cannot be represented by non-ASCII characters such as Chinese characters.

>>> c = b'a' >>> c b'a' >>> type(c) <class 'bytes'> >>> d = b'\xe7\xa6\x85' >>> d b'\xe7\xa6\x85' >>> type(d) <class 'bytes'> >>> >>> e = b' zen 'File "<stdin>", line 1 SyntaxError: bytes can only contain ASCII literal characters.Copy the code

The bytes type provides the same operations as STR, supporting sharding, indexing, and basic numeric operations. But STR and bytes cannot perform the + operation, although this is possible in Py2.

>>> b"a"+b"c"
b'ac'
>>> b"a"*2
b'aa'
>>> b"abcdef\xd6"[1:]
b'bcdef\xd6'
>>> b"abcdef\xd6"[-1]
214

>>> b"a" + "b"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to strCopy the code

Python2 and python3 bytes correspond to characters

Encode and decode

STR to bytes can be converted using encode and from decode methods.

Encode is responsible for character-to-byte encoding conversion. The default encoding is UTF-8.

> > > s = "zen Python" > > > s.e ncode () b 'Python \ xe4 \ xb9 \ x8b \ xe7 \ xa6 \ x85' > > > s.e ncode (" GBK ") b 'Python \ xd6 \ xae \ xec \ xf8'Copy the code

Decode is responsible for byte to character decoding and conversion, generally using UTF-8 encoding format for conversion.

>>> b 'python \xe4\xb9\x8b\xe7\xa6\x85'. Decode () 'Python zen '>>> B 'python \xd6\xae\xec\xf8'. Decode (" GBK ") 'Python Zen'Copy the code

How does Python3 solve the tricky character encoding problem?

Related Posts

Link Tracing for Distributed Systems (PART 1)

How to use index efficiently and correctly

Why is 0.1 + 0.2 not equal to 0.3 in most programming languages?