This is the second day of my participation in Gwen Challenge
1. What is a string
String, in Python “” (double quotation marks),” “(single quotation marks),”’ ”’ (triple quotation marks – used less often), contains a string of characters. Python internally provides a keyword STR to describe the string type.
Take the string ‘JUEJING’ and its coordinates
2. String encoding
The current evolution of encodings from the earliest ASCII encodings to the more familiar UTF, Unicode encodings and so on is shown below (timeline from top to bottom)
As you can see from the evolution above, ASCII addresses letters and numbers, GB2312 addresses Chinese, but there are Arabic, Spanish, etc., so Unicode encodings are derived
Why do you need UTF-8 with Unicode?
Utf-8 is a compression and optimization of Unicode encoding that no longer requires a minimum of two bytes, but classifies all characters and symbols:
- The contents of the ASCII code are stored in 1 byte
- European characters are stored in 2 bytes
- East Asian characters are stored in 3 bytes
. Therefore, UTF-8 is by far the most common and recommended character encoding.
All right, so if you know how coding evolved, there’s a lot of information on the web, and you can just click your little finger and find out how it all evolved, right
Anyway, what is the default encoding for Python2 and Python3?
Python2’s default encoding is ASCII. Python2 does not recognize Chinese characters and requires an explicit encoding. Python3’s default encoding is Unicode, which recognizes Chinese characters.
Python2 and Python3 interpreters use a different defaultencoding, which can be obtained by using sys.getDefaultencoding () :
>>> # Python2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> # Python3
>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Copy the code
As a result, we will often encounter coding errors in practice due to incorrect coding of the Python interpreter
For Python2, the Python interpreter will report the following error when attempting to decode a bytecode that reads a Chinese character
SyntaxError: Non-ASCII character '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
For python3, the default encoding is utf-8, but this does not mean full compatibility with Chinese problems. For example, when the PYTHon3 interpreter executes GBK code on Windows, utF-8 will be used to decode the code, and the decoding fails, with the following error message
SyntaxError: Non-UTF-8 code starting with '\xc4' in file xxx.py on line 11, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Solutions:
- Create a file and determine if its encoding is set to UTF-8
- To be compatible with PYTHon2 and PYTHon3, declare the character encoding in the header of the code. — coding:utf-8 —
3. String encoding and encoding mode
Introduce a little bit of knowledge
UNICODE character encoding is also a mapping between a character and a number, but the number here is called a code point, which is actually a hexadecimal number.
The official Python documentation describes the relationship between Unicode strings, bytes, and encodings:
A Unicode string is a sequence of code points ranging from 0 to 0x10FFFF (the decimal equivalent is 1114111). This sequence of code points needs to be represented as a set of bytes (values between 0 and 255) in storage (both memory and physical disk), and the rules for converting Unicode strings into sequences of bytes are called encodings.
Encoding here does not refer to character encoding, but to the process of encoding and the mapping rules of code points and bytes for Unicode characters used in this process.
For example, Unicode characters are converted to ASCII
For example, Unicode characters are converted to UTF-8
Summary:
Encode: The process and rules of converting Unicode strings (code points) into the corresponding bytes of a particular character encoding (code points)
4. String encoding conversion
Can bytes of different character encodings be converted to each other through Unicode? The answer is yes.
Python2 is used to encode strings:
Byte string – > decode (‘ original character encoding) — > — > encode Unicode string (‘ new character encoding) — – > string of bytes
#! / usr/bin/env python # - * - coding: utf-8 - * - utf_8_a = 'China' gbk_a = utf_8_a. Decode (' utf-8) encode (' GBK ') print(gbk_a.decode('gbk'))Copy the code
Output result:
ChinaCopy the code
Python3 defines strings as Unicode by default, so they can be directly encoded into new character encodings without first decoding:
String –>encode(‘ new character encoding ‘)–> bytes
#! /usr/bin/env python # -* -coding: utF-8 -* -utf_8_A = utf_8_a. Encode (' GBK ') print(' gbk_a ')Copy the code
Output result:
ChinaCopy the code
conclusion
Python string encoding problems are often encountered in practical work. As long as you understand the character encoding process, many problems can be solved
Well, I’m learning Python. See you next time