Unicode encoding and UTF-8 encoding: UtF-8 encoding: UtF-8 encoding: UtF-8 encoding: UTF-8 encoding The following code is demonstrated in Python2, Windows and Linux respectively, to deepen your understanding of string encoding.
1, the first demonstration in the Windows operating system under Python2 environment, we all know that Python2 coding problems often occur, need to be implemented through encoding (encode) and decoding (decode). Enter CMD into the command line window, and then enter two strings’ ABC ‘and u’ ABC ‘, as shown in the figure below. Note that the encoding of these two strings is different: the former is string and the latter is Unicode. Next, you can encode it to utF-8, and you can see that both display normally without error.
When the string is changed into Chinese, and then the encoding is demonstrated again, as shown in the figure below, it can be seen that the former error occurs, while the latter does not. This error occurs frequently in Python2, so it is important to note that Python strings are encoded in memory through Unicode. Str1 is now defined as a UTF-8 encoding, which is not Unicode encoding, using encode() only if the string to be converted is encoded in Unicode encoding. So you can see str1 is reporting an error, str2 is not reporting an error. In Windows, the encoding format of the string is GB2312. In Linux, the encoding format of the string is UTF-8. If you want to successfully convert STR1 to UTF-8 encoding, you need to decode STR1 into Unicode encoding first, and then encode it. At this time, the result obtained is consistent with the str2 conversion result.
2. Now in Python2 on Linux, using the same string, the result is the same, but the process is a little different, as shown below.
In Windows, the encoding format of the string is GB2312. In Linux, the encoding format of the string is UTF-8. Therefore, an error will be reported when a string with Chinese characters is directly encoded as UTF-8. Decoding by GB2312 code will also report an error. Only decoding through UTF-8 encoding and then encoding through UTF-8 can output the correct result. Str1: encode(‘ UTF-8 ‘) : encode str1: encode(‘ UTF-8 ‘) : encode str1: encode(‘ UTF-8 ‘) The main reason is that STR1 is not really decoded to Unicode. In fact, str1.encode(‘ UTF-8 ‘), by default, does one step of decoding, but its decode() process calls the default encoding format, and this default encoding format is ASCII encoding, as shown in the figure below.
When the Chinese string is decoded using ASCII encoding, an error is reported, and encode(‘ UTF-8 ‘) is not executed at all. Python2 string encoditions in Python2, the next article will introduce the problem of Python3 string encoditions.