Hi, I’m a Python advancer. UnicodeEncodeError: UnicodeEncodeError: ‘GBK’ Codec can’t encode character solution, here again to give you three kinds of Chinese garble processing scheme in the process of web crawler, hope to help you learn.
preface
A few days ago, a fan asked a question about Chinese garble when using Python web crawlers, as shown in the picture below.
It certainly looked big, and to the novice reptilian, this garbled code stood in front of him like a formidable obstacle. But don’t panic, xiaobian here for you to sort out three methods, specifically for Chinese garbled, I hope you will meet the problem of Chinese garbled again in the back, here can get inspiration!
A, thinking
In fact, the key point to solve the problem is to deal with the part of the garbled code, and the processing scheme can mainly start from two aspects. One is for the whole webpage coding in advance, the other is for the local specific Chinese garbled part of the coding processing. Here are 3 ways to do this, and there are definitely others out there, so let us know in the comments section.
Second, the analysis
In fact, there are many forms of Chinese garble, but the two common ones are as follows:
1. When the webpage code is GBK and the obtained content is printed on the console, it is similar to the following:
AAA ® (including cAO by A quarter of uAI » u · ¿ ¿ E ° ® Ð ¡ 1/2 level a 1/2 level a4k + UO 1/2 levelCopy the code
2. When the webpage code is GBK and the obtained content is printed on the console, it is similar to the following:
�װŮ�� Ů ˮ с Ϫ ψ �Copy the code
Although the console output looks fine, there are no errors:
Process finished with exit code 0
Copy the code
But the output of Chinese content, but not ordinary people can understand.
In this case, you can use the three methods given in this article to solve the problem, time and again!
Third, concrete implementation
1) Approach 1: Change requests.get().text to requests.get().content
We can see that the source code obtained through the text() method is indeed garbled when printed out later, as shown in the figure below.
Consider changing the request to.content, and you’ll get normal content.
2) Method two: manually specify the web page coding
Apparent_encoding = Response. encoding = response.apparent_encodingCopy the code
This method is a little more complicated, but easier to understand, and more acceptable for beginners.
If you find the above method hard to remember, or you can try to directly specify GBK code can also be handled, as shown in the picture below:
The two methods introduced above are for overall coding of web pages, and the effect is remarkable. The third method is for the Chinese part of the garbled part to use the general coding method.
3) Method three: use common coding methods
img_name.encode('iso-8859-1').decode('gbk')
Copy the code
The use of general coding methods, the Chinese appear gibberish coding Settings can be. Img_name encoding, specify the encoding and decode, as shown in the figure below.
In this way, the problem of Chinese garbled characters is solved.
Four,
I’m a Python advancer. Based on the questions raised by fans, this paper provides three solutions to the problem of Chinese garbled code in the process of Python web crawler, which helps fans to solve the problem smoothly. There are 3 ways to do this, but I’m sure there are others out there, and I welcome your suggestions in the comments section.
Friends, quickly use the practice! If you have any problems during the learning process, please add me as a friend and I will invite you to join the Python Learning exchange group to discuss and learn together