Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
When using crawler to crawl data, often encounter the problem of garbled code, then encountered garbled code how to do?
Usually when you see garbled code, you subconsciously think it might be a crawler crawling the wrong thing. No, it’s a simple coding problem.
In general, there are two coding formats involved in the crawler program. One is to decode the returned content after the request is initiated. The other is to set the encoding format when saving the file. Let’s break it down.
1. Stage of initiating request and obtaining web page content
Most sites are encoded in UTF-8, so if your default encoding is utF-8, that is, your default encoding is the same as the target site’s encoding, even if you don’t specify the encoding, there will be no problem.
But if they are inconsistent, garbled characters will appear. This is also why there is often a question of why it works well on my computer and why it is garbled on yours. This problem can be solved simply by setting encoding in the code. R.encoding = R.aparent_encoding, which can automatically predict the encoding format of the target website, saving you from setting it yourself (although in rare cases, it may be garble, so you can manually check the encoding of the web page.)
def fetchURL(url) :
headers = {
'accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
r = requests.get(url,headers=headers)
# Set the encoding format here
r.encoding = r.apparent_encoding
return r.text
Copy the code
2. Encoding errors occur during file saving
This is a common problem reported by readers, that is, there is no problem in the crawling process, but there are garbled characters when using Excel to open the saved CSV file (there is no problem when using Notepad to open the file), this is actually caused by the inconsistency between the encoding mode of the file and the decoding mode of Excel.
Encoding =’ utF_8_sig ‘; datafame. To_csv (encoding=’utf_8_sig’);
import pandas as pd
def writePage(urating) :
''' Function : To write the content of html into a local file '''
dataframe = pd.DataFrame(urating)
dataframe.to_csv('filename.csv',encoding='utf_8_sig', mode='a', index=False, sep=', ', header=False )
Copy the code
CSV file, you can use Notepad to open, and then click Save as, and then select the encoding format, ANSI, Unicode, UTF-8, and then save, open it again with Excel is normal.
3. Common garbled characters
There are several common cases of garbled code, we can refer to.