The day before yesterday, I shared with you how to use Python web crawler to crawl wechat moments data. Today, I share with you the code implementation (actual practice), and then I continue to go further.
One, code implementation
1. Modify the kitems. py file in the Scrapy project. The data we need to get is the circle of friends and the publication date, so we define the date and dynamic properties here, as shown in the figure below.
To modify the main file that implements the crawler logic, the first step is to import the module, especially the WeixinMomentItem class in kitems. py. Then modify the start_requests method, as shown in the following figure.
3. Modify the parse method to parse the navigation packet. The code implementation is slightly more complicated, as shown in the figure below.
- L Note that the response obtained from the web page is of bytes type. The displayed response must be converted to STR type before parsing. Otherwise, an error will be reported.
- L Under the restriction of POST request, parameters need to be constructed. It is important to note that the year, month, and index in the parameters must be strings. Otherwise, the server will return 400 status code, indicating that the request parameters are incorrect, resulting in an error during program running.
- L In the request parameters also need to add the request header, especially the Referer (anti-theft chain) must be added, otherwise the web page can not be found in the redirect, resulting in an error.
- L The above code construction is not the only way to write the code, but other ways can be used.
4. Define the parse_moment function to extract the friends moment data. The returned data is loaded with JSON, and the data is extracted with JSON.
5, Uncomment ITEM_PIPELINES in setting.py, indicating that data is processed through the pipelines.
6. Execute scrapy crawl moment-o moment.json on the command line.
7. Then we get a moment.json file, which stores our moments data, as shown below.
8. Well, you did read it right, and the data was really confusing, but it wasn’t gibberish, it was a coding problem. The way to solve this problem is to delete the original moment.json file and re-type the following command from the command line: Json -s FEED_EXPORT_ENCODING= utF-8 scrapy crawl moment -o moment.json -s FEED_EXPORT_ENCODING= utF-8
In the next article, xiaobian will take you to capture the circle of friends data for visual display, please pay attention to ~~