When developing personal applets, I wanted to find some poetry content to put on the applets, so I thought of using Python to crawl the poetry data on the Internet. Find this poetry website.
Python crawler code at github.com/yueyue10/.. … . Small program QR code:
The screenshot from the applet is as follows:
Crawl process
Check the structure of the poetry data on the poetry website, use lxml.etree in the code to find the parent tag of the list, traverse again to find the attributes in the child tag, and extract them into the entity class. Dumps converts the data to JSON format using the json.dumps method, and finally saves the JSON data to a JSON file. Part of the code is as follows:
def get_info_from_cate4(self, url):
html = self.get_html_text(url)
com_html = etree.HTML(html)
cate_card_div = com_html.xpath('//*[@id="main_left"]/div[@class="cate_card"]/div')
for cate in cate_card_div:
if len(cate.xpath('@class')) > 0:
title = cate.xpath('.//h3/a[1]/text()')[0]
href = cate.xpath('.//h3/a[1]/@href')[0]
image = cate.xpath('.//a/img/@src')[0]
# print("title", title)
gradepoetry = self.get_detail_from_cate(title, image, href)
gradepoetry_json = json.dumps(gradepoetry, default=lambda obj: obj.__dict__, sort_keys=True, indent=4)
self.save_json_in_json(gradepoetry.grade, gradepoetry_json)
print(gradepoetry_json)
else:
pass
Copy the code
Quick start
1. Clone the PythonPro project to the local PC
2. Use Idea to open the pandas_data directory
3. Run the spider.py program under each category in the Poetry directory to obtain the corresponding JSON file