Use the xpath syntax of the LXML library to grab the python crawler part of Bole online and store the data to mongodb. This article focuses on using xpath syntax and storing mongodb databases.
See this article for a full summary of xpath usage
Import requests from pymongo import MongoClient def start_requests(url): r = requests.get(url) return r.content def parse(text): html = etree.HTML(text) divs = html.xpath('//div[@id="archive"]//div[@class="post-meta"]') for div in divs: # an intermediate variable comment = div. Xpath ('/p/a [3] / text () ') # use the generator return a dictionary yield {' title: div. Xpath ('/p/a [1] / text () ') [0], 'url' : div.xpath('./p/a[1]/@href')[0], 'time': Div. Xpath ('/p / / text () ') [2]. The strip (' \ r \ n '), 'type' : div. Xpath ('/p/a [2] / text () ') [0]. 'typeurl':div.xpath('./p/a[2]/@href')[0], 'comment':comment[0] if comment else None, 'excerpt': Div. Xpath ('./span[@class="excerpt"]/p/text()')[0]} def get_all(): # file for I in range(1, 6): # cycle crawl all pages url = '{} / http://python.jobbole.com/tag/%E7%88%AC%E8%99%AB/page/'. The format (I) the text = start_requests (url) yield from parse(text) def main(): Mysql > MongoClient = MongoClient() # MongoClient = MongoClient() # MongoClient = client.bole # create table for item in get_all(): Boledb.insert_one (item) # Insert document client.close() # disconnect if __name__ == '__main__': main()Copy the code
The results saved to the database are as follows
Reading the code above requires the following basics
1. Basic principles of crawlers
2. See these two articles for installation and use of mongodb
- Mongodb installation configuration and introduction
- Python connected mongo
3. The xpath syntax
Xpath Overview
4. Use generators in crawlers
Crawler Code improvement (III)
Column information
Column home: Programming in Python
Table of contents: table of contents
Crawler directory: Crawler directory
Version description: Software and package version description