Xpath +mongodb crawl Bole online combat

Use the xpath syntax of the LXML library to grab the python crawler part of Bole online and store the data to mongodb. This article focuses on using xpath syntax and storing mongodb databases.

See this article for a full summary of xpath usage

Import requests from pymongo import MongoClient def start_requests(url): r = requests.get(url) return r.content def parse(text): html = etree.HTML(text) divs = html.xpath('//div[@id="archive"]//div[@class="post-meta"]') for div in divs: # an intermediate variable comment = div. Xpath ('/p/a [3] / text () ') # use the generator return a dictionary yield {' title: div. Xpath ('/p/a [1] / text () ') [0], 'url' : div.xpath('./p/a[1]/@href')[0], 'time': Div. Xpath ('/p / / text () ') [2]. The strip (' \ r \ n '), 'type' : div. Xpath ('/p/a [2] / text () ') [0]. 'typeurl':div.xpath('./p/a[2]/@href')[0], 'comment':comment[0] if comment else None, 'excerpt': Div. Xpath ('./span[@class="excerpt"]/p/text()')[0]} def get_all(): # file for I in range(1, 6): # cycle crawl all pages url = '{} / http://python.jobbole.com/tag/%E7%88%AC%E8%99%AB/page/'. The format (I) the text = start_requests (url) yield from parse(text) def main(): Mysql > MongoClient = MongoClient() # MongoClient = MongoClient() # MongoClient = client.bole # create table for item in get_all(): Boledb.insert_one (item) # Insert document client.close() # disconnect if __name__ == '__main__': main()Copy the code

The results saved to the database are as follows

Reading the code above requires the following basics

1. Basic principles of crawlers

2. See these two articles for installation and use of mongodb

Mongodb installation configuration and introduction
Python connected mongo

3. The xpath syntax

Xpath Overview

4. Use generators in crawlers

Crawler Code improvement (III)

Column information

Column home: Programming in Python

Table of contents: table of contents

Crawler directory: Crawler directory

Version description: Software and package version description

Xpath +mongodb crawl Bole online combat

Column information

Related Posts

JAVA–List from getting started to source

Golang sync.Pool

Redis transactions (part 1)