How to use Python to write crawlers to crawl web pages

One of the first problems faced by machine learning is the preparation of data, which can be collected by companies, purchased, exchanged, publicly available data from government agencies and businesses, and scraped from the web by crawlers. This article describes how to write a crawler to crawl public data from the Web.

Many languages can write crawlers, but different languages have different degrees of difficulty. Python, as an interpretive glue language, is simple to get started and easy to get started. There are complete standard libraries and a variety of open source libraries. Life is short, you need Python! . It is widely used in Web site development, scientific computing, data mining/analysis, artificial intelligence and many other fields.

Python3.5.2, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy

Create a project

So let’s first create a
ScrapyProject, project name:
kiwiThe command:
scrapy startproject kiwi, will create folders and file templates.

Defining data structures

Settings. py is the setting information, kitems. py is used to store parsed data, define some data structures in this file, example code:

[url=]

[/url]

-*- coding: utf-8 -*-

Define here the models for your scraped items

See documentation in:

http://www.smpeizi.com/en/latest/topics/items.html

import

scrapy

class

AuthorInfo(scrapy.Item):

authorName = scrapy.Field()

The author nickname

authorUrl = scrapy.Field()

The author Url

class

ReplyItem(scrapy.Item):

content = scrapy.Field()

Reply content

time = scrapy.Field()

Release time

author = scrapy.Field()

Reply (AuthorInfo)

class

TopicItem(scrapy.Item):

title = scrapy.Field()

Post title

url = scrapy.Field()

Post page Url

content = scrapy.Field()

Post content

time = scrapy.Field()

Release time

author = scrapy.Field()

The poster (AuthorInfo)

reply = scrapy.Field()

ReplyItem List

replyCount = scrapy.Field()

Reply article number

[url=]

[/url]

The AuthorInfo and ReplyItem List are nested in TopicItem above, but the initialization type must be scrapy.field (). Note that all three classes need to continue from scrapy.item.

Creating a crawler spider

The kiwi_spider.py file under the spiders project directory is the spider code, which is written in this file. Take the posts and replies in douban group as an example.

[url=]

[/url]

-*- coding: utf-8 -*-

from

scrapy.selector

import

Selector

from

scrapy.spiders

import

CrawlSpider, Rule

from

scrapy.linkextractors

import

LinkExtractor

from

kiwi.items

import

TopicItem, AuthorInfo, ReplyItem

class

KiwiSpider(CrawlSpider):

name =

”

kiwi

”

allowed_domains = [

”

douban.com

”

]

anchorTitleXPath =

‘

a/text()

‘

anchorHrefXPath =

‘

a/@href

‘

start_urls = [

”

https://www.pzzs168.com/group/topic/90895393/?start=0

”

]

rules = (

Rule(

LinkExtractor(allow=(r

‘

/group/[^/]+/discussion\? start=\d+

‘

)),

callback=

‘

parse_topic_list

‘

follow=True

Rule(

LinkExtractor(allow=(r

‘

/group/topic/\d+/$

‘

)),

Post content page

callback=

‘

parse_topic_content

‘

follow=True

Rule(

LinkExtractor(allow=(r

‘

/group/topic/\d+/\? start=\d+

‘

)),

Post content page

callback=

‘

parse_topic_content

‘

follow=True

)

Post Details Page

def

parse_topic_content(self, response):

The title of XPath

titleXPath =

‘

//html/head/title/text()

‘

XPath

contentXPath =

‘

//div[@class=”topic-content”]/p/text()

‘

Post time XPath

timeXPath =

‘

//div[@class=”topic-doc”]/h3/span[@class=”color-green”]/text()

‘

The poster XPath

authorXPath =

‘

//div[@class=”topic-doc”]/h3/span[@class=”from”]

‘

item = TopicItem()

Current page Url

item[

‘

url

‘

] = response.url

The title

titleFragment = Selector(response).xpath(titleXPath)

item[

‘

title

‘

] = str(titleFragment.extract()[0]).strip()

Post content

contentFragment = Selector(response).xpath(contentXPath)

strs = [line.extract().strip()

for

line

contentFragment]

item[

‘

content

‘

] =

‘

.join(strs)

Posting time

timeFragment = Selector(response).xpath(timeXPath)

timeFragment:

item[

‘

time

‘

] = timeFragment[0].extract()

Poster Information

authorInfo = AuthorInfo()

authorFragment = Selector(response).xpath(authorXPath)

authorFragment:

authorInfo[

‘

authorName

‘

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

authorInfo[

‘

authorUrl

‘

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

item[

‘

author

‘

] = dict(authorInfo)

Reply list XPath

replyRootXPath = r

‘

//div[@class=”reply-doc content”]

‘

Reply time XPath

replyTimeXPath = r

‘

div[@class=”bg-img-green”]/h4/span[@class=”pubtime”]/text()

‘

Reply people XPath

replyAuthorXPath = r

‘

div[@class=”bg-img-green”]/h4

‘

replies = []

itemsFragment = Selector(response).xpath(replyRootXPath)

for

replyItemXPath

itemsFragment:

replyItem = ReplyItem()

Reply content

contents = replyItemXPath.xpath(

‘

p/text()

‘

)

strs = [line.extract().strip()

for

line

contents]

replyItem[

‘

content

‘

] =

‘

.join(strs)

Recovery time

timeFragment = replyItemXPath.xpath(replyTimeXPath)

timeFragment:

replyItem[

‘

time

‘

] = timeFragment[0].extract()

Respond to people

replyAuthorInfo = AuthorInfo()

authorFragment = replyItemXPath.xpath(replyAuthorXPath)

authorFragment:

replyAuthorInfo[

‘

authorName

‘

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

replyAuthorInfo[

‘

authorUrl

‘

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

replyItem[

‘

author

‘

] = dict(replyAuthorInfo)

Add to reply list

replies.append(dict(replyItem))

100

101

item[

‘

] = replies

102

yield

item

103

104

Post list page

105

def

parse_topic_list(self, response):

106

Post list XPath(skip the header line)

107

topicRootXPath = r

‘

//table[@class=”olt”]/tr[position()>1]

‘

108

Single post entry XPath

109

titleXPath = r

‘

td[@class=”title”]

‘

110

The poster XPath

111

authorXPath = r

‘

td[2]

‘

112

Reply a number of XPath

113

replyCountXPath = r

‘

td[3]/text()

‘

114

Post time XPath

115

timeXPath = r

‘

td[@class=”time”]/text()

‘

116

117

topicsPath = Selector(response).xpath(topicRootXPath)

118

for

topicItemPath

topicsPath:

119

item = TopicItem()

120

titlePath = topicItemPath.xpath(titleXPath)

121

item[

‘

title

‘

] = titlePath.xpath(self.anchorTitleXPath).extract()[0]

122

item[

‘

url

‘

] = titlePath.xpath(self.anchorHrefXPath).extract()[0]

123

Posting time

124

timePath = topicItemPath.xpath(timeXPath)

125

timePath:

126

item[

‘

time

‘

] = timePath[0].extract()

127

The poster

128

authorPath = topicItemPath.xpath(authorXPath)

129

authInfo = AuthorInfo()

130

authInfo[

‘

authorName

‘

] = authorPath[0].xpath(self.anchorTitleXPath).extract()[0]

131

authInfo[

‘

authorUrl

‘

] = authorPath[0].xpath(self.anchorHrefXPath).extract()[0]

132

item[

‘

author

‘

] = dict(authInfo)

133

Reply article number

134

replyCountPath = topicItemPath.xpath(replyCountXPath)

135

item[

‘

replyCount

‘

] = replyCountPath[0].extract()

136

137

item[

‘

content

‘

] =

‘ ‘

138

yield

item

139

140

parse_start_url = parse_topic_content

[url=]

[/url]

Pay special attention to

1. KiwiSpider needs to inherit from the CrawlSpider class instead of crawling rules pages.

Parse_start_url = parse_topic_list (); parse_start_URL = parse_topic_list (); parse_start_URL = parse_topic_list (); You can also assign it a new function like the code above.

3, Start_urls is the entry url, you can add more than one url.

4, rules in the definition of crawl to the web page which url need to climb, rules and the corresponding callback function, rules with regular expression to write. The above example code, defined to continue to crawl the details of the post home page and pages.

5. Notice that the code is wrapped with dict(). When defining data structures in the kitems. py file, the author attribute needs to be an AuthorInfo type, and assignments must be wrapped with dict. Item [‘author’] = authInfo assignment raises Error.

When extracting content, use XPath to extract what you need. See the XPath Tutorial for more information on XPath. During the development process can use the browser provides tools to check the XPath, such as FireBug in Firefox browser, FirePath plug-in, for https://www.aiidol.com/group/python/discussion?start=0 this page, The XPath rule “//td[@class=”title”]” retrieves a list of post titles, as shown in this example:

You can enter XPath rules in the red box to test whether XPath rules meet requirements. The new Firefox can install the Try XPath plugin to view XPath, and Chrome can install the XPath Helper plugin.

Use a random UserAgent

To make the Middleware site look more like normal browser access, you can write a sample code that provides a random user-agent and adds the file userAgentMiddleware.py to the project root directory:

[url=]

[/url]

-*-coding:utf-8-*-

import

random

from

scrapy.downloadermiddlewares.useragent

import

UserAgentMiddleware

class

RotateUserAgentMiddleware(UserAgentMiddleware):

def

__init__

(self, user_agent=

‘ ‘

) :

self.user_agent = user_agent

def

process_request(self, request, spider):

ua = random.choice(self.user_agent_list)

ua:

request.headers.setdefault(

‘

User-Agent

‘

, ua)

for more user agent strings,you can find it in http://www.idiancai.com/pages/useragentstring.php

user_agent_list = [ \

”

Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1

”

Mozilla / 5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11

”

, \

”

Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6

”

, \

”

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6

”

, \

”

Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1

”

, \

”

Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5

”

, \

”

Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5

”

, \

”

Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

”

, \

”

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

”

, \

”

Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

”

, \

”

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

”

, \

”

Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

”

, \

”

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

”

, \

”

Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

”

, \

”

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

”

, \

”

Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3

”

, \

”

Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24

”

, \

”

Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24

”

]

[url=]

[/url]

Modify settings.py to add the following Settings,

DOWNLOADER_MIDDLEWARES = {

‘

kiwi.useragentmiddleware.RotateUserAgentMiddleware

‘

1.}

Also disable cookies, COOKIES_ENABLED = False.

Run the crawler

Switch to the project root directory and type scrapy crawl kiwi. The console window displays the printed data, or use “scrapy crawl kiwi -o result.json -t json” to save the result to a file.

The property values of the FundEquity class in the above code are defined using getter/setter functions that check the values. The __str__(self) function is similar to toString() in other languages.

Run the fund_spider.py code from the command line, and the console window outputs net values.

summary

From the above example code, it can be seen that a small amount of code can capture, parse and store the data of posts and replies in the group on Douban.com, which shows the concise and efficient Python language.

The code for the example is relatively simple, and the only time it takes is tuning XPath rules, which can be greatly improved with the help of browser-assisted plug-in tools.

There is no mention of complex things like Pipeline or Middleware. It does not take into account that too frequent crawler requests lead to the site blocking IP(which can be cracked by constantly changing HTTP Proxy), and it does not take into account the situation that login is required to capture data (code simulation user login cracking).

The mutable parts of the actual project, such as XPath rules and regular expressions for extracting content, should not be hard-coded in code. Web fetching, content parsing, storage of parsing results, and so on should run independently using a distributed architecture. In short, there are many problems to consider in the crawler system running in the actual production environment. There are also some open source web crawler systems on Github for reference.

More Information about Java technology can be found at itheimaGZ

How to use Python to write crawlers to crawl web pages

Related Posts

High performance high concurrency generation unique Id

GNU inline assembly

GithubPages + Hexo creates a free personal static web blog