One of the first problems faced by machine learning is the preparation of data, which can be collected by companies, purchased, exchanged, publicly available data from government agencies and businesses, and scraped from the web by crawlers. This article describes how to write a crawler to crawl public data from the Web.

Many languages can write crawlers, but different languages have different degrees of difficulty. Python, as an interpretive glue language, is simple to get started and easy to get started. There are complete standard libraries and a variety of open source libraries. Life is short, you need Python! . It is widely used in Web site development, scientific computing, data mining/analysis, artificial intelligence and many other fields.

Python3.5.2, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy, pip3 install scrapy

Create a project

So let’s first create a
ScrapyProject, project name:
kiwiThe command:
scrapy startproject kiwi, will create folders and file templates.

Defining data structures

Settings. py is the setting information, kitems. py is used to store parsed data, define some data structures in this file, example code:

[url=]
[/url]


1
#
-*- coding: utf-8 -*-
2
3
#
Define here the models for your scraped items
4
#
5
#
See documentation in:
6
#
http://www.smpeizi.com/en/latest/topics/items.html
7
8
import

scrapy

9
10
11
class

AuthorInfo(scrapy.Item):

12

authorName = scrapy.Field()

#
The author nickname
13

authorUrl = scrapy.Field()

#
The author Url
14
15
class

ReplyItem(scrapy.Item):

16

content = scrapy.Field()

#
Reply content
17

time = scrapy.Field()

#
Release time
18

author = scrapy.Field()

#
Reply (AuthorInfo)
19
20
class

TopicItem(scrapy.Item):

21

title = scrapy.Field()

#
Post title
22

url = scrapy.Field()

#
Post page Url
23

content = scrapy.Field()

#
Post content
24

time = scrapy.Field()

#
Release time
25

author = scrapy.Field()

#
The poster (AuthorInfo)
26

reply = scrapy.Field()

#
ReplyItem List
27

replyCount = scrapy.Field()

#
Reply article number
[url=]

[/url]



The AuthorInfo and ReplyItem List are nested in TopicItem above, but the initialization type must be scrapy.field (). Note that all three classes need to continue from scrapy.item.

Creating a crawler spider

The kiwi_spider.py file under the spiders project directory is the spider code, which is written in this file. Take the posts and replies in douban group as an example.

[url=]

[/url]


1
#
-*- coding: utf-8 -*-
2
from

scrapy.selector

import

Selector

3
from

scrapy.spiders

import

CrawlSpider, Rule

4
from

scrapy.linkextractors

import

LinkExtractor

5
6
from

kiwi.items

import

TopicItem, AuthorInfo, ReplyItem

7
class

KiwiSpider(CrawlSpider):

8

name =

kiwi
9

allowed_domains = [

douban.com

]

10
11

anchorTitleXPath =

a/text()
12

anchorHrefXPath =

a/@href
13
14

start_urls = [

15
https://www.pzzs168.com/group/topic/90895393/?start=0

.

16

]

17

rules = (

18

Rule(

19

LinkExtractor(allow=(r

/group/[^/]+/discussion\? start=\d+

)),

20

callback=

parse_topic_list

.

21

follow=True

22

),

23

Rule(

24

LinkExtractor(allow=(r

/group/topic/\d+/$

)),

#
Post content page
25

callback=

parse_topic_content

.

26

follow=True

27

),

28

Rule(

29

LinkExtractor(allow=(r

/group/topic/\d+/\? start=\d+

)),

#
Post content page
30

callback=

parse_topic_content

.

31

follow=True

32

),

33

)

34
35
#
Post Details Page
36
def

parse_topic_content(self, response):

37
#
The title of XPath
38

titleXPath =

//html/head/title/text()
39
#
XPath
40

contentXPath =

//div[@class=”topic-content”]/p/text()
41
#
Post time XPath
42

timeXPath =

//div[@class=”topic-doc”]/h3/span[@class=”color-green”]/text()
43
#
The poster XPath
44

authorXPath =

//div[@class=”topic-doc”]/h3/span[@class=”from”]
45
46

item = TopicItem()

47
#
Current page Url
48

item[

url

] = response.url

49
#
The title
50

titleFragment = Selector(response).xpath(titleXPath)

51

item[

title

] = str(titleFragment.extract()[0]).strip()

52
53
#
Post content
54

contentFragment = Selector(response).xpath(contentXPath)

55

strs = [line.extract().strip()

for

line

in

contentFragment]

56

item[

content

] =

\n

.join(strs)

57
#
Posting time
58

timeFragment = Selector(response).xpath(timeXPath)

59
if

timeFragment:

60

item[

time

] = timeFragment[0].extract()

61
62
#
Poster Information
63

authorInfo = AuthorInfo()

64

authorFragment = Selector(response).xpath(authorXPath)

65
if

authorFragment:

66

authorInfo[

authorName

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

67

authorInfo[

authorUrl

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

68
69

item[

author

] = dict(authorInfo)

70
71
#
Reply list XPath
72

replyRootXPath = r

//div[@class=”reply-doc content”]
73
#
Reply time XPath
74

replyTimeXPath = r

div[@class=”bg-img-green”]/h4/span[@class=”pubtime”]/text()
75
#
Reply people XPath
76

replyAuthorXPath = r

div[@class=”bg-img-green”]/h4
77
78

replies = []

79

itemsFragment = Selector(response).xpath(replyRootXPath)

80
for

replyItemXPath

in

itemsFragment:

81

replyItem = ReplyItem()

82
#
Reply content
83

contents = replyItemXPath.xpath(

p/text()

)

84

strs = [line.extract().strip()

for

line

in

contents]

85

replyItem[

content

] =

\n

.join(strs)

86
#
Recovery time
87

timeFragment = replyItemXPath.xpath(replyTimeXPath)

88
if

timeFragment:

89

replyItem[

time

] = timeFragment[0].extract()

90
#
Respond to people
91

replyAuthorInfo = AuthorInfo()

92

authorFragment = replyItemXPath.xpath(replyAuthorXPath)

93
if

authorFragment:

94

replyAuthorInfo[

authorName

] = authorFragment[0].xpath(self.anchorTitleXPath).extract()[0]

95

replyAuthorInfo[

authorUrl

] = authorFragment[0].xpath(self.anchorHrefXPath).extract()[0]

96
97

replyItem[

author

] = dict(replyAuthorInfo)

98
#
Add to reply list
99

replies.append(dict(replyItem))

100
101

item[

reply

] = replies

102
yield

item

103
104
#
Post list page
105
def

parse_topic_list(self, response):

106
#
Post list XPath(skip the header line)
107

topicRootXPath = r

//table[@class=”olt”]/tr[position()>1]
108
#
Single post entry XPath
109

titleXPath = r

td[@class=”title”]
110
#
The poster XPath
111

authorXPath = r

td[2]
112
#
Reply a number of XPath
113

replyCountXPath = r

td[3]/text()
114
#
Post time XPath
115

timeXPath = r

td[@class=”time”]/text()
116
117

topicsPath = Selector(response).xpath(topicRootXPath)

118
for

topicItemPath

in

topicsPath:

119

item = TopicItem()

120

titlePath = topicItemPath.xpath(titleXPath)

121

item[

title

] = titlePath.xpath(self.anchorTitleXPath).extract()[0]

122

item[

url

] = titlePath.xpath(self.anchorHrefXPath).extract()[0]

123
#
Posting time
124

timePath = topicItemPath.xpath(timeXPath)

125
if

timePath:

126

item[

time

] = timePath[0].extract()

127
#
The poster
128

authorPath = topicItemPath.xpath(authorXPath)

129

authInfo = AuthorInfo()

130

authInfo[

authorName

] = authorPath[0].xpath(self.anchorTitleXPath).extract()[0]

131

authInfo[

authorUrl

] = authorPath[0].xpath(self.anchorHrefXPath).extract()[0]

132

item[

author

] = dict(authInfo)

133
#
Reply article number
134

replyCountPath = topicItemPath.xpath(replyCountXPath)

135

item[

replyCount

] = replyCountPath[0].extract()

136
137

item[

content

] =

‘ ‘
138
yield

item

139
140

parse_start_url = parse_topic_content

[url=]

[/url]






Pay special attention to

1. KiwiSpider needs to inherit from the CrawlSpider class instead of crawling rules pages.

Parse_start_url = parse_topic_list (); parse_start_URL = parse_topic_list (); parse_start_URL = parse_topic_list (); You can also assign it a new function like the code above.

3, Start_urls is the entry url, you can add more than one url.

4, rules in the definition of crawl to the web page which url need to climb, rules and the corresponding callback function, rules with regular expression to write. The above example code, defined to continue to crawl the details of the post home page and pages.

5. Notice that the code is wrapped with dict(). When defining data structures in the kitems. py file, the author attribute needs to be an AuthorInfo type, and assignments must be wrapped with dict. Item [‘author’] = authInfo assignment raises Error.

When extracting content, use XPath to extract what you need. See the XPath Tutorial for more information on XPath. During the development process can use the browser provides tools to check the XPath, such as FireBug in Firefox browser, FirePath plug-in, for https://www.aiidol.com/group/python/discussion?start=0 this page, The XPath rule “//td[@class=”title”]” retrieves a list of post titles, as shown in this example:

You can enter XPath rules in the red box to test whether XPath rules meet requirements. The new Firefox can install the Try XPath plugin to view XPath, and Chrome can install the XPath Helper plugin.

Use a random UserAgent

To make the Middleware site look more like normal browser access, you can write a sample code that provides a random user-agent and adds the file userAgentMiddleware.py to the project root directory:

[url=]

[/url]


1
#
-*-coding:utf-8-*-
2
3
import

random

4
from

scrapy.downloadermiddlewares.useragent

import

UserAgentMiddleware

5
6
7
class

RotateUserAgentMiddleware(UserAgentMiddleware):

8
def
__init__

(self, user_agent=

‘ ‘

) :

9

self.user_agent = user_agent

10
11
def

process_request(self, request, spider):

12

ua = random.choice(self.user_agent_list)

13
if

ua:

14

request.headers.setdefault(

User-Agent

, ua)

15
16
#
for more user agent strings,you can find it in http://www.idiancai.com/pages/useragentstring.php
17

user_agent_list = [ \

18
Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1

\

19
Mozilla / 5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11

, \

20
Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6

, \

21
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6

, \

22
Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1

, \

23
Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5

, \

24
Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5

, \

25
Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

26
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

27
Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3

, \

28
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

, \

29
Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3

, \

30
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

31
Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

32
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3

, \

33
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3

, \

34
Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24

, \

35
Mozilla / 5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24
36

]

[url=]

[/url]






Modify settings.py to add the following Settings,

DOWNLOADER_MIDDLEWARES = {

kiwi.useragentmiddleware.RotateUserAgentMiddleware

1.}


Also disable cookies, COOKIES_ENABLED = False.

Run the crawler

Switch to the project root directory and type scrapy crawl kiwi. The console window displays the printed data, or use “scrapy crawl kiwi -o result.json -t json” to save the result to a file.


The property values of the FundEquity class in the above code are defined using getter/setter functions that check the values. The __str__(self) function is similar to toString() in other languages.

Run the fund_spider.py code from the command line, and the console window outputs net values.


summary

From the above example code, it can be seen that a small amount of code can capture, parse and store the data of posts and replies in the group on Douban.com, which shows the concise and efficient Python language.

The code for the example is relatively simple, and the only time it takes is tuning XPath rules, which can be greatly improved with the help of browser-assisted plug-in tools.

There is no mention of complex things like Pipeline or Middleware. It does not take into account that too frequent crawler requests lead to the site blocking IP(which can be cracked by constantly changing HTTP Proxy), and it does not take into account the situation that login is required to capture data (code simulation user login cracking).

The mutable parts of the actual project, such as XPath rules and regular expressions for extracting content, should not be hard-coded in code. Web fetching, content parsing, storage of parsing results, and so on should run independently using a distributed architecture. In short, there are many problems to consider in the crawler system running in the actual production environment. There are also some open source web crawler systems on Github for reference.

More Information about Java technology can be found at itheimaGZ