Crawler is mainly to filter out useless information in the web page, capture useful information in the web page


The general crawler architecture is:






Before the python crawler, it is necessary to have a certain understanding of the structure of the web page, such as the label of the web page and the language of the web page.

There are also a few tools to start crawling:


1. First of all, Python development environment: HERE I choose Python2.7. The IDE I developed is developed on VS by using Python plug-in on VS2013 for the convenience of installation and debugging (debugging of Python programs is almost familiar with debugging of C programs);


2. Web source viewing tools: Although web source viewing can be done in any browser, I recommend firefox and FirBug (both of which are must-use tools for web developers);


The FirBug plugin can be installed in the add component on the right;

Second, try to look at the source code of the web page. Here I use the basketball data we want to crawl as an example:


For example, if I want to crawl the contents of the Team Comparison table in the web page:






Right click on the score I want to climb 32-49, right-click on it and select View elements with firBug. (firBug also has the advantage of displaying the source code on the web page. At the bottom of the page will pop up the source code of the page and the 32-49 score location and source code as shown below:






You can see the source code for the 32-49 page as follows:



<td class=”sdi-datacell” align=”center”>32-49</td>

1

Where td is the name of the tag, class is the name of the class, align is the format, 32-49 is the content of the tag, is the content we want to crawl.

But there are many similar tags and the name of the class in the same page, relying on these two elements alone can not climb down the data we need, then we need to look at the parent tag of the tag, or the next level of the tag to extract more features of the data we want to climb, to filter other data we do not want to climb, For example, we select the label of this table as the second one we filter

Features:



<div class=”sdi-so”>

<h3>Team Comparison</h3>


Let’s analyze the URL of the web page:

If the URL of the page we want to crawl is:



http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/nba/matchups/g5_preview_12.html

1

Because have build website experience, so can here

www.covers.com is the domain name.

/ pageLoader/pageLoader aspxpage = / data/NBA/matchups/g5_preview_12. HTML, may be in the web root directory on the server/pageLoader/pageLoader aspx? Page =/data/ NBA /matchups/ page in address,

For ease of management, all pages of the same type are placed in the same folder and named in a similar way: For example, this page is named g5_preview_12.html so a similar page would change 5 in G5, or 12 in _12, and by changing those two numbers, we found that a similar page could change 12 to get,

Let’s learn more about reptiles:

This is where the Python crawler is mostly used

urllib2

BeautifulSoup

Detailed documentation of both libraries, BeautifulSoup, can be viewed on the following website:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

When climbing a web page:

You open up the page, and then you call beautifulSoup to analyze the page, and then you use.find to find the location of the feature that you want to analyze, and then you use.text to retrieve the content of the tag that you want to crawl

For example, compare the following code to analyze:



response=urllib2.urlopen(url)

print response.getcode()

soup=BeautifulSoup(

response,

‘html.parser’,

from_encoding=’utf-8′

)

links2=soup.find_all(‘div’,class_=”sdi-so”,limit=2)

cishu=0

for i in links2:

if(cishu==1):

two=i.find_all(‘td’,class_=”sdi-datacell”)

for q in two:

print q.text

table.write(row,col,q.text)

col=(col+1)%9

if(col==0):

row=row+1

row=row+1

file.save(‘NBA.xls’)

cishu=cishu+1


Urllib2. urlopen(URL) is to open a web page;

Print response.getCode () to test whether the page can be opened;

soup=BeautifulSoup(

response,

. ‘the HTML parser,’

From_encoding = “utf-8”

)

Analysis of web pages for Beautiful;

Links2 =soup. Find_all (‘ div ‘,class_= “sdi-so”,limit=2) is used to query and return eigenvalues

‘div’,class_= “sdi-so”, limit=2



for i in links2:

if(cishu==1):

two=i.find_all(‘td’,class_=”sdi-datacell”)

for q in two:

print q.text

table.write(row,col,q.text)

col=(col+1)%9

if(col==0):

row=row+1

row=row+1


To find the ‘div’,class_= “sdi-so” tag and then the corresponding ‘TD’,class_= “sdi-datacell” tag;

Q.ext returns the data we want

Where row=row+1, row=row+1 is used for formatting when we write data to excel files;

Next is the save of the captured data:

Here we use Excel to save data using packages:

XLWT xdrlib, sys

Function:

file=xlwt.Workbook()

Table = file. Add_sheet (shuju, cell_overwrite_ok = True)

The table. Write (0, 0, “team”)

The table. Write (0, 1, ‘ ‘W/L)

table.write(row,col,q.text)

The file. Save (‘ NBA. XLS “)

Writing functions for basic Excel, which I won’t cover here;

Finally, we climb down and save the data in the format:


For more free technical information: annalin1203