A Python crawler that uses regular expressions to crawl a web site

Now that we have regular expressions, we can filter all the source code we crawl. Below we together try crawl connotation jokes web site: www.neihan8.com/articl… Opened, it is not difficult to see a quite a connotation inside jokes, when you to turn pages, pay attention to the url address changes: the first page url: HTTP: / / www.neihan8.com/article/lis… .html page 2 url: HTTP: / / www.neihan8.com/article/list_5_2 HTML third page url: HTTP: / / www.neihan8.com/article/lis… .html page 4 url: HTTP: / / www.neihan8.com/article/list_5_4 HTML

Here we find the url pattern, to crawl all the paragraphs, only need to change a parameter. Let’s start by climbing down all the jokes step by step.

Step 1: Get the data

1. Following our previous usage, we need to write a method to load the page.

Here we define a unified class that handles the URL request as a member method
We create a file called duanzi_spider.py
Then define a Spider class and add a member method that loads the page

Import urllib2 class Spider: “”” “”” def loadPage(self, page): “” “” “” @ the method of brief, defines a url request page @ param what page @ page need request returns the returned HTML page url =” www.neihan8.com/article/lis… _” + str(page)
- “.html”
User_agent = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0 ‘headers = {‘ the user-agent: user_agent} req = urllib2.Request(url, headers = headers) response = urllib2.urlopen(req) html = response.read() print html #return html

To define a member method of a Python class, add an extra parameter, self.

So the page in loadPage(self, Page) is the page we specify to request
Finally, print HTML to the screen
Then we write a main function to see testing a loadPage method

Write the main function to test a loadPage method

if name """ == ' main ': "" "= = = = = = = = = = = = = = = = = = = = = = connotation jokes crawlies = = = = = = = = = = = = = = = = = = = = = = print 'please press enter to begin raw_input () mySpider = # define a spiders object Spider() mySpider.loadpage(1)Copy the code

If the program works, we’ll print the entire HTML code for the first page of the content section on the screen. However, we found that the Chinese part of the HTML may display gibberish.

Then we need to simply get the source code of the web page to deal with it.

def loadPage(self, page): "" "" "" @ the method of brief, defines a url request page @ param what page @ page need request returns the returned HTML page url =" http://www.neihan8.com/article/list_5_ "+ STR (page) + ".html" # user-agent header user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0 'headers = {' the user-agent: user_agent} req = urllib2.Request(url, headers = headers) response = urllib2.urlopen(req) html = response.read() gbk_html = html.decode('gbk').encode('utf-8') # print gbk_html return gbk_htmlCopy the code

Note: The encoding of Chinese is different for each website, so html.decode(‘ GBK ‘) is not a universal script and varies according to the encoding of the website.

If duanzi_spider.py is executed again, the Chinese garbled characters will be displayed normally.

Step 2: Filter the data

Now we have the data for the entire page. However, there are a lot of things we don’t care about, so next we need to filter. To filter, use the regular expressions described in the previous section.

First, import re and then filter matches in our resulting GBk_HTML.

We need a matching rule: we can open the page with the content of the joke, right click the mouse to “view the source code” and you will be surprised to find that the content of each joke we need is in one

Tag, and each div has a property class = “f18 MB20”

So, we just need to match everything on the page

 <div class="f18 mb20">
Copy the code

到

</div>
Copy the code

The data will do.

According to the regular expression, we can deduce a formula as follows:

<div.? class="f18 mb20">(.?) </div>Copy the code

This expression matches the contents of all divs in class=”f18 mb20 “and applies the re to the code. We get the following code:

def loadPage(self, page): "" "" "" @ the method of brief, defines a url request page @ param what page @ page need request returns the returned HTML page url =" http://www.neihan8.com/article/list_5_ "+ STR (page) + ".html" # user-agent header user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident / 5.0 'headers = {' the user-agent: user_agent} req = urllib2.Request(url, headers = headers) response = urllib2.urlopen(req) html = response.read() gbk_html = html.decode('gbk').encode('utf-8') <div class = "f18 mb20"></div> # re.s Pattern = re.compile(r'<div.*? Class ="f18 mb20">(.*?)</di v>', re.S) item_list = pattern.findall(gbk_html) return item_list def printOnePage(self, item_list, page): """ @brief brief list @param item_list brief list @param page """ print "******* page %d complete... *******" %page for item in item_list: print "================" print iteCopy the code

One thing to note here is that re.S is an argument to be matched in the regular expression. If there is no re.s, only one line is matched. If there is no re.s, the next line is matched. Adding re.s matches all strings as a whole, and findAll encapsulates all matched results into a list. We then wrote a method printOnePage() that iterates over item_list. Ok, so let’s do it again.

Power@PowerMac ~$ python duanzi_spider.py
Copy the code

Our entire section on the first page, containing no other information, was printed out. You’ll notice that there are a lot of

in the paragraph.

is uncomfortable. This is actually an HTML paragraph tag. It doesn’t look like it on a browser, but it does if you print it as text

Then we just need to remove the content we don’t want. PrintOnePage () can be modified simply as follows.

def printOnePage(self, item_list, page): """ """ @brief @param item_list @param page print "******* %d page complete... *******" %page for item in item_list: print "================" item = item.replace("<p>", "").replace("</p>", "").repl ace("<br />", "") print itemCopy the code

This is followed by saving and displaying data. For full information on the project and full crawler video tutorials, please click here.

A Python crawler that uses regular expressions to crawl a web site

Step 1: Get the data

Step 2: Filter the data

Related Posts

【LeetCode】 Hamming distance Java problem solving | Java brush problem punch card

Laravel Passport does not generate tokens with a password, but with a login user ID

Microservices Learning Notes (2)