Getting started with A Python crawler (2)

It takes about 12 minutes to read the passage.

In this section, we will use some simple Python code to implement the Python crawler architecture URL manager, web page download and web page parser.

As we mentioned in the previous article, the URL manager is used to manage the urls to be crawled and the urls that have been crawled. As a smart crawler, we should of course choose to skip the urls that we have already crawled, not only to prevent repeated fetching, but also to prevent some of the problems of circular fetching. The intercall between urls will result in an infinite loop crawling of crawlers.

URL manager is designed to solve these problems, so that our crawlers can be smarter and avoid repeated and circular crawling. The above list is what the URL manager does, and based on these functions, we can summarize a general idea of how to implement the URL manager.

We need two containers A and B. A is used to store urls to be crawled, and B is used to store urls that have been crawled. The manager obtains urls from A and delivers them to the web page loader for processing. If there is no URL in A, it waits. Add the URL to A and queue it to be climbed. After a URL is crawled, it is stored in B. During crawler, if the URL obtained exists in A or B, the URL will be skipped. The flow chart is as follows:

The above functions are the minimum set of functions that the URL manager needs to implement, and of course there are more functions in complex situations. Let’s look at the simple URL manager memory implementation code.

class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set() def add_new_url(self,url): if url is None: return if url not in self.new_urls and url not in self.old_urls: Add (URL) def has_new_url(self): return len(self.new_urls)! = 0 def get_new_url(self): self = self.new_urls.pop() # add URL self.old_urls.add(new_url)Copy the code

The code above is very simple. We use Python’s Set as a container to manage urls, because it is automatically de-processed and internal queries are very fast, so it is very convenient. To retrieve the URL to be climbed, we use the POP method to retrieve an element and remove it from the set, thus implementing a queue-like queuing.

This is a simple implementation of our URL manager, although very simple, but the basic function of the idea has been included in it.

Web downloader is Internet URL corresponding web pages to download to local tools, from the URL manager when we get to a crawl urls, we only have to download URL corresponding web pages to the local, can continue to the back of the data processing, so the web downloader in spider architecture is very important, is the core component.

The operation mode of web page download is very simple. It can download the corresponding web page of THE URL to the local in the form of HTML, and store it as a local file or in the form of memory string. Basically, you download a static web file, and the content of the file is

Such tags are composed of HTML files.

There are many readily available and powerful libraries to choose from to implement web page downloaders in Python. Urllib is an official Python base module, and Requests is a powerful third-party module. I’ll use URllib in Python3 as a demonstration. The urllib syntax of urllib2 is quite different from that of Python3.

#encoding:UTF-8import urllib.requesturl = "http://www.baidu.com"data = urllib.request.urlopen(url).read()data = data.decode('UTF-8')print(data)Copy the code

This is the simplest way to use urllib. We read a URL through the urlopen method and call the read method to get the HTML memory string we just talked about, which is printed out as a bunch of tag web page strings.

The urlopen function returns an HTTPResponse object. This object is an HTTPResponse return object for the crawl request. We can check the status of the crawl URL request, as well as some object information, such as getCode 200 indicating that the network request was successful.

>>> a = urllib.request.urlopen(full_url)>>> a.geturl()'http://www.baidu.com/s?word=Jecvay'>>> type(a)<class 'http.client.HTTPResponse'>>>> a.info()<http.client.HTTPMessage object at 0x03272250>>>> a.getcode()200Copy the code

Of course, the reality of our network Request is a bit more complex, and sometimes you need to add data to the Request or change the header. Urllib’s Request object can do this.

Import urllib.requestRequest = urllib.request. request ("http://www.baidu.com")# add request header request.add_header(' user-agent ', Request. Data = b"I am a data" Response = urllib.request.urlopen(request)Copy the code

There are also some more special scenarios, such as some web pages need to be processed by cookies, some web pages need to be accessed by web proxy, some web pages need to be authenticated by account password, and some web pages need TO be accessed by HTTPS protocol. Faced with these more complex scenarios, URllib provides a powerful tool called Handler to handle these requirements.

Different scenarios have different handlers. For example, HTTPCookieProcessor is used to handle cookies, and ProxyHandler is used to handle network proxies. When using this, we use Handler to build a opener, Then, opener is installed on the request, so that the Handler installed will play a role in processing special scenes when the request is processed again. Take the scenario of entering a user password:

Import urllib.request# build the base HTTP account to validate Handlerauth_handler = urllib.request.HTTPBasicAuthHandler()auth_handler.add_password(realm='PDQ Application', uri='https://mahler:8092/site-updates.py', user='klem', passwd='kadidd! Ehopper ')# Handler build openeropener = urllib.request.build_opener(auth_handler)# install openerurllib.request.install_opener(opener)urllib.request.urlopen('http://www.example.com/login.html')Copy the code

We use polymorphism to build a Handler for HTTP basic authentication information. After adding relevant account and password information, we build a opener and install opener into the request. When requesting an address with authentication, Will populate the data we filled in in Handler.

For urllib apis, you can refer to the Official Python3 documentation, which is very clear and has official code examples. I have read the documentation and feel that the official Python documentation is very careful and comfortable. For your information, urllib is basically what you need when you start learning.

After downloading the web page locally, we need to use the web page parser to extract the valuable information we need from the downloaded local file or memory string. For directional crawler, we need to extract two data from the web page, one is the value data we need, and the other is the LIST of urls that the URL can jump to, which we will input into the URL manager for processing. There are several ways to implement a Web page parser in Python.

One is to use regular expression, which is the most intuitive way. We extract the value data we need from the string of the web page through the regular fuzzy matching method. Although this method is more intuitive, it will be more troublesome if the web page is complex. My advice, however, is to learn regular expressions. It’s a tricky technique, at least if you understand it.

At the same time, we recommend XPATH, another analysis language, it is an efficient analysis language, grammar expression is clear and simple compared with regular, if you have a good command of the basic can replace regular, you can search to learn oh ~

Python can also use html.parser, LXML, and third-party library BeautifulSoup for web page parsing. BeautifulSoup, a powerful combination of HTML. parser and LXML, uses structured parsing to parse and access HTML files, using DOM trees to traverse the hierarchy of elements. A picture is understanding

BeautifulSoup is easy to install. There are several ways to install BeautifulSoup. The easiest way is to use the command PIP install Beautifulsoup4 to install BeautifulSoup.

I would like to introduce how BeautifulSoup is used. For more details, please refer to the official documents. Besides, the BS documents are translated by friendly Chinese developers, which is very good

The process of using BS is to first create a BS object, pass in the corresponding web page string, specify the corresponding parser (html.parser or LXML), and then use find_all or find to search for nodes. Finally, access the corresponding name, attribute or text through the obtained node to get the information you want.

For example, we now have a web page string message like this:

<a href='123.html' class='article_link'> python </a>Copy the code

In this string, the name of the node is a, the attributes of the node are href=’123.html’ and class=’article_link’, and the content of the node is Python.

With these three nodes, we are ready to start coding

BeautifulSoup = BeautifulSoup(html_doc, # HTML string 'html.parser', # from_encoding='utf8')# HTML encoding # soup.find_all('a')# Link to the soup. Find_all ('a',href='/view/123.htm')# Find_all ('div',class_=' ABC ',string='Python')# Match soup with regular expression. Find_all ('a',href=re.compile(r'/view/\d+\.htm))Copy the code

Find_all uses the same method as find, except that find_all returns a list of nodes. Note that the find method can use regular expressions for fuzzy matching. This is where it is powerful. If we get node, we can easily get node information.

<a href='1.html'>Python</a># fetch node tag name node['href']# fetch node text node.get_text()Copy the code

Getting started with A Python crawler (2)

Related Posts

Implementing your own RPC Framework (II)

Introduction to the Dubbo SPI mechanism

Ali interview (with answer) interview must read