Wechat official account: Python Data Science

Zhihu:zhuanlan.zhihu.com/py…

Hi, everyone. I believe that those of you who click here are very interested in crawlers, and so are bloggers. Bloggers were attracted by crawlers when they first got to know them, because they felt SO COOL. You feel a sense of accomplishment when you see a string of data floating on the screen after typing code, don’t you? Even worse is that technology can be applied to a lot of life of the crawler scenario, for example, automatically vote, interested in batch download articles, novels, video, WeChat robot, crawl important data for data analysis, real feel the code was written for oneself, for their service, can also help others, so life is short, I choose the crawler.

To be honest, the blogger is also a nine-to-five office worker, learning reptiles is also the use of spare time, but with the enthusiasm of reptiles began to learn the journey of reptiles, as the saying goes, interest is the best teacher. The blogger is also a small white, the original intention of opening this public number is to share with you some of my experience in learning crawler and crawler skills, of course, there are also a variety of crawler tutorials online for your reference to learn, in the back of the blogger will share some of the resources used in the beginning of learning. All right, let’s get down to business.

1. What is a reptile?

First of all, we should understand one thing, that is, what is a crawler, why to crawler, the blogger Baidu, is explained like this:

A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.

Basically, crawlers can mimic the behavior of browsers and do what you want, customize what you search for and download, and automate it. For example, browsers can download novels, but sometimes they can’t download them in bulk, so crawlers come in handy.

There are many programming environments to realize crawler technology. Java, Python, C++ and so on can be used for crawler. But bloggers choose Python, and I’m sure many people do, because Python is really good for crawlers, rich third-party libraries are powerful, you can do what you want with a few lines of code, and most importantly, Python is great for data mining and analysis. It feels great to be able to crawl and analyze data in Python.

2. Crawler learning route

Know what is crawler, to tell you about the basic route of learning crawler summarized by the blogger, only for your reference, because everyone has their own method, here just to provide some ideas.

The general steps for learning a Python crawler are as follows:

  • Learn basic Python syntax first
  • Learn Python crawler commonly used in several important built-in libraries urllib, HTTP, etc., used to download web pages
  • Learn regular expression RE, BeautifulSoup (BS4), Xpath (LXML) and other web page parsing tools
  • Start some simple site crawls (the blogger started from Baidu, haha) and learn about the crawls
  • Understand some crawler mechanism, header, robot, time interval, proxy IP, hidden fields, etc
  • Learn some special website crawling, solve login, Cookie, dynamic web pages and other problems
  • Understand the combination of crawler and database, how to store crawler data
  • Learn to apply Python multi-thread, multi-process crawling, improve crawler efficiency
  • Learn crawler framework, Scrapy, PySpider, etc
  • Learning distributed crawlers (large data requirements)

The above is a general overview of learning, many content bloggers also need to continue to learn, about the details of each step mentioned, the blogger will gradually share with you in the following content with practical examples, of course, there will also be some interesting content about crawlers.

3. Start with the first crawler

The implementation of the first crawler code I think should start from urllib, when the blogger began to learn is to use urllib library to hit a few lines of code to achieve a simple data crawling function, I think most partners are also so come over. I was like, wow, that’s amazing, you can do a seemingly complicated task in just a few lines of code, and I was like, how do you do this in a few lines of code, and how do you do more sophisticated crawls? With this question in mind, I began the study of urllib.

First of all, I have to mention the process of data crawling, to understand exactly what it is, when learning urllib will be easier to understand.

Crawler process

In fact, the crawler process is the same as the browser to browse the web page process. Truth we should all understand, is when we enter the url on the keyboard after clicking search, through the network will first go through the DNS server, analysis of the domain name of the url, find the real server. Then we send GET or POST request to the server through HTTP protocol, if the request is successful, we GET the web page we want to see, generally using HTML, CSS, JS and other front-end technology to build, if the request is not successful, the server will return to us the status code of request failure, common 503,403, etc..

The crawler process is the same, by making a request to the server to get the HTML page, and then parsing the downloaded page to get the content we want. Of course, this is an overview of a crawler process, and there are a lot of details we need to deal with, which will be shared later.

After understanding the basic process of reptile, we can start our real reptile journey.

Urllib library

Python has a built-in URllib library, which is an important part of the crawler process. The use of the built-in library can be completed to the server to request and obtain the function of the web page, so it is also the first step to learn the crawler.

The blogger uses PYTHon3. x, and the urllib library structure is a bit different from that of Python2.x. The URllib2 and urllib libraries used in Python2.x are combined into a unique URllib library.

First, let’s take a look at what the URllib library for Python3.x has.

The blogger’s IDE is Pycharm, which is very easy to edit and debug. Enter the following code in the console:

>>importurllib
>>dir(urllib)

['__builtins__','__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__','__path__', '__spec__', 'error', 'parse', 'request', 'response']
Copy the code

As you can see, urllib has four important attributes, error, parse, Request, and Response, in addition to the built-in attributes that start and end with a double underscore.

The beginning of doc in Python’s urllib library is briefly described as follows:

  • Error: “Exception classesraised by urllib.” —- is the Exception raised by urllib
  • Parse: “Parse (absolute andrelative) URLs.” —- Parse absolute andrelative URLs
  • Request: “An extensiblelibrary for opening URLs using a variety of protocols” —- opens An extended library of URLs using various protocols
  • Response: “Response classesused by urllib.” —- Response class used by urllib

Among these four attributes, the most important one is Request, which completes most of the functions of crawlers. Let’s take a look at how request is used.

The use of the request

Request The simplest operation is the urlopen method, which looks like this:

import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read()
print(result)
Copy the code

The running results are as follows:

b'<! doctype html>\n<! --[if lt IE 7]>... </body>\n</html>\n'Copy the code

The result is garbled!! Don’t worry, this is because of coding problems, we just need to read the requested class file and decode it.

Modify the code as follows:

import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read().decode('utf-8')
print(result)    
   Copy the code

The running results are as follows:

<! doctype html> <! --[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8>.. <! --[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9">.. <! --[if IE 8]> <html class="no-js ie8 lt-ie9"> <! [endif]--> <! --[if gt IE 8]><! --><html class="no-js" lang="en" dir="ltr" <head> <meta charset="utf-8"> ...Copy the code

This is the HTML page we want, how about it? Simple.

Let’s take a look at the urlopen method and the parameters it applies.

Urlopen method

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TI
            MEOUT,*, cafile=None, capath=None, 
            cadefault=False, context=None):
Copy the code

Urlopen is one of the request methods that opens a URL, either as a string (as in the example above) or as a Request object (described later).

  • Url: that is, the URL we input, (for example: www.xxxx.com/);
  • Data: Is the additional information that we send to the server for request (such as the user information that we need to fill in to log in to the web page). If a data parameter is required, it is a POST request. If there is no data parameter, it is a GET request.

    • In general, the data parameter only makes sense when requested under the HTTP protocol
    • The data argument is specified as a byte object, or byte object
    • The data argument should use the standard structure. This requires urllib.parse.urlencode() to convert the data. Data is going to show you how to use it in the future, right
  • Timeout: specifies the optional timeout period, in seconds, to prevent the request time from being too long. If this parameter is not specified, the default timeout period is used.
  • Cafile: points to a single file and contains a set of CA certificates (rarely used, default is ok);
  • Capath: refers to the document target and is also used for CA authentication (rarely used, default is ok).
  • Cafile: can be ignored
  • Context: set SSL encrypted transmission (rarely used, default);

It returns a file-like object and can perform various operations on this object (like the read operation above, which reads the entire HTML). Other common methods include:

  • geturl(): Returns the URL to see if there is a redirect.

    result = response.geturl()

    Results: https://www.python.org/

  • Info () : returns meta information, such as HTTP headers. Result = response.info()

    x-xss-protection: 1; mode=block X-Clacks-Overhead: GNU Terry Pratchett ... Vary: Cookie Strict-Transport-Security: max-age=63072000; includeSubDomainsCopy the code
  • getcode(): Returns the HTTP status code, 200 for success or 503 for failure, which can be used to check the reusability of the proxy IP address.

    result = response.getcode()

    Results: 200

The Request method

class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):
Copy the code

As defined above, Request is a class that initializes the parameters required by the Request:

  • Url, data as mentioned in urlopen above.
  • Headers is HTTP request information, such as User_Agent parameters, which can make the crawler masquerade as a browser and not be detected by the server.
  • Origin_reg_host, unverifiable, method, etc

Headers is very useful. Some websites have an anti-crawler mechanism. If there is no headers in a request, an error will be reported.

So how do you find headers for your browser?

You can go to F12 to view the headers of a request, for example, Chrome. Press F12-> Network to view the headers of a request. You can copy the headers information of this browser to use.

Here’s how Request works:

import urllib.request
headers = {'User_Agent': ''}
response = urllib.request.Request('http://python.org/', headers=headers)
html = urllib.request.urlopen(response)
result = html.read().decode('utf-8')
print(result)
Copy the code

The result is the same as urlopen, which accepts objects of the Request class as well as specified parameters. Fill in your browser information.

There are many other methods in urllib’s requset property, such as proxy, timeout, authentication, HTTP POST mode request, etc., which will be shared next time. This time, we will focus on the basic functions.

Let’s talk about exceptions, urllib’s error method.

The use of the error

The error attribute contains two important exception classes, URLError and HTTPError.

1. URLError class

def __init__(self, reason, filename=None):
    self.args = reason,
    self.reason = reason
    if filename is not None:
        self.filename = filename
Copy the code
  • The URLError class is a subclass of OSError, inherits OSError, and does not have any behavior characteristics of its own, but will be used as a base class for all other types of error.
  • The URLError class initialization defines the Reason argument, which means that when objects of the URLError class are used, the wrong reason can be viewed.

2. HTTPErro class

def __init__(self, url, code, msg, hdrs, fp):
    self.code = code
    self.msg = msg
    self.hdrs = hdrs
    self.fp = fp
    self.filename = url
Copy the code
  • HTTPError is a subclass of URLError. HTTPError is raised when HTTP fails.
  • HTTPError is also an example of a valid HTTP response, since HTTP protocol errors are valid responses, including status codes, headers, and body. So you can see that these valid response parameters are defined when HTTPError is initialized.
  • When using an HTTPError class object, you can view the status code, headers, etc.

Let’s use an example to see how to use these two Exception classes.

Request import urllib.error try: headers = {'User_Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; Rv :57.0) Gecko/20100101 Firefox/57.0'} Response = urllib.request.request ('http://python.org/', headers=headers) html = urllib.request.urlopen(response) result = html.read().decode('utf-8') except Urllib.error.URLError as e: if hasattr(e, 'reason'): print(' error '+ STR (e.eason)) except urllib.error.HTTPError as e: If hasattr(e, 'code'): print(' error status is' + STR (e.code)) else: print(' request passed successfully. ')Copy the code

The above code uses a try.. The exception structure implements simple web page crawls and returns Reason when an exception such as URLError occurs or code when an HTTPError error occurs. The addition of exceptions enriches the crawl structure, making it more robust.

Why is it stronger?

Don’t underestimate these exception errors, they are very useful and critical. Think about it, when you’re writing code that has to run a crawl and parse automatically over and over again, you don’t want interruptions in the middle of your program. If these exception states are not set, then it is very likely that an error will pop up and be terminated, but if the full exception is set, the error code will be executed without interruption when it encounters an error (such as printing the error code as above).

These interrupts can vary, especially if you are using proxy IP pools, where many different errors can occur, and exceptions come in handy.

4. To summarize

  • The definition and learning route of crawler are introduced
  • The process of crawler is introduced
  • Introduce the use of urllib library to start crawler learning, including the following methods:

    • Request Request: urlopen, request
    • The error exception

If you want to learn Python crawlers, you can follow the wechat official accountPython Data Science, the blogger will always update the highlights and share more practical explanations, taking you into the world of reptiles.