Hi, everyone. I believe that those of you who click here are very interested in crawlers, and so are bloggers. Bloggers were attracted by crawlers when they first got to know them, because they felt SO COOL. You feel a sense of accomplishment when you see a string of data floating on the screen after typing code, don’t you? Even worse is that technology can be applied to a lot of life of the crawler scenario, for example, automatically vote, interested in batch download articles, novels, video, WeChat robot, crawl important data for data analysis, real feel the code was written for oneself, for their service, can also help others, so life is short, I choose the crawler.
1. What is a reptile?
First of all, we should understand one thing, that is, what is a crawler, why to crawler, the blogger Baidu, is explained like this:
A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.
Basically, crawlers can mimic the behavior of browsers and do what you want, customize what you search for and download, and automate it. For example, browsers can download novels, but sometimes they can’t download them in bulk, so crawlers come in handy.
There are many programming environments to realize crawler technology. Java, Python, C++ and so on can be used for crawler. But bloggers choose Python, and I’m sure many people do, because Python is really good for crawlers, rich third-party libraries are powerful, you can do what you want with a few lines of code, and most importantly, Python is great for data mining and analysis. It feels great to be able to crawl and analyze data in Python.
2. Crawler learning route
Know what is crawler, to tell you about the basic route of learning crawler summarized by the blogger, only for your reference, because everyone has their own method, here just to provide some ideas.
The general steps for learning a Python crawler are as follows:
- Learn the basics first
Python
Grammar knowledge - Learn about some of the important built-in libraries that Python crawlers use
urllib
.http
Etc., used to download web pages - Learning regular expressions
re
,BeautifulSoup (bs4
),Xpath (LXML)
Web page parsing tools - Start some simple site crawls (the blogger started from Baidu, haha) and learn about the crawls
- Learn about some of the crawling mechanisms of reptiles,
header
.robot
.The time interval
.The proxy IP
.Hidden fields
Etc. - Learn some special site crawlers to solve
The login
,Cookie
,Dynamic web page JS simulation
The problems such as - learning
selenium
Automated tools for asynchronously loading pages - Understand the combination of crawler and database, how to store crawler data,
Mysql
.Mongodb
- Learning to use Python
multithreading
andasynchronous
To improve the efficiency of crawlers - Learning the framework of the crawler,
Scrapy
,PySpider
Etc. - learning
Redis distributed
Crawlers (large data requirements) - learning
incremental
The crawler
The above is a general overview of learning, many content bloggers also need to continue to learn, about the details of each step mentioned, the blogger will gradually share with you in the following content with practical examples, of course, there will also be some interesting content about crawlers.
3. Start with the first crawler
The implementation of the first crawler code I think should start from urllib, when the blogger began to learn is to use urllib library to hit a few lines of code to achieve a simple data crawling function, I think most partners are also so come over. I was like, wow, that’s amazing, you can do a seemingly complicated task in just a few lines of code, and I was like, how do you do this in a few lines of code, and how do you do more sophisticated crawls? With this question in mind, I began the study of urllib.
First of all, I have to mention the process of data crawling, to understand exactly what it is, when learning urllib will be easier to understand.
Crawler process
In fact, the crawler process is the same as the browser to browse the web page process. Truth we should all understand, is when we enter the url on the keyboard after clicking search, through the network will first go through the DNS server, analysis of the domain name of the url, find the real server. Then we send GET or POST request to the server through HTTP protocol, if the request is successful, we GET the web page we want to see, generally using HTML, CSS, JS and other front-end technology to build, if the request is not successful, the server will return to us the status code of request failure, common 503,403, etc..
The crawler process is the same, by making a request to the server to get the HTML page, and then parsing the downloaded page to get the content we want. Of course, this is an overview of a crawler process, and there are a lot of details we need to deal with, which will be shared later.
After understanding the basic process of reptile, we can start our real reptile journey.
Urllib library
Python has a built-in URllib library, which is an important part of the crawler process. The use of the built-in library can be completed to the server to request and obtain the function of the web page, so it is also the first step to learn the crawler.
The blogger uses PYTHon3. x, and the urllib library structure is a bit different from that of Python2.x. The URllib2 and urllib libraries used in Python2.x are combined into a unique URllib library.
First, let’s take a look at what the URllib library for Python3.x has.
The blogger’s IDE is Pycharm, which is very easy to edit and debug. Enter the following code in the console:
>>importurllib
>>dir(urllib)
['__builtins__','__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__','__path__', '__spec__', 'error', 'parse', 'request', 'response']
Copy the code
As you can see, urllib has four important attributes, error, parse, Request, and Response, in addition to the built-in attributes that start and end with a double underscore.
The beginning of doc in Python’s urllib library is briefly described as follows:
- Error: “Exception classesraised by urllib.” —- is the Exception raised by urllib
- Parse: “Parse (absolute andrelative) URLs.” —- Parse absolute andrelative URLs
- Request: “An extensiblelibrary for opening URLs using a variety of protocols” —- opens An extended library of URLs using various protocols
- Response: “Response classesused by urllib.” —- Response class used by urllib
Among these four attributes, the most important one is Request, which completes most of the functions of crawlers. Let’s take a look at how request is used.
The use of the request
Request The simplest operation is the urlopen method, which looks like this:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read()
print(result)
Copy the code
The running results are as follows:
b'<! doctype html>\n<! --[if lt IE 7]>... </body>\n</html>\n'Copy the code
The result is garbled!! Don’t worry, this is because of coding problems, we just need to read the requested class file and decode it.
Modify the code as follows:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
result = response.read().decode('utf-8')
print(result)
Copy the code
The running results are as follows:
<! doctype html> <! --[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8>.. <! --[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9">.. <! --[if IE 8]> <html class="no-js ie8 lt-ie9"> <! [endif]--> <! --[if gt IE 8]><! --><html class="no-js" lang="en" dir="ltr" <head> <meta charset="utf-8"> ...Copy the code
This is the HTML page we want, how about it? Simple.
Let’s take a look at the urlopen method and the parameters it applies.
Urlopen method
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TI
MEOUT,*, cafile=None, capath=None,
cadefault=False, context=None):
Copy the code
Urlopen is one of the request methods that opens a URL, either as a string (as in the example above) or as a Request object (described later).
-
Url: that is, the URL we input, (for example: www.xxxx.com/);
-
Data: Is the additional information that we send to the server for request (such as the user information that we need to fill in to log in to the web page). If a data parameter is required, it is a POST request. If there is no data parameter, it is a GET request.
-
In general, the data parameter only makes sense when requested under the HTTP protocol
-
The data argument is specified as a byte object, or byte object
-
The data argument should use the standard structure. This requires urllib.parse.urlencode() to convert the data. Data is going to show you how to use it in the future, right
-
Timeout: specifies the optional timeout period, in seconds, to prevent the request time from being too long. If this parameter is not specified, the default timeout period is used.
-
Cafile: points to a single file and contains a set of CA certificates (rarely used, default is ok);
-
Capath: refers to the document target and is also used for CA authentication (rarely used, default is ok).
-
Cafile: can be ignored
-
Context: set SSL encrypted transmission (rarely used, default);
It returns a file-like object and can perform various operations on this object (like the read operation above, which reads the entire HTML). Other common methods include:
- Geturl (): Returns the URL to see if there is a redirect.
result = response.geturl()
Results: https://www.python.org/
- info(): Returns meta information, such as HTTP
headers
.
result = response.info()
Results:
x-xss-protection: 1; mode=block X-Clacks-Overhead: GNU Terry Pratchett … Vary: Cookie Strict-Transport-Security: max-age=63072000; includeSubDomains
- getcode(): Returns a reply
The HTTP status code
, success is200
Failure may be503
Can be used to check the availability of proxy IP addresses.
result = response.getcode()
Results: 200
The Request method
class Request:
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False,
method=None):
Copy the code
As defined above, Request is a class that initializes the parameters required by the Request:
url
.data
And the aboveurlopen
As mentioned in.headers
isThe HTTP request
Packet information, such asUser_Agent
Parameters, etc., it allows crawlersDisguised as a browser
Without the server knowing you’re using a crawler.origin_reg_host
.unverifiable
.method
Not often used
Headers is very useful. Some websites have an anti-crawler mechanism. If there is no headers in a request, an error will be reported.
So how do you find headers for your browser?
You can go to F12 to view the headers of a request, for example, Chrome. Press F12-> Network to view the headers of a request. You can copy the headers information of this browser to use.
Here’s how Request works:
import urllib.request
headers = {'User_Agent': ''}
response = urllib.request.Request('http://python.org/', headers=headers)
html = urllib.request.urlopen(response)
result = html.read().decode('utf-8')
print(result)
Copy the code
The result is the same as urlopen, which accepts objects of the Request class as well as specified parameters. Fill in your browser information.
There are many other methods in urllib’s requset property, such as proxy, timeout, authentication, HTTP POST mode request, etc., which will be shared next time. This time, we will focus on the basic functions.
Let’s talk about exceptions, urllib’s error method.
The use of the error
The error attribute contains two important exception classes, URLError and HTTPError.
1. URLError class
def __init__(self, reason, filename=None):
self.args = reason,
self.reason = reason
if filename is not None:
self.filename = filename
Copy the code
URLError class
isOSError
Subclasses of, inheritOSError
, does not have any behavior characteristics of its own, but will be used as a base class for all other types of error.URLError class
The definition is initializedreason
Argument, meaning that when using objects of the URLError class, you can view the error reason.
2. HTTPErro class
def __init__(self, url, code, msg, hdrs, fp):
self.code = code
self.msg = msg
self.hdrs = hdrs
self.fp = fp
self.filename = url
Copy the code
HTTPError
isURLError
Class, which will be presented when an HTTP error occursHTTPError
.HTTPError
Is also an example of a valid HTTP response, since HTTP protocol errors are valid responses, includingStatus code
.headers
andbody
. So to see thatHTTPError
Initialization defines parameters for these valid responses.- When using
HTTPError
Class to view the status code,headers
And so on.
Let’s use an example to see how to use these two Exception classes.
Request import urllib.error try: headers = {'User_Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; Rv :57.0) Gecko/20100101 Firefox/57.0'} Response = urllib.request.request ('http://python.org/', headers=headers) html = urllib.request.urlopen(response) result = html.read().decode('utf-8') except Urllib.error.URLError as e: if hasattr(e, 'reason'): print(' error '+ STR (e.eason)) except urllib.error.HTTPError as e: If hasattr(e, 'code'): print(' error status is' + STR (e.code)) else: print(' request passed successfully. ')Copy the code
The above code uses a try.. The exception structure implements simple web page crawls and returns Reason when an exception such as URLError occurs or code when an HTTPError error occurs. The addition of exceptions enriches the crawl structure, making it more robust.
Why is it stronger?
Don’t underestimate these exception errors, they are very useful and critical. Think about it, when you’re writing code that has to run a crawl and parse automatically over and over again, you don’t want interruptions in the middle of your program. If these exception states are not set, then it is very likely that an error will pop up and be terminated, but if the full exception is set, the error code will be executed without interruption when it encounters an error (such as printing the error code as above).
These interrupts can vary, especially if you are using proxy IP pools, where many different errors can occur, and exceptions come in handy.
4. To summarize
- The definition and learning route of crawler are introduced
- The process of crawler is introduced
- Introduce the use of urllib library to start crawler learning, including the following methods:
- Request Request: urlopen, request
- The error exception
That will do for today’s sharing, feel useful, trouble thumb up, collect the article, the process of learning Python, often because there is no data or nobody guidance leading to he didn’t want to learn it, so just for a group of 】 【 learning Python communication, can obtain the PDF books, tutorials, etc. To everyone free of charge, also can learn, Welcome everyone.