{{ title }}
1, the preface
I’ll start learning the Python crawler section today, and of course I’ll output notes along the way. Consolidate again
Python is popular because there are huge third-party libraries. The first stop in crawler learning, the use of the Requests library.
2. Installation of the Requests library
2.1. Introduction to the official website
The Requests library has a website for interested parties.
It’s all in English, but there’s too much description.
2.2. Installation of the Requests library
How to install, take Windows as an example, as an administrator to open the command line. cmd
pip install Requests
Copy the code
2.3. Test whether the installation is successful
After downloading, you must be curious to open the door to the Python crawler. Let’s do a little test
>>> import requests
>>> r = requests.get ("http://www.baidu.com")
>>> r.status_code
200
Copy the code
Status code is 200, which means we’re on our way to the crawler.
Try printing baidu’s home page:
>>> r.encoding = "utf-8"
>>> r.text
Copy the code
Although a little messy, but you can see some fields, some small achievements in the heart.
2.4, the seven main methods for the Requests library
methods | instructions |
---|---|
requests.request() | Construct a request that supports the underlying methods for each of the following methods. |
requests.get() | The main method of getting HTML web pages, corresponding to HTTP GET. |
requests.head() | Method of getting header information for a web page, corresponding to the HTTP HEAD |
requests.post() | A method of submitting a POST request to an HTML web page, corresponding to an HTTP POST |
requests.put() | A method for submitting a PUT request to an HTML page, corresponding to HTTP PUT |
requests.patch() | Submit local modification requests to HTML, corresponding to PACTH for HTTP |
requests.delete() | Submit a DELETE request to HTML, corresponding to HTTP DELETE |
It looks very difficult. Let’s take a closer look. There must be something wrong.
3, the Requests library get() method
3.1, the usage
The correct format is:
r = requests.get(url)
Copy the code
-
What is a URL?
URL: refers to the Unified resource location system, simply said is a web address. Locate and access resources on the server by URL. I’ll just take it as a web address.
-
Does this code mean anything?
The Get method and the URL are used to construct a Request object that Requests resources from the server. The Request object is automatically generated internally by the Requests library. And what is returned is assigned to r, so R is Response. R contains all the resources returned from the server.
Let me clean this up: two objects
- Request object: Requests resources from the server
- Response object: Returns all resources from the server
Some go and some go back.
The above is the requests. Get (URL) part, so there must be a whole
3.2. Final usage
r = requests.get(url,params=None,**kwargs)
Copy the code
Now you can see that there are three parameters, why is the present, notice that ** indicates the omission.
Introduction to Pa
-
URL: The URL to which a page is to be fetched (in plain English, a web address)
-
Params: Additional parameters in the URL, dictionary or byte stream format, optional
-
Kwargs: 12 controlled access parameters,
You can see there’s a lot of them
3.3, Go deep into the Response object
We said that the Response object is all the resources that are returned from the server.
Let’s take a look at some code
>>> import requests
>>> r = requests.get ("http://www.baidu.com")
>>> r.status_code
200
>>> type(r)
<class 'requests.models.Response'> > > >r.headers
>
Copy the code
R.tatus_code () returns the status code 200 and the request succeeded. And then type of r, which is a class. You can also see his class name.
R.haaders () returns the header information for the page.
In addition to all the information about the server, there is also information about our Request to the server
3.4. Properties of Response object
There are five attributes that are the most common and necessary to access a web page.
attribute | instructions |
---|---|
r.status_code | The return status of the HTTP request, where 200 indicates a successful link and 404 indicates a failure (any number other than 200 is a failure). |
r.text | The HTTP response content is a string, that is. The page content corresponding to the URL |
r.encoding | Guess the encoding of the response content from the HTTP header |
r.apparent_encoding | Encoding method of response content analyzed from content (alternative encoding method) |
r.content | The binary form of HTTP corresponding content |
Five, no more, no less. Show me
Let’s take Baidu as an example.
Let’s start coding
- r.status_code
>>> import requests
>>> r = requests.get ("http://www.baidu.com")
>>> r.status_code
200
Copy the code
The return value is 200. We requested success.
- r.text
>>> r.text
'<! DOCTYPE html>\r\n<! --STATUS OK--><html> <head><meta http-equiv=content-type content=text/html; charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css > < title > c \ x99 delighted many customers and a DHS ¦ a ¸ \ x80a ¸ \ x8bi fell \ x8ca 1/2 level \ xa0a ° + c \ x9f selections e \ x81 \ x93 < / title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su Value = c \ x99 delighted many customers and a DHS ¦ a ¸ \ x80a ¸ \ x8b class = "bg s_btn" > < / span > < / form > < / div > < / div > < div id = u1 > < href=http://news.baidu.com a Name =tj_trnews class=mnav>æ x96° E \x97»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a Href =http://map.baidu.com name=tj_trmap class=mnav> a \ x9C ° A \x9b¾</a> <a href=http://v.baidu.com name=tj_trvideo Class =mnav> e §\ x86e ¢x91</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav> e ´´ a \x90§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login& tpl=mn& U=http%3A%2F%2Fwww.baidu.com % 2 f % 3 fbdorz_come % 3 d1 name = tj_login class = lb > c \ 'a 1/2 level x99 \ x95 < / a > < / noscript > <script>document.write(\'<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=\'+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&") + "bdorz_come = 1") + \ 'tj_login "name =" "class =" lb "> c \' a 1/2 level x99 \ x95 < / a > \ '); </script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;" > æ \ x9b ´ a problem \ x9aa DHS § a \ x93 \ x81 < / a > < / div > < / div > < / div > < div id = ftCon > < div id = ftConw > < p > id = lh < a href=http://home.baidu.com > a \ x85 after a DHS \ x8ec \ x99 delighted many customers and a DHS ¦ < / a > < a > href=http://ir.baidu.com About Baidu < / a > < / p > < p id = cp > & copy; 2017 Baidu < a href=http://www.baidu.com/duty/ > 1/2 level ¿ a C \ x94 ¨ c \ x99 delighted many customers and a DHS ¦ a \ x89 \ x8da ¿ '< / a > \ x85e ¯ & have spent < a href=http://jianyi.baidu.com/ class = cp - feedback > æ \ x84 \ x8fe § \ x81a \ x8f \ x8de ¦ \ x88 < / a > & have spent A DHS such ICPe ¯ \ x81030173a \ x8f, & have spent <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
Copy the code
It’s probably all gibberish, I can’t read it.
-
r.encoding
Now we can look at his code
>>> r.encoding'ISO-8859-1'
Copy the code
So that’s it
-
r.apparent_encoding
He might have another code
>>> r.apparent_encoding'utf-8' Copy the code
Utf-8, I got something
In this case, try replacing the above code with utF-8 encoding, output
>>> r.encoding = "utf-8">>> r.text>>> Copy the code
See some familiar Chinese characters, is there a suddenly enlightened feeling
So what’s the difference between these two properties, you say?
attribute instructions r.encoding Guess the encoding of the response content from the HTTP header r.apparent_encoding Encoding method of response content analyzed from content (alternative encoding method) What you can see is that there’s an alternative,
Resources on the web have their own codes, and without codes, we can’t read them. There is no way to parse in an efficient way that we can read. So that’s where the concept of coding comes in.
R.coding is obtained from the charset field in the HTTP header. If there is a charser field, it indicates that the server we visit has requirements on the encoding of resources, which will be retrieved and saved in R. encoding. If the charset field does not exist, isO-8859-1 is returned.
But such encoding does not parse Chinese. So there is an alternative r.apparent_encoding
R.apparent_encoding is the encoding of possible text in the content based on the HTTP content portion.
R.apparent_encoding is more accurate than R.encoding, which is a guess. Therefore, r.apparent_encoding is commonly used.
So the steps to crawl a web page are
graph TD A[r = requests.get] --> B{r.status_code} B --> |200| D[r.text r.encoding r.apparent_encoding r.content] B --> | | except 200 E error [some reason will produce abnormal]
In this, the first look is not 200, in the page parsing, printing and other operations
3. General code framework for crawling web pages
Speaking of frameworks, what is a framework is a set of code, accurate and convenient.
3.1, understand the Requests exception
abnormal | instructions |
---|---|
requests.ConnectionError | The network connection is abnormal, such as DNS query failure or connection denial |
requests.HTTPError | HTTP error exception |
reuqests.URLRequired | URL missing exception |
requests.TooManyRedirects | The maximum number of redirection times is exceeded, causing a redirection exception. Procedure |
requests.ConnectTimeout | The connection to the remote server timed out. Procedure |
requests.Timeout | The REQUEST URL timed out. Procedure |
r.raise_for_status() | Returns whether the Requests type r is 200, or if not, generates Requests.HTTPError |
So, what does this code framework look like
3.2. Using frameworks
import requestsdef getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() HTTPError r.encoding = r.aparent_encoding return r.ext except: return "raise exception"
Copy the code
What does that mean
Seeing def, I define a getHTMLText function. See try–except and think about exception handling (which I forgot). Try is code, except is exception. And then we’re going to analyze a bunch of things inside try. Request access to a URL and call the r.aise_for_status () method based on the returned Requests object to determine if the returned content is healthy. Normal return content decoding is normal, print web page information; Abnormal, HTTPError, print exception.
That makes it clear. With this framework, you can quickly discover the problems encountered in early crawling web pages.
Let’s test this code
if __name__=="__main__": url = "http://www.baidu.com" print(getHTMLText(url))>>>
Copy the code
Modify this code
If __name__ = = "__main__" : url = "www.baidu.com" print (getHTMLText (url)) > > > generate an exceptionCopy the code
3.3, the summary
It is easy to find that try-except is used to catch and handle exceptions in code, which reduces a lot of time to find and fix bugs in our use.
4. HTTP protocol and Requests library methods
We already know the seven main methods of the Requests library, but that’s not enough. We should also know the HTTP protocol.
4.1, the HTTP protocol
- HTTP: HyperText Transfer Protocol
- HTTP is a stateless application layer protocol based on the request and response pattern.
- HTTP uses urls as identifiers to locate network resources.
In these three definitions, there are more or less unfamiliar terms, and we will tackle them one by one.
“Request and response” mode: the user initiates a request and the server responds.
Stateless: The first request is not related to the second request.
Application layer: This protocol works on top of TTP protocol.
URL format: http://host[:port][path]
- Host: indicates the valid Inter host domain name or IP address
- Port: indicates the port number. The omitted port is 80
- Path: indicates the path to which resources are requested
URL is an Inter path to access resources through HTTP. One URL corresponds to one data resource.
4.2. Operations on resources by HTTP protocol
Maybe we’ve seen the following tables before, but we should look at them again.
methods | instructions |
---|---|
GET | Request a resource at the URL location. |
HEAD | A response message report requesting a resource at a URL location, that is, getting the header information for that resource. |
POST | Request to append new data to the resource at the URL location. |
PUT | The request stores a resource to the URL location, overwriting the original URL location resource. |
PATCH | Requests a local update of a resource at a URL location, that is, changes part of the content of that resource. |
DELETE | Request to delete the resource stored at the URL location, |
Is it similar to 2.4,
How do you understand that
We think of the Internet as the cloud, and we can access the cloud with urls. To GET resources, use the GET and HEAD methods; What about storage resources? POST, PUT, PATCH, or DELETE methods, each operation is independent.
The biggest difference between PUT and PATCHS is that the latter saves network bandwidth.
5,Requests library main methods and parsing
methods | instructions |
---|---|
requests.request() | Construct a request that supports the underlying methods for each of the following methods. |
requests.get() | The main method of getting HTML web pages, corresponding to HTTP GET. |
requests.head() | Method of getting header information for a web page, corresponding to the HTTP HEAD |
requests.post() | A method of submitting a POST request to an HTML web page, corresponding to an HTTP POST |
requests.put() | A method for submitting a PUT request to an HTML page, corresponding to HTTP PUT |
requests.patch() | Submit local modification requests to HTML, corresponding to PACTH for HTTP |
requests.delete() | Submit a DELETE request to HTML, corresponding to HTTP DELETE |
You may not know the pattern of this table, and some may even want to vomit, but having said so much, I think I can only use the first three.
5.1, requests the request ()
This is the basic method of all methods.
requests.request(method,url,**kwargs)
Copy the code
-
Method: request mode
requests.request(GET,url,**kwargs) Copy the code
The request mode can be interpreted as HTTP, which corresponds to HTPP.
-
**kwargs: optional parameters for controlling access
-
Params: dictionary or sequence of bytes added as a parameter to the URL.
import requestskv = {"key1":"value1"."key2":"value2"}r = requests.request('GET'.'http://python123.io/ws',params=kv)print(r.url)>>>https://python123.io/ws? key1=value1&key2=value2Copy the code
Create a dictionary that is supplied as a params parameter to a link request using the GET method. Since this is an optional parameter, we need to use the naming method here. Returns his URL link.
You can see that there are more key-value pairs after the link.
That is, we add key-value pairs after the URL, and then when the URL goes to the URL, it doesn’t know it’s accessing the resource, and it’s taking parameters. The server can take these resources and filter them back.
-
Data: dictionary, byte sequence, or file object as the content of Requests
(Used to provide or submit resources to the server.)
import requestskv = {"key1":"value1"."key2":"value2"}r = requests.request('POST'.'http://python123.io/ws',data=kv) Copy the code
There are two key-value pairs that are submitted as part of data using the POST method.
The key-value pairs we submit are not stored in the link, but are stored as data in the corresponding location of the URL link. Put a string is also wide.
-
Json: Data in JSON format as the content of Requests
import requestskv = {"key1":"value1"}r = requests.request('POST'.'http://python123.io/ws',json=kv) Copy the code
This data format is very common in HTTP, but I’m not aware of it. It can also be submitted to the server as part of the content.
Create a key-value pair with a dictionary, and with JSON, the key-value pair is assigned to the server’s JSON
-
Headres: dictionary, HTTP custom header
import requestshd = {'user-agent':'chrome/10'}r = requests.request('POST'.'http://python123.io/ws',headers=hd) Copy the code
-
The HTTP header field that is initiated when accessing a URL. You can use this field to customize access to a particular URLhttp protocol header
-
Create a field to modify the USER-agent field in HTTP to Chrome/10. When accessing a link, the server sees the user-Agent field as Chrome/10
-
Chrome/10: This is the 10th version of Chrome
-
So you can use this to simulate different browsers to access the server
-
-
Cookies: cookies in the dictionary or CookieJar or Request
-
Auth: a tuple that supports HTTP authentication.
-
Files: dictionary type, transfer files.
import requestsfs = {"file":open("1.txt"."rb")}r = requests.request('POST'.'http://python123.io/ws',files=fs) Copy the code
Files can be transferred to the server.
-
Timeout: Indicates the timeout period, in seconds.
import requestsr = requests.request('GET'.'http://www.baidu.com',timeout=10) Copy the code
If the requested content is not returned within the timeout period, an error occurs.
-
Proxise: dictionary type. If you set proxise to access the proxy server, you can add login authentication
It prevents crawlers from backtracking
-
Allow_radirects: true/false, default is false, redirect switch,
-
Stream: true/false, default is true, and get content immediately download switch
-
Verify: true/false. The default value is true. Authenticates SSL certificates
-
Cert: indicates the path of the local SSL certificate.
-
5.2, requests the get ()
requests.get(url,params=None,**kwargs)
Copy the code
-
URL: Link to the URL of the page to be retrieved
-
Params: URL extra parameter, dictionary or byte stream format optional.
-
**kwargs: 12 access control parameters.
The 12 control parameters are the same as in requests. Request (), except for the proxise field
5.3, requests the head ()
requests.head(url,**kwargs)
Copy the code
The URL goes without saying,
**kwargs: 12 access control parameters. These 13 control parameters are exactly the same as in requests. Request (),
5.4, requests. Post ()
requests.post(url,data=None,josn=None,**kwargs)
Copy the code
URL, data, josn you’re familiar with, and of course there are 11 control parameters, if you think about it there are 11 other control parameters besides data and josn
5.4, requests. The put ()
requests.put(url,data=None,**kwargs)
Copy the code
See here, do you want to ask if they are the same, they are
5.5, requests. Patch ()
x requests.patch(url,data=None,**kwargs)
Copy the code
5.4, requests. The delete ()
x requests.patch(url,**kwargs)
Copy the code
Of course it’s all the same
6,
From the macro point of view, there are many methods and a lot of code. From the micro point of view, there are basically those few methods, which appear repeatedly, and 13 control parameters.
Of course, we should also remember that framework, which saves a lot of time
Requests,get() is a popular method
So that’s it, my notes.
Thank you, there are mistakes in the article, welcome your correction; My pleasure if I can be of any help to you.