Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
The introduction
In the last article we talked about the basic gestures of Urlopen, but these few simple parameters are not enough to build a complete request. For complex requests, such as adding headers, it is not possible to use Request.
Request
The official document: https://docs.python.org/zh-cn/3.7/library/urllib.request.html
Let’s look at the syntax for using Request:
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)Copy the code
- Url: the url of the requested address. Only this parameter is mandatory. All other parameters are optional.
- Data: If this parameter is to be passed, bytes must be passed.
- Headers: Request header information, which is a dictionary that can be constructed between headers when constructing a request or added by calling add_header().
- originreqHost: the host name or IP address of the requesting party.
- Unverifiable: Indicates whether the request is unverifiable. The default is False. This means that the user does not have sufficient permission to select the result of receiving the request. For example, if you are asking for an image in an HTML document, and you don’t have the right to automatically load the image, the unverifiable value will be True.
- Method: Request methods, such as GET, POST, PUT, and DELETE.
Let’s start with a simple example of using Request to crawl a blog site:
import urllib.request
request = urllib.request.Request('https://www.geekdigging.com/')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))Copy the code
As you can see, urlopen() is still used to initiate the Request, but instead of the URL, Data, timeout, etc., the parameters are now Request objects.
Let’s build a slightly more complex request.
import urllib.request, urllib.parse import json url = 'https://httpbin.org/post' headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Content-Type': 'Application /json; encoding=utf-8', 'Host': 'geekdigging.com' } data = { 'name': 'geekdigging', 'hello':'world' } data = bytes(json.dumps(data), encoding='utf8') req = urllib.request.Request(url=url, data=data, headers=headers, method='POST') resp = urllib.request.urlopen(req) print(resp.read().decode('utf-8'))Copy the code
The results are as follows:
{ "args": {}, "data": "{\"name\": \"geekdigging\", \"hello\": \"world\"}", "files": {}, "form": {}, "headers": { "Accept-Encoding": "identity", "Content-Length": "41", "Content-Type": "application/json; Encoding = UTF-8 ", "Host": "geekdigging.com", "user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "json": {"hello": "world", "name": "Geekdigging}" and "origin", "116.234.254.11 116.234.254.11", "url" : "https://geekdigging.com/post"}Copy the code
Here we build a Request object with four parameters.
The url specifies the link to access, again the test link mentioned in the previous article.
User-agent, Content-type, and Host are specified in headers.
Json.dumps () is used in data to convert a dict to JSON format, and bytes() is eventually converted to a byte stream.
Finally, the access mode is specified as POST.
From the final result, we can see that all our previous Settings were successful.
Advanced operation
Previously we added the Request header using Request, but if we want to handle Cookies and use proxy access, we need to use a more powerful Handler. Handler is simply a functional processor that can do almost everything for us about HTTP requests.
Urllib. request provides us with the BaseHandler class, which is the parent of all other handlers. It provides the following methods for direct use:
- Add_parent () : Adds director as the parent class.
- Close () : Closes its parent class.
- Parent () : Open to use a different protocol or handle errors.
- Default_open () : Captures all urls and subclasses, called before the protocol is opened.
Next, there are various Handler subclasses that integrate the BaseHandler class:
- HTTPDefaultErrorHandler: Used to handle HTTP response errors that raise an exception of the HTTPError class.
- HTTPRedirectHandler: Used to handle redirects.
- ProxyHandler: Used to set the proxy. The default proxy is empty.
- HTTPPasswordMgr: Used to manage passwords. It maintains tables of user names and passwords.
- AbstractBasicAuthHandler: Used to get the user/password pair and retry the request to process the authentication request.
- HTTPBasicAuthHandler: Used to retry requests with authentication information.
- HTTPCookieProcessor: Used to handle cookies.
Urllib provides a set of BaseHandler subclasses, which are not listed here. You can view them by accessing the official documentation.
The official documentation address: https://docs.python.org/zh-cn/3.7/library/urllib.request.html#basehandler-objects
Before I show you how to use Handler, I’ll introduce an advanced class: OpenerDirector.
OpenerDirector is a high-level class that handles urls and opens them in three stages:
The order in which these methods are called in each phase is determined by sorting handler instances; Each program using this method calls the ProtocolRequest () method to process the request, and then calls the ProtocolOpen () method to process the request; Finally, the protocol_response() method is called to process the response.
We can call OpenerDirector Opener. We’ve used the urlopen() method before, which is actually a Opener provided by urllib.
Opener’s methods include:
- Add_handler (handler) : Adds handlers to links
- Open (URL,data=None[,timeout]) : Opens the given URL the same as the urlopen() method
- Error (PROto,* ARGS) : Handles errors for a given protocol
Let’s demonstrate how to get Cookies from a website:
import http.cookiejar, Urllib. request # instantiate cookiejar object cookie = http.cookiejar.cookiejar () # build a handler with HTTPCookieProcessor handler = Urllib. Request. HTTPCookieProcessor (cookies) # building Opener Opener = urllib. Request. Build_opener (handler) response = # by request opener.open('https://www.baidu.com/') print(cookie) for item in cookie: print(item.name + " = " + item.value)Copy the code
The specific meaning of the code will not be explained, comments have been written more complete. The final print result is as follows:
<CookieJar[<Cookie BAIDUID=48EA1A60922D7A30F711A420D3C5BA22:FG=1 for .baidu.com/>, <Cookie BIDUPSID=48EA1A60922D7A30DA2E4CBE7B81D738 for .baidu.com/>, <Cookie PSTM=1575167484 for .baidu.com/>, <Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>
BAIDUID = 48EA1A60922D7A30F711A420D3C5BA22:FG=1
BIDUPSID = 48EA1A60922D7A30DA2E4CBE7B81D738
PSTM = 1575167484
BD_NOT_HTTPS = 1Copy the code
This raises a question: since cookies can be printed, can we save the output of cookies to a file?
The answer is yes, of course, because we know that cookies themselves are stored in files.
Type # cookies save Mozilla file example filename = 'cookies_mozilla. TXT' cookie. = HTTP cookiejar. MozillaCookieJar handler = (filename) urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener. Open ('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) print(' Cookies_mozilla saved successfully ')Copy the code
Here we need to change the CookieJar to Mozilla CookkieJar, which is used for generating files. It is a subclass of the CookieJar that can handle Cookies and file-related events, such as reading and saving Cookies. You can save Cookies in the Mozilla browser’s Cookies format.
After running, we can see that a cookies. TXT file is generated in the directory of the current program. The details are as follows:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.
.baidu.com TRUE / FALSE 1606703804 BAIDUID 0A7A76A3705A730B35A559B601425953:FG=1
.baidu.com TRUE / FALSE 3722651451 BIDUPSID 0A7A76A3705A730BE64A1F6D826869B5
.baidu.com TRUE / FALSE H_PS_PSSID 1461_21102_30211_30125_26350_30239
.baidu.com TRUE / FALSE 3722651451 PSTM 1575167805
.baidu.com TRUE / FALSE delPer 0
www.baidu.com FALSE / FALSE BDSVRTM 0
www.baidu.com FALSE / FALSE BD_HOME 0Copy the code
Xiaobian is lazy, not screenshots, direct paste results.
Of course, we can save cookies as libwww-Perl (LWP) file in addition to Mozilla browser format.
To save Cookies in LWP format, change to LWPCookieJar at declaration time:
# cookies save type LWP file example filename = 'cookies_lwp. TXT' cookie. = HTTP cookiejar. LWPCookieJar handler = (filename) urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = Opener. Open ('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) print(' cookies_LWp saved successfully ')Copy the code
The result is as follows:
# LWP - Cookies - 2.0 - Set - Cookie3: BAIDUID = "D634D45523004545C6E23691E7CE3894: FG = 1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2020-11-30 02:45:24Z"; comment=bd; version=0 Set-Cookie3: BIDUPSID=D634D455230045458E6056651566B7E3; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0 Set-Cookie3: H_PS_PSSID=1427_21095_30210_18560_30125; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: PSTM=1575168325; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0 Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0 Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0Copy the code
As you can see, the cookie file formats produced by the two types are quite different.
The cookie file has been generated. The next step is to add a cookie to the request as shown in the following example:
Type # request is to use the Mozilla file cookies. = HTTP cookiejar. MozillaCookieJar () cookie. Load (' cookies_mozilla. TXT, ignore_discard = True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))Copy the code
Here we use the load() method to read the local Cookies file and get the contents of the Cookies.
The premise is that we need to generate Mozilla format cookie file in advance, and then use the same method to build Handler and Opener after reading Cookies.
Request normal when the corresponding ferries home page source, the results xiaobian is not posted, true a bit long.
That’s the end of this article. I hope you remember to write your own code
The sample code
All of the code in this series will be available on Github and Gitee.
Example code -Github
Example code -Gitee
reference
https://www.cnblogs.com/zhangxinqi/p/9170312.html
https://cuiqingcai.com/5500.html