This article is the first in a series of crawler articles on the use of the URllib library in Python 3. Urllib is the Python standard library for network requests. The library has four modules, urllib.request, urllib.error, urllib.parse, urllib.robotParser. Urllib. request and urllib.error libraries are frequently used in crawlers. Let’s get straight to the point and explain the use of these two modules.
1 Initiating a Request
To simulate an HTTP request from a browser, we need the urllib.request module. Urllib. request does more than just make a request, it also gets the result of the request. The urlopen() method alone can kick off a request. Let’s take a look at the URlopen () API
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Copy the code
- The first argument is a String address or
data
Is a bytes content that can be converted to a byte stream through the bytes() function. It is also optional. With the data parameter, the request mode changes to POST form submission. Using the standard format isapplication/x-www-form-urlencoded
timeout
Parameter is used to set the request timeout period. The units are seconds.cafile
andcapath
Indicates the PATH of the CA certificate and CA certificate. If you are usingHTTPS
I need to use.context
The parameter must bessl.SSLContext
Type used to specifySSL
Set up thecadefault
Parameters have been deprecated and can be left alone.- This method can also be passed in separately
urllib.request.Request
object - The result returned by this function is one
http.client.HTTPResponse
Object.
1.1 Simply Fetching Web pages
We use urllib.request.urlopen() to request Baidu Tieba and get the source code of its page.
import urllib.request url = "http://tieba.baidu.com" response = urllib.request.urlopen(url) html = response.read() # Print (html.decode(' utF-8 ')) # convert to UTF-8Copy the code
1.2 Setting a Request Timeout
Some requests may not be answered because of network problems. Therefore, we can manually set the timeout. When a request times out, we can take further steps, such as choosing to discard the request directly or request it again.
import urllib.request
url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))
Copy the code
1.3 Submit data using the Data parameter
We need to use the data parameter when we need to carry some data when requesting certain web pages.
The import urilib. Parse the import urllib. Request url = "http://127.0.0.1:8000/book" params = {' name ':' "six floating, 'author':' author'} data = bytes(urllib.parse. Urlencode (params), encoding='utf8') Response = urllib.request. Urlopen (url, encoding='utf8') data=data) print(response.read().decode('utf-8'))Copy the code
Params need to be transcoded into a byte stream. And Params is a dictionary. We need to use urllib.parse.urlencode() to convert the dictionary to a string. Bytes () is then used to convert to a byte stream. Finally, urlopen() is used to initiate the request, which simulates POST submission of form data.
1.4 use the Request
We know from the above that we can make a simple request using the urlopen() method. However, these simple parameters are not enough to build a complete Request. If you need to include headers, specify the Request method, etc., you can use the more powerful Request class to build a Request. Request is constructed in accordance with international conventions:
urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Copy the code
The url parameter
Is the request link, this parameter is mandatory, the other parameters are optional.The data parameter
Same as the data argument in urlopen().Headers parameters
Is the header information that specifies the HTTP request to be initiated. Headers is a dictionary. In addition to being added to the Request, the Request header can be added by calling the add_header() method of the Reques T instance.Origin_req_host parameters
Refers to the host name or IP address of the requester.Unverifiable parameters
Indicates whether the request is unverifiable. The default value is False. This means that the user does not have sufficient permission to select the result of receiving the request. For example, if you are asking for an image in an HTML document, and you don’t have the right to automatically grab the image, you can set unverifiable to True.Method parameters
An HTTP request can be sent by GET, POST, DELETE, or PUT
1.4.1 Simply Using Request
Make HTTP requests using Request masquerading as a browser. If user-agent in headers is not set, the default user-agent is python-urllib /3.5. Some sites may block the request, so you need to make the request disguised as a browser. The User-Agent I use is Chrome.
Import urllib.request URL = "http://tieba.baidu.com/" headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, Request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))Copy the code
1.4.2 Advanced Usage of Request
If we need to add proxies to the request and Cookies to handle the request, we need to use Handler and OpenerDirector.
Handler Handler Handler Handler Handler Handler Handler Handler Handler can handle various things in requests (HTTP, HTTPS, FTP, etc.). It is the concrete implementation of this class urllib. Request. BaseHandler. It is the base class for all handlers and provides basic Handler methods such as default_open(), protocol_request(), and so on. There are many classes that inherit BaseHandler, but I’ll just list a few of the more common ones:
ProxyHandler
: Sets up the proxy for the requestHTTPCookieProcessor
: Processes Cookies in HTTP requestsHTTPDefaultErrorHandler
: Processing HTTP response error.HTTPRedirectHandler
: Handles HTTP redirection.HTTPPasswordMgr
: for managing passwords, it maintains a table of user name passwords.HTTPBasicAuthHandler
: Used for login authentication, general andHTTPPasswordMgr
Use in combination.
2) OpenerDirector For OpenerDirector we can call it Opener. We’ve used the urlopen() method before, which is actually a Opener provided by urllib. What does Opener have to do with Handler? The opener object is created by the build_opener(handler) method. To create a custom opener, use the install_opener(opener) method. Note that the Install_opener instantiation results in a global OpenerDirector object.
1.5 Using an Agent
We’ve already looked at opener and Handler, so let’s take a closer look at examples. The first example is setting up a proxy for HTTP requests. Some sites have frequency limits. If we request the site frequency too high. The site will be IP blocked from our access. So we need to use agents to break the chains.
Import urllib.request URL = "http://tieba.baidu.com/" headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 56.0.2924.87 Safari / 537.36 '} proxy_handler = urllib. Request. ProxyHandler ({' HTTP: 'web-proxy.oa.com:8080', 'https': 'web-proxy.oa.com:8080' }) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener) request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))Copy the code
1.6 Authentication Login
Some websites require you to log in with your account and password before you can continue browsing. When encountering such a site, we need to authenticate login. The first thing we need to use HTTPPasswordMgrWithDefaultRealm () object instantiates a account password management; Then use the add_password() function to add the account and password; Then use HTTPBasicAuthHandler() to get the hander; Then use build_opener() to get the opener object; Finally, use opener’s open() function to initiate the request.
The second example is a request to log in to Baidu Tieba with an account number and password. The code is as follows:
import urllib.request url = "http://tieba.baidu.com/" user = 'user' password = 'password' pwdmgr = Urllib. Request. HTTPPasswordMgrWithDefaultRealm () PWDMGR. Add_password (None, url, the user, password) auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) opener = urllib.request.build_opener(auth_handler) response = opener.open(url) print(response.read().decode('utf-8'))Copy the code
1.7 the Cookies set
If the requested page requires authentication each time, we can use Cookies to automatically log in and avoid the operation of repeated login authentication. Getting Cookies requires instantiating a cookie object using http.cookiejar.cookiejar (). Reoccupy urllib. Request. HTTPCookieProcessor construct handler object. Finally, use opener’s open() function.
The third example is to obtain the Cookies requested by Baidu Tieba and save them in a file, the code is as follows:
import http.cookiejar
import urllib.request
url = "http://tieba.baidu.com/"
fileName = 'cookie.txt'
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)
f = open(fileName,'a')
for item in cookie:
f.write(item.name+" = "+item.value+'\n')
f.close()
Copy the code
1.8 HTTPResponse
As you can see from the above example, using urllib.request.urlopen() or opener. Open (URL) returns an http.client.httpresponse object. It has properties such as MSG, version, status, reason, debuglevel, closed, and functions such as read(), readinto(), getheader(name), getheaders(), and fileno().
2 Error Resolution
It is inevitable that there will be all kinds of exceptions when the request is initiated. We need to handle the exceptions, which will make the program more humane. Urllib.error.URLError and urllib.error.HTTPError are two classes used for exception handling.
URLError
URLError is the base of the urllib.error exception class, which catches exceptions generated by urllib.request.
It has a propertyreason
That is, return the cause of the error.
Example code to catch URL exceptions:
import urllib.request
import urllib.error
url = "http://www.google.com"
try:
response = request.urlopen(url)
except error.URLError as e:
print(e.reason)
Copy the code
HTTPError
HTTPError is a subclass of UEKRrror that specializes in handling errors in HTTP and HTTPS requests. It has three properties.
1)
code: Indicates the status code returned by the HTTP request.
1)
renson: as in superclass usage, indicates the reason for returning an error.
1)
Headers: indicates the response header returned by an HTTP request.
Get HTTP exception example code, output error status code, error cause, server response header
import urllib.request
import urllib.error
url = "http://www.google.com"
try:
response = request.urlopen(url)
except error.HTTPError as e:
print('code: ' + e.code + '\n')
print('reason: ' + e.reason + '\n')
print('headers: ' + e.headers + '\n')
Copy the code
This article was originally published in the wechat public account “Geek Monkey”, welcome to follow the first time to get more original sharing