Rounding python3 urllib

This article is the first in a series of crawler articles on the use of the URllib library in Python 3. Urllib is the Python standard library for network requests. The library has four modules, urllib.request, urllib.error, urllib.parse, urllib.robotParser. Urllib. request and urllib.error libraries are frequently used in crawlers. Let’s get straight to the point and explain the use of these two modules.

1 Initiating a Request

To simulate an HTTP request from a browser, we need the urllib.request module. Urllib. request does more than just make a request, it also gets the result of the request. The urlopen() method alone can kick off a request. Let’s take a look at the URlopen () API

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Copy the code

The first argument is a String address or
dataIs a bytes content that can be converted to a byte stream through the bytes() function. It is also optional. With the data parameter, the request mode changes to POST form submission. Using the standard format isapplication/x-www-form-urlencoded
timeoutParameter is used to set the request timeout period. The units are seconds.
cafileandcapathIndicates the PATH of the CA certificate and CA certificate. If you are usingHTTPSI need to use.
contextThe parameter must bessl.SSLContextType used to specifySSLSet up the
cadefaultParameters have been deprecated and can be left alone.
This method can also be passed in separatelyurllib.request.Requestobject
The result returned by this function is onehttp.client.HTTPResponseObject.

1.1 Simply Fetching Web pages

We use urllib.request.urlopen() to request Baidu Tieba and get the source code of its page.

import urllib.request url = "http://tieba.baidu.com" response = urllib.request.urlopen(url) html = response.read() # Print (html.decode(' utF-8 ')) # convert to UTF-8Copy the code

1.2 Setting a Request Timeout

Some requests may not be answered because of network problems. Therefore, we can manually set the timeout. When a request times out, we can take further steps, such as choosing to discard the request directly or request it again.

import urllib.request

url = "http://tieba.baidu.com"
response = urllib.request.urlopen(url, timeout=1)
print(response.read().decode('utf-8'))
Copy the code

1.3 Submit data using the Data parameter

We need to use the data parameter when we need to carry some data when requesting certain web pages.

The import urilib. Parse the import urllib. Request url = "http://127.0.0.1:8000/book" params = {' name ':' "six floating, 'author':' author'} data = bytes(urllib.parse. Urlencode (params), encoding='utf8') Response = urllib.request. Urlopen (url, encoding='utf8') data=data) print(response.read().decode('utf-8'))Copy the code

Params need to be transcoded into a byte stream. And Params is a dictionary. We need to use urllib.parse.urlencode() to convert the dictionary to a string. Bytes () is then used to convert to a byte stream. Finally, urlopen() is used to initiate the request, which simulates POST submission of form data.

1.4 use the Request

We know from the above that we can make a simple request using the urlopen() method. However, these simple parameters are not enough to build a complete Request. If you need to include headers, specify the Request method, etc., you can use the more powerful Request class to build a Request. Request is constructed in accordance with international conventions:

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
Copy the code

The url parameterIs the request link, this parameter is mandatory, the other parameters are optional.
The data parameterSame as the data argument in urlopen().
Headers parametersIs the header information that specifies the HTTP request to be initiated. Headers is a dictionary. In addition to being added to the Request, the Request header can be added by calling the add_header() method of the Reques T instance.
Origin_req_host parametersRefers to the host name or IP address of the requester.
Unverifiable parametersIndicates whether the request is unverifiable. The default value is False. This means that the user does not have sufficient permission to select the result of receiving the request. For example, if you are asking for an image in an HTML document, and you don’t have the right to automatically grab the image, you can set unverifiable to True.
Method parametersAn HTTP request can be sent by GET, POST, DELETE, or PUT

1.4.1 Simply Using Request

Make HTTP requests using Request masquerading as a browser. If user-agent in headers is not set, the default user-agent is python-urllib /3.5. Some sites may block the request, so you need to make the request disguised as a browser. The User-Agent I use is Chrome.

Import urllib.request URL = "http://tieba.baidu.com/" headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, Request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))Copy the code

1.4.2 Advanced Usage of Request

If we need to add proxies to the request and Cookies to handle the request, we need to use Handler and OpenerDirector.

Handler Handler Handler Handler Handler Handler Handler Handler Handler can handle various things in requests (HTTP, HTTPS, FTP, etc.). It is the concrete implementation of this class urllib. Request. BaseHandler. It is the base class for all handlers and provides basic Handler methods such as default_open(), protocol_request(), and so on. There are many classes that inherit BaseHandler, but I’ll just list a few of the more common ones:

ProxyHandler: Sets up the proxy for the request
HTTPCookieProcessor: Processes Cookies in HTTP requests
HTTPDefaultErrorHandler: Processing HTTP response error.
HTTPRedirectHandler: Handles HTTP redirection.
HTTPPasswordMgr: for managing passwords, it maintains a table of user name passwords.
HTTPBasicAuthHandler: Used for login authentication, general andHTTPPasswordMgrUse in combination.

2) OpenerDirector For OpenerDirector we can call it Opener. We’ve used the urlopen() method before, which is actually a Opener provided by urllib. What does Opener have to do with Handler? The opener object is created by the build_opener(handler) method. To create a custom opener, use the install_opener(opener) method. Note that the Install_opener instantiation results in a global OpenerDirector object.

1.5 Using an Agent

We’ve already looked at opener and Handler, so let’s take a closer look at examples. The first example is setting up a proxy for HTTP requests. Some sites have frequency limits. If we request the site frequency too high. The site will be IP blocked from our access. So we need to use agents to break the chains.

Import urllib.request URL = "http://tieba.baidu.com/" headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 56.0.2924.87 Safari / 537.36 '} proxy_handler = urllib. Request. ProxyHandler ({' HTTP: 'web-proxy.oa.com:8080', 'https': 'web-proxy.oa.com:8080' }) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener) request = urllib.request.Request(url=url, headers=headers) response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))Copy the code

1.6 Authentication Login

Some websites require you to log in with your account and password before you can continue browsing. When encountering such a site, we need to authenticate login. The first thing we need to use HTTPPasswordMgrWithDefaultRealm () object instantiates a account password management; Then use the add_password() function to add the account and password; Then use HTTPBasicAuthHandler() to get the hander; Then use build_opener() to get the opener object; Finally, use opener’s open() function to initiate the request.

The second example is a request to log in to Baidu Tieba with an account number and password. The code is as follows:

import urllib.request url = "http://tieba.baidu.com/" user = 'user' password = 'password' pwdmgr = Urllib. Request. HTTPPasswordMgrWithDefaultRealm () PWDMGR. Add_password (None, url, the user, password) auth_handler = urllib.request.HTTPBasicAuthHandler(pwdmgr) opener = urllib.request.build_opener(auth_handler) response = opener.open(url) print(response.read().decode('utf-8'))Copy the code

1.7 the Cookies set

If the requested page requires authentication each time, we can use Cookies to automatically log in and avoid the operation of repeated login authentication. Getting Cookies requires instantiating a cookie object using http.cookiejar.cookiejar (). Reoccupy urllib. Request. HTTPCookieProcessor construct handler object. Finally, use opener’s open() function.

The third example is to obtain the Cookies requested by Baidu Tieba and save them in a file, the code is as follows:

import http.cookiejar
import urllib.request

url = "http://tieba.baidu.com/"
fileName = 'cookie.txt'

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open(url)

f = open(fileName,'a')
for item in cookie:
    f.write(item.name+" = "+item.value+'\n')
f.close()
Copy the code

1.8 HTTPResponse

As you can see from the above example, using urllib.request.urlopen() or opener. Open (URL) returns an http.client.httpresponse object. It has properties such as MSG, version, status, reason, debuglevel, closed, and functions such as read(), readinto(), getheader(name), getheaders(), and fileno().

2 Error Resolution

It is inevitable that there will be all kinds of exceptions when the request is initiated. We need to handle the exceptions, which will make the program more humane. Urllib.error.URLError and urllib.error.HTTPError are two classes used for exception handling.

URLError

URLError is the base of the urllib.error exception class, which catches exceptions generated by urllib.request.

It has a propertyreasonThat is, return the cause of the error.

Example code to catch URL exceptions:

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.URLError as e:
    print(e.reason)
Copy the code

HTTPError

HTTPError is a subclass of UEKRrror that specializes in handling errors in HTTP and HTTPS requests. It has three properties.

1)code: Indicates the status code returned by the HTTP request.

1)renson: as in superclass usage, indicates the reason for returning an error.

1)Headers: indicates the response header returned by an HTTP request.

Get HTTP exception example code, output error status code, error cause, server response header

import urllib.request
import urllib.error

url = "http://www.google.com"
try:
    response = request.urlopen(url)
except error.HTTPError as e:
   print('code: ' + e.code + '\n')
   print('reason: ' + e.reason + '\n')
   print('headers: ' + e.headers + '\n')
Copy the code

This article was originally published in the wechat public account “Geek Monkey”, welcome to follow the first time to get more original sharing