preface
For more, please visit my personal blog.
Urllib, urllib2, urllib3, httplib, httplib2, requests.
- The built-in
urllib
The module- Advantages: Built-in modules, no need to download third-party libraries
- Disadvantages: Cumbersome operation, lack of advanced functions
- Third-party libraries
requests
- Advantages: Handling URL resources is particularly convenient
- Disadvantages: Need to download and install third-party libraries
The built-inurllib
The module
Making a GET request
The urlopen() method is used to initiate the request, as follows:
from urllib import request
resp = request.urlopen('http://www.baidu.com')
print(resp.read().decode())
Copy the code
The result is an HTTP.client.httpresponse object whose read() method lets you retrieve the data from the web page. Note, however, that the retrieved data will be in binary bytes format, so decode() will be needed to convert it to string format.
Making a POST request
The default access to urlopen() is GET, and when the data argument is passed to the urlopen() method, a POST request is made. Note: The data passed must be in bytes format.
Setting the timeout parameter also sets the timeout period, and if the request time exceeds, an exception will be thrown. As follows:
from urllib import request
resp = request.urlopen('http://www.baidu.com', data=b'word=hello', timeout=10)
print(resp.read().decode())
Copy the code
Add Headers
The default Headers for urllib requests is “user-agent “:” python-urllib /3.6″, indicating that urllib sent the request. Therefore, we need to customize Headers for sites that validate user-agent, which requires the request object in urllib.request.
from urllib import request
url = 'http://httpbin.org/get'
headers = {'user-agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
You need to generate a Request object using the URL and headers and pass it into the urlopen method
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read().decode())
Copy the code
The Request object
As shown above, the urlopen() method can pass not only a url in string form, but also a Request object to extend the functionality, which looks like this:
class urllib.request.Request(url, data=None, headers={},
origin_req_host=None,
unverifiable=False,
method=None)
Copy the code
To construct the Request object, you must pass in url parameters. Data data and HEADERS are optional.
Finally, the Request method can take the method argument to select the requested method, such as PUT, DELETE, and so on. The default is GET.
Add a Cookie
In order to take Cookie information with the request, we need to reconstruct a opener.
We use the request.build_opener method to construct opener, configure the cookie we want to pass into opener, and then use the opener’s open method to initiate the request. As follows:
from http import cookiejar
from urllib import request
url = 'https://www.baidu.com'
Create a CookieJar object
cookie = cookiejar.CookieJar()
Create a cookie processor using HTTPCookieProcessor
cookies = request.HTTPCookieProcessor(cookie)
And create the Opener object with it as an argument
opener = request.build_opener(cookies)
Use this opener to initiate a request
resp = opener.open(url)
# View the previous cookie object, you can see the cookie obtained by accessing Baidu
for i in cookie:
print(i)
Copy the code
Alternatively, the generated opener can be set to global using the install_opener method.
The cookie is always attached to subsequent requests made using the Urlopen method.
Set this opener to the global opener
request.install_opener(opener)
resp = request.urlopen(url)
Copy the code
Setting a Proxy
When using crawlers to crawl data, we often need to use proxies to hide our real IP. As follows:
from urllib import request
url = 'http://www.baidu.com'
proxy = {'http':'222.222.222.222:80'.'https':'222.222.222.222:80'}
Create a proxy handler
proxies = request.ProxyHandler(proxy)
Create opener object
opener = request.build_opener(proxies)
resp = opener.open(url)
print(resp.read().decode())
Copy the code
Download data locally
We often need to save data such as images or audio locally when making network requests. One way to do this is to use Python’s file operation, which saves data retrieved from read() to a file.
Urllib provides an urlRetrieve () method that simply saves the requested data directly into a file. As follows:
from urllib import request
url = 'http://python.org/'
request.urlretrieve(url, 'python.html')
Copy the code
The second argument passed in to the urlRetrieve () method is the location where the file is saved, along with the file name.
Note: the urlretrieve() method is directly ported from python2 and may be deprecated in a later version.
Third-party librariesrequests
The installation
Since Requests is a third-party library, install it first, as follows:
pip install requests
Copy the code
Making a GET request
Use the get method directly as follows:
import requests
r = requests.get('http://www.baidu.com/')
print(r.status_code) # state
print(r.text) # content
Copy the code
For urls with parameters, pass a dict as a params parameter, as follows:
import requests
r = requests.get('http://www.baidu.com/', params={'q': 'python'.'cat': '1001'})
print(r.url) The actual requested URL
print(r.text)
Copy the code
The additional convenience of Requests is that for specific types of responses, such as JSON, they can be obtained directly, as follows:
r = requests.get('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format =json')
r.json()
# {'query': {'count': 1, 'created': '2017-11-17T07:14:12Z', ...
Copy the code
Add Headers
When we need to pass HTTP headers, we pass a dict as the headers argument, as follows:
r = requests.get('https://www.baidu.com/', headers={'User-Agent': 'the Mozilla / 5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'})
Copy the code
Get the response header as follows:
r.headers
# {Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ... }
r.headers['Content-Type']
# 'text/html; charset=utf-8'
Copy the code
Making a POST request
To send a POST request, simply change the get() method to POST () and pass in the data argument as the data for the POST request, as follows:
r = requests.post('https://accounts.baidu.com/login', data={'form_email': '[email protected]'.'form_password': '123456'})
Copy the code
Requests uses Application/X-www-form-urlencoded POST data by default. If you want to pass JSON data, you can pass JSON parameters as follows:
params = {'key': 'value'}
r = requests.post(url, json=params) # Internal automatic serialization to JSON
Copy the code
Upload a file
Uploading files requires a more complex encoding format, but Requests simplified it to the files argument as follows:
upload_files = {'file': open('report.xls'.'rb')}
r = requests.post(url, files=upload_files)
Copy the code
When reading a file, be sure to use ‘rb’, or binary mode, so that the bytes retrieved are the length of the file.
Replace the post() method with put(), delete(), and so on to request resources as put or DELETE.
Add a Cookie
To pass cookies in a request, simply prepare a dict to pass cookies as follows:
cs = {'token': '12345'.'status': 'working'}
r = requests.get(url, cookies=cs)
Copy the code
Requests treats cookies so that we can easily retrieve the specified cookies without parsing them, as follows:
r.cookies['token']
# 12345
Copy the code
Specify the timeout
To specify a timeout, pass in the timeout argument in seconds. Timeout is divided into connection timeout and read timeout, as follows:
try:
Connection timeout after 3.1 seconds, read timeout after 27 secondsR = requests. Get (url, timeout = (3.1, 27)) except requests. Exceptions. RequestException as e:print(e)
Copy the code
Timeout reconnection
def gethtml(url):
i = 0
while i < 3:
try:
html = requests.get(url, timeout=5).text
return html
except requests.exceptions.RequestException:
i += 1
Copy the code
Add the agent
If the headers method is added, the proxy argument should also be a dict, as follows:
heads = {
'User-Agent': 'the Mozilla / 5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'
}
proxy = {
'http': 'http://120.25.253.234:812'.'https' 'https://163.125.222.244:8123'
}
r = requests.get('https://www.baidu.com/', headers=heads, proxies=proxy)
Copy the code
More programming teaching please pay attention to the public account: Pan Gao accompany you to learn programming