Life is short. I use Python
Previous portal:
Learning Python crawlers (1) : The Beginning
Python crawler (2) : Preparation (1) basic class library installation
Learn Python crawler (3) : Pre-preparation (2) Linux basics
Docker is a Python crawler
Learn Python crawler (5) : pre-preparation (4) database foundation
Python crawler (6) : Pre-preparation (5) crawler framework installation
Python crawler (7) : HTTP basics
Little White learning Python crawler (8) : Web basics
Learning Python crawlers (9) : Crawler basics
Python crawler (10) : Session and Cookies
Python crawler (11) : Urllib
Python crawler (12) : Urllib
Urllib: A Python crawler (13)
Urllib: A Python crawler (14)
Python crawler (15) : Urllib
Python crawler (16) : Urllib crawler (16) : Urllib crawler
The introduction
In front of the preparation, we installed a lot of third-party Request library, such as Request, AioHttp and so on, I do not know if you still have the impression, do not have the impression of the students can look over the front of the article.
In the previous articles, we have had a rough understanding of the basic usage of URllib. There are indeed many inconvenient areas, such as handling Cookies or using proxy access, which need to be handled by Opener and Handler.
A more powerful Request library makes sense. With the Request library, we can use these higher-order operations much more easily.
Introduction to the
First of all, the various official addresses:
- GitHub:https://github.com/requests/requests
- Official document: http://www.python-requests.org
- Chinese document: http://docs.python-requests.org/zh_CN/latest
The purpose of listing all kinds of official documents here is to hope that students can form the habit of consulting official documents. After all, xiaobian is also human and can make mistakes. Comparatively speaking, the error rate of official documents is very low, including sometimes some difficult problems can be solved through official documents.
We’ve already covered the basics of urllib, so let’s skip BB and get straight to the real business: writing code.
Here we use the same test address as previously mentioned: https://httpbin.org/.
A GET request
GET Requests are our most commonly used Requests, so let’s take a look at how to send a GET request using Requests. The code is as follows:
import requests
r = requests.get('https://httpbin.org/get')
print(r.text)Copy the code
The results are as follows:
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Python - requests / 2.22.0"}, "origin" : "116.234.254.11, 116.234.254.11", "url" : "https://httpbin.org/get"}Copy the code
I won’t talk about it here, but it’s the same thing as urllib.
If we want to add request parameters to a GET request, how do we add them?
import requests
params = {
'name': 'geekdigging',
'age': '18'
}
r1 = requests.get('https://httpbin.org/get', params)
print(r1.text)Copy the code
The results are as follows:
{ "args": { "age": "18", "name": "geekdigging" }, "headers": { "Accept": "*/*", "Accept-Encoding": "Gzip, deflate", "Host": "httpbin.org"," user-agent ": "python-requests/2.22.0"}, "origin": "116.234.254.11, 116.234.254.11", "url": "https://httpbin.org/get?name=geekdigging&age=18"}Copy the code
As you can see, the requested link is automatically constructed as: https://httpbin.org/get?name=geekdigging&age=18.
It is important to note that r1.text returns a STR data type, but is actually a JSON. If you want to convert this JSON directly into a dictionary format that we can use directly, you can use the following methods:
print(type(r1.text))
print(r1.json())
print(type(r.json()))Copy the code
The results are as follows:
<class 'str'> {'args': {'age': '18', 'name': 'geekdigging'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'user-agent ': 'python-requests/2.22.0'},' Origin ': '116.234.254.11, 116.234.254.11', 'URL ': 'https://httpbin.org/get?name=geekdigging&age=18'} <class 'dict'>Copy the code
Add a request header:
Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'referer': 'https://www.geekdigging.com/' } r2 = requests.get('https://httpbin.org/get', headers = headers) print(r2.text)Copy the code
The results are as follows:
{ "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "Referer": "Https://www.geekdigging.com/", "the user-agent: Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "Origin ": "116.234.254.11 116.234.254.11,", "url" : "https://httpbin.org/get"}Copy the code
As with urllib.request, we pass the headers argument.
If we want to grab a picture or a video or something like that, what can we do?
These files are essentially composed of binary code, because there is a specific format for saving and the corresponding way of parsing, we can see these various multimedia. So, if you want to grab them, you have to get their binary code.
For example, we grab baidu logo on the photo, picture address is: https://www.baidu.com/img/superlogo_c4d7df0a003d3db9b65e9ef0fe6da1ec.png
import requests
r3 = requests.get("https://www.baidu.com/img/superlogo_c4d7df0a003d3db9b65e9ef0fe6da1ec.png")
with open('baidu_logo.png', 'wb') as f:
f.write(r3.content)Copy the code
Results xiaobian will not show, can be downloaded normally.
A POST request
Let’s move on to a very common POST request. As with the GET request above, we still test using: https://httpbin.org/post. Example code is as follows:
Import requests headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'referer': 'https://www.geekdigging.com/' } params = { 'name': 'geekdigging', 'age': '18' } r = requests.post('https://httpbin.org/post', data = params, headers = headers) print(r.text)Copy the code
The results are as follows:
{ "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "geekdigging" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "23", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "Referer": "https://www.geekdigging.com/", "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}, "json": null, "origin": "116.234.254.11 116.234.254.11,", "url" : "https://httpbin.org/post"}Copy the code
We added the request header and parameters to the POST request.
The Response Response
We used text and JSON to get the response content above, but there are many other properties and methods you can use to get other information.
Let’s visit baidu home page to demonstrate:
import requests
r = requests.get('https://www.baidu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)Copy the code
The results are as follows:
<class 'int'> 200 <class 'requests.structures.CaseInsensitiveDict'> {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Thu, 05 Dec 2019 13:24:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:23:55 GMT', Pragma': 'no-cache', 'Server': 'BFE /1.0.8.18',' set-cookie ': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'} <class 'requests.cookies.RequestsCookieJar'> <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> <class 'str'> https://www.baidu.com/ <class 'list'> []Copy the code
Output status_code to get the status code, headers to get the response header, cookies to get cookies, URL to get URL, and history to get the request history.
The sample code
All of the code in this series will be available on Github and Gitee.
Example code -Github
Example code -Gitee