Now that you know about crawlers and network requests, it’s time to begin a formal look at crawler-related modules in Python
A lot of crawler books start with the urllib module, and when you’re done with the urllib module, you’ll be told that it’s a bit complicated to use and usually not used
Indeed, urllib is a relatively old module, and the crawler methods encapsulated are relatively complex. So you can just start the requests requests module.
The Requests module mimics the browser sending Requests. Python is a Python native network request based module, not only powerful, but also relatively easy to use!
Module installation
Install directly through PIP
pip install requests
Copy the code
It is also very simple to use, divided into three steps
- Specify the URL, that is, the address of the website you want to crawl
- Send requests, which are divided into GET, POST, PUT, DELETE, and so on
- Get response data
Isn’t that easy to look at? Without further ado, let’s take a simple example. For example, I need to climb baidu’s home page
A GET request
# import module
import requests
url = 'https://www.baidu.com'
Send a get request
response = requests.get(url)
Get the source code of the web page
baidu_html = response.text
# persist (save) the retrieved data
with open('baidu.html'.'w', encoding='utf-8') as f:
f.write(baidu_html)
Copy the code
This will result in a bidum.html file in the same directory.
This is a very basic example of using Requests to send GET requests. Also if you send other requests such as POST/PUT/HEAD etc
requests.post(url)
requests.delete(url)
requests.put(url)
...
Copy the code
Most of the time a request is sent with some parameters. For example, when we do a search, we send a GET request that carries the search content
For example, when I search for Python on Baidu, the URL is https://www.baidu.com/s?wd=python
To make a crawler more flexible, it is necessary to separate the contents of the search. You can send a GET request by building a dictionary parameter
import requests
Enter your search keywords
wd = input('Please enter what to search for: \n')
url = 'https://www.baidu.com/s?'
Build the search parameters for the GET request
url_param = {'wd': wd}
# prevent crawlers from being intercepted, increase requested UA
header = {
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
Send a GET request
response = requests.get(url, params=url_param, headers=header)
Get the source code of the web page
wd_html = response.text
Write file, persist operation
with open('wd.html'.'w', encoding='utf-8') as f:
f.write(wd_html)
Copy the code
The specific uses of GET are as follows
The url is the address to access
# params carries parameters
# ** Kwargs Other parameters, such as request headers, cookies, proxies, etc
def get(url, params=None, **kwargs) :
Copy the code
A POST request
Use the Httpbin web site for testing
Httpbin is a web site that can test HTTP requests and responses, supporting various request methods
To submit parameters as a form, construct a dictionary of the request parameters and pass it to the data parameter.
import requests
url = 'http://httpbin.org/post'
params = {'name': 'Public: Python Geek Column'.'language': 'python'}
response = requests.post(url, data=params)
print(response.json())
Copy the code
Execution Result:
{' args: {}, 'data' : 'and' files' : {}, 'form' : {' language ':' python ', 'name' : 'the public number: python geek column'}, 'headers' : { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '99', 'Content-Type': 'Application/X-www-form-urlencoded ', 'Host': 'httpbin.org',' user-agent ': 'python-requests/2.22.0', 'X-amzn-trace-id ': application/ X-www-form-urlencoded ', 'Host': 'httpbin.org',' user-agent ': 'python-requests/2.22.0', 'X-amzn-trace-id ': 'Root=1-5fef5112-65ad78d4706c58475905fef2' }, 'json': None, 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code
2. Submit the parameter as a string and passjson.dumps
Convert the dictionary to a string
import requests
import json
header = {
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
params = {'name': 'Tom'.'hobby': ['music'.'game']}
Dumps formats the dictionary as a JSON string with json.dumps
response = requests.post(url, json=json.dumps(params), headers=header)
print(response.json())
Copy the code
Execution Result:
{ 'args': {}, 'data': '"{\\"name\\": \\"Tom\\", \\"hobby\\": [\\"music\\", \\"game\\"]}"', 'files': {}, 'form': {}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '55', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36', 'x-amzn-trace-id ': 'Root= 1-5fef583E-5224201d08e2ff396416e822'}, 'json': '{"name": "Tom", "hobby": ["music", "game"]}', 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code
3. Submit a file using POST (multipart)
import requests
import json
header = {
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
Baidu. HTML file, 'rb': read in binary form
files = {'file': open('baidu.html'.'rb')}
Pass file as post
response = requests.post(url, files=files)
print(response.json())
Copy the code
Execution Result:
{ 'args': {}, 'data': '', 'files': { 'file': '<! DOCTYPE html>..... HTML content omitted here... ' }, 'form': {}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '2732', 'Content-Type': 'multipart/form-data; boundary=1ba2b1406c4a4fe89c1846dc6398dae5', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'x-amzn-trace-id ': 'Root= 1-5fef58F8-68f9fb2246eb190f06092ffb'}, 'json': None, 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code
Processing of response
Requests such as GET/POST GET a response from the server, which is the response in the example above. How do you get more information?
import requests
headers = {
'referer': 'https://www.baidu.com'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
Baidu. HTML file, 'rb': read in binary form
files = {'file': open('baidu.html'.'rb')}
Pass file as post
response = requests.post(url, files=files, headers=headers, cookies={'test': 'abcdefg'})
# Response processing
# specify code
response.encoding = 'utf-8'
print(response.url) The requested URL
print(response.encoding) # request encoding,
print(response.status_code) # status code
print(response.content) Binary form of response content (save files, images, audio, etc.)
print(response.text) The text form of the response content
print(response.json()) Json form of the response content
print(response.headers) Response header information
print(response.request.headers) Request header information
print(response.request.headers['referer']) The content of the attribute corresponding to the request header
print(response.cookies) Cookie information, the cookie object returned
print(response.cookies.items())
Copy the code
The following figure shows the output
Hello, welcome to my Python Geek account. We will share different Python dry stuff every day!