Now that you know about crawlers and network requests, it’s time to begin a formal look at crawler-related modules in Python

A lot of crawler books start with the urllib module, and when you’re done with the urllib module, you’ll be told that it’s a bit complicated to use and usually not used

Indeed, urllib is a relatively old module, and the crawler methods encapsulated are relatively complex. So you can just start the requests requests module.

The Requests module mimics the browser sending Requests. Python is a Python native network request based module, not only powerful, but also relatively easy to use!

Module installation

Install directly through PIP

pip install requests
Copy the code

It is also very simple to use, divided into three steps

  • Specify the URL, that is, the address of the website you want to crawl
  • Send requests, which are divided into GET, POST, PUT, DELETE, and so on
  • Get response data

Isn’t that easy to look at? Without further ado, let’s take a simple example. For example, I need to climb baidu’s home page

A GET request

# import module
import requests

url = 'https://www.baidu.com'
Send a get request
response = requests.get(url)
Get the source code of the web page
baidu_html = response.text
# persist (save) the retrieved data
with open('baidu.html'.'w', encoding='utf-8') as f:
    f.write(baidu_html)
Copy the code

This will result in a bidum.html file in the same directory.

This is a very basic example of using Requests to send GET requests. Also if you send other requests such as POST/PUT/HEAD etc

requests.post(url)
requests.delete(url)
requests.put(url)
...
Copy the code

Most of the time a request is sent with some parameters. For example, when we do a search, we send a GET request that carries the search content

For example, when I search for Python on Baidu, the URL is https://www.baidu.com/s?wd=python

To make a crawler more flexible, it is necessary to separate the contents of the search. You can send a GET request by building a dictionary parameter

import requests

Enter your search keywords
wd = input('Please enter what to search for: \n')
url = 'https://www.baidu.com/s?'
Build the search parameters for the GET request
url_param = {'wd': wd}
# prevent crawlers from being intercepted, increase requested UA
header = {
    'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
Send a GET request
response = requests.get(url, params=url_param, headers=header)
Get the source code of the web page
wd_html = response.text
Write file, persist operation
with open('wd.html'.'w', encoding='utf-8') as f:
    f.write(wd_html)

Copy the code

The specific uses of GET are as follows

The url is the address to access
# params carries parameters
# ** Kwargs Other parameters, such as request headers, cookies, proxies, etc
def get(url, params=None, **kwargs) :
Copy the code

A POST request

Use the Httpbin web site for testing

Httpbin is a web site that can test HTTP requests and responses, supporting various request methods

To submit parameters as a form, construct a dictionary of the request parameters and pass it to the data parameter.

import requests

url = 'http://httpbin.org/post'
params = {'name': 'Public: Python Geek Column'.'language': 'python'}
response = requests.post(url, data=params)
print(response.json())
Copy the code

Execution Result:

{' args: {}, 'data' : 'and' files' : {}, 'form' : {' language ':' python ', 'name' : 'the public number: python geek column'}, 'headers' : { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '99', 'Content-Type': 'Application/X-www-form-urlencoded ', 'Host': 'httpbin.org',' user-agent ': 'python-requests/2.22.0', 'X-amzn-trace-id ': application/ X-www-form-urlencoded ', 'Host': 'httpbin.org',' user-agent ': 'python-requests/2.22.0', 'X-amzn-trace-id ': 'Root=1-5fef5112-65ad78d4706c58475905fef2' }, 'json': None, 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code

2. Submit the parameter as a string and passjson.dumpsConvert the dictionary to a string

import requests
import json

header = {
    'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
params = {'name': 'Tom'.'hobby': ['music'.'game']}
Dumps formats the dictionary as a JSON string with json.dumps
response = requests.post(url, json=json.dumps(params), headers=header)
print(response.json())
Copy the code

Execution Result:

{ 'args': {}, 'data': '"{\\"name\\": \\"Tom\\", \\"hobby\\": [\\"music\\", \\"game\\"]}"', 'files': {}, 'form': {}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '55', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36', 'x-amzn-trace-id ': 'Root= 1-5fef583E-5224201d08e2ff396416e822'}, 'json': '{"name": "Tom", "hobby": ["music", "game"]}', 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code

3. Submit a file using POST (multipart)

import requests
import json

header = {
    'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
Baidu. HTML file, 'rb': read in binary form
files = {'file': open('baidu.html'.'rb')}
Pass file as post
response = requests.post(url, files=files)
print(response.json())
Copy the code

Execution Result:

{ 'args': {}, 'data': '', 'files': { 'file': '<! DOCTYPE html>..... HTML content omitted here... ' }, 'form': {}, 'headers': { 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '2732', 'Content-Type': 'multipart/form-data; boundary=1ba2b1406c4a4fe89c1846dc6398dae5', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0', 'x-amzn-trace-id ': 'Root= 1-5fef58F8-68f9fb2246eb190f06092ffb'}, 'json': None, 'origin': '', 'url': 'http://httpbin.org/post' }Copy the code

Processing of response

Requests such as GET/POST GET a response from the server, which is the response in the example above. How do you get more information?

import requests

headers = {
    'referer': 'https://www.baidu.com'.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
url = 'http://httpbin.org/post'
Baidu. HTML file, 'rb': read in binary form
files = {'file': open('baidu.html'.'rb')}
Pass file as post
response = requests.post(url, files=files, headers=headers, cookies={'test': 'abcdefg'})
# Response processing
# specify code
response.encoding = 'utf-8'
print(response.url) The requested URL
print(response.encoding) # request encoding,
print(response.status_code) # status code
print(response.content) Binary form of response content (save files, images, audio, etc.)
print(response.text) The text form of the response content
print(response.json()) Json form of the response content
print(response.headers) Response header information
print(response.request.headers) Request header information
print(response.request.headers['referer']) The content of the attribute corresponding to the request header
print(response.cookies) Cookie information, the cookie object returned
print(response.cookies.items())
Copy the code

The following figure shows the output

Hello, welcome to my Python Geek account. We will share different Python dry stuff every day!