This article is published simultaneously on my wechat official account. You can follow it by scanning the QR code at the bottom of the article or searching the Geek navigation on wechat. The article will be updated every weekday.

A profile,

Following on from the previous article, if you really thought that the Requests network library consisted of only Get and Post Requests, you’d be wrong. It also has a few other uses that a reptile often uses. Let’s look at them.

Second, the use of

  • Auth validation

If you want to enter the background, you need to enter the Auth authentication of the browser. You need to enter the user name and password. Its principle is to put the user name: password base64 encryption in the HTTP request head, and then send it to the background for verification. Our Requests certainly supports this as well, but less often.

import requests

auth=('admin'.'admin')

response = requests.get(
    'http://192.168.1.1', 
    auth = auth
)
print (response.text)
Copy the code
  • The agent

Using HTTP or HTTPS proxies is probably one of the more important parts of our solution to reverse crawling. What is an agent? Look at the picture:

import requests

Select different agents according to the protocol type
proxies = {
    "http": "http://12.34.56.79:9527"."https": "http://12.34.56.79:9527",
}

response = requests.get(
    "http://www.baidu.com", 
    proxies = proxies
)
print(response.text)
Copy the code

Private Agent:

import requests

If the proxy needs to use HTTP Basic Auth, the following format can be used:
proxy = { 
    "http": "The name: [email protected]:11163" 
}

response = requests.get(
    "http://www.baidu.com",
     proxies = proxy
)

print (response.text)
Copy the code

There are many free and paid agents in the market. Such as west agent, fast agent, etc

  • Cookies

When developing crawlers using Python’s Requests program, it is often necessary to send set-cookie values returned from previous requests as cookies for the next request. For example, the sessionId returned after the simulated login needs to be used as a cookie parameter for subsequent requests.

import requests

response = requests.get("http://www.baidu.com/")

Return the CookieJar object:
cookiejar = response.cookies

# print cookiejar
print (cookiejar)

# Bring the previous cookies on the next visit
response = requests.get("http://www.baidu.com/", cookies=cookie_jar)

Print the response content
print (response.text)
Copy the code
  • Session

In Requests, session is a powerful object that represents a user session: from the time the client browser connects to the server to the time the client browser disconnects from the server, the session allows us to hold certain parameters across requests. Such as keeping cookies between all requests made by the same Session instance.

import requests

Create a session object that can store Cookie values
session = requests.session()

Add the request header
headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"
}

# post parameters
data = {
    "email":"xxxx"."password":"xxxx"
}

# send a request with the username and password, and get the Cookie value after login, save in the session
session.post(
    "http://www.jikedaohang.com/login",
    data = data
)

# session contains the Cookie value after the user logs in, and can directly access the page that can be accessed only after the user logs in.
# Personal center page for example
response = session.get(
    "http://www.jikedaohang.com/1562336754/profile"
)

Print the response content
print (response.text)
Copy the code
  • Processing HTTPS requests (SSL certificate verification)

To deal with this, we need to know a few things:

  • SSL: Indicates the secure socket layer. HTTP protocol is to solve the plaintext, to avoid transmission of data stolen, tampered with, hijacking and so on.
  • TSL: Transport Layer Security protocol. TSL is actually the product of SSL standardization, namely SSL/TSL
  • During data transmission, HTTPS establishes a TCP connection first and a TSL connection after the TCP connection is established.
  • Requests can validate SSL certificates for HTTPS requests, just like with web browsers, SSL authentication is turned on by default,

If certificate validation fails, the request throws SSLError:

SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)".)Copy the code

Verify =False if SSL authentication is requested, set it to False.

import requests
response = requests.get("https://www.12306.cn/mormhweb/", verify = False)
print (response.text)
Copy the code

If validated, the verify parameter can be the path to the CA_BUNDLE file passed in or the path to the folder containing the trusted CA certificate

import requests
response = requests.get("https://www.12306.cn/mormhweb/", verify = './certfile')
print (response.text)
Copy the code

Result: 1. The HTTPS request succeeds only when SSL authentication is performed or ignored. The ignored mode is verify=False.

2. The SSL certificate is issued by the CA and costs money.

Third, summary

The Requests Network request library, we’ll leave it there for now. Next we will continue our study of parsing libraries.

Welcome to pay attention to my public number, we learn together.