Introduction to the Requests Library Requests is a simple and elegant HTTP library designed for humans. The Requests library is a native HTTP library that is easier to use than the URllib3 library. The Requests library sends native HTTP 1.1 requests without manually adding query strings to urls or form-encoding of POST data. In contrast to the URllib3 library, the Requests library fully automates keep-alive and HTTP connection pooling. The Requests library contains the following features. ❖ 1Keep-Alive & Connect to Pool International domain name and URL ❖ Sessions with Persistent Cookies ❖ Browser-based SSL Authentication Automatic Content Decoding Basic/Summary identification Elegant Key/Value Cookie automatic decompress ❖ Unicode Response Body The HTTP(S) agent supports block uploading of ❖ files to ❖ streams to download and link timed block Requests to observe.netrC 1.1 Requests Basic using code 1-1: send a get request and view the return results import requests a url = ‘www.tipdm.com/tipdm/index… RQG = requests. Get (url)

View result types

Print (type(RQG))

Viewing status Code

Print (‘ status_code: ‘,rqg.status_code)

Check the code

Print (‘ encoding: ‘,rqg.encoding)

View the response header

Print (‘ RQG. Headers’)

Print to view web content

Rqg.text:

{‘ Date ‘: ‘Mon, 18 Nov 2019 04:45:49 GMT’, ‘Server’ : ‘apache-Coyote /1.1’, ‘Accept-Ranges’ :’ bytes’, ‘ETag’ : ‘W/”15693-1562553126764″,’ last-Modified ‘:’ Mon, 08 Jul 2019 02:32:06 GMT ‘, ‘content-type’ : ‘text/ HTML’, ‘Content-Length’ : ‘15693’, ‘keep-alive’ : ‘timeout=5, Max =100’, ‘Connection’ : You can send all HTTP requests through the Requests library: Get (“httpbin.org/get”) # get request requests. Post (“httpbin.org/post”) # post request requests. Put (“httpbin.org/put”) # put request Delete (“httpbin.org/delete”) # delete request requests. Head (“httpbin.org/get”) # head request Requests. Options (“httpbin.org/get”) # Options Requests You can send all HTTP requests through the Requests library:

Get (“httpbin.org/get”) # get request requests. Post (“httpbin.org/post”) # post request requests. Put (“httpbin.org/put”) # put request Delete (“httpbin.org/delete”) # delete request requests. Head (“httpbin.org/get”) # head request One of the most common requests in HTTP is the GET Request, Let’s start by taking a closer look at how to build GET requests from Requests.  GET Parameter description: GET (URL, params=None, **kwargs): ❖ URL: The site to be requested to observe Params: (Optional) dictionary, list tuples or bytes to send for the requested query string Variable length keyword arguments First, build a simple GET request, the link of the request is httpbin.org/get, the site will determine if the client initiated a GET request, it will return the corresponding request information, Import requests r = requests. GET (‘httpbin.org/get’) print(r.ext) {“args”: {}, “headers”: { “Accept”: “/”, “Accept-Encoding”: “gzip, deflate”, “Host”: “httpbin.org”, “User-Agent”: “Python-requests /2.24.0″,” x-amzn-trace-id “: “Root=1-5fb5b166-571d31047bda880d1ec6c311”}, “origin”: “36.44.144.134”, “URL “: “httpbin.org/get”} It can be found that we successfully initiate a GET request, and the return result contains the request header, URL, IP address and other information. So, for GET requests, if you want to attach additional information, how do you typically do that?

Import requests response = requests. Get (‘ www.zhihu.com/explore ‘) Print (f” Response.status_code}”) print(response.text) Response.status_code

400 Bad Request


Import requests headers = {“user-agent”: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.198 Safari/537.36 ‘} Response = requests. Get (‘ www.zhihu.com/explore ‘, Headers =headers) print(f” {response.status_code}”)

print(response.text)


The response status code of the request is 200

. Here we add the HEADERS information, which contains the user-Agent field information, which is the browser identifier information. It’s clear our disguise worked! This method of disguising the browser is one of the simplest anti-creeping measures.  GET requests. GET (url, headers=headers)

  • The HEADERS parameter receives the request header in dictionary form
  • The request header field name is the key, and the corresponding value is the value

Ask baidu home page www.baidu.com, ask to carry headers, and print request header information! solution

Import requests URL = ‘www.baidu.com’ headers = {” user-agent “: “Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36”}

Put user-Agent in the request header to simulate the browser sending the request

response = requests.get(url, headers=headers) print(response.content)

Prints the request header information

Print (Response.request.headers) 2.2 Send a request with parameters ‘, then the following question mark is the request parameter, also known as the query string! In general, we don’t just visit basic web pages, especially when crawling dynamic web pages we need to pass different parameters to get different content; GET passes parameters in two ways, 2.2.1 Import requests headers = {” user-agent “: request headers = {” user-agent “: request headers = {” user-agent “: “Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36”} URL = ‘www.baidu.com/s?wd=python… Response = requests. Get (URL, headers=headers) 2.2.2 Carry the parameter dictionary through params

  1. Build the request parameter dictionary
  2. Requests are sent to the interface with a parameter dictionary, which is set to Params

Import requests headers = {” user-agent “: “Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36”}

This is the destination URL

Url = ‘www.baidu.com/s?wd=python…

It doesn’t matter if there’s a question mark at the end

Url = “www.baidu.com/s?”

The request argument is a dictionary that is wd=python

Kw = {” wd “:” python “}

Initiate a request with request parameters and get the response

Get (url, headers=headers, params=kw) print(response.content)

Httpbin.org/get?key2=va… . In addition, the return type of the web page is actually STR, but it is special in JSON format. So, if you want to parse the return directly to get a dictionary format, you can call the JSON () method directly. The following is an example:

import requests r = requests.get(“httpbin.org/get”) print( type(r.text)) print(r.json()) print( type(r. json()))

< class ‘STR’ > {” args “: {},” headers “: {‘ Accept ‘:’/’, ‘the Accept – Encoding’ : ‘gzip, deflate’, ‘Host’ : ‘httpbin.org’, ‘user-agent’ : ‘python-requests/2.24.0’, ‘x-amzn-trace-id’ : ‘Root=1-5fb5b3f9-13f7c2192936ec541bf97841’}, ‘origin’ : ‘httpbin.org/get’} < class ‘dict’ > as you can see, calling the json() method converts a string that returns a JSON format to a dictionary. But it is important to note that if you return the result is not a JSON format, then there will be a parse error, throw JSON. The decoder. JSONDecodeError anomalies. In addition, the receiving dictionary string is automatically encoded and sent to the URL, as follows

Import requests headers = {‘ user-agent ‘:’ Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.193 Safari/537.36 ‘} wd = ‘request. Get (‘ www.baidu.com/s’, Params ={‘ wd ‘: wd,’ pn ‘: pn}, headers=headers) print(response.url)

The output is:www.baidu.com/s?wd=%E9%9B…

C%E5%AD%A6&pn=1

The visible URL has been automatically encoded

The above code is equivalent to the following code, params code conversion is essentially using urlencode

Parse import urlencode headers = {‘ user-agent ‘:’ Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) wd = ‘encode_res = urlencode({‘ k’ : Wd}, encoding= ‘utF-8’) keyword = encode_res.split(‘ = ‘)[1] print(keyword)

And then concatenate it into a URL

Url = ‘www.baidu.com/s?wd=%s&pn=… % keyword response = requests.get(url, headers=headers) print(response.url)

The output is:www.baidu.com/s?wd=%E9%9B…

%90%8C%E5%AD%A6&pn=1 2.3 Use GET request crawl webpage above request link return is JSON form string, so if the request ordinary webpage, then can obtain the corresponding content!

Import requests import re headers = {“user-agent”: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 ‘} Response = requests. Get (‘www.zhihu.com/explore ‘, headers=headers) result = re.findall(“(ExploreSpecialCard-contentTitle|ExploreRoundtableCard questionTitle).? > (.?). “, response.text) print([i[1] for i in result])

[‘ What delicious food are there in Xi ‘an Hui People Street? ‘, ‘What treasure shops are worth visiting in Xi ‘an?’, ‘What business districts in Xi ‘an carry your youth?’, ‘What good driving habits can you share?’, ‘what driving skills only experienced drivers know? Welcome to the landing! Zhihu Cosmic member recruitment notice Planet Landing Question: Give you ten yuan to travel to the future, how can you make a living? Planet landing Question: What kind of super energy in the universe would you most like to have and how would you use it? ‘, ‘Norwegian salmon, place of origin matters’,’ What are the most interesting places in Norway? ‘, ‘What is the experience of living in Norway?’, ‘What do you think of jd Side’s AMOLED flexible screen mass production? What are the prospects for the future? ‘, ‘Will flexible screens revolutionize the mobile phone industry?’, ‘What are ultra-thin flexible batteries? Will they have a major impact on the battery life of smartphones?’, ‘How can you learn art well and score high in the art exam? ‘, ‘Tsinghua Academy of Fine Arts is despised?’, ‘Art students are really bad?’, ‘How should one live one’s life?’, ‘What should one pursue in life?’, ‘Will humans go mad when they know the ultimate truth of the world?’, ‘Anxiety is due to their own incompetence? ‘, ‘What does social phobia feel like?’, ‘Is there any reason to say that being busy means you don’t have time to be depressed?’] Here we add headers information, which contains user-Agent field information, which is the browser identifier information. If this is not added, Zhihu will prohibit fetching. Fetching binary Data In the above example, we are fetching a page from Zhihu, which actually returns an HTML document. What if you want to grab pictures, audio, video, etc.? Pictures, audio and video files are essentially composed of binary codes, and we can see all kinds of multimedia because of specific saving formats and corresponding parsing methods. So, if you want to grab them, you have to get their binary code. Take the GitHub site icon as an example:

Import requests response = requests. Get (“github.com/favicon.ico”) with open(‘ github. Ico ‘, ‘wb’) as f: F.write (response.content) Two properties of the response object, one is text, the other is content. The former represents string text and the latter bytes data, as well as audio and video files. Web sites often use the cookie field in the request header to maintain the user’s access state. Therefore, we can add a cookie to the Headers parameter to simulate the request of an ordinary user. 2.4.1 Obtaining Cookies In order to obtain the page after login through crawler, or to solve anti-crawling through Cookies, Import requests URL = ‘www.baidu.com’ req = requests. Get (URL) print(req.cookies)

The response of cookies

for key, value in req.cookies.items(): Print (f”{key} = {value}”) <RequestsCookieJar[<Cookie BDORZ=27315 for.baidu.com >]> BDORZ=27315 Property to successfully get the cookie, which you can see is of type RequestCookieJar. Then, the items() method is used to convert it into a list composed of tuples, and the name and value of each Cookie are iterated to realize the traversal parsing of cookies. 2.4.2 Advantages of login with Cookies and Session: Disadvantages of being able to request to the page after login with Cookies and session: A set of cookies and sessions often correspond to one user requesting too many times too quickly, It is easy to be identified by the server as crawler and we try not to use cookies when we do not need cookies. However, in order to obtain the page after login, we must send a request with cookies. We can directly use cookies to maintain the login state. First log on zhihu, The Cookie in the Headers request parameter dictionary must be the same as that in the Headers parameter dictionary The value corresponding to the key is a string

import requests import re

Construct the request header dictionary

headers = {

User-agent copied from the browser

“User-agent “: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 ‘,

Cookie copied from the browser

“Cookie “: ‘XXX here is the copied cookie string’}

The request header parameter dictionary carries the cookie string

The response = requests. Get (” www.zhihu.com/creator “, Headers = headers) data = re. The.findall (‘ CreatorHomeAnalyticsDataItem – title.? > (…?) ‘, the response. The text) Print (response.status_code) print(data) When we request without Cookies:

Import requests import re headers = {“user-agent”: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 ‘} Response = requests. Get (‘www.zhihu.com/creator ‘Headers, data = = headers) re. The.findall (‘ CreatorHomeAnalyticsDataItem – the title.? > (.?). ‘, the response text) print (response. Status_code) print (data)

200 [] is empty in the printed output. If the two are compared, the headers parameter is successfully used to carry the cookie and obtain the page that can be accessed only after login. In the previous section, carry cookies in the headers parameter. You can also use special cookies parameters. The dictionary corresponding to the cookie string in the request header, divided into a semicolon and a space for each dictionary key value pair. Response = requests. Get (URL, cookies) ❖ 3. Dictionary needed to convert cookie strings to cookies parameters: cookies_dict = {cookie.split (‘ = ‘) [0]: Cookie. split (‘ = ‘) [-1] for cookie in cookies_str.split (‘; ‘)} observe: Cookies usually have an expiration date, Get response = requests. Get (url, Cookies) import requests import re url = ‘www.zhihu.com/creator’ cookies_str = ‘headers = {“user-agent”: ‘the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 ‘} Cookies_dict = {cookie.split(‘ = ‘, 1)[0]:cookie.split(‘ = ‘, 1)[-1] for cookie in cookies_str.split(‘; ‘)}

The request header parameter dictionary carries the cookie string

Resp = requests. Get (url, headers = headers, cookies = cookies_dict) data = re. The.findall (‘ CreatorHomeAnalyticsDataItem – the title.? > (.?). ‘, resp. Text) print (resp. Status_code) print (data)

‘, ‘My parents don’t have the money to buy a computer for me, what should I do?’, ‘Describe your current state of life in one sentence. We can also set cookies by constructing RequestsCookieJar object, example code is as follows: Import requests import re url = ‘www.zhihu.com/creator’ cookies_str = ‘cookies’ headers = {“user-agent”: ‘the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 86.0.4240.198 Safari / 537.36} ‘jar = requests. Cookies, RequestsCookieJar () for cookies in Cookies_str. Split (‘; ‘): key,value = cookie.split(‘ = ‘,1) jar. Set (key,value)

The request header parameter dictionary carries the cookie string

Resp = requests. Get (url, headers = headers, cookies = jar) data = re. The.findall (‘ CreatorHomeAnalyticsDataItem – the title.? > (.?). ‘, resp. Text) print (resp. Status_code) print (data)

‘, ‘My parents don’t have the money to buy a computer for me, what should I do?’, ‘Describe your current state of life in one sentence. ‘] Here we first create a RequestCookieJar object, then split the copied cookies using the split() method, and then set the key and value of each Cookie using the set() method. You can then call the Get () method of requests and pass the cookies argument. Of course, due to zhihu’s own limitations, the headers parameter can also be used, but there is no need to set the cookie field in the original headers parameter. After the test, it was found that zhihu could be logged in normally. The method for converting cookieJar objects to cookies dictionaries uses the Resposne object retrieved from Requests, which has cookies attributes. The value of this property is a cookieJar type that contains a cookie set locally by the other server. How do we turn this into a dictionary of cookies? ❖ 1. Transform method Cookies_dict = requests.utils.dict_from_cookiejar(Response.cookies) ❖ 2. The response.cookies function returns a cookieJar object. ❖ 3. requests. Utils. dict_from_cookiejar returns a cookie dictionary

Import requests import re url = ‘www.zhihu.com/creator’ cookies_str = ‘cookies’ headers = {“user-agent”: ‘the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’} Cookie_dict = {cookie.split(‘=’, 1)[0]:cookie.split(‘=’, 1)[-1] for cookie in cookies_str.split(‘; ‘)}

The request header parameter dictionary carries the cookie string

resp = requests.get(url, headers=headers, cookies=cookies_dict) data = re.findall(‘CreatorHomeAnalyticsDataItem-title.? > (a)? ‘,resp.text) print(resp.status_code) print(data)

Can put a dictionary into a requests. Cookies, RequestsCookieJar object

cookiejar = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True) type(cookiejar) # requests.cookies.RequestsCookieJar type(resp.cookies) # Requests. Cookies. RequestsCookieJar # construct RequestsCookieJar object to set the type of the jar is also requests. Cookies to cookies, RequestsCookieJar Dict_from_cookiejar (cookiejar) 2.5 Timeout Setting A request that waits a long time may come to nothing. In crawler, a request without result for a long time will make the efficiency of the whole project become very low. At this time, we need to force the request to return the result within a specific time, or an error will be reported. Get (URL, timeout=3) Response = requests. Get (URL, timeout=3

Url = ‘www.tipdm.com/tipdm/index… Print (‘ request. Get (url,timeout=2) ‘)

An error will be reported if the timeout is too short

Request. Get (url,timeout = 0.1) # Request. Get (url,timeout = 0.1) # Request.

  1. Login registration (In the view of Web engineers, POST is more secure than GET, and the URL address will not expose the user’s account password and other information)
  2. When large text needs to be transferred (POST requests have no data length requirements)

Therefore, our crawler also needs to simulate the browser to send a POST request in these two places. In fact, sending a POST request is very similar to GET, but we need to define parameters in data:  Post parameter description: Post (URL, data=None, json=None, **kwargs): ❖ URL: Site to be requested ❖ Data: (Optional) Dictionary, tuple list, byte, or file-like object to send in the body of the Request (Optional)JSON data, sent to the body of the Request class. ❖ **kwargs: Variable length keyword parameter import Requests payload = {‘ KEY1 ‘:’ value1 ‘, ‘key2’ : ‘value2’} the req = requests. Post (” httpbin.org/post “, Data =payload) print(req.text) 3.1 POST Sends JSON data many times you want to send data that is not encoded as a form. This problem occurs especially when you are crawling Java urls. If you pass a string instead of a dict, the data will be published directly. We can use json.dumps() to convert dict to STR format; In addition to encoding the dict yourself, you can also pass the JSON argument directly, and it will be automatically encoded. Import json import requests URL = ‘httpbin.org/post’ payload = {‘ some ‘: ‘data’} req1 = requests. Post (url, data=json. Dumps (payload)) req2 = requests. Post (url, Json =payload) print(req1.text) print(req2.text) Note requests module sends requests with data, JSON, and Params. Params is used in GET requests, and Data and JSON are used in POST requests. Data can receive dictionary, string, byte, and file object parameters. Use json parameters, regardless of whether the packet type is STR or dict. If you do not specify the content-Type of headers, the default value is: Application/json. ❖ Use the data parameter. The packet type is dict. If you do not specify the content-Type of headers, the default application/x www-urlencoded form submission is application/x www-urlencoded form submission. The data in the form is converted to key-value pairs, which can be retrieved from Request. POST, and the contents of Request. body are key-value pairs of a=1&b=2. ❖ Use the data parameter. The packet type is STR. If the content-Type of headers is not specified, applica tion/ JSON is used by default. Body: a=1&b=2; body: “A “: 1, “b”:” A “: 1, “b”: If we want to upload a file using a crawler, we can use the fifile parameter:

url = ‘httpbin.org/post’ files = {‘file’: Open (‘test.xlsx’, ‘rb’)} req = requests. Post (URL, files=files) req.text If you send a very large file as a multipart/form data request, you may want to make the request a data stream. Requests is not supported by default, so you can use the requests- Toolbelt library instead. 3.3 The main purpose of using POST to capture web pages is to find the pages to be resolved

import requests

Prepare data for translation

Ps = {“kw”: kw}

Prepare to forge a request

headers = {

User-agent: indicates the requested identity information. Generally using browser identity information directly, forgery

The crawler request

Make the browser think the request was made by the browser [hide crawler info]

“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ( KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.41” }

Send a POST request with the form data to be translated — passed as a dictionary

response = requests.post(“fanyi.baidu.com/sug”, data=ps)

Print the returned data

print(response.content)

Print (response.content.decode(“unicode_escape”)) print(response.content.decode(“unicode_escape”)) print(response.content.decode(“unicode_escape”)) And proxy IP in Requests. It’s possible to simulate web requests directly using get() or POST (), but that’s essentially a different session, meaning you’re opening two different pages in two browsers. Imagine a scenario where the first request uses the POST () method to log in to a website, and the second request uses the GET () method to get your personal information after successfully logging in. In effect, this is the equivalent of opening two browsers, two completely unrelated sessions. Can you successfully retrieve personal information? Of course not. As some of you may have said, why don’t I just set the same cookies for both requests? Yes, but it’s too cumbersome to do. There’s a simpler solution. The main solution to this problem is to maintain the same session, which is equivalent to opening a new browser TAB instead of a new browser. But I don’t want to set cookies every time, so what do I do? Here comes a new tool: the Session object. With it, we can easily maintain a session without worrying about cookies, which take care of it automatically. The Session class in the Requests module can automatically handle cookies generated during the process of sending requests and obtaining responses to achieve state preservation. Let’s learn what 4.1 requests. Session can do to automatically process cookies. Cookies 4.2 requests. Session Usage after a session instance requests a web site, Cookies set locally by the peer server are stored in the session, and the next time a session is used to request the peer server, Response = session. Get (url, headers, headers…) response = session . post ( url , data , …) Parameters for the Session object to send a GET or POST request, Observe what you can do to log in to Github and what pages you can access only after logging in to it. Determine the URL for the login request Address, request method, and required request parameters

  • Some request parameters can be obtained by using the RE module in the response content corresponding to other urls

Determine the URL and request method for the page to be accessed after login. Complete the code using requests. Session

import requests import re

Construct the request header dictionary

Headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36’,}

Instantiate the session object

session = requests.session()

Visit the login page to obtain the parameters required for the login request

response = session.get(‘github.com/login’, headers=headers) authenticity_token = re.search(‘name=”authenticity_token” value=”(.*?)” />’, Response.text).group(1

Construct a dictionary of login request parameters

Data = {‘commit’: ‘Sign in’, # fixed value ‘utf8’: ‘ ‘, # fixed value ‘authenticity_token’: Authenticity_token, # This parameter in the response content of the login page ‘login’: input(‘ enter github account: ‘), ‘password’: input(‘ Enter Github account: ‘)}

Send a login request (do not pay attention to the response to this request)

session.post(‘github.com/session’, headers=headers, data=data)

Print pages that require login to access

The response = session. Get (‘ github.com/settings/pr… ‘, headers=headers) print(response.text) For some sites, requesting a few Requests during testing will allow content to be retrieved. However, once large-scale crawling starts, for large-scale and frequent requests, the website may pop up the verification code, or jump to the login authentication page, or even directly block the IP address of the client, resulting in inaccessible within a certain period of time. To prevent this from happening we need to set proxies to solve this problem using the Proxies parameter. You can set the proxy parameter by specifying the proxy IP address, so that the proxy IP address can forward the request we send to the proxy server. Let’s first learn about the proxy IP address and the process of proxy server 5.1 Using proxy

  1. A proxy IP address is an IP address that points to a proxy server
  2. The proxy server forwards requests to the target server for us

5.2 Forward proxy and Reverse Proxy If the proxy IP address specified in the proxy parameter points to a forward proxy server, a reverse server is required. What will distinguish a forward proxy from a reverse proxy from the one who will make the request? A forward proxy forwards a request for a browser or client (the one who makes the request)

  • The browser knows the real IP address of the server that will ultimately process the request, such as a VPN

❖ A reverse proxy that forwards a request not for the browser or client (the sender of the request), but for the server that ultimately processes the request

  • The browser does not know the real address of the server, such as nginx

Transparent Proxy: A Transparent Proxy can discover who you are, although it can hide your IP address directly. REMOTE_ADDR = Proxy IP HTTP_VIA = Proxy IP HTTP_X_FORWARDED_FOR = Your IP With an anonymous proxy, people can only know that you use a proxy, not who you are. The target server receives the following request header: 3. Local government prepared by proxy. A high proxy makes it impossible for others to see that you are using an agent, so it is the best option. ** There is no doubt that using a high proxy works best **. The target server receives the following request header: REMOTE_ADDR = Proxy IP HTTP_VIA = not determined HTTP_X_FORWARDED_FOR = not determined ❖ Use the Proxy service based on the protocol used by the website. Such protocol as: HTTP proxy: destination URL: HTTP protocol: Destination URL: HTTPS protocol: Socks tunnel proxy (such as socks5 proxy). ✾ 1. The socks proxy simply delivers data packets, regardless of the application protocol (FTP, HTTP, HTTPS, etc.). ✾ 2. The socks proxy takes less time than the HTTP or HTTPS proxy. ✾ 3. Socks proxy can forward HTTP and HTTPS requests 5.4 Proxies are used to proxies to get proxies proxies that are not the same client. In order to prevent frequent requests to a domain name blocked IP, so we need to use proxy IP; Get (URL, proxies = proxies) Proxies proxies = proxies proxies Dictionary Proxies = {” HTTP “: “HTTP :// 12.34.56.79:9527 “,” HTTPS “: “HTTPS :// 12.34.56.79: 9527 “,} Note: Proxies will be sent to proxies = {” HTTP “if multiple key pairs are proxies in the proxies dictionary. “Http://124.236.111.11:80”, “HTTPS” : “HTTPS :183.220.145.3:8080”} req = requests. Get (‘ www.baidu.com ‘, Proxies = Proxies) req.status_code SSL Certificate Validation In addition, Requests provides certificate validation capabilities. When sending an HTTP request, it checks the SSL certificate, and we can use the verify parameter to control whether the certificate is checked. If you do not verify, the default value is True. Now let’s test for requests:

import requests url = ‘Cas.xijing.edu.cn/xjtyrz/logi…’headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.198 Safari/537.36’} req = requests. Get (url,headers=headers)

SSLError: HTTPSConnectionPool(host= ‘cas.xijing.edu.cn’, port=443): Max retries exceeded with URL: /xjtyrz/login (Caused by SSLError(SSLCertVerificationError(1, ‘ CERTIFICATE_VERIFY_FAILED] certificate verify failed: Unable to get local issuer certificate (_SSL.c :1123) ‘)) SSL Error is displayed, indicating a certificate verification Error. So, how can you avoid this error if you request an HTTPS site with a certificate validation error page? Simply set the verify parameter to False. Import requests URL = ‘www.jci.edu.cn/’headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } req = requests. Get (URL,headers=headers,verify=False) req.status_code

I can’t find the web page that needs to do SSL authentication, so angry! But we found a warning that suggested we give it a certificate. We can mask this warning by setting import requests from requests. Packages import urllib3 urllib3.disable_warnings() URL = ‘import requests from requests.www.jci.edu.cn/’headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } req = requests. Get (URL,headers=headers,verify=False) req.status_code

200 Or ignore warnings by capturing warnings to logs:

import logging import requests logging.captureWarnings(True) url = ‘www.jci.edu.cn/’headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } req = requests. Get (URL,headers=headers,verify=False) req.status_code

200 Of course, we can also specify a local certificate as the client certificate, which can be a single file (containing the key and certificate) or a tuple containing two file paths: import Requests response = requests. Get (‘)www.12306.cn and cert = (‘. ‘/ path/server…’ )) print(response.status_code)

200 Of course, the above code is a demonstration example, we need to have CRT and KE y files, and specify their path. Note that the key of the local private certificate must be in the decrypted state. Encrypted keys are not supported. Now there are very few such web sites! When you send a request, you’ll get a response. In the example above, we get the content of the response using text and content. In addition, there are many properties and methods you can use to retrieve other information, such as status codes, response headers, Cookies, and so on. The following is an example:

import requests url = ‘www.baidu.com’ req = requests.get(url) print(req.status_code)

Response status code

print(req.text)

The textual content of the response

print(req.content)

The binary content of the response

print(req.cookies)

The response of cookies

print(req.encoding)

Encoding of response

print(req.headers)

Header information for the response

print(req.url)

Response url

print(req.history)

Response history

7.2 Viewing the Status Code and Encoding The status code returned by the server can be viewed in the format of rqg.status_code, and the web page is encoded based on the HTTP header information returned by the server in the format of rqg.encoding. Please specify the encoding for the Requests library if the Requests library is incorrect. It is necessary to manually specify the encoding for the Requests library if the Requests library is incorrect. Send a get request, and manually specify code url = ‘www.tipdm.com/tipdm/index… RQG = requests. Get (URL) print(‘ status code ‘,rqg.status_code) print(‘ encoding ‘,rqg.encoding) rqg.encoding = ‘utF-8 Print (‘ modified encoding ‘,rqg.encoding)

print(rqg.text)

Status code 200 coding ISO – 8859-1 modified coding utf-8 notes manually specified method is not flexible, not in the process of adaptive crawl different web page code, and USES chardet library method is easy and flexible. The Detect method using the ChartDET library can detect the encoding of a given string in the following syntax format. Chartdet. detect(byte_str) Common parameters of the detect method and their Descriptions BYte_str: Receives a string. Represents a string that requires detection encoding. None Default value 7.5 Using the detect method to detect the code and specify the code 1-3: Using the detect method to detect the code and specify the code

The import chardet url = ‘www.tipdm.com/tipdm/index… ‘ rqg = requests.get(url) print(rqg.encoding) print(chardet.detect(rqg.content)) rqg.encoding = chardet.detect(rqg.content)[‘encoding’]

Accessing dictionary elements

Print (rqg.encoding) ISo-8859-1 {‘ encoding ‘:’ utF-8 ‘, ‘confidence’ : 0.99, ‘language’ : 7.6 requests “} utf-8 library comprehensive test to the website ‘www.tipdm.com/tipdm/index… Send a complete GET request with the link, request header, response header, timeout, and status code, and the encoding is set correctly. Codes 1-6: Generate the complete HTTP request

Import the relevant libraries

import requests import chardet

Set the url

Url = ‘www.tipdm.com/tipdm/index… ‘

Set the request header

Headers = {” user-agent “:”Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36”}

Generate a GET request and set the delay to 2

rqg = requests.get(url,headers=headers,timeout = 2)

Viewing status Code

Print (” status code “,rqg.status_code)

Detect encoding (view encoding)

Print (‘ code ‘RQG. Encoding)

The code is corrected using the Detect method of the Chardet library

rqg.encoding = chardet.detect(rqg.content)[‘encoding’]

Detect the corrected encoding

Print (‘ response header: ‘,rqg.headers)

Viewing web Content

{‘ Date ‘:’ Mon, 18 Nov 2019 06:28:56 GMT ‘, ‘Server’ : {‘ Date ‘:’ Mon, 18 Nov 2019 06:28:56 GMT ‘, ‘Server’ : ‘Apache-Coyote/1.1’, ‘Accept-Ranges’ :’ bytes’, ‘ETag’ : ‘W/”15693-1562553126764″,’ Last-Modified ‘: ‘Mon, 08 Jul 2019 02:32:06 GMT’, ‘Content-Type’ : ‘text/ HTML’, ‘Content-Length’ : ‘15693’, ‘keep-alive’ : ‘timeout=5, Max =100’, ‘Connection’ : ‘keep-alive’}