Requests Library Introduction:

Requests is the only Non-gmO Python HTTP library that is safe for human consumption. This statement directly and defiantly declares that the Requests library is python’s best HTTP library.

Simple usage of Requests

The seven main methods of the Requests library

Python Development IT Exchange group: 887934385 Provides information, source code, common exchange, endeavour

11. Requests. Get

import requests  # import Requests library
r = requests.get(url) Send the request using the get method, return the Response containing the web page data and store it in the Response object RCopy the code

Properties of the Response object:

  • R.status_code: return status of the HTTP request. 200 indicates successful connection (HTTP status code)
  • R.ext: Returns the text content of the object
  • R.tent: Guess the binary form of the returned object
  • R.encoding: Analyzes the encoding of returned objects
  • R.apparent_encoding: Response content encoding (alternative encoding)

Take Zhihu as an example to show the use of the above code:

>>> import requests
>>> r = requests.get('https://www.zhihu.com/')
>>> r.status_code
500
>>> r.text   # omit
>>> r.content   # omit
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'ascii'Copy the code

In actual combat

Analysis douban short comments web page

First through the browser tools to analyze the loading way of the web page. Only synchronously loaded data can be directly viewed in the source code of the web page. Asynchronously loaded data cannot be directly viewed in the source code of the web page.



Change the JavaScript from “allow” to “prevent” and refresh the page again. If the page is loaded normally, it indicates that the page is loaded synchronously; if the page is not loaded normally, it indicates that the page is loaded asynchronously.

Steps to download data using Requests

  • Import Requests library
  • Enter the url
  • Use the get method
  • Print return text
  • An exception is thrown
import requests # import Requests library

url = ' https://book.douban.com/subject/27147922/?icn=index-editionrecommend' # enter the url
r = requests.get(url,timeout=20) Use the get method
##print(r.ext) #print returns text
print(r.raise_for_status()) Throw an exceptionCopy the code
None

Copy the code

Crawl the common web page frame

  • Define a function
  • Set the timeout
  • Exception handling
  • Call a function |
# define function
def getHTMLText(url):
    try:
        r = requests.get(url,timeout=20) # set timeout
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except: # exception handling
        return "Generate an exception"

if __name__ == '__main__':
    url = ""
    print(getHTMLText(url)) # call function
Copy the code

The crawler agreement

What is crawler protocol

The crawler protocol, also known as robots, is designed to tell web spiders which pages to crawl and which not to

How to view crawler protocol

Add robots.txt after visiting the website domain name, for example, check the crawler agreement of Baidu website: www.baidu.com/robots.txt

Crawler protocol attributes

Intercept all bots: user-agent: * Disallow: /

Allow all robots: user-agent: * Disallow:

The crawler advice

  • Crawl public data on the Internet
  • Try to slow down your speed
  • Try to follow the robots protocol
  • Do not use it for commercial purposes
  • Do not publish crawlers and data