“This is the 10th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

🌹 Eraser 🌹 This column is a series of crawler lessons. This column will use the Requests library for all cases. Through 9 cases, you will have a deep understanding of the Requests library. That is the core objective of this series of columns.

This series of courses requires some basic Python syntax. Data matching will use the Python RE module, so there are some basic requirements for regular expressions. For crawl environment requirements, Python3.5 and above, install the requests module.

Prereptilian analysis

Category page analysis

The website we want to climb in this crawler class is the q&A module of parenting website (ASk.ci123.com/), and we want to collect the information in the red box.

There are many types of problems involved in this website, and the specific classification can be obtained from the menu to the left of the above link. As shown in the following figure:

Here need a little analysis, classification of the rule of the address, if there is no rule, the first step is to get all the classification of the address, click the mouse link found, classification list page link is as follows:

http://ask.ci123.com/categories/show/2 http://ask.ci123.com/categories/show/3 http://ask.ci123.com/categories/show/4 http://ask.ci123.com/categories/show/} {category IDCopy the code

Don’t draw a conclusion here, say ID is increasing in sequence, if you write crawler program to assume certain rules too early, it is easy to lose data, so try to try again.

Here you can also directly view the source of the web page, look at all the addresses, of course, after reading or we can crawl to. Eventually all access to the address for http://ask.ci123.com/categories/show/} {category ID form, only the last category ID not continuous. Here the problem classification analysis is finished.

Problem list page analysis

Now we need to find the relevant rules of the list page. After clicking any category, we can see that the page data style is as shown in the picture below:

The first thing to do, please, is to find the pattern of paging, find the paging area, click on the page one by one, get the different paging address.

Finally find its regular link address is as follows:

{http://ask.ci123.com/categories/show/4/all?p=} page NumbersCopy the code

Page number rule is not enough, but also need to find the end of the page, simple search in the source code, find the corresponding page number at the end of the page.

The analysis before the crawler is finished. Now we start the crawler logic coding process, that is, sorting out our ideas.

Logical encoding (pseudocode)

Parenting web crawler is divided into the following steps:

  1. Through the ask.ci123.com/ page, get all the classified page addresses
  2. Loop through all category page addresses
  3. Gets the list page for each category and the total page number
  4. Loop from the beginning to the total page number
  5. The data to be climbed from each page in the previous loop

Finishing the thought, coding is actually a simple implementation process.

Crawler formal coding

Request library get method description

It’s easy to import and quickly apply the Requests library, so let’s take a look at the basic usage by grabbing the source code of the categorization page.

import requests

url = "http://ask.ci123.com/"

# Grab the category page
def get_category() :
    res = requests.get("http://ask.ci123.com/")
    print(res.text)


if __name__ == "__main__":
    get_category()
Copy the code

Get () is the request module’s request module that gets the source code for the web site. The parameters in this method are as follows:

This parameter is mandatory

requests.get(url="http://ask.ci123.com/")

Copy the code

Passing URL parameters

Using this parameter, you can construct the following format: https://www.baidu.com/s?wd= hello &rsv_spt= 1&rsv_iqID = 0x8DD347e100002e04. The format is as follows:

import requests
payload = {'key1': 'value1'.'key2': 'value2'}
res = requests.get(url="http://ask.ci123.com/", params=payload)
print(res.url)
Copy the code

Key1 is the key name and value1 is the key value.

Custom request headers

In the crawler crawler process, we will try our best to simulate the crawler as a real user accessing the website through the browser, so many times we need to customize the browser request header. The format is as follows:

import requests
payload = {'key1': 'value1'.'key2': 'value2'}
headers = {
    'user-agent': 'Baiduspider-image+(+http://www.baidu.com/search/spider.htm)'
}
res = requests.get(url="http://ask.ci123.com/",
                   params=payload, headers=headers)
print(res.url)

Copy the code

More content can be configured in headers, but this blog will not expand, just remember the headers parameter first.

Cookie

Cookie is a necessary content in many crawler programs, where encryption information is sometimes stored, and user information is sometimes stored in the following format:

import requests
payload = {'key1': 'value1'.'key2': 'value2'}
headers = {
    'user-agent': 'Baiduspider-image+(+http://www.baidu.com/search/spider.htm)'
}
cookies = dict(my_cookies='nodream')
res = requests.get(url="http://ask.ci123.com/",
                   params=payload, headers=headers, cookies=cookies)
print(res.text)
Copy the code

Disable redirection processing

Some sites will carry redirection source code, in the crawling time need to prohibit grid member automatic jump, the code is as follows:

r = requests.get('http://github.com', allow_redirects=False)
Copy the code

timeout

For a network request, sometimes the request cannot be received, which is explained in the advanced section of the official manual, but for beginners you can use the advanced use of ignoring timeouts first.

Most requests sent to external servers should take a timeout parameter in case the server fails to respond in a timely manner. By default, requests will not be automatically timed out unless a timeout value is explicitly specified. Without a timeout, your code may hang for several minutes or more.

The general code is as follows:

import requests
payload = {'key1': 'value1'.'key2': 'value2'}
headers = {
    'user-agent': 'Baiduspider-image+(+http://www.baidu.com/search/spider.htm)'
}
cookies = dict(my_cookies='nodream')
res = requests.get(url="http://ask.ci123.com/",
                   params=payload, headers=headers, cookies=cookies, timeout=3)
print(res.text)

Copy the code

Advanced parameters

There are also some parameters for the GET method that we may use in future blogs, such as:

  • SSL Certificate Verification
  • Client Certificate (CERT)
  • Event hooks
  • Custom Authentication (AUTH)
  • Stream requests
  • The agent (proxies)

All of these arguments appear in the GET method, so the Requests library is a very, very powerful library.

Get all category page addresses

With that in mind, it’s a little easier to use the Requests library to get content from a web page. You need a foundation in Python basics using the RE module and regular expressions. The specific crawl code is as follows:

import requests
import re

url = "http://ask.ci123.com/"
headers = {
    'user-agent': 'Baiduspider-image+(+http://www.baidu.com/search/spider.htm)'
}
# Grab the category page
def get_category() :
    res = requests.get("http://ask.ci123.com/", headers=headers)
    pattern = re.compile(
        r'<li><a href="/categories/show/(\d+)">', re.S)
    categories_ids = pattern.findall(res.text)
    print(F "The obtained category ID is as follows :",categories_ids)

if __name__ == "__main__":
    get_category()

Copy the code

Loop through all category page addresses

In the above code, the findAll method of the RE library is used to obtain all the classification numbers, which are used to join the subsequent pages to be climbed. Once you get IDS, you can get all the list pages in a loop as follows:

# Grab the category page
def get_category() :
    res = requests.get("http://ask.ci123.com/", headers=headers)
    pattern = re.compile(
        r'<li><a href="/categories/show/(\d+)">', re.S)
    categories_ids = pattern.findall(res.text)
    print(F "The obtained category ID is as follows :", categories_ids)
    for cate in categories_ids:
		# get_list() = get_list()
        get_list(cate)
        time.sleep(1)
Copy the code

The above code needs to add a delay handler, time.sleep(), to prevent reverse crawling.

Gets the list page for each category and the total page number

Open the list page, the primary purpose of the first to obtain the total page number, the implementation of the case to obtain the page number way is relatively simple, there is a list page, data can be seen directly in the source code, so directly grab.

def get_list(cate) :
    Get the total page number, loop through all pages

    res = requests.get(
        f"http://ask.ci123.com/categories/show/{cate}", headers=headers)

    pattern = re.compile(
        r', re.S)
    totle = pattern.search(res.text).group(1)
    for page in range(1.int(totle)):
        print(f"http://ask.ci123.com/categories/show/{cate}/all? p={page}")
        time.sleep(0.2)

Copy the code

Loop from 1 to total page number

This part of the code is relatively easy and has been implemented in the above code. The result looks like this:

The end of this case

Subsequent content becomes very easy, analysis of data on each page, and store data operation, did not write the following code storage, grab some code is complete, there is a very large regular expressions, you can consult, if crawl data is not very strict, a lot of use. * \ s these common metacharacter.

import requests
import re
import time

url = "http://ask.ci123.com/"
headers = {
    'user-agent': 'Baiduspider-image+(+http://www.baidu.com/search/spider.htm)'
}


def get_detail(text) :
    This function parses the page data and then stores the data
    pattern = re.compile(r'
  • [.\s]*.*? \s*(\d+).*? \s* (. *?) \s*(.*?) \s*
  • '
    ) data = pattern.findall(text) print(data) # Data storage code is not written def get_list(cate) : Get the total page number, loop through all pages res = requests.get( f"http://ask.ci123.com/categories/show/{cate}", headers=headers) pattern = re.compile( r', re.S) totle = pattern.search(res.text).group(1) for page in range(1.int(totle)): print(f"http://ask.ci123.com/categories/show/{cate}/all? p={page}") res = requests.get( f"http://ask.ci123.com/categories/show/{cate}/all? p={page}", headers=headers) time.sleep(0.2) Fetch list page data extraction function get_detail(res.text) # Grab the category page def get_category() : res = requests.get("http://ask.ci123.com/", headers=headers) pattern = re.compile( r'<li><a href="/categories/show/(\d+)">', re.S) categories_ids = pattern.findall(res.text) print(F "The obtained category ID is as follows :", categories_ids) for cate in categories_ids: get_list(cate) time.sleep(1) if __name__ == "__main__": get_category() Copy the code