High-performance asynchronous crawler

  • Synchronous crawler
  • Asynchronous crawler
    • Thread pool principle
    • In actual combat

Synchronous crawler

Example: URLS block crawls, and the next image crawls will not be performed until the previous image crawls are completed. It’s single-threaded.

import requests

header = {'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '}

urls = [
    'http://pic.netbian.com/uploads/allimg/210122/195550-1611316550d711.jpg'.'http://pic.netbian.com/uploads/allimg/180803/084010-15332568107994.jpg'.'http://pic.netbian.com/uploads/allimg/190415/214606-15553359663cd8.jpg'
]

The wrapper method gets the URL content
def get_content(url) :
    print('Crawling', url)
    response = requests.get(url=url, headers=header)
    if response.status_code == 200:
        return response.content

def parse_content(content) :
    print(The length of the response data is:.len(content))

for url in urls:
    content = get_content(url)
    parse_content(content)
Copy the code

Asynchronous crawler

Method:

  • Multi-threaded, multi-process (not recommended) :

    • Benefits: You can start a thread or process separately for related blocking operations, and blocking operations can be executed asynchronously
    • Disadvantages: You can’t open multiple threads or processes indefinitely
  • Thread pool, process pool (when appropriate) :

    • Benefits: Can reduce the system to process or thread creation and consumption, so as to reduce the system overhead
    • Disadvantages: There is an upper limit to the number of threads or processes in the pool
  • Single thread + asynchronous coroutine (recommended) :

    • Event_loop: a wireless loop in which functions can be registered and executed when certain conditions are met.
    • Coroutine: a coroutine object that can be registered with an event loop and is called by the event loop. You can use the async keyword to define a method that, when called, is not executed immediately but returns a coroutine object
    • Task: a further encapsulation of the coroutine object, which contains each state of the task
    • Future: represents a task that will be executed or has not yet been executed. It is essentially the same as task
    • Async: Defines a coroutine
    • Await: used to suspend execution of a blocking method

Pay attention to

In coroutines, you need to use the AIOHTTP module instead of Requests for asynchronous network requests. Get is sync-based code:

import aiohttp
# Use ClientSession in this module
async def get_page(url) :
    async with aiohttp.ClientSession() as session:
        async with await session.get(url=url) as response:
            # text() returns the response data as a string
            The binary form of response data returned by # read()
            # json() returns a JSON object
            Note: We must suspend with await before getting the response data operation
            page_text = await response.text()
            print(page_text)
Copy the code

Python instantiates a thread pool object

from multiprocessing.dummy import Pool
Instantiate a thread pool object
pool = Pool(4)
Copy the code

Coroutines:

import asyncio

async def request(url) :
    print('the request url:, url)
# async modified function that returns a coroutine object when called
c = request('www.wzc.com')
Create an event loop object
loop = asyncio.get_event_loop()
Register the coroutine object into the event loop and start the loop
loop.run_until_complete(c)
Copy the code

Thread pool principle

Thread pools handle blocking and time-consuming operations

In actual combat

Crawl source website: www.pearvideo.com/category_8

Note: The operation before entering a single video page is quite basic. After entering a single page, the video is dynamically acquired, and Ajax is needed for data transfer and acquisition



The parameter to be added when retrieving ajax data, and the Referer needs to be added to the header

post_url = 'https://www.pearvideo.com/videoStatus.jsp'
    data = {
        'contId': id_,
        'mrd': str(random.random()),

    }
    ajax_headers = {
        'User-Agent': random.choice(user_agent_list),
        'Referer':'https://www.pearvideo.com/video_' + id_
    }
    response = requests.post(post_url, data, headers=ajax_headers)
Copy the code

Code:

import requests
from lxml import etree
import random
import os
import time
from multiprocessing.dummy import Pool

user_agent_list=[
            'the Mozilla / 5.0 (compatible; MSIE9.0; WindowsNT6.1; Trident / 5.0) '.'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '.'the Mozilla / 4.0 (compatible; MSIE7.0; WindowsNT6.0) '.'Opera / 9.80 (WindowsNT6.1; U; En) Presto / 2.8.131 Version 11.11 / '.'the Mozilla / 5.0 (WindowsNT6.1; The rv: Gecko / 20100101 firefox 2.0.1) / 4.0.1 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'.'Opera / 8.0 (Windows NT 5.1; U; en)'.'the Mozilla / 5.0 (Windows NT 5.1; U; en; Rv :1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; En) Opera 9.50 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox 34.0 / '.'the Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'.'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'.'the Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)'.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; SE 2. MetaSr 1.0 X) '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
        ]

Decrypt the url and return to the correct video path
def videoUrlDeal(video_url, id_) :
    Get the real URL, do string processing
    video_true_url = ' '
    s_list = str(video_url).split('/')
    for i in range(0.len(s_list)):
        if i < len(s_list) - 1:
            video_true_url += s_list[i] + '/'
        else:
            ss_list = s_list[i].split(The '-')
            for j in range(0.len(ss_list)):
                if j == 0:
                    video_true_url += 'cont-' + id_ + The '-'
                elif j == len(ss_list) - 1:
                    video_true_url += ss_list[j]
                else:
                    video_true_url += ss_list[j] + The '-'
    return video_true_url

def testPost(id_) :
    post_url = 'https://www.pearvideo.com/videoStatus.jsp'
    data = {
        'contId': id_,
        'mrd': str(random.random()),

    }
    ajax_headers = {
        'User-Agent': random.choice(user_agent_list),
        'Referer':'https://www.pearvideo.com/video_' + id_
    }
    response = requests.post(post_url, data, headers=ajax_headers)
    page_json = response.json()
    # print(page_json['videoInfo']['videos']['srcUrl'])
    return videoUrlDeal(page_json['videoInfo'] ['videos'] ['srcUrl'], id_)

# Store the video locally
def saveVideo(data) :
    true_url = data[0]
    videoTitle = data[1]
    content = requests.get(url=true_url, headers=header).content
    with open('./video/' + videoTitle + '.mp4'.'wb') as fp:
        fp.write(content)
        print(true_url, videoTitle, 'Saved successfully')

if __name__ == '__main__':
    Create a folder and save all your pictures
    if not os.path.exists('./video'):
        os.mkdir('./video')
    header = {'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '}
    url = 'https://www.pearvideo.com/category_8'

    # request the URL, parse the video details also the URL and video name
    response = requests.get(url=url, headers=header)
    tree = etree.HTML(response.text)
    li_list = tree.xpath('//ul[@class="category-list clearfix"]/li')
    true_url_list = []
    for li in li_list:
        videoTitle = li.xpath('./div[@class="vervideo-bd"]/a/div[@class="vervideo-title"]/text()') [0]
        videoHref = 'https://www.pearvideo.com/' + li.xpath('./div[@class="vervideo-bd"]/a/@href') [0]

        # request details page URL (this page has been changed)
        # videoText = requests.get(url=videoHref, headers=header).text
        # Resolve the address of the video from the details page by id (URL)
        id = videoHref.split('_') [1]
        true_url_list.append((testPost(id),videoTitle))
    # print(true_url_list)

    Instantiate a thread pool object for multithreaded storage
    pool = Pool(5)
    pool.map(saveVideo, true_url_list)
    pool.close()
    pool.join()
Copy the code

Crawl results: