High-performance asynchronous crawler
- Synchronous crawler
- Asynchronous crawler
-
- Thread pool principle
- In actual combat
Synchronous crawler
Example: URLS block crawls, and the next image crawls will not be performed until the previous image crawls are completed. It’s single-threaded.
import requests
header = {'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '}
urls = [
'http://pic.netbian.com/uploads/allimg/210122/195550-1611316550d711.jpg'.'http://pic.netbian.com/uploads/allimg/180803/084010-15332568107994.jpg'.'http://pic.netbian.com/uploads/allimg/190415/214606-15553359663cd8.jpg'
]
The wrapper method gets the URL content
def get_content(url) :
print('Crawling', url)
response = requests.get(url=url, headers=header)
if response.status_code == 200:
return response.content
def parse_content(content) :
print(The length of the response data is:.len(content))
for url in urls:
content = get_content(url)
parse_content(content)
Copy the code
Asynchronous crawler
Method:
-
Multi-threaded, multi-process (not recommended) :
- Benefits: You can start a thread or process separately for related blocking operations, and blocking operations can be executed asynchronously
- Disadvantages: You can’t open multiple threads or processes indefinitely
-
Thread pool, process pool (when appropriate) :
- Benefits: Can reduce the system to process or thread creation and consumption, so as to reduce the system overhead
- Disadvantages: There is an upper limit to the number of threads or processes in the pool
-
Single thread + asynchronous coroutine (recommended) :
- Event_loop: a wireless loop in which functions can be registered and executed when certain conditions are met.
- Coroutine: a coroutine object that can be registered with an event loop and is called by the event loop. You can use the async keyword to define a method that, when called, is not executed immediately but returns a coroutine object
- Task: a further encapsulation of the coroutine object, which contains each state of the task
- Future: represents a task that will be executed or has not yet been executed. It is essentially the same as task
- Async: Defines a coroutine
- Await: used to suspend execution of a blocking method
Pay attention to
In coroutines, you need to use the AIOHTTP module instead of Requests for asynchronous network requests. Get is sync-based code:
import aiohttp
# Use ClientSession in this module
async def get_page(url) :
async with aiohttp.ClientSession() as session:
async with await session.get(url=url) as response:
# text() returns the response data as a string
The binary form of response data returned by # read()
# json() returns a JSON object
Note: We must suspend with await before getting the response data operation
page_text = await response.text()
print(page_text)
Copy the code
Python instantiates a thread pool object
from multiprocessing.dummy import Pool
Instantiate a thread pool object
pool = Pool(4)
Copy the code
Coroutines:
import asyncio
async def request(url) :
print('the request url:, url)
# async modified function that returns a coroutine object when called
c = request('www.wzc.com')
Create an event loop object
loop = asyncio.get_event_loop()
Register the coroutine object into the event loop and start the loop
loop.run_until_complete(c)
Copy the code
Thread pool principle
Thread pools handle blocking and time-consuming operations
In actual combat
Crawl source website: www.pearvideo.com/category_8
Note: The operation before entering a single video page is quite basic. After entering a single page, the video is dynamically acquired, and Ajax is needed for data transfer and acquisition
The parameter to be added when retrieving ajax data, and the Referer needs to be added to the header
post_url = 'https://www.pearvideo.com/videoStatus.jsp'
data = {
'contId': id_,
'mrd': str(random.random()),
}
ajax_headers = {
'User-Agent': random.choice(user_agent_list),
'Referer':'https://www.pearvideo.com/video_' + id_
}
response = requests.post(post_url, data, headers=ajax_headers)
Copy the code
Code:
import requests
from lxml import etree
import random
import os
import time
from multiprocessing.dummy import Pool
user_agent_list=[
'the Mozilla / 5.0 (compatible; MSIE9.0; WindowsNT6.1; Trident / 5.0) '.'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '.'the Mozilla / 4.0 (compatible; MSIE7.0; WindowsNT6.0) '.'Opera / 9.80 (WindowsNT6.1; U; En) Presto / 2.8.131 Version 11.11 / '.'the Mozilla / 5.0 (WindowsNT6.1; The rv: Gecko / 20100101 firefox 2.0.1) / 4.0.1 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; . NET4.0 C; NET4.0 E) '.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'.'Opera / 8.0 (Windows NT 5.1; U; en)'.'the Mozilla / 5.0 (Windows NT 5.1; U; en; Rv :1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50'.'the Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; En) Opera 9.50 '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64; The rv: 34.0) Gecko / 20100101 Firefox 34.0 / '.'the Mozilla / 5.0 (X11; U; Linux x86_64; zh-CN; Rv :1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'.'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'.'the Mozilla / 5.0 (Windows; U; Windows NT 6.1; En-us) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'.'the Mozilla / 5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident / 5.0; SLCC2; The.net CLR 2.0.50727; The.net CLR 3.5.30729; The.net CLR 3.0.30729; Media Center PC 6.0; . NET4.0 C; . NET4.0 E; LBBROWSER)'.'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0'.'the Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident / 4.0; SV1; QQDownload 732; . NET4.0 C; . NET4.0 E; SE 2. MetaSr 1.0 X) '.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
Decrypt the url and return to the correct video path
def videoUrlDeal(video_url, id_) :
Get the real URL, do string processing
video_true_url = ' '
s_list = str(video_url).split('/')
for i in range(0.len(s_list)):
if i < len(s_list) - 1:
video_true_url += s_list[i] + '/'
else:
ss_list = s_list[i].split(The '-')
for j in range(0.len(ss_list)):
if j == 0:
video_true_url += 'cont-' + id_ + The '-'
elif j == len(ss_list) - 1:
video_true_url += ss_list[j]
else:
video_true_url += ss_list[j] + The '-'
return video_true_url
def testPost(id_) :
post_url = 'https://www.pearvideo.com/videoStatus.jsp'
data = {
'contId': id_,
'mrd': str(random.random()),
}
ajax_headers = {
'User-Agent': random.choice(user_agent_list),
'Referer':'https://www.pearvideo.com/video_' + id_
}
response = requests.post(post_url, data, headers=ajax_headers)
page_json = response.json()
# print(page_json['videoInfo']['videos']['srcUrl'])
return videoUrlDeal(page_json['videoInfo'] ['videos'] ['srcUrl'], id_)
# Store the video locally
def saveVideo(data) :
true_url = data[0]
videoTitle = data[1]
content = requests.get(url=true_url, headers=header).content
with open('./video/' + videoTitle + '.mp4'.'wb') as fp:
fp.write(content)
print(true_url, videoTitle, 'Saved successfully')
if __name__ == '__main__':
Create a folder and save all your pictures
if not os.path.exists('./video'):
os.mkdir('./video')
header = {'User-Agent': 'the Mozilla / 4.0 (compatible; MSIE8.0; WindowsNT6.0; Trident / 4.0) '}
url = 'https://www.pearvideo.com/category_8'
# request the URL, parse the video details also the URL and video name
response = requests.get(url=url, headers=header)
tree = etree.HTML(response.text)
li_list = tree.xpath('//ul[@class="category-list clearfix"]/li')
true_url_list = []
for li in li_list:
videoTitle = li.xpath('./div[@class="vervideo-bd"]/a/div[@class="vervideo-title"]/text()') [0]
videoHref = 'https://www.pearvideo.com/' + li.xpath('./div[@class="vervideo-bd"]/a/@href') [0]
# request details page URL (this page has been changed)
# videoText = requests.get(url=videoHref, headers=header).text
# Resolve the address of the video from the details page by id (URL)
id = videoHref.split('_') [1]
true_url_list.append((testPost(id),videoTitle))
# print(true_url_list)
Instantiate a thread pool object for multithreaded storage
pool = Pool(5)
pool.map(saveVideo, true_url_list)
pool.close()
pool.join()
Copy the code
Crawl results: