Python multithreaded crawlers quickly batch download images

1. Complete the module that needs to be imported

Urllib, random, queue (queue), threading, time, OS, and jsonCopy the code

Installation of third-party modules

Keyboard win+R, enter CMD, go to the command window

For the urllib module, the installation code PIP install urllib3

 

2, how to multithreading crawl pictures

First, we need to go to this site

www.quanjing.com/

Then input keywords, come to another interface, such as I input landscape

Check the source code, it can be found that the download links of these pictures should not be directly written in the website, we click the right mouse button, click Check, click Network, click XHR, press F5 refresh, you can find the download links of these pictures under XHR.

So all we need is this url to get the download links for all the images on this page.

www.quanjing.com/Handler/Sea…

Through the analysis of multiple such urls, it can be found that the parameter after ‘t=’ should be a random number of four digits, the parameter after ‘q=’ should be the type of image you input, namely landscape, but it is only encoded here, and the parameter after ‘pagnum=’ means the page number. The url ** ‘pagesize=100’ ** indicates that there should be 100 images per page, and the total number of pages is shown here

 

The value of the last parameter should be a timestamp, obtained by processing.

So we can get the image to crawl multiple pages. However, when crawling, it can be found that even if we get this website, we can not get these data. At this time, we can come to the request header of this website. Through many experiments, we can find that we only need to add: Referer: in the request header

In order to make the server think we are visiting from the browser, we can add user-agent to the request header

The code is as follows:

def get_time(): Str_time = STR (time.time()) str_time = str_time[:str_time.find('.')] + str_time[str_time.find('.') + 1:str_time.find('.') + 4] time.sleep(1.25) # return str_time if no timestamp is foundCopy the code
Def get_url(): keyword = input key_word = parse.urlencode({'q': Keyword}) num=int(input(' Referer '))// headers = {"Referer": "Https://www.quanjing.com/search.aspx?%s" % (key_word), "the user-agent: Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"} url='https://www.quanjing.com/Handler/SearchUrl.ashx?t=%s&callback=searchresult&%s&stype=1&pagesize=100&pagenum=%s&image Type=2&imageColor=&brand=&imageSType=&fr=1&sortFlag=1&imageUType=&btype=&authid=&_=%s' list_url=[] for i in range(1,num+1): str_1 = str(random.random()) random_1 = str_1[str_1.find('.') + 1:str_1.find('.') + 5] time_1=get_time() url_1=url%(random_1,key_word,i,time_1) list_url.append(url_1) return list_url,headers,keywordCopy the code

 

This way we can get the download link of the image, then only need to multithread download. However, in the process of multi-threading download, we found that the number of downloaded pictures is far lower than the number of pictures entered by ourselves. In fact, this is because the picture has the same name. We can add a random number in front of the picture name.

Run:

Perhaps after the completion of the input, will not reach their input the number of pictures on the automatic end of the program, it is recommended to try several times.

Once you’re done, you can check that there’s a new folder under the current folder that contains the images you downloaded

To know the number of images we have downloaded, we can do this:

​​​​​​​

Import oslist_1=os.listdir('E:\Pycharm_1\ reptile ')for I in range(len(list_1)): print(I +1,list_1[I])Copy the code

 

Running results:

So there are 400 images.

3. The complete code is as follows

import urllib.parse as parse

from urllib import requestimport randomfrom queue import Queueimport threadingimport timeimport jsonimport os def get_time(): Str_time = STR (time.time()) str_time = str_time[:str_time.find('.')] + str_time[str_time.find('.') + 1:str_time.find('.') + 4] time.sleep(1.25) return str_time def get_url(): Keyword = input(' please input the type of image you want to download :') key_word = parse.urlencode({'q': Keyword}) num=int(input(' Referer '))// headers = {"Referer": "Https://www.quanjing.com/search.aspx?%s" % (key_word), "the user-agent: Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"} url='https://www.quanjing.com/Handler/SearchUrl.ashx?t=%s&callback=searchresult&%s&stype=1&pagesize=100&pagenum=%s&image Type=2&imageColor=&brand=&imageSType=&fr=1&sortFlag=1&imageUType=&btype=&authid=&_=%s' list_url=[] for i in range(1,num+1): str_1 = str(random.random()) random_1 = str_1[str_1.find('.') + 1:str_1.find('.') + 5] time_1=get_time() url_1=url%(random_1,key_word,i,time_1) list_url.append(url_1) return list_url,headers,keyword tuple_1=get_url()list_url,headers,keyword=tuple_1[0],tuple_1[1],tuple_1[2]queue_url = Queue(len(list_url)*100+5)queue_img = Queue(len(list_url)*100+5) try: Num =1 for I in range(len(list_url)) request_1=request.Request(url=list_url[i],headers=headers) content=request.urlopen(request_1) str_1 = Content. the read (). The decode (' utf-8 ') # the data type string str_1 = str_1 [str_1. Find (' (') + 1: str_1 rfind dict_1 = (') ')] json.loads(str_1) images_list = dict_1['imglist'] for j in range(len(images_list)): Print (' [{}] -{}'. Format (num, images_list[j]['caption'])) queue_url.put(images_list[j]['imgurl']) queue_img.put(images_list[j]['caption']) num+=1 def Downlad(queue_url: Queue, queue_img: Queue): path_1 = './' + keyword try: os.mkdir(path_1) except: pass finally: while True: if queue_url.empty(): break image_name = queue_img.get() request.urlretrieve(url=queue_url.get(), PNG '. Format (random.random()*1000,image_name)) Print (' thread {} is downloading [{}] '. Format (threading.current_thread().getName(), image_name)) time.sleep(0.25) Threading_list = [] print(' Start downloading! ') time.sleep(5) for i in range(len(list_url)*5): Threading_1 = threading.Thread(target=Downlad, args=(queue_url, queue_img,)) threading_1.start() threading_list.append(threading_1) for i in threading_list: I.join () print('------------------------! Threading.current_thread ().getName() except Exception as e: print(e,' ') ')Copy the code