Python multithreaded crawlers quickly batch download images
1. Complete the module that needs to be imported
Urllib, random, queue (queue), threading, time, OS, and jsonCopy the code
Installation of third-party modules
Keyboard win+R, enter CMD, go to the command window
For the urllib module, the installation code PIP install urllib3
2, how to multithreading crawl pictures
First, we need to go to this site
www.quanjing.com/
Then input keywords, come to another interface, such as I input landscape
Check the source code, it can be found that the download links of these pictures should not be directly written in the website, we click the right mouse button, click Check, click Network, click XHR, press F5 refresh, you can find the download links of these pictures under XHR.
So all we need is this url to get the download links for all the images on this page.
www.quanjing.com/Handler/Sea…
Through the analysis of multiple such urls, it can be found that the parameter after ‘t=’ should be a random number of four digits, the parameter after ‘q=’ should be the type of image you input, namely landscape, but it is only encoded here, and the parameter after ‘pagnum=’ means the page number. The url ** ‘pagesize=100’ ** indicates that there should be 100 images per page, and the total number of pages is shown here
The value of the last parameter should be a timestamp, obtained by processing.
So we can get the image to crawl multiple pages. However, when crawling, it can be found that even if we get this website, we can not get these data. At this time, we can come to the request header of this website. Through many experiments, we can find that we only need to add: Referer: in the request header
In order to make the server think we are visiting from the browser, we can add user-agent to the request header
The code is as follows:
def get_time(): Str_time = STR (time.time()) str_time = str_time[:str_time.find('.')] + str_time[str_time.find('.') + 1:str_time.find('.') + 4] time.sleep(1.25) # return str_time if no timestamp is foundCopy the code
Def get_url(): keyword = input key_word = parse.urlencode({'q': Keyword}) num=int(input(' Referer '))// headers = {"Referer": "Https://www.quanjing.com/search.aspx?%s" % (key_word), "the user-agent: Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"} url='https://www.quanjing.com/Handler/SearchUrl.ashx?t=%s&callback=searchresult&%s&stype=1&pagesize=100&pagenum=%s&image Type=2&imageColor=&brand=&imageSType=&fr=1&sortFlag=1&imageUType=&btype=&authid=&_=%s' list_url=[] for i in range(1,num+1): str_1 = str(random.random()) random_1 = str_1[str_1.find('.') + 1:str_1.find('.') + 5] time_1=get_time() url_1=url%(random_1,key_word,i,time_1) list_url.append(url_1) return list_url,headers,keywordCopy the code
This way we can get the download link of the image, then only need to multithread download. However, in the process of multi-threading download, we found that the number of downloaded pictures is far lower than the number of pictures entered by ourselves. In fact, this is because the picture has the same name. We can add a random number in front of the picture name.
Run:
Perhaps after the completion of the input, will not reach their input the number of pictures on the automatic end of the program, it is recommended to try several times.
Once you’re done, you can check that there’s a new folder under the current folder that contains the images you downloaded
To know the number of images we have downloaded, we can do this:
Import oslist_1=os.listdir('E:\Pycharm_1\ reptile ')for I in range(len(list_1)): print(I +1,list_1[I])Copy the code
Running results:
So there are 400 images.
3. The complete code is as follows
import urllib.parse as parse
from urllib import requestimport randomfrom queue import Queueimport threadingimport timeimport jsonimport os def get_time(): Str_time = STR (time.time()) str_time = str_time[:str_time.find('.')] + str_time[str_time.find('.') + 1:str_time.find('.') + 4] time.sleep(1.25) return str_time def get_url(): Keyword = input(' please input the type of image you want to download :') key_word = parse.urlencode({'q': Keyword}) num=int(input(' Referer '))// headers = {"Referer": "Https://www.quanjing.com/search.aspx?%s" % (key_word), "the user-agent: Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"} url='https://www.quanjing.com/Handler/SearchUrl.ashx?t=%s&callback=searchresult&%s&stype=1&pagesize=100&pagenum=%s&image Type=2&imageColor=&brand=&imageSType=&fr=1&sortFlag=1&imageUType=&btype=&authid=&_=%s' list_url=[] for i in range(1,num+1): str_1 = str(random.random()) random_1 = str_1[str_1.find('.') + 1:str_1.find('.') + 5] time_1=get_time() url_1=url%(random_1,key_word,i,time_1) list_url.append(url_1) return list_url,headers,keyword tuple_1=get_url()list_url,headers,keyword=tuple_1[0],tuple_1[1],tuple_1[2]queue_url = Queue(len(list_url)*100+5)queue_img = Queue(len(list_url)*100+5) try: Num =1 for I in range(len(list_url)) request_1=request.Request(url=list_url[i],headers=headers) content=request.urlopen(request_1) str_1 = Content. the read (). The decode (' utf-8 ') # the data type string str_1 = str_1 [str_1. Find (' (') + 1: str_1 rfind dict_1 = (') ')] json.loads(str_1) images_list = dict_1['imglist'] for j in range(len(images_list)): Print (' [{}] -{}'. Format (num, images_list[j]['caption'])) queue_url.put(images_list[j]['imgurl']) queue_img.put(images_list[j]['caption']) num+=1 def Downlad(queue_url: Queue, queue_img: Queue): path_1 = './' + keyword try: os.mkdir(path_1) except: pass finally: while True: if queue_url.empty(): break image_name = queue_img.get() request.urlretrieve(url=queue_url.get(), PNG '. Format (random.random()*1000,image_name)) Print (' thread {} is downloading [{}] '. Format (threading.current_thread().getName(), image_name)) time.sleep(0.25) Threading_list = [] print(' Start downloading! ') time.sleep(5) for i in range(len(list_url)*5): Threading_1 = threading.Thread(target=Downlad, args=(queue_url, queue_img,)) threading_1.start() threading_list.append(threading_1) for i in threading_list: I.join () print('------------------------! Threading.current_thread ().getName() except Exception as e: print(e,' ') ')Copy the code