I. Background introduction
Hello, I’m Pipi. For different data we use different capture methods, pictures, video, audio, text, are different, because the website picture material is too much, so today we use multi-threaded way to collect a station 4K HD wallpaper.
Second, page analysis
Target Website:
http://www.bizhi88.com/3840x2160/
Copy the code
As shown in the figure, there are 278 pages. Here we climb the first 100 pages of wallpaper images and save them locally.
Parsing the page
As is shown in the picture, the picture of the fish is in a big box.
Then each page div tag inside the wallpaper image data of various information :1. Links; 2. Name; Here is the parsing of xpath;
imgLink = each.xpath("./a[1]/img/@data-original")[0]
name = each.xpath("./a[1]/img/@alt")[0]
Copy the code
One caveat:
The image label has SRC attribute and data-original attribute, both of which correspond to the URL address of the image. We generally use the latter, because data-original-src is a custom attribute, the actual address of the image, while the SRC attribute will not be fully displayed until the page is fully loaded, otherwise the corresponding address will not be obtained.
Three, grab ideas
As mentioned above, we can’t write a for loop to download images one by one, so we have to use multithreading or multi-process, and then throw so much data queue to the thread pool or process pool to process; In Python, multiprocessing Pool, multiprocessing. Dummy is a very useful process Pool.
- Dummy module: dummy module is multithreaded;
- Multiprocessing module: Multiprocessing is a multi-process;
Dummy API is common to both the multiprocessing module and the multiprocessing module. The code switch is more flexible;
Page URL rules:
'http://www.bizhi88.com/s/470/1.html' # # the first page to 'http://www.bizhi88.com/s/470/2.html' on page 2 'http://www.bizhi88.com/s/470/3.html' on page # 3Copy the code
Build the url:
page = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
Copy the code
So we customize two functions: one for crawling and parsing the page, one for downloading the data, turn on the thread pool, use the for loop to build a 13-page URL, store it in the list as a URL queue, **pool.map()**
def map(self, fn, *iterables, timeout=None, chunksize=1): Returns an iterator equivalent to map(fn, iter); / / pool. Map (spider,page); / / pool. Page: url queueCopy the code
Function: Extract each element in the list as the parameter of the function, create a process, put into the process pool;
Parameter 1: the function to execute;
Parameter 2: iterator, passing the numbers in the iterator as arguments to the function;
Iv. Data collection
Import related third-party libraries
From LXML import etree # parse import requests # request from multiprocessing. Dummy import Pool as ThreadPool # concurrency import time # The efficiency ofCopy the code
Page data parsing
def spider(url): html = requests.get(url, headers=headers) selector = etree.HTML(html.text) contents = selector.xpath("//div[@class='flex-img auto mt']/div") item = {} for each in contents: imgLink = each.xpath("./a[1]/img/@data-original")[0] name = each.xpath("./a[1]/img/@alt")[0] item['Link'] = imgLink item['name'] = name towrite(item)Copy the code
Download photo
def download_pic(contdict): name = contdict['name'] link = contdict['Link'] with open('img/' + name + '.jpg','wb') as f: Data = requests. Get (link) cont = data. Content. ')Copy the code
The main () function
pool = ThreadPool(6)
page = []
for i in range(1, 101):
newpage = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
page.append(newpage)
result = pool.map(spider, page)
pool.close()
pool.join()
Copy the code
Description:
-
In the main function we preferred to create six thread pools;
-
Build 100 urls dynamically through the for loop;
-
The map() function is used to parse and store the url in the thread pool.
-
When the thread pool is closed, it does not close the thread pool, but changes its state to non-pluggable.
5. Program operation
If __name__ == '__main__': start = time.time() # print(end-start) #Copy the code
The results are as follows:
Of course, this is just a partial image capture, 2,000 + images in total.
Six, summarized
This time, we used multithreading to crawl the hd images of a wallpaper website. If requests were used, it would be obvious to synchronize the requests and download the data slowly. Therefore, we used multithreading to download the images, which improved the crawl efficiency.