I. Background introduction

Hello, I’m Pipi. For different data we use different capture methods, pictures, video, audio, text, are different, because the website picture material is too much, so today we use multi-threaded way to collect a station 4K HD wallpaper.

Second, page analysis

Target Website:

http://www.bizhi88.com/3840x2160/
Copy the code

As shown in the figure, there are 278 pages. Here we climb the first 100 pages of wallpaper images and save them locally.

Parsing the page

As is shown in the picture, the picture of the fish is in a big box.

Then each page div tag inside the wallpaper image data of various information :1. Links; 2. Name; Here is the parsing of xpath;

imgLink = each.xpath("./a[1]/img/@data-original")[0]
name = each.xpath("./a[1]/img/@alt")[0]
Copy the code

One caveat:

The image label has SRC attribute and data-original attribute, both of which correspond to the URL address of the image. We generally use the latter, because data-original-src is a custom attribute, the actual address of the image, while the SRC attribute will not be fully displayed until the page is fully loaded, otherwise the corresponding address will not be obtained.

Three, grab ideas

As mentioned above, we can’t write a for loop to download images one by one, so we have to use multithreading or multi-process, and then throw so much data queue to the thread pool or process pool to process; In Python, multiprocessing Pool, multiprocessing. Dummy is a very useful process Pool.

  • Dummy module: dummy module is multithreaded;
  • Multiprocessing module: Multiprocessing is a multi-process;

Dummy API is common to both the multiprocessing module and the multiprocessing module. The code switch is more flexible;

Page URL rules:

'http://www.bizhi88.com/s/470/1.html' # # the first page to 'http://www.bizhi88.com/s/470/2.html' on page 2 'http://www.bizhi88.com/s/470/3.html' on page # 3Copy the code

Build the url:

page = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
Copy the code

So we customize two functions: one for crawling and parsing the page, one for downloading the data, turn on the thread pool, use the for loop to build a 13-page URL, store it in the list as a URL queue, **pool.map()**

def map(self, fn, *iterables, timeout=None, chunksize=1): Returns an iterator equivalent to map(fn, iter); / / pool. Map (spider,page); / / pool. Page: url queueCopy the code

Function: Extract each element in the list as the parameter of the function, create a process, put into the process pool;

Parameter 1: the function to execute;

Parameter 2: iterator, passing the numbers in the iterator as arguments to the function;

Iv. Data collection

Import related third-party libraries

From LXML import etree # parse import requests # request from multiprocessing. Dummy import Pool as ThreadPool # concurrency import time # The efficiency ofCopy the code

Page data parsing

def spider(url): html = requests.get(url, headers=headers) selector = etree.HTML(html.text) contents = selector.xpath("//div[@class='flex-img auto mt']/div") item  = {} for each in contents: imgLink = each.xpath("./a[1]/img/@data-original")[0] name = each.xpath("./a[1]/img/@alt")[0] item['Link'] = imgLink item['name'] = name towrite(item)Copy the code

Download photo

def download_pic(contdict): name = contdict['name'] link = contdict['Link'] with open('img/' + name + '.jpg','wb') as f: Data = requests. Get (link) cont = data. Content. ')Copy the code

The main () function

   pool = ThreadPool(6)
    page = []
    for i in range(1, 101):
        newpage = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
        page.append(newpage)
    result = pool.map(spider, page)
    pool.close()
    pool.join()
Copy the code

Description:

  1. In the main function we preferred to create six thread pools;

  2. Build 100 urls dynamically through the for loop;

  3. The map() function is used to parse and store the url in the thread pool.

  4. When the thread pool is closed, it does not close the thread pool, but changes its state to non-pluggable.

5. Program operation

If __name__ == '__main__': start = time.time() # print(end-start) #Copy the code

The results are as follows:

Of course, this is just a partial image capture, 2,000 + images in total.

Six, summarized

This time, we used multithreading to crawl the hd images of a wallpaper website. If requests were used, it would be obvious to synchronize the requests and download the data slowly. Therefore, we used multithreading to download the images, which improved the crawl efficiency.