Written in 2018.01.24

Project address: github.com/jrainlau/wa…

preface

It’s been a long time since I’ve been adapting to my new position and learning Python in my spare time. This article is a recent summary of the Python learning stage, developed a crawler to batch download a wallpaper site hd wallpaper.

Note: This article is for python learning only and is strictly prohibited for other purposes!

Initialize the project

The project uses virtualEnv to create a virtual environment without contaminating the whole world. Use PIp3 to download directly:

pip3 install virtualenv
Copy the code

Then create a new wallpaper-downloader directory in an appropriate place and use virtualenv to create a virtual environment called venv:

virtualenv venv

. venv/bin/activate
Copy the code

Next create a dependency directory:

echo bs4 lxml requests > requirements.txt
Copy the code

Finally, yun can download and install the dependency:

pip3 install -r requirements.txt
Copy the code

Analyze the working steps of crawler

For simplicity, let’s go straight to the wallpaper list page classified as “Aero” : wallpaperswide.com/aero-deskto…

As you can see, there are 10 wallpapers available for download on this page. However, as the thumbnails shown here are far from clear enough as a wallpaper, so we need to go to the wallpaper details page to find the hd download link. Click on the first wallpaper and you will see a new page:

Since my machine has Retina display, I’m going to download the largest one directly to keep hd (the volume shown in red circle).

After understanding the specific steps, is to find the corresponding DOM node through the developer tool, extract the corresponding URL, this process is no longer expanded, readers can try by themselves, the following enter the coding part.

To access the page

Create a new download.py file and import two libraries:

from bs4 import BeautifulSoup
import requests
Copy the code

Next, write a function that specifically accesses the URL and returns the page HTML:

def visit_page(url):
    headers = {
      'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
    }
    r = requests.get(url, headers = headers)
    r.encoding = 'utf-8'
    return BeautifulSoup(r.text, 'lxml')
Copy the code

To avoid being hit by anti-crawling, we need to disguise the crawler as a normal browser by adding UA to the header, then specify UTF-8 encoding, and finally return HTML in string format.

Extract the link

After obtaining the HTML for the page, we need to extract the URL for the page wallpaper list:

def get_paper_link(page):
    links = page.select('#content > div > ul > li > div > div a')
    return [link.get('href') for link in links]
Copy the code

This function extracts the URL of all wallpaper details from the list page.

Download wallpaper

Once we have the address of the details page, we can go in and pick the appropriate size. After analyzing the DOM structure of the page, we can know that each size corresponds to a link:

So the first step is to extract the links corresponding to size:

wallpaper_source = visit_page(link)
wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
size_list = [{
    'size': eval(link.get_text().replace('x'.The '*')),
    'name': link.get('href').replace('/download/'.' '),
    'url': link.get('href')}for link in wallpaper_size_links]
Copy the code

Size_list is a collection of these links. In order to select the wallpaper with the highest resolution (the largest volume), I used the eval method in size to directly calculate 5120×3200 as the value of size.

Once we have all of the sets, we can use the Max () method to pick out the highest definition item:

biggest_one = max(size_list, key = lambda item: item['size'])
Copy the code

The biggest_one URL is the requested url, and you can download the requested url from the Requests library:

result = requests.get(PAGE_DOMAIN + biggest_one['url'])

if result.status_code == 200:
    open('wallpapers/' + biggest_one['name'].'wb').write(result.content)
Copy the code

Note that first you need to create a wallpapers directory in the root directory, otherwise the runtime will report an error.

To rearrange, the full download_wallpaper function looks like this:

def download_wallpaper(link):
    wallpaper_source = visit_page(PAGE_DOMAIN + link)
    wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
    size_list = [{
        'size': eval(link.get_text().replace('x'.The '*')),
        'name': link.get('href').replace('/download/'.' '),
        'url': link.get('href')}for link in wallpaper_size_links]

    biggest_one = max(size_list, key = lambda item: item['size'])
    print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
    result = requests.get(PAGE_DOMAIN + biggest_one['url'])

    if result.status_code == 200:
        open('wallpapers/' + biggest_one['name'].'wb').write(result.content)
Copy the code

The batch operation

The above steps only download the first wallpaper on the first wallpaper list page. If we want to download the entire wallpaper of multiple list pages, we need to loop through these methods. First we define a few constants:

import sys

iflen(sys.argv) ! = 4:print('3 arguments were required but only find ' + str(len(sys.argv) - 1) + '! ')
    exit()

category = sys.argv[1]

try:
    page_start = [int(sys.argv[2])]
    page_end = int(sys.argv[3])
except:
    print('The second and third arguments must be a number but not a string! ')
    exit(a)Copy the code

Here we specify three constant categories, page_start and page_end, which correspond to the wallpaper category, the start page number, and the end page number.

For convenience, we define two more URL-related constants:

PAGE_DOMAIN = 'http://wallpaperswide.com'
PAGE_URL = 'http://wallpaperswide.com/' + category + '-desktop-wallpapers/page/'
Copy the code

Now that we can happily batch, let’s define a start() function:

def start():
    if page_start[0] <= page_end:
        print('Preparing to download the ' + str(page_start[0])  + ' page of all the "' + category + '" wallpapers... ')
        PAGE_SOURCE = visit_page(PAGE_URL + str(page_start[0]))
        WALLPAPER_LINKS = get_paper_link(PAGE_SOURCE)
        page_start[0] = page_start[0] + 1

        for index, link in enumerate(WALLPAPER_LINKS):
            download_wallpaper(link, index, len(WALLPAPER_LINKS), start)
Copy the code

Then rewrite the download_wallpaper function again:

def download_wallpaper(link, index, total, callback):
    wallpaper_source = visit_page(PAGE_DOMAIN + link)
    wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')
    size_list = [{
        'size': eval(link.get_text().replace('x'.The '*')),
        'name': link.get('href').replace('/download/'.' '),
        'url': link.get('href')}for link in wallpaper_size_links]

    biggest_one = max(size_list, key = lambda item: item['size'])
    print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name'])
    result = requests.get(PAGE_DOMAIN + biggest_one['url'])

    if result.status_code == 200:
        open('wallpapers/' + biggest_one['name'].'wb').write(result.content)

    if index + 1 == total:
        print('Download completed! \n\n')
        callback()
Copy the code

Finally, specify the startup rule:

if __name__ == '__main__':
     start()

Copy the code

Run the project

Enter the following code on the command line to start the test:

python3 download.py aero 1 2
Copy the code

You can then see the following output:

Grab Charles and grab the package and you can see that the script is running smoothly:

At this point, the download script has been developed, and you don’t have to worry about the wallpaper shortage!