Python crawler 120 source DIY–001– grab desktop wallpaper


The grab the target site: www.netbian.com/fengjing/, in… .

Refer to blog post:

Blog.csdn.net/hihell/arti…

The original text uses the regular expression and re module, and the download is just the preview graph, I rewrite on this basis can obtain the HIGH definition graph, this article is only for reference, to provide a train of thought

Tool materials:


  • python3

  • Requests the library

  • Parsel library

B.


  1. Analyze the pages, as shown, and preview images for each page are under the list collection. Each image corresponds to an abbreviated address :”

  1. Use the Parsel library to parse out the original address, including unwanted urls

    url = 'http://www.netbian.com/fengjing/'
    response = requests.get(url)
    sel = parsel.Selector(response.content.decode('gbk'))
    lists = sel.css('.list li a::attr(href)').extract()  Get the initial address
    Copy the code

  2. Filter unwanted addresses

    Check the link address for high-definition images are acquired: www.netbian.com/desk/23791-…

    # Clean the assembly address
    def clearurl(lists) :
        nurls = []
        for i in lists:
            if i.startswith('/desk/'):
                i = wurl + i[:-4] + '-1920x1080.htm'
                nurls.append(i)
        return nurls
    Copy the code

    Finally obtain the following address list:

  3. Parse the webpage where the hd picture is, get the final picture address and download

    Get the image address from the web source as shown below:

gqurls = clearurl(lists)

response = requests.get(gqurls[0]) # select * from the first address
sel = parsel.Selector(response.content.decode('gbk'))
gpic = sel.css('td a::attr(href)').extract_first()

image = requests.get(gpic).content
with open('.. /eg001/'+'1.jpg'.'wb') as f:
    f.write(image)
Copy the code

At this point, the first picture from the analysis to download the basic completion.

  1. Resolve the page to get more images to download

    By analyzing the change rule of web address, the address is:

    www.netbian.com/fengjing/in…

    www.netbian.com/fengjing/in…

    www.netbian.com/fengjing/in…

    .

    www.netbian.com/fengjing/in…

    Except for the first page address without serial number, the following are regular addresses, refactoring the URL address into a list form

    def urls() :
        url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2.206)]
        url_list.insert(0.'http://www.netbian.com/fengjing/')
        return url_list
    Copy the code
  2. Encapsulate sel objects

    Because the SEL object is repeatedly called to parse the web page, the SEL is written as a function

    Get sel object
    def t_sel(url) :
        response = requests.get(url)
        sel = parsel.Selector(response.content.decode('gbk'))
        return sel
    Copy the code

Full source code:

#! /usr/bin/env python
# coding=utf-8

"' wallpaper crawl, 001 http://www.netbian.com/fengjing/ '

import requests
import parsel

url = 'http://www.netbian.com/fengjing/'
wurl = 'http://www.netbian.com'


Get a list of paging urls
def urls() :
    url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2.3)]
    url_list.insert(0, url)
    return url_list


Assemble SEL objects
def t_sel(url) :
    response = requests.get(url)
    sel = parsel.Selector(response.content.decode('gbk'))
    return sel


# Clean assembly get hd address
def clearurl(lists) :
    nurls = []
    for i in lists:
        # print(i)
        if i.startswith('/desk/'):
            i = wurl + i[:-4] + '-1920x1080.htm'
            nurls.append(i)
    return nurls


def savepic(gqurls) :
    for g_url in gqurls:
        sel = t_sel(g_url)
        gpic = sel.css('td a::attr(href)').extract_first()
        image = requests.get(gpic).content
        with open('.. /eg001/' + str(g_url[28: -4]) + '.jpg'.'wb') as f:
            f.write(image)


if __name__ == '__main__':
    ulist = urls()
    for url in ulist:
        sel = t_sel(url)
        lists = sel.css('.list li a::attr(href)').extract()  Get the initial address
        gqurls = clearurl(lists)
        savepic(gqurls)

Copy the code

The source code has been uploaded and archived

Yards cloud archive