Preface:

This article is very easy to understand, can be said to be zero foundation can be quickly mastered. If you have any questions, please leave a message and I will reply to you as soon as possible. This article is hosted on Github

First, analyze the website of emoticons

“The Latest Emoji”

The first page: www.doutula.com/photo/list/… The third page: www.doutula.com/photo/list/… Page 4: www.doutula.com/photo/list/…

As you can see, the page value is related to the number of pages clicked, so we have the URL to crawl

2, open the check element, you can see the HTML source

xpath

Second, the practice

Imgs = html.xpath(“//div[#class=’page-content text-center’//img[@class!= ‘GIF ‘]”)

B, next is to get the url of the image, the above code is as follows:

    for img in imgs:
        # print(etree.tostring(img))
        img_url = img.get('data-original')  # Don't know why multiple! Data, get rid of it
        img_url = img_url.replace(! "" dta"."")
Copy the code

C, cut the suffix, get the file name, and save

        alt = img.get('alt') Get the image name
        # Alt may need to handle illegal characters in some cases (these characters cannot be saved as names)
        suffix = os.path.splitext(img_url)[1] # split the url, take the second part of the array, get the suffix name
        filename = alt + suffix
        request.urlretrieve(img_url, 'images/' + filename) # Save images
Copy the code

In this way, you can quickly save the emojis you need, in terms of pictures, who can beat you

The full code is as follows:

def parse_page(url):

    headers = {
        'User-Agent':'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    # print(response.text) # print HTML source code
    html = etree.HTML(response.text)
    imgs = html.xpath("//div[@class='page-content text-center']//img[@class!='gif']")
    for img in imgs:
        # print(etree.tostring(img))
        img_url = img.get('data-original')  # Don't know why multiple! Data, get rid of it
        img_url = img_url.replace(! "" dta"."")
        # print(img_url)
        alt = img.get('alt') Get the image name
        # Alt may need to handle illegal characters in some cases (these characters cannot be saved as names)
        print(alt)
        suffix = os.path.splitext(img_url)[1] # split the url, take the second part of the array, get the suffix name
        filename = alt + suffix
        print(filename)
        request.urlretrieve(img_url, 'images/' + filename) # Save images

def main():
    for x inPloidy range (1) :Range (1,3) = 1, 2
        url = 'https://www.doutula.com/photo/list/?page=%d' % x
        parse_page(url)
        break
Copy the code

Final result:

Just 20 lines of code, you can create a bucket map of the west for failure, move on!

Of course, can also be more advanced, is the use of multi-threading, asynchronous crawling, download, a few seconds can download thousands of emoticons! I have also put the relevant code on Github, and friends who need it can go to look look!




For more exciting content, please pay attention to the public account “BigDeveloper” — programmer tycoon show