Github:github.com/nnngu/Learn…


How to make a crawler

Making a crawler generally consists of the following steps:

  • Analyze requirements
  • Analyze web source code and coordinate with developer tools
  • Write regular or XPath expressions
  • Formally write python crawler code

Results the preview

The running effect is as follows:

Folder for storing pictures:

Demand analysis

Our crawler should achieve at least two functions: one is to search for pictures, the other is automatic download.

Search images: the easiest thing to think of is the result of crawling Baidu pictures, we will take a look at baidu pictures:

If you search for a few keywords, you can see that many images have been found:

Analysis of web page

We right click to see the source code:

When we open the source code, we find a pile of source code that makes it difficult to find the resource we want.

At this point, use developer tools! Let’s go back to the previous page and bring up the developer tools, which we need in the upper left corner :(mouse follow).

Then select where you want to look at the source code, and you will see that the following code areas are automatically positioned accordingly. The diagram below:

We copied the address, searched the source code, and found its location, but here we are confused, the image has so many addresses, which one should we use? We see thumbURL, middleURL, hoverURL, objURL

Through the analysis, we can know that the first two are the miniature versions, hoverURL is the version displayed after the mouse moves, objURL should be what we need, we can open these several websites to have a look, and find that objURL is the largest and most clear.

Having found the image address, let’s analyze the source code. See if all objurls are images.

Found images that end in.jpg format.

Writing regular expressions

pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)
Copy the code

Write crawler code

Here we use two packages, re and requests

#-*- coding:utf-8 -*-
import re
import requests
Copy the code

Copy the Baidu image search link, pass in Requests, and write out the regular expression

url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=%E6%A0%97%E5%B1%B1%E6%9C%AA%E6%9D%A5%E5%A4%B4%E5%83%8F &ct=201326592&ic=0&lm=-1&width=&height=&v=index'

html = requests.get(url).text
pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)

Copy the code

Because there were a lot of images, so we had to loop. We printed out the results to look at and then used Requests to get urls. Because some images might not open urls, we added a 10-second timeout control.

pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)
i = 1
for each in pic_url:
    print each
    try:
        pic= requests.get(each, timeout=10)
    except requests.exceptions.ConnectionError:
        print('[error] current image cannot be downloaded')
        continue

Copy the code

And then we save the images, we create an images directory, we put the images in there, and when we name them, we name them by numbers.

        dir = '.. /images/' + keyword + '_' + str(i) + '.jpg'
        fp = open(dir, 'wb')
        fp.write(pic.content)
        fp.close()
        i += 1
Copy the code

Complete code

# -*- coding:utf-8 -*-
import re
import requests


def dowmloadPic(html, keyword):
    pic_url = re.findall('"objURL":"(.*?) ",", html, re.S)
    i = 1
    print('Find keywords :' + keyword + 'pictures, start downloading pictures now... ')
    for each in pic_url:
        print('Downloading the first' + str(i) + 'A photo, photo address :' + str(each))
        try:
            pic = requests.get(each, timeout=10)
        except requests.exceptions.ConnectionError:
            print('[error] current image cannot be downloaded')
            continue

        dir = '.. /images/' + keyword + '_' + str(i) + '.jpg'
        fp = open(dir, 'wb')
        fp.write(pic.content)
        fp.close()
        i += 1


if __name__ == '__main__':
    word = input("Input key word: ")
    url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
    result = requests.get(url)
    dowmloadPic(result.text, word)

Copy the code

We see some pictures are not displayed, open the website to look, found that it is indeed missing.

We can still see it on Baidu because some pictures are cached on Baidu’s server, but its actual link has been disabled.

conclusion

Enjoy our first image download crawler! Of course, it can not only download Baidu pictures, according to the gourd gourd gourd, you should now be able to do a lot of things, such as climbing the profile picture, taobao display and so on.

The full code has been posted on Githut github.com/nnngu/Baidu…