Github:github.com/nnngu/Learn…
How to make a crawler
Making a crawler generally consists of the following steps:
- Analyze requirements
- Analyze web source code and coordinate with developer tools
- Write regular or XPath expressions
- Formally write python crawler code
Results the preview
The running effect is as follows:
Folder for storing pictures:
Demand analysis
Our crawler should achieve at least two functions: one is to search for pictures, the other is automatic download.
Search images: the easiest thing to think of is the result of crawling Baidu pictures, we will take a look at baidu pictures:
If you search for a few keywords, you can see that many images have been found:
Analysis of web page
We right click to see the source code:
When we open the source code, we find a pile of source code that makes it difficult to find the resource we want.
At this point, use developer tools! Let’s go back to the previous page and bring up the developer tools, which we need in the upper left corner :(mouse follow).
Then select where you want to look at the source code, and you will see that the following code areas are automatically positioned accordingly. The diagram below:
We copied the address, searched the source code, and found its location, but here we are confused, the image has so many addresses, which one should we use? We see thumbURL, middleURL, hoverURL, objURL
Through the analysis, we can know that the first two are the miniature versions, hoverURL is the version displayed after the mouse moves, objURL should be what we need, we can open these several websites to have a look, and find that objURL is the largest and most clear.
Having found the image address, let’s analyze the source code. See if all objurls are images.
Found images that end in.jpg format.
Writing regular expressions
pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)
Copy the code
Write crawler code
Here we use two packages, re and requests
#-*- coding:utf-8 -*-
import re
import requests
Copy the code
Copy the Baidu image search link, pass in Requests, and write out the regular expression
url = 'https://image.baidu.com/search/index?tn=baiduimage&ie=utf-8&word=%E6%A0%97%E5%B1%B1%E6%9C%AA%E6%9D%A5%E5%A4%B4%E5%83%8F &ct=201326592&ic=0&lm=-1&width=&height=&v=index'
html = requests.get(url).text
pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)
Copy the code
Because there were a lot of images, so we had to loop. We printed out the results to look at and then used Requests to get urls. Because some images might not open urls, we added a 10-second timeout control.
pic_url = re.findall('"objURL":"(.*?) ",",html,re.S)
i = 1
for each in pic_url:
print each
try:
pic= requests.get(each, timeout=10)
except requests.exceptions.ConnectionError:
print('[error] current image cannot be downloaded')
continue
Copy the code
And then we save the images, we create an images directory, we put the images in there, and when we name them, we name them by numbers.
dir = '.. /images/' + keyword + '_' + str(i) + '.jpg'
fp = open(dir, 'wb')
fp.write(pic.content)
fp.close()
i += 1
Copy the code
Complete code
# -*- coding:utf-8 -*-
import re
import requests
def dowmloadPic(html, keyword):
pic_url = re.findall('"objURL":"(.*?) ",", html, re.S)
i = 1
print('Find keywords :' + keyword + 'pictures, start downloading pictures now... ')
for each in pic_url:
print('Downloading the first' + str(i) + 'A photo, photo address :' + str(each))
try:
pic = requests.get(each, timeout=10)
except requests.exceptions.ConnectionError:
print('[error] current image cannot be downloaded')
continue
dir = '.. /images/' + keyword + '_' + str(i) + '.jpg'
fp = open(dir, 'wb')
fp.write(pic.content)
fp.close()
i += 1
if __name__ == '__main__':
word = input("Input key word: ")
url = 'http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=' + word + '&ct=201326592&v=flip'
result = requests.get(url)
dowmloadPic(result.text, word)
Copy the code
We see some pictures are not displayed, open the website to look, found that it is indeed missing.
We can still see it on Baidu because some pictures are cached on Baidu’s server, but its actual link has been disabled.
conclusion
Enjoy our first image download crawler! Of course, it can not only download Baidu pictures, according to the gourd gourd gourd, you should now be able to do a lot of things, such as climbing the profile picture, taobao display and so on.
The full code has been posted on Githut github.com/nnngu/Baidu…