This is I learn Python reptile notes, always wanted to learn another language to expand own aspect of knowledge, looked at the various language mainly use the direction of the final resolution or to make a make Python. The Python to my first impression is grammar concise, format of alternative and the support of various libraries, like such a personality language ~ for later Into the learning of crawler, prior to certainly is the first grammar again, the following is my implementation of a small crawler, you can use keywords to search images and download through Baidu gallery ~
Tools:
- Don’t think too much about it. Pick an IDE. I used PyCharm(yes, it was free this year)
- Turn on the PyCharm Settings and select Project under Project (if you have commond) Interpreter then click the plus sign in the bottom left corner, type requests in the input box, and install. There are many other ways to install, use PIP, type in that code in the terminal, and do something else, but it’s still lazy.
- Python is up to date, and 2.7 should work as well. It doesn’t use the Scrapy crawler framework, and it doesn’t use LXML, just re re and Requests network requests
Re and requests usage
-
Re regular
Re is regular, is primarily used to parse the data, when we get the web of data need to extract the data we want, when regular matching is one of the methods, as for the regular way, there is not much, want to see the regular expression tutorial 30 minutes here, and re commonly used gimmick can understand Pyth in this article On crawler -re(regular expression) module common method, here we mainly use its re.findall(” regular expression “,” matched data “, matching restrictions (e.g. : ignore case))
-
Network requests
Get (” URL “,timeout=5) for simple GET requests. For more information, see requests for a quick start.
Specific steps
-
The first is to figure out what you want to do, you want to get what data (no goal which power ah), here we want to go through Baidu pictures to picture links and content, I want to search keywords, and can specify the amount of data to search, choose whether to save and save the path ~
-
Demand has, will go to the analysis of the structure of the web page to climb, take a look at our data where, we want to scratch the picture from Baidu picture
-
First enter Baidu Gallery, the page you see can be constantly refreshed when sliding down, this is a dynamic web page, and we can choose a simpler method, is to click the traditional page turning version at the top of the web page
-
Next comes the familiar page-turning interface, where you can click on the page to get more pictures
-
Click the right button of the mouse to view the source code of the web page, which is probably like this, we get down the data, this is it, we need to find the link of each picture and the link of the next page in this, but a little meng, so much data, where do we want?
-
Don’t worry, we can use the Developer Tools of the browser to view the elements of the web page. I use Chrome, open Developer Tools to view the web page style, and when you move the mouse across the structure table, the corresponding position area of this code will be displayed in real time. We can quickly find the corresponding position of the picture by this method:
Finding the path of an image and the path of the next page, we can search the source to find their location, and analyze how to write the re to get information:
-
All the data have been analyzed, this time it is time to start writing our crawler, look at so long, unexpectedly there is no code:
import requests # First entry into storage
import re
Copy the code
Then set the default configurationCopy the code
MaxSearchPage = 20 # Number of pages requested
CurrentPage = 0 # Number of pages currently being searched
DefaultPath = "/Users/caishilin/Desktop/pictures" The default storage location
NeedSave = 0 # Whether storage is required
Copy the code
Image link re and next page link reCopy the code
def imageFiler(content): Get the current page image address array by re
return re.findall('"objURL":"(.*?) "',content,re.S)
def nextSource(content): Get the url of the next page using the re
next = re.findall('
.*
'
,content,re.S)[0]
print("-- -- -- -- -- -- -- -- --" + "http://image.baidu.com" + next)
return next
Copy the code
The crawler subjectCopy the code
def spidler(source):
content = requests.get(source).text # Get content by link
imageArr = imageFiler(content) Get an array of images
global CurrentPage
print("Current page:" + str(CurrentPage) + "* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *")
for imageUrl in imageArr:
print(imageUrl)
global NeedSave
if NeedSave: # Save if needed
global DefaultPath
try:
picture = requests.get(imageUrl,timeout=10) Download the image and set the timeout period. If the image address is wrong, the wait will not continue
except:
print("Download image error! errorUrl:" + imageUrl)
continue
pictureSavePath = DefaultPath + imageUrl.replace('/'.' ') Create a path to save the image
fp = open(pictureSavePath,'wb') # Open file fp.write(picture.content)
fp.close()
else:
global MaxSearchPage
if CurrentPage <= MaxSearchPage:
if nextSource(content):
CurrentPage += 1
spidler("http://image.baidu.com" + nextSource(content)) # select * from next page address
Copy the code
Opening method of crawlerCopy the code
def beginSearch(page=1,save=0,savePath="/users/caishilin/Desktop/pictures/") :# (page: number of pages to crawl,save: whether to save,savePath: default save path)
global MaxSearchPage,NeedSave,DefaultPath
MaxSearchPage = page
NeedSave = save
DefaultPath = savePath
key = input("Please input you want search StartSource = "http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=" + str(key) + "&ct=201326592&v=flipSpidler (StartSource) spidler(StartSource)Copy the code
Call the open method to search for images by keywordCopy the code
beginSearch(page=1,save=0)
Copy the code
summary
** Because the understanding of Python is not particularly deep, so the code is tedious, compared to the crawler framework Scrapy, using reqests and Re is not particularly cool, but it is the best way to learn and understand crawlers. Next I will write the process of learning crawler framework Scrapy, there are mistakes Please correct ~**