Simple python crawler to climb douban books TOP250
Test technical feasibility only, do not crawl other people’s websites
A boring afternoon thinking about life, what simple content can climb: suddenly found this webpage: (…”Douban Books “) looks pretty goodThen start ~
Here are the modules that will be used:
import requests
from bs4 import BeautifulSoup
import pandas
Copy the code
Bs4 is used to extract the contents of tags. Pandas is a simpler way to store the retrieved data in Excel
If you have not used BS4 and PANDAS before, you will need to install bS4 and PANDAS by PIP
res = requests.get('')
soup = BeautifulSoup(res.text , 'html.parser')
Copy the code
BeautifullySoup parses the data into HTML code in order to extract the contents of the tag later. Then in douban book interface F12, and then review the elements, as shown in the pictureKnowing that most of the information on this page exists in a TR whose class is item
for news in'.item'):
Copy the code
Walk through the contents of the soup whose class is item, i.e., all the contents of the TR tag. Here is the content I pulled out as an example
<tr class="item"> <td valign="top" width="100"> <a class="nbg" href="" onclick="moreurl(this,{i:'0'})"> <img src="" width="90"/> </a> </td> <td valign="top"> <div class="pl2"> <a href="" onclick="" moreurl(this,{i:'0'})"" Title = "kite runner" > kite runner < / a > < img Alt = "but trying to read" SRC = "" title = "but trying to read" / > < br / > < span style="font-size:12px;" >The Kite Runner</span> </div> <p class="pl"> </p> <div class="star Clearfix ">< span class="allstar45"></span> <span class="rating_nums">8.9</span> <span class="pl"> </div> <p class="quote" style="margin: 10px 0; Color: # 666 "> < span class =" inq "> for you, thousands times < / span > < / p > < / td > < / tr >Copy the code
Now let’s pull out the information we need bit by bit. First we want to know the name of the book:
title ='a')[1].text.replace(' ','')
Copy the code
Extract the contents of the second a tag from the text (subscripts start at 0) and use replace(“, “) to remove the extra space and make it easier to read.
Because author, publisher, time, price and so on are all in one P tag, we have to figure out how to separate them
a ='p')[0].text
price = a.split('/')[-1]
time = a.split('/')[-2]
store = a.split('/')[-3]
Copy the code
Let’s find the contents of paragraph P: try to separate them, but the problem is, the number of authors is 1 or 2, so how can we separate them? I thought of a clever way to do this, which is to divide it by (‘/’) symbol. The last item is price, the second to the last item is publication time, and the third to the last item is publisher
name1 ='p')[0].text.split('/')[:-3][0]
name2 = ""
name2 = "," +'p')[0].text.split('/')[:-3][1]
Copy the code
Then we need to take the author’s name, some authors have a name, some authors are a foreign author and a Chinese translator, to solve this problem, we first according to (‘/’) partition, extract [0:-3] content, i.e. the content before the publisher. The first element, which is the first name, is received with name1. The second name may not exist, so we default it to null: “” Then use try, except statement to include the following content, remove the second name, do not pass. If the try statement is not used, the subscript is out of bounds and the program fails. Then take out the introduction of the book as shown:It can be found that its class name is inq, but there is an error when running it, indicating that the subscript is out of bounds. After running it several times, I find that it stops in one place each time, as shown in the following figure:
A try is made to retrieve the introduction statement as follows:
jianjie = ""
jianjie = news.find_all(class_='inq')[0].text
Copy the code
For the sake of world peace, I also sent an E-mail to Douban to remind them of the mistake:) The last thing that needs to be taken out is the number of people evaluating
person = news.find_all(class_='pl')[1].text.replace(' ','')
Copy the code
I’m still using the same statement, and I’ve removed the extra whitespace. So where do I put the extracted content? The answer is a dictionary array ~, which you can create under the import of the entire code
newsary = []
Copy the code
Then go back to where we wrote and add to the array via append:
newsary.append({'title': title , 'name': name1 + name2 , 'person':person , 'jianjie': jianjie , 'price' : price , 'time' : time , 'store' : store })
Copy the code
I think you can all read this. The thing to notice here is that the value of name is name1 + name2 and then outside of the loop because we’re only getting one page, how do we get the next page? After testing, we can get a link to the next page:… The next page is… Got it?
for i in range(10):
res = requests.get('' + str(i*25))
Copy the code
All we need is one loop like this and multiple requests to Douban to get the full 250 pieces of data!
After writing:
newsdf = pandas.DataFrame(newsary)
Copy the code
Use pandas’ DataFrame method to handle newsary and store the contents of NewsdF in doubanbook1.xlsx. Remember to create a Doubanbook1.xlsx file in the same level as the PY file. Then run the py file, wait for it to complete, open the doubanbook1.xlsx file, and you can see the following:This simple crawler is over ~ the following attached source code for your reference:
import requests from bs4 import BeautifulSoup import pandas newsary = [] for i in range(10): res = requests.get('' + str(i*25)) soup = BeautifulSoup(res.text , 'html.parser') for news in'.item'): # locate the person = news. Find_all (class_ = 'pl') [1]. The text. The replace (', ') title = news. Select (' a ') [1]. The text. The replace jianjie (', ') = "" try: jianjie = news.find_all(class_='inq')[0].text except: pass; name ='p')[0].text a ='p')[0].text price = a.split('/')[-1] time = a.split('/')[-2] store = a.split('/')[-3] # author = a.split('/')[-4] # name = a.split('/')[:-4] name1 ='p')[0].text.split('/')[:-3][0] name2 = "" try: name2 = "," +'p')[0].text.split('/')[:-3][1] except: pass; newsary.append({'title': title , 'name': name1 + name2 , 'person':person , 'jianjie': jianjie , 'price' : price , 'time' : time , 'store' : store }) newsdf = pandas.DataFrame(newsary) newsdf.to_excel('doubanbook1.xlsx')Copy the code