This article is participating in Python Theme Month. See the link to the event for more details. Recently, I was looking for an apartment on Douban, but I found that it was a little inefficient. It was a waste of time to find relevant information one by one in the douban group. Or use Python to crawl douban web pages, do a simple tool to extract key information is more efficient.
The basic idea
- through
requests
Get relevant web page data - through
BeautifulSoup
To get the relevantDOM
Content in a node - through
re
The content is matched regularly to filter the key information - through
Pandas
Generate Excel
Climb the HTML for all the list pages
For example: climb shenzhen rental group, Shenzhen rental group of the two rental group in front of 10 pages of data.
Note: The Cookie needs to be obtained by itself after logging in to Douban in the browser and viewing the request in NetWork in F12. No Cookie frequent crawling will be banned by Douban access. The specific location is as follows:
import requests
# Start entry, final entry, number of entries per page
page_indexs = range(0.250.25)
# Rentals group link
baseUrls = ['https://www.douban.com/group/szsh/discussion'.# Rent in Shenzhen
'https://www.douban.com/group/106955/discussion'# Shenzhen Rental Group
]
# cookie note
cookie = 'Find your cookie after logging into Douban in your browser'
def download_all_htmls() :
htmls = []
for baseUrl in baseUrls:
for idx in page_indexs:
UA = 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212
url = f"{baseUrl}? start={idx}"
print("download_all_htmls craw html:", url)
r = requests.get(url,
headers={"User-Agent":UA,"Cookie":cookie})
ifr.status_code ! =200:
print('download_all_htmls,r.status_code',r.status_code)
##raise Exception("error")
htmls.append(r.text)
return htmls
htmls = download_all_htmls()
Copy the code
Gets the contents of the relevant DOM node
In the douban page, press F12 to view the page related elements, mainly is the title, link href, time and other elements
def parse_single_html(html) :
soup = BeautifulSoup(html, 'html.parser')
# The content of each item
article_items = (
soup.find("table", class_="olt")
.find_all("tr", class_=""))for article_item in article_items:
# Article title
title = article_item.find("td", class_="title").get_text().strip()
# Post link
link = article_item.find("a") ["href"]
# Article time
time = article_item.find("td", class_="time").get_text()
Copy the code
Re filters key information
You can use re to filter the information in the title
# Article title
title = article_item.find("td", class_="title").get_text().strip()
# Match the keywords of science park, Bamboo Forest and Chegongmiao
res1 = re.search("Science and technology park | | Lin che bamboo",title)
# Screen one room
res2 = re.search("A room compartment | | | | 1 room one room 1 room", title)
if res1 is not None and res2 is not None:
print(title,link,time)
Copy the code
Generate Excel
The data obtained before is generated into Excel, and such a simple function of filtering Douban information is completed. The complete code is as follows:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import random
# Start entry, final entry, number of entries per page
page_indexs = range(0.250.25)
# Rentals group link
baseUrls = ['https://www.douban.com/group/szsh/discussion'.# Rent in Shenzhen
'https://www.douban.com/group/106955/discussion'# Shenzhen Rental Group
]
# cookies, pay attention to
cookie = 'Find your cookie after logging into Douban in your browser'
# Download each page
def download_all_htmls() :
htmls = []
for baseUrl in baseUrls:
for idx in page_indexs:
UA = 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212
url = f"{baseUrl}? start={idx}"
print("download_all_htmls craw html:", url)
r = requests.get(url,
headers={"User-Agent":UA,"Cookie":cookie})
ifr.status_code ! =200:
print('download_all_htmls,r.status_code',r.status_code)
htmls.append(r.text)
return htmls
htmls = download_all_htmls()
# Save each title name for subsequent removal
datasKey = []
# Parse individual HTML to get data
def parse_single_html(html) :
soup = BeautifulSoup(html, 'html.parser')
article_items = (
soup.find("table", class_="olt")
.find_all("tr", class_="")
)
datas = []
for article_item in article_items:
# Article title
title = article_item.find("td", class_="title").get_text().strip()
# Post link
link = article_item.find("a") ["href"]
# Article time
time = article_item.find("td", class_="time").get_text()
# Match the keywords of science park, Bamboo Forest and Chegongmiao
res1 = re.search("Science and technology park | | Lin che bamboo",title)
# Screen one room
res2 = re.search("A room compartment | | | | 1 room one room 1 room", title)
# Find a location and a room that matches the title and previously stored list that does not exist
if res1 is not None and res2 is not None and not title in datasKey:
print(title,link,time)
datasKey.append(title)
datas.append({
"title":title,
"link":link,
"time":time
})
return datas
all_datas = []
# Iterates over all the HTML and parses it
for html in htmls:
all_datas.extend(parse_single_html(html))
df = pd.DataFrame(all_datas)
# Convert data to Excel
df.to_excel("test.xlsx")
Copy the code
Here are the results:
We can also use jupyter Notebook to run our code, so we don’t have to export Excel, we can just see the result:
Finally, if the article was helpful or interesting to you, please give it a like