This article is participating in Python Theme Month. See the link to the event for more details. Recently, I was looking for an apartment on Douban, but I found that it was a little inefficient. It was a waste of time to find relevant information one by one in the douban group. Or use Python to crawl douban web pages, do a simple tool to extract key information is more efficient.

The basic idea

  1. throughrequestsGet relevant web page data
  2. throughBeautifulSoupTo get the relevantDOMContent in a node
  3. throughreThe content is matched regularly to filter the key information
  4. throughPandasGenerate Excel

Climb the HTML for all the list pages

For example: climb shenzhen rental group, Shenzhen rental group of the two rental group in front of 10 pages of data.

Note: The Cookie needs to be obtained by itself after logging in to Douban in the browser and viewing the request in NetWork in F12. No Cookie frequent crawling will be banned by Douban access. The specific location is as follows:

import requests
# Start entry, final entry, number of entries per page
page_indexs = range(0.250.25)

# Rentals group link
baseUrls = ['https://www.douban.com/group/szsh/discussion'.# Rent in Shenzhen
           'https://www.douban.com/group/106955/discussion'# Shenzhen Rental Group
           ]

# cookie note
cookie = 'Find your cookie after logging into Douban in your browser'

def download_all_htmls() :

    htmls = []
    for baseUrl in baseUrls:
        for idx in page_indexs:
            UA = 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212
            url = f"{baseUrl}? start={idx}"
            print("download_all_htmls craw html:", url)
            r = requests.get(url,
                            headers={"User-Agent":UA,"Cookie":cookie})
            ifr.status_code ! =200:
                print('download_all_htmls,r.status_code',r.status_code)
                ##raise Exception("error")
            htmls.append(r.text)
    return htmls

htmls = download_all_htmls()
Copy the code

Gets the contents of the relevant DOM node

In the douban page, press F12 to view the page related elements, mainly is the title, link href, time and other elements

def parse_single_html(html) :

    soup = BeautifulSoup(html, 'html.parser')

    # The content of each item
    article_items = (
        soup.find("table", class_="olt")
            .find_all("tr", class_=""))for article_item in article_items:
        
        # Article title
        title = article_item.find("td", class_="title").get_text().strip()
        # Post link
        link = article_item.find("a") ["href"]
        # Article time
        time = article_item.find("td", class_="time").get_text()

Copy the code

Re filters key information

You can use re to filter the information in the title

# Article title
title = article_item.find("td", class_="title").get_text().strip()

# Match the keywords of science park, Bamboo Forest and Chegongmiao
res1 = re.search("Science and technology park | | Lin che bamboo",title)
# Screen one room
res2 = re.search("A room compartment | | | | 1 room one room 1 room", title)

if res1 is not None and res2 is not None:
    print(title,link,time)
Copy the code

Generate Excel

The data obtained before is generated into Excel, and such a simple function of filtering Douban information is completed. The complete code is as follows:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

# Start entry, final entry, number of entries per page
page_indexs = range(0.250.25)

# Rentals group link
baseUrls = ['https://www.douban.com/group/szsh/discussion'.# Rent in Shenzhen
           'https://www.douban.com/group/106955/discussion'# Shenzhen Rental Group
           ]

# cookies, pay attention to
cookie = 'Find your cookie after logging into Douban in your browser'

# Download each page
def download_all_htmls() :
    htmls = []
    for baseUrl in baseUrls:
        for idx in page_indexs:

            UA = 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212
            url = f"{baseUrl}? start={idx}"
            print("download_all_htmls craw html:", url)
            r = requests.get(url,
                            headers={"User-Agent":UA,"Cookie":cookie})
            ifr.status_code ! =200:
                print('download_all_htmls,r.status_code',r.status_code)
            htmls.append(r.text)
    return htmls

htmls = download_all_htmls()

# Save each title name for subsequent removal
datasKey = []

# Parse individual HTML to get data
def parse_single_html(html) :

    soup = BeautifulSoup(html, 'html.parser')

    article_items = (
        soup.find("table", class_="olt")
            .find_all("tr", class_="")
    )

    datas = []
    
    for article_item in article_items:
        
        # Article title
        title = article_item.find("td", class_="title").get_text().strip()
        # Post link
        link = article_item.find("a") ["href"]
        # Article time
        time = article_item.find("td", class_="time").get_text()
        
        # Match the keywords of science park, Bamboo Forest and Chegongmiao
        res1 = re.search("Science and technology park | | Lin che bamboo",title)
        # Screen one room
        res2 = re.search("A room compartment | | | | 1 room one room 1 room", title)

        # Find a location and a room that matches the title and previously stored list that does not exist
        if res1 is not None and res2 is not None and not title in datasKey:
                print(title,link,time)
                datasKey.append(title)
                datas.append({
                    "title":title,
                    "link":link,
                    "time":time
                })
    return datas

all_datas = []

# Iterates over all the HTML and parses it
for html in htmls:
    all_datas.extend(parse_single_html(html))
    
df = pd.DataFrame(all_datas)
# Convert data to Excel
df.to_excel("test.xlsx")
Copy the code

Here are the results:

We can also use jupyter Notebook to run our code, so we don’t have to export Excel, we can just see the result:

Finally, if the article was helpful or interesting to you, please give it a like