Some friends said that they wanted to do data analysis on second-hand housing information of lianjia, so that they could help catch some data. Had not done, searched a few data on the net to try, the feeling is not difficult can do, summary below.

tool

  • python3
  • Python tripartite libraries:
    • BeautifulSoup (for parsing data)
    • Pandas (used to process data and store it in Excel)
    • Requests for sending requests

PIP install is a simple way to install the library:

pip install pandas
pip install requests
pip install beautifulsoup4
Copy the code

Train of thought

The basic meaning of packet capture is to simulate the user’s request with code, and then parse the corresponding web content, select the information you need. Simply looked at the chain home web page structure, is relatively neat. This is shenzhen second-hand housing list paging link:

https://sz.lianjia.com/ershoufang/pg1
https://sz.lianjia.com/ershoufang/pg2
...
https://sz.lianjia.com/ershoufang/pg99
https://sz.lianjia.com/ershoufang/pg100
Copy the code

You can request this link, and then parse the return result, and you can grab the link to each house detail page. I’m using a regular match to parse (see catchHouseList in source code for details). The parse result might look something like this:

https://sz.lianjia.com/ershoufang/105101151981.html
https://sz.lianjia.com/ershoufang/105101102328.html
https://sz.lianjia.com/ershoufang/105100779210.html
https://sz.lianjia.com/ershoufang/105101254525.html
https://sz.lianjia.com/ershoufang/105101201989.html
https://sz.lianjia.com/ershoufang/105101262457.html
Copy the code

After getting the details link, request the details link to get the details. BeautifulSoup parses the details to get the data you want. Finally, write the data to Excel via pandas (see appendToXlsx). Write it in append mode.

Note that because most sites have restrictions on access to links, such as frequent visits, the server may consider the request abnormal and not return the correct results. So each time a page is requested, it waits a little while before requesting the next page. Avoid being rejected by the server.

# I set this to 3 seconds
time.sleep(3)
Copy the code

The source code

Here is my source code, should install the corresponding tripartite library, in python environment to run the following code:


import requests
from bs4 import BeautifulSoup
import sys
import os
import time
import pandas as pd
import numpy as np
from parsel import Selector
import re



headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.7 Safari/537.36'
    }


def catchHouseList(url):
    resp = requests.get(url, headers=headers, stream=True)
    if resp.status_code == 200:
        reg = re.compile('
      
       .*? 
      *?>)
        urls = re.findall(reg, resp.text)
        return urls
    return []

def catchHouseDetail(url):
    resp = requests.get(url, headers=headers)
    print(url)
    if resp.status_code == 200:
        info = {}
        soup = BeautifulSoup(resp.text, 'html.parser')
        info['title'] = soup.select('.main')[0].text
        info['total'] = soup.select('.total')[0].text
        info['Gross unit'] = soup.select('.unit')[0].text
        info['Price per square metre'] = soup.select('.unitPriceValue')[0].text
        # p = soup.select('.tax')
        # info[' tax'] = soup. Select ('.tax')[0]
        info['Build time'] = soup.select('.subInfo')[2].text
        info['Cell name'] = soup.select('.info')[0].text
        info['Location'] = soup.select('.info a')[0].text + ':' + soup.select('.info a')[1].text
        info['Homelink Number'] = str(url)[34:].rsplit('.html')[0]
        info['House Type'] = str(soup.select('.content')[2].select('.label')[0].next_sibling)
        info['Floor'] = soup.select('.content')[2].select('.label')[1].next_sibling
        info['Floor area'] = soup.select('.content')[2].select('.label')[2].next_sibling
        info['House Structure'] = soup.select('.content')[2].select('.label')[3].next_sibling
        info['Jacket area'] = soup.select('.content')[2].select('.label')[4].next_sibling
        info['Building Type'] = soup.select('.content')[2].select('.label')[5].next_sibling
        info['House Orientation'] = soup.select('.content')[2].select('.label')[6].next_sibling
        info['Building structure'] = soup.select('.content')[2].select('.label')[7].next_sibling
        info['Condition of decoration'] = soup.select('.content')[2].select('.label')[8].next_sibling
        info['Ladder ratio'] = soup.select('.content')[2].select('.label')[9].next_sibling
        info['Method of heating'] = soup.select('.content')[2].select('.label')[10].next_sibling
        info['Get an elevator'] = soup.select('.content')[2].select('.label')[11].next_sibling
      # info [' fixed number of year of the property rights'] = STR (soup. Select (' content ') [2]. Select (' label ') [12]. Next_sibling)
        return info
    pass

def appendToXlsx(info):
    fileName = './ Lianjia second-hand room.xlsx '
    dfNew = pd.DataFrame([info])
    if(os.path.exists(fileName)):
        sheet = pd.read_excel(fileName)
        dfOld = pd.DataFrame(sheet)
        df = pd.concat([dfOld, dfNew])
        df.to_excel(fileName)
    else:
        dfNew.to_excel(fileName)


def catch():
    pages = ['https://sz.lianjia.com/ershoufang/pg{}/'.format(x) for x in range(1, 1001)]
    for page in pages:
        print(page)
        houseListURLs = catchHouseList(page)
        for houseDetailUrl in houseListURLs:
            try:
                info = catchHouseDetail(houseDetailUrl)
                appendToXlsx(info)
            except:
                pass
            time.sleep(3)

    pass

if __name__ == '__main__':
    catch()
    
Copy the code

A shot in the ink

  • Although the technical content is not high, it only involves the use of some tripartite tools. There are some problems with pandas. For example, it takes a long time to use pandas. Probably a lot of things that seem very simple, when they really get up, there are still unexpected problems.
  • The second is to capture the package, technically not strange (of course, some websites are still difficult to capture the package), but partners need this, maybe a thing is not only from the technical consideration of its value.

The resources

Google baidu a pile, can not judge the originality, too lazy to post.