Climb 10W chain home rental data

A background.

Because my girlfriend chose Python data analysis for her engineering practice, and I had to analyze the current situation of renting apartments in Beijing, Shanghai, Guangzhou and Shenzhen. Then I had to scratch the previous Python code to see if I could climb tens of thousands of data for her to analyze.
Because I do front end, so web data parsing usingpyqueryThis library uses syntax a bit like jquery
And this project may become a graduation project, so I try to use a variety of ways. For example, regional data is stored in CSV files, and rental details are usedpymysqlDeposit; Regional URL usageseleniumGet, while details of the data usedrequestsTo obtain

Analysis of two.

Taking shenzhen as an example, HOME LINK’s rent’s web site: https://sz.lianjia.com/zufang/; So one by one to see Beijing, Shanghai, Guangzhou and Shenzhen, get the city array['sz','sh','bj','gz']
Analysis of shenzhen page interface: https://sz.lianjia.com/zufang/pg100/#contentList; When the number of pages is greater than 100, you get the same data as when you had 100 pages, soYou can't get all the data by changing the number after pg!
Think about:A city is divided into districts, a city is divided into streets and so onIf the maximum number of pages of each street is 100, then even if there are one or two streets with more than 100 pages occasionally, the amount of data obtained is much larger than our data volume of 100 pages

Problem 3.

Some problems were encountered in the actual coding
There is a certain amount of advertising in the data obtained: After observation, it is found that there is no element corresponding to the address of the house in the advertisement, so it is determined whether it is an advertisement. If it is an advertisement, it will climb the next piece of data
Lianjia anti - climbing strategy: Lianjia’s reverse climb is relatively friendly, even if caught, a few man-machine verification can be again, so I just passed hereTime.sleep () hibernation mode to perform anti - anti - crawl. Rich friends can go to Taobao to buy IP to get an IP pool, so what crawl all don’t panic, as for those free IP site basically no use.
In the middle of the climb, the reverse climbMy solution is relatively low, which is to manually delete the URLS already used in the CSV file and then run the next run to crawl the remaining data

4. To achieve

Start by creating the database tables

Select * from ‘lianJIA’ where ‘lianjia’ = ‘lianjia’;

create table shanghai(
  id int(5) PRIMARY KEY NOT NULL auto_increment,
  city VARCHAR(200),
  hName VARCHAR(200),
  way VARCHAR(200),
  address VARCHAR(200) ,
  area VARCHAR(200) ,
  position VARCHAR(200),
  type VARCHAR(200),
  price VARCHAR(200),
  time VARCHAR(200),
  url VARCHAR(200)
)
Copy the code

Get the district data for the city

All dependencies are first imported in the first line

from selenium import webdriver

import time

import csv

from pyquery import PyQuery as pq

import requests

import pymysql

import random

Copy the code

The following code for the fetch area

def getArea(a):

    brow = webdriver.Chrome()

    cityArr = ['sz'.'sh'.'bj'.'gz']

    file = open('area.csv'.'a', encoding='utf-8', newline=' ')

    Open the file, ready to append

    writer = csv.writer(file)

    for city in cityArr:

        url = 'https://' + city + '.lianjia.com/zufang/'

        brow.get(url)

        doc=pq(brow.page_source,parser='html')

        ul=doc('#filter ul').items()

        # Get the url of the area

        for item in ul:

            tem = item.attr('data-target')

            if(tem == 'area') :

                for li in item.items('li') :

                    if(li.text()! ='不限') :

                        str = url.split('/zufang') [0] + li.children('a').attr('href')

                        writer.writerow(str.split(', '))

        time.sleep(10)

    # exit

    file.close()

    brow.quit()

Copy the code

Obtain information such as streets in the area

def getDetail(a):

    # read

    arr = []

    with open('area.csv'.'r') as f:

        reader = csv.reader(f)

        for row in reader:

            arr.append(row[0])

    f.close()

    # write

    file_detail = open('detail.csv'.'a', encoding='utf-8', newline=' ')

    writer_detail = csv.writer(file_detail)

    brow = webdriver.Chrome()

    for val in arr:

        brow.get(val)

        doc = pq(brow.page_source, parser='html')

        ul = doc('#filter ul').items()

        for i, item in enumerate(ul):

            if (i == 3) :

                for li in item.items('li') :

                    if(li.text() ! ='不限' and li.children('a')) :

                        str = val.split('/zufang') [0] + li.children('a').attr('href')

                        writer_detail.writerow(str.split(', '))

                        print(str)

        time.sleep(3)

    file_detail.close()

    brow.quit()

Copy the code

Crawl for detailed rental information

def run(a):

    with open('detail.csv'.'r') as f:

        reader = csv.reader(f)

        for row in reader:

            time.sleep(random.randint(20.100))

            pgRes = requests.get(row[0])

            pgDoc = pq(pgRes.text, parser='html')

            pgSum = int(pgDoc('.content__pg').attr('data-totalpage'))

            pg = 0

            # also need to crawl according to the number of pages. Some add another layer of cycle

            while(pg < pgSum):

                pg+=1

                url =row[0] + 'pg%d'%pg

                print(url)

                res = requests.get(url)

                doc = pq(res.text, parser='html')

                city = doc('.content__title a') [0]

                str = ' '

                time.sleep(random.randint(2.20))

                if(city.text == 'shenzhen') :

                    str = 'shenzhen'

                elif(city.text == 'guangzhou') :

                    str = 'guangzhou'

                elif(city.text == 'Shanghai') :

                    str = 'shanghai'

                elif(city.text == 'Beijing') :

                    str = 'beijing'

                else:

                    Exception('City name error')



                list = doc('.content__list .content__list--item').items()

                for li in list:

                    # Need to create database, table first

                    db = pymysql.connect(host='localhost', user='root', password='123456', db='lianjia')

                    tem = li.find('.content__list--item--des')

                    arr = tem.text().split('/')

                    way = li.find('.content__list--item--title a').text().split(', ')

                    house_data = (

                        city.text,

                        tem.children('a') [2].text if tem.children('a').length>1 else 'advertising'.

                        way[0] if way[0] else ' '.

                        arr[0] if arr.__len__() > 0 else ' '.

                        arr[1] if arr.__len__() > 1 else ' '.

                        arr[2] if arr.__len__() > 2 else ' '.

                        arr[3] if arr.__len__() > 3 else ' '.

                        li.find('.content__list--item-price em').text(),

                        li.find('.content__list--item--time').text(),

                        ('https://sz.lianjia.com' + li.find('.twoline').attr('href'))if(li.find('.twoline').attr('href')) else ' '

                    )

                    if(house_data[1] = ='advertising') :

                        continue

                    # Declare a cursor

                    cursor = db.cursor()

                    sql = "insert into "+ str +"(city,hName,way,address,area,position,type,price,time,url)" \

                          " values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"

                    cursor.execute(sql, house_data)

                    db.commit()  # Submit database

                    db.close()

Copy the code

Five. Complete code

In the current version, the URL in the CSV file needs to be manually removed after the reverse crawl, and there may be duplicate/missing data
In the blog see a big garden with mobile terminal decompiled find interface to complete the crawl, I haven’t tried, but there seems, interested can try: https://www.cnblogs.com/mengyu/p/9115832.html
Because there’s a word limit on gold. So just post the code that was last called

if __name__=='__main__':

    getArea()

    getDetail()

    run()

Copy the code

Climb 10W chain home rental data

Climb 10W chain home rental data

A background.

Analysis of two.

Problem 3.

4. To achieve

Five. Complete code

Related Posts

Use gulp to quickly develop static pages

Axios secondary encapsulation

Zustand, a lightweight state management tool for Web applications