Latiao Jun writes reptile 1

Learning to write a crawler in Python for a few days, let’s write a small crawler to verify the learning results. This article is suitable for small white food, you figure a joy ~

This article is for learning and communication only. Do not use it for illegal purposes!!

I. Web page analysis

Crawl shijiazhuang second-hand housing information, open the link firsthttps://sjz.ke.com/ershoufang/.Without adding the filter criteria, a total of 42,817 houses were found. We click on the second page and the link becomeshttps://sjz.ke.com/ershoufang/pg2/. So, it can be found/pg{i}I is page number. After the hot tiaojun physical education teacher professor of mathematics, the number, a page of 30, up to 100 pages.So can climb up to 3000 property information, distance from the above given more than 40,000 is still far away, so try to putpg{i}That I artificially change, hit Enter request.

https://sjz.ke.com/ershoufang/pg200/

https://sjz.ke.com/ershoufang/pg300/

It is found that both requests return the same property information data. It’s all information on page 100, and then it comes to a conclusion. Through the Shell Web, you can view up to 3000 houses under a certain condition. Harm, can buy 3000 sets at most, the feeling that money spends not to go out is really uncomfortable ~ escape :)~~

So, let’s add some conditions, like, for example, only two rooms for every five. The request of the ~Found that the link becamehttps://sjz.ke.com/ershoufang/pg2mw1l2/.mw1l2This thing is supposed to filter conditions. We saw only 2,399, Oak, so we climbed it.

Roll up your sleeves and write code

Although the sparrow has five organs, this reptile design three parts, crawling, parsing, storage.

crawl

Crawling utilizes the Requests library, which is much easier than Urllib, the Python built-in library.

import requests

def get_a_page(url) :
    result = requests.get(url)
    print(result.text)
    
if __name__ == '__main__':
    for i in range(1.101):
        get_a_page(f'https://sjz.ke.com/ershoufang/pg{i}mw1l2/')
Copy the code

The for loop prints back data and finds no problem. In fact, I cycle to 81, after all, we know, there are less than 2,400 sets.

parsing

Parsing is done using PyQuery, a library similar to Jquery. A complete API, https://pythonhosted.org/pyquery/api.html. There is also a parsing library ‘BS4’, try again next time.

It turns out that reading a div in ul as shown in the figure gets us the data we want.

import requests
from pyquery import PyQuery as pq
import json

def get_a_page(url) :
    result = requests.get(url)
    doc = pq(result.text) 
    ul = doc('.sellListContent')
    divs = ul.children('.clear .info.clear').items()
    for div in divs:
        count += 1
        title = div.children('.title a').text()
        place = div.children('.address .flood .positionInfo a').text()
        msg = div.children('.address .houseInfo').text()
        price = div.children('.address .priceInfo .totalPrice span').text()
        per_meter = div.children('.address .priceInfo .unitPrice').attr('data-price')
        dict = {
            'title': title,
            'place': place,
            'msg': msg,
            'price': price,
            'per_meter': per_meter
        }
        print(str(count) + ':' + json.dumps(dict, ensure_ascii=False))
Copy the code

As shown above, PyQuery’s children method looks for children tags and find looks for children tags, so we just need to find the next generation. Then find the text contained in the tag through text. Attr is used to fetch the property content because the per_meter is easier to fetch from the property. The content in the tag also contains “yuan/square meter”.

storage

This time we’re going to save it directly into CSV, a file format similar to Excel. The library is pandas.

The complete code is as follows:

import requests
from pyquery import PyQuery as pq
import json
import pandas as pd

columns = ['title'.'msg'.'price'.'per_meter']

# Crawl a web page
def get_a_page(url) :
    result = requests.get(url)
    doc = pq(result.text)
    ul = doc('.sellListContent')
    divs = ul.children('.clear .info.clear').items()
    count = 0
    titles = []
    places = []
    msgs = []
    prices = []
    per_meters = []
    for div in divs:
        count += 1
        title = div.children('.title a').text()
        place = div.children('.address .flood .positionInfo a').text()
        msg = div.children('.address .houseInfo').text()
        price = div.children('.address .priceInfo .totalPrice span').text()
        per_meter = div.children('.address .priceInfo .unitPrice').attr('data-price')
        dict = {
            'title': title,
            'place': place,
            'msg': msg,
            'price': price,
            'per_meter': per_meter
        }
        titles.append(title)
        places.append(place)
        msgs.append(msg)
        prices.append(price)
        per_meters.append(per_meter)
        print(str(count) + ':' + json.dumps(dict, ensure_ascii=False))
    datas={
        'title': titles,
        'place': places,
        'msg': msgs,
        'price': prices,
        'per_meter': per_meters
    }
    df = pd.DataFrame(data=datas, columns=columns)
    df.to_csv('sjz.csv', mode='a', index=False, header=False)

if __name__ == '__main__':
    for i in range(1.101):
        get_a_page(f'https://sjz.ke.com/ershoufang/pg{i}mw1l2/')
Copy the code

Multiple processes

Because get_a_page function to run 100 times, a little slow, so use multiple processes to speed up, this part of the code, please directly copy.

Change the main function to the following

from multiprocessing.pool import Pool

if __name__ == '__main__':
    pool = Pool(5)
    group = ([f'https://sjz.ke.com/ershoufang/pg{x}mw1l2/' for x in range(1.101)])
    pool.map(get_a_page,group)
    pool.close()
    pool.join()
Copy the code

Third, the end

Check the effect:

The effect is ok. Some people may say, why not split the MSG information and store it separately, such as floor, rooms, building age and so on. At the beginning, I did that, but it turned out that the MSG data items were not mandatory. Some owners did not fill in the building dates and floors, so they simply took the whole thing.

That was the end of Latiao’s first reptile. Although simple, but finished, or a little satisfaction. In the future, I will continue to learn crawlers and write some blogs. Guys, give it a thumbs up before you go

I. Web page analysis

Roll up your sleeves and write code

crawl

parsing

storage

Multiple processes

Third, the end

Related Posts

CDH Big data Cluster Linux system parameters tuning

The interviewer was impressed with me!!

Build your own Git service based on Gitea