This article has participated in the call for good writing activities, click to view: back end, big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!


A few days ago, I had a friend who wanted to check the price of rentals near Wangjing. He encountered some problems and asked me to help analyze them.

1 analysis

I thought, I’ve done this before, how difficult could it be? So open a random rental page.

The forehead (even though it’s o… I changed it to a picture. There should have been a separate Ajax request to get the price information.

① Although the price is made up of four < I > tags, the background image is the same. Width: 20px; height: 30px; It’s fixed, it just sets the offset of the image.

But that doesn’t stop me from thinking:

  • Request web page
  • Get picture information, get price offset information
  • Cut up the picture for identification
  • Get the price data

Just recently in the study of CNN image recognition related, so neat numbers, with a little training recognition rate can certainly reach 100%.

2 of actual combat

Just do it, find a portal, and then get a wave of web pages.

2.1 Obtaining the original Web page

Click metro directly, find a line 15 wangjing East, and then get the room list, and then deal with the next page.

Sample code:

# -*- coding: UTF-8 -*-

import os
import time
import random

import requests
from lxml.etree import HTML

__author__ = 'lpe234'


index_url = 'https://www.ziroom.com/z/s100006-t201081/?isOpen=0'
visited_index_urls = set(a)def get_pages(start_url: str) :
    Param start_url :return: """
    # to heavy
    if start_url in visited_index_urls:
        return
    visited_index_urls.add(start_url)

    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }
    resp = requests.get(start_url, headers=headers)
    resp_content = resp.content.decode('utf-8')
    root = HTML(resp_content)
    # parse when page list
    hrefs = root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href')
    for href in hrefs:
        if not href.startswith('http'):
            href = 'http:' + href.strip()
        print(href)
        parse_detail(href)
    # Parse page flipping
    pages = root.xpath('//div[@class="Z_pages"]/a/@href')
    for page in pages:
        if not page.startswith('http'):
            page = 'http:' + page
            get_pages(page)


def parse_detail(detail_url: str) :
    """ Access details page :param Detail_URL: :return: ""
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }
    filename = 'pages/' + detail_url.split('/')[-1]
    if os.path.exists(filename):
        return
    # Random pauses for 1-5 seconds
    time.sleep(random.randint(1.5))
    resp = requests.get(detail_url, headers=headers)
    resp_content = resp.content.decode('utf-8')
    with open(filename, 'wb+') as page:
        page.write(resp_content.encode())


if __name__ == '__main__':
    get_pages(start_url=index_url)

Copy the code

A simple access to the nearby housing sources, a total of about 600.

2.2 Parsing web pages to obtain pictures

The logic is simple: go through all the previously acquired web pages, parse the price images and store them.

Sample code:

# -*- coding: UTF-8 -*-

import os
import re
from urllib.request import urlretrieve

from lxml.etree import HTML

__author__ = 'lpe234'


poss = list(a)def walk_pages() :
    """ Traverse all pages in path :return: """
    for dirpath, dirnames, filenames in os.walk('pages') :for page in filenames:
            page = os.path.join('pages', page)
            print(page)
            parse_page(page)


def parse_page(page_path: str) :
    Param page_path: :return: """
    with open(page_path, 'rb') as page:
        page_content = ' '.join([_.decode('utf-8') for _ in page.readlines()])
        root = HTML(page_content)
        styles = root.xpath('//div[@class="Z_price"]/i/@style')
        pos_re = re.compile(r'background-position:(.*?) px; ')
        img_re = re.compile(r'url\((.*?) \); ')
        for style in styles:
            style = style.strip()
            print(style)
            pos = pos_re.findall(style)[0]
            img = img_re.findall(style)[0]
            if img.endswith('red.png') :continue
            if not img.startswith('http'):
                img = 'http:' + img
            print(f'pos: {pos}, img: {img}')
            save_img(img)
            poss.append(pos)


def save_img(img_url: str) :
    img_name = img_url.split('/')[-1]
    img_path = os.path.join('imgs', img_name)
    if os.path.exists(img_path):
        return
    urlretrieve(img_url, img_path)


if __name__ == '__main__':
    walk_pages()
    print(sorted([float(_) for _ in poss]))
    print(sorted(set([float(_) for _ in poss])))
Copy the code

Finally get the price related picture data.

A total of 21 pictures, including 20 orange pictures for ordinary prices, the red one for special sale housing.

It looks like you don’t need image recognition anymore, just map by image name and offset.

2.3 Price analysis

I was gonna write identification, but it doesn’t feel right. It’s not identification. It’s just a picture name + offset mapping.

Sample code:

# -*- coding: UTF-8 -*-

import re

from lxml.etree import HTML
import requests

__author__ = 'lpe234'


PRICE_IMG = {
    '1b68fa980af5e85b0f545fccfe2f8af1.png': [8.9.1.6.7.0.2.4.5.3].'4eb5ebda7cc7c3214aebde816b10d204.png': [9.5.7.0.8.6.3.1.2.4].'5c6750e29a7aae17288dcadadb5e33b1.png': [4.5.9.3.1.6.2.8.7.0].'6f8787069ac0a69b36c8cf13aacb016b.png': [6.1.9.7.4.5.0.8.3.2].'7ce54f64c5c0a425872683e3d1df36f4.png': [5.1.3.7.6.8.9.4.0.2].'8e7a6d05db4a1eb58ff3c26619f40041.png': [3.8.7.1.2.9.0.6.4.5].'73ac03bb4d5857539790bde4d9301946.png': [7.1.9.0.8.6.4.5.2.3].'234a22e00c646d0a2c20eccde1bbb779.png': [1.2.0.5.8.3.7.6.4.9].'486ff52ed774dbecf6f24855851e3704.png': [4.7.8.0.1.6.9.2.5.3].'19003aac664523e53cc502b54a50d2b6.png': [4.9.2.8.7.3.0.6.5.1].'93959ce492a74b6617ba8d4e5e195a1d.png': [5.4.3.0.8.7.9.6.2.1].'7995074a73302d345088229b960929e9.png': [0.7.4.2.1.3.8.6.5.9].'939205287b8e01882b89273e789a77c5.png': [8.0.1.5.7.3.9.6.2.4].'477571844175c1058ece4cee45f5c4b3.png': [2.1.5.8.0.9.7.4.3.6].'a822d494f1e8421a2fb2ec5e6450a650.png': [3.1.6.5.8.4.9.7.2.0].'a68621a4bca79938c464d8d728644642.png': [7.0.3.4.6.1.5.9.8.2].'b2451cc91e265db2a572ae750e8c15bd.png': [9.1.6.2.8.5.3.4.7.0].'bdf89da0338b19fbf594c599b177721c.png': [3.1.6.4.7.9.5.2.8.0].'de345d4e39fa7325898a8fd858addbb8.png': [7.2.6.3.8.4.0.1.9.5].'eb0d3275f3c698d1ac304af838d8bbf0.png': [3.6.5.0.4.8.9.2.1.7].'img_pricenumber_detail_red.png': [6.1.9.7.4.5.0.8.3.2]
}

POS_IDX = [-0.0, -31.24, -62.48, -93.72, -124.96, -156.2, -187.44, -218.68, -249.92, -281.16]


def parse_price(img: str, pos_list: list) :
    price_list = PRICE_IMG.get(img)
    if not price_list:
        raise Exception('img not found. %s', img)
    step = 1
    price = 0
    _pos_list = reversed(pos_list)
    for pos in _pos_list:
        price += price_list[POS_IDX.index(float(pos))]*step
        step *= 10
    return price


def parse_page(content: str) :
    root = HTML(content)
    styles = root.xpath('//div[@class="Z_price"]/i/@style')
    pos_re = re.compile(r'background-position:(.*?) px; ')
    pos_img = re.findall('price/(.*?) \ \); ', styles[0[])0]
    poss = list(a)for style in styles:
        style = style.strip()
        pos = pos_re.findall(style)[0]
        poss.append(pos)
    print(pos_img)
    print(poss)
    return parse_price(pos_img, poss)


def request_page(url: str) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }
    resp = requests.get(url, headers=headers)
    resp_content = resp.content.decode('utf-8')
    return resp_content
Copy the code

For convenience, has made the interface services: testing interface = = > http://lemon.lpe234.xyz/common/ziru/

3 summary

I thought I could show my friends my new skills, but I got sloppy. And then I thought, well, if you can do this, why don’t you make more pictures and make it harder?

But in order to impress our friends, we still need to use CNN, otherwise we will learn nothing.