A few days ago a friend wants to climb Wangjing near freely rent the price, encounter a point of problem, want to let me help analysis.
1 analysis
I thought, this thing I have done before, but also what difficulty is not. So open a random rental page.
The forehead (even though it’s o… Instead of a picture. There should have been a separate Ajax request to get the price information.
① Although the price is composed of four < I > tags, the background image is the same. Width: 20px; width: 20px; height: 30px; Is fixed and only sets the offset of the image.
But that doesn’t bother me. Let’s get this straight:
- Request a web page
- Get the picture information, get the price offset information
- Cut the image for identification
- Get the price data
Just recently in the study of CNN image recognition related, such a regular number, with a little training recognition rate can certainly reach 100%.
2 of actual combat
Say to do it, first find an entry, and then get a wave of web pages again.
2.1 Get the original page
Press the subway directly, find line 15 Wangjing East, and then get the room list, and then deal with the next pagination.
Sample code:
# -*- coding: UTF-8 -*- import os import time import random import requests from lxml.etree import HTML __author__ = 'lpe234' index_url = 'https://www.ziroom.com/z/s100006-t201081/?isOpen=0' visited_index_urls = set() def get_pages(start_url: Param start_url :return: """ "# to reset if start_url in visited_index_urls: Return visited_index_urls.add(start_url) headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = point. get(start_url, requests) Headers =headers) Resp.Content.Decode (' UTF-8 ') root = HTML(Resp.Content.Decode (' UTF-8 ' root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href') for href in hrefs: if not href.startswith('http'): Href = 'href :' + href. Strip () print(href) parse_detail(href root.xpath('//div[@class="Z_pages"]/a/@href') for page in pages: if not page.startswith('http'): + page get_pages(page) def parse_detail(detail_url: STR): "" "" headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Filename = 'pages/' + detail_url.split('/')[-1] if (like Gecko) Chrome/90.0.4430.212 Safari/537.36'} filename = 'pages/' + detail_url.split('/')[-1] if (like Gecko) Chrome/90.0.4430.212 Safari/537.36' os.path.exists(filename): Return time.sleep(random. RandInt (1, 5)) resp = point.get (detail_url, request) headers=headers) resp_content = resp.content.decode('utf-8') with open(filename, 'wb+') as page: page.write(resp_content.encode()) if __name__ == '__main__': get_pages(start_url=index_url)
A simple access to nearby housing sources, a total of about 600.
2.2 Analyze web pages to get pictures
The logic is simple. Go through all the previous pages, parse the price picture and save it.
Sample code:
# -*- coding: UTF-8 -*- import os import re from urllib.request import urlretrieve from lxml.etree import HTML __author__ = 'lpe234' Poss = list() def walk_pages(): "" return: "" for dirpath, dirnames, filenames in os.walk('pages'):" for page in filenames: page = os.path.join('pages', page) print(page) parse_page(page) def parse_page(page_path: str): "" with open(page_path, 'rb') as page: "" page_content = ''.join([_.decode('utf-8') for _ in page.readlines()]) root = HTML(page_content) styles = root.xpath('//div[@class="Z_price"]/i/@style') pos_re = re.compile(r'background-position:(.*?)px; ') img_re = re.compile(r'url\((.*?)\); ') for style in styles: style = style.strip() print(style) pos = pos_re.findall(style)[0] img = img_re.findall(style)[0] if img.endswith('red.png'): continue if not img.startswith('http'): img = 'http:' + img print(f'pos: {pos}, img: {img}') save_img(img) poss.append(pos) def save_img(img_url: str): img_name = img_url.split('/')[-1] img_path = os.path.join('imgs', img_name) if os.path.exists(img_path): return urlretrieve(img_url, img_path) if __name__ == '__main__': walk_pages() print(sorted([float(_) for _ in poss])) print(sorted(set([float(_) for _ in poss])))
Finally get the price related picture data.
There are 21 pictures in total, among which 20 orange pictures are for ordinary prices and 1 red picture is for special prices.
It looks like we don’t need image recognition anymore. We can just map by image name and offset.
2.3 Price analysis
I wanted to write identification, but it didn’t feel right. What kind of identification is this? Is not an image name + offset mapping.
Sample code:
# -*- coding: UTF-8 -*- import re from lxml.etree import HTML import requests __author__ = 'lpe234' PRICE_IMG = { '1b68fa980af5e85b0f545fccfe2f8af1.png': [8, 9, 1, 6, 7, 0, 2, 4, 5, 3], '4eb5ebda7cc7c3214aebde816b10d204.png': [9, 5, 7, 0, 8, 6, 3, 1, 2, 4], '5c6750e29a7aae17288dcadadb5e33b1.png': [4, 5, 9, 3, 1, 6, 2, 8, 7, 0], '6f8787069ac0a69b36c8cf13aacb016b.png': [6, 1, 9, 7, 4, 5, 0, 8, 3, 2], '7ce54f64c5c0a425872683e3d1df36f4.png': [5, 1, 3, 7, 6, 8, 9, 4, 0, 2], '8e7a6d05db4a1eb58ff3c26619f40041.png': [3, 8, 7, 1, 2, 9, 0, 6, 4, 5], '73ac03bb4d5857539790bde4d9301946.png': [7, 1, 9, 0, 8, 6, 4, 5, 2, 3], '234a22e00c646d0a2c20eccde1bbb779.png': [1, 2, 0, 5, 8, 3, 7, 6, 4, 9], '486ff52ed774dbecf6f24855851e3704.png': [4, 7, 8, 0, 1, 6, 9, 2, 5, 3], '19003aac664523e53cc502b54a50d2b6.png': [4, 9, 2, 8, 7, 3, 0, 6, 5, 1], '93959ce492a74b6617ba8d4e5e195a1d.png': [5, 4, 3, 0, 8, 7, 9, 6, 2, 1], '7995074a73302d345088229b960929e9.png': [0, 7, 4, 2, 1, 3, 8, 6, 5, 9], '939205287b8e01882b89273e789a77c5.png': [8, 0, 1, 5, 7, 3, 9, 6, 2, 4], '477571844175c1058ece4cee45f5c4b3.png': [2, 1, 5, 8, 0, 9, 7, 4, 3, 6], 'a822d494f1e8421a2fb2ec5e6450a650.png': [3, 1, 6, 5, 8, 4, 9, 7, 2, 0], 'a68621a4bca79938c464d8d728644642.png': [7, 0, 3, 4, 6, 1, 5, 9, 8, 2], 'b2451cc91e265db2a572ae750e8c15bd.png': [9, 1, 6, 2, 8, 5, 3, 4, 7, 0], 'bdf89da0338b19fbf594c599b177721c.png': [3, 1, 6, 4, 7, 9, 5, 2, 8, 0], 'de345d4e39fa7325898a8fd858addbb8.png': [7, 2, 6, 3, 8, 4, 0, 1, 9, 5], 'eb0d3275f3c698d1ac304af838d8bbf0.png': [3, 6, 5, 0, 4, 8, 9, 2, 1, 7], 'img_pricenumber_detail_red.png': [6, 1, 9, 7, 4, 5, 0, 8, 3, 2)} POS_IDX = [0.0, 31.24, 62.48, 93.72, 124.96, 156.2, 187.44, 218.68, 249.92, -281.16] def parse_price(img: STR, pos_list: list): price_list = price_img.get (img) if not price_list: raise Exception('img not found. %s', img) step = 1 price = 0 _pos_list = reversed(pos_list) for pos in _pos_list: price += price_list[POS_IDX.index(float(pos))]*step step *= 10 return price def parse_page(content: str): root = HTML(content) styles = root.xpath('//div[@class="Z_price"]/i/@style') pos_re = re.compile(r'background-position:(.*?)px; ') pos_img = re.findall('price/(.*?) \ \); ', styles[0])[0] poss = list() for style in styles: style = style.strip() pos = pos_re.findall(style)[0] poss.append(pos) print(pos_img) print(poss) return Parse_price (pos_img, poss) def request_page(url: STR): headers = {' user-agent ': 'MOZILLA /5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = requests. Get (url, WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'} resp = requests. headers=headers) resp_content = resp.content.decode('utf-8') return resp_content
For convenience, has made the interface services: testing interface = = > https://lemon.lpe234.xyz/common/ziru/
3 summary
I thought I could show off my new skills to my friends, but I lost track of them. And then I was thinking, since I’m comfortable with this set of things, why not get some more pictures to make it more difficult.
But if you want to get in front of your friends, you’re going to have to use CNN.