This article has participated in the call for good writing activities, click to view: back end, big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!
A few days ago, I had a friend who wanted to check the price of rentals near Wangjing. He encountered some problems and asked me to help analyze them.
1 analysis
I thought, I’ve done this before, how difficult could it be? So open a random rental page.
The forehead (even though it’s o… I changed it to a picture. There should have been a separate Ajax request to get the price information.
① Although the price is made up of four < I > tags, the background image is the same. Width: 20px; height: 30px; It’s fixed, it just sets the offset of the image.
But that doesn’t stop me from thinking:
- Request web page
- Get picture information, get price offset information
- Cut up the picture for identification
- Get the price data
Just recently in the study of CNN image recognition related, so neat numbers, with a little training recognition rate can certainly reach 100%.
2 of actual combat
Just do it, find a portal, and then get a wave of web pages.
2.1 Obtaining the original Web page
Click metro directly, find a line 15 wangjing East, and then get the room list, and then deal with the next page.
Sample code:
# -*- coding: UTF-8 -*-
import os
import time
import random
import requests
from lxml.etree import HTML
__author__ = 'lpe234'
index_url = 'https://www.ziroom.com/z/s100006-t201081/?isOpen=0'
visited_index_urls = set(a)def get_pages(start_url: str) :
Param start_url :return: """
# to heavy
if start_url in visited_index_urls:
return
visited_index_urls.add(start_url)
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
resp = requests.get(start_url, headers=headers)
resp_content = resp.content.decode('utf-8')
root = HTML(resp_content)
# parse when page list
hrefs = root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href')
for href in hrefs:
if not href.startswith('http'):
href = 'http:' + href.strip()
print(href)
parse_detail(href)
# Parse page flipping
pages = root.xpath('//div[@class="Z_pages"]/a/@href')
for page in pages:
if not page.startswith('http'):
page = 'http:' + page
get_pages(page)
def parse_detail(detail_url: str) :
""" Access details page :param Detail_URL: :return: ""
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
filename = 'pages/' + detail_url.split('/')[-1]
if os.path.exists(filename):
return
# Random pauses for 1-5 seconds
time.sleep(random.randint(1.5))
resp = requests.get(detail_url, headers=headers)
resp_content = resp.content.decode('utf-8')
with open(filename, 'wb+') as page:
page.write(resp_content.encode())
if __name__ == '__main__':
get_pages(start_url=index_url)
Copy the code
A simple access to the nearby housing sources, a total of about 600.
2.2 Parsing web pages to obtain pictures
The logic is simple: go through all the previously acquired web pages, parse the price images and store them.
Sample code:
# -*- coding: UTF-8 -*-
import os
import re
from urllib.request import urlretrieve
from lxml.etree import HTML
__author__ = 'lpe234'
poss = list(a)def walk_pages() :
""" Traverse all pages in path :return: """
for dirpath, dirnames, filenames in os.walk('pages') :for page in filenames:
page = os.path.join('pages', page)
print(page)
parse_page(page)
def parse_page(page_path: str) :
Param page_path: :return: """
with open(page_path, 'rb') as page:
page_content = ' '.join([_.decode('utf-8') for _ in page.readlines()])
root = HTML(page_content)
styles = root.xpath('//div[@class="Z_price"]/i/@style')
pos_re = re.compile(r'background-position:(.*?) px; ')
img_re = re.compile(r'url\((.*?) \); ')
for style in styles:
style = style.strip()
print(style)
pos = pos_re.findall(style)[0]
img = img_re.findall(style)[0]
if img.endswith('red.png') :continue
if not img.startswith('http'):
img = 'http:' + img
print(f'pos: {pos}, img: {img}')
save_img(img)
poss.append(pos)
def save_img(img_url: str) :
img_name = img_url.split('/')[-1]
img_path = os.path.join('imgs', img_name)
if os.path.exists(img_path):
return
urlretrieve(img_url, img_path)
if __name__ == '__main__':
walk_pages()
print(sorted([float(_) for _ in poss]))
print(sorted(set([float(_) for _ in poss])))
Copy the code
Finally get the price related picture data.
A total of 21 pictures, including 20 orange pictures for ordinary prices, the red one for special sale housing.
It looks like you don’t need image recognition anymore, just map by image name and offset.
2.3 Price analysis
I was gonna write identification, but it doesn’t feel right. It’s not identification. It’s just a picture name + offset mapping.
Sample code:
# -*- coding: UTF-8 -*-
import re
from lxml.etree import HTML
import requests
__author__ = 'lpe234'
PRICE_IMG = {
'1b68fa980af5e85b0f545fccfe2f8af1.png': [8.9.1.6.7.0.2.4.5.3].'4eb5ebda7cc7c3214aebde816b10d204.png': [9.5.7.0.8.6.3.1.2.4].'5c6750e29a7aae17288dcadadb5e33b1.png': [4.5.9.3.1.6.2.8.7.0].'6f8787069ac0a69b36c8cf13aacb016b.png': [6.1.9.7.4.5.0.8.3.2].'7ce54f64c5c0a425872683e3d1df36f4.png': [5.1.3.7.6.8.9.4.0.2].'8e7a6d05db4a1eb58ff3c26619f40041.png': [3.8.7.1.2.9.0.6.4.5].'73ac03bb4d5857539790bde4d9301946.png': [7.1.9.0.8.6.4.5.2.3].'234a22e00c646d0a2c20eccde1bbb779.png': [1.2.0.5.8.3.7.6.4.9].'486ff52ed774dbecf6f24855851e3704.png': [4.7.8.0.1.6.9.2.5.3].'19003aac664523e53cc502b54a50d2b6.png': [4.9.2.8.7.3.0.6.5.1].'93959ce492a74b6617ba8d4e5e195a1d.png': [5.4.3.0.8.7.9.6.2.1].'7995074a73302d345088229b960929e9.png': [0.7.4.2.1.3.8.6.5.9].'939205287b8e01882b89273e789a77c5.png': [8.0.1.5.7.3.9.6.2.4].'477571844175c1058ece4cee45f5c4b3.png': [2.1.5.8.0.9.7.4.3.6].'a822d494f1e8421a2fb2ec5e6450a650.png': [3.1.6.5.8.4.9.7.2.0].'a68621a4bca79938c464d8d728644642.png': [7.0.3.4.6.1.5.9.8.2].'b2451cc91e265db2a572ae750e8c15bd.png': [9.1.6.2.8.5.3.4.7.0].'bdf89da0338b19fbf594c599b177721c.png': [3.1.6.4.7.9.5.2.8.0].'de345d4e39fa7325898a8fd858addbb8.png': [7.2.6.3.8.4.0.1.9.5].'eb0d3275f3c698d1ac304af838d8bbf0.png': [3.6.5.0.4.8.9.2.1.7].'img_pricenumber_detail_red.png': [6.1.9.7.4.5.0.8.3.2]
}
POS_IDX = [-0.0, -31.24, -62.48, -93.72, -124.96, -156.2, -187.44, -218.68, -249.92, -281.16]
def parse_price(img: str, pos_list: list) :
price_list = PRICE_IMG.get(img)
if not price_list:
raise Exception('img not found. %s', img)
step = 1
price = 0
_pos_list = reversed(pos_list)
for pos in _pos_list:
price += price_list[POS_IDX.index(float(pos))]*step
step *= 10
return price
def parse_page(content: str) :
root = HTML(content)
styles = root.xpath('//div[@class="Z_price"]/i/@style')
pos_re = re.compile(r'background-position:(.*?) px; ')
pos_img = re.findall('price/(.*?) \ \); ', styles[0[])0]
poss = list(a)for style in styles:
style = style.strip()
pos = pos_re.findall(style)[0]
poss.append(pos)
print(pos_img)
print(poss)
return parse_price(pos_img, poss)
def request_page(url: str) :
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
resp = requests.get(url, headers=headers)
resp_content = resp.content.decode('utf-8')
return resp_content
Copy the code
For convenience, has made the interface services: testing interface = = > http://lemon.lpe234.xyz/common/ziru/
3 summary
I thought I could show my friends my new skills, but I got sloppy. And then I thought, well, if you can do this, why don’t you make more pictures and make it harder?
But in order to impress our friends, we still need to use CNN, otherwise we will learn nothing.