An overview of the
- preface
- The statistical results
- Crawler analysis
- Crawler code implementation
- Crawler analysis implementation
- Afterword.
preface
Recently, the rent of the second tier cities have risen, what is the overall rise to what extent? We don’t know, so In order to find out, Zone used Python to crawl the rental data of shenzhen under the house. Here is the sample data of this time:
The statistical results
Let’s look at the statistical results first and then at the technical analysis. Housing distribution in shenzhen :(by district) futian and nanshan have the most housing distribution. But the rent on these two lots is pretty high.
Unit rent :(unit rent per square meter per month — average) is the price of 1 square meter per month. The bigger the square, the higher the price.
It can be seen that Futian and Nanshan are the first, with 114.874 and 113.483 respectively, several times that of other areas. If you rent a room of 20 square meters in Futian:
114.874 x 20 = 2297.48
Water, electricity and property of 200 yuan:
2297.48 + 200 = 2497.48
Let’s be frugal. Breakfast is $10, lunch is $25, and dinner is $25:
2497.48 + 50 x 30 = 3997.48
Yeah, $3,997.48 just to survive. Cut off time to go to a restaurant, buy some clothes every month, transportation expenses, talk about a girlfriend, go shopping with her girlfriend, no problem drop add 3500
3997.48 + 3500 = 7497.48
One thousand for mom and Dad:
7497.48 + 2000 = 9497.48
Ten thousand a month no problem drop, into the moonlight clan.
Unit rent :(per square metre per day — average)
That’s the price of 1 square meter per day.
Door model
The apartment type is mainly 3 rooms, 2 halls and 2 rooms, 2 halls. Renting a room in a group with a friend is the best choice. Otherwise, a series of uncomfortable things can happen when you share a room with someone you don’t know. The larger the font, the greater the number of house types.
Rental area statistics
Among them, 30-90 square meters of rental accounted for the majority, now the way, also can only be a few small friends rent a house together, huddle together to keep warm.
Cloud of rental descriptors
This is a description of the rental, where the larger the font, the more times the sign appears. Among them [fine decoration] occupies a large part, indicating that long-term rental apartments also occupy a large part of the market.
The crawler thinking
Firstly, the data of each plate in Shenzhen of Fang x is climbed, and then stored in MongoDB database, and finally the data is analyzed.
Part of database data:
/ * 1 * / {"_id" : ObjectId("5b827d5e8a4c184e63fb1325"),
"traffic" : "It is about 567 meters away from shajing Electronic City bus station."// Traffic description"address" : "Bao 'an - Shajing - Ming Holi City"/ / address"price": 3100, / / price"area": / / area, 110"direction" : "Face south \r\n", / / front"title" : "Shajing hao Li Cheng hardcover three room furniture neat bag live high-rise south at any time to see the house."/ / title"rooms" : "Three rooms, two rooms", / / family"region" : "Baoan"} / / areaCopy the code
Crawler analysis
- Request library: requests
- HTML parsing: BeautifulSoup
- The word cloud: wordcloud
- Data visualization: Pyecharts
- Database: MongoDB
- Database connection: Pymongo
Crawler code implementation
First right click the page, view the page source, find out what we want to crawl to get.
def getOnePageData(self, pageUrl, reginon="不限"):
rent = self.getCollection(self.region)
self.session.headers.update({
'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36'})
res = self.session.get(
pageUrl
)
soup = BeautifulSoup(res.text, "html.parser")
divs = soup.find_all("dd", attrs={"class": "info rel"}) # retrieve the div
for div in divs:
ps = div.find_all("p")
try: # catch exception, because some data in the page is not filled in completely, or an advertisement is inserted, there will be no corresponding tag, so the error will be reported
for index, p in enumerate(ps): # From the source code, we can see that each p tag has the information we want, so we will pass through the P tag,
text = p.text.strip()
print(text) # output to see if it is the information we want
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
# crawl and save into MongoDB database
roomMsg = ps[1].text.split("|")
# rentMsg This is done because some information is incomplete, causing the object to be empty
area = roomMsg[2].strip()[:len(roomMsg[2]) - 2]
rentMsg = self.getRentMsg(
ps[0].text.strip(),
roomMsg[1].strip(),
int(float(area)),
int(ps[len(ps) - 1].text.strip()[:len(ps[len(ps) - 1].text.strip()) - 3]),
ps[2].text.strip(),
ps[3].text.strip(),
ps[2].text.strip()[:2],
roomMsg[3],
)
rent.insert(rentMsg)
except:
continue
Copy the code
Data analysis implementation
Data analysis:
Price of rent for a district (square meters/yuan)
def getAvgPrice(self, region):
areaPinYin = self.getPinyin(region=region)
collection = self.zfdb[areaPinYin]
totalPrice = collection.aggregate([{'$group': {'_id': '$region'.'total_price': {'$sum': '$price'}}}])
totalArea = collection.aggregate([{'$group': {'_id': '$region'.'total_area': {'$sum': '$area'}}}])
totalPrice2 = list(totalPrice)[0]["total_price"]
totalArea2 = list(totalArea)[0]["total_area"]
return totalPrice2 / totalArea2
# Get how much each district costs per square meter per month
def getTotalAvgPrice(self):
totalAvgPriceList = []
totalAvgPriceDirList = []
for index, region in enumerate(self.getAreaList()):
avgPrice = self.getAvgPrice(region)
totalAvgPriceList.append(round(avgPrice, 3))
totalAvgPriceDirList.append({"value": round(avgPrice, 3), "name": region + "" + str(round(avgPrice, 3))})
return totalAvgPriceDirList
Figure out how much it costs per square meter per day in each district
def getTotalAvgPricePerDay(self):
totalAvgPriceList = []
for index, region in enumerate(self.getAreaList()):
avgPrice = self.getAvgPrice(region)
totalAvgPriceList.append(round(avgPrice / 30, 3))
return (self.getAreaList(), totalAvgPriceList)
# Obtain the number of statistical samples in each district
def getAnalycisNum(self):
analycisList = []
for index, region in enumerate(self.getAreaList()):
collection = self.zfdb[self.pinyinDir[region]]
print(region)
totalNum = collection.aggregate([{'$group': {'_id': ' '.'total_num': {'$sum': 1}}}])
totalNum2 = list(totalNum)[0]["total_num"]
analycisList.append(totalNum2)
return (self.getAreaList(), analycisList)
# Obtain the proportion of housing in each district
def getAreaWeight(self):
result = self.zfdb.rent.aggregate([{'$group': {'_id': '$region'.'weight': {'$sum': 1}}}])
areaName = []
areaWeight = []
for item in result:
if item["_id"] in self.getAreaList():
areaWeight.append(item["weight"])
areaName.append(item["_id"])
print(item["_id"])
print(item["weight"])
# print(type(item))
return (areaName, areaWeight)
Get the title data to build the word cloud
def getTitle(self):
collection = self.zfdb["rent"]
queryArgs = {}
projectionFields = {'_id': False, 'title': True} Use a dictionary to specify the required fields
searchRes = collection.find(queryArgs, projection=projectionFields).limit(1000)
content = ' '
for result in searchRes:
print(result["title"])
content += result["title"]
return content
# Obtain apartment type data (e.g., 3 bedrooms and 2 halls)
def getRooms(self):
results = self.zfdb.rent.aggregate([{'$group': {'_id': '$rooms'.'weight': {'$sum': 1}}}])
roomList = []
weightList = []
for result in results:
roomList.append(result["_id"])
weightList.append(result["weight"])
# print(list(result))
return (roomList, weightList)
# Obtain rental area
def getAcreage(self):
results0_30 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 0.'$lte': 30}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results30_60 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 30.'$lte': 60}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results60_90 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 60.'$lte': 90}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results90_120 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 90, '$lte': 120}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results120_200 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 120, '$lte': 200}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results200_300 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 200, '$lte': 300}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results300_400 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 300, '$lte': 400}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results400_10000 = self.zfdb.rent.aggregate([
{'$match': {'area': {'$gt': 300, '$lte': 10000}}},
{'$group': {'_id': ' '.'count': {'$sum': 1}}}
])
results0_30_ = list(results0_30)[0]["count"]
results30_60_ = list(results30_60)[0]["count"]
results60_90_ = list(results60_90)[0]["count"]
results90_120_ = list(results90_120)[0]["count"]
results120_200_ = list(results120_200)[0]["count"]
results200_300_ = list(results200_300)[0]["count"]
results300_400_ = list(results300_400)[0]["count"]
results400_10000_ = list(results400_10000)[0]["count"]
attr = ["0-30 square meters"."30-60 square meters"."60-90 square meters"."90-120 square meters"."120-200 square meters"."200-300 square meters"."300-400 square meters"."400+ square meters"]
value = [
results0_30_, results30_60_, results60_90_, results90_120_, results120_200_, results200_300_, results300_400_, results400_10000_
]
return (attr, value)
Copy the code
Data Display:
# Show pie chart
def showPie(self, title, attr, value):
from pyecharts import Pie
pie = Pie(title)
pie.add("aa", attr, value, is_label_show=True)
pie.render()
Display the rectangle tree
def showTreeMap(self, title, data):
from pyecharts import TreeMap
data = data
treemap = TreeMap(title, width=1200, height=600)
treemap.add("Shenzhen", data, is_label_show=True, label_pos='inside', label_text_size=19)
treemap.render()
# Show bar chart
def showLine(self, title, attr, value):
from pyecharts import Bar
bar = Bar(title)
bar.add("Shenzhen", attr, value, is_convert=False, is_label_show=True, label_text_size=18, is_random=True,
# xaxis_interval=0, xaxis_label_textsize=9,
legend_text_size=18, label_text_color=["# 000"])
bar.render()
# Show word clouds
def showWorkCloud(self, content, image_filename, font_filename, out_filename):
d = path.dirname(__name__)
# content = open(path.join(d, filename), 'rb').read()
# Keyword extraction based on TF-IDF algorithm, topK returns the items with the highest frequency, the default value is 20 withWeight
# is whether to return the weight of the keyword
tags = jieba.analyse.extract_tags(content, topK=100, withWeight=False)
text = "".join(tags)
# Background image to display
img = imread(path.join(d, image_filename))
# specify a Chinese font, otherwise it will be garbled
wc = WordCloud(font_path=font_filename,
background_color='black'.# word cloud shape,
mask=img,
Allow maximum vocabulary
max_words=400,
# Maximum font size, or image height if not specified
max_font_size=100,
Canvas width and height will not take effect if MSAK is set
# width=600,
# height=400,
margin=2,
The frequency of horizontal placement of # words is 0.9 by default. The frequency of vertical placement is 0.1Prefer_horizontal =0.9) wc. Generate (text) img_color = ImageColorGenerator(img) plt.imshow(wc.recolor(color_func=img_color)) plt.axis("off")
plt.show()
wc.to_file(path.join(d, out_filename))
# Show pyecharts word cloud
def showPyechartsWordCloud(self, attr, value):
from pyecharts import WordCloud
wordcloud = WordCloud(width=1300, height=620)
wordcloud.add("", attr, value, word_size_range=[20, 100])
wordcloud.render()
Copy the code
Afterword.
There are a lot of things happening recently. The surge in rents is actually the power of capital entering the rental market. These long-term rental apartments, Such as Freely and Eggshell, have too high rent prices for each other, and let customers sign third-party loan agreements. The early development may require a little money, but when the market is monopolized in the later period, as long as the housing is just needed, the money will not be lost. Finally, in response to changes in external conditions, we should improve our hard power, so as to improve our survival ability.
This article was first published in the public account [Zone7], pay attention to the public account to get the latest tweets, background reply [Shenzhen rent] to get the source code.