This project comes from student Miss_candy of the 6th session of “Building + Data Analysis and Mining Actual Combat” of experimental building. “Building + Data Analysis and Mining actual combat” is the experimental building to meet the needs of data analysis or data mining junior engineer and customized course content. Contains 35 experiments, 20 challenges, 5 comprehensive projects, and 1 major project. Six weeks of introduction to data analysis and mining.
Data is read
Data is obtained on August 27, 08-28 to 08-29 hotel prices, hotel prices will fluctuate with the tourist peak season, at present Dalian belongs to the junction of seasonal conversion, the price level tends to be reasonable but still higher than the normal level.
import pandas as pd
import jieba
from tqdm import tqdm_notebook
from wordcloud import WordCloud
import numpy as np
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('https://s3.huhuhang.com/temporary/b1vzDs.csv')
df.shape
Copy the code
Output:
(2475, 7)
Copy the code
Data cleaning
# There will be duplication of data obtained. Firstly, one item with the same name will be deleted from the data table according to the name of the hotel
df = df.drop_duplicates(['HotelName'])
df.info()
Copy the code
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2219 entries, 0 to 2474
Data columns (total 7 columns):
Unnamed: 0 2219 non-null int64
index 2219 non-null int64
HotelName 2219 non-null object
HotelLocation 2219 non-null object
HotelCommentValue 2219 non-null float64
HotelCommentAmount 2219 non-null int64
HotelPrice 2219 non-null float64
dtypes: float64(2), INT64 (3), Object (2) Memory Usage: 138.7+ KBCopy the code
After deleting duplicates, the obtained hotel information contains 2219 pieces of valid information, among which 5 columns of valid information are as follows:
- HotelName indicates the HotelName
- HotelLocation Indicates the location of a hotel
- “HotelCommentValue” hotels are scoring
- “HotelCommentAmount” The number of comments a hotel has received
- “HotelPrice” The lowest price for a hotel
Due to the new opening (or other reasons) of some hotels, there is no rating score for the time being (hotels without rating score are assigned “0” in the data acquisition process). Therefore, we take out this part of data separately as the new_hotel data set for some subsequent analysis and prediction.
df_new_hotel = df[df["HotelCommentValue"]==0].drop(['Unnamed: 0'], axis=1).set_index(['index'])
df_new_hotel.head()
Copy the code
The output
For hotels that already have scores, they are separated from the original data set for analysis and modeling.
df_in_ana = df[df["HotelCommentValue"]! =0].drop(["Unnamed: 0"."index"], axis=1)
df_in_ana.shape
Copy the code
Output:
(1669, 5)
Copy the code
The data analysis
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # is used to display the minus sign normally
sns.distplot(df_in_ana['HotelPrice'].values)
Copy the code
Output:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7353b9c240>
Copy the code
Through the visualization of the distribution of hotel prices, it can be seen that most hotel prices are less than 500 yuan/night, among which 200-300 yuan/night prices are most concentrated. There are not many hotels that cost more than 500 yuan per night. Therefore, according to the price distribution and the actual price level, the hotel is divided into the following grades according to the price situation:
- A “cheap” hotel that costs less than 100 yuan per night
- “Economy” hotels, between 100 and 300 yuan per night
- A “comfortable” hotel with a price of 300-500 yuan per night
- “High-end” hotels, priced at 500-1000 yuan per night
- A “luxury” hotel that costs more than 1,000 yuan per night
df_in_ana['HotelLabel'] = df_in_ana["HotelPrice"].apply(lambda x: 'luxury' if x > 1000 else\ ['high-end' if x > 500 else\ ['comfortable' if x > 300 else\ ['economic' if x > 100 else 'cheap'))))
Copy the code
After the division, first to roughly understand the proportion of the number of different types of hotels:
hotel_label = df_in_ana.groupby('HotelLabel') ['HotelName'].count()
plt.pie(hotel_label.values, labels=hotel_label.index, autopct='%.1f%%'Explode =[0, 0.1, 0.1, 0.1, 0.1], shadow=True)Copy the code
Output:
([<matplotlib.patches.Wedge at 0x7f735196bf28>, <matplotlib.patches.Wedge at 0x7f7351974978>, <matplotlib.patches.Wedge at 0x7f735197d358>, <matplotlib.patches.Wedge at 0x7f735197dcf8>, <matplotlib.patches.Wedge at 0x7F73519096D8 >], [Text(1.0995615668223722, 0.0310541586125,'luxury'), Text (0.8817809341165916, 0.813917922292212,'cheap'), Text (1.1653378183544278, 0.28633506441395257,'economic'),
Text(0.9862461234793722, -0.6836070391108557, 'comfortable'),
Text(1.1898928807304072, -0.15541857156431768, 'high-end')], [Text (0.5997608546303848, 0.016938631970454542,'0.9%'), Text (0.5143722115680117, 0.47478545467045696,'21.9%'), Text (0.679780394040083, 0.16702878757480563,'62.0%'),
Text(0.5753102386963004, -0.3987707728146658, '11.0%'),
Text(0.6941041804260709, -0.09066083341251863, '4.1%')])
Copy the code
As can be seen from the pie chart, more than 50% of the hotels are economical, 21.9% of the hotels are cheap, and the proportion of high-end and luxury hotels is relatively small, which is in line with the general positioning of tourist cities.
Let’s take a look at the geographical distribution of the hotel:
from pyecharts import Map
map_hotel = Map("Dalian Hotel Area Distribution Map", width=1000, height=600)
hotel_distribution = df_in_ana.groupby('HotelLocation') ['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]
h_values = list(hotel_distribution.values)
district = list(hotel_distribution.index)
map_hotel.add("", district, h_values, maptype='dalian', is_visualmap=True,
visual_range=([min(h_values), max(h_values)]),
visual_text_color="#fff", symbol_size=20, is_label_show=True)
map_hotel.render('dalian_hotel.html')
Copy the code
Here, because the location information was obtained from the website, fill in the part of the location information of the hotel itself is not standard, lead to obtain the information presented certain differentiation, due to the diversity of information is not convenient for unified planning, and the proportion is not big, therefore in the comparison of the position after the sort, We only capture the front of the eight major areas of information, you can see, to have been collected by the hotel, most of the hotel is located in Shahekou District, jinzhou, dalian is directly related to the distribution of the main attractions, such as the well-known xinghai square, bridge located in Shahekou District, golden pebble beach, and found that the kingdom has located in jinzhou district. (Actually, there is no corresponding content in map for high-tech park, because it does not belong to the administrative region, but is a technological development zone at the intersection of Ganjingzi District and Shahekou District. Its proportion has no influence on Both Shahekou District and Ganjingzi District, which does not hinder our data analysis).
Dalian hotel area distribution map
Dalian is a tourist city, and the positioning and level of hotels in different administrative regions (geographical locations) should be different. Therefore, it is an interesting question to understand the distribution of hotels of various grades in different regions:
hotel_distribution = df_in_ana.groupby('HotelLocation') ['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]
hotel_label_distr = df_in_ana.groupby([ 'HotelLocation'.'HotelLabel'[])'HotelName'].count().sort_values(ascending=False).reset_index()
in_use_district = list(hotel_distribution.index)
hotel_label_distr = hotel_label_distr[hotel_label_distr['HotelLocation']. Plots (1, 2, 3) = axes (1, 3, 4)'high-end'.'comfortable'.'economic'.'luxury'.'cheap']
for i in range(len(hotel_label_list)):
current_df = hotel_label_distr[hotel_label_distr['HotelLabel']==hotel_label_list[i]]
axes[i].set_title('Regional distribution of {} hotels'.format(hotel_label_list[i]))
axes[i].pie(current_df.HotelName, labels=current_df.HotelLocation, autopct='%.1f%%', shadow=True)
Copy the code
Through the distribution of various class hotel in different regions, various types of hotel in Shahekou District, jinzhou and first-hand are advantages of distributed, the more interesting is that the lushunkou luxury hotel has no distribution, this type of hotel except in Shahekou District more concentrated in the first-hand occupies a large proportion of the, This has a lot to do with the historical and geographical reasons. Dalian people often say that Zhongshan District is the legendary “rich area”, many business travelers will choose to live in Zhongshan District, which also promotes the investment growth in high-end and luxury hotels in this area.
In addition to the requirements of hotel price (grade), we will also consider the evaluation of the hotel when booking a hotel. The higher the score is, the more the evaluation is, we will be more inclined to book. Therefore, according to the data set with scores, we take a look at the situation of these hotels in Dalian.
First of all, according to the scoring situation combined with the general cognition of consumers on the scoring, for the hotel to mark:
- A score above 4.6 is “excellent”.
- A score of 4.0 to 4.6 is “fair”
- A score of 3.0 to 4.0 is “fair”
- A score below 3.0 is considered “poor”.
df_in_ana['HotelCommentLevel'] = df_in_ana["HotelCommentValue"].apply(lambda x: 'great' ifX > 4.6 \else ('Not bad' ifX > 4.0 \else ('So-so' ifX > 3.0else 'bad' )))
Copy the code
We visualized the data according to rating grade and hotel grade cluster.
hotel_label_level = df_in_ana.groupby(['HotelCommentLevel'.'HotelLabel'[])'HotelName'].count().sort_values(ascending=False).reset_index(), axes = plt.subplots(1, 5, figsize=(17,8))for i in range(len(hotel_label_list)):
current_df = hotel_label_level[hotel_label_level['HotelLabel'] == hotel_label_list[i]]
axes[i].set_title('{} type hotel rating '.format(hotel_label_list[i]))
axes[i].pie(current_df.HotelName, labels=current_df.HotelCommentLevel, autopct='%.1f%%', shadow=True)
Copy the code
Evaluation of the distribution by various types of hotels, bad review mainly appear in the cheap hotel and budget hotel, and the type with cheap hotel for bad review areas, comfort for the lowest price above 300 per night, high-end and luxury hotels, basic no bad review, and confirm the “where is the money which is good” general cognition, Luxury hotels have the highest proportion of favorable comments (” excellent “). The proportion of “great” is not increased with the hotel scale and improve, for high-end hotels, the “great” evaluation of the proportion of relative prices lower comfort hotel down instead, for it may be due to price of service expectation value is greater than the hotel can provide actual service levels, on the one hand, to remind consumers should not blindly think your is good, On the one hand to remind the hotel, how much ability to do how much ability to do the corresponding thing, the price is not desirable.
The hotel list
According to the current content, we can make a “grass list” and “lightning protection list” :
“Grass list” mainly collect the evaluation of the grade of the hotel, the number of evaluation (multi-person inspection, meet the requirements), the corresponding reasonable price of the hotel list, for a variety of different travel needs of friends to choose; The anti-thunder list mainly collects bad reviews of hotels and reminds us not to dare to “try and mistake” and “take chances”.
Grass listing
# Budget Hotel
df_pos_cheap = df_in_ana[(df_in_ana['HotelLabel'] = ='cheap') \
& (df_in_ana['HotelCommentValue']> 4.6) \
& (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'], ascending=False)
df_pos_cheap
Copy the code
Output:
# Budget hotels
df_pos_economy = df_in_ana[(df_in_ana['HotelLabel'] = ='economic') \
& (df_in_ana['HotelCommentValue']> 4.6) \
& (df_in_ana['HotelCommentAmount']> 2000)].sort_values(by=['HotelPrice'])
df_pos_economy
Copy the code
Output:
# Comfort hotel
df_pos_comfortable = df_in_ana[(df_in_ana['HotelLabel'] = ='comfortable') \
& (df_in_ana['HotelCommentValue']> 4.6) \
& (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_comfortable
Copy the code
Output:
# High-end hotels
df_pos_hs = df_in_ana[(df_in_ana['HotelLabel'] = ='high-end') \
& (df_in_ana['HotelCommentValue']> 4.6) \
& (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_hs
Copy the code
Output:
# Luxury Hotel
df_pos_luxury = df_in_ana[(df_in_ana['HotelLabel'] = ='luxury') \
& (df_in_ana['HotelCommentValue']> 4.6) \
& (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'])
df_pos_luxury
Copy the code
Output:
Minefields listing
df_neg = df_in_ana[(df_in_ana['HotelCommentValue'] < 3.0) \
& (df_in_ana['HotelCommentAmount'] > 50)].sort_values(by=['HotelPrice'], ascending=False)
df_neg
Copy the code
Output:
The science of hotel names
For the more extreme hotel types, such as very expensive very expensive high-end hotels, generally go business elegant atmosphere style, the name will sound very “expensive”; And some cheaper, rely on the price of flow, for students or poor economic basis of the crowd, the name or go small and fresh way, or simple and crude, a sound is “cost-effective”, we through the word cloud to verify that this theory for the hotel in Dalian area is in line with.
wget -nc "http://labfile.oss.aliyuncs.com/courses/1176/fonts.zip"
unzip -o fonts.zip
from wordcloud import WordCloud
def get_word_map(hotel_name_list):
word_dict ={}
for hotel_name in tqdm_notebook(hotel_name_list):
hotel_name = hotel_name.replace('('.' ')
hotel_name = hotel_name.replace(') '.' ')
word_list = list(jieba.cut(hotel_name, cut_all=False))
for word in word_list:
if word == 'dalian' or len(word) < 2:
continue
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
font_path = 'fonts/SourceHanSerifK-Light.otf'
wc = WordCloud(font_path=font_path, background_color='white', max_words=1000,
max_font_size=120, random_state=42, width=800, height=600, margin=2)
wc.generate_from_frequencies(word_dict)
return wc
Copy the code
In order to ensure the sufficient amount of data to draw the word cloud, here is not in accordance with the original hotel grade classification standard to select data, but choose the price of less than 150 hotels and hotels higher than 500, as two relatively extreme types, to see if they have any typical differences in naming.
part1 = df_in_ana[df_in_ana['HotelPrice'[] < = 150]'HotelName'].values
part2 = df_in_ana[df_in_ana['HotelPrice'] > [500]'HotelName'].values
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
axes[0].set_title('The word cloud of cheaper hotels')
axes[0].imshow(get_word_map(part1), interpolation='bilinear')
axes[1].set_title('The word cloud of higher-priced hotels')
axes[1].imshow(get_word_map(part2), interpolation='bilinear')
Copy the code
Output:
<matplotlib.image.AxesImage at 0x7f73515c1908>
Copy the code
According to the results, there are significant differences between the noun cloud of the two types of hotels. Low price hotels, names appear “guest house”, “theme”, “youth”, “express hotel”, “hotel”, “hotel” and so on the frequency is high, in line with our positioning of this kind of hotel cognition; A high-end hotel whose name includes “Xinghai”, “Seascape”, “Hot Spring”, The frequency of “square” is relatively high, because the well-known landmark of Dalian is Xinghai Square in Shahekou District, the nearby hotels (especially high-end hotels) like to reflect the word “Xinghai” in the name, in addition to highlighting the geographical location, it seems that the word can also add some style to the hotel. In addition, high-end hotels seem less inclined to call themselves “XX Hotel”, preferring “hotel” or “serviced apartment”. The crazy thing is that both the cheaper hotels and the more expensive hotels love the word “apartment.” This seems to be a trend in the hotel industry.
Know a hotel by its name
As a symbol of a person or thing, the first impression caused by the name is very important. We have just analyzed the characteristics of the more extreme types of hotels in the name, to some extent, can judge whether the hotel is in a certain grade.” “The age of three to see life, for just started running, no score of small white hotel, we can according to the forecast results of the price determine the pricing plan is in line with the positioning of the hotel, we analyzed the different class hotel evaluation before characteristics, combined with the known results, probably know whether these small white hotel is priced artificially high, Or if it’s worth it to be guinea pigs and take the road of discovery. But it also involves a problem, a new hotel for environment and the age of reason, in the name of the strategy on and before the hotel have difference, the difference in the process of modeling prediction will produce significant influence, therefore, here we just by using the method of study, do an interesting experiment, the result will not accurate, but the process is very interesting:)
df_in_ana['HotelPrice'].median()
Copy the code
Output:
156.0
Copy the code
Through the word cloud in front of the analysis and evaluation of the median price of a hotel, we will price 150 set to divide threshold, the price is lower than 150 yuan/night hotel, labeled 1, and higher than this price, as 0, such segmentation approach, has fundamental equilibrium data of two parts and also in part reflect the differences in the name of the hotel.
df_in_ana['PriceLabel'] = df_in_ana['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
df_new_hotel['PriceLabel'] = df_new_hotel['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
# Set the participle
def word_cut(x):
x = x.replace('('.' ') # remove () from name
x = x.replace(') '.' ')
return jieba.lcut(x)
Set training set and test set
x_train = df_in_ana['HotelName'].apply(word_cut).values
y_train = df_in_ana['PriceLabel'].values
x_test = df_new_hotel['HotelName'].apply(word_cut).values
y_test = df_new_hotel['PriceLabel'].values
Copy the code
The training set contains 1669 pieces of information, 790 pieces of data labeled as 1, and the test set contains 550 pieces of information, 195 pieces of data labeled as 1.
# The word vector shallow neural network model is established by Word2Vec method, and the sum of the word vector of the hotel name after word segmentation is calculated
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')
w2v_model = Word2Vec(size=200, min_count=10)
w2v_model.build_vocab(x_train)
w2v_model.train(x_train, total_examples=w2v_model.corpus_count, epochs=5)
def sum_vec(text):
vec = np.zeros(200).reshape((1, 200))
for word in text:
try:
vec += w2v_model[word].reshape((1, 200))
except KeyError:
continue
return vec
train_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_train)])
# Build neural network classifier model and use training data to train the model
from sklearn.externals import joblib
from sklearn.neural_network import MLPClassifier
from IPython import display
model = MLPClassifier(hidden_layer_sizes=(100, 50, 20), learning_rate='adaptive')
model.fit(train_vec, y_train)
Draw the loss change curve and monitor the change process of the loss function
display.clear_output(wait=True)
plt.plot(model.loss_curve_)
Copy the code
Output:
[<matplotlib.lines.Line2D at 0x7f73400b8198>]
Copy the code
Ps: Because the amount of data is small and the information contained in the data itself is relatively insufficient, the training result here is not very good.
# Then sum the word vectors of the test set
test_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_test)])
# Use the trained model to make predictions and pour the results into the test table
y_pred = model.predict(test_vec)
df_new_hotel['PredLabel'] = pd.Series(y_pred)
# Modeling predicted results
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)
Copy the code
Output:
0.6163636363636363
Copy the code
In fact, the accuracy rate of prediction is only about 60%, which is a rather unsatisfactory result. When we expand the data, what is the main reason?
new_hotel_questionable = df_new_hotel[(df_new_hotel['PriceLabel'] ==0) & (df_new_hotel['PredLabel']==1)]
new_hotel_questionable = new_hotel_questionable.sort_values(by='HotelPrice', ascending=False)
new_hotel_questionable
Copy the code
Output:
The results with obvious differences show that many newly opened hotels, especially those with high prices, are “villa” type resort hotels, which is not clearly reflected in the evaluated data sets, and the modeled classifier is not sensitive to it, so the possibility of misclassification will be greatly increased.
plt.figure(figsize=(15, 7))
plt.imshow(get_word_map(new_hotel_questionable['HotelName'].values), interpolation='bilinear')
Copy the code
Output:
<matplotlib.image.AxesImage at 0x7f7333b06d68>
Copy the code
Compared with the data set used in modeling, some words were added in the name of newly opened hotels, such as “number shop”, “branch store”, “villa”, etc., resulting in a decrease in the accuracy of prediction.
Get to know the new hotel
In addition to the names, we can also see how the new hotels reflect the geographical distribution and changes in average prices.
new_hotel_distri = df_new_hotel.groupby('HotelLocation') ['HotelName'].count().sort_values(ascending=False)[:7]
plt.pie(new_hotel_distri.values, labels=new_hotel_distri.index, autopct='%.1f%%', shadow=True)
Copy the code
Output;
([<matplotlib.patches.Wedge at 0x7f7333ae1240>, <matplotlib.patches.Wedge at 0x7f7333ae1c50>, <matplotlib.patches.Wedge at 0x7f7333ae9630>, <matplotlib.patches.Wedge at 0x7f7333ae9fd0>, <matplotlib.patches.Wedge at 0x7f7333af29b0>, <matplotlib.patches.Wedge at 0x7f7333afd390>, <matplotlib.patches.Wedge at 0x7F7333AFDD30 >], [Text(0.4952241217848982, 0.9822184427113841,'Jinzhou District'),
Text(-1.0502523061308453, 0.32706282801144143, 'Ganjing District'), Text (0.7189197652449374, 0.8325589295300148,'Sand River Mouth'),
Text(0.10878704263418966, -1.0946074087794706, 'Lvshun Mouth area'),
Text(0.6457239222346646, -0.8905282793117135, 'Zhongshan District'),
Text(0.9702662169179598, -0.5182503915171803, 'Xigang District'),
Text(1.0890040760087287, -0.1551454879665377, 'Plandian District')], [Text (0.2701222482463081, 0.5357555142062095,'35.1%'),
Text(-0.5728648942531882, 0.17839790618805892, '20.1%'), Text (0.39213805376996586, 0.4541230524709171,'16.8%'),
Text(0.059338386891376174, -0.597058586606984, '9.0%'),
Text(0.35221304849163515, -0.4857426978063891, '7.8%'),
Text(0.5292361183188871, -0.2826820317366438, '6.6%'),
Text(0.5940022232774883, -0.08462481161811146, '4.5%')])
Copy the code
As can be seen from the pie chart, more than 30% of the new hotels choose Jinzhou District and Shahekou District as the old hotel cluster, while only 16% of the practitioners choose to open new hotels here.
df_new_hotel['HotelLabel'] = df_new_hotel["HotelPrice"].apply(lambda x: 'luxury' if x > 1000 \
else ('high-end' if x > 500 \
else ('comfortable' if x > 300 \
else('economic' if x > 100 \
else 'cheap'))))
new_hotel_label = df_new_hotel.groupby('HotelLabel') ['HotelName'].count()
plt.pie(new_hotel_label.values, labels=new_hotel_label.index, autopct='%.1f%%'Explode =[0, 0.1, 0.1, 0.1, 0.1], shadow=True)Copy the code
Output:
([<matplotlib.patches.Wedge at 0x7f7333abbdd8>, <matplotlib.patches.Wedge at 0x7f7333a44828>, <matplotlib.patches.Wedge at 0x7f7333a4d208>, <matplotlib.patches.Wedge at 0x7f7333a4dba8>, <matplotlib.patches.Wedge at 0x7F7333A59588 >], [Text(1.0859612910752763, 0.17518011955161772,'luxury'), Text (0.6137971106588083, 1.0311416522218948,'cheap'),
Text(-1.1999216970224413, 0.01370842860376746, 'economic'),
Text(0.46080283077562195, -1.1079985339111122, 'comfortable'),
Text(1.1494416996723409, -0.3446502271207151, 'high-end')], [Text (0.5923425224046961, 0.09555279248270056,'5.1%'), Text (0.3580483145509714, 0.6014992971294385,'22.7%'),
Text(-0.6999543232630907, 0.007996583352197684, '44.0%'),
Text(0.26880165128577943, -0.6463324781148153, '18.9%'),
Text(0.6705076581421987, -0.20104596582041712, '9.3%')])
Copy the code
Except for most travelers will choose the economical low-priced hotels, high-end luxury hotel in which there is a marked increase in the newly opened hotel, combining with the former in the face of the newly opened hotel word cloud analysis, more and more practitioners in the construction of the high-end hotels, hotel is mainly with type villa resort hotel, It reflects people’s pursuit of quality and more comfortable travel experience.
On price, there are some interesting results:
df2 = df_new_hotel.groupby('HotelLabel') ['HotelPrice'].mean().reset_index()
df1=df_in_ana.groupby('HotelLabel') ['HotelPrice'].mean().reset_index()
price_change_percent = (df2['HotelPrice'] - df1['HotelPrice'])/df1['HotelPrice'] * 100
plt.title('Change of average price of newly opened hotels')
plt.bar(df1['HotelLabel'],price_change_percent, width = 0.35) plt.ylim(-18, 18)for x, y in enumerate(price_change_percent):
if y < 0:
plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='top')
else:
plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='bottom')
Copy the code
The change of average price of newly opened hotels compared with the old hotels that have been evaluated is as follows:
- The average price of luxury and budget hotels fell
- Intermediate hotels, including budget, comfort and high-end hotels, saw an increase in average prices
Two extreme grades of hotels are to “lower the price” in the way to get attention to attract occupancy rate, in order to achieve rapid development, and the middle type of hotel, to change the business philosophy, conform to the trend of The Times and other ways to obtain the price of capital, but its final development effect still depends on the identity of the passengers.
conclusion
The experiment in Dalian area as the hotel analysis data, mining including price, regional distribution and other information, provided the evaluation of the hotel “grass list” and “anti-thunder list” (mother no longer need to worry about friends to Dalian worry about the hotel!) , hotel name word cloud analysis, explore the relationship between name and hotel scale, and establish the classification model, predict the new name of the hotel, no evaluation and its pricing standards is appropriate, at the same time digging the regional distribution and the grade of the newly opened hotel distribution, compared with the existing evaluation of the hotel’s average price change, Side to understand some ideas of the development of dalian tourism. Due to the small amount of data and the strong correlation between the naming method of hotel names and the region and era environment, the effect of modeling and prediction is not good. However, it is interesting to learn these contents and apply them in different aspects to deepen our understanding of them.
Synchronous zhihu column: zhuanlan.zhihu.com/p/85909205