Confucius once said, food and sex also. (Some have pointed out that it was Actually Mencius who said it…) Today we don’t talk about sex, just talk about the word food. As the saying goes, food is the god of the people, this satiated work is not tired, but the world of hotels such as cattle hair, intermingled good and bad, often you spend a big price results found not delicious. But this meal entrance Chinese side knows good or bad ah, eat also eat total can not not give money, that chuzi stile strength not to mention. So, I want to look at the thousands of thousands of hotels it is exactly how to return a responsibility, what rules, can find a way to judge whether a store is worth.
At this moment, I happened to see that @passer-by shared a data about Dianping, and a total of 580,000 records were retrieved. It’s the perfect time for someone to hand you a pillow when you want to go to sleep, so you don’t have to crawl through the data yourself. I looked at the data and it’s pretty neat, so… Let’s just do it.
This paper is mainly divided into four aspects: data processing, data analysis, machine learning, summary thinking.
I. Data processing
1. Data deduplication
First, let’s take a look at the data. Aha, 585,915 pieces of data, 10 dimensions.
- “City” indicates its City, and there are 49 popular cities;
- Chinese Cuisine has 72 different cuisines.
- Name indicates the Name of 231,877 restaurants.
- The value of Star is 0, 20, 30, 35, 40, 45, or 50.
- Comments indicates the number of Comments.
- “PCC” means per capita consumption;
- The “Taste” score ranges from 0 to 10.
- Environment indicates the Environment score. The value ranges from 0 to 10.
- Service indicates the Service score. The value ranges from 0 to 10.
- Addr indicates the address
However, I found that many of these data are duplicated, which is understandable given that restaurants may have branches. If the addresses of two stores are repeated, I can only consider them as one store. In this way, the number of stores decreased from 585915 to 516674, which is nearly 7000 pieces of data missing
2. Handle the missing value
Now let’s look at the missing value case.
There are 11 missing values in the “Name” field. As the Name is an essential element of a restaurant, these 11 restaurants without a Name will be deleted as punishment. For the same reason, I can’t find you without an address. Then, for the data with missing values in “Taste”, “Environment”, “Service”, “Comments” and “PCC”, I originally intended to backfill 0 for them, but on second thought, I wanted to find excellent restaurants, so these “three noes” restaurants are not helpful for the overall analysis. Are redundant values. It also caused errors in calculating averages and other indicators, so I decided to delete them.
Of course, some “three Noes” restaurants may just be newly opened restaurants because the time is too short to accumulate the evaluation, there will indeed be some cases of accidental killing. But going to a new restaurant can be risky, so I just want to do a safe analysis. So, wait until you have enough scores to be selected again, I believe that the gold will shine.
After deleting the missing values, the data was reduced by almost half, indicating that at least half of the stores on Dianping in this data set were missing key information, which could not give us a good guide. You need to refuel
Well, this data is clean ~~~~
3. Construction features
In order to make it more convenient for me to analyze the score of taste, I constructed a new feature — “Overall”, which is the average score of ‘taste’, ‘environment’ and ‘service’, with the accuracy of one decimal point.
Finally, we have an intuitive understanding of the processed data and see what data each category contains
Two, data analysis (EDA)
1. Holistic analysis
Number of the restaurant
Let’s take a look around the country:
It can be seen that the number of restaurants is the largest in the four first-tier cities of Beijing, Shanghai, Guangzhou and Shenzhen. Among them, the number of restaurants in The Imperial Capital is the largest with 12,522 restaurants (how happy the people in imperial Capital are ~), and the least is Lanzhou (uh, the base of the first shop in the universe, Lanzhou Ramen, unexpectedly the least number). The number of Guangzhou and Shenzhen is basically the same, worthy of being the province where I eat fujian people (heavy fog ~).
To some extent, this reflects the degree of urban development. After all, people live on food, and the large cities with a net inflow of population have more basic amenities such as hotels. Except north guangshen four major first-tier cities like nanjing, tianjin, chengdu, hangzhou second-tier cities such as star hotel number is also very outstanding, is more than the average, as some of the less developed regions the fewer number of the hotel, in line with the front analysis – in fact, it is not difficult to understand, like the vast northwest China in addition to tourist attractions have seen a few people how to open a shop Ha ha, is what I know, Because I just came back from a trip to the northwest, the scenery is really beautiful, strong amway ~
Types of cuisine
Let’s take a look at which cities offer the most variety of cuisines:
First-tier cities remained at the top of the list, in line with expectations. The capital beat the pack to the top, but I was surprised that Tianjin came in second by only one vote. In my opinion, the cities with the most diverse dishes are those with a large influx of migrant population, because cities gather people from all over the country, and different tastes naturally lead to different cuisines in the market. In a relatively long history, the population flow is not particularly big city, long history would have made the local people to create their own food culture, thus kinds won’t be much more special, so so high grades to tianjin I expressed surprise (who recently watching the river god, for old tianjin’s unique culture is quite interested in ~). However, this also shows that Tianjin, as one of the municipalities directly under the Central Government, is developing better and better, with great potential. Similarly, hot second-tier cities such as Chengdu, Nanjing and Hangzhou.
The variety of cuisines also reflects the tolerance of a city and the strength of its local culture. Variety show strong inclusive of the city, but the less species shows that local food culture is very strong, extrusion and even has assimilated many other species, such as how many hearty north the yes fell in love with delicate cantonese morning tea, and how much to eat and not spicy a snot a swig of tears in sichuan gym ~ (but this question in the analysis, Because Beijing’s indigenous culture is not strong enough to follow this line of thinking, right? Is Changsha the least inclusive? I don’t think so. But I haven’t figured out what causes these cognitive biases for the time being, and I’ll leave it for further reflection. But I suspect that one of the reasons may be that the data is not the full data.
The overall point of view
Let’s take a look at the whole picture:
It can be seen that among the 49 cities sampled this time, most hotels are concentrated in the southeast coastal areas, especially the Yangtze River Delta and pearl River Delta. In addition, although the Beijing-Tianjin-Hebei region is not a coastal area, it also gathers a large number of hotels thanks to the prestige of the imperial capital. This also reflects the general situation of China’s regional economy and development, southeast coastal developed, northwest underdeveloped situation has not been too much improved, the development of the motherland also need to rely on us to build ah. In addition, again, the analysis data is not the complete data, so it may only reflect a part of the situation, the real situation may be different ~
2. Shenzhen
Next, let’s focus on a single city, because I live in Shenzhen, so I will take a look at the situation of Shenzhen ~~ first locate shenzhen, check shenzhen’s data.
Restaurant stars
Let’s start with the stars
The star distribution conforms to the normal distribution, and is mainly concentrated around 3.5 stars. There are few 2 stars and 5 stars (1 star has been removed during data cleaning). 3.5 stars is basically the sum of other stars, and the number from 3.5 to 4 stars drops sharply, indicating that 4 stars is a very big bottleneck, and it is difficult for most stores to break through. Most stores reach an average level (3.5 stars) and stay there. It’s hard to do better. The jump from 3 to 3.5 stars also suggests that 3 stars is a relatively easy level to reach. So those 2 and 3 star stores ~ reflect on it hahaha
Per capita consumption
So let’s look at the stars per capita
The per capita consumption of shenzhen hotels is the highest 1,806 yuan, the lowest 4 yuan and the average 67 yuan. Well, the average is 67 yuan, which seems to be about the same as my feeling. Almost every time I go out to eat with my friends, the average is 70, 80 to 100 yuan, slightly higher than the average. Alas, income did not outperform average, but food consumption did outperform average… No wonder the Engel coefficient skyrocketed.
The distribution of
Take a look at the distribution, and divide per capita consumption into several range groups
A very obvious right-skewed distribution. Most people’s per capita consumption is less than 200 yuan, generally concentrated in less than 100 yuan (see here my heart balance). However, there are still a small number of people who spend more than 1000 per capita. For these people, I just want to say, tuhao want to hug thighs… Looks like that gave me another little goal. Eat it for $1,000.
This also reflects that most people in Shenzhen are ordinary working class with relatively low per capita consumption power, but there are still a small number of bourgeoisie with far more consumption power than the masses. The gap between the rich and the poor is large and the polarization is serious, which conforms to the 80-20 principle.
Of course, Shenzhen is a young city, a city seeking dreams. Although there is a large gap between the rich and the poor, there are always indomitable dream chasers who create many myths, and they break away from their own classes in this increasingly stereotypical society and realize the class transition blablalbla….. (Dry this bowl of chicken soup ~)
Finally, out of curiosity, I looked at the most expensive restaurant.
Er, the Old Liu in Xi ‘an? 1806 yuan? I’ve never heard of this place… So I went to Dianping to have a look. Well, here’s the style.
This tells us that sometimes the data can not be trusted. The data is too far off. How did you get such data
Sentiment is
Now let’s look at the popularity. Here, how much I put comments as indicators of sentiment, high and low, high popularity, after all, reviews will be more people, and so will the unwanted shop few people comment on ~ although this judgment is not very accurate, but at least should be a positive correlation, so without thinking (lazy) before they can settle for a better way to use ~ (of course, If a store is bad, it will cause a lot of comments to make fun of it, but negative popularity is also popular.
So the store with the most reviews had 20,094 reviews and the store with the least had just one, an average of 294.
As can be seen from Figure 1, only one store has more than 20,000 comments, and almost all of them have less than 10,000. I checked out the store in response to 20,000 comments. It was a branch of “Happy West Cake Birthday Cake (Buxin Store)”.
Such a high number of comments does not seem to be a natural occurrence, it should be caused by interference factors, possibly by some activities of the store owners, such as comment cashback and other extreme data.
Usually, the more reviews, the larger the store size or brand, because only a large store can have a higher reputation and attract more people to eat there. Small stores in this point is not comparable to big stores, of course, there will be that decades old store is a small store, and then people mouth reputation opened, but this situation is too few, we will not consider the case.
So if we look at the distribution, like per capita consumption, it’s still a right-leaning distribution, with most of it concentrated under 100 items. That is, the vast majority of stores are small stores, really can do big, make chain brand stores are very few, it seems that success always belongs to a small number of people ~
Grading index
Look at the scores of taste, environment, service and comprehensive
The distribution of taste, environment, service and the previously constructed comprehensive indicators is put together. It can be seen that each indicator is normally distributed, most of which are concentrated between 7.2 and 7.6, and the distribution of the remaining two ends is very small.
Then I found an interesting phenomenon that at 2 stars, the score of environment was higher than that of other items, indicating that people have a higher tolerance for the environment when eating in restaurants with lower stars. This may be because people go to a low-star restaurant with a preconceived idea of its environment. With the higher and higher star level, the indicators are more and more balanced, indicating that if the hotel wants to get a higher star level, it must develop in a balanced way, and can not be biased. At the same time, I also noticed that the service score of 5-star restaurants was slightly higher than the other items, indicating that the higher the star restaurant, the more important the service level. Of course, in other words, does this mean that many of our restaurants these days, especially high-end restaurants, are more interested in form and service than in the taste of the food itself? After all, high-end restaurants have been difficult to distance from the taste, at this time more added value soft power is the fundamental victory.
The comprehensive level
It seems that shenzhen’s indicators are quite balanced
OK~ I know you must be very curious to know these indicators are the highest respectively which stores ~~
Above hotel please contact me to pay advertising fee ~~ thank you ~
So what’s the most expensive cuisine in terms of cuisine? The most popular?
So what’s the most expensive? Take a look. Seafood ~~~ have you ever thought that the most popular cuisine is jiangsu and Zhejiang cuisine, to be honest, I thought it was morning tea ~ after all, I don’t drink morning tea in Guangdong
Finally, let’s look at the relationship between stars and ratings
Check the relationship between star rating and score, it is found that there is a positive correlation, the higher the star rating, the higher the comprehensive score. Pretty much what I expected
Machine learning
Before, I made a preliminary exploration of the data, and then I wanted to predict the star rating through various indicators, which is a multi-classification problem. Due to the limited time and energy, I decided to transform it into a dichotomous problem, which is to judge whether a restaurant is a good restaurant. I binarized the ‘Star’ and set the threshold at 39. The simple restaurants with more than 4 stars were labeled as good restaurants and marked as 1, while the ones below were labeled as those with more efforts and marked as 0. Then the dishes are processed from non-numerical labels to numerical labels, which should have been followed by dummy variables, because of the potential order relationship after conversion to numerical labels. But I don’t know why, after the dummy variable processing, there was a problem in the later screening of the importance of features. Due to the limited time, I gave up the dummy variable processing later. So there’s a little bit of an error here.
Take a look at the processed data
Next, divide feature and label
Star as the label, then select ‘Cuisine’,’Comments’,’PCC’,’Taste’,’Environment’,’Service’ as the characteristics. Why don’t you choose City? Because all cities are Shenzhen, and everyone is equally meaningless. The reason for not choosing Name is very simple. It’s not a fortune-teller. I don’t read faces. Don’t choose Addr is the same reason, I also don’t look at feng shui; Overall is not selected because the score is composed of the mean values of ‘Taste’,’Environment’ and ‘Service’. The correlation is very high and can be omitted ~
Then calculate the importance of features and take the top 80% of features as training features. Comments, Taste, Environment, Service
Then feature scaling is done, because the difference between the number of comments and the rating is too large, so feature scaling is done to prevent one excessively large value from crowding out another excessively small value.
Then the cross-validation test model score was taken, and CV was 10~
I have tried LinearSVC, SVM, Naive Bayes, random forest and XGBoost model. The XGBoost model has the best performance, so the next step is to tune the XGBoost model
Ah, finally arrived at the time of helpless and painful adjustment ~
After adjusting the hyperparameters, the score increased from 0.9489 to 0.9497
So if there’s a mysterious store with 100 reviews, 9.2 taste, 9.3 environment, and 8.2 service, is it worth going to
Yes, it’s classified as 1, it seems worth going ~ hahaha
Fourth, summary and thinking
For this data only, the following conclusions can be drawn from the above analysis:
- Polarization. 52.1% of merchants have not received any comments on The platform, which indicates that there are a large number of merchants who are not visited by the platform, and they cannot rely on the platform to attract people. But at the same time, popular business reviews are very popular, which will further drive popularity. This will widen the gap between the two.
- First-tier cities are ahead of second-tier and third-tier cities in both the number of restaurants and the variety of cuisines, which reflects the development level of first-tier cities is currently unmatched by other cities, and you can still enjoy a lot of convenience in first-tier cities.
- The key second-tier cities are developing at a fast speed. They are closer to first-tier cities in terms of the number of restaurants and types of cuisines, and there are even cities like Tianjin that have surpassed first-tier cities in terms of types of cuisines. If you want to go to second-tier city development, you can give priority to hangzhou, Nanjing, Chengdu, Tianjin and other key second-tier cities.
- Regional concentration. Most restaurants are concentrated in the Yangtze River Delta, Pearl River Delta and surrounding Beijing, with few in western China. This reflects the fact that China is still in the developed southeast region and the underdeveloped western region. In addition the Yangtze River Delta, pearl River Delta these two super city groups still have strong competitiveness.
- Star distribution follows normal distribution, from 3.5 to 4 stars is a threshold, it is very difficult to cross, 73.8% of stores can not cross, but after passing the prospect is also very good.
- Per capita consumption is still relatively low, most people are working class, 86.4% of people spend less than 100 yuan. But there are still some tuhao consumption level far beyond ordinary people, social wealth distribution is still wide.
- Taste, environment and service all obey normal distribution. In addition, with the higher and higher stars, the indicators are more balanced, if the hotel wants to get a higher star must be balanced development.
- It is possible to classify restaurants simply by machine learning.
Due to the incomplete data of this analysis, the conclusion is inevitably different from the real situation. In addition, the analysis is relatively simple, many cases are not considered, the conclusion is also slightly immature, some places may have pure stacking data without in-depth analysis of the situation. This is something I will try to avoid in the future and try to improve myself. In addition, the machine learning part is very crude, there should be a better way to select features, and the definition of tags is too simple. Originally wanted to do more classification tasks, from the index to predict stars, but because of the time problem had to change to two classification tasks. This will need to be refined in future iterations.
Finally, why am I sitting around writing this report? Because I lost my job, unemployed at home !!!!! Everybody big guy, beg to introduce a job, beg inside push ~~~~~(this paragraph delimits key point to feed ~, escape ~)