Xu Lin, currently working in Shanghai ViPSHOP Product Technology Center, Columbia University statistical data dog, engaged in data mining & analysis, likes to play some different data with R&Python.

Personal official account: Shujusenlin (ID: Shujusenlin), a columnist of the same name on Zhihu.

Preface:

Looking at the domestic film market in recent years, Mahua Funage seems to have become a box office guarantee. Films from Goodbye Mr. Loser and The Shy Fist to the latest release The Richest Man in Hong Kong have all exploded at the box office. In this episode, we’ll explain whether The Richest Man in Hong Kong is worth watching, based on tens of thousands of comments we’ve collected from Maoyan

Data crawl

In this data crawling, we referred to the crawling method of cat’s eye data in other articles before, called its interface, took out part of the data each time and removed the duplicate, and finally got tens of thousands of comments, with the code as follows:

tomato = pd.DataFrame(columns=['date'.'score'.'city'.'comment'.'nick'])

for i inRange (0, 1000): j = random. Randint (1,1000)print(str(i)+' '+str(j))

    try:
        time.sleep(2) 
        url= 'http://m.maoyan.com/mmdb/comments/movie/1212592.json?_v_=yes&offset=' + str(j)
        html = requests.get(url=url).content
        data = json.loads(html.decode('utf-8'))'cmts']

        for item in data:

            tomato = tomato.append({'date':item['time'].split(' ') [0],'city':item['cityName'].'score':item['score'].'comment':item['content'].'nick':item['nick']},ignore_index=True)

        tomato.to_csv('The richest man in Xihong city',index=False)                           

   except:
        continue
Copy the code

The data analysis

Let’s take a look at the data:

In the data, we can get the user’s nickname, which is convenient for later deduplication. The rest of the section focuses on ratings, cities and reviews.

First of all, comment on the distributed heat map:

Here we look at the number of comments and ratings in major cities:

With the highest score of 4.77, it was Harbin, the capital city of Shen Teng’s hometown (Shen Teng was born in Qiqihar, Heilongjiang province). It seems that Shen Teng is still recognized by the majority of his fellow villagers in Heilongjiang province. The lowest and second lowest scores come from Hefei and Zhengzhou. Mahua Funage can consider introducing and strengthening propaganda in the central region in the future.

We rank cities from highest to lowest:

Among the 20 cities with the largest number of comments, four of the top seven cities are from northeast China, while wuhan, Hefei and Zhengzhou, with relatively low scores, all belong to the central region. It can be seen that audiences in different regions have different degrees of recognition for film reviews.

We projected the city scores onto the map :(red means high, blue means low)

Further, we divided the cities into high and low scores

Higher areas:

Lower area:

It can be seen that for the “tomato”, there is a certain difference in the evaluation of the audience in the north and south, which seems to be consistent with the annual gala ratings in various regions. Shen Teng himself is a regular at the Spring Festival Gala, and naturally brings some “Spring Festival taste” to the film, which seems to explain our results to some extent.

After reviewing the ratings, let’s take a look at the word clouds generated by the comments. The following are the original and the word clouds drawn accordingly:

I don’t know what you think, but at least I’ve seen a cloud of words like “funny”, “funny”, “worthwhile”, “happy”, “good” and even “haha”, which all arouse my strong desire to watch the movie. At the same time, Shen Teng has been repeatedly mentioned many times, it can be foreseen that it has a very good performance in the film, but also to a certain extent to stimulate everyone’s desire to see the film.

Part of code display

Heat map:

tomato_com = pd.read_excel('The richest man in Xihong. XLSX')
grouped=tomato_com.groupby(['city'])
grouped_pct=grouped['score'] # tip_pct column
city_com = grouped_pct.agg(['mean'.'count'])
city_com.reset_index(inplace=True)
city_com['mean'] = round(city_com['mean'],2)

data=[(city_com['city'][i],city_com['count'][i]) for i in range(0, city_com.shape[0])]

geo = Geo('The Richest Man in Xihong city' National Heat Map, title_color="#fff", title_pos="center", width=1200, height=600, background_color='#404a59')

attr, value = geo.cast(data)

geo.add("", attr, value, type="heatmap", visual_range=[0, 200],visual_text_color="#fff", symbol_size=10, is_visualmap=True,is_roam=False)

geo.render('National Heat Map of The richest man in Xihong city')
Copy the code

Line chart + bar chart combination:

city_main = city_com.sort_values('count',ascending=False)[0:20]

attr = city_main['city']

v1=city_main['count']

v2=city_main['mean']

line = Line("Major City Rating")

line.add("City", attr, v2, is_stack=True,xaxis_rotate=30,yaxis_min=4.2, mark_point=['min'.'max'],xaxis_interval =0,line_color='lightblue', line_width=4,mark_point_textcolor='black',mark_point_color='lightblue', is_splitline_show=False)  

bar = Bar("Number of comments in major Cities")

bar.add("City"Attr, v1, is_stack=True,xaxis_rotate=30,yaxis_min=4.2, xaxis_interval =0,is_splitline_show=False) overlap = overlap ()Select * from * where (x, y)
overlap.add(bar)

overlap.add(line, yaxis_index=1, is_add_yaxis=True)

overlap.render('Number of comments in major cities _ average score.html')
Copy the code

The word cloud:

tomato_str =  ' '.join(tomato_com['comment'])

words_list = []

word_generator = jieba.cut_for_search(tomato_str) 

for word in word_generator:
    words_list.append(word)

words_list = [k for k in words_list if len(k)>1]

back_color = imread('Tomato.jpg')  # Parse the picture

wc = WordCloud(background_color='white'.# Background color
               max_words=200,  # Maximum number of words
               mask=back_color,  If this parameter is not null, width and height are ignored
               max_font_size=300,  Display the maximum size of the font
               font_path="C:/Windows/Fonts/STFANGSO.ttf", 
               random_state=42,  # Return a PIL color for each word
               # width=1000, # width of image
               # height=860
               )

tomato_count = Counter(words_list)

wc.generate_from_frequencies(tomato_count)

# Generate corresponding color based on color image

image_colors = ImageColorGenerator(back_color)

# Draw word clouds

plt.figure()

plt.imshow(wc.recolor(color_func=image_colors))

plt.axis('off')
Copy the code

Sales forecast

Finally, let’s make a bold estimate of the box office of The Richest Man in Hong Kong. In our daily work, we will select benchmarks to estimate some upcoming things. The benchmark we chose was Shy Iron Fist:

We chose Shy Iron Fist as our benchmark for the following reasons:

  • Are happy mahua production, similar theme
  • The cast is highly compatible
  • Douban fans have similar recognition (both score 6.9, at the median level of comedies)
  • Cat eye has similar fan recognition (Iron Fist 9.1, Tomato 9.3)

Let’s take a look at the first three days of both films:

The first three days have been pretty similar for both films, and based on our previous averages, we can make a tentative (more than random) prediction of tomatina’s final box office. “Tomato” box office predicted value ≈ “Iron Fist” total box office/” Iron Fist “the first three days of box office *” Tomato “the first three days of box office =22.13/5.25*8.62≈36, considering that iron Fist is released in the National Day holiday, tomato box office estimates need to be lowered accordingly.

All told, we’re looking at $3 billion. Join us to see how octopus Paul predicts or Pele gets slapped in the face. Welcome to leave a message in the comment area if you agree with our official account this rigorous (suixing) forecast.

The data set can be obtained at github.com/shujusenlin…