Live up to the time, the creation of non-stop, this article is participating in 2021 year-end summary essay contest
preface
Using Python to achieve Tiktok big V data visualization, no more talk ~
Let’s have a good time
The development tools
Python version: 3.6.4
Related modules:
Pyecharts module;
Matplotlib module;
Wordcloud module;
PIL module;
Pandas module
Numpy module
Jieba module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
In this episode, we will analyze the data for you. What kind of videos are the most popular in Douyin
To get the data
The data comes from a third party, and there are 5000+ Tiktok BIG V data information in total
It mainly contains their nicknames, gender, location, type, number of likes, number of fans, number of videos, number of comments, number of shares, number of followers, school of graduation, certification, profile and other information
Data visualization
Code implementation
from pyecharts.charts import Pie, Bar, TreeMap, Map, Geo
from wordcloud import WordCloud, ImageColorGenerator
from pyecharts import options as opts
import matplotlib.pyplot as plt
from PIL import Image
import pandas as pd
import numpy as np
import jieba
df = pd.read_csv('douyin.csv', header=0, encoding='utf-8-sig')
print(df)
Copy the code
Gender distribution
As you can see from the graph, there is little difference in the proportion of men and women
Visual code implementation ①
def create_gender(df) :
df = df.copy()
# change the value
df.loc[df.gender == '0'.'gender'] = 'unknown'
df.loc[df.gender == '1'.'gender'] = 'male'
df.loc[df.gender == '2'.'gender'] = 'women'
# Group by gender
gender_message = df.groupby(['gender'])
# count the grouped results
gender_com = gender_message['gender'].agg(['count'])
gender_com.reset_index(inplace=True)
# Pie chart data
attr = gender_com['gender']
v1 = gender_com['count']
Initial configuration
pie = Pie(init_opts=opts.InitOpts(width="800px", height="400px"))
Add data and set radius
pie.add(""[list(z) for z in zip(attr, v1)], radius=["40%"."75%"])
# Set the global configuration items, title, legend, toolbox.
pie.set_global_opts(title_opts=opts.TitleOpts(title="Sex distribution of Douyin BIG V", pos_left="center", pos_top="top"),
legend_opts=opts.LegendOpts(orient="vertical", pos_left="left"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}))
# Set a series of configuration items, label styles
pie.set_series_opts(label_opts=opts.LabelOpts(is_show=True, formatter="{b}:{d}%"))
pie.render("Gender distribution of Douyin BIG V. HTML")
Copy the code
give a like
All of the TOP10 likes, except for “xiaotuantuan” and “poison tongue,” are big vs in the news media category
I still remember that “Sichuan Watch” was joked in the comment section as “everywhere watch”, which means very fast news release
There are more than 500 big Vs with more than 100 million likes, and the largest number of big Vs with 10 million to 50 million likes
Visual code implementation ②
def create_likes(df) :
# sort, descending
df = df.sort_values('likes', ascending=False)
# Get TOP10 data
attr = df['name'] [0:10]
v1 = [float('%.1f' % (float(i) / 100000000)) for i in df['likes'] [0:10]]
Initial configuration
bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
# X-axis data
bar.add_xaxis(list(reversed(attr.tolist())))
# Y-axis data
bar.add_yaxis("".list(reversed(v1)))
# Set the global configuration items, title, toolbox (download image), y splitter
bar.set_global_opts(title_opts=opts.TitleOpts(title="TOP10 likes for tiktok big V", pos_left="center", pos_top="18"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
xaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
# Set a series of configuration items, label styles
bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="right", color="black"))
bar.reversal_axis()
bar.render("TOP10 likes for tiktok big V".html)
def create_cut_likes(df) :
# Segment the data
Bins = [0.1000000.5000000.10000000.25000000.50000000.100000000.5000000000]
Labels = [' '0-100..'100-500'.'500-1000'.'1000-2500'.'2500-5000'.'5000-10000'.'10000']
len_stage = pd.cut(df['likes'], bins=Bins, labels=Labels).value_counts().sort_index()
# Fetch data
attr = len_stage.index.tolist()
v1 = len_stage.values.tolist()
# Generate a bar chart
bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
bar.add_xaxis(attr)
bar.add_yaxis("", v1)
bar.set_global_opts(title_opts=opts.TitleOpts(title=Distribution of Big V likes of Douyin (ten thousand), pos_left="center", pos_top="18"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
yaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="top", color="black"))
bar.render("Tiktok big V Likes distribution (ten thousand).html")
Copy the code
fans
“People’s Daily” and “CCTV News” both have more than 100 million followers
This year live with goods hot, Li Jiaqi ranked in the top ten, is not surprising, after all, with a brother
The distribution of fans of big Vs
More than 50 million 56, are big ~
Comments may need
In general, there are more comments on media videos
Share several top 10
From the graph, people still like to share news and food videos
Summary distribution of likes and fans of each type
Visual code implementation ③
def create_type_likes(df) :
# group sum
likes_type_message = df.groupby(['category'])
likes_type_com = likes_type_message['likes'].agg(['sum'])
likes_type_com.reset_index(inplace=True)
# Processing data
dom = []
for name, num in zip(likes_type_com['category'], likes_type_com['sum']):
data = {}
data['name'] = name
data['value'] = num
dom.append(data)
print(dom)
Initial configuration
treemap = TreeMap(init_opts=opts.InitOpts(width="800px", height="400px"))
# add data
treemap.add(' ', dom)
Set the global configuration items, title, toolbox.
treemap.set_global_opts(title_opts=opts.TitleOpts(title="Total Number of Big V likes for All Types of Douyin", pos_left="center", pos_top="5"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
legend_opts=opts.LegendOpts(is_show=False))
treemap.render("All types of Douyin big V likes summary chart.html")
Copy the code
TOP10 in average likes/fans of videos
As the top traffic in 2019, Li Xian ranked first
Visual code implementation ④
def create_avg_likes(df) :
# screening
df = df[df['videos'] > 0]
# Calculate the average number of likes per video
df.eval('result = likes/(videos*10000)', inplace=True)
df['result'] = df['result'].round(decimals=1)
df = df.sort_values('result', ascending=False)
Take top 10 #
attr = df['name'] [0:10]
v1 = ['%.1f' % (float(i)) for i in df['result'] [0:10]]
Initial configuration
bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
# add data
bar.add_xaxis(list(reversed(attr.tolist())))
bar.add_yaxis("".list(reversed(v1)))
# Set the global configuration items, title, toolbox (download image), y splitter
bar.set_global_opts(title_opts=opts.TitleOpts(title="TOP10 average likes of douyin big V videos (ten thousand)", pos_left="center", pos_top="18"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
xaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
Set a series of configuration items
bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="right", color="black"))
# flip the xy axis
bar.reversal_axis()
bar.render("Douyin big V average video likes TOP10(ten thousand).html")
Copy the code
Distribution of Tiktok V
Guangdong, Zhejiang and Sichuan ranked the top three
Visual code implementation ⑤
def create_province_map(df) :
# filter data
df = df[df["country"] = ="China"]
df1 = df.copy()
# Data replacement
df1["province"] = df1["province"].str.replace("Save"."").str.replace("Zhuang Autonomous Region"."").str.replace("Uygur Autonomous Region"."").str.replace("Autonomous region"."")
# group count
df_num = df1.groupby("province") ["province"].agg(count="count")
df_province = df_num.index.values.tolist()
df_count = df_num["count"].values.tolist()
Initial configuration
map = Map(init_opts=opts.InitOpts(width="800px", height="400px"))
# Map of China
map.add(""[list(z) for z in zip(df_province, df_count)], "china")
# Set up global configuration items, title, toolbox (download image), color legend
map.set_global_opts(title_opts=opts.TitleOpts(title="Distribution of Douyin big V Provinces", pos_left="center", pos_top="0"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
# Set the value range from 0 to 600, is_PIECewise tag values continuous
visualmap_opts=opts.VisualMapOpts(max_=600, is_piecewise=False))
map.render("Tik Tok V province distribution.html")
Copy the code
Top 10 cities
Beijing is far ahead of the pack, the hub of big V
Visual code implementation ⑥
def create_city(df) :
df1 = df[df["country"] = ="China"]
df1 = df1.copy()
df1["city"] = df1["city"].str.replace("The city"."")
df_num = df1.groupby("city") ["city"].agg(count="count").reset_index().sort_values(by="count", ascending=False)
df_city = df_num[:10] ["city"].values.tolist()
df_count = df_num[:10] ["count"].values.tolist()
bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
bar.add_xaxis(df_city)
bar.add_yaxis("", df_count)
bar.set_global_opts(title_opts=opts.TitleOpts(title="Douyin big V city distribution TOP10", pos_left="center", pos_top="18"),
toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
yaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="top", color="black"))
bar.render("Douyin BIG V City Distribution Top10.html")
Copy the code
Top 10 abroad
The United States ranks first, and many Chinese in The United States will share some things about their life in the United States
TOP10 schools graduated from douyin big V
Beijing Film, Zhuan, Zhejiang zhuan, Zhongxi, Shangxi, Central Beauty, no problem in the show business
Code to query the authentication situation
df1 = df[(df["custom_verify"] != "") & (df["custom_verify"] != "Unknown")]
df1 = df1.copy()
df_num = df1.groupby("custom_verify") ["custom_verify"].agg(count="count").reset_index().sort_values(by="count", ascending=False)
print(df_num[:20])
Copy the code
The results of
Douyin big V introduction word cloud
It can be seen that most of the big V’s have left the message of business cooperation, which is good for content creators, so as to win the win-win situation
Visual code implementation ⑦
def create_wordcloud(df, picture) :
words = pd.read_csv('chineseStopWords.txt', encoding='gbk', sep='\t', names=['stopword'])
# participle
text = ' '
df1 = df[df["signature"] != ""]
df1 = df1.copy()
for line in df1['signature']:
text += ' '.join(jieba.cut(str(line).replace("".""), cut_all=False))
# stop words
stopwords = set(' ')
stopwords.update(words['stopword'])
backgroud_Image = plt.imread('douyin.png')
# Use tiktok background color
alice_coloring = np.array(Image.open(r"douyin.png"))
image_colors = ImageColorGenerator(alice_coloring)
wc = WordCloud(
background_color='white',
mask=backgroud_Image,
font_path='Fang Zheng's Orchid pavilion is black. TTF',
max_words=2000,
max_font_size=70,
min_font_size=1,
prefer_horizontal=1,
color_func=image_colors,
random_state=50,
stopwords=stopwords,
margin=5
)
wc.generate_from_text(text)
# Look at the word frequency
process_word = WordCloud.process_text(wc, text)
sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True)
print(sort[:50])
plt.imshow(wc)
plt.axis('off')
wc.to_file(picture)
print('Generated word cloud successfully! ')
Copy the code