Live up to the time, the creation of non-stop, this article is participating in 2021 year-end summary essay contest

preface

Using Python to achieve Tiktok big V data visualization, no more talk ~

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Pyecharts module;

Matplotlib module;

Wordcloud module;

PIL module;

Pandas module

Numpy module

Jieba module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

In this episode, we will analyze the data for you. What kind of videos are the most popular in Douyin

To get the data

The data comes from a third party, and there are 5000+ Tiktok BIG V data information in total

It mainly contains their nicknames, gender, location, type, number of likes, number of fans, number of videos, number of comments, number of shares, number of followers, school of graduation, certification, profile and other information

Data visualization

Code implementation

from pyecharts.charts import Pie, Bar, TreeMap, Map, Geo
from wordcloud import WordCloud, ImageColorGenerator
from pyecharts import options as opts
import matplotlib.pyplot as plt
from PIL import Image
import pandas as pd
import numpy as np
import jieba

df = pd.read_csv('douyin.csv', header=0, encoding='utf-8-sig')
print(df)
Copy the code

Gender distribution

As you can see from the graph, there is little difference in the proportion of men and women

Visual code implementation ①

def create_gender(df) :
    df = df.copy()
    # change the value
    df.loc[df.gender == '0'.'gender'] = 'unknown'
    df.loc[df.gender == '1'.'gender'] = 'male'
    df.loc[df.gender == '2'.'gender'] = 'women'
    # Group by gender
    gender_message = df.groupby(['gender'])
    # count the grouped results
    gender_com = gender_message['gender'].agg(['count'])
    gender_com.reset_index(inplace=True)

    # Pie chart data
    attr = gender_com['gender']
    v1 = gender_com['count']

    Initial configuration
    pie = Pie(init_opts=opts.InitOpts(width="800px", height="400px"))
    Add data and set radius
    pie.add(""[list(z) for z in zip(attr, v1)], radius=["40%"."75%"])
    # Set the global configuration items, title, legend, toolbox.
    pie.set_global_opts(title_opts=opts.TitleOpts(title="Sex distribution of Douyin BIG V", pos_left="center", pos_top="top"),
                        legend_opts=opts.LegendOpts(orient="vertical", pos_left="left"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}))
    # Set a series of configuration items, label styles
    pie.set_series_opts(label_opts=opts.LabelOpts(is_show=True, formatter="{b}:{d}%"))
    pie.render("Gender distribution of Douyin BIG V. HTML")

Copy the code

give a like

All of the TOP10 likes, except for “xiaotuantuan” and “poison tongue,” are big vs in the news media category

I still remember that “Sichuan Watch” was joked in the comment section as “everywhere watch”, which means very fast news release

There are more than 500 big Vs with more than 100 million likes, and the largest number of big Vs with 10 million to 50 million likes

Visual code implementation ②

def create_likes(df) :
    # sort, descending
    df = df.sort_values('likes', ascending=False)
    # Get TOP10 data
    attr = df['name'] [0:10]
    v1 = [float('%.1f' % (float(i) / 100000000)) for i in df['likes'] [0:10]]

    Initial configuration
    bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
    # X-axis data
    bar.add_xaxis(list(reversed(attr.tolist())))
    # Y-axis data
    bar.add_yaxis("".list(reversed(v1)))
    # Set the global configuration items, title, toolbox (download image), y splitter
    bar.set_global_opts(title_opts=opts.TitleOpts(title="TOP10 likes for tiktok big V", pos_left="center", pos_top="18"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                        xaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
    # Set a series of configuration items, label styles
    bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="right", color="black"))
    bar.reversal_axis()
    bar.render("TOP10 likes for tiktok big V".html)


def create_cut_likes(df) :
    # Segment the data
    Bins = [0.1000000.5000000.10000000.25000000.50000000.100000000.5000000000]
    Labels = [' '0-100..'100-500'.'500-1000'.'1000-2500'.'2500-5000'.'5000-10000'.'10000']
    len_stage = pd.cut(df['likes'], bins=Bins, labels=Labels).value_counts().sort_index()
    # Fetch data
    attr = len_stage.index.tolist()
    v1 = len_stage.values.tolist()

    # Generate a bar chart
    bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
    bar.add_xaxis(attr)
    bar.add_yaxis("", v1)
    bar.set_global_opts(title_opts=opts.TitleOpts(title=Distribution of Big V likes of Douyin (ten thousand), pos_left="center", pos_top="18"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                        yaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
    bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="top", color="black"))
    bar.render("Tiktok big V Likes distribution (ten thousand).html")
Copy the code

fans

“People’s Daily” and “CCTV News” both have more than 100 million followers

This year live with goods hot, Li Jiaqi ranked in the top ten, is not surprising, after all, with a brother

The distribution of fans of big Vs

More than 50 million 56, are big ~

Comments may need

In general, there are more comments on media videos

Share several top 10

From the graph, people still like to share news and food videos

Summary distribution of likes and fans of each type

Visual code implementation ③

def create_type_likes(df) :
    # group sum
    likes_type_message = df.groupby(['category'])
    likes_type_com = likes_type_message['likes'].agg(['sum'])
    likes_type_com.reset_index(inplace=True)
    # Processing data
    dom = []
    for name, num in zip(likes_type_com['category'], likes_type_com['sum']):
        data = {}
        data['name'] = name
        data['value'] = num
        dom.append(data)
    print(dom)

    Initial configuration
    treemap = TreeMap(init_opts=opts.InitOpts(width="800px", height="400px"))
    # add data
    treemap.add(' ', dom)
    Set the global configuration items, title, toolbox.
    treemap.set_global_opts(title_opts=opts.TitleOpts(title="Total Number of Big V likes for All Types of Douyin", pos_left="center", pos_top="5"),
                            toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                            legend_opts=opts.LegendOpts(is_show=False))

    treemap.render("All types of Douyin big V likes summary chart.html")
Copy the code

TOP10 in average likes/fans of videos

As the top traffic in 2019, Li Xian ranked first

Visual code implementation ④

def create_avg_likes(df) :
    # screening
    df = df[df['videos'] > 0]
    # Calculate the average number of likes per video
    df.eval('result = likes/(videos*10000)', inplace=True)
    df['result'] = df['result'].round(decimals=1)
    df = df.sort_values('result', ascending=False)

    Take top 10 #
    attr = df['name'] [0:10]
    v1 = ['%.1f' % (float(i)) for i in  df['result'] [0:10]]

    Initial configuration
    bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
    # add data
    bar.add_xaxis(list(reversed(attr.tolist())))
    bar.add_yaxis("".list(reversed(v1)))
    # Set the global configuration items, title, toolbox (download image), y splitter
    bar.set_global_opts(title_opts=opts.TitleOpts(title="TOP10 average likes of douyin big V videos (ten thousand)", pos_left="center", pos_top="18"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                        xaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
    Set a series of configuration items
    bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="right", color="black"))
    # flip the xy axis
    bar.reversal_axis()
    bar.render("Douyin big V average video likes TOP10(ten thousand).html")
Copy the code

Distribution of Tiktok V

Guangdong, Zhejiang and Sichuan ranked the top three

Visual code implementation ⑤

def create_province_map(df) :
    # filter data
    df = df[df["country"] = ="China"]
    df1 = df.copy()
    # Data replacement
    df1["province"] = df1["province"].str.replace("Save"."").str.replace("Zhuang Autonomous Region"."").str.replace("Uygur Autonomous Region"."").str.replace("Autonomous region"."")
    # group count
    df_num = df1.groupby("province") ["province"].agg(count="count")
    df_province = df_num.index.values.tolist()
    df_count = df_num["count"].values.tolist()

    Initial configuration
    map = Map(init_opts=opts.InitOpts(width="800px", height="400px"))
    # Map of China
    map.add(""[list(z) for z in zip(df_province, df_count)], "china")
    # Set up global configuration items, title, toolbox (download image), color legend
    map.set_global_opts(title_opts=opts.TitleOpts(title="Distribution of Douyin big V Provinces", pos_left="center", pos_top="0"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                        # Set the value range from 0 to 600, is_PIECewise tag values continuous
                        visualmap_opts=opts.VisualMapOpts(max_=600, is_piecewise=False))
    map.render("Tik Tok V province distribution.html")
Copy the code

Top 10 cities

Beijing is far ahead of the pack, the hub of big V

Visual code implementation ⑥

def create_city(df) :
    df1 = df[df["country"] = ="China"]
    df1 = df1.copy()
    df1["city"] = df1["city"].str.replace("The city"."")

    df_num = df1.groupby("city") ["city"].agg(count="count").reset_index().sort_values(by="count", ascending=False)
    df_city = df_num[:10] ["city"].values.tolist()
    df_count = df_num[:10] ["count"].values.tolist()

    bar = Bar(init_opts=opts.InitOpts(width="800px", height="400px"))
    bar.add_xaxis(df_city)
    bar.add_yaxis("", df_count)
    bar.set_global_opts(title_opts=opts.TitleOpts(title="Douyin big V city distribution TOP10", pos_left="center", pos_top="18"),
                        toolbox_opts=opts.ToolboxOpts(is_show=True, feature={"saveAsImage": {}}),
                        yaxis_opts=opts.AxisOpts(splitline_opts=opts.SplitLineOpts(is_show=True)))
    bar.set_series_opts(label_opts=opts.LabelOpts(is_show=True, position="top", color="black"))
    bar.render("Douyin BIG V City Distribution Top10.html")
Copy the code

Top 10 abroad

The United States ranks first, and many Chinese in The United States will share some things about their life in the United States

TOP10 schools graduated from douyin big V

Beijing Film, Zhuan, Zhejiang zhuan, Zhongxi, Shangxi, Central Beauty, no problem in the show business

Code to query the authentication situation

df1 = df[(df["custom_verify"] != "") & (df["custom_verify"] != "Unknown")]
df1 = df1.copy()
df_num = df1.groupby("custom_verify") ["custom_verify"].agg(count="count").reset_index().sort_values(by="count", ascending=False)
print(df_num[:20])
Copy the code

The results of

Douyin big V introduction word cloud

It can be seen that most of the big V’s have left the message of business cooperation, which is good for content creators, so as to win the win-win situation

Visual code implementation ⑦

def create_wordcloud(df, picture) :
    words = pd.read_csv('chineseStopWords.txt', encoding='gbk', sep='\t', names=['stopword'])
    # participle
    text = ' '
    df1 = df[df["signature"] != ""]
    df1 = df1.copy()
    for line in df1['signature']:
        text += ' '.join(jieba.cut(str(line).replace("".""), cut_all=False))
    # stop words
    stopwords = set(' ')
    stopwords.update(words['stopword'])
    backgroud_Image = plt.imread('douyin.png')
    # Use tiktok background color
    alice_coloring = np.array(Image.open(r"douyin.png"))
    image_colors = ImageColorGenerator(alice_coloring)
    wc = WordCloud(
        background_color='white',
        mask=backgroud_Image,
        font_path='Fang Zheng's Orchid pavilion is black. TTF',
        max_words=2000,
        max_font_size=70,
        min_font_size=1,
        prefer_horizontal=1,
        color_func=image_colors,
        random_state=50,
        stopwords=stopwords,
        margin=5
    )
    wc.generate_from_text(text)
    # Look at the word frequency
    process_word = WordCloud.process_text(wc, text)
    sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True)
    print(sort[:50])
    plt.imshow(wc)
    plt.axis('off')
    wc.to_file(picture)
    print('Generated word cloud successfully! ')
Copy the code