“This is the 20th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

Writing in the front

Years ago, I wrote an article about B station barrage crawling, saying that later I have time to analyze the hot spots in the barrage, and just recently I have time, so LET’s fill the holes. This paper mainly analyzes and visualizes the data of bullet screen in the first episode of “Jinshu Hui Zhan” animation of station B, mainly usingpythonpandasTo preprocess the data,jiebaParticiples,pyechartsVisualization. The data set is simple and contains 60000 pieces of data. Okay, no more talking, just turn it on.


Data preprocessing

First, look at some information about the dataset so that you can get a clear picture of the dataset later.

import pandas as pd

df = pd.read_csv('bilibili_clean.csv')
df.info()
Copy the code

Print the first five lines.

df.head()
Copy the code

As you can see from the figure above, the other_data column contains a lot of data, not all of which is useful for later analysis. We split the column, keeping only the columns we need. Take the data in the first row for example.

‘351.52700, 1,25,16777215,1601686748,0 b083a745, 39082825228484615’

Required Data and meaning:

  • 351.52700: Time and position of the barrage, based on the length of the video, in seconds;
  • 16777215: Barrage color,16777215The corresponding0xFFFFFF
  • b083a745: Indicates the user of the danmu senderid
# Partition the other_data column and add the required columns to the DataFrame
split_df = df['other_data'].str.split(', ', expand=True)

column_dict = [('video_time'.0), ('color'.3), ('user_id'.6)]
for col_name, index in column_dict:
    df[col_name] = split_df[index]
# delete other_data column
df.drop('other_data', axis=1, inplace=True)
df.head()
Copy the code

In this way, we get a clearer data set with a clearer structure. Before drawing the graph, we will also process the data to make it more convenient to draw the graph.

Bar chart of projectile barrage length distribution

Add a column called comment_length to record the length of the comment and count the number of times the length of the comment occurs. Use the Pyecharts library to draw histograms.

Add a column comment_length to record the length of comment
df['comment_length'] = df['comment'].map(lambda x: len(x))

length_series = df['comment_length'].value_counts()
length_series.sort_index(ascending=True, inplace=True)
# Comment length list (ascending)
length_list = length_series.index.astype(int).tolist()
# Each length corresponds to the list of occurrences
count_list = length_series.values.astype(int).tolist()

# Draw the histogram
from pyecharts import options as opts
from pyecharts.charts import Bar

chart = Bar()
chart.add_xaxis(length_list).add_yaxis("Episode One", count_list, color='#DF0101').set_global_opts(
        title_opts=opts.TitleOpts(title="Projectile length distribution"),
        datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_="inside")],
    ).render("Projectile length distribution.html")
chart.render_notebook()
Copy the code

Observing the information in the picture, it can be seen that with the increase of the length, the number of shells decreases gradually, and most shells are less than 10 in length, which is quite consistent with our habit. The author usually sends shells with only a few words.


Barrage color distribution pie chart

First convert the decimal color code to hexadecimal. Then when you draw the color code, you can draw the corresponding color for the sector area according to the color code.

import time

# change datatype of color column from "STR" to "int", datatype from "decimal" to "hex"
df['color'] = df['color'].astype(int).map(lambda x: str(hex(x)))
Copy the code

Count the number of white barrage and color barrage and draw a pie chart.

# Bullet screen color visualization
from pyecharts.charts import Pie

color_series = df['color'].value_counts()
color_list = [color for color in color_series.index]
count_list = color_series.values.astype(int).tolist()

white_other = ['white'.The 'color']
white_other_count = [count_list[0].sum(count_list[1:])]

chart = (
    Pie()
    .add(
        ""[list(z) for z in zip(white_other, white_other_count)],
        radius=["40%"."75%"],
    )
    .set_colors(['#0101DF'.'#FE2E2E',])
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Ordinary, color bullet screen distribution pie chart"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
# .render("pie_radius.html")
)
chart.render_notebook()
Copy the code

It can be seen that the default color of most bullets is white. Although third-level users of station B can send color bullets, it seems that not many people use them. Let’s take a closer look at what colors are used in the color barrage. Due to the wide variety of colors, only colors with more than 10 occurrences are counted here.

Ignore colors that appear less than 10 times
for count in count_list:
    if count <= 10:
        index = count_list.index(count)
        break
new_count_list = count_list[1: index]
new_color_list = color_list[1: index]
# Change the 0xffFFFF color format to # FFFFFF
new_color_list = [The '#' + color[2:] for color in new_color_list]


chart = (
    Pie()
    .add(
        ""[list(z) for z in zip(new_color_list, new_count_list)],
        radius=["40%"."75%"],
    )
    .set_colors(new_color_list)
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Color bullet screen color distribution pie chart"),
        legend_opts=opts.LegendOpts(orient="vertical", pos_top="8%", pos_left="0%"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
# .render("pie_radius.html")
)
chart.render_notebook()
Copy the code

Wow, look at the colorful, but a closer look, red and yellow accounted for more than half, but it can not show that so many people like red and yellow, by observing the official color number provided by the mobile version and the web version, the color number in the web version includes #FFFF00 and #FE0302, the mobile version is temporarily unable to distinguish. But there are two other colors that are more prevalent: #FEF102 and E70012. These colors are used partly because users like them, and partly because the official location of the color number may be higher.

The mobile version The web version


Line chart of relationship between projectile volume and time in video

First convert the video_time column data type to float, then divide the video time (total 1435 seconds, 23:55) into 10-second intervals, 0-10, 10-20, 20-30, 30-40… , corresponding to labels 10,20,30,40… . Next, change the data format of “second” type into “minute: second”, and finally count the number of bullets at each time.

import numpy as np

Convert the data type of the video_time column to float
df['video_time'] = df['video_time'].astype('float')
Create a temporary DataFrame
temp_df = pd.DataFrame({})
temp_df['video_time'] = df['video_time']
Partition the video_time column into 10-second intervals.
temp_df = temp_df.apply(lambda x : pd.cut(x, list(range(0.1435.10)) + [np.inf], labels=list(range(0.1435.10))))

count_series = temp_df['video_time'].value_counts()
count_series.sort_index(ascending=True, inplace=True)

# change time data format from "second" to "minute: second"
count_series.index = count_series.index.map(lambda x: time.strftime('%M:%S', time.gmtime(x)))
time_list = count_series.index.tolist()
count_list = count_series.values.astype('int').tolist()
Copy the code

According to the time_list and count_list, draw a line chart.

# Draw a line chart
from pyecharts.charts import Line

chart = (
    Line()
    .add_xaxis(time_list)
    .add_yaxis("Episode One", count_list, is_smooth=True)
    .set_series_opts(
        areastyle_opts=opts.AreaStyleOpts(opacity=0.5),
        label_opts=opts.LabelOpts(is_show=False),
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Relationship between projectile volume and Video Time"),
        datazoom_opts=[opts.DataZoomOpts(), opts.DataZoomOpts(type_="inside")],
        xaxis_opts=opts.AxisOpts(
            axistick_opts=opts.AxisTickOpts(is_align_with_label=True),
            is_scale=False,
            boundary_gap=False,),)# .render("line_areastyle_boundary_gap.html")
)
chart.render_notebook()
Copy the code

According to the amount of bullets sent in each time period of the video, the high energy moment of the video can be roughly calculated, which is similar to the “highlight moment”. The peak values in the figure mainly appear at the beginning, middle and end, which is also relatively consistent with the time point when ordinary people send shells.


Pie chart of the number of bullets sent

Count the number of times of bullet-screen sending by a single user, divide the number of bullet-screen sending into four categories: 1 time, 2 times, 3 times, >3 times, and draw the pie chart of the number of bullet-screen sending.

# Get user ID (index) and the number of times to send bullets (values) series
series_user = df['user_id'].value_counts()
# Get a series consisting of the number of times (index) and the number of users (values)
series_comment = series_user.value_counts()
Sort index in ascending order
series_comment.sort_index(ascending=True, inplace=True)
# list of times to send bullets
comment_count_list = series_comment.index
# List of users
user_count_list = series_comment.values.tolist()
# The number of rounds can be divided into 4 categories: 1, 2, 3, and more than 3
comment_count_list = [str(count) + 'time' for count in comment_count_list[:3]] + ['> 3 times']
user_count_list = user_count_list[:3] + [sum(user_count_list[3:])]

chart = (
    Pie()
    .add(
        ""[list(z) for z in zip(comment_count_list, user_count_list)],
        center=["35%"."50%"],
    )
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Pie chart of the number of bullets sent"),
        legend_opts=opts.LegendOpts(pos_left="80%", orient="vertical"),
    )
    .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
# .render("pie_position.html")
)
chart.render_notebook()
Copy the code

As can be seen from the figure, people who send one barrage account for most of them, with 5222 users sending two, 1681 users sending three, and 1873 users sending more than three. Next, let’s look at the Top10 number of barrage.

Histogram of Top10 number of bullets per user

top10 = series_comment[-10:].index.tolist()
top10.reverse()

chart = (
    Bar()
    .add_xaxis(list(range(1.11)))
    .add_yaxis('Episode One', top10, color='#F781D8')
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Top10 bullets sent by users"),
        datazoom_opts=opts.DataZoomOpts(type_="inside"),# .render("bar_datazoom_inside.html")
)
chart.render_notebook()
Copy the code

Cow!!!!! The e top user posted 76 bullets in 23 minutes. The average is about 20 seconds, I have to say, really impressive, this can also watch the video. However, it may be after watching a time, and then time the barrage.


Bullet screen hot word analysis word cloud

Firstly, load the local stop words library, and manually add some stop words according to the words in the bullet screen, in order to have a better effect of the word cloud.

def load_stopwords(read_path) :
    Read each line of the file and save it to a list :param read_path: the path to the file to be read :return: The list to save each line of the file.
    result = []
    with open(read_path, "r", encoding='utf-8') as f:
        for line in f.readlines():
            line = line.strip('\n')  # Remove line breaks for each element in the list
            result.append(line)
    return result

# load Chinese stop words
stopwords = load_stopwords('wordcloud_stopwords.txt')
Copy the code

Now, the data in the barrage is cleaned, mainly removing the Spaces, repeating single characters (‘ 111 ‘, ‘AAA’) and time (xx clocked) in the barrage, and then deleting the empty string.

Remove whitespace from bullets
df['comment'] = df['comment'].str.replace(r' '.' ')
# Replace ('111','aaa','.... 'with an empty string ('') ') etc.
df['comment'] = df['comment'].str.replace(r'^(.) \ * $1 '.' ')
# replace ('2020/11/20 20:00:00') with an empty string (' ")
df['comment'] = df['comment'].str.replace(r'\d+/\d+/\d+ \d+:\d+:\d+'.' ')

# Change the empty string to 'np.nan', i.e. Nan, for the next step to remove these bullets
df['comment'].replace(to_replace=r'^\s*$', value=np.nan, regex=True, inplace=True)
Delete null value from comment and reset index
df = df.dropna(subset=['comment'])
df.reset_index(drop=True, inplace=True)
Copy the code

After cleaning the left words, you can use jieba to divide bullets, but it is better to import the local custom dictionary with load_userdict() before dividing words. The unique domain words can be reserved without dividing them, and the stop words after dividing words are removed.

import jieba

Add a custom dictionary
jieba.load_userdict("Custom dictionary.txt")
token_list = []
# Perform word segmentation for the content of bullet screen, and save the word segmentation result in the list
for comment in df['comment']:
    tokens = jieba.lcut(comment, cut_all=False)
    token_list += [token for token in tokens if token not in stopwords]
len(token_list)
Copy the code
119752

Then the top 100 words with the most frequent occurrences are selected to draw the word cloud map.

from pyecharts.charts import WordCloud
from collections import Counter

token_count_list = Counter(token_list).most_common(100)
new_token_list = []
for token, count in token_count_list:
    new_token_list.append((token, str(count)))

chart = (
    WordCloud()
    .add(series_name="Hot word", data_pair=new_token_list, word_size_range=[12.88])
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Bullet screen hot Word cloud", title_textstyle_opts=opts.TextStyleOpts(font_size=23)
        ),
        tooltip_opts=opts.TooltipOpts(is_show=True),# .render("basic_wordcloud.html")
)
chart.render_notebook()
Copy the code

At the beginning of the word cloud drawing effect may not be very good, need to manually add some words. It looks pretty good right now.


For those who are new to Python or want to get started with Python, you can follow the public account “Python New Horizons” to communicate and learn with others. They are all beginners. Sometimes a simple question is stuck for a long time, but others may suddenly realize it with a little help. There are also nearly 1,000 resume templates and hundreds of e-books waiting for you to collect!