HeoiJin: Determined to see the world through data product planning, focus on crawler, data analysis, product planning. Everything has its marketing | capital never sleeps CSDN:me.csdn.net/weixin_4067 | data constant truth…
One, foreword
This article is the next in a series analyzing changes in Station B using Python. In the previous part, we have completed data collection, so this part will focus on comparative analysis and visualization of the collected data.
Ii. Project Features
- Pandas is used to classify and aggregate the data
- Pyecharts and Fan’s Bi software were used to visualize the actual data
- The changes of station B from 2019 to 2020 are analyzed in combination with DT Finance data
Iii. Project preparation
- Language: Python 3.7
- IDE :Pycharm
- Browser: Chrome
- Plug-in: ChromeDriver
- Libraries: Pandas, Pyecahrts, Snapshot_selenium
- Others: Fine Bi
Iv. Definition of the problem
4.1 Definition of Keywords
Before the analysis is carried out, it is necessary to confirm what are quadratic and cubic elements, and what criteria are used to classify them.
The word is hivespituan-derived from Japanese hivespituan-derived word げ code, which originally means “two-dimensional,” explaining characters in animation, games and other works displayed on paper or screen. Espitochism is also used to refer to a real-life character. — Encyclopedia of Cute girls
Two-dimensional: Animations (A of ACG), Comics (C of ACG), and Games (G of ACG). Three dimensions: the real world. — Wikipedia
That is, among all the partitions to be climbed, the partitions that can be clearly classified as two/three dimensions are:
Two-dimensional: animation, national innovation related, game three-dimensional: technology, digital, life, fashion, entertainment \
The rest of the ghost livestock, dance, music, film and television because of both the two dimensional and three dimensional attributes, defined as 2.5 dimensions [funny].
4.2 Setting goals
After completing the partition attribute division, we can start to establish the research objectives:
- In the analysis of the top 100 comprehensive scores of station B, which partition has the largest proportion? How users behave in different partitions.
- Analyze the situation of each partition of station B and find out the situation of the amount of playback and user behavior of each partition
- Analyze popular tag changes
- Understand the behavior and psychological essence behind the change of STATION B
Five, data analysis actual combat
5.1 Data Precleaning
Before entering into formal analysis, usedf.info()
Let’s see what the data is.
From the printed data above, it can be seen that there are 14 columns and 1300 rows, and no missing values.
However, it should be noted that we first exclude the total station list to avoid double calculation, and the subsequent analysis is based on the following data dF_WITHout_all.
# tilde ~ said don't choose this part df_without_all = df [~ df [' rank_tab] the isin ([' total '])]
Copy the code
5.2 The comprehensive score of the station is top100 series
5.2.1 Visualization of the proportion of each partition
Data processing ideas and core code:
- Df_without_all was divided into the first 100 items in descending order of comprehensive score
- Obtain zone names and count the occurrence times of each zone
Get a Series with a partition name index and frequency values.
Next, the rose diagram of Pyecahrts is used for visualization.
Compared to Excel or Fine Bi, pyecahrts’s rose charts are very friendly and look good.
Core code:
\
\
In the absence of official data on the number of broadcasts in 2019, a temporary comparison was made with the 2018 earnings report.
After comparison, the ranking of life and animation has been raised to no.1 and no.2 respectively. It can be said that animation is still an important part of B station.
However, entertainment, games and technology dropped out of the list, while fashion, livestock and music became new stars. The total number of videos completely belonging to the second dimension was relatively low, only accounting for 27%.
5.2.2 Average data processing of each partition
Data processing ideas and core code:
- Sort dF_WITHout_ALL in descending order by comprehensive score and get the top 100 items
- Group the dataframes by category names as row indexes and find the average
After the data is processed, it is divided into three parts for visualization:
- Broadcast analysis
- Average triad visualization and analysis
- Visualization and analysis of average comments, bullet screen and forwarding volume
5.2.3 Visualization and analysis of average playback volume
Just use the category name as the dimension and the average number of plays as the indicator.
For visualization of single dimension and single index, there can be bar chart, broken line chart, area chart and other options. Here, I choose the bar chart for visualization.
Ideas and core code:
- Get the data and build a list of category names and average playtimes
- Create a bar chart and add a Javascript statement to create gradients
\
\
The animation area has surpassed the fashion area by a slight margin to become the top1 in the average broadcast volume. Does that mean that quadratic element is still the home field of station B?
Not so. Looking back at the detailed data in the animation area, we can see that the number one “[Bilibili 2020 Happy New Year Festival]” has 5.74 times of the number two. \
And the fashion area of the top two only 1.6 times, that is, the animation area is averaged.
5.2.4 Visualization and analysis of average triple connection
The measurement unit of coin, like and favorites is the number of people, which can more accurately reflect user preferences compared with the number of people playing. Pyecharts radar diagram is used here for visualization.
Core code:
\
Despite the presence of dark horse videos in the animation section, the average coins and likes in the living area were still higher than those in the animation section.
5.2.5 Visualization and analysis of average comment, bullet screen and forwarding volume
Pyecahrts’s hybrid charting is complex in code implementation and very cost-effective compared to Fine Bi. Therefore, Fine Bi is used for data visualization in this part, which is directly shown in the figure above without further details.
The average number of bullets in the animation section was good, but comments and shares were mediocre. The high threshold of topicality and self-propagation leads to the inability of the second dimension to have explosive growth as the third dimension.
When the partial growth rate is lower than the overall growth rate, it is inevitable to see the phenomenon that the quadratic attribute of station B is diluted.
5.3 Top100 Series by District
Above, only the data of the top100 in the comprehensive score are preliminarily-analyzed. In order to avoid the logical fallacy of survivor error, the top100 of all partitions will be further analyzed below, and the DT financial data will be used for comparative analysis.
5.3.1 Data preprocessing
Processing way of thinking
- Classify dF_WITHout_all by partition name
- Calculate the mean value of data in each case for each partition
- Deposited in the CSV
5.3.2 Average broadcast numbers
Ideas and core code:
- Read partition name and play data
- Scale the playback data
- Draw a broken line chart
Compared with the data of DT Finance in 2019, except that we did not collect the screening hall, drama and advertising section, living area is still the giant of B station in terms of broadcast volume.
Animation area due to the hot New Year festival, from the third to the second. It is worth noting that the average broadcast volume of top100 in each zone has increased significantly compared with the data in 19 years, and the average broadcast volume of popular videos in living areas has quadrupled.
5.3.3 Average comparison of user behavior data
The production method of Pyecharts line chart has been mentioned above, so the visualization of this part will be completed by Bi software without further expansion.
Continue to compare with DT financial data, in addition to the drama, screening hall and advertising area, DT financial data indicators, almost the animation area alone.
By 2020, we can see the present situation of a hundred flowers blooming together. The leading position of the second dimension in different indicators has been divided up by the three dimension zones, and the living area has even won the first place in most indicators.
5.4 Hot Labels
Again, before working with the data, let’s look at what the data structure looks like.
It can be observed that each item of data contains N labels. Therefore, it is necessary to convert the label column into a Series without nesting first, and then count the occurrence times of each unique label.
Core code:
Comparing the hashtag frequency of popular videos from March to April 2019 according to DT Finance,Funny is still the hashtag that appears most frequently in popular videos on B stationAnd the goblins are still visible.
From this year’s word cloud can find a lot of closely related to the label of life, we are going through the fight against pneumonia, but also have to set goals every time, but always lost to eat eat slimming shape and weight loss.
Vi. Project Summary
After ten years of establishment, site B has developed into a large platform with 33 million DAUs of APP alone. From the quadratic community to the comprehensive video community, the original quadratic attributes must be diluted.
Back to the original question:
1. How about the dilution of quadratic attributes? The quadratic element is still part of the core of station B. However, from the commercial layout of STATION B, the live broadcast line, variety show line, Vlog line and other more content that fit with life will further dilute the quadratic element attribute.
2. Which zone is the leader of station B? Living areas with a wider audience are gradually becoming the mainstream of STATION B, and this trend will become more pronounced. In mass communication, he put forward such a theory — the spiral of silence: that is, the mass of the mass, the minority of the minority, the Matthew effect in communication.
3. What label videos do the mainstream users of STATION B like? Funny is still the favorite hashtag of B station users. After all, humor is a scarce resource in a harsh social environment.
**4. What are the thoughts brought by this analysis? **B station from the two yuan successfully transformed into a comprehensive website, Tencent, Ali and other giants successfully listed investment, and in the winter of our, we must continue to increase the value of upgrading, in order to usher in a warm spring.
Finally, I hope THAT B station can be better and better, do not forget the original intention!
Source address: github.com/heoijin/Bil…
Solemnly declare: this project and all related articles, only for technical exchange experience, forbid the application of relevant technology to improper ways, because the risk of abuse of technology has nothing to do with myself.
Reference: 1, the data interpretation | we studied B station, found that it is not “secondary yuan”, “- DT: financial mp.weixin.qq.com/s/EObWtXz1y… 2, 2020 China mobile Internet “epidemic” special report – QuestMobile2020:https://www.questmobile.com.cn/research/report-new/81 | B station 3, product analysis report, Video from secondary yuan community to comprehensive community – FMR:www.woshipm.com/evaluating/…
Submission Email:[email protected]
▼ contribute please Click to read the original article* * * *Like articles, click Looking at the