The structure of this paper is as follows:
Part 1. Understanding data
1.1 the data set
www.kaggle.com/datasnaek/y…
The dataset contains months (and growing) of data on YouTube’s top daily videos. Includes data for us, GB, DE, CA and FR regions (US, UK, Germany, Canada and France respectively), listing up to 200 trending videos per day.
It now includes data from RU, MX, KR, JP and IN regions (Russia, Mexico, Korea, Japan and India respectively) over the same period. The data for each locale is kept in a separate file. The data includes video titles, channel titles, post times, tags, views, likes and dislikes, descriptions and comments. The data also includes a category_id field, which varies from region to region. To retrieve the category for a particular video, find it in the associated JSON. Each of the five regions in the dataset contains one such file.
1.2 Data Description
Part 2. Ask questions
2.1 Common Streaming Media Indicators
Refer to zhihu article short video operation indicators
zhuanlan.zhihu.com/p/53091630
2.2 Asking Questions
Part 3. Cleaning data
3.1 Coding Conversion
When opening a data set, there are many garbled characters, which are not English characters.
Solution:
Open it in Notepad and save it in UTF-8 format
3.2 Json processing
Code reference zhihumei language
Reference:
zhuanlan.zhihu.com/p/66702457
You can see that an ID corresponds to a title
So we do it in Python
# Code reference zhihumei language
import json
import csv
jsonFile =open("US_category_id.json"."r")
categoryId=json.load(jsonFile)
with open('US_category_id.csv'.'w', newline=' ') as f:
csvWriter = csv.writer(f)
csvWriter.writerow(['id'.'category'])
for i in categoryId['items']:
csvWriter.writerow([i['id'], i['snippet'] ['title']])Copy the code
Import the MySQL 3.3
There are two pits. First, create the database with UTF8MB4
1406 – Data too long for column…
After the import:
3.4 Selecting a Subset
Select subsets through Navicat’s filtering function and hide items irrelevant to the research question
3.5 Rename column names
Because the field name is pure English, it is clear that there is no need to rename it.
3.6 Adding a Primary Key
alter table youtube
add COLUMN id intCopy the code
3.7 Deleting a Duplicate Value
select min(id),video_id,trending_date,count(video_id) asRepeat the numberfrom youtube
GROUP BY video_id,trending_date
having count(video_id)>1
order byRepeat the numberdescCopy the code
This is grouped by video_id and TrendingDate because it is valuable to repeat information such as trending charts at different times. Therefore, we need to find the unique iD of Videoid on the same trending_date, there are 50 in total.
select * from youtube a
INNER JOIN
(
select min(id) asThe onlyid,video_id,trending_date,count(video_id) asRepeat the numberfrom youtube
GROUP BY video_id,trending_date
having count(video_id)>1
order byRepeat the numberdesc)b
on a.trending_date=b.trending_date and a.video_id=b.video_id andAnderson, d < > b. onlyidCopy the code
Through multi-table query we can find the data we do not want, here to create a temporary table, because the table can not query the content and then delete, must use temporary table.
INSERT into temp(id)
(select id from youtube a
INNER JOIN
(
select min(id) asThe onlyid,video_id,trending_date,count(video_id) asRepeat the numberfrom youtube
GROUP BY video_id,trending_date
having count(video_id)>1
order byRepeat the numberdesc)b
on a.trending_date=b.trending_date and a.video_id=b.video_id andAnderson, d < > b. onlyid
)Copy the code
And then delete it
DELETE from youtube
where id IN
(select id
from temp)Copy the code
It affects 50 rows, which corresponds exactly.
3.8 Missing Value Processing
select count(id),count(video_id),count(title),count(channel_title),count(category_id),count(publish_time),count(tags),count(views),count(likes),count(dislikes),count(comment_count),count(description)
from youtubeCopy the code
We found that video_id and description were missing to varying degrees. Since they were important information, we deleted a whole line of missing values.
delete from
youtube
where video_id is null
or description is nullCopy the code
3.9 Consistent processing
3.9.1 Number of Labels (tags_num)
Tags with | space, so how many (+ 1) | | how many labels.
alter table youtube add COLUMN tags_num int
update youtube set tags_num=(
LENGTH(tags)-LENGTH(replace(tags,'|'.' ')) +1
)Copy the code
3.9.2 Description Length
alter table youtube add COLUMN description_len int
update youtube set description_len=(
LENGTH(description)
)Copy the code
3.9.3 List date conversion
update youtube set trending_date= concat('20'.left(trending_date,'2'),'/'.RIGHT(trending_date,'2'),'/'.MID(trending_date,4.2))Copy the code
3.9.4 Release Time Split date and time
Break up the date
alter table youtube add COLUMN publish_date date
update youtube set publish_date= concat(left(publish_time,4),The '-'.MID(publish_time,6.2),The '-'.MID(publish_time,9.2))Copy the code
Break up the time
UPDATE youtube setPublish_time = mid (publish_time, 12, 8)Copy the code
3.9.5 List status
Add an index to speed up the search before executing the following statement:
We can determine whether the videos on the list are continuously on the list by judging whether the number of times and the time interval +1 are equal. The SQL here is complicated, but the idea is to find the maximum and minimum values of trending date of each video and subtract them to get the time gap. Then, count is used to calculate the number of entries and then compare with if.
selectE. ideo_id, E. list times,(F. Interval +1) asInterval,if(Interval = number of entries,'discontinuous'.'row') asConsecutive listfrom
(
select video_id,count(*) asOn the number of timesfrom youtube
GROUP BY video_id
ORDER BY video_id) e
INNER JOIN
(
SELECT c.video_id,DATEDIFF(Maximum date, minimum date)asintervalFROM
(
select a.video_id,max(a.trending_date) asThe biggest datefrom youtube a
inner JOIN youtube b
on a.video_id=b.video_id and a.trending_date=b.trending_date
GROUP BY a.video_id
ORDER BY a.video_id
)c
INNER JOIN
(
select a.video_id,min(a.tending_date) Minimum datefrom youtube a
inner JOIN youtube b
on a.video_id=b.video_id and a.trending_date=b.trending_date
GROUP BY a.video_id
ORDER BY a.video_id
)d
on c.video_id=d.video_id
)f
on e.video_id=f.video_idCopy the code
selectWhether it's a consecutive list,COUNT(Consecutive list)FROM(
selectE. ideo_id, E. list times,(F. Interval +1) asInterval,if(Interval = number of entries,'discontinuous'.'row') asConsecutive listfrom
(
select video_id,count(*) asOn the number of timesfrom youtube
GROUP BY video_id
ORDER BY video_id) e
INNER JOIN
(
SELECT c.video_id,DATEDIFF(Maximum date, minimum date)asintervalFROM
(
select a.video_id,max(a.trending_date) asThe biggest datefrom youtube a
inner JOIN youtube b
on a.video_id=b.video_id and a.trending_date=b.trending_date
GROUP BY a.video_id
ORDER BY a.video_id
)c
INNER JOIN
(
select a.video_id,min(a.tending_date) Minimum datefrom youtube a
inner JOIN youtube b
on a.video_id=b.video_id and a.trending_date=b.trending_date
GROUP BY a.video_id
ORDER BY a.video_id
)d
on c.video_id=d.video_id
)f
on e.video_id=f.video_id
)g
GROUP BYConsecutive listCopy the code
We found 5,995 consecutive videos on the list, which is 97% of the total.
Part 4. Data analysis and visualization
4.1 Correlation coefficient between viewing quantity and liking degree?
Since viewing and liking/disliking/commenting can generally reflect the situation of a video, correlation coefficient analysis is carried out on these indicators.
Conclusion: You can see that there is a strong correlation between likes and views (and comments), reflecting that people are more engaged when the video is liked. However, dislike must also lead to comments, which reflects people’s desire to express when they are dissatisfied with a video.
4.2 What kind of videos have been listed more often?
SELECTType,count(type)asThe number offrom
(
select a.video_id,b.category astypefrom youtube as a
inner JOIN youtube_category as b
on a.category_id=b.id
)c
GROUP BYtypeORDER BYThe number ofCopy the code
Conclusion: It can be seen that Howto&Style\Music\Entertainment has the highest ranking, which can be focused on later. Further build the model from the video indicators.
4.3 What is the ratio of comments, likes and views in different types?
Indicators to build
select b.category,round(sum(likes)/sum(dislikes),2) asLike and dislike ratio,round((sum(comment_count)/sum(views)),6) asComments on the ratefrom youtube as a
inner JOIN youtube_category as b
on a.category_id=b.id
group by b.categoryCopy the code
Hypothesis: popular videos may have a high rate of likes and comments.
Gather evidence:
Conclusion:
- The environment category is a video with a high ranking, but the likes and comments rate are very low. Therefore, considering that YouTube aims at an inaccurate user group, it fails to do a good job in guiding the type of video creation.
- Music is relatively easy to understand and the threshold for discussion is low, so people’s participation and liking will be relatively high, because people have a high tolerance for this popular video.
- For tutorial class (how to) video we like the degree of participation or worthy of share.
4.4 Is the more labels the video has, the better? Is it related to Trending?
The impact of video labels and descriptions on video views
select b.category,round(avg(a.tags_num),0) asTag number,round(avg(a.description_len),0) asDescription length,round(avg(a.views),0) asThe averagefrom youtube as a
inner JOIN youtube_category as b
on a.category_id=b.id
group by b.category
ORDER BYThe averageCopy the code
Assume: The more tags, the more page views
Gather evidence:
Conclusion: the hypothesis is not tenable.
- Although there is a small fluctuation in the number of tags under low page views, the overall number of tags under low page views tends to be high, and the number of tags under high page views tends to decline.
- In addition, we also found that the number of labels in shows is very large, because shows are of many types, including talk shows, variety shows, reality shows and so on. And non-profit video tags are relatively few, reflecting that this kind of video is very pure, do not put a lot of fancy tags.
Make assumptions: The longer the description, the more page views
Gather evidence:
Conclusion: The hypothesis is not true, longer description can not win the favor of the audience for the video, we also found that in the category of tutorial \ education video, the description is very long, there is a peak, which reflects that although the tutorial video is not very popular, but because of the characteristics of the video itself, the description length is very long.
4.5 What factors are involved in the list?
Hypothesis: there is a relationship between the number of videos in different categories
SELECT[C] genre. [D] number of videosfrom
(
select b.category asType,count(DISTINCT video_id) asNumber of videofrom youtube a
inner JOIN youtube_category as b
on a.category_id=b.id
group by b.category
)c
INNER JOIN
(
select b.category asType,count(trending_date) asOn the number of timesfrom youtube as a
inner JOIN youtube_category as b
on a.category_id=b.id
group by b.category
ORDER BY b.category
)d
on c.` type `=d.` type `Copy the code
Gather evidence:
Conclusion: If the hypothesis is true, it can be seen that the more videos there are, the more chances there are to make the list.
4.6 Is there any relationship between the release date of the video and the interval between the first list and the number of releases?
selectE. ideo_id, E. total number of entries, D. Intervalfrom
(
SELECT DISTINCT a.video_id,TIMESTAMPDIFF(day,a.publish_date,b. Earliest release date)asintervalfrom youtube a
INNER JOIN
(
SELECT video_id,min(trending_date) asEarliest release Datefrom youtube
GROUP BY video_id
ORDER BY video_id desc
)b
on a.video_id=b.video_id
)d
INNER JOIN
(SELECT video_id,count(*) Total number of entriesfrom youtube
GROUP BY video_id
)e
on e.video_id=d.video_idCopy the code
Conclusion:
- Most of the videos that make it to the top of the charts more often than not reach the top 0-2 days after they are first released.
- There is also one on the list after 4,215 days, considering the impact of hot news on past videos.
4.7 Is there a golden event of video release?
4.7.1 weeks
select DAYOFWEEK(publish_date) asSeveral weeks,count(trending_date) asOn the number of timesfrom youtube
GROUP BY DAYOFWEEK(publish_date)Copy the code
Gather evidence:
Conclusion:
It can be seen from the chart that videos released on Sunday have no advantage, while videos trending on weekdays are more and more uniform, because what we analyze is the national data of the United States, which is contrary to the view that we have more time to surf the Internet on weekends. Because the data set is the video data of the list, it lacks the support of the overall video data, which may lead to some one-sided conclusions.
Hypothesis 1: Instead of surfing the Internet and watching videos, most Americans are likely to spend their weekends outdoors or offline.
Speculation 2: It may be that YouTube recommends some videos to Trending in a more concentrated way on weekends.
Hypothesis 3: Maybe the video producer is making/releasing the video on a weekday.
4.7.2 hours
select hour(publish_time) asHours,count(trending_date) asOn the number of timesfrom youtube
GROUP BY hour(publish_time)Copy the code
Gather evidence:
The conclusion is drawn that in trending, the proportion of trending releases in the period from 12 hours to 17 hours is relatively high, and it reaches its peak at 17 hours, but it also starts to decline and reaches a low point at 7 a.m.
Speculation: This may be related to the working hours of foreigners. They get off work at 5pm or 6pm. They may want to post videos before they get off work, but there is still a lack of data support, such as user portrait data.
Part 5. Summary & Suggestions
Advice:
Final:
This article analyzes the train of thought to draw lessons from zhihu – mei language, hereby explain!
Both articles use the same Kaggle data set. You can go out and turn left and read the article, and try out data analysis with different techniques.
The purpose of this article is to learn and exchange, not for commercial purposes, also declined commercial reprint.