Danmu, as they are known, are the commentary subtitles that pop up when watching a video on the Internet. I don’t know if you will click on the bullet screen when watching the video, but for me, the bullet screen is a good complement to the content of the video, a well-organized comment sequence. By analyzing the bullet screen, we can quickly gain insight into the audience’s views on the video.

Brother J drew the following word cloud map through a video bullet screen data about “Eight Hundred”, and felt that the effect was ok.

Massive bullet screen data can not only draw such word cloud map, but also call Baidu AI for emotional analysis. So, how do we get the barrage data? In this paper, Python is used to crawl the bullet screen of B station video, Tencent Video, Mango TV and iQiyi video, so that you can easily obtain the data of the bullet screen of mainstream video websites.

1. Video barrage of station B

1. Web page analysis

This article to climb up the main hardcore half-Buddha fairy released “do you know milk tea to join in the end how deceiving people?” For example, to find the real URL storing the video barrage, perform the following steps.

By simple analysis of URL parameters, it is obvious that the date parameter represents the time of sending the barrage, while other parameters remain unchanged. Therefore, simply change the date parameter and parse it into the barrage data via BeautifulSoup.

2. Reptile combat

Import requests # Request page data

From bS4 import BeautifulSoup #

import pandas as pd

import time

From TQDM import trange #

def get_bilibili_url (start, end) :

url_list = []

date_list = [i for i in pd.date_range(start, end).strftime( ‘%Y-%m-%d’ )]

for date in date_list:

Url = f “api.bilibili.com/x/v2/dm/his… {date} “

url_list.append(url)

return url_list

def get_bilibili_danmu (url_list) :

headers = {

“The user-agent: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36”,

“Cookie” : “copy your own” #Headers

}

file = open( “bilibili_danmu.txt” , ‘w’ )

for i in trange(len(url_list)):

url = url_list[i]

response = requests.get(url, headers=headers)

response.encoding = ‘utf-8’

soup = BeautifulSoup(response.text)

data = soup.find_all( “d” )

danmu = [data[i].text for i in range(len(data))]

for items in danmu:

file.write(items)

file.write( “\n” )

time.sleep( 3 )

file.close()

if name == “main” :

Start = ‘9/24/2020’ # set the start day of the crawl barrage

End = ‘9/26/2020’ # set the end date of the crawler barrage

url_list = get_bilibili_url(start, end)

get_bilibili_danmu(url_list)

print(

“Barrage crawl completed”

)

3. Preview data

Second, Tencent video bullet screen

1. Web page analysis

This paper takes the last video barrage of Talk Show Convention Season 3 as an example to find the real URL storing the barrage through the following steps.

By deleting all parameters, it is found that only the change of timestamp parameter will affect the crawl of bullet screen data, and timestamp parameter is an arithmetic sequence with the first term of 15 and tolerance of 30. It’s a safe guess that Tencent Video updates a page of danmu data every 30 seconds, and the video is 12,399 seconds long. The data is in standard JSON format, so json.loads parse data directly.

2. Reptile combat

import requests

import json

import time

import pandas as pd

df = pd.DataFrame()

for page in range( 15 , 12399 , 30 ):

Headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36’}

Url = ‘mfm.video.qq.com/danmu?otype… ‘ .format(page)

Print (” STR (page) + “)

html = requests.get(url,headers = headers)

Bs = json.loads(html.text,strict = False) #strict

time.sleep( 1 )

Pass through to get the target field

for i in bs[ ‘comments’ ]:

Content = I [‘content’] #

Upcount = I [‘upcount’] #

User_degree = I [‘uservip_degree’] # user_degree = I [‘uservip_degree’

Timepoint = I [‘timepoint’] #

Comment_id = I [‘ commEntid ‘] # bullet-screen id

Cache = pd.DataFrame({‘ bullet-screen ‘:[content],’ member level ‘:[user_degree],

‘release date: [timepoint],’ barrage thumb up ‘: [upcount],’ barrage id: [comment_id]})

df = pd.concat([df,cache])

df.to_csv( ‘tengxun_danmu.csv’ ,encoding = ‘utf-8’ )

print(df.shape)

3. Preview data

Mango TV barrage

1. Web page analysis

In this paper, the author takes the last video barrage of “Sister Riding the Waves” as an example to find the real URL storing the barrage through the following steps.

By analyzing the parameters, we can find that Mango TV will generate json barrage files of arithmetic sequence with the first term being 0 and the tolerance being 1, and each JSON file stores all barrage data within the previous minute. Bullet screen data is stored in JSON format, and data parsing is relatively simple.

2. Reptile combat

import requests

import json

import pandas as pd

def get_mangguo_danmu (num1, num2, page) :

try :

Url = ‘bullet-ws.hitv.com/bullet/2020… ‘

Print (” + STR (page) + “)

danmuurl = url.format(num1, num2, page)

res = requests.get(danmuurl)

res.encoding = ‘utf-8’

#print(res.text)

data = json.loads(res.text)

except :

Print (” unable to connect “)

details = []

For I in range(len(data[‘data’][‘items’])): # for I in range(len(data[‘data’][‘items’])

result = {}

Result [‘stype’] = num2 #

Result [‘ id ‘] = data [‘ data ‘] [‘ items’] [I] [‘ id ‘] get id #

Try: # Try to get uname

result[ ‘uname’ ] = data[ ‘data’ ][ ‘items’ ][i][ ‘uname’ ]

except :

result[ ‘uname’ ] = ”

Result [‘content’] = data[‘data’][‘items’][I][‘content’] #

Result [‘time’] = data[‘data’][‘items’][I][‘time’] #

Try: # Try to get the number of likes for the barrage

result[ ‘v2_up_count’ ] = data[ ‘data’ ][ ‘items’ ][i][ ‘v2_up_count’ ]

except :

result[ ‘v2_up_count’ ] = ”

details.append(result)

return details

Enter key information

def count_danmu () :

danmu_total = []

Num1 = input(‘ first digit ‘)

Num2 = input(‘ second digit ‘)

Page = int(input(‘ input duration ‘))

for i in range(page):

danmu_total.extend(get_mangguo_danmu(num1, num2, i))

return danmu_total

def main () :

df = pd.DataFrame(count_danmu())

df.to_csv( ‘mangguo_danmu.csv’ )

if name == ‘main‘ :

main()

3. Preview data

Iqiyi Bullet screen

1. Web page analysis

In this paper, the author takes the video barrage from the 13th issue of “Summer of Bands season 2” as an example to find the real URL storing the barrage through the following steps.

By analyzing the real URL of bullet screen, we find that parameter 5981449914376200 is the video tVID, parameter 62 is the first two digits of the inverse 4 of TVID, parameter 00 is the last two digits of tvid, parameter 1 before.z is the total video time divided by 300 seconds rounded up. By observing the two adjacent danmu file packages, it can be seen that iQiyi updates the danmu file every 5 minutes.

Since there are garbled characters in the bullet screen file extracted directly by crawling, binary coding is required to obtain the final bullet screen data.

2. Reptile combat

import zlib

import requests

1. Crawl the XML file

def download_xml (url) :

Bulletold = requests. Get (URL).content # Specifies the binary content

return zipdecode(bulletold)

def zipdecode (bulletold) :

‘Decode zip compressed binary content into text’

decode = zlib.decompress(bytearray(bulletold), 15 + 32 ).decode( ‘utf-8’ )

return decode

for x in range( 1 , 12 ):

The episode lasts 57 minutes. Iqiyi will load a new barrage every 5 minutes. 57 divided by 5 rounded up

Url = ‘cmts.iqiyi.com/bullet/62/0… _’ + str(x) + ‘.z’

xml = download_xml(url)

The encoded files are respectively written into 17 XML files (similar to TXT files) for the convenience of retrieving data later

with open( ‘./aiqiyi/iqiyi’ + str(x) + ‘.xml’ , ‘a+’ , encoding= ‘utf-8’ ) as f:

f.write(xml)

2. Read the barrage data in the XML file

from xml.dom.minidom import parse

import xml.dom.minidom

def xml_parse (file_name) :

DOMTree = xml.dom.minidom.parse(file_name)

collection = DOMTree.documentElement

Get all entry data in the collection

entrys = collection.getElementsByTagName( “entry” )

print(entrys)

result = []

for entry in entrys:

content = entry.getElementsByTagName( ‘content’ )[ 0 ]

print(content.childNodes[ 0 ].data)

i = content.childNodes[ 0 ].data

result.append(i)

return result

with open( “aiyiqi_danmu.txt” , mode= “w” , encoding= “utf-8” ) as f:

for x in range( 1 , 12 ):

l = xml_parse( “./aiqiyi/iqiyi” + str(x) + “.xml” )

for line in l:

f.write(line)

f.write(

“\n”

3. Preview data

Complete code acquisition

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Use Python to climb B station, Tencent Video, Mango TV and iQiyi video bullet screen!

1. Video barrage of station B

1. Web page analysis

Second, Tencent video bullet screen

1. Web page analysis

3. Preview data

3. Preview data

Iqiyi Bullet screen

1. Web page analysis

In this paper, the author takes the video barrage from the 13th issue of “Summer of Bands season 2” as an example to find the real URL storing the barrage through the following steps.

1. Crawl the XML file

The episode lasts 57 minutes. Iqiyi will load a new barrage every 5 minutes. 57 divided by 5 rounded up

The encoded files are respectively written into 17 XML files (similar to TXT files) for the convenience of retrieving data later

2. Read the barrage data in the XML file

Get all entry data in the collection

3. Preview data

Use Python to climb B station, Tencent Video, Mango TV and iQiyi video bullet screen!

1. Video barrage of station B

1. Web page analysis

Second, Tencent video bullet screen

1. Web page analysis

3. Preview data

3. Preview data

Iqiyi Bullet screen

1. Web page analysis

In this paper, the author takes the video barrage from the 13th issue of “Summer of Bands season 2” as an example to find the real URL storing the barrage through the following steps.

1. Crawl the XML file

The episode lasts 57 minutes. Iqiyi will load a new barrage every 5 minutes. 57 divided by 5 rounded up

The encoded files are respectively written into 17 XML files (similar to TXT files) for the convenience of retrieving data later

2. Read the barrage data in the XML file

Get all entry data in the collection

3. Preview data

Related Posts

Nginx Network EPoll Multiprocess Series: What is a scare group? How to solve

Cannot use tactical hard cover strategic lazy | methodology: tomato time and agile development

【UE4 Materials 101】- Episode 4 Notes