Danmu, as they are known, are the commentary subtitles that pop up when watching a video on the Internet. I don’t know if you will click on the bullet screen when watching the video, but for me, the bullet screen is a good complement to the content of the video, a well-organized comment sequence. By analyzing the bullet screen, we can quickly gain insight into the audience’s views on the video.
Brother J drew the following word cloud map through a video bullet screen data about “Eight Hundred”, and felt that the effect was ok.
Massive bullet screen data can not only draw such word cloud map, but also call Baidu AI for emotional analysis. So, how do we get the barrage data? In this paper, Python is used to crawl the bullet screen of B station video, Tencent Video, Mango TV and iQiyi video, so that you can easily obtain the data of the bullet screen of mainstream video websites.
1. Video barrage of station B
1. Web page analysis
This article to climb up the main hardcore half-Buddha fairy released “do you know milk tea to join in the end how deceiving people?” For example, to find the real URL storing the video barrage, perform the following steps.
By simple analysis of URL parameters, it is obvious that the date parameter represents the time of sending the barrage, while other parameters remain unchanged. Therefore, simply change the date parameter and parse it into the barrage data via BeautifulSoup.
2. Reptile combat
Import requests # Request page data
From bS4 import BeautifulSoup #
import pandas as pd
import time
From TQDM import trange #
def get_bilibili_url (start, end) :
url_list = []
date_list = [i for i in pd.date_range(start, end).strftime( ‘%Y-%m-%d’ )]
for date in date_list:
Url = f “api.bilibili.com/x/v2/dm/his… {date} “
url_list.append(url)
return url_list
def get_bilibili_danmu (url_list) :
headers = {
“The user-agent: Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36”,
“Cookie” : “copy your own” #Headers
}
file = open( “bilibili_danmu.txt” , ‘w’ )
for i in trange(len(url_list)):
url = url_list[i]
response = requests.get(url, headers=headers)
response.encoding = ‘utf-8’
soup = BeautifulSoup(response.text)
data = soup.find_all( “d” )
danmu = [data[i].text for i in range(len(data))]
for items in danmu:
file.write(items)
file.write( “\n” )
time.sleep( 3 )
file.close()
if name == “main” :
Start = ‘9/24/2020’ # set the start day of the crawl barrage
End = ‘9/26/2020’ # set the end date of the crawler barrage
url_list = get_bilibili_url(start, end)
get_bilibili_danmu(url_list)
print(
“Barrage crawl completed”
)
3. Preview data
Second, Tencent video bullet screen
1. Web page analysis
This paper takes the last video barrage of Talk Show Convention Season 3 as an example to find the real URL storing the barrage through the following steps.
By deleting all parameters, it is found that only the change of timestamp parameter will affect the crawl of bullet screen data, and timestamp parameter is an arithmetic sequence with the first term of 15 and tolerance of 30. It’s a safe guess that Tencent Video updates a page of danmu data every 30 seconds, and the video is 12,399 seconds long. The data is in standard JSON format, so json.loads parse data directly.
2. Reptile combat
import requests
import json
import time
import pandas as pd
df = pd.DataFrame()
for page in range( 15 , 12399 , 30 ):
Headers = {‘ user-agent ‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36’}
Url = ‘mfm.video.qq.com/danmu?otype… ‘ .format(page)
Print (” STR (page) + “)
html = requests.get(url,headers = headers)
Bs = json.loads(html.text,strict = False) #strict
time.sleep( 1 )
Pass through to get the target field
for i in bs[ ‘comments’ ]:
Content = I [‘content’] #
Upcount = I [‘upcount’] #
User_degree = I [‘uservip_degree’] # user_degree = I [‘uservip_degree’
Timepoint = I [‘timepoint’] #
Comment_id = I [‘ commEntid ‘] # bullet-screen id
Cache = pd.DataFrame({‘ bullet-screen ‘:[content],’ member level ‘:[user_degree],
‘release date: [timepoint],’ barrage thumb up ‘: [upcount],’ barrage id: [comment_id]})
df = pd.concat([df,cache])
df.to_csv( ‘tengxun_danmu.csv’ ,encoding = ‘utf-8’ )
print(df.shape)
3. Preview data
Mango TV barrage
1. Web page analysis
In this paper, the author takes the last video barrage of “Sister Riding the Waves” as an example to find the real URL storing the barrage through the following steps.
By analyzing the parameters, we can find that Mango TV will generate json barrage files of arithmetic sequence with the first term being 0 and the tolerance being 1, and each JSON file stores all barrage data within the previous minute. Bullet screen data is stored in JSON format, and data parsing is relatively simple.
2. Reptile combat
import requests
import json
import pandas as pd
def get_mangguo_danmu (num1, num2, page) :
try :
Url = ‘bullet-ws.hitv.com/bullet/2020… ‘
Print (” + STR (page) + “)
danmuurl = url.format(num1, num2, page)
res = requests.get(danmuurl)
res.encoding = ‘utf-8’
#print(res.text)
data = json.loads(res.text)
except :
Print (” unable to connect “)
details = []
For I in range(len(data[‘data’][‘items’])): # for I in range(len(data[‘data’][‘items’])
result = {}
Result [‘stype’] = num2 #
Result [‘ id ‘] = data [‘ data ‘] [‘ items’] [I] [‘ id ‘] get id #
Try: # Try to get uname
result[ ‘uname’ ] = data[ ‘data’ ][ ‘items’ ][i][ ‘uname’ ]
except :
result[ ‘uname’ ] = ”
Result [‘content’] = data[‘data’][‘items’][I][‘content’] #
Result [‘time’] = data[‘data’][‘items’][I][‘time’] #
Try: # Try to get the number of likes for the barrage
result[ ‘v2_up_count’ ] = data[ ‘data’ ][ ‘items’ ][i][ ‘v2_up_count’ ]
except :
result[ ‘v2_up_count’ ] = ”
details.append(result)
return details
Enter key information
def count_danmu () :
danmu_total = []
Num1 = input(‘ first digit ‘)
Num2 = input(‘ second digit ‘)
Page = int(input(‘ input duration ‘))
for i in range(page):
danmu_total.extend(get_mangguo_danmu(num1, num2, i))
return danmu_total
def main () :
df = pd.DataFrame(count_danmu())
df.to_csv( ‘mangguo_danmu.csv’ )
if name == ‘main‘ :
main()
3. Preview data
Iqiyi Bullet screen
1. Web page analysis
In this paper, the author takes the video barrage from the 13th issue of “Summer of Bands season 2” as an example to find the real URL storing the barrage through the following steps.
By analyzing the real URL of bullet screen, we find that parameter 5981449914376200 is the video tVID, parameter 62 is the first two digits of the inverse 4 of TVID, parameter 00 is the last two digits of tvid, parameter 1 before.z is the total video time divided by 300 seconds rounded up. By observing the two adjacent danmu file packages, it can be seen that iQiyi updates the danmu file every 5 minutes.
Since there are garbled characters in the bullet screen file extracted directly by crawling, binary coding is required to obtain the final bullet screen data.
2. Reptile combat
import zlib
import requests
1. Crawl the XML file
def download_xml (url) :
Bulletold = requests. Get (URL).content # Specifies the binary content
return zipdecode(bulletold)
def zipdecode (bulletold) :
‘Decode zip compressed binary content into text’
decode = zlib.decompress(bytearray(bulletold), 15 + 32 ).decode( ‘utf-8’ )
return decode
for x in range( 1 , 12 ):
The episode lasts 57 minutes. Iqiyi will load a new barrage every 5 minutes. 57 divided by 5 rounded up
Url = ‘cmts.iqiyi.com/bullet/62/0… _’ + str(x) + ‘.z’
xml = download_xml(url)
The encoded files are respectively written into 17 XML files (similar to TXT files) for the convenience of retrieving data later
with open( ‘./aiqiyi/iqiyi’ + str(x) + ‘.xml’ , ‘a+’ , encoding= ‘utf-8’ ) as f:
f.write(xml)
2. Read the barrage data in the XML file
from xml.dom.minidom import parse
import xml.dom.minidom
def xml_parse (file_name) :
DOMTree = xml.dom.minidom.parse(file_name)
collection = DOMTree.documentElement
Get all entry data in the collection
entrys = collection.getElementsByTagName( “entry” )
print(entrys)
result = []
for entry in entrys:
content = entry.getElementsByTagName( ‘content’ )[ 0 ]
print(content.childNodes[ 0 ].data)
i = content.childNodes[ 0 ].data
result.append(i)
return result
with open( “aiyiqi_danmu.txt” , mode= “w” , encoding= “utf-8” ) as f:
for x in range( 1 , 12 ):
l = xml_parse( “./aiqiyi/iqiyi” + str(x) + “.xml” )
for line in l:
f.write(line)
f.write(
“\n”
3. Preview data
Complete code acquisition