MOJITO

Article technology overview:

  1. Climb up station B barrage
  2. Draw word cloud map
  3. Intelligent affective tendency analysis

What all didn’t write recently, chase after a hot spot all can’t catch up with hot, despise oneself once.

Jay’s new song “MOJITO” released (June 12, zero) has been roughly a week, opened THE B station MV look, no problem to break ten million, bullet screen broke one hundred thousand, this popularity is really great.

To be honest, the name “MOJITO” was a bit off base for me. I had no idea what it meant when I first saw it.

But the problem is not big, there is nothing baidu can not solve, if there is, then add a Zhihu.

MOJITO’s Chinese name is MOJITO.

Mojito is one of the most famous rum mixers. It originated in Cuba. Traditionally, a mojito is a cocktail made with five ingredients: light rum, sugar (traditionally made from sugarcane juice), lime juice, soda water and mint. The original Cuban recipe uses spearmint or lemon mint, common on the island. The refreshing flavors of lime and mint complement the strength of the rum and make this clear, colorless concoctions one of the hottest drinks of the summer. This blend has a relatively low alcohol content (about 10%).

If the alcohol level is around 10%, it can be considered a drink.

Of course, if you’re driving a MOJITO, you shouldn’t consider it a drink.

I’ve watched the whole video over and over several times, and the word “MOJITO” doesn’t appear once in the video, except in the lyrics and title.

Climb up station B barrage

Bullet screen data crawl is relatively simple, I will not grab the request step by step to show you, pay attention to the following several request connections:

Address of request for barrage:

https://api.bilibili.com/x/v1/dm/list.so?oid=XXX

https://comment.bilibili.com/XXX.xml
Copy the code

The first address can not be found in the Network of Chrome tool now because the webpage of station B has been changed, but it can still be used. I found this address before.

The second address comes from Baidu, I do not know where the gods find out this address, for reference.

The above two bullet addresses actually require something called OID, which can be obtained as follows:

First you can find a directory page interface:

https://api.bilibili.com/x/player/pagelist?bvid=XXX&jsonp=jsonp
Copy the code

This interface is also derived from Chrome network, including bvid this parameter from the video address, such as this “MOJITO” MV, jay’s address is https://www.bilibili.com/video/BV1PK4y1b7dt, So this bVID is going to be the last BV1PK4y1b7dt.

Next on the https://api.bilibili.com/x/player/pagelist?bvid=BV1PK4y1b7dt&jsonp=jsonp this interface, we can see the returned json parameters, as follows:

{
    "code":0."message":"0"."ttl":1."data":[
        {
            "cid":201056987."page":1."from":"vupload"."part":"Jay-mojito_ Full MV(Updated)"."duration":189."vid":""."weblink":""."dimension": {"width":1920."height":1080."rotate":0}}}]Copy the code

Note: Since this MV has only one full video, there is only one CID. If a video is posted in different sections, there will be multiple CID, with different CID representing different videos.

The cid here is, of course, we just want to find the oid, spell the cid to link just now, be able to get such an address, https://api.bilibili.com/x/v1/dm/list.so?oid=201056987 Then enter into the browser, you can see the return data, is an XML format text.

Source code is as follows:

import requests
import re

# for cid
res = requests.get("https://api.bilibili.com/x/player/pagelist?bvid=BV1PK4y1b7dt&jsonp=jsonp")
cid = res.json()['data'] [0] ['cid']

# Fetch the barrage XML through the re to generate a list
danmu_url = f"https://api.bilibili.com/x/v1/dm/list.so?oid={cid}"
result = requests.get(danmu_url).content.decode('utf-8')
pattern = re.compile('
      
        (. *?) '
      *?>)
danmu_list = pattern.findall(result)

Save the list of bullets to TXT file
with open("dan_mu.txt", mode="w", encoding="utf-8") as f:
    for item in danmu_list:
        f.write(item)
        f.write("\n")
Copy the code

Here, I saved the acquired barrage in the dan_mu. TXT file for subsequent analysis.

Draw word cloud map

The first step is to read the dan_mu. TXT file and put it in a list:

# Read ammunition TXT file
with open("dan_mu.txt", encoding="utf-8") as f:
    txt = f.read()
danmu_list = txt.split("\n")
Copy the code

The word segmentation tool I use here is the best Python phrase segmentation jieba. Students who have not installed jieba can use the following command to install it:

pip install jieba
Copy the code

Jieba: Use jieba to participle the danmu list just obtained:

# jieba participle
danmu_cut = [jieba.lcut(item) for item in danmu_list]
Copy the code

So, we get the post-participle danmu_cut, which is also a list.

Then we do the next operation on the danmu_cut to remove the stop word:

# get stop words
with open("baidu_stopwords.txt",encoding="utf-8") as f:
    stop = f.read()
stop_words = stop.split()

# Remove the final word after the stop word
s_data_cut = pd.Series(danmu_cut)
all_words_after = s_data_cut.apply(lambda x:[i for i in x if i not in stop])
Copy the code

Here I introduced a baidu_stopwords. TXT file, this file is Baidu stopwords library, here I found several commonly used Chinese stopwords library, source: github.com/goto456/sto… .

Word file Word table name
baidu_stopwords.txt Baidu stop word table
hit_stopwords.txt Harbin Institute of Technology stop glossary
scu_stopwords.txt Machine Intelligence Laboratory of Sichuan University has closed the word database
cn_stopwords.txt Chinese stop words list

Here I use baidu stop word table, we can use according to their own needs, also can do integration of these stop word table before using, the main purpose is to remove some words without concern, the above several stop word library I will submit to the code warehouse, there is a need to take.

Then we counted the word frequency after removing the stop words:

# Word frequency statistics
all_words = []
for i in all_words_after:
    all_words.extend(i)
word_count = pd.Series(all_words).value_counts()
Copy the code

The last step is to generate our final result, the word cloud:

wordcloud.WordCloud(
    font_path='msyh.ttc',
    background_color="#fff",
    max_words=1000,
    max_font_size=200,
    random_state=42,
    width=900,
    height=1600
).fit_words(word_count).to_file("wordcloud.png")
Copy the code

The end result is this:

As you can see from the word cloud image above, fans have true love for “MOJITO”, with ah-ah-ah and Ai-fan appearing the most frequently.

Of course, this fan could also refer to the sexy pink jalopy in the music video.

There is also a high frequency is Ye Qinghui, I guess this means that ye qinghui’s youth has come back, indeed, Jay has accompanied people of my age all the way, as a person of 79 years is now 41 years old “advanced”, looking back on the past, people sigh.

In those days, a “nunchaku” fire all over the land of China, the video store on the street all day in the cycle of these songs, I go to school in this generation, basically everyone can hum two sentences, “fast use nunchaku, hum ha Hey” has become our generation’s common memories.

Intelligent affective tendency analysis

We can also conduct an emotional tendency analysis on the bullet screen. Here, I use the emotional tendency analysis interface of “Baidu AI Open Platform”.

Baidu AI Open Platform document address: ai.baidu.com/ai-doc/NLP/…

The first step is to access the “Baidu AI Open Platform” according to the document and obtain the access_token. The code is as follows:

Get Baidu API access_token
access_token_url = F 'https://aip.baidubce.com/oauth/2.0/token?grant_type={grant_type}&client_id={client_id}&client_secret={client_secret}& '

res = requests.post(access_token_url)

access_token = res.json()['access_token']

# Universal emotional Interface
# sentiment_url = F 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify?charset=UTF-8&access_token= {access_token}'
# Customize the emotional interface
sentiment_url = F 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify_custom?charset=UTF-8&access_token={access_token}'
Copy the code

Baidu AI open platform has two emotion analysis interfaces, one is general, and the other is customized. I used the trained customized interface here. If there is no customized interface, there is no problem to use the general interface.

The grant_type, client_id and client_secret parameters used above can be obtained by registering. The number of calls to these interfaces on baidu AI Open Platform is limited, but it is sufficient for us to use them by ourselves.

Then read the barrage text we just saved:

with open("dan_mu.txt", encoding="utf-8") as f:
    txt = f.read()
danmu_cat = txt.split("\n")
Copy the code

Before calling the interface to get the emotional tendency, we need to do another thing, processing the bullet screen once, because there will be some emojis in the bullet screen, and the interface of Baidu directly requesting emojis will return an error. Here, I use another tool kit to process emojis.

Install the emoji kit first:

pip install emoji
Copy the code

It is very simple to use. We use emojis to process the bullet screen data once:

import emoji

with open("dan_mu.txt", encoding="utf-8") as f:
    txt = f.read()
danmu_list = txt.split("\n")

for item in danmu_list:
    print(emoji.demojize(item))
Copy the code

There are emojis like this in our barrage data:

❤ ❤ ❤ ❤ ❤ ❤ ❤
#After the treatment:
:red_heart::red_heart::red_heart::red_heart::red_heart::red_heart::red_heart:
Copy the code

Then, we can call the emotion tendency analysis interface of Baidu to analyze our bullet-screen data:

# Emotion counter
optimistic = 0
neutral = 0
pessimistic = 0

for danmu in danmu_list:
    # Due to QPS call restriction, each call interval is 0.5s
    time.sleep(0.5)
    req_data = {
        'text': emoji.demojize(danmu)
    }
    # Call the sentiment propensity analysis interface
    if len(danmu) > 0:
        r = requests.post(sentiment_url, json = req_data)
        print(r.json())
        for item in r.json()['items'] :if item['sentiment'] = =2:
                # Positive emotion
                optimistic += 1
            if item['sentiment'] = =1:
                # Neutral emotions
                neutral += 1
            if item['sentiment'] = =0:
                # Negative emotion
                pessimistic += 1

print('Positive emotion :', optimistic)
print('Neutral emotion :', neutral)
print('Negative emotion :', pessimistic)

attr = ['Positive emotion'.'Neutral emotion'.'Negative emotion']
value = [optimistic, neutral, pessimistic]

c = (
    Pie()
    .add("", [list(attr) for attr in zip(attr, value)])
    .set_global_opts(title_opts=opts.TitleOpts(title="' MOJITO 'Barrage Emotion Analysis"))
    .render("pie_base.html"))Copy the code

The resulting graph looks like this:

From the final result, positive emotion accounted for about 2/3, while negative emotion only accounted for less than 1/4, it seems that most people are still full of excitement when they see Jay’s new songs.

Nevertheless this data is not accurate, can make a reference at most.

The source code

Students who need source code can reply “MOJITO” in the background of the public account to obtain.