This is my sixth day of the August Challenge
Python crawl online drama barrage data, simple visualization (attached source code)
Today the target climbed 201865 network drama barrage data
Tool use
Development environment:
Win10, python3.6
Development tools:
pycharm
Related modules:
Requests, stylecloud
Thought analysis
1. Crawler obtains data
The barrage data of IQiyi exists as compressed files in the form of.z. First, the list of TVID is obtained, and then the compressed file of the barrage is obtained according to tVID. Finally, it is decompression and storage.
def get_data(tv_name,tv_id) :
url = 'https://cmts.iqiyi.com/bullet/{}/{}/{}_300_{}.z'
datas = pd.DataFrame(columns=['uid'.'contentsId'.'contents'.'likeCount'])
for i in range(1.20):
myUrl = url.format(tv_id[-4: -2],tv_id[-2:],tv_id,i)
print(myUrl)
res = requests.get(myUrl)
if res.status_code == 200:
btArr = bytearray(res.content)
xml=zlib.decompress(btArr).decode('utf-8')
bs = BeautifulSoup(xml,"xml")
data = pd.DataFrame(columns=['uid'.'contentsId'.'contents'.'likeCount'])
data['uid'] = [i.text for i in bs.findAll('uid')]
data['contentsId'] = [i.text for i in bs.findAll('contentId')]
data['contents'] = [i.text for i in bs.findAll('content')]
data['likeCount'] = [i.text for i in bs.findAll('likeCount')]
else:
break
datas = pd.concat([datas,data],ignore_index = True)
datas['tv_name'] =str(tv_name)
return datas
Copy the code
A total of 201865 bullet data were obtained.
2. Prepare the barrage launcher
Group users according to user ids and count the bullet ids to get the cumulative number of bullets sent by each user.
# Total number of users who send bullets
danmu_counts = df.groupby('uid') ['contentsId'].count().sort_values(ascending = False).reset_index()
danmu_counts.columns = ['user id'.'Total number of bullets sent']
danmu_counts.head()
Copy the code
The number one post was 2,561. This is just a 12-episode online drama.
df_top1 = df[df['uid'] = =1810351987].sort_values(by="likeCount",ascending = False).reset_index()
df_top1.head(10)
Copy the code
Each bullet is a reflection of the viewer’s feelings, perhaps he or she is just watching the show while he or she is firing the bullet.
This “barrage launcher” friend, in each episode of the barrage amount is how?
Does the picture above suggest that the drama of individual episodes is more conflicted and more likely to provoke ridicule?
“Barrage launcher” comrade, 11, 12 sets please increase the output!
We all agree with these barrage, put aside the “barrage launcher” comrades, we continue to explore the barrage of different episodes.
Check out each episode, which bullet screen does everyone agree with (like)?
df_like = df[df.groupby(['tv_name'[])'likeCount'].rank(method="first", ascending=False) = =1].reset_index()[['tv_name'.'contents'.'likeCount']]
df_like.columns = ['show'.'barrage'.'good']
df_like
Copy the code
The best bullet screen of each episode is a condensed version of the episode, and these are the jokes voted for by the audience.
If you have to, I’ll treat you to a mountain climb
The rising sun rises east
In addition to the script and music, the acting skills of the “old actors” and “young actors” have also been praised by netizens.
Despite its short 12-episode run, the show’s storyline isn’t just about one or two people. Everyone has a backstory, and it’s a series of coincidences that lead to a constant discussion among the audience.
Let’s count the number of times the actors appear in the barrage to see which characters from the show are mentioned the most.
a = {'Zhang Dongsheng':'dongsheng | Qin Hao | teacher zhang'.'Zhu Chaoyang':'sun'.'YanLiang':'YanLiang'.'pop':'pop'.'Zhu Yongping':'Zhu Yongping'.'Zhou Chunhong':'red spring | big lady'.'Wang Yao':'Wang Yao'.'Xu Jing':'Xu Jing | yellow rice in accordance with the'.'Chen Guan Sheng':'jing-chun wang | Chen | Chen Guansheng'.'Ye Jun':'Ye Jun | pickup truck'.'Director Ma':'|, director of the old horse'.'Zhu Jingjing':'jingjing'.'Chimin Ye':'Chimin Ye'}
for key, value in a.items():
df[key] = df['contents'].str.contains(value)
staff_count = pd.Series({key: df.loc[df[key], 'contentsId'].count() for key in a.keys()}).sort_values()
Copy the code
Compared to the three children I doubt zhu Chaoyang mentioned so low, should be roughly the same with the other two ah.
After reviewing the source data again, it is true that there are very few references to Zhu Chaoyang, because most of them refer to him as “excellent student”, “son” and so on.
4, the word cloud
As we all know, a multi-point article cannot be without word cloud.
The wordcloud for each article is as different as possible from the last one, but this time I’m using stylecloud, which is a much nicer version of the wordcloud wordcloud pack.
import stylecloud
from IPython.display import Image
stylecloud.gen_stylecloud(text=' '.join(text1), collocations=False,
font_path=R ' C: \ Windows \ Fonts \ msyh TTC',
icon_name='fas fa-play-circle',size=400,
output_name='Hidden Corners - Word clouds.png')
Image(filename='Hidden Corners - Word clouds.png')
Copy the code
In addition to the name of the protagonist, in this play with the theme of “children”, the discussion of children’s thoughts and behaviors takes up an important part. In addition, everyone in the play, from the elderly players to the young children, contributed outstanding acting skills, and praise for their acting skills has become a high-frequency vocabulary.
And the most out of circle “mountain” terrier, is mentioned frequently.
From the sin of unlicensed to the hidden corners, testify in suspense crime are not market in the present, to gain popularity high reputation, it just means how to communication and marketing, more and more team focused to grind the high-quality goods series, the audience will be willing to pay for the show, let “climbing” stem “ones” step by step.
This is the end of the article, thank you for watching, Python data analysis series, the next article to share Python crawl to Mr. Lu Xun “classic Quotes”
To thank the readers, I’d like to share some of my recent programming gems to give back to each and every one of you.
Dry goods are mainly:
① More than 2000 Python ebooks (both mainstream and classic)
Python Standard Library (Chinese version)
(3) project source code (forty or fifty interesting and classic practice projects and source code)
④Python basic introduction, crawler, Web development, big data analysis videos (suitable for small white learning)
⑤Python Learning Roadmap
All done ~ see profile or private message for complete source code.
Review past
Python implements “fake” data
Classic Quotations by Lu Xun, a Python crawler
Python Crawler douban hot topic