The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article comes from Tencent Cloud author: user 7760819
(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)
Python crawler – 800 word cloud
An overview of the
Douban eight hundred short comments on reptiles
Train of thought
Use re to parse web pages and get data to draw word clouds using Wordcloud
code
# Import requests import re import CSV import jieba import wordCloud # Implement multi-page crawler via loop # Observe page link rules # https://movie.douban.com/subject/26754233/comments?start=0&limit=20&sort=new_score&status=P # https://movie.douban.com/subject/26754233/comments?start=20&limit=20&sort=new_score&status=P # https://movie.douban.com/subject/26754233/comments?start=40&limit=20&sort=new_score&status=P # https://movie.douban.com/subject/26754233/comments?start=60&limit=20&sort=new_score&status=P # 20 from 0 to each page, so setting step, Page =[] for range(0,80,20): page.append(i) with open CSV ','a',newline= ",encoding=' UTF-8 ') as (r'd :\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python =' utF-8 ') as (r'd :\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\ f: for i in page: url='https://movie.douban.com/subject/26754233/comments?start='+str(i)+'&limit=20&sort=new_score&status=P' headers={ 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; Rv :80.0) Gecko/20100101 Firefox/80.0'} resp=requests. Get (URL,headers=headers) HTML =resp.text Res =re.compile('<span class="short">(.*?)</span>') duanpin=re.findall(res, HTML) # Save data for duan in duanpin: Writer =csv.writer(f) duanpin=[] duanpin. Append (Duan) writerow(duanpin) # TXT =f.read() (r'd :\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python =' utF-8 ') TXT =f.read() txt_list=jieba.lcut(txt) string=' '.join(txt_list) w=wordcloud.WordCloud( width=1000, height=700, background_color='white', font_path="msyh.ttc", scale=15, stopwords={" "}, contour_width=5, contour_color='red' ) w.generate(string) W.to_file (r'd :\360MoveData\Users\cmusunqi\Documents\GitHub\R_and_python\python\ doupe200.png ')Copy the code
The results of
There are only a few comments in the source code of the web page, which makes me scratching my head. I feel there is something wrong with it. It may be necessary to convert the web code into mobile phone data for browsing
Judging from the ci cloud, Eight hundred is still propagating in the name of history, so don’t look at such historical nihilistic movies, guan Hu’s ass is not correct.