preface
In one of the courses I took, the teacher gave each group a task to introduce and complete a small module, the use of tool knowledge. My group, however, happened to have a small problem with Python crawlers.
I thought, well, that’s easy. What the hell? Thinking of new time and energy may not be enough, just their own douban film reviews (short comments) do it.
I’ve written about Nezha before, but this one is going to be as detailed as an aunt.The main realization of this article is to grab and visual analysis of any film short reviews (popular).You just provide the link and some basic information, and he can do it
Analysis of the
For douban reptiles, what shold we consider? How do you analyze it? Douban movie home page
First of all, try it. Open any movie. Take Jiang Ziya as an example. Open up jiang Ziya and you will find that it is a non-dynamic rendering page, that is, the traditional rendering method, directly request this URL to get the data. But as you flip through the page, you’ll find that unlogged users can only access the priority interface, and logged in users can only have access to the following page.
So the process should be login – > crawler – > Store – > Visual analysis.
The environment is python3, the code can run successfully on Linux and Win, if MAC and Linux can not run friends font garbled problem also please private me. PIP package is used as follows, directly with tsinghua image download or very slow very slow (enough sweet not).
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple pip install wordcloud -i https://pypi.tuna.tsinghua.edu.cn/simple pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simpleCopy the code
The login
The login address of Douban
After entering, there is a password login bar, we need to analyze what happened on the way to login, it is not enough to open the F12 console, we also need to use Fidder to capture packets.
Open the F12 console and click Login. After many attempts, the login interface is also very simple:
Check the parameters of the request and find that it is a normal request without encryption. Of course, you can use Fidder to capture packets. Here I simply test with the wrong password. If the failure of the small partner can try to manually login and exit so again run the program.
Code the login module like this:
Url = 'https://accounts.douban.com/j/mobile/login/basic' header = {' the user-agent ':' Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony', 'Origin': 'https://accounts.douban.com', 'content-Type':'application/x-www-form-urlencoded', 'x-requested-with':'XMLHttpRequest', 'accept':'application/json', 'accept-encoding':'gzip, deflate, br', 'accept-language':'zh-CN,zh; Q = 0.9 ', 'connection' : 'keep alive -', 'Host' : 'accounts.douban.com' } data={ 'ck':'', 'name':'', 'password':'', 'remember':'false', 'ticket':'' } def login(username,password): global data data['name']=username data['password']=password data=urllib.parse.urlencode(data) print(data) req=requests.post(url,headers=header,data=data,verify=False) cookies = requests.utils.dict_from_cookiejar(req.cookies) print(cookies) return cookiesCopy the code
After this hd, the entire execution process is roughly as follows:
crawl
After successful login, we can carry the login information to the website and crawl information as we like. It’s a traditional interaction, but you’ll find an Ajax request every time you switch pages.
In this part of the interface, we can get the data directly from the comment section, so we don’t need to request the whole page and extract the content of this section. And the URL rule of this part is the same as the previous analysis, only onestart
Indicates that the current number of items is changing, so just piece together the URL.
That is, you piece together the URL with logic until it doesn’t work properly.
https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&, other parameters are omitted https://movie.douban.com/subject/25907124/comments?percent_type=&start=20&, other parameters are omitted https://movie.douban.com/subject/25907124/comments?percent_type=&start=40&, other parameters are omittedCopy the code
How do you extract information after each URL visit? We filter the data according to the CSS selectors, because each comment has the same style, much like an element in a list in HTML.
The data returned by the Ajax interface is in the red area below, so we can divide it into groups based on the class search.
In the implementation, we used Requests to send requests for results and BeautifulSoup to parse HTML files. And the data we need is easy to analyze the corresponding part.
The code is as follows:
import requests from bs4 import BeautifulSoup url='https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&limit=20&status=P&sort=new_score&comments_ Only =1&ck=C7di' header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', } req = requests. Get (url,headers=header,verify=False) res = req.json() # return a json res = res[' HTML '] soup = BeautifulSoup(res, 'lxml') node = soup.select('.comment-item') for va in node: name = va.a.get('title') star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2] comment = va.select_one('.short').text votes=va.select_one('.votes').text print(name, star,votes, comment)Copy the code
The result of this test is:
storage
After the data is crawled, storage is a concern, and we store the data in CVS.
Write data to Excel files using XLWT. Basic application examples of XLWT:
Import XLWT # create workbook object workbook = xlwt. workbook (encoding=' UTF-8 ') # create worksheet worksheet = Workbook.add_sheet ('sheet1') # workbook.add_sheet('sheet1') # workbook.add_sheet('sheet1') XLSX workbook.save('test.xlsx')Copy the code
Use XLRD to read excel files, this case XLRD basic application example:
Workbook = xlrd.open_workbook('test.xls') # import table (workbook = xlrd.open_workbook('test.xls') # import table (workbook = test.xls) # import table (workbook = test.xls) # import table (workbook = test.xls) # Nrows = table.nrows for I in range(nrows): print(table.row_values(I))#Copy the code
Here, we can save the data locally to the login module + crawl module + storage module, the specific integration code is:
Import requests from bS4 import BeautifulSoup import urllib.parse import XLWT import XLRD # def login(username, password): url = 'https://accounts.douban.com/j/mobile/login/basic' header = { 'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36', 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony', 'Origin': 'https://accounts.douban.com', 'content-Type': 'application/x-www-form-urlencoded', 'x-requested-with': 'XMLHttpRequest', 'accept': 'application/json', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'zh-CN,zh; Q =0.9', 'connection': 'keep-alive', 'Host': 'accounts.douban.com'} # data = {'ck' : ', 'name': '', 'password': '', 'remember': 'false', 'ticket': '' } data['name'] = username data['password'] = password data = urllib.parse.urlencode(data) print(data) req = requests.post(url, headers=header, data=data, verify=False) cookies = requests.utils.dict_from_cookiejar(req.cookies) print(cookies) return cookies def getcomment(cookies, mvid): # Cookies for login success (background can identify users through cookies, Workbook(encoding=' ASCII ') # create writable Workbook object ws = w.dd_sheet ('sheet1') # create sheet index = Header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',} # try catch Try: # to piece together the url each star plus 20 url = 'https://movie.douban.com/subject/' + STR (mvid) + '/ comments? Start =' + STR + (start) '&limit=20&sort=new_score&status=P&comments_only=1' start += 20 Json () res = req.json() res = res[' HTML '] # soup = soup under 'HTML BeautifulSoup(res, 'LXML ') # Create a BeautifulSoup object from this structured HTML to extract information node = soup. Select ('. Comment-item ') # Each class group is comment-item For va in node, 20 records (20 comments per URL) : Name = # traversal comments va. Atul gawande et (' title ') # for reviewers name star = va. Select_one (' comment - info '), select (' span) [1]. The get (" class ") [0] [2] # Votes = va.select_one('.votes'). Text # vote count comment = va.select_one('.short'). Text # comment text print(name, star, votes, Comment) ws.write(index, 0, index) # index ws.write(index, 1, name) # index Ws. write(index, 2, star) # index, ws.write(index, 3, votes) # index, Ws. Write (index, 4, comment) # index += 1 except Exception as e: Print (e) break w.ave ('test.xls') # save test.xls file if __name__ == '__main__': Username = input(' enter id: ') password = input(' enter password: ') cookies = login(username, password) mvid = input(' enter id: ') ') getcomment(cookies, mvid)Copy the code
Successfully store data after execution:
Visual analysis
We need to score statistics, word frequency statistics. There is also the generation of word cloud display. The corresponding libraries are Matplotlib and WordCloud.
The logical idea: read the XLS file, the comments will use word segmentation processing statistical word frequency, statistics of the most words into histogram and words. The number of stars 🌟 made into a pie chart to show, the main code are annotated, the specific code is:
Where the code is:
import matplotlib.pyplot as plt import matplotlib import jieba import jieba.analyse import xlwt import xlrd from Wordcloud import WordCloud import numpy as NP from Collections import Counter # Set font Some Linux fonts have problems RcParams ['axes. Unicode_minus '] = False # comment To review some of the data [[' 1 ', 'name', 'star star', 'approval number', 'comment'], [' 2 ', 'name', 'star star', 'approval number', 'comment']] tuples def anylasescore (comment) : Score = [0, 0, 0, 0, 0, 0, 0] # count = 0 # ['1',' name ',' star ',' agree ',' comment content '] try: Score [int(va[2])] += 1 # count += 1 except Exception as e: Continue print(score) label = '1 ', '2 ', '3 ', '4 ', '5 'color = 'blue', 'orange', 'yellow', 'green' Size = [0, 0, 0, 0, 0] # explode = [0, 0, 0, 0] # explode: for i in range(1, 5): Size [I] = score[I] * 100 / explode[I] = count (size, colors=color) Slide =explode, labels=label, shadow=True, autopct='%1.1f%%') for font in pie[1]: font.set_size(8) for digit in pie[2]: # digital.set_size (8) plt.axis('equal') # digital.set_size (8) plt.axis('equal') # digital.set_size (loc=0) # digital.set_size (loc=0) # digital.set_size (loc=0) Bbox_to_anchor =(0.82, 1) # legend # set the font size for legend leg = plt.gca().get_legend() ltext = leg.get_texts() plt.setp(ltext, Fontsize =6) plt.savefig("score.png") # def getZhifang (map): X = [] y = [] for k, v in map.most_common(15): Array (x) Yi = Np.array (y) width = 0.6 Plt. figure(figsize=(8, 6)) Plt. bar(Xi, Yi, width, color='blue', label=' black ', alpha=0.8, Savefig ('zhifang.png') plt.show() return def getCiyun_most (map):) plt.xlabel(" 表 ") plt.ylabel(" 表 ") plt.savefig('zhifang.png') plt.show() return def getCiyun_most (map): X = [] y = [] for k, v in map.most_common(300): Join (xi) # print(xi) # print(xi) # print(xi) # print(xi) # print(xi) # print(xi) # Backgroud_Image = plt.imread(" ") # Wc = WordCloud(background_color="white", width=1500, height=1200, # min_font_size=40, # mask=backgroud_Image, Font_path ="simhei.ttf", max_font_size=150, # set font maximum random_state=50, # set number of randomly generated states, that is, how many color schemes are available. Wc.font_path ="simhei.ttf" # boldface # wc.font_path="simhei.ttf" my_wordcloud = wc.generate(xi) # generate(xi) Plt.imshow (my_wordcloud) # show my_wordcloud.to_file("img.jpg") # save xi = ". Join (x[150:300]) # My_wordcloud = wc.generate(xi) my_wordcloud.to_file("img2.jpg") plt.axis("off") def anylaseword(comment): # the filter word, Meaningless need to filter out some terms list = [' this', 'a', 'a', 'up' and 'no', 'is',' not ', 'the', 'or', 'story', 'this',' that ', 'this',' the ', 'story', 'character', Print (list) commnetstr = "c = Counter() # print(list) commnetstr =" Seg_list = jieba.cut(va[4], cut_all=False) ## jieba index += 1 for x in seg_list: if len(x) > 1 and x! = '\r\n': # not a single word or special symbol try: c[x] += 1 # Continue commnetstr += va[4] for (k, v) in c. ost_common(): # if v < 5 or k in list: c.pop(k) continue # print(k,v) print(len(c), Getciyun_most (c) # print(commnetstr) def anylase(): Data = xlrd.open_workbook('test.xls') # open XLS file table = data.sheets()[0] # open nrows = table.nrows # comment = [] for i in range(nrows): Append (table.row_values(I)) # print(comment) anylasescore(comment) anylaseWord (comment) if __name__ == '__main__': anylase()Copy the code
Let’s take a look at the implementation again:
Here I have selected some data of Jiang Ziya and Spirited Away. The score ratio of the two films is as follows:
Judging from the ratings, it is clear that Chihiro and Chihiro are more favorably received, with most people willing to give him a five. It is basically one of the most beautiful anime. Let’s look at the word score of the histogram:
The author of Spirited Away is obviously more famous and influential, so much so that people are talking about him. Take a look at the word cloud:
Miyazaki Hayao, white dragon, mother-in-law, is really full of memories, good don’t say, what want to say welcome to discuss!
PS: If you need Python learning materials, please click on the link below to obtain them
Free Python learning materials and group communication solutions click to join