This paper mainly involves knowledge points including Sina Weibo crawler, simple reading and writing of Python to database, simple list data deduplication, simple natural language processing (snowNLP module, machine learning). For those who have some programming background and have some understanding of Python.
I believe there is one thing that the tech community has been joking about recently: a post at noon on October 8 caused a huge uproar among Sina Weibo users, especially female users, which caused sina Weibo to crash.
The author of the post is lu Han, the world’s most popular idol.
Programmers have started to talk about how to learn:
How did Lu Han blow up his Weibo server
How Weibo engineers work overtime while getting married
How did Taobao programmers forgive Lu Han
At this moment, the whole world knew that Lu was in love.
All over the world, women are brokenhearted.
So what are the feelings of Lu’s fans? Let’s take a look at the comments on Lu’s micro blog about his relationship and analyze the mood of his fans. Here’s what I tell you. (If you want to see the analysis, skip to section 5.)
1. Sina Weibo API
After going through the grief of a few crawler bans (which really hurt), I learned the “good” habit of checking for apis before crawling. As a big company, how could sina not launch sina Weibo API? For developers, Sina has its own open platform. Here is the python API call method, below is the code to access weibo API by logging in App_key and App_secret. Py3 has some problems in using Weibo module.
From Weibo import APIClient import webbrowser import sys reload(sys) sys. setDefaultencoding (' UTF-8 ') APP_KEY = 'your App Key '# to get the App Key APP_SECRET =' your AppSecret '# get AppSecret CALLBACK_URL =' https://api.weibo.com/oauth2/default.html ' Client = APIClient(app_key= app_key, app_secret= app_secret, Redirect_uri =CALLBACK_URL) url = client.get_authorize_url() webbrowser.open_new(url) # open default browser to get code parameter print 'Enter the content after code in the URL and press Enter: ' code = raw_input() r = client.request_access_token(code) access_token = r.access_token expires_in = r.expires_in client.set_access_token(access_token, expires_in)Copy the code
Know how to log in API, how to call API to crawl a single weibo comment? One line of code.
r = client.comments.show.get(id = 4160547165300149,count = 200,page = 1)
Copy the code
All information about a single weibo comments in r.com ments, there needs to be controlled weibo API documentation, weibo apis have statement calling weibo comment API needs to get the user authorization, but, as long as know a single weibo id, you can call this API, about how to get behind a single weibo id will say (down a bit, Don’t let Weibo know.
Obtain the API in the client.interface name.get (request parameters) format. You can view the specifications of the OBTAINED API in interface details. An example of the returned result is provided in the document.
Json interface names for key data are also given in the documentation.
If we want to get the content of weibo comments, we just need to call the Text interface.
for st in r.comments:
text = st.text
Copy the code
2. Micro-blog crawler
By calling the Sina Weibo API, we can simply get the comment information of a single weibo, why say it is simple, because the red information is expensive! Do you think big V’s micro blog gives you free API calls? Non-certified application developers can only request thousands of API calls a day, which is for big V like Lu Han who has hundreds of thousands of comments on a single micro blog. Too little (TT)
So pinch, still want to write micro blog crawler.
As the saying goes, it is not dangerous to know yourself and your enemy in a hundred battles. As a big factory, Sina is experienced in hundreds of wars, and must have experienced numerous wars between crawlers and anti-crawlers, so it must have a sound anti-crawler strategy. Is the so-called, in front of the strong enemy, detour, a big guy said well, climb the website, climb the mobile end first: https://m.weibo.cn/
After logging on weibo, enter Lu Han announced his relationship in the microblog, _(:зゝ∠)_ has 200W + comments, you can see the quiet weibo fans uneasy heart…
Mobile terminal microblogging site appears simple fatty intestine, not like a PC so complex and unclear logic: https://m.weibo.cn/status/4160547165300149 more behind several microblogging can know the status of Numbers, is a single weibo id.
The comments contain both the most popular and the most recent comments, but in either case, the url stays the same as you scroll down. Right-click Check in Chrome and observe the network changes.
From the XHR file of network, it can be learned that the change rule of popular comments is as follows:
'https://m.weibo.cn/single/rcList?format=cards&id=' + a single weibo id + '& type = comment&hot = 1 & page =' + page numberCopy the code
Here’s how the latest reviews change:
'https://m.weibo.cn/api/comments/show?id=' + a single weibo id + '& page =' + page numberCopy the code
Open https://m.weibo.cn/single/rcList?format=cards&id=4154417035431509&type=comment&hot=1&page=1 can see hot comments json file.
The next step is to disguise the browser header, read the JSON file, traverse each page… That’s not the point! This is py3 code
The import re, time, requests, urllib. Request weibo_id = input (' input a single weibo ID: Weibo_id ') # url = 'https://m.weibo.cn/single/rcList?format=cards&id=' + + '& type = = {}' comment&hot = 1 & page # hot comments Url = 'https://m.weibo.cn/api/comments/show?id=' + weibo_id + '& page = {}' # climbing time sorting comment headers = {' the user-agent: 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10.12; Rv :55.0) Gecko/20100101 Firefox/55.0', 'Host' : 'm.wibo.cn ', 'Accept' : 'application/json, text/plain, */*', 'Accept-Language' : 'zh-CN,zh; Q = 0.8, en - US; Q = 0.5, en. Q = 0.3 ', 'the Accept - Encoding' : 'gzip, deflate, br', 'Referer' : 'https://m.weibo.cn/status/' + weibo_id, 'cookies' : 'Login cookie info ', 'DNT' : '1', 'Connection' : 'keep-alive',} I = 0 comment_num =1 while True: # if I ==1: R = requests. Get (url = url.format(I),headers = headers) # comment_page = r.json()[1]['card_group'] # else: # r = requests.get(url = url.format(i),headers = headers) # comment_page = r.json()[0]['card_group'] r = Get (url = url.format(I),headers = headers) # Comment_page = r.json()['data'] if r.tatus_code ==200: Try: print(' reading from %s page comment: '% I) for j in range(0,len(comment_page)): Print (' %s comment '% comment_num) user = comment_page[j] comment_id = user['user']['id'] print(comment_id) user_name = user['user']['screen_name'] print(user_name) created_at = user['created_at'] print(created_at) text = Re. Sub (' <. *? > | reply ". *? > : | | [\ U00010000 - \ U0010ffff] [\ uD800 - \ uDBFF] [\ uDC00 - \ uDFFF] ', ', the user [' text ']) print likenum = (text) user['like_counts'] print(likenum) source = re.sub('[\U00010000-\U0010ffff]|[\uD800-\uDBFF][\uDC00-\uDFFF]','',user['source']) print(source + '\r\n') comment_num+=1 i+=1 time.sleep(3) except: i+1 pass else: breakCopy the code
Here are a few notes: 1. After setting the crawling interval, the probability of the micro-blog crawler being banned is greatly reduced (especially at night) 2. The number of JSON data returned by Sina is random each time, so there will be data duplication after page turning, so data deduplication is used, which will be explained later. 3. Added code to remove emojis from the database (which took so long to contain only one near-dropout /) and source, as well as HTML that appeared to be doped with replies. 4. I only write read data, not how to save, because we need to use a number! According to the! Library! Hot! (That’s the point! On the blackboard)
3. Database reads and writes in Python
Although microblog crawler greatly improves the amount of data acquisition, it is also easy to be banned by Sina because it is a crawler. The judgment that ends the loop here is that the network state is not 200, but when Weibo finds that it is a crawler, weibo returns a webpage with no substance in the webpage, at this time the program will report an error, and the data that has crawled before will have nothing.
But if you crawl for a while and save the data once, the amount of data is bigger… Cold papers slapping in the face… My heart feels like it’s being… This is where we need to use the database.
Database, as the name implies, is the storage of data warehouse, database as a development of more than 60 years of management system, has a huge application field and complex functions…… Okay, I can’t do this anymore.
In this article, the primary role of the database is an AI-style Excel spreadsheet (● — ●). In the process of crawler, the crawler will be saved as soon as it reaches a number. Even if the crawler program is interrupted, the crawler data will be stored in the database before interruption.
Most databases can be used with Python. Mizhan knows mysql, SQLite, mongodb, redis. Mysql > install mysql on MAC > Navicat > Navicat > Navicat > Navicat > Navicat > Navicat > Navicat > Navicat > Navicat > Navicat
Based on the above code, create the database, table, and domain in NavICat and the format of the domain. Add code to a Python program.
Conn =pymysql.connect(host=' server IP(default: 127.0.0.1)',user=' server name (default: root)',password=' server password ',charset="utf8",use_unicode = SQL = "insert into NLP. Love_lu (comment_id,user_name,created_at,text,likenum,source) Values (%s,%s,%s,%s,%s,%s,%s)" Table name (domain name) param = (comment_id,user_name,created_at,text,likenum,source) try: A = cur.execute(sql,param) conn.commit() except Exception as e: print(e) conn.rollback()Copy the code
Running the Python program, we climbed about 1W live comments, and before we could proceed to the next step, we had to read the contents of the database, which was also very simple in Python.
Conn =pymysql.connect(host=' server IP',user=' user ',password=' password ',charset= 'utf8') cur = conn.cursor() cur.execute("SELECT * FROM nlp.love_lu WHERE id < '%d'" % 10000) rows = cur.fetchall()Copy the code
In this way, the information crawled before will be read out, but as mentioned above, the number of data returned by microblog crawler when turning pages is random, so there will be a repeat situation, so after reading, we need to use if… The not in statement performs a data deduplication.
for row in rows:
row = list(row)
del row[0]
if row not in commentlist:
commentlist.append([row[0],row[1],row[2],row[3],row[4],row[5]])
Copy the code
4. Natural language processing NLP
NLP is a field of artificial intelligence, in which algorithms can be designed to make machines understand human language. Natural language is also a relatively difficult part of artificial intelligence, and a language as profound and unpredictable as Chinese is a major difficulty in NLP. Python has many MODULES related to NLP. Interested friends can use Python to implement simple text sentiment analysis to explore NLP.
I took a test of some ready-made sentiment analysis algorithms to analyze crawling comments, with high error rate _(:зゝ _), what to do? Redesign the algorithm? Rice sauce seemed to encounter the first major problem caused by the failure of learning Chinese in his life…
Of course, girls like Mijiang, who are both LAN and duo, quickly discovered one of python’s more famous Chinese NLP libraries: snowNLP. SnowNLP call method is relatively simple, the source code in detail explains the call method, and the generation of results.
def snowanalysis(textlist):
sentimentslist = []
for li in textlist:
s = SnowNLP(li)
print(li)
print(s.sentiments)
sentimentslist.append(s.sentiments)
Copy the code
This code gets the list file generated by the comment body text after reading the database, and analyzes the sentiment value of each comment in turn. SnowNLP can generate a value between 0 and 1 based on a given sentence. When the value is greater than 0.5, the sentence’s emotional polarity tends to be positive. When the score is less than 0.5, the sentence’s emotional polarity tends to be negative.
It can be seen from the text content and the corresponding value below that the score of blessing or expressing positive emotions is mostly higher than 0.5, while the score of expecting to break up or expressing negative emotions is mostly lower than 0.5. There is also a certain error in the analysis results, the algorithm can be optimized through training, rice sauce language is not bad, do not fool around… (run
5. Analyze the results
Let’s look at the results of this analysis (● — ●).
PLT. Hist (sentimentslist, bins = np arange,1,0.02 (0)) PLT. The show ()Copy the code
The list of emotion values processed in the previous section is counted and the distribution map is generated. The data in the figure below was collected at 19:00 on October 9, and 1W comments were collected.
Lu Han announced his love relationship on Weibo
Take a look at guan xiaotong’s response on weibo.
Guan Xiaotong corresponds to the distribution of emotional value of weibo comments
According to these two graphs, it can be seen that the frequency of emotion value is high at both ends close to 0 and 1 and around 0.5, indicating that fans’ emotions towards such events are obvious both positive and negative. But it can also be seen that positive emotions outweigh negative ones.
I also calculated the number of micro-blog emoticons appearing in the comments.
There were nearly three times as many emojis in comments to Lu as guan, and the most popular one was “Go for it”. As you can see, fans are still supportive of Lu’s relationship, while some want to complain about the main characters, while others feel sad and want to calm down.
Carry on word cloud analysis to comment content again.
Lu Han has announced his love affair
Guan Xiaotong corresponds to the cloud of comments on Weibo
There are a lot of words like “blessing”, “support”, “together” and “why”, “don’t deserve” and “break up” on Lu’s micro blog. The same words also appear on Guan’s micro blog, but there are also a lot of words about Reba and Li Yifeng, so it seems that both of them are having affairs.
What can you guess is the background of the word cloud? Forget about the rice sauce. You feel for yourself.
The copyright of this article belongs to Daji Dali Millet sauce @ Jianshu, House RICE@ Zhihu all rights reserved, shall not be reproduced without authorization