Our target for this week: video comment data from bilibili (www.bilibili.com).
As we all know, there are a lot of videos on Station B, which are called “the treasure of the town”, with an awful amount of comments and bullets. So our goal this time is to crawl the comment data of the video on Site B and analyze why it is so popular.
First of all, investigate which video has the most comments on site B… Fortunately, there has been a big guy has been counted, let’s see a ha!
【B station big data visualization 】B station most comments on the video after all? From < www.bilibili.com/video/av349… >
Huh? It’s interesting that the first and last episode of “The Full-time Master” was the second and first most reviewed episode, far ahead of many other popular episodes. All right, let’s take it. Let’s see how strong it is.
Nonsense not much said, go to the station B look at the show just how good is god www.bilibili.com/bangumi/pla…
Well, you need a large membership to watch…
Well, I don’t want to, but the good news is that while the video isn’t viewable, the comments are.
Do you feel the horror of it? 63W6 comments! More than 9,000 pages! Well, that was extraordinary.
Next, let’s write a crawler to crawl through this data.
Crawlers are generally divided into four stages: analyze the target page, obtain the page content, extract the key information, output and save.
1. Analyze the target page
- Firstly, the structure of the comment area was observed, and it was found that the comment area was page turning by mouse click. There were 9399 pages in total, and there were 20 comments on each page. Each comment contained user name, comment content, comment floor, time and date, number of likes and other information displayed.
- Next we press F12 to summon the developer tools and switch to Network. Then click on the comment to turn the page, observe how this process changes, and use this to formulate our crawling strategy.
- It is not difficult to find that the URL does not change throughout the process, indicating that the page turning in the comment area is not controlled by the URL. At each turn of the page, the page makes this Request to the server (see Request URL).
- By clicking the Preview bar, you can switch to the Preview page, which means you can see what the request returns. Here is the JSON file returned by the request containing the comment data for this page in replies. In this JSON file, we can see that it contains too much information, in addition to the information displayed on the page, there is a lot of information is not displayed, it is a treasure. But we don’t need it here, so we’ll just ignore all of it, and just pick what we care about.
2. Get web content
Page content analysis finished, can formally write code to climb.
import requests
def fetchURL(url) :
Url: the url of the target page return: the HTML content of the target page
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36,}try:
r = requests.get(url,headers=headers)
r.raise_for_status()
print(r.url)
return r.text
except requests.HTTPError as e:
print(e)
print("HTTPError")
except requests.RequestException as e:
print(e)
except:
print("Unknown Error !")
if __name__ == '__main__':
url = 'https://api.bilibili.com/x/v2/reply?callback=jQuery172020326544171595695_1541502273311&jsonp=jsonp&pn=2&type=1&oid=1135 7166&sort=0&_=1541502312050'
html = fetchURL(url)
print(html)
Copy the code
However, after running, you will find that 403 error, the server denied us access.
403 Client Error: Forbidden for URL: https://api.bilibili.com/x/v2/reply?callback=jQuery172020326544171595695_1541502273311&jsonp=jsonp&pn=2&type=1&oid=11357 166&sort=0&_=1541502312050 HTTPError NoneCopy the code
Again, if you put this request in the browser address bar and open it directly, it will change to 403 and you will get nothing.
This is the first pit we’ve encountered on this reptile. The browser can normally return the response, but directly open the request link, but the server will be rejected. (My first reaction is cookie, put the cookie in the browser request header of the crawler, re-visit, found useless), maybe this is also a small anti-crawler mechanism.
After looking up information on the Internet, I found a solution (although I do not understand the principle). The URL parameters of the original request are as follows:
callback = jQuery1720913511919053787_1541340948898
jsonp = jsonp
pn = 2
type = 1
oid = 11357166&sort=0
_ = 1541341035236
Copy the code
There are only three really useful parameters: pn (number of pages), type (=1) and OID (video ID). After deleting the other unnecessary parameters, use the newly collated URL to access and successfully obtain the comment data.
https://api.bilibili.com/x/v2/reply?type=1&oid=11357166&pn=2
Copy the code
Then, in the main function, we get the comment data for each page by writing a for loop and changing the value of PN.
if __name__ == '__main__':
for page in range(0.9400):
url = 'https://api.bilibili.com/x/v2/reply?type=1&oid=11357166&pn=' + str(page)
html = fetchURL(url)
Copy the code
3. Extract key information
The response content is parsed through the JSON library, and then we extract the content we need: floor, user name, gender, time, comments, likes, and replies.
import json
import time
def parserHtml(html) :
Function: Attempts to parse the structure of an in-memory HTML file given by parameter HTML to get the desired content parameter: HTML: in-memory HTML text object similar to a file
s = json.loads(html)
for i in range(20):
comment = s['data'] ['replies'][i]
# Floor, Username, gender, time, comments, likes, replies
floor = comment['floor']
username = comment['member'] ['uname']
sex = comment['member'] ['sex']
ctime = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(comment['ctime']))
content = comment['content'] ['message']
likes = comment['like']
rcounts = comment['rcount']
print(The '-'+str(floor) + ':' + username + '('+sex+') ' + ':'+ctime)
print(content)
print('like : '+ str(likes) + ' ' + 'replies : ' + str(rcounts))
print(' ')
Copy the code
Part of the operation results are as follows: --204187:day cocoa bell (secret):2018-11-05 18:16:22 Wife again out of this, this is really wood money (´; Omega. 2) replies: 0 --204186: changye Weiyang 233(female) 0 --204185: it's a piece of scumbag (male) 1 --204183: Day cacao bell(secret): Ready to go to school, all his mid-term exam like :2 replies: 0 --204182: pick up the autumn leaves (secret):2018-11-05 12:04:19 November 5 clock ( ̄▽ ̄) You really missed a billion dollars this time! = you really missed a billion dollars this time! = You really missed a billion dollars this timeCopy the code
4. Save the output
We saved these data locally in CSV format, which completed all the tasks of this crawler. The complete code for the crawler is attached below.
import requests
import json
import time
def fetchURL(url) :
Url: the url of the target page return: the HTML content of the target page
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q = 0.8 '.'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36,}try:
r = requests.get(url,headers=headers)
r.raise_for_status()
print(r.url)
return r.text
except requests.HTTPError as e:
print(e)
print("HTTPError")
except requests.RequestException as e:
print(e)
except:
print("Unknown Error !")
def parserHtml(html) :
Function: Attempts to parse the structure of an in-memory HTML file given by parameter HTML to get the desired content parameter: HTML: in-memory HTML text object similar to a file
try:
s = json.loads(html)
except:
print('error')
commentlist = []
hlist = []
hlist.append("Serial number")
hlist.append("Name")
hlist.append("Gender")
hlist.append("Time")
hlist.append("Comment")
hlist.append("Number of likes")
hlist.append("Response number")
#commentlist.append(hlist)
# Floor, Username, gender, time, comments, likes, replies
for i in range(20):
comment = s['data'] ['replies'][i]
blist = []
floor = comment['floor']
username = comment['member'] ['uname']
sex = comment['member'] ['sex']
ctime = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(comment['ctime']))
content = comment['content'] ['message']
likes = comment['like']
rcounts = comment['rcount']
blist.append(floor)
blist.append(username)
blist.append(sex)
blist.append(ctime)
blist.append(content)
blist.append(likes)
blist.append(rcounts)
commentlist.append(blist)
writePage(commentlist)
print(The '-'*20)
def writePage(urating) :
''' Function : To write the content of html into a local file html : The response content filename : the local filename to be used stored the response '''
import pandas as pd
dataframe = pd.DataFrame(urating)
dataframe.to_csv('Bilibili_comment5-1000条.csv', mode='a', index=False, sep=', ', header=False)
if __name__ == '__main__':
for page in range(0.9400):
url = 'https://api.bilibili.com/x/v2/reply?type=1&oid=11357166&pn=' + str(page)
html = fetchURL(url)
parserHtml(html)
To reduce the risk of IP being blocked, take 5 seconds off every 20 pages.
if page%20= =0:
time.sleep(5)
Copy the code
Write in the last
In the process of climbing, or encountered a lot of small pits.
-
The requested URL cannot be used directly and can be accessed only after parameters are filtered.
-
The crawl process is actually not smooth, because if a user comments during the crawl, the response returned from the request will be empty and the program will fail. So during the actual crawl, record the crawl position so that you can continue to climb from that position if you make a mistake. (Also, choosing late at night when there are fewer people Posting can greatly reduce the probability of errors.)
-
There are a number of inconsistencies in the data that we crawled, which is not really a pit, but I’m going to talk about it here just to avoid confusion.
A. The floor of the comment area is only over 200,000, but the number of comments is over 630,000. This discrepancy is mainly due to the fact that the comments of Station B can be replied, and the replies will also be counted in the total number of comments. Here we only climb the floor of the comments, and the comments are ignored, only the number of replies can be counted.
B. comments area floor in the 200000 or so, but we finally climbed down the data of only 180000 or so, repeatedly check crawlers and the original website and found that this belongs to the normal phenomenon, because of the delete comments, reviews after deletion, the back of the floor will not reorder, but so that layer on the delete empty. Resulting in inconsistent floors and reviews.
If there is something not clear or wrong in the article, please comment in the comment area, or scan the QR code below, and add my wechat, so that we can learn and communicate together and make progress together.