preface
Obtain all video data, comment data, bullet screen data and video from the UP main page of station B
Tip: The following is the body of this article, the following cases for reference
First, obtain video data
Objective: To obtain the data of likes, comments, comment pages and so on of all videos, we need to find the law of video URLS and visit all video homepages to obtain the video URLS of station B for data analysis
A ID of url:https://www.bilibili.com/video/BV1jK4y1D7Ft url: https://www.bilibili.com/video/+Copy the code
You can know that the composition of the video URL includes HTTPS:www.bilibili.com/video+ some ID,…Let’s go to the UP main page
Through my simple utilization of Requests, I found that the URL packet of the home page video was not what we saw, so we captured the packet and looked to find its real URL and real packet
So this is pretty easy, just grab the package, and let’s see it’s a JSON data, and look at the BVID that I’ve underlined
Bvid: "BV1jK4y1D7Ft" video url:https://www.bilibili.com/video/BV1jK4y1D7FtCopy the code
Ok, there is BVID in the data packet of the video home page, and this BVID is an important part of the VIDEO URL, so we can get the URL of all videos by obtaining the data of the home page, so now take a closer look at its structure, and the data of each video is in the vlist, and the vList is in the list. List in data and [” data “][” list “][” vlist “][” bvid “], the code
Def get_url(): head = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/87.0.4280.88 Safari/537.36'} js=requests.get("https://api.bilibili.com/x/space/arc/search?mid=2026561407&ps=30&tid=0&pn=1&keyword=&order=pubdate&json p=jsonp",headers=head).json() for i in js["data"]["list"]["vlist"]: Urls. Append (" https://www.bilibili.com/video/ "+ I [" bvid"]) oids. Append (I [" aid "]) = = = "obtain review data to be usedCopy the code
Aid is very useful and is an important part of getting our review data
Now that we have urls for each video, we can access each URL to get the data we need. This data is pretty simple so I’ll just go to the code
def parser_data(url): dic = {} bro = webdriver.Chrome() bro.get(url) bro.execute_script('window.scrollTo(0, Document. Body. ScrollHeight) ') # pulled down a screen to sleep (4) bro. Execute_script (' window. ScrollTo (0, Document. Body. ScrollHeight) ') # pulled down a screen to sleep (4) bro. Execute_script (' window. ScrollTo (0, Document. Body. ScrollHeight) ') # pulled down a screen sleep. (4) HTML = etree HTML (bro. Page_source) try: dic["title"]=html.xpath('//div[@id="viewbox_report"]/h1/@title')[0] except: dic["title"]="" try: dic["view"]=html.xpath('//span[@class="view"]/text()')[0] except: dic["view"]="" try: dic["dm"]=html.xpath('//span[@class="dm"]/text()')[0] except: dic["dm"]="" try: dic["page"]=html.xpath('//*[@id="comment"]/div/div[2]/div/div[4]/a[last()-1]/text()')[0] except: Dic ["page"]="1" == data_list.append(dic) bro.quit()Copy the code
Some people may ask why selenium is used instead of requests. Let me explain. The main purpose of Selenium is to get the number of pages per video comment, which will be important later when getting comment data
Summary of video data
In order to obtain the data of all videos, we searched for the composition rules of video URLS, and then found the BVID in the data of the homepage. Through its splicing URL, we could get the required data, and obtained two data for analyzing comments: AID Page
Second, get comment data
Let’s first look at the comment data of a single video. It can be seen at a glance that the video data is dynamically loaded. For dynamic loading, I have two methods: 1. Find the URL interface. If you use Selenium, you can wait. It’s very slow, so let’s try to find the interface first
We’ve got the packet of comment data, so let’s take a look at his URL
https://api.bilibili.com/x/v2/reply?callback=jQuery17201533269097888037_1615856026814 & json = json & pn = 1 = = "pages & type = 1 &oid=929491224 == "ID&sort =2 &_=1615856028782Copy the code
If we simply direct access to the url that is not successful, but the packet actually is to review the data, then we guess will be B station deliberately made the climb technology for url, we take a look at the url of what we don’t want what need, after I try many times, found that this stuff is the callback to ghosts, You can’t access the right data with it. So let’s just get rid of it
Now it looks like we don’t need Selenium anymore. Now we just need to analyze the URL to find the pattern and get all the data
https://api.bilibili.com/x/v2/reply? & json = json & pn = 1 = = "pages & type = 1 & oid = 929491224 = =" id & sort = 2Copy the code
So if we look at this, there are two important things in this, PN and OID, and at this point we have to think that when we get the video data, we get a page number, AID, the number of pages we don’t have to think that pn is the number of pages, so what’s the relationship between OID and AID? Let me put the two together and compare them
Aid: aid: 929491224 OID: & OID =929491224Copy the code
By comparison, we can see that the OID of the COMMENT URL was the AID we got when we collected the video data. Now we can splice the URL of the comment data through page and AID in the collected video data
def urls(): urls=[] df=pd.read_csv("home.csv") for a, b in zip(df.oid, df.page): for i in range(b): url = "https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn=%s&type=1&oid=%s&sort=2" % (i + 1, A) urls. Append (url) print (" * * "30 +" total stitching "+ STR (len (urls)) +" url "+" * * "30) return urlsCopy the code
With urls you can access data via Requests, and let’s look at the structure the comments per person are in [” data “][” replies “], so
Def parser_comment(url): sleep(1) head = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0); Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'} data = requests. Get (url, headers=head).json() if data['data']['replies']: for i in data['data']['replies']: dic = {} dic["name"] = i["member"]["uname"] dic["sex"] = i["member"]["sex"] dic["level"] = i["member"]["level_info"]["current_level"] dic["content"] = i["content"]["message"] Dic [" oid "] = url. The split (" oid = ") [1]. The replace (" & sort = 2 ", "") with the open (' B station comments. CSV ', 'a', encoding =" utf-8 ") as f: w = csv.writer(f) w.writerow(dic.values())Copy the code
This can parse and get all the comment data of all videos. If you want to separate the comment data of one video into one CSV, that is easy. When I parse the comment data, I added OID
def split_comments(url): Print (" * * "30 +" at the end of data acquisition, began to split "+" * * "30) oid = url. The split (" oid =") [1]. The replace (" & sort = 2 ", "") df = pd read_csv (" B station comments. CSV)" data=df[df.oid==int(oid)].reset_index(drop=True) pd.DataFrame(data).to_csv('./comment/%s.csv'%(oid))Copy the code
Summary of review data
We need to use page and AID of video data when obtaining comment data, that is, we must obtain video data before obtaining comment data
3. Obtain barrage data
The bullet screen data must also be dynamically loaded, so I also have two methods to obtain the bullet screen data, namely Selenium and find the interface, or try to find the interface, we start to capture the packet
We caught a total of 5 packets through packet capture, and the only packet that looked like the barrage packet was highlighted in red, but it was not that similar, so we looked for the interface of the barrage data on the Internet:Comment.bilibili.com/305163630.x…
So how does this barrage packet interface come together? We can see that the only variable is a string of numbers in the middle, so how does this string of numbers come together? Now, let’s take the url of the packet we captured and put it together with the interface
https://api.bilibili.com/x/v2/dm/web/seg.so?type=1&oid=305163630&pid=929491224&segment_index=2
https://comment.bilibili.com/305163630.xml
Copy the code
Let’s see, the OID in the URL obtained by packet capture is a string of numbers that we need in the data interface of the bullet screen. Now it’s easy. As long as we get this string of numbers, we can join the data interface of the bullet screen and get the data. Manually input them one by one like this? No, it’s too low. My method is to get the number by mitmdump, and to get the number by mitmdump we need to click on the bullet screen button in the video, so selenium is used again. So I’m going to automatically go to all of these video pages and I’m going to hit mitmdump and I’m going to get the URL and I’m going to get this number mitmdump and I’m going to call PY
def response(flow):
try:
if "https://api.bilibili.com/x/v2/dm/web/seg.so?type=1" in flow.request.url:
print("*_*"*100)
dic={}
dic["cid"]=flow.request.url.split("&oid=")[-1].split("&pid=")[0]
dic["oid"]=flow.request.url.split("&pid=")[-1].split("&")[0]
with open('cid.csv', 'a') as f:
w = csv.writer(f)
w.writerow(dic.values())
except:
pass
Copy the code
Simulate clicking on a bullet barrage
Def get_pageurl(): urls=[] head = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/87.0.4280.88 Safari/537.36'} js=requests.get("https://api.bilibili.com/x/space/arc/search?mid=2026561407&ps=30&tid=0&pn=1&keyword=&order=pubdate&json p=jsonp",headers=head).json() for i in js["data"]["list"]["vlist"]: Urls. Append (" https://www.bilibili.com/video/ "+ I [" bvid"]) print (" * * "30 +" at the end of page url for "+" * * "30) return urls def bro_chrome(url): Bro.get (url) bro.maximize_window() # maximize browser sleep(12) bro.find_element_by_xpath('//div[@class="bui-collapse-arrow"]/span[1]').click() sleep(4)Copy the code
We use mitmdum to call mitmdump -s file name -p port number to screen out the bullet screen URL and extract this string of numbers and save it
So now we have this string of numbers for all the videos, and now we can concatenate the packet interface URL
def get_dmurl(): urls=get_pageurl() for url in urls: bro_chrome(url) bro.quit() names = ["cid", "oid"] df = pd.read_csv("cid.csv", names=names) df = df.drop_duplicates() df = df.reset_index(drop=True) pd.DataFrame(df).to_csv('.\\cid.csv') urls = [" https://comment.bilibili.com/%s.xml "% (I) for I in df. The cid] print (" * *" 15 + "at the end of barrage url splicing," + "* *" 15) return urlsCopy the code
So with the URL it’s really easy to get the data, it’s in XML format, it’s really easy to just look at the code
Def get_dm(url): print("*" * 30 + "+ "*" * 30) if not os.path.exists("dm"): Os.mkdir ("dm") head = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } r=requests. Get (URL,headers=head) R.encoding ='utf8' html=etree.HTML(r.text.encode('utf-8')) d_list=html.xpath("//d") for d in d_list: dm=d.xpath("./text()")[0] name=url.split("com/")[-1].replace(".xml","") with open("./dm/%s.txt"%(name),"a",encoding="utf-8") as f: f.write(dm) f.write('\n')Copy the code
Summary of barrage data
In order to get the data to find the interface to get that string of numbers, and then through MITmdump and Selenium extract the numbers, splicose URL access to get the data, this is our method I think there must be better
Four, download the video If find this video interface url I don’t know how to do it, so I use another method, chrome has a plug-in called bi li bi li assistant, with it we access each video, simulate click Then download (the plug-in good) you look, the plug-in function a lot also can download the barrage, For the installation of this plugin, if you are going to go online scientifically,
Had thought it would be easy to enter video UP the main page, and by simulating click each video click bi li bi li assistant in click download, bit by bit to, first, I was in the home page loop click each video into every video page (before we can get all the video url, so this step can be completely by accessing each url steps) under the way of
def Home_Video(): list=bro.find_elements_by_xpath('//ul[@class="clearfix cube-list"]/li') for i in range(len(list)): List [I].click() ws = bro.window_handles # bro.switch_to.window(ws[1]) # sleep(6) Video_list()== List =bro.find_elements_by_xpath('//ul[@class="clearfix cube-list"]/li')Copy the code
Then click download
def Download_video(): try: # click bi li bi li assistant sleep bro. (2) find_element_by_id (' bilibiliHelper2HandleButton). Click () except the Exception as e: End ="\n") print(e) pass try: Sleep (2) bro.find_element_by_xpath('//div[@class=" sc-juenpm cCSKON"]/div[1]').click() sleep(2) except Exception As e: print(end="\n") print(e) passCopy the code
One of the things to consider here is if there’s a list of videos, if there’s a list of videos for this video we need to download it all, if it’s a single video then that’s fine. That is, how many videos are in the video list and how many videos are in the collection, how many times are you going to download each video, if you don’t download it once
Def Video_list () : a try: if bro. Find_element_by_xpath (' / / * [@ id = "multi_page"] / div [1] / div [1] / h3 '). The text = = "video collection" : li_list=bro.find_elements_by_xpath('//ul[@class="list-box"]/li') for i in range(len(li_list)): li_list[i].click() bro.refresh() sleep(3) Download_video() li_list = bro.find_elements_by_xpath('//ul[@class="list-box"]/li') except: Download_video() passCopy the code
It is also important to note that Selenium default instantiation of Chrome does not have our plugin, so we set Chrome to launch with the plugin
option = webdriver.ChromeOptions() option.add_argument("--user-data-dir=" + R "C: / Users / 13772 / AppData/Local/Google/Chrome/User Data/") # enabled browser with plug-in bro = webdriver. Chrome (options = option)Copy the code
Add_argument specifies the address to be added after option.add_argument. You can enter chrome://version/ in the Chrome address bar to view browser information.
Download video summary method is too dependent, you must automatically click the plug-in to download summary through multiple files to get all the videos of STATION B UP, overall I think it is very good bullet screen data: The above method is incomplete to obtain, you can automatically download the danmu comment data by automatically downloading videos: it is recommended to sleep for a long time, or to crawl in batches, and the proxy pool can be ignored
Recently, many friends have sent messages to ask about learning Python. For easy communication, click on blue to join yourselfDiscussion solution resource base