Hi, I’m a Python advancer.
preface
A few days ago, Snowball brother in Python exchange group shared a Python code to obtain B station video selection, xiaobian feel very nai, here sorted into a small article, share for everyone to learn.
You should be familiar with snowball, who has written an article on Python.
There are three common methods of captcha annotation and identification project online deployment — VUE front end, Java back end, and Python back end deployment
CNN Neural Network Model training/Testing/Deployment
Captcha annotation and Recognition (front-end + back-end for efficient data annotation)
Data acquisition/preprocessing/character graph cutting
Common captcha annotations and Identification (Requirements analysis and implementation ideas)
The Python web crawler is used to retrieve 100,000 bytes of data from the Python library. The Python web crawler is used to retrieve 100,000 bytes of data from the Python library.
First, background introduction
When it comes to B station, the first impression is the video, and I do believe that a lot of friend, are thinking of using the web crawler technology B stand for video, but standing video not so good actually got B, video about B stand for, have introduced by you before – get library implementation, interested friends can see this article: You-get is so powerful! .
Anyway, those of you who often study on B website may often encounter dozens or even hundreds of videos serializing by some bloggers, especially for such continuous tutorials on programming languages, courses, and tool use, as shown in the figure below.
Of course, the fields of these selections are also visible to the naked eye. Just through the program to achieve, it may not be as easy as imagined. The goal of this article is to obtain video selections using Python Web crawler technology, based on selenium library.
Second, concrete implementation
In this article, we are using Selenium, a library designed to simulate user logins. Although it may seem slow, it is widely used in the field of web crawlers, and it is a very successful way to simulate logins and retrieve data. Below is the realization of video collection of all the code, welcome to hands-on practice.
# coding: utf-8 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait class Item: page_num = "" part = "" duration = "" def __init__(self, page_num, part, duration): self.page_num = page_num self.part = part self.duration = duration def get_second(self): str_list = self.duration.split(":") sum = 0 for i, item in enumerate(str_list): sum += pow(60, len(str_list) - i - 1) * int(item) return sum def get_bilili_page_items(url): Options = webdriver.chromeOptions () options.add_argument('--headless') # set no interface options.add_experimental_option('excludeSwitches', ['enable-automation']) # options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2, # "profile.managed_default_content_settings.flash": Chrome(options=options) # browser = webdriver.PhantomJS() print Browser. get(url) print(" Wait for webpage response..." ) # Need to wait, Wait = WebDriverWait(browser, 10) wait. Until (ec.visibility_of_element_located ((by.xpath, '/ / * [@ class = "list - box"] / li/a'))) print (" web data are obtained..." ) list = browser.find_elements_by_xpath('//*[@class="list-box"]/li') # print(list) itemList = [] second_sum = 0 # 2. Loop through the title of each search result for t in list: # print("t text:",t.text) element = t.find_element_by_tag_name('a') # print("a text:",element.text) arr = element.text.split('\n') print(" ".join(arr)) item = Item(arr[0], arr[1], Arr [2]) second_sum += item.get_second() itemList.append(item) print(" Len (itemList) # browser.page_source print(round(second_sum / 60, 2)) print(" Round (second_sum / 3600.0, 2)) browser.close() return itemList get_bilili_page_items("https://www.bilibili.com/video/BV1Eb411u7Fw")Copy the code
The selector used here is xpath, using the video example is station B “Advanced Mathematics” Tongji version of the whole teaching video (Teacher Song Hao) video selection, if you want to grab other video selection, just need to change the URL link of the last line of the above code.
Common problems
Small partners should often encounter this problem during the runtime, as shown in the figure below.
This is because of the Google driver version of the problem, just need to follow the tips, to download the corresponding driver version, driver download link:
https://chromedriver.storage.googleapis.com/index.html
Copy the code
Four,
I’m a Python advancer. This article mainly introduces the method of obtaining the content of B station video selection, based on the web crawler, through Selenium library and xpath selector, and gives you examples of common problems. Friends, quickly use the practice! If you have any problems during the learning process, please add me as a friend and I will invite you to join the Python Learning exchange group to discuss and learn together.