This is the second day of my participation in the August More text Challenge. For details, see:August is more challenging
The cause of
At noon edge to have a meal while brushing mobile phone, suddenly brush to some sound just on-line a webpage version of some sound webpage version, open a look, good guy, this is not to copy…. Uh, no, draw lessons from the tubing, just think the website was launched, there must be no such strong anti-patching mechanism, a reptile play
The target site
A sound PC web version
Code warehouse
Open source address: a sound crawler
Implementation scheme
The main implementation is Python3
- Requests are used to get HTML code for web pages, download videos, and so on
- Re uses regular expressions to match links and the like from HTML
- Selenium Because the video page has dynamic data, using the Selenium solution to obtain the loaded data needs to be used in conjunction with the browser driver
The implementation process
1. Request a list page, get HTML, match all details page links in the page, and generate a link pool
2. Originally, I wanted to get the address of the detail page from the link pool, request the HTML of the detail page, match the address of the video resource from it, and enter the video pool. However, after testing, I found that the video resource link was dynamically obtained, and the video address could not be obtained directly in requests. Use Selenium to call up the browser to load the page. When the page is loaded, use xpath to locate the video resource and extract the resource link to enter the video pool
3. Extract resource addresses from the video pool, download and save them to the local
4. The basic function has been realized here. In the code execution, it is found that the link pool can only get 10 links and can only download 10 videos
Five, here have the solution, as long as we flip automatically, turn to the last, all the loading list to grab link pool, used here to insert the selenium JS, operation page, and then run the code, found that won’t be able to get the full link pool, can only get to the last page or two, speculation can be optimized, The scrolling list was removed
Six, this is also to solve, as long as we turn the page each time to get a link pool, when the page is completed can get to the complete link, so the birth of a new problem, is to get the repeated link, as long as the page is completed, the conversion of the data type, to heavy can be
Seven, then you can be happy to climb the video
That may seem like a lot, but it only takes about 50 lines of code to do it
The specific code is as follows
import requests
import re
from selenium import webdriver
import time
from contextlib import closing
def get_home_ulrs(url) :
browser = webdriver.Chrome();
browser.get(url)
num = browser.find_element_by_xpath(
'//*[@id="root"]/div/div[2]/div/div[4]/div[1]/div[1]/div[1]/span').text
num = int(int(num)/15) +1
urls = []
for i in range(num):
newList = re.findall('https\:\/\/www\.douyin\.com\/video\/[\d]*\?previous_page\=', browser.page_source)
urls = [*urls,*newList]
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
print(urls)
print(set(urls))
mainList =set(urls)
browser.close()
return mainList
def get_video(url) :
browser = webdriver.Chrome();
browser.get(url)
html = browser.find_element_by_xpath(
'//*[@id="root"]/div/div[2]/div[1]/div[1]/div[1]/div[1]/div/div[1]/div[2]/video/source[1]')
video_url = html.get_attribute('src')
browser.close()
return video_url
def download(url, name) :
with closing(requests.get(url=url, verify=False, stream=True)) as res:
with open('video/{}.mp4'.format(name), 'wb') as fd:
for chunk in res.iter_content(chunk_size=1024) :if chunk:
fd.write(chunk)
if __name__ == '__main__':
url = input('Please enter your home page address')
print('Getting video address')
urls = get_home_ulrs(url)
print(urls)
print('Successfully obtained {}, start downloading'.format(len(urls)))
index = 1
for url in urls:
video_url = get_video(url)
print('Downloading {}/{}'.format(len(urls),index))
download(video_url, index)
print('Download complete')
index += 1
Copy the code
Video presentation
www.bilibili.com/video/BV1Af…
conclusion
This climb station has no too big obstacles, no complex anti-raking mechanism, nothing to say…. By the way, how many years is that…
supplement
At present, this program is not applicable to the current version of a sound, the reason is that a sound to add a new anti-scratch mechanism, will pop up slide verification code, this article is only for reference