preface
The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with.
By IvanFX revival computer society
PS: If you need Python learning materials, please click on the link below to obtain them
Note.youdao.com/noteshare?i…
Basic steps and preparations
Debugging environment:
pycharm+python3
Need library:
- urllib.
- request
- re
(http.cookiejar is a library that will be used for subsequent crawler entry. This project does not involve reverse crawler, so it can not be added.)
If the import process does not show any of the above libraries, you can add them by clicking + on the right of The File → Settings → Projet Interpreter (if you are using Anaconda or Python, you can also run the project directly by CMD → PIP Install).
2. In this paper, we use Python to crawl, download and store online short videos. The basic steps are as follows (notes can be written to clarify ideas) :
(1) Analyze page URL and video file URL characteristics (2) obtain webpage source CODE HTML, solve anti-crawling mechanism (3) batch download video storage
Analyze the page URL and file URL features
1. Analyze web urls
Through the website: www.budejie.com/video/1, we can…
2. Analyze the FILE name URL
By analyzing the file name of MP4 in the web page, it is found that the URL of the file is displayed in clear text, so it can be matched by re re.
Get urls in batches and extract the urls of videos from them
import urllib.request
import re
for page in range (1.20):
req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
print(html)
Copy the code
1. Crawl web urls in batches
Here we have the page variable representing the code of the page, from which we temporarily crawl the first 20 pages.
(1) req to get the feedback of the web page (2) HTML through the function to get the meta code of the web page (3) through the source code utF-8 encoding to restore the display of Chinese.
However, execution of the above code found that HTTP Error 403 was displayed because the anti-crawling mechanism of the page was not retrievable.
2. Add header files to the page
We access the page through Google Browser, press F12 and switch to Network to refresh the interface and observe the access process. We can select a viewing header file from the process file and add it to the code (baisibudejie.js is selected here), modify the code as follows, and the interface can be climbed normally.
for page in range (1.20):
req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
req.add_header("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
print(html)
Copy the code
Download videos in batches and set up file name storage
1. Build a batch naming loop structure
I. split(“/”)[-1] splits I with ‘/’ as the separator and keeps the last segment, the MP4 file name.
2. Batch download
Again, you need to add a display output statement to indicate progress, which is consistent with the interactive nature of a program, which is to show progress as you download that video, and then download it to a folder in MP4
for i in re.findall(reg, html):
filename = i.split("/") [- 1] # Use '/' to split the f character and keep the last segment, the file name of MP4
print('Downloading %s video' % filename)
urllib.request.urlretrieve(i, "mp4/%s" % filename)
Copy the code
1. Establish a complete program
As a qualified programmer, you need to comb through the program and add comments for easy understanding and subsequent modification
import urllib.request
import re
def getVideo(page):
req = urllib.request.Request("http://www.budejie.com/video/%s" %page)
req.add_header("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
html = urllib.request.urlopen(req).read()
html = html.decode('UTF-8')
reg = r'data-mp4="(.*?) "'
for i in re.findall(reg,html):
filename = i.split("/") [- 1]# Use '/' to split the f character and keep the last segment, the file name of MP4
print ('Downloading %s video' %filename)
urllib.request.urlretrieve(i,"mp4/%s"%filename)
for i in range (1.20):
getVideo(i)
Copy the code