preface

The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with.

By IvanFX revival computer society

PS: If you need Python learning materials, please click on the link below to obtain them

Note.youdao.com/noteshare?i…

Basic steps and preparations

Debugging environment:

pycharm+python3

Need library:

  • urllib.
  • request
  • re

(http.cookiejar is a library that will be used for subsequent crawler entry. This project does not involve reverse crawler, so it can not be added.)

If the import process does not show any of the above libraries, you can add them by clicking + on the right of The File → Settings → Projet Interpreter (if you are using Anaconda or Python, you can also run the project directly by CMD → PIP Install).

2. In this paper, we use Python to crawl, download and store online short videos. The basic steps are as follows (notes can be written to clarify ideas) :

(1) Analyze page URL and video file URL characteristics (2) obtain webpage source CODE HTML, solve anti-crawling mechanism (3) batch download video storage

Analyze the page URL and file URL features

1. Analyze web urls

Through the website: www.budejie.com/video/1, we can…

2. Analyze the FILE name URL

By analyzing the file name of MP4 in the web page, it is found that the URL of the file is displayed in clear text, so it can be matched by re re.

Get urls in batches and extract the urls of videos from them

import urllib.request
import re
for  page in range (1.20):
    req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
    html = urllib.request.urlopen(req).read()
    html = html.decode('UTF-8')
    print(html)
Copy the code

1. Crawl web urls in batches

Here we have the page variable representing the code of the page, from which we temporarily crawl the first 20 pages.

(1) req to get the feedback of the web page (2) HTML through the function to get the meta code of the web page (3) through the source code utF-8 encoding to restore the display of Chinese.

However, execution of the above code found that HTTP Error 403 was displayed because the anti-crawling mechanism of the page was not retrievable.

2. Add header files to the page

We access the page through Google Browser, press F12 and switch to Network to refresh the interface and observe the access process. We can select a viewing header file from the process file and add it to the code (baisibudejie.js is selected here), modify the code as follows, and the interface can be climbed normally.

for  page in range (1.20):
    req = urllib.request.Request("http://www.budejie.com/video/%s" % page)
    req.add_header("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
    html = urllib.request.urlopen(req).read()
    html = html.decode('UTF-8')
    print(html)
Copy the code

Download videos in batches and set up file name storage

1. Build a batch naming loop structure

I. split(“/”)[-1] splits I with ‘/’ as the separator and keeps the last segment, the MP4 file name.

2. Batch download

Again, you need to add a display output statement to indicate progress, which is consistent with the interactive nature of a program, which is to show progress as you download that video, and then download it to a folder in MP4

for i in re.findall(reg, html):
    filename = i.split("/") [- 1]  # Use '/' to split the f character and keep the last segment, the file name of MP4
    print('Downloading %s video' % filename)
    urllib.request.urlretrieve(i, "mp4/%s" % filename)
Copy the code

1. Establish a complete program

As a qualified programmer, you need to comb through the program and add comments for easy understanding and subsequent modification

import urllib.request
import re
def getVideo(page):
        req = urllib.request.Request("http://www.budejie.com/video/%s" %page)
        req.add_header("User-Agent"."Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
        html = urllib.request.urlopen(req).read()
        html = html.decode('UTF-8')
        reg = r'data-mp4="(.*?) "'
        for i in re.findall(reg,html):
            filename = i.split("/") [- 1]# Use '/' to split the f character and keep the last segment, the file name of MP4
            print ('Downloading %s video' %filename)
            urllib.request.urlretrieve(i,"mp4/%s"%filename)
for  i in range (1.20):
    getVideo(i)
Copy the code