Python crawler: crawl the video on Zhihu and save the download link to the MD file

This article is participating in Python Theme Month. See Python Theme Month for details

1. Required Python modules

mainlyrequestsModule, used to obtain the data of the web page installation command is:pip install requests

2. Specific implementation process

Take, for example, the video on zhihu’s website, “King of Glory”.

We first came to the official Zhihu website of King of Glory, click on the video, and the link at this time is:www.zhihu.com/org/wang-zh…

Scroll down and find the number of pages below. Click to go to the second page. Now the link is:www.zhihu.com/org/wang-zh…

So the link on the first page should be:www.zhihu.com/org/wang-zh…

Select this url, then press F12 on your computer keyboard to go to developer mode and click XHR under NetworkAs you can see, the data on the web page is in this JSON data. The url is:www.zhihu.com/api/v4/memb…

Through repeated page switching, it is found that the corresponding url of the data on page 2 is:www.zhihu.com/api/v4/memb…

Liimit should be the number of videos, offset should be the number of pages, but the number of pages should be separated.

Further analyzing the JSON data, I found a link to download the video below:

How do I get the JSON url above? In fact, it is a string of splicing bar! The url we entered was:www.zhihu.com/org/wang-zh…

The JSON url is:www.zhihu.com/api/v4/memb…

You just need to concatenate strings.

3. Refer to the code and running results

Reference code:

import requests
from crawlers.userAgent import useragent
import re
import json
import time


video_url=input("Please enter the url of the video interface :")
userAgent=useragent()
headers={'user-agent':userAgent.getUserAgent()}
response=requests.get(url=video_url,headers=headers)
content=response.text  # url string
videos=int(re.findall('(\d+)',content)[0]) # Total number of videos
print('Total number of videos',videos)
page=0 # pages
if videos%20= =0:
    page=videos//20
else:
    page=videos//20+1

def formatStr(string:str) :
    string=string.replace('\n'.' ')
    strLength=len(string.strip())
    if strLength==0:
        return 'empty'
    num=strLength//35+1
    str1=' '
    for i in range(num):
        str1+=string[i*35:(i+1) *35] +'\n'

    return str1

print('Total pages are :',page)
# ajax_url='https://www.zhihu.com/api/v4/members/wang-zhe-rong-yao-74-54/zvideos?offset=0&limit=20'
ajaxUrl='https://www.zhihu.com/api/v4/members/'+video_url[video_url.find('/org/') +5:video_url.rfind('? ') +1] +'offset={}&limit=20'  # concatenate string
# construct ajax constructed url

file=input('Please enter the file name you want to create :')
filePath='./{}.md'.format(file)

f=open(file=filePath,mode='a',encoding='utf=8')
for i in range(page):
    print('======= page {} '.format(i+1))
    ajaxUrl=ajaxUrl.format(i*20)
    response1=requests.get(url=ajaxUrl,headers=headers)
    dict1=json.loads(response1.text)
    list1=dict1['data']
    for list2 in list1:
        title=list2['title']
        descript1=list2['description']
        print('------- [{}] --------'.format(title))
        print('-- > ':%s'%formatStr(descript1))
        f.write('## {}\n'.format(title))
        f.write({}\n'.format(descript1.replace('\n'.' ')))
        downloadUrl=list2['video'] ['playlist']
        print('------- video sizes are: ')
        for key in downloadUrl.keys():
            downUrl=downloadUrl[key]['play_url']
            print('{0}--width:{1},height:{2} '.format(key,downloadUrl[key]['width'],downloadUrl[key]['height'],downUrl))
            f.write('< font face = "HuaWenXin wei" size = 4 color = blue > {}, download link is: {} < / font > < br >'.format(key,downUrl))
        f.write('============================================\n')
        print('- ='*40)
    time.sleep(2)
f.close()

Copy the code

Results: Readers can look at the small series of b station video, video link for Python crawler download video on Zhihu

Operation is completed, will be in the same run file folder below one more md file, video download stored in this file, in addition, small make up in your code to import a module, it is small make up a custom, don’t know the reader can look small make up this article, the article links as follows: Python crawler: make a belongs to own IP agent module

4. To summarize

Because some videos are very large, so xiaobian did not download, the download link of knowledge output video, in addition, there are a lot of videos on some sites, xiaobian is not able to download them, but feel that as a civilized crawler, can not cause a great burden on the corresponding server. We’re participating in Python Theme Month, so if you think this article is a good one, please give it a thumbs up, and if you have any questions or suggestions, feel free to leave them in the comments!