Went to the happy weekend again, today she is not at home we will play some SAO operation, we will play some men like to do things, to climb the little sister video, evening can secretly happy. Like friends can come to a key three connect.
Analysis page
This time we climb is a six rooms of a webpage official address: v.6.cn/minivideo/
Implementation idea:
- Grab the packet to find the URL path we want
- Get the URL and send the request back to us
- Parsing the data and weeding out the data we don’t need
- Save the data locally
So at this point we’re close enough to getting the URL to send the request and we’re using the Requests library here
What requests are for?
1. Request Mode The common request modes are GET or POST
2. Requested URL URL is used to define a unique resource on the Internet, such as an image, a file, or a video
User-agent: If the user-agent client is not configured in the request header, the server may treat you as an invalid User host. Cookies: Cookies are used to save login information
Note: generally do crawlers will add a request header.
The request header needs to pay attention to the following parameters: Referrer: where to visit the source (some large websites, through the Referrer to do anti-theft strategy; All crawlers should also pay attention to simulation. (User-agent: the browser visited (to be added otherwise it will be regarded as a crawler) cookie: the request header is carried carefully
If the request body is in get mode, the request body does not contain any content. If the request body is in POST mode, the request body is format data ps: 1, login window, file upload, etc., information will be attached to the request body 2, login, enter the wrong username and password, and submit, you can see the POST, after the correct login page will usually jump, can not capture the POST
3. Basic introduction to Response
1. Response status code 200: indicates success
301: indicates a jump
404: The file does not exist
403: No access permission
502: Server error
2.response header
Set-cookie :BDSVRTM=0; Path =/ : may be multiple, to tell the browser to save the cookie
3. Preview is the source code of the web page
The json data
Web pages, HTML, images
Binary data, etc.
Write code
4.1 Importing dependent libraries
import requests
import re
Copy the code
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
"@project: python-crawler-set @file: start. py @ide: IntelliJ IDEA @author: Big data @date: 2020/12/19 15:21"
Import the required dependencies
import requests
import re
# Filter out special characters
def match(title) :
compile= re.compile(r'[\\\/:\*\?\"><\|]')
match = re.sub(compile."_",title)
return match
# set request first-class parameters to prevent reverse crawling
headers = {
'Accept': '* / *'.'Accept-Language': 'en-US,en; Q = 0.5 '.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36'
}
def main(num) :
url="https://v.6.cn/minivideo/getMiniVideoList.php?act=recommend&page=%s&pagesize=50" %(num)
print("Start downloading page %s" %(num))
# send request
data=requests.get(url,headers=headers)
# parse data
json=data.json()
# Parse out the data we want
datalist=json["content"] ["list"]
Loop over the data
for data in datalist:
# for the title
title=data["alias"] +'.mp4'
newTitle=match(title)
Get the video URL
playurl=data["playurl"]
Send a request at a time to request video data
video=requests.get(playurl,headers=headers)
with open("video\\"+newTitle,'ab') as output:
Write to local in binary format
output.write(video.content)
print("Download successful:",newTitle)
if __name__ == '__main__':
for i in range(1.10):
main(i)
Copy the code
conclusion
Well today’s crawler share to the end here, like friends can pay attention to my public number for the first time to read the content of the share, I also built a Github for everyone I like friends can go to collect wave, follow wechat public number [big data elder brother] reply: python source code can be obtained.
Previously recommended Python bilibili video
You can learn how to convert PDF to Word using Python