There is no better desire than to look at black silk.
I. Technical route
Requests BeautifulSoup parsing HTML pages re regular expressions extracting HTML page information OS Saving files
import re
import requests
import os
from bs4 import BeautifulSoup
Copy the code
Second, access to web information
Gets a fixed format for web page information, returning web page content in string format, where the headers parameter can be simulated to ‘trick’ the web site from being discovered
Def getHtml(url): headers = {' user-agent ': 'Mozilla/5.0 '(Windows NT 10.0); Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} # try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: Print (' network status error ')Copy the code
Third, web page crawl analysis
Right click on the image area, select Review elements, you can view the current page picture details link, I am full of joy to copy the link open save, see the effect, the result is a picture only 60 KB, this is thumbnail ah, not clear, decisively abandoned…
There is no alternative but to click on the link to find the details page and then carry out a separate crawl.
Right-click blank, look at the page source code, copy the thumbnail link you just copied to find quick location, analyze all image details page links that have div tags and class= ‘list’ unique, so you can use BeautifulSoup to extract this label. In addition, we found that the link of the picture details page was after herf= (at the same time, we noticed that some invalid links were also in the div tag. By observing their similarities and differences, we found that invalid links had the word ‘HTTPS’, so invalid links can be excluded in the code according to this, corresponding to the function code in Article 4). Just extract it and then add the homepage link in front of it to open, and right click the picture, ‘review elements’, copy the link to download the picture close to 1M, indicating that it is hd picture, to this step we only need to call the download and save function to save the picture
Page details page link to obtain
The primary goal is to crawl down the detail page links of each image on each page in preparation for subsequent hd image crawls, using the definition function def getUrlList(URL).
def getUrlList(url): Url_list = [] Demo = getHtml(URL) soup = BeautifulSoup(demo,'html.parser') sp = soup. Find_all ('div', Class_ ="list") #class='list' is unique in the full text, so as an anchor, get a unique div tag; Note that the source code for this web page is class, but Python has added _ NLS = re.findall(r'a href="(.*?)) at the end to distinguish it from class. "Rel ="external nofollow" rel="external nofollow" ', STR (sp)) # Continue url_list.append('http://www.netbian.com' + I) # add prefix to fetched links, Form a complete valid link return url_listCopy the code
5. Save the picture according to the picture link
After obtaining the details page links for each image above, open it, right-click the image review element, copy the links to quickly locate the image, and then save the image
def fillPic(url,page): Path = './ '# save path for p in range(len(pic_url)): pic = getHtml(pic_url[p]) soup = BeautifulSoup(pic, 'html.parser') psoup = soup.find('div', Class_ =" PIC ") #class_=" PIC "as anchor, get a unique div tag; Note that the source code for this web page is class, but Python adds _ picUrl = re.findall(r' SRC ="(.*?)) at the end to distinguish it from class. ", STR (psoup)] [0] # fetch the first element from the list, PIC = requests. Get (picUrl).content And return them in binary form (pictures, sounds, Format (page) + STR (p+1) + '.jpg' # Image_path = path + '/' + image_name F. rite(PIC) print(image_name, 'download complete!! ')Copy the code
The main() function
After the front of the main frame to build up after the whole program to do a predisposition, directly on the code
The link on page 1 here is www.netbian.com/meinv/
Page 2 of the link is www.netbian.com/meinv/index…
And the subsequent page is on the basis of the second page only change the last number, so when writing the code should pay attention to distinguish the links of the first page and the subsequent page, do the processing respectively; The main() function also adds the ability to customize the number of pages to be climbed, as shown in the code
def main(): N = input(' Please enter the number of pages to climb: # ') url = 'http://www.netbian.com/meinv/' resources of homepage, can examine different classification according to their own requirements, custom change directory, crawl resources accordingly if not OS. Path. The exists ('/beauty ') : If int(n) >= 2, create file directory os.mkdir('./ beautiful /') page = 1 fillPic(url, page) # crawl after page 2 ls = list of resources (range (2, 1 + int (n))) url = 'http://www.netbian.com/meinv/' for I in ls: # by the method of traversing input demand respectively crawl crawl pages do page = STR (I) url_page = 'http://www.netbian.com/meinv/' url_page + = 'index_' + page + 'HTM' FillPic (url, page) call fillPic()Copy the code
Vii. Complete code
Finally, call main() and enter the number of pages to be climbed to start the crawl. The complete code is as follows
import re import requests import os from bs4 import BeautifulSoup def getHtml(url): Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} # try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: Def getUrlList(url) def getUrlList(url) Url_list = [] Demo = getHtml(URL) soup = BeautifulSoup(demo,'html.parser') sp = soup. Find_all ('div', Class_ ="list") #class='list' is unique in the full text, so as an anchor, get a unique div tag; Note that the source code for this web page is class, but Python has added _ NLS = re.findall(r'a href="(.*?)) at the end to distinguish it from class. "Rel ="external nofollow" rel="external nofollow" ', STR (sp)) # Continue url_list.append('http://www.netbian.com' + I) # add prefix to fetched links, Return url_list def fillPic(url,page): Path = './ '# save path for p in range(len(pic_url)): pic = getHtml(pic_url[p]) soup = BeautifulSoup(pic, 'html.parser') psoup = soup.find('div', Class_ =" PIC ") #class_=" PIC "as anchor, get a unique div tag; Note that the source code for this web page is class, but Python adds _ picUrl = re.findall(r' SRC ="(.*?)) at the end to distinguish it from class. ", STR (psoup)] [0] # fetch the first element from the list, PIC = requests. Get (picUrl).content And return them in binary form (pictures, sounds, Format (page) + STR (p+1) + '.jpg' # Image_path = path + '/' + image_name F. rite(PIC) print(image_name, 'download complete!! ') def main(): N = input(' Please enter the number of pages to climb: # ') url = 'http://www.netbian.com/meinv/' resources of homepage, can examine different classification according to their own requirements, custom change directory, crawl resources accordingly if not OS. Path. The exists ('/beauty ') : If int(n) >= 2, create file directory os.mkdir('./ beautiful /') page = 1 fillPic(url, page) # crawl after page 2 ls = list of resources (range (2, 1 + int (n))) url = 'http://www.netbian.com/meinv/' for I in ls: # by the method of traversing input demand respectively crawl crawl pages do page = STR (I) url_page = 'http://www.netbian.com/meinv/' url_page + = 'index_' + page + 'HTM' # fillPic(url_page, page) # fillPic()Copy the code
① Part-time exchange, industry consultation, online professional answers
②Python development environment installation tutorial
③Python400 self-study video
④ Common vocabulary of software development
⑤Python learning roadmap
⑥ Over 3000 Python ebooks
You can just take it if you need it and click collect.
This article is about climbing the network black silk beauty hd picture introduction to this, thank you for watching, hope to want to learn friends have provided to function, more exciting content can see the small series homepage.