There is no better desire than to look at black silk.

I. Technical route

Requests BeautifulSoup parsing HTML pages re regular expressions extracting HTML page information OS Saving files

import re
import requests
import os
from bs4 import BeautifulSoup
Copy the code

Second, access to web information

Gets a fixed format for web page information, returning web page content in string format, where the headers parameter can be simulated to ‘trick’ the web site from being discovered

Def getHtml(url): headers = {' user-agent ': 'Mozilla/5.0 '(Windows NT 10.0); Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} # try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: Print (' network status error ')Copy the code

Third, web page crawl analysis

Right click on the image area, select Review elements, you can view the current page picture details link, I am full of joy to copy the link open save, see the effect, the result is a picture only 60 KB, this is thumbnail ah, not clear, decisively abandoned…

There is no alternative but to click on the link to find the details page and then carry out a separate crawl.

Right-click blank, look at the page source code, copy the thumbnail link you just copied to find quick location, analyze all image details page links that have div tags and class= ‘list’ unique, so you can use BeautifulSoup to extract this label. In addition, we found that the link of the picture details page was after herf= (at the same time, we noticed that some invalid links were also in the div tag. By observing their similarities and differences, we found that invalid links had the word ‘HTTPS’, so invalid links can be excluded in the code according to this, corresponding to the function code in Article 4). Just extract it and then add the homepage link in front of it to open, and right click the picture, ‘review elements’, copy the link to download the picture close to 1M, indicating that it is hd picture, to this step we only need to call the download and save function to save the picture

Page details page link to obtain

The primary goal is to crawl down the detail page links of each image on each page in preparation for subsequent hd image crawls, using the definition function def getUrlList(URL).

def getUrlList(url): Url_list = [] Demo = getHtml(URL) soup = BeautifulSoup(demo,'html.parser') sp = soup. Find_all ('div', Class_ ="list") #class='list' is unique in the full text, so as an anchor, get a unique div tag; Note that the source code for this web page is class, but Python has added _ NLS = re.findall(r'a href="(.*?)) at the end to distinguish it from class. "Rel ="external nofollow" rel="external nofollow" ', STR (sp)) # Continue url_list.append('http://www.netbian.com' + I) # add prefix to fetched links, Form a complete valid link return url_listCopy the code

5. Save the picture according to the picture link

After obtaining the details page links for each image above, open it, right-click the image review element, copy the links to quickly locate the image, and then save the image

def fillPic(url,page): Path = './ '# save path for p in range(len(pic_url)): pic = getHtml(pic_url[p]) soup = BeautifulSoup(pic, 'html.parser') psoup = soup.find('div', Class_ =" PIC ") #class_=" PIC "as anchor, get a unique div tag; Note that the source code for this web page is class, but Python adds _ picUrl = re.findall(r' SRC ="(.*?)) at the end to distinguish it from class. ", STR (psoup)] [0] # fetch the first element from the list, PIC = requests. Get (picUrl).content And return them in binary form (pictures, sounds, Format (page) + STR (p+1) + '.jpg' # Image_path = path + '/' + image_name F. rite(PIC) print(image_name, 'download complete!! ')Copy the code

The main() function

After the front of the main frame to build up after the whole program to do a predisposition, directly on the code

The link on page 1 here is www.netbian.com/meinv/

Page 2 of the link is www.netbian.com/meinv/index…

And the subsequent page is on the basis of the second page only change the last number, so when writing the code should pay attention to distinguish the links of the first page and the subsequent page, do the processing respectively; The main() function also adds the ability to customize the number of pages to be climbed, as shown in the code

def main(): N = input(' Please enter the number of pages to climb: # ') url = 'http://www.netbian.com/meinv/' resources of homepage, can examine different classification according to their own requirements, custom change directory, crawl resources accordingly if not OS. Path. The exists ('/beauty ') : If int(n) >= 2, create file directory os.mkdir('./ beautiful /') page = 1 fillPic(url, page) # crawl after page 2 ls = list of resources (range (2, 1 + int (n))) url = 'http://www.netbian.com/meinv/' for I in ls: # by the method of traversing input demand respectively crawl crawl pages do page = STR (I) url_page = 'http://www.netbian.com/meinv/' url_page + = 'index_' + page + 'HTM' FillPic (url, page) call fillPic()Copy the code

Vii. Complete code

Finally, call main() and enter the number of pages to be climbed to start the crawl. The complete code is as follows

import re import requests import os from bs4 import BeautifulSoup def getHtml(url): Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} # try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: Def getUrlList(url) def getUrlList(url) Url_list = [] Demo = getHtml(URL) soup = BeautifulSoup(demo,'html.parser') sp = soup. Find_all ('div', Class_ ="list") #class='list' is unique in the full text, so as an anchor, get a unique div tag; Note that the source code for this web page is class, but Python has added _ NLS = re.findall(r'a href="(.*?)) at the end to distinguish it from class. "Rel ="external nofollow" rel="external nofollow" ', STR (sp)) # Continue url_list.append('http://www.netbian.com' + I) # add prefix to fetched links, Return url_list def fillPic(url,page): Path = './ '# save path for p in range(len(pic_url)): pic = getHtml(pic_url[p]) soup = BeautifulSoup(pic, 'html.parser') psoup = soup.find('div', Class_ =" PIC ") #class_=" PIC "as anchor, get a unique div tag; Note that the source code for this web page is class, but Python adds _ picUrl = re.findall(r' SRC ="(.*?)) at the end to distinguish it from class. ", STR (psoup)] [0] # fetch the first element from the list, PIC = requests. Get (picUrl).content And return them in binary form (pictures, sounds, Format (page) + STR (p+1) + '.jpg' # Image_path = path + '/' + image_name F. rite(PIC) print(image_name, 'download complete!! ') def main(): N = input(' Please enter the number of pages to climb: # ') url = 'http://www.netbian.com/meinv/' resources of homepage, can examine different classification according to their own requirements, custom change directory, crawl resources accordingly if not OS. Path. The exists ('/beauty ') : If int(n) >= 2, create file directory os.mkdir('./ beautiful /') page = 1 fillPic(url, page) # crawl after page 2 ls = list of resources (range (2, 1 + int (n))) url = 'http://www.netbian.com/meinv/' for I in ls: # by the method of traversing input demand respectively crawl crawl pages do page = STR (I) url_page = 'http://www.netbian.com/meinv/' url_page + = 'index_' + page + 'HTM' # fillPic(url_page, page) # fillPic()Copy the code

① Part-time exchange, industry consultation, online professional answers

②Python development environment installation tutorial

③Python400 self-study video

④ Common vocabulary of software development

⑤Python learning roadmap

⑥ Over 3000 Python ebooks

You can just take it if you need it and click collect.

This article is about climbing the network black silk beauty hd picture introduction to this, thank you for watching, hope to want to learn friends have provided to function, more exciting content can see the small series homepage.