IT journey (ID: Jake_Internet) please contact authorization (wechat ID: Hc220088)
Python captures 500 “beautiful” wallpaper images at once.
1. Crawl a page of pictures
Extract image data by regular matching
The screenshot of the source code is as follows:
Reset GBK code to solve the garbled code problem
Code implementation:
Import requests import re # set save path = r'd :\test\picture_1\ '# destination url url = "Http://pic.netbian.com/4kmeinv/index.html" # camouflage request header By crawling headers = {the user-agent: "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' # reformat match response.encoding = 'GBK' # reformat match response.encoding = 'GBK' # reformat match response.encoding = 'GBK img_info = re.findall('img src="(.*?)" alt="(.*?)" /', response.text) for src, name in img_info: Img_url = 'http://pic.netbian.com' + SRC # + 'http://pic.netbian.com' img_content = requests. Get (img_url, headers=headers).content img_name = name + '.jpg' with open(path + img_name, 'wb') as f: Print (f" img_name}") f.rite (img_content)Copy the code
Xpath localization extracts image data
Code implementation:
Import requests from LXML import etree # set save path = r'd :\test\picture_1\ '# destination URL url = "Http://pic.netbian.com/4kmeinv/index.html" # camouflage request header By crawling headers = {the user-agent: "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, 'wb') as f: # print({img_name}")Copy the code
2. Page crawling to achieve batch download
Single thread version
Import requests from LXML import etree import datetime import time path = r'd: test\picture_1\ 'headers = { "User-agent ": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1", "Referer": "http://pic.netbian.com/4kmeinv/index.html" } start = datetime.datetime.now() def get_img(urls): for url in urls: Response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, {img_name}") f (img_content) time.sleep(1) def main(): # to the requested url list url_list = [' http://pic.netbian.com/4kmeinv/index.html '] + [f 'http://pic.netbian.com/4kmeinv/index_ {I}. HTML' For I in range(2, 11)] get_img(url_list) delta = (datetime.datetime.now() -start).total_seconds() print(f) {delta}s") if __name__ == '__main__': main()Copy the code
The program runs successfully and captures 210 pictures of 10 pages in 63.682837s.
Multi-threaded version
import requests from lxml import etree import datetime import time import random from concurrent.futures import Path = r'd :\test\picture_1\ 'user_agent = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1" "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3" Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3 "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, Def get_img(url) def get_img(url) def get_img(url) def get_img(url) def get_img(url) headers = { "User-Agent": random.choice(user_agent), "Referer": "Http://pic.netbian.com/4kmeinv/index.html"} # send request to obtain the response = requests. Get (url, Response. encoding = 'GBK' HTML = response.encoding = 'GBK Etree.html (response.text) # xpath to extract the desired data to get the image link and name img_src = html.xpath('//ul[@class="clearfix"]/li/a/img/@src') # The list derivation yields the actual image URL img_src = ['http://pic.netbian.com' + x for x in img_src] img_alt = html.xpath('//ul[@class="clearfix"]/li/a/img/@alt') for src, name in zip(img_src, img_alt): img_content = requests.get(src, headers=headers).content img_name = name + '.jpg' with open(path + img_name, {img_name}") f (img_content) time.sleep(random.randint(1, 2)) def main(): # to the requested url list url_list = [' http://pic.netbian.com/4kmeinv/index.html '] + [f 'http://pic.netbian.com/4kmeinv/index_ {I}. HTML' for i in range(2, 51)] with ThreadPoolExecutor(max_workers=6) as executor: Executor.map (get_img, url_list) delta = (datetime.datetime.now() -start).total_seconds() print(f" print(f") {delta}s") if __name__ == '__main__': main()Copy the code
The program ran successfully, capturing 50 pages of pictures, a total of 1047 pictures, with a time of 56.71979s. Open multithreading greatly improves the efficiency of data crawling.
The final results are as follows:
Original is not easy, code word is not easy, if you think this article is useful to you, welcome to like, leave a message, forward to share so that more digg friends see, because this will be my continuous output of more high-quality articles the strongest power, thank you!