This article describes how to download web images by downloading Python directly from image urls, parsing HTML through Re /beautifulSoup, and handling dynamic web pages.

Download individually/in batches via PIC_URL

Given image urls such as http://xyz.com/series-*(1,2.. N).jpg, a total of N images, the link form is relatively fixed, so that after a simple loop, directly through ‘f.write(requests. Get (URL).content)’ can be written in binary form.

import os
import requests

def download(file_path, picture_url):
	headers = {
		"User-Agent": "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE",
		}
	r = requests.get(picture_url, headers=headers)
	with open(file_path, 'wb') as f:
		f.write(r.content)

def main(a):
	os.makedirs('./pic/', exist_ok=True)  Output directory

	prefix_url = 'http://xyz.com/series-'  # image URL prefix in the same category
	n = 6  # Total number of images in this category

	tmp = prefix_url.split('/') [- 1]
	for i in range(1, n + 1):
		file_path = './pic/' + tmp + str(i) + '.jpg'
		picture_url = prefix_url + str(i) + '.jpg'
		download(file_path, picture_url)
		

if __name__ == '__main__':
	main()
Copy the code

Re parses HTML to get pic_URL and then downloads it

In practice, the picture URL is rarely arranged in order. In most cases, users only know the web url. They need to parse the HTNL content of the current web page to obtain the picture URL contained in the source code, and the common methods include regular expression matching or BeautifulSoup library parsing.

Get (URL).text to get the HTML source of the current page. Re.compile (r'[a-za-z]+://[^\s]*\.jpg’) results in a.jpg url, but other sites may end in.png or.webp, or even require other regex matches. A 30-minute introduction to regular expressions is highly recommended. Add the image URL obtained in the previous step to the list for download.

import os
import re
import requests

def get_html(url):
    headers = {
        "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        }
    html = requests.get(url, headers=headers).text

    return html

def parse_html(html_text):
    picre = re.compile(r'[a-zA-z]+://[^\s]*\.jpg')  This regex yields the url ending in.jpg
    pic_list = re.findall(picre, html_text)

    return pic_list

def download(file_path, pic_url):
    headers = {
        "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        }
    r = requests.get(pic_url, headers=headers)
    with open(file_path, 'wb') as f:
        f.write(r.content)

def main(a):
    Change the url when using
    url = 'http://xyz.com/series'
    html_text = get_html(url)
    pic_list = parse_html(html_text)

    os.makedirs('./pic/', exist_ok=True)  Output directory
    for pic_url in pic_list:
        file_name = pic_url.split('/') [- 1]
        file_path = './pic/' + file_name

        download(file_path, pic_url)


if __name__ == '__main__':
    main()
Copy the code

Get pic_URL via BS4

The idea is similar to regular matching, except that Beautiful Soup parses the HTML to get a list of image urls and then downloads the images in turn. Because the HTML structure of each website is different, users need to modify it appropriately. The following code is the download of douban pictures.

import os
import time
import requests
from bs4 import BeautifulSoup

def get_html(url):
    headers = {
        'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    html = requests.get(url, headers=headers).text

    return html

def parse_html(html_text):
    soup = BeautifulSoup(html_text, 'html.parser')
    li = soup.find_all('div', attrs={'class':'cover'})

    pic_list = []
    for link in li:
        pic_url = link.find('img').get('src')
        pic_url = pic_url.replace('/m/'.'/l/')
        pic_list.append(pic_url)

    return pic_list

def download(file_path, pic_url):
    headers = {
        "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        }
    r = requests.get(pic_url, headers=headers)
    with open(file_path, 'wb') as f:
        f.write(r.content)

def main(a):
    'Download Rimi Ishihara images from Douban and observe that each page contains 30 images, and its URL increases by 30, as shown below'
    pic_list = []
    for i in range(10):
        url = 'https://movie.douban.com/celebrity/1016930/photos/?type=C&start=' + str(30*i) + '&sortby=like&size=a&subtype=a'
        html_text = get_html(url)
        pic_list += parse_html(html_text)
        
    os.makedirs('./pic/', exist_ok=True)  Output directory

    for i, pic_url in enumerate(pic_list):
        if i%30= =0:
            print('Downloading page %s'%(i/30+1))
        file_name = pic_url.split('/') [- 1].split('. ') [0] + '.jpg'
        file_path = './pic/' + file_name

        download(file_path, pic_url)


if __name__ == '__main__':
    main()
Copy the code

When downloading an image, it is found that the thumbnail URL of the image can be directly accessed for downloading. However, due to douban’s anti-crawling strategy, direct access to the original IMAGE URL will be rejected by the server, as shown in the following figure. See the next section for solutions.

Possible problems

  • Website anti-crawler mechanism

    1. User-agent: simulates browser access. After being added, the server considers it as a normal browser request. Generally, access related to web operations is added.
    2. Referer: This is what browsers use to determine which page you’ve redirected to. For example, in the example of downloading douban pictures above, directly entering the url will be rejected, but you will get the content in the same address when you click step by step on the website. This is because there is a prior jump address when you visit step by step, which can be obtained by “F12” in the header. If you can’t find it, try the root address “movie.douban.com/,” or the previous address “… GitHub repository ‘adv_bs4url.py’ file.
    3. IP mask: Build an IP address pool.
    4. Cookie disguise: Cookies are used by the server to identify your state at this time, and will be updated with each request to the server.
  • Common regular matches

    • A 30-minute introduction to regular expressions is highly recommended
  • The data of the web page is loaded asynchronously, such as THE PAGE rendered by JS or the data loaded by Ajax can not get the complete page source.

    • One solution is known as dynamic crawler, which uses some third-party tools to simulate the behavior of browsers to load data, such as Selenium and PhantomJs. There are many introduction articles on the network, a little trouble did not write their own, the subsequent need to do it, but other methods have been enough.

    • In addition, you can analyze the page to find an excuse to load the page. The core of this is to track the interaction of the page, JS trigger scheduling, analyze the valuable and meaningful core calls (generally an HTTP request via JS), and then we use Python to directly access the reverse link to get the value data. Analysis by “F12 “, for example for petal net, can obtain its link as “huaban.com/search/?q= stone… Request.urlopen (URL).read() reads the web page.

  • Other questions…

Pyautogui, mouse simulation of click “fool” flow

This method is only suitable for repetitive work with low efficiency, but it has no risk of being screened by anti-crawler strategy. The idea is similar to the “macros” in Word, where you tell the computer what the mouse is doing in a loop and let it loop automatically. The code is straightforward.

import pyautogui
import time

pyautogui.FAILSAFE = True

def get_mouse_positon(a):
    time.sleep(3)  # Move the mouse to the initial position
    x1, y1 = pyautogui.position()
    print(x1, y1)
    pyautogui.click(x=x1, y=y1, button='right')  # Simulate right mouse click, exhale menu
    time.sleep(5)  # Move the mouse over "Save image as..." Options for the central
    x2, y2 = pyautogui.position()
    print(x2, y2)
    pyautogui.click(x=x2, y=y2)  Click "Save Image as..."
    time.sleep(10)  # The save file popup window will pop up. Select the save location and move the mouse pointer to the center of the "Save (S)" button
    x3, y3 = pyautogui.position()
    pyautogui.click(x=x3, y=y3)
    print(x3, y3)


def click_download(N):
    for i in range(N):  # Number of images to download
        pyautogui.click(x=517, y=557, duration=0.25, button='right')  Set x/y to x1/y1
        time.sleep(1)
        pyautogui.click(x=664, y=773, duration=0.25)  X /y = x2/y2
        time.sleep(1)
        pyautogui.click(x=745, y=559, duration=0.25)  X /y is x3/y3
        time.sleep(1)
        pyautogui.click(x=517, y=557, duration=0.25)  # Go to the next image
        time.sleep(2) It depends on the network loading speed
    
 
 
if __name__ == "__main__":
    # get_mouse_positon() # Get_mouse_positon () # Get_mouse_positon () #
    click_download(10)
Copy the code

See all the code, detailed notes and updates in this articleMaking the warehouse.