This article describes how to download web images by downloading Python directly from image urls, parsing HTML through Re /beautifulSoup, and handling dynamic web pages.
Download individually/in batches via PIC_URL
Given image urls such as http://xyz.com/series-*(1,2.. N).jpg, a total of N images, the link form is relatively fixed, so that after a simple loop, directly through ‘f.write(requests. Get (URL).content)’ can be written in binary form.
import os
import requests
def download(file_path, picture_url):
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE",
}
r = requests.get(picture_url, headers=headers)
with open(file_path, 'wb') as f:
f.write(r.content)
def main(a):
os.makedirs('./pic/', exist_ok=True) Output directory
prefix_url = 'http://xyz.com/series-' # image URL prefix in the same category
n = 6 # Total number of images in this category
tmp = prefix_url.split('/') [- 1]
for i in range(1, n + 1):
file_path = './pic/' + tmp + str(i) + '.jpg'
picture_url = prefix_url + str(i) + '.jpg'
download(file_path, picture_url)
if __name__ == '__main__':
main()
Copy the code
Re parses HTML to get pic_URL and then downloads it
In practice, the picture URL is rarely arranged in order. In most cases, users only know the web url. They need to parse the HTNL content of the current web page to obtain the picture URL contained in the source code, and the common methods include regular expression matching or BeautifulSoup library parsing.
Get (URL).text to get the HTML source of the current page. Re.compile (r'[a-za-z]+://[^\s]*\.jpg’) results in a.jpg url, but other sites may end in.png or.webp, or even require other regex matches. A 30-minute introduction to regular expressions is highly recommended. Add the image URL obtained in the previous step to the list for download.
import os
import re
import requests
def get_html(url):
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
}
html = requests.get(url, headers=headers).text
return html
def parse_html(html_text):
picre = re.compile(r'[a-zA-z]+://[^\s]*\.jpg') This regex yields the url ending in.jpg
pic_list = re.findall(picre, html_text)
return pic_list
def download(file_path, pic_url):
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
}
r = requests.get(pic_url, headers=headers)
with open(file_path, 'wb') as f:
f.write(r.content)
def main(a):
Change the url when using
url = 'http://xyz.com/series'
html_text = get_html(url)
pic_list = parse_html(html_text)
os.makedirs('./pic/', exist_ok=True) Output directory
for pic_url in pic_list:
file_name = pic_url.split('/') [- 1]
file_path = './pic/' + file_name
download(file_path, pic_url)
if __name__ == '__main__':
main()
Copy the code
Get pic_URL via BS4
The idea is similar to regular matching, except that Beautiful Soup parses the HTML to get a list of image urls and then downloads the images in turn. Because the HTML structure of each website is different, users need to modify it appropriately. The following code is the download of douban pictures.
import os
import time
import requests
from bs4 import BeautifulSoup
def get_html(url):
headers = {
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
html = requests.get(url, headers=headers).text
return html
def parse_html(html_text):
soup = BeautifulSoup(html_text, 'html.parser')
li = soup.find_all('div', attrs={'class':'cover'})
pic_list = []
for link in li:
pic_url = link.find('img').get('src')
pic_url = pic_url.replace('/m/'.'/l/')
pic_list.append(pic_url)
return pic_list
def download(file_path, pic_url):
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
}
r = requests.get(pic_url, headers=headers)
with open(file_path, 'wb') as f:
f.write(r.content)
def main(a):
'Download Rimi Ishihara images from Douban and observe that each page contains 30 images, and its URL increases by 30, as shown below'
pic_list = []
for i in range(10):
url = 'https://movie.douban.com/celebrity/1016930/photos/?type=C&start=' + str(30*i) + '&sortby=like&size=a&subtype=a'
html_text = get_html(url)
pic_list += parse_html(html_text)
os.makedirs('./pic/', exist_ok=True) Output directory
for i, pic_url in enumerate(pic_list):
if i%30= =0:
print('Downloading page %s'%(i/30+1))
file_name = pic_url.split('/') [- 1].split('. ') [0] + '.jpg'
file_path = './pic/' + file_name
download(file_path, pic_url)
if __name__ == '__main__':
main()
Copy the code
When downloading an image, it is found that the thumbnail URL of the image can be directly accessed for downloading. However, due to douban’s anti-crawling strategy, direct access to the original IMAGE URL will be rejected by the server, as shown in the following figure. See the next section for solutions.
Possible problems
-
Website anti-crawler mechanism
- User-agent: simulates browser access. After being added, the server considers it as a normal browser request. Generally, access related to web operations is added.
- Referer: This is what browsers use to determine which page you’ve redirected to. For example, in the example of downloading douban pictures above, directly entering the url will be rejected, but you will get the content in the same address when you click step by step on the website. This is because there is a prior jump address when you visit step by step, which can be obtained by “F12” in the header. If you can’t find it, try the root address “movie.douban.com/,” or the previous address “… GitHub repository ‘adv_bs4url.py’ file.
- IP mask: Build an IP address pool.
- Cookie disguise: Cookies are used by the server to identify your state at this time, and will be updated with each request to the server.
-
Common regular matches
- A 30-minute introduction to regular expressions is highly recommended
-
The data of the web page is loaded asynchronously, such as THE PAGE rendered by JS or the data loaded by Ajax can not get the complete page source.
-
One solution is known as dynamic crawler, which uses some third-party tools to simulate the behavior of browsers to load data, such as Selenium and PhantomJs. There are many introduction articles on the network, a little trouble did not write their own, the subsequent need to do it, but other methods have been enough.
-
In addition, you can analyze the page to find an excuse to load the page. The core of this is to track the interaction of the page, JS trigger scheduling, analyze the valuable and meaningful core calls (generally an HTTP request via JS), and then we use Python to directly access the reverse link to get the value data. Analysis by “F12 “, for example for petal net, can obtain its link as “huaban.com/search/?q= stone… Request.urlopen (URL).read() reads the web page.
-
-
Other questions…
Pyautogui, mouse simulation of click “fool” flow
This method is only suitable for repetitive work with low efficiency, but it has no risk of being screened by anti-crawler strategy. The idea is similar to the “macros” in Word, where you tell the computer what the mouse is doing in a loop and let it loop automatically. The code is straightforward.
import pyautogui
import time
pyautogui.FAILSAFE = True
def get_mouse_positon(a):
time.sleep(3) # Move the mouse to the initial position
x1, y1 = pyautogui.position()
print(x1, y1)
pyautogui.click(x=x1, y=y1, button='right') # Simulate right mouse click, exhale menu
time.sleep(5) # Move the mouse over "Save image as..." Options for the central
x2, y2 = pyautogui.position()
print(x2, y2)
pyautogui.click(x=x2, y=y2) Click "Save Image as..."
time.sleep(10) # The save file popup window will pop up. Select the save location and move the mouse pointer to the center of the "Save (S)" button
x3, y3 = pyautogui.position()
pyautogui.click(x=x3, y=y3)
print(x3, y3)
def click_download(N):
for i in range(N): # Number of images to download
pyautogui.click(x=517, y=557, duration=0.25, button='right') Set x/y to x1/y1
time.sleep(1)
pyautogui.click(x=664, y=773, duration=0.25) X /y = x2/y2
time.sleep(1)
pyautogui.click(x=745, y=559, duration=0.25) X /y is x3/y3
time.sleep(1)
pyautogui.click(x=517, y=557, duration=0.25) # Go to the next image
time.sleep(2) It depends on the network loading speed
if __name__ == "__main__":
# get_mouse_positon() # Get_mouse_positon () # Get_mouse_positon () #
click_download(10)
Copy the code
See all the code, detailed notes and updates in this articleMaking the warehouse.