Requests + BeautifulSoup + urllib crawl and download site images to local

This post is about downloading requests + BeautifulSoup + urllib and saving the images locally. The next post is about downloading the entire site details page and saving the images locally.

Crawling the data on the network is actually very simple, as long as you master the basic logic.

Find the website;

HTML nodes to analyze the required data;

Download the data locally or store it in a database

Well, without further ado, let’s get started!

The preparatory work

Development environment: Windows, Pycharm, Request, BeautifulSoup, Urllib
Some basic Knowledge of Python crawlers and HTML is required

Start to

The site to be climbed this time is Shuaia.com. We need to download the pictures of all items from the first page of the site to the local site
Make the crawler
1. Because the encoding format of the HTML obtained is incorrect, specify utF-8 encoding format
2. Gets the picture label for each item on the page
3. Loop to get the tag’s image link (SRC) and image name (Alt)
4. Download the image locally

from bs4 import BeautifulSoup
import requests
import os
import urllib.request
import time

headers = {
    "Cookie": "UM_distinctid=16685e0279d3e0-06f34603dfa898-36664c08-1fa400-16685e0279e133; bdshare_firstime=1539844405694; gsScrollPos-1702681410=; CNZZDATA1254092508=1744643453-1539842703-%7C1539929860; _d_id=0ba0365838c8f6569af46a1e638d05"."User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
}
path = "D://images/"

def get_links(url):
    wb_data = requests.get(url, headers=headers)  Add headers to avoid being considered a spider by the site's anti-crawling mechanism
    wb_data.encoding = "utf-8"
    soup = BeautifulSoup(wb_data.text, 'lxml')

    links = soup.select(".item-img img")
    if not os.path.exists(path):  # check if this folder exists, create it if it does not
        os.mkdir(path)
    for link in links:
        time.sleep(1)  # pause for one second to prevent the anti-crawling mechanism from thinking it is a spider
        img = link.get("src")
        img_name = link.get("alt")
        urllib.request.urlretrieve(img, path + img_name + ".jpg")
        print("-------- downloading ---------")

    print("------ download done -------")

if __name__ == "__main__":

    get_links("http://www.shuaia.net/index.html")
Copy the code

Began to crawl

Requests + BeautifulSoup + urllib crawl and download site images to local

The preparatory work

Start to

Related Posts

Twin Networking Primer (top) Siamese Net

Is AlphaGo better at managing funds than humans?

Tensorflow 1.x Tutorial — Simple Classification Model