This post is about downloading requests + BeautifulSoup + urllib and saving the images locally. The next post is about downloading the entire site details page and saving the images locally.

Crawling the data on the network is actually very simple, as long as you master the basic logic.

  1. Find the website;
  2. HTML nodes to analyze the required data;
  3. Download the data locally or store it in a database

Well, without further ado, let’s get started!


The preparatory work

  • Development environment: Windows, Pycharm, Request, BeautifulSoup, Urllib
  • Some basic Knowledge of Python crawlers and HTML is required

Start to

  • The site to be climbed this time is Shuaia.com. We need to download the pictures of all items from the first page of the site to the local site

  • Make the crawler

    1. Because the encoding format of the HTML obtained is incorrect, specify utF-8 encoding format
    2. Gets the picture label for each item on the page
    3. Loop to get the tag’s image link (SRC) and image name (Alt)
    4. Download the image locally
from bs4 import BeautifulSoup
import requests
import os
import urllib.request
import time

headers = {
    "Cookie": "UM_distinctid=16685e0279d3e0-06f34603dfa898-36664c08-1fa400-16685e0279e133; bdshare_firstime=1539844405694; gsScrollPos-1702681410=; CNZZDATA1254092508=1744643453-1539842703-%7C1539929860; _d_id=0ba0365838c8f6569af46a1e638d05"."User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
}
path = "D://images/"

def get_links(url):
    wb_data = requests.get(url, headers=headers)  Add headers to avoid being considered a spider by the site's anti-crawling mechanism
    wb_data.encoding = "utf-8"
    soup = BeautifulSoup(wb_data.text, 'lxml')

    links = soup.select(".item-img img")
    if not os.path.exists(path):  # check if this folder exists, create it if it does not
        os.mkdir(path)
    for link in links:
        time.sleep(1)  # pause for one second to prevent the anti-crawling mechanism from thinking it is a spider
        img = link.get("src")
        img_name = link.get("alt")
        urllib.request.urlretrieve(img, path + img_name + ".jpg")
        print("-------- downloading ---------")

    print("------ download done -------")

if __name__ == "__main__":

    get_links("http://www.shuaia.net/index.html")
Copy the code
  • Began to crawl