“This is the 20th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Hanfu is really beautiful

Read this blog and you’ll find out

  • Improvements in Python technology
  • 40,000 + hanfu photos, or more

Hanfu photo collection technique

Target data source analysis

The target data to be captured this time refer to the following figure. Target site is https://www.hanfuhui.com/, a vertical community hanfu with robe centered.

This blog will cover the knowledge

  1. Requests read JSON data;
  2. Json format data parsing;
  3. CSV file storage;
  4. File read + picture save.

Data source analysis

  • The target data captured this time is asynchronous transmission, that is, returned through the server interface. The interface structure can be queried by the browser developer tool as follows:

  • The data interface is: https://api5.hanfugou.com/Trend/GetTrendListForHot?maxid=3396754&objecttype=album&page=3&count=20, one of important parameters for the page with the count, That is, the page number and the amount of data per page. During the test, it is found that the count value can be arbitrarily changed. When the value exceeds 100, the interface returns data slowly, and the subsequent value is set to 500.

  • The Data format of the interface response is JSON, as shown in the figure below. The Status of the interface success is Status, and the core Data of the interface is in Data.

  • To learn more about JSON data, you can look for relevant materials to expand your learning, or you can start with cases and gradually master them. For example, follow the Python Crawler 120 column and learn in practice.

Need to sort out

Based on the above analysis, the following requirements are sorted out:

  • Batch generation of interface address, used to extract the image address;
  • To ensure efficiency, the extracted image addresses are stored in batchescsvFile;
  • readcsvFile image address, splicing downloadable links;
  • Download pictures.

Case coding

First, use Requests for basic data fetching. This step is relatively simple. Just show the code.

import requests
from lxml import etree
import time
import re
import random

USER_AGENTS = [
    "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; The.net CLR 2.0.50727; Media Center PC 5.0; The.net CLR 3.0.04506)"."Collect other USER_AGENT by yourself"
]


def run(url) :
    try:
        headers = {"User-Agent": random.choice(USER_AGENTS)}
        res = requests.get(url=url, headers=headers)
        Get json data directly from res.json
        json_text = res.json()
        There is no validation of the interface data. If necessary, verify the Status of the interface request via the Status attribute
        data = json_text["Data"]
        img_srcs = []
        for item in data:
            img_srcs.extend(item["ImageSrcs"])
        long_str = "\n".join(img_srcs)
        # Save data
        save(long_str)

    except Exception as e:
        print(e)


def save(long_str) :
    try:
        with open(f"./imgs.csv"."a+") as f:
            f.write("\n"+long_str)
    except Exception as e:
        print(e)


if __name__ == '__main__':
    urls = []
    for i in range(1.10) :print(F "is climbing to number one{i}Batch data")
        run(f"https://api5.hanfugou.com/Trend/GetTrendListForHot?maxid=3396754&objecttype=album&page={i}&count=500")

    print("It's all gone.")
Copy the code

The above code lacks interface state validation logic. Comments have been made in the specified location and can be extended if necessary.

The data saved in the first step is 40000+, which has met the requirements for subsequent use. The next step is to write the logic code related to obtaining pictures.

throughcsvFile to get the picture link, direct request, get is a name forhanfuhui-pi-404There is a problem with the request.

In the subsequent re-test process, it was found that some image links were accessible, while others returned 404 images, showing differences. There is no specialization here.

Through the page details page, get the picture address, get the following rules.

When the image address is https://pic.hanfugou.com/android/2020/2/30/3b2c6bc54cfa4656a81b4f9b4167e2c3.jpg, download failed, on this basis, increase the picture size limit, Is https://pic.hanfugou.com/android/2020/2/30/3b2c6bc54cfa4656a81b4f9b4167e2c3.jpg_700x.jpg to get the correct pictures, only reduce the clarity.

The image grab code is as follows

To use the following code, you need to create a hanfu directory in the code file directory in advance to store images.

Since the image address request protocol is HTTPS, we need to add a new parameter verify to the requests method and set it to False. This parameter indicates that the ca certificate of the site is not verified when requesting data.

def save_img(img_src) :
    try:
        url = img_src
        headers = {"User-Agent": random.choice(USER_AGENTS)}
        # note that the verify parameter is set to False and does not validate the website certificate
        res = requests.get(url=url, headers=headers, verify=False)
        data = res.content

        with open(f"./hanfu/{int(time.time())}.jpg"."wb+") as f:
            f.write(data)
    except Exception as e:
        print(e)


if __name__ == '__main__':

    with open("./imgs.csv"."r") as f:
        while True:
            img_url = f.readline().strip()
            if img_url is None:
                break
            real_url = f"{img_url}_700x.jpg"
            save_img(real_url)
Copy the code

The code is finished, the next is waiting for the program to give us the purchase of time, out to tea.