I used Python to take 100GB of images offline overnight just to keep the site from disappearing

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

20 lines of code to become a meaty geek

Use Python to crawl 100G Cosers images

Objective of this blog

Crawl target

Target data source: www.cosplay8.com/pic/chinaco… “, another Cos website, this kind of website can easily disappear in the Internet, in order to save the data, we disc it.

Python modules used

Requests, Re, OS

Key learning content

Today’s focus on learning, can be on the details of the page crawling, this skill was not covered in the previous blog, in the process of writing code to take care of the focus.

List page and detail page analysis

Through the developer tools, you can easily analyze the target data of the label.

Click any picture to enter the details page, and the target picture will be displayed as a single page, that is, one picture per page.

<a href="javascript:dPlayNext();" id="infoss">
  <img
    src="/uploads/allimg/210601/112879-210601143204.jpg"
    id="bigimg"
    width="800"
    alt=""
    border="0"
/></a>
Copy the code

Obtain the list page and detail page URL generation rule as follows:

List of pp.

www.cosplay8.com/pic/chinaco…
www.cosplay8.com/pic/chinaco…
www.cosplay8.com/pic/chinaco…

Details page

www.cosplay8.com/pic/chinaco…
www.cosplay8.com/pic/chinaco…
www.cosplay8.com/pic/chinaco…

Note that the first page of the details page is out of order no. 1, while crawling to get the total page number, you need to store the home page picture.

Encoding time

The target website classifies the pictures, that is, domestic COS, foreign COS, Hanfu circle and Lolita. Therefore, dynamic input can be carried out during the crawling, that is, the crawling target originates from the definition.


def run(category, start, end) :
    Generate a list page to be climbed
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
    	The # get_list function is provided below
        ret = get_list(item)

        print(F "has been captured:{len(ret)}The data")
        url_list.extend(ret)


if __name__ == "__main__":

    # http://www.cosplay8.com/pic/chinacos/list_22_2.html
    category = input("Please enter the classification number:")
    start = input("Please enter start page:")
    end = input("Please enter closing page:")
    run(category, start, end)
Copy the code

The above code first generates the target url based on the user’s input, and then passes the target url once to the get_list function, which looks like this:

def get_list(url) :
    """ Get full details page links """
    all_list = []

    res = requests.get(url, headers=headers)
    html = res.text
    pattern = re.compile(' ')
    all_list = pattern.findall(html)

    return all_list

Copy the code

Using the regular expression

match all detail page addresses in the list page and return them as a whole.

In the run function to continue to increase the code, get details page picture material, and capture the picture to save.

def run(category, start, end) :
    # List page to be climbed
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
        ret = get_list(item)

        print(F "has been captured:{len(ret)}The data")
        url_list.extend(ret)

    print(url_list)
    # print(len(url_list))
    for url in url_list:
        get_detail(f"http://www.cosplay8.com{url}")
Copy the code

Because the matched detail page address is a relative address, the customer formats the address to generate a complete address. The get_detail function looks like this:

def get_detail(url) :
	Request details page data
    res = requests.get(url=url, headers=headers)
    # set encoding
    res.encoding = "utf-8"
    Get the source code for the web page
    html = res.text

    # Unpack the page number and save the first image
    size_pattern = re.compile('  )
    # get the title, later found published differences, the regular expression has been modified
    # title_pattern = re.com running (' < title > (. *?) - China Cosplay < / title > ')
    title_pattern = re.compile('(.*?) - Cosplay (China | 8) < / title > ')
    # set the image regular expression
    first_img_pattern = re.compile(")
    try:
    	# try to match the page number
        page_size = size_pattern.search(html).group(1)
        # try to match the title
        title = title_pattern.search(html).group(1)
        # try to match the address
        first_img = first_img_pattern.search(html).group(1)

        print(The data corresponding to the F "URL is{page_size}Pp.", title, first_img)
        # generate path
        path = f'images/{title}'
        # Path judgment
        if not os.path.exists(path):
            os.makedirs(path)

        # request the first image
        save_img(path, title, first_img, 1)

        # Request more images
        urls = [f"{url[0:url.rindex('. ')]}_{i}.html" for i in range(2.int(page_size)+1)]

        for index, child_url in enumerate(urls):
            try:
                res = requests.get(url=child_url, headers=headers)

                html = res.text
                first_img_pattern = re.compile(")
                first_img = first_img_pattern.search(html).group(1)

                save_img(path, title, first_img, index)
            except Exception as e:
                print("Grab child pages", e)

    except Exception as e:
        print(url, e)
Copy the code

The core logic of the above code has been written in the comments, focusing on the title regular matching part. The initial regular expression is written as follows:

<title>(.*?) - China Cosplay < / title >Copy the code

If all matches fail, modify the information as follows:

<title>(.*?) | - Cosplay (China8)</title>
Copy the code

The missing save_img function code is as follows:

def save_img(path, title, first_img, index) :
    try:
        # request image
        img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
        img_data = img_res.content

        with open(f"{path}/{title}_{index}.png"."wb+") as f:
            f.write(img_data)
    except Exception as e:
        print(e)
Copy the code

I used Python to take 100GB of images offline overnight just to keep the site from disappearing

Use Python to crawl 100G Cosers images

Objective of this blog

Encoding time

Related Posts

Twenty-three design patterns (19) — Observer Pattern

Ten classical sorting algorithm ideas and Java code implementation

Springboot and Thymeleaf