This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

20 lines of code to become a meaty geek

Use Python to crawl 100G Cosers images

Objective of this blog

Crawl target

  • Target data source: www.cosplay8.com/pic/chinaco… “, another Cos website, this kind of website can easily disappear in the Internet, in order to save the data, we disc it.

Python modules used

  • Requests, Re, OS

Key learning content

  • Today’s focus on learning, can be on the details of the page crawling, this skill was not covered in the previous blog, in the process of writing code to take care of the focus.

List page and detail page analysis

Through the developer tools, you can easily analyze the target data of the label.

Click any picture to enter the details page, and the target picture will be displayed as a single page, that is, one picture per page.

<a href="javascript:dPlayNext();" id="infoss">
  <img
    src="/uploads/allimg/210601/112879-210601143204.jpg"
    id="bigimg"
    width="800"
    alt=""
    border="0"
/></a>
Copy the code

Obtain the list page and detail page URL generation rule as follows:

List of pp.

  • www.cosplay8.com/pic/chinaco…
  • www.cosplay8.com/pic/chinaco…
  • www.cosplay8.com/pic/chinaco…

Details page

  • www.cosplay8.com/pic/chinaco…
  • www.cosplay8.com/pic/chinaco…
  • www.cosplay8.com/pic/chinaco…

Note that the first page of the details page is out of order no. 1, while crawling to get the total page number, you need to store the home page picture.

Encoding time

The target website classifies the pictures, that is, domestic COS, foreign COS, Hanfu circle and Lolita. Therefore, dynamic input can be carried out during the crawling, that is, the crawling target originates from the definition.


def run(category, start, end) :
    Generate a list page to be climbed
    wait_url = [
        f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
    print(wait_url)

    url_list = []
    for item in wait_url:
    	The # get_list function is provided below
        ret = get_list(item)

        print(F "has been captured:{len(ret)}The data")
        url_list.extend(ret)


if __name__ == "__main__":

    # http://www.cosplay8.com/pic/chinacos/list_22_2.html
    category = input("Please enter the classification number:")
    start = input("Please enter start page:")
    end = input("Please enter closing page:")
    run(category, start, end)
Copy the code

The above code first generates the target url based on the user’s input, and then passes the target url once to the get_list function, which looks like this:

def get_list(url) :
    """ Get full details page links """
    all_list = []

    res = requests.get(url, headers=headers)
    html = res.text
    pattern = re.compile('
  • '
  • ) all_list = pattern.findall(html) return all_list Copy the code

    Using the regular expression

  • match all detail page addresses in the list page and return them as a whole.
  • In the run function to continue to increase the code, get details page picture material, and capture the picture to save.

    def run(category, start, end) :
        # List page to be climbed
        wait_url = [
            f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
        print(wait_url)
    
        url_list = []
        for item in wait_url:
            ret = get_list(item)
    
            print(F "has been captured:{len(ret)}The data")
            url_list.extend(ret)
    
        print(url_list)
        # print(len(url_list))
        for url in url_list:
            get_detail(f"http://www.cosplay8.com{url}")
    Copy the code

    Because the matched detail page address is a relative address, the customer formats the address to generate a complete address. The get_detail function looks like this:

    def get_detail(url) :
    	Request details page data
        res = requests.get(url=url, headers=headers)
        # set encoding
        res.encoding = "utf-8"
        Get the source code for the web page
        html = res.text
    
        # Unpack the page number and save the first image
        size_pattern = re.compile('  )
        # get the title, later found published differences, the regular expression has been modified
        # title_pattern = re.com running (' < title > (. *?) - China Cosplay < / title > ')
        title_pattern = re.compile('(.*?) - Cosplay (China | 8) < / title > ')
        # set the image regular expression
        first_img_pattern = re.compile(")
        try:
        	# try to match the page number
            page_size = size_pattern.search(html).group(1)
            # try to match the title
            title = title_pattern.search(html).group(1)
            # try to match the address
            first_img = first_img_pattern.search(html).group(1)
    
            print(The data corresponding to the F "URL is{page_size}Pp.", title, first_img)
            # generate path
            path = f'images/{title}'
            # Path judgment
            if not os.path.exists(path):
                os.makedirs(path)
    
            # request the first image
            save_img(path, title, first_img, 1)
    
            # Request more images
            urls = [f"{url[0:url.rindex('. ')]}_{i}.html" for i in range(2.int(page_size)+1)]
    
            for index, child_url in enumerate(urls):
                try:
                    res = requests.get(url=child_url, headers=headers)
    
                    html = res.text
                    first_img_pattern = re.compile(")
                    first_img = first_img_pattern.search(html).group(1)
    
                    save_img(path, title, first_img, index)
                except Exception as e:
                    print("Grab child pages", e)
    
        except Exception as e:
            print(url, e)
    Copy the code

    The core logic of the above code has been written in the comments, focusing on the title regular matching part. The initial regular expression is written as follows:

    <title>(.*?) - China Cosplay < / title >Copy the code

    If all matches fail, modify the information as follows:

    <title>(.*?) | - Cosplay (China8)</title>
    Copy the code

    The missing save_img function code is as follows:

    def save_img(path, title, first_img, index) :
        try:
            # request image
            img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
            img_data = img_res.content
    
            with open(f"{path}/{title}_{index}.png"."wb+") as f:
                f.write(img_data)
        except Exception as e:
            print(e)
    Copy the code