This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.
20 lines of code to become a meaty geek
Use Python to crawl 100G Cosers images
Objective of this blog
Crawl target
- Target data source: www.cosplay8.com/pic/chinaco… “, another Cos website, this kind of website can easily disappear in the Internet, in order to save the data, we disc it.
Python modules used
- Requests, Re, OS
Key learning content
- Today’s focus on learning, can be on the details of the page crawling, this skill was not covered in the previous blog, in the process of writing code to take care of the focus.
List page and detail page analysis
Through the developer tools, you can easily analyze the target data of the label.
Click any picture to enter the details page, and the target picture will be displayed as a single page, that is, one picture per page.
<a href="javascript:dPlayNext();" id="infoss">
<img
src="/uploads/allimg/210601/112879-210601143204.jpg"
id="bigimg"
width="800"
alt=""
border="0"
/></a>
Copy the code
Obtain the list page and detail page URL generation rule as follows:
List of pp.
- www.cosplay8.com/pic/chinaco…
- www.cosplay8.com/pic/chinaco…
- www.cosplay8.com/pic/chinaco…
Details page
- www.cosplay8.com/pic/chinaco…
- www.cosplay8.com/pic/chinaco…
- www.cosplay8.com/pic/chinaco…
Note that the first page of the details page is out of order no. 1, while crawling to get the total page number, you need to store the home page picture.
Encoding time
The target website classifies the pictures, that is, domestic COS, foreign COS, Hanfu circle and Lolita. Therefore, dynamic input can be carried out during the crawling, that is, the crawling target originates from the definition.
def run(category, start, end) :
Generate a list page to be climbed
wait_url = [
f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
print(wait_url)
url_list = []
for item in wait_url:
The # get_list function is provided below
ret = get_list(item)
print(F "has been captured:{len(ret)}The data")
url_list.extend(ret)
if __name__ == "__main__":
# http://www.cosplay8.com/pic/chinacos/list_22_2.html
category = input("Please enter the classification number:")
start = input("Please enter start page:")
end = input("Please enter closing page:")
run(category, start, end)
Copy the code
The above code first generates the target url based on the user’s input, and then passes the target url once to the get_list function, which looks like this:
def get_list(url) :
""" Get full details page links """
all_list = []
res = requests.get(url, headers=headers)
html = res.text
pattern = re.compile(' ' )
all_list = pattern.findall(html)
return all_list
Copy the code
Using the regular expression
In the run function to continue to increase the code, get details page picture material, and capture the picture to save.
def run(category, start, end) :
# List page to be climbed
wait_url = [
f"http://www.cosplay8.com/pic/chinacos/list_{category}_{i}.html" for i in range(int(start), int(end)+1)]
print(wait_url)
url_list = []
for item in wait_url:
ret = get_list(item)
print(F "has been captured:{len(ret)}The data")
url_list.extend(ret)
print(url_list)
# print(len(url_list))
for url in url_list:
get_detail(f"http://www.cosplay8.com{url}")
Copy the code
Because the matched detail page address is a relative address, the customer formats the address to generate a complete address. The get_detail function looks like this:
def get_detail(url) :
Request details page data
res = requests.get(url=url, headers=headers)
# set encoding
res.encoding = "utf-8"
Get the source code for the web page
html = res.text
# Unpack the page number and save the first image
size_pattern = re.compile(' )
# get the title, later found published differences, the regular expression has been modified
# title_pattern = re.com running (' < title > (. *?) - China Cosplay < / title > ')
title_pattern = re.compile('(.*?) - Cosplay (China | 8) < / title > ' )
# set the image regular expression
first_img_pattern = re.compile(")
try:
# try to match the page number
page_size = size_pattern.search(html).group(1)
# try to match the title
title = title_pattern.search(html).group(1)
# try to match the address
first_img = first_img_pattern.search(html).group(1)
print(The data corresponding to the F "URL is{page_size}Pp.", title, first_img)
# generate path
path = f'images/{title}'
# Path judgment
if not os.path.exists(path):
os.makedirs(path)
# request the first image
save_img(path, title, first_img, 1)
# Request more images
urls = [f"{url[0:url.rindex('. ')]}_{i}.html" for i in range(2.int(page_size)+1)]
for index, child_url in enumerate(urls):
try:
res = requests.get(url=child_url, headers=headers)
html = res.text
first_img_pattern = re.compile(")
first_img = first_img_pattern.search(html).group(1)
save_img(path, title, first_img, index)
except Exception as e:
print("Grab child pages", e)
except Exception as e:
print(url, e)
Copy the code
The core logic of the above code has been written in the comments, focusing on the title regular matching part. The initial regular expression is written as follows:
<title>(.*?) - China Cosplay < / title >Copy the code
If all matches fail, modify the information as follows:
<title>(.*?) | - Cosplay (China8)</title>
Copy the code
The missing save_img function code is as follows:
def save_img(path, title, first_img, index) :
try:
# request image
img_res = requests.get(f"http://www.cosplay8.com{first_img}", headers=headers)
img_data = img_res.content
with open(f"{path}/{title}_{index}.png"."wb+") as f:
f.write(img_data)
except Exception as e:
print(e)
Copy the code