“This is the 20th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
Hanfu is really beautiful
Read this blog and you’ll find out
- Improvements in Python technology
- 40,000 + hanfu photos, or more
Hanfu photo collection technique
Target data source analysis
The target data to be captured this time refer to the following figure. Target site is https://www.hanfuhui.com/, a vertical community hanfu with robe centered.
This blog will cover the knowledge
- Requests read JSON data;
- Json format data parsing;
- CSV file storage;
- File read + picture save.
Data source analysis
-
The target data captured this time is asynchronous transmission, that is, returned through the server interface. The interface structure can be queried by the browser developer tool as follows:
-
The data interface is: https://api5.hanfugou.com/Trend/GetTrendListForHot?maxid=3396754&objecttype=album&page=3&count=20, one of important parameters for the page with the count, That is, the page number and the amount of data per page. During the test, it is found that the count value can be arbitrarily changed. When the value exceeds 100, the interface returns data slowly, and the subsequent value is set to 500.
-
The Data format of the interface response is JSON, as shown in the figure below. The Status of the interface success is Status, and the core Data of the interface is in Data.
-
To learn more about JSON data, you can look for relevant materials to expand your learning, or you can start with cases and gradually master them. For example, follow the Python Crawler 120 column and learn in practice.
Need to sort out
Based on the above analysis, the following requirements are sorted out:
- Batch generation of interface address, used to extract the image address;
- To ensure efficiency, the extracted image addresses are stored in batches
csv
File; - read
csv
File image address, splicing downloadable links; - Download pictures.
Case coding
First, use Requests for basic data fetching. This step is relatively simple. Just show the code.
import requests
from lxml import etree
import time
import re
import random
USER_AGENTS = [
"Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36"."Mozilla / 4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; The.net CLR 2.0.50727; Media Center PC 5.0; The.net CLR 3.0.04506)"."Collect other USER_AGENT by yourself"
]
def run(url) :
try:
headers = {"User-Agent": random.choice(USER_AGENTS)}
res = requests.get(url=url, headers=headers)
Get json data directly from res.json
json_text = res.json()
There is no validation of the interface data. If necessary, verify the Status of the interface request via the Status attribute
data = json_text["Data"]
img_srcs = []
for item in data:
img_srcs.extend(item["ImageSrcs"])
long_str = "\n".join(img_srcs)
# Save data
save(long_str)
except Exception as e:
print(e)
def save(long_str) :
try:
with open(f"./imgs.csv"."a+") as f:
f.write("\n"+long_str)
except Exception as e:
print(e)
if __name__ == '__main__':
urls = []
for i in range(1.10) :print(F "is climbing to number one{i}Batch data")
run(f"https://api5.hanfugou.com/Trend/GetTrendListForHot?maxid=3396754&objecttype=album&page={i}&count=500")
print("It's all gone.")
Copy the code
The above code lacks interface state validation logic. Comments have been made in the specified location and can be extended if necessary.
The data saved in the first step is 40000+, which has met the requirements for subsequent use. The next step is to write the logic code related to obtaining pictures.
throughcsv
File to get the picture link, direct request, get is a name forhanfuhui-pi-404
There is a problem with the request.
In the subsequent re-test process, it was found that some image links were accessible, while others returned 404 images, showing differences. There is no specialization here.
Through the page details page, get the picture address, get the following rules.
When the image address is https://pic.hanfugou.com/android/2020/2/30/3b2c6bc54cfa4656a81b4f9b4167e2c3.jpg, download failed, on this basis, increase the picture size limit, Is https://pic.hanfugou.com/android/2020/2/30/3b2c6bc54cfa4656a81b4f9b4167e2c3.jpg_700x.jpg to get the correct pictures, only reduce the clarity.
The image grab code is as follows
To use the following code, you need to create a hanfu directory in the code file directory in advance to store images.
Since the image address request protocol is HTTPS, we need to add a new parameter verify to the requests method and set it to False. This parameter indicates that the ca certificate of the site is not verified when requesting data.
def save_img(img_src) :
try:
url = img_src
headers = {"User-Agent": random.choice(USER_AGENTS)}
# note that the verify parameter is set to False and does not validate the website certificate
res = requests.get(url=url, headers=headers, verify=False)
data = res.content
with open(f"./hanfu/{int(time.time())}.jpg"."wb+") as f:
f.write(data)
except Exception as e:
print(e)
if __name__ == '__main__':
with open("./imgs.csv"."r") as f:
while True:
img_url = f.readline().strip()
if img_url is None:
break
real_url = f"{img_url}_700x.jpg"
save_img(real_url)
Copy the code
The code is finished, the next is waiting for the program to give us the purchase of time, out to tea.