Python crawler, climb some Hanfu girl do welfare

This is the NTH day of my participation in the August Text Challenge.Juejin. Cn/post / 698796… “Juejin. Cn/post / 698796…”)

Recently wrote the crawler, I will stay at the end of the number of public welfare link some figure, a big message said some time ago to put more hanfu sister, today is just a little bit of time is applied under the baidu website, found a hanfu website, can write a crawler to collect some pictures, please donate said, begin.

1. Target website analysis Target website: www.hanfuhui.com/

As you can see, the home page is laid out as a stream, and dragging down loads new data.

Open a sub-page to see the corresponding link information for each image.

You can see the link information for this image.

Pic.hanfugou.com/ios/2019/12…

It seems a little simple, but once you locate the link information of the image, just save the image.

Now we have the first solution: open the home page, and then slide into the sub-interface to get all the picture link information, save it all the time.

2. Slide analysis: Slide analysis Keep sliding on the home page, open F12,clear all requests, and then switch to the Network TAB to see how the page is turned when sliding.

Two requests are sent while dragging, and a bunch of JSON data is received.

API address:

Api5.hanfugou.com/Trend/GetTr…

Analyze the parameters

What page is on,

Count is how many balloons per page

The objectType object is an album

Since this is the guess, let’s change the parameters to request, open a new TAB, and then request 30 data per page to try.

It looks like we were right.

3, The end of the show code analysis, let’s go directly to the code,

I chose the second solution because it was simple and could be done in one API

import requests
 
isQuick = False
 
# author: Coriander
 
def get_content(url) :
   try:
       user_agent = 'the Mozilla / 5.0 (X11; Linux x86_64; The rv: 45.0) Gecko / 20100101 Firefox 45.0 / '
       response = requests.get(url, headers={'User-Agent': user_agent})
       response.raise_for_status()  If the status code returned is not 200, throw an exception;
   except Exception as e:
       print("Crawl error")
   else:
       print("Successful climb!")
       return response.json()
 
 
def save_img(img_src) :
   if img_src is None:
       return
   imgName = img_src.split('/')[-1]
   try:
       print(img_src)
       url = img_src
       if isQuick:
           url = url + "_700x.jpg"
       headers = {"User-Agent": 'the Mozilla / 5.0 (X11; Linux x86_64; The rv: 45.0) Gecko / 20100101 Firefox 45.0 / '}
       # note that the verify parameter is set to False and does not validate the website certificate
       requests.packages.urllib3.disable_warnings()
       res = requests.get(url=url, headers=headers, verify=False)
       data = res.content
       filePath = 'hanfu/' + imgName
       with open(filePath, "wb+") as f:
           f.write(data)
   except Exception as e:
       print(e)
 
def readAllMsg(jsonData) :
   allImgList = []
   dataList = jsonData['Data']
   for dataItem in dataList:
       imgList = dataItem['Images']
       allImgList.extend(imgList)
 
   return allImgList
 
 
if __name__ == '__main__':
   # url = "https://api5.hanfugou.com/Trend/GetTrendListForHot?maxid=2892897&objecttype=album&page=3&count=20"
   url = "https://api5.hanfugou.com/trend/GetTrendListForMain?banobjecttype=trend&maxtime=0&page=2&count=10"
   jsonData = get_content(url)
   imgList = readAllMsg(jsonData)
   for imgDict in imgList:
       save_img(imgDict['ImgSrc'])
Copy the code

Note: Folder Hanfu needs to be created under the current code path

Take a look at our results:

1. Although the search process of paging API is simple, it also takes some time, which should not be.

2. In the process of analysis, it was found that the pictures on the sub-page were reduced, and “_700x.jpg” was added on the original basis. The sizes obtained were all 700, and there were also suffixes of 200 after changing the mobile phone device

3. The analysis of the original image, I found a bigpic function in the process of checking JS, but I didn’t find the relevant button on the page, and I didn’t know what happened.

4. The downloaded original picture was also found after a chaotic analysis, otherwise it might be the downloaded picture 404. Please try it

5. Add a switch to the code, isQuick = False. If you want to speed up the download speed, you can set it to True, so it will download the photo of 700 width

5. It is suggested that students who have no needs should not use this crawler. After all, it will hurt the website and waste traffic.

Regular benefits

Python crawler, climb some Hanfu girl do welfare

Related Posts

How to use C language to implement circular queue

Understand singleton patterns in depth

Oracle exports the entire database and imports the entire database commands