Environment Preparation:

Install the Requests library as you did in installing the requests- HTML in the previous article

Request via HTML

Example content: Climb the link of the picture from the webpage picture and download the instance background: download the type of picture you want from Baidu Picture (image.baidu.com), the number and size.

Import the Requests and JSON libraries

import requests
import json
Copy the code

– Get the request link to get the picture information: open the web page (image.baidu.com) and search for “wallpaper”

  1. Right-click anywhere to check

  2. network

  3. Click on HTML and XHR (we want the JSON data for the Ajax request, which contains the image information)

  4. Slide the mouse down the image page and a request for obtaining images will appear. Click and find that a GET request returns a lot of JSON data, and these data packets contain information of 30 pictures, as shown below:

  5. Click a link in copy GET

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/95494843ab6946659b95e9bf020024a3)
  • Analyze links: The content returned by links

    ‘ ‘ ‘

    After cut links: image.baidu.com/search/acjs… Tn =resultjson_com fixed IPn =rj Fixed width= image width (empty get full width image) height= same as above word= keyword of image type to search pn= If all the images that meet the requirements on the server are placed in an array, So pn is like a subscript, and it gets the next 30 images from pn

    Width,height,pn deleted does not affect the web page, but it will be better if reserved.

Copy the entire content of the web page, browser search JSON online parsing, paste formatting, continue to analyze…

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/dd33cbd8ac984100989732888adbbe9f)
  • The next step will be used after analysis!!
  1. Get (n) image links

      for i in range(30):
       imgUrl = jsonData['data'][i]['thumbURL']
       imgList.append(imgUrl)
       if len(imgList) >= num:
           break
    Copy the code
  • 30 images per request so loop 30 times, more than 30 will repeat the request (so add one more loop here)

  • The next sentence is the crawling picture, just analyzed the list of data I picture information thubURL is the image link.

  • We want to disguise the request headers as a browser request before crawling the image

    Headers = {” user-agent “:”Mozilla/5.0 (Windows NT 10.0; Win64; x64; The rv: 81.0) Gecko / 20100101 Firefox / 81.0 “}

    1234

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/4ad7da84069e4784a834cc122d095990)
def GetImgUrl(title,num,width='',height=''):
    global headers
    imgList = list()
    index = 0
    while True:
        url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=" + str(width) + \
              "&height=" + str(height) + "&word=" + title + "&pn=" + str(index)
        data = requests.get(url, headers=headers)
        jsonData = json.loads(data.content.decode('utf-8'))
        for i in range(30):
            imgUrl = jsonData['data'][i]['thumbURL']
            imgList.append(imgUrl)
            if len(imgList) >= num:
                break
        if len(imgList) >= num:
            break
        index = index + 30
    return imgList
Copy the code

Then fill the appropriate position according to the train of thought code:

  • Disguised headers, request url with GET
  • Load JSON data, encoded in UTF-8
  • ImgURL gets a link to an image and puts it in the list (append method) (imgList), so create a list on top of it to store
  • If the list length is equal to the number of images we want, we exit the loop (less than 30). If the list length is greater than 30, we request again, so we add a whlie loop to the outside
  • The download function is packaged separately as a method, with global headers indicating that you can reference the headers global variable.
  1. Get the picture link and you can download it

    def DownloadImg(urlList): index=0 for i in urlList: with open(str(index)+i[-4:],’wb’) as img: print(str(index)+i[-4:]) f=requests.get(i,headers=headers) img.write(f.content) img.close() index=index+1

  • Customize a download method

  • Traverse image links

  • Open the file with the image name index+ 4 bits after each link, as (alias)

  • The output

  • Request the next link

  • Write to img (image download)

  • Remember to close the file

  • Index ++ will not be overwritten if the file name is not changed.

  • I want to add the library that I’m using

    import requests import json 12

  1. Finally, call

    If name == “main”: urlList=GetImgUrl(“小 新”,88) DownloadImg(urlList)

    1234

The complete code

"" Reduced link: https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=&height=&word= king glory & pn = 60 tn = resultjson_com fixed Ipn =rj fixed width= image width (empty) height= same as above word= keyword of image type to be searched pn= If all images on the server are in an array, then pn is equivalent to a subscript, Import requests import JSON headers = {" user-agent ":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; Rv :81.0) Gecko/20100101 Firefox/81.0"} def DownloadImg(urlList): index=0 for I in urlList: with open(str(index)+i[-4:],'wb') as img: print(str(index)+i[-4:]) f=requests.get(i,headers=headers) img.write(f.content) img.close() index=index+1 def GetImgUrl(title,num,width='',height=''): global headers imgList = list() index = 0 while True: url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=" + str(width) + \ "&height=" + str(height) + "&word=" + title + "&pn=" + str(index) data = requests.get(url, headers=headers) jsonData = json.loads(data.content.decode('utf-8')) for i in range(30): imgUrl = jsonData['data'][i]['thumbURL'] imgList.append(imgUrl) if len(imgList) >= num: break if len(imgList) >= num: Break index = index + 30 return imgList if __name__ == "__main__": urlList=GetImgUrl("小 新 新",88) DownloadImg(urlList)Copy the code

Complete project code acquisitionClick here