Environment Preparation:

Install the Requests library as you did in installing the requests- HTML in the previous article

Request via HTML

Example content: Climb the link of the picture from the webpage picture and download the instance background: download the type of picture you want from Baidu Picture (image.baidu.com), the number and size.

Import the Requests and JSON libraries

import requests
import json
Copy the code

– Get the request link to get the picture information: open the web page (image.baidu.com) and search for “wallpaper”

Right-click anywhere to check
network
Click on HTML and XHR (we want the JSON data for the Ajax request, which contains the image information)
Slide the mouse down the image page and a request for obtaining images will appear. Click and find that a GET request returns a lot of JSON data, and these data packets contain information of 30 pictures, as shown below:
Click a link in copy GET

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/95494843ab6946659b95e9bf020024a3)

Analyze links: The content returned by links

‘ ‘ ‘

After cut links: image.baidu.com/search/acjs… Tn =resultjson_com fixed IPn =rj Fixed width= image width (empty get full width image) height= same as above word= keyword of image type to search pn= If all the images that meet the requirements on the server are placed in an array, So pn is like a subscript, and it gets the next 30 images from pn

Width,height,pn deleted does not affect the web page, but it will be better if reserved.

Copy the entire content of the web page, browser search JSON online parsing, paste formatting, continue to analyze…

! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/dd33cbd8ac984100989732888adbbe9f)

The next step will be used after analysis!!

Get (n) image links

  for i in range(30):
   imgUrl = jsonData['data'][i]['thumbURL']
   imgList.append(imgUrl)
   if len(imgList) >= num:
       break
Copy the code

30 images per request so loop 30 times, more than 30 will repeat the request (so add one more loop here)
The next sentence is the crawling picture, just analyzed the list of data I picture information thubURL is the image link.
We want to disguise the request headers as a browser request before crawling the image

Headers = {” user-agent “:”Mozilla/5.0 (Windows NT 10.0; Win64; x64; The rv: 81.0) Gecko / 20100101 Firefox / 81.0 “}

1234

! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/4ad7da84069e4784a834cc122d095990)

def GetImgUrl(title,num,width='',height=''):
    global headers
    imgList = list()
    index = 0
    while True:
        url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=" + str(width) + \
              "&height=" + str(height) + "&word=" + title + "&pn=" + str(index)
        data = requests.get(url, headers=headers)
        jsonData = json.loads(data.content.decode('utf-8'))
        for i in range(30):
            imgUrl = jsonData['data'][i]['thumbURL']
            imgList.append(imgUrl)
            if len(imgList) >= num:
                break
        if len(imgList) >= num:
            break
        index = index + 30
    return imgList
Copy the code

Then fill the appropriate position according to the train of thought code:

Disguised headers, request url with GET
Load JSON data, encoded in UTF-8
ImgURL gets a link to an image and puts it in the list (append method) (imgList), so create a list on top of it to store
If the list length is equal to the number of images we want, we exit the loop (less than 30). If the list length is greater than 30, we request again, so we add a whlie loop to the outside
The download function is packaged separately as a method, with global headers indicating that you can reference the headers global variable.

Get the picture link and you can download it

def DownloadImg(urlList): index=0 for i in urlList: with open(str(index)+i[-4:],’wb’) as img: print(str(index)+i[-4:]) f=requests.get(i,headers=headers) img.write(f.content) img.close() index=index+1

Customize a download method
Traverse image links
Open the file with the image name index+ 4 bits after each link, as (alias)
The output
Request the next link
Write to img (image download)
Remember to close the file
Index ++ will not be overwritten if the file name is not changed.
I want to add the library that I’m using

import requests import json 12

Finally, call

If name == “main”: urlList=GetImgUrl(“小新”,88) DownloadImg(urlList)

1234

The complete code

"" Reduced link: https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=&height=&word= king glory & pn = 60 tn = resultjson_com fixed Ipn =rj fixed width= image width (empty) height= same as above word= keyword of image type to be searched pn= If all images on the server are in an array, then pn is equivalent to a subscript, Import requests import JSON headers = {" user-agent ":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; Rv :81.0) Gecko/20100101 Firefox/81.0"} def DownloadImg(urlList): index=0 for I in urlList: with open(str(index)+i[-4:],'wb') as img: print(str(index)+i[-4:]) f=requests.get(i,headers=headers) img.write(f.content) img.close() index=index+1 def GetImgUrl(title,num,width='',height=''): global headers imgList = list() index = 0 while True: url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&width=" + str(width) + \ "&height=" + str(height) + "&word=" + title + "&pn=" + str(index) data = requests.get(url, headers=headers) jsonData = json.loads(data.content.decode('utf-8')) for i in range(30): imgUrl = jsonData['data'][i]['thumbURL'] imgList.append(imgUrl) if len(imgList) >= num: break if len(imgList) >= num: Break index = index + 30 return imgList if __name__ == "__main__": urlList=GetImgUrl("小 新 新",88) DownloadImg(urlList)Copy the code

Complete project code acquisitionClick here

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python web crawler quick start 2 (parsing JSON data)

Environment Preparation:

Request via HTML

The complete code

Python web crawler quick start 2 (parsing JSON data)

Environment Preparation:

Request via HTML

The complete code

Related Posts

PHP Functions Share – 100 of the most commonly used PHP functions

Leetcode 191. Number of Bits (Python)

Chatter – Those Busy Jobs (18)