Requests: Used regular expressions to crawl bucket graphs

Title: Python- Climb the image

This is hexo’s crawler map from the previous deployment of hexo.www.cnblogs.com/thloveyl/p/…

Requests crawls the image

import time
import requests
import urllib.request


class getRequests:

# Regular expressions can be used to filter out the URL and name of the image when text gets the response text. Splitting the image is considered to violate the simplicity of Python syntax.

    def getImagePath(self) :
        # map invariant partial URL
        url = 'https://www.doutula.com/photo/list/?page='

        # count
        sum = 0
        Get all images from page 1 to page 2327
        for i in range(1.2328) :# splicing url
            print(url+str(i))
            Send the request to get the text of the response
            reponse = requests.get(url+str(i))
            retext = reponse.text
            # Split the response text
            reponseList = retext.split('data-original="')
            # print(reponseList)
            # fetch the url
            imageUrl = reponseList[1:len(reponseList)-1]
            print('Current page: %d' % i)
            for x in  imageUrl:
                # image url
                iurl= x.split('" alt="')
                imageUrl = iurl[0]
                # print(' imageUrl: %s'%imageUrl)
                # picture name
                imageName = iurl[1].split('"') [0]
                # Prevent images from containing the following unnamed characters.
                noContanins = ['? '.The '*'.':'.'"'.'<'.'>'.'\ \'.'/'.'|']
                for noC in  noContanins:
                    if  imageName.__contains__(noC):
                        # replace character
                        print(noC)
                        imageName = imageName.replace(noC,' ')

                    # Image suffixes
                    imageType = imageUrl.split('! ') [0]
                    # Reverse split
                    imageType = imageType.rsplit('. ')
                    # take the last. The following letter as format
                    imageType = imageType[len(imageType)-1]
                    # Replace null format
                    if imageType == 'null':
                        imageType = 'jpg'
                # continue if an error is encountered: enter the next loop
                try:
                    # Set save path - the folder needs to already exist
                    path = ('d:\doutula\\'+ imageName+'. ' + imageType)
                    urllib.request.urlretrieve(imageUrl, path)
                    sum+ =1
                except:
                    continue
                print('downloaded to % s | | downloaded file: % d. '%(path,sum))
        print('Total %d images'%sum)


if __name__ == '__main__':
    Get the current time
    print("Begin:", time.strftime('%Y.%m.%d %H:%M:%S ', time.localtime(time.time())))
    getRequests().getImagePath()
    print("Done:", time.strftime('%Y.%m.%d %H:%M:%S ', time.localtime(time.time())))
Copy the code

Regular expressions are not used here, which is against the purpose of Python code brevity. You should refactor when you learn regular expressions later. PS: the latest using a regular expression instead of the redundant code extraction link: www.cnblogs.com/thloveyl/p/…

This article only do record, if there is an error, welcome to leave a message to correct!

Requests: Used regular expressions to crawl bucket graphs

This is hexo’s crawler map from the previous deployment of hexo.www.cnblogs.com/thloveyl/p/…

Requests crawls the image

Related Posts

About Ruby’s language features

Cloud Native Enthusiast Weekly: Lens 5.0 is released, and it is more cool, faster and stronger!

Circular strong references in Swift