preface

Following the previous article on how to use XPath to extract data, this article will teach you how to use XPath to get information from a web page.

So what I’m doing today is getting Bing Wallpapers.

The preparatory work

To do a good job, he must use his tools. The same goes for reptiles.

First, install two libraries: LXML and Requests.

pip install lxml
pip install requests
Copy the code

Demand analysis

Url to climb: https://bing.ioliu.cn/

Caught analysis

First, open the developer tool, click a picture randomly to enter its large hd picture, click Network to capture the package, and click the download button of the picture.

After you click the download button, you will find that the browser requests the url in the image, and when you click in, you will find that it is the url of the hd image.

So our first requirement is to get links to all the images.

Get image link

Why get an image link?

First of all, do you have to click the Download button on every image to save it locally? Well, if you don’t understand reptiles, you can’t help it. But what else do we reptiles do?

Since every time the download button is clicked, the browser requests the corresponding large image in high definition, this means that we can get all the image links, and then use Python to simulate the browser to request these links and download the image.

Here’s the link:

https://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_1920x1080.jpg?imageslim
Copy the code

On the page

When you open the page, you’ll find at least 100 pages of images. So how do you get those 100 pages of images?

It’s very simple. Again, analyze the URL of each page to see if there is any variation.

# 2 page https://bing.ioliu.cn/?p=2 # 3 page https://bing.ioliu.cn/?p=3Copy the code

In fact, after seeing the URL change above, I think you should also understand the law of change.

Function implementation

Construct links for each page

In fact, it is a simple page-turning function.

Specific code examples are as follows:

def get_page_url() :
    page_url = []
    for i in range(1.148):
        url = f'https://bing.ioliu.cn/?p={i}'
        page_url.append(url)
    return page_url
Copy the code

The function of the above code is to construct links for each page. Save the link in page_URL.

Gets links to images on each page

In the image above, you can see that the link to the image is hidden in data-progressive, which is an attribute of the IMG tag. If you have any difficult?

But careful friends will find that this link is not the same as the link we started to capture the package, in the end where is not the same?

Let’s see

https://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_1920x1080.jpg?imageslim
http://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_640x480.jpg?imageslim
Copy the code

Do you see that? The resolution is different. Everything else is the same, so long as the resolution is replaced.

The specific code is as follows:

def get_img_url(urls) :
    headers = {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
    }
    img_urls = []
    count = 1
    for url in urls[:3]:
        time.sleep(5)
        text = requests.get(url, headers=headers).content.decode('utf-8')
        html = etree.HTML(text)
        img_url = html.xpath('//div[@class="item"]//img/@data-progressive')
        img_url = [i.replace('640x480'.'1920x1080') for i in img_url]
        print(F prime is getting the number one{count}Page links')

        img_urls.extend(img_url)
        count += 1
    return img_urls
Copy the code

The above code takes the image link for each page and stores the link in img_urls.

Save the picture

 def save_img(self, img_urls) :

        headers = {
            'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
        }
        count = 1
        
        for img_url in img_urls:
            content = requests.get(img_url, headers=headers).content
            print(F 'is downloading number one{count}Zhang ')
            with open(f'.. /image/{count}.jpg'.'wb') as f:
                f.write(content)
            count += 1
Copy the code

Save the picture code is relatively simple, you can get all the picture links as a parameter to pass in, one by one access, you can.

The last

That’s the end of this post, but if you’ve reached this point, this post has inspired you, which is why I wanted to share it.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!