preface
Following the previous article on how to use XPath to extract data, this article will teach you how to use XPath to get information from a web page.
So what I’m doing today is getting Bing Wallpapers.
The preparatory work
To do a good job, he must use his tools. The same goes for reptiles.
First, install two libraries: LXML and Requests.
pip install lxml
pip install requests
Copy the code
Demand analysis
Url to climb: https://bing.ioliu.cn/
Caught analysis
First, open the developer tool, click a picture randomly to enter its large hd picture, click Network to capture the package, and click the download button of the picture.
After you click the download button, you will find that the browser requests the url in the image, and when you click in, you will find that it is the url of the hd image.
So our first requirement is to get links to all the images.
Get image link
Why get an image link?
First of all, do you have to click the Download button on every image to save it locally? Well, if you don’t understand reptiles, you can’t help it. But what else do we reptiles do?
Since every time the download button is clicked, the browser requests the corresponding large image in high definition, this means that we can get all the image links, and then use Python to simulate the browser to request these links and download the image.
Here’s the link:
https://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_1920x1080.jpg?imageslim
Copy the code
On the page
When you open the page, you’ll find at least 100 pages of images. So how do you get those 100 pages of images?
It’s very simple. Again, analyze the URL of each page to see if there is any variation.
# 2 page https://bing.ioliu.cn/?p=2 # 3 page https://bing.ioliu.cn/?p=3Copy the code
In fact, after seeing the URL change above, I think you should also understand the law of change.
Function implementation
Construct links for each page
In fact, it is a simple page-turning function.
Specific code examples are as follows:
def get_page_url() :
page_url = []
for i in range(1.148):
url = f'https://bing.ioliu.cn/?p={i}'
page_url.append(url)
return page_url
Copy the code
The function of the above code is to construct links for each page. Save the link in page_URL.
Gets links to images on each page
In the image above, you can see that the link to the image is hidden in data-progressive, which is an attribute of the IMG tag. If you have any difficult?
But careful friends will find that this link is not the same as the link we started to capture the package, in the end where is not the same?
Let’s see
https://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_1920x1080.jpg?imageslim
http://h2.ioliu.cn/bing/LoonyDook_ZH-CN1879420705_640x480.jpg?imageslim
Copy the code
Do you see that? The resolution is different. Everything else is the same, so long as the resolution is replaced.
The specific code is as follows:
def get_img_url(urls) :
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
img_urls = []
count = 1
for url in urls[:3]:
time.sleep(5)
text = requests.get(url, headers=headers).content.decode('utf-8')
html = etree.HTML(text)
img_url = html.xpath('//div[@class="item"]//img/@data-progressive')
img_url = [i.replace('640x480'.'1920x1080') for i in img_url]
print(F prime is getting the number one{count}Page links')
img_urls.extend(img_url)
count += 1
return img_urls
Copy the code
The above code takes the image link for each page and stores the link in img_urls.
Save the picture
def save_img(self, img_urls) :
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
count = 1
for img_url in img_urls:
content = requests.get(img_url, headers=headers).content
print(F 'is downloading number one{count}Zhang ')
with open(f'.. /image/{count}.jpg'.'wb') as f:
f.write(content)
count += 1
Copy the code
Save the picture code is relatively simple, you can get all the picture links as a parameter to pass in, one by one access, you can.
The last
That’s the end of this post, but if you’ve reached this point, this post has inspired you, which is why I wanted to share it.
The way ahead is so long without ending, yet high and low I’ll search with my will unbending.
I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!