Today to share a practical crawler project, batch collection 4K wallpaper, wallpaper from worry, zhang Zhang boutique. Don’t forget to “like” it. Don’t say a word. (smile)
Go to pic.netbian.com/ first
The website requires login to download pictures, so obtaining cookies is the first problem we face.
To get a cookie
We use Selenium to get cookies automatically.
Import the required modules first.
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Copy the code
Open the homepage of the website.
driver = webdriver.Chrome()
driver.get('https://pic.netbian.com/')
Copy the code
F12
Open developer tools to locate the login TAB.
wait = WebDriverWait(driver, 10.0.5)
wait.until(EC.presence_of_element_located(
(By.XPATH, '/html/body/div[1]/div/div[2]/a[2]')),
message='Location timeout').click()
Copy the code
Selenium is slow to open the page here, so we can use display wait to check for element presence every 0.5 seconds and throw an exception after 10 seconds.
Locating QQ loginClick the account and password to log in.Then locate your account password and click LoginStep code above
driver.find_element(By.XPATH, '/html/body/div[1]/div/div[2]/div[1]/div[3]/ul/li[1]/a/em').click()
# Locate the page TAB and check whether the page is redirected successfully
wait.until(EC.presence_of_element_located(
(By.XPATH, '//*[@id="combine_page"]/div[1]/div')),
message='Location timeout')
# Switch to iframe
driver.switch_to.frame('ptlogin_iframe')
Click the password to log in
driver.find_element(By.ID, 'switcher_plogin').click()
# Enter your account
username = driver.find_element(By.ID, 'u')
username.send_keys('account')
# Input password
password = driver.find_element(By.ID, 'p')
password.send_keys('password')
# Click login
driver.find_element(By.ID, 'login_button').click()
Copy the code
Next, we judge whether the page is successfully redirected. If so, we directly obtain the cookie of the page.
wait.until(EC.presence_of_element_located(
(By.XPATH, '/html/body/div[1]/div/ul/li[2]/a')),
message='Location timeout')
cookies = {}
for item in driver.get_cookies():
cookies[item['name']] = item['value']
print(cookies)
Copy the code
Image acquisition
Now that the cookie has been retrieved, you can start the image collection. There are many categories of pictures, this demonstration is the acquisition of animation pictures.
Start by importing the required modules.
import requests
import os
import re
from lxml import etree
Copy the code
Create a folder under the current path to save the image.
isExists = os.path.exists('./4ktupian')
if not isExists:
os.makedirs('./4ktupian')
Copy the code
Let’s manually download an image and look at the download link.
https://pic.netbian.com/downpic.php?id=25487&classid=66
Copy the code
After comparison found that each picture download link? The previous part is the same, ID is the ID of the picture, Classid is the classification ID, and the id of the animation category is 66.
The only thing to note is the image ID, which can be extracted from the image jump link.OK, the basic logic is clear.
The code of the collection part is as follows. Here, a total of 300 pictures of 15 pages are climbed.
headers = {
'Users-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}
for page in range(1.16) :if (page == 1):
new_url = 'http://pic.netbian.com/4kdongman/'
else:
new_url = format(url % page)
response = requests.get(url=new_url, headers=headers, cookies=cookies)
Set the encoding format for retrieving the response data
response.encoding = 'gbk'
page_text = response.text
# data parsing
tree = etree.HTML(page_text)
li_list = tree.xpath('//ul[@class="clearfix"]/li')
img_id_list = []
img_name_list = []
for li in li_list:
img_id = li.xpath('./a/@href') [0]
img_id = re.findall('\d+', img_id)[0]
img_id_list.append(img_id)
img_name_list.append(li.xpath('./a/img/@alt') [0])
Get the full image URL
img_url_list = []
for img_url in img_id_list:
img_url_list.append(f'https://pic.netbian.com/downpic.php?id={img_url}&classid=66')
# Extract image data
for i in range(len(img_url_list)):
img_data = requests.get(url=img_url_list[i], headers=headers, cookies=cookies).content
filePath = './4ktupian/' + img_name_list[i] + '.jpg'
with open(filePath, 'wb')as fp:
fp.write(img_data)
print('%s, download successful ' % img_name_list[i])
Copy the code
Each picture is about 2-4m in size and is extremely clear.
The website provides membership system, annual membership can download 200 pictures per day, so the above code is supposed to download 300 pictures, but the actual quality of the original picture is only about 240, the rest of the pictures can not be opened. Ordinary users will be provided with one free chance to download the original image every day.
Considering that it is not easy to maintain a website, I have also sponsored it by giving away a yearly subscription of 200 downloads per day, but it is not realistic to download 200 copies manually, hence this article. Haha, everyone should learn technology.
The code has been all sorted out, the collection of pictures, has been packed, there is a need to contact me!