QQ, wechat dou map is always dou, but simply to climb dou map network, I have the whole site of the map, not to fight.
Needless to say, the selection of the site for the bucket chart, we first have a simple look at the structure of the site
Web information
From the above picture, we can see that there are many sets of pictures on one page. At this time, we need to figure out how to store each set of pictures separately (detailed explanation later).
With analysis, all the information is available on the page, so we don’t worry about asynchronous loading, we worry about pagination, and it’s easy to see the pagination rules by clicking on different pages
It’s pretty easy to see how the pagination URL is constructed, the image link is in the source code, so I won’t go into the details of that but once you know that you can write code to grab the image
Save the idea of pictures
To store each set of images in a folder (OS module), I named the folder after the last few digits of the URL of each set of images, and then separated the file from the file path to name the last field, see the screenshot below for details.
Once these are understood, the next step is the code (you can refer to my parsing ideas, only get 30 pages as a test) all the source code
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os
class doutuSpider(object):
headers = {
"user-agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}
def get_url(self,url):
data = requests.get(url, headers=self.headers)
soup = BeautifulSoup(data.content,'lxml')
totals = soup.findAll("a", {"class": "list-group-item"})
for one in totals:
sub_url = one.get('href')
global path
path = 'J:\\train\\image'+'\ \'+sub_url.split('/')[-1]
os.mkdir(path)
try:
self.get_img_url(sub_url)
except:
pass
def get_img_url(self,url):
data = requests.get(url,headers = self.headers)
soup = BeautifulSoup(data.content, 'lxml')
totals = soup.find_all('div', {'class':'artile_des'})
for one in totals:
img = one.find('img')
try:
sub_url = img.get('src')
except:
pass
finally:
urls = 'http:' + sub_url
try:
self.get_img(urls)
except:
pass
def get_img(self,url):
filename = url.split('/')[-1]
global path
img_path = path+'\ \'+filename
img = requests.get(url,headers=self.headers)
try:
with open(img_path,'wb') as f:
f.write(img.content)
except:
pass
def create(self):
for count in range(1, 31):
url = 'https://www.doutula.com/article/list/?page={}'.format(count)
print 'Start downloading page {}'.format(count)
self.get_url(url)
if __name__ == '__main__':
doutu = doutuSpider()
doutu.create()
Copy the code
The results of
conclusion
Overall, this site structure is relatively not very complex, we can refer to it, climb some interesting
Original author: Loading_ Miracle, original link:
https://www.jianshu.com/p/88098728aafd
Welcome to follow my wechat public account “Code farming breakthrough”, share Python, Java, big data, machine learning, artificial intelligence and other technologies, pay attention to code farming technology improvement, career breakthrough, thinking transition, 200,000 + code farming growth charge first stop, accompany you have a dream to grow together.