In the middle of the night, I used Python to crawl the entire doodle site

QQ, wechat dou map is always dou, but simply to climb dou map network, I have the whole site of the map, not to fight.

Needless to say, the selection of the site for the bucket chart, we first have a simple look at the structure of the site

Web information

From the above picture, we can see that there are many sets of pictures on one page. At this time, we need to figure out how to store each set of pictures separately (detailed explanation later).

With analysis, all the information is available on the page, so we don’t worry about asynchronous loading, we worry about pagination, and it’s easy to see the pagination rules by clicking on different pages

It’s pretty easy to see how the pagination URL is constructed, the image link is in the source code, so I won’t go into the details of that but once you know that you can write code to grab the image

Save the idea of pictures

To store each set of images in a folder (OS module), I named the folder after the last few digits of the URL of each set of images, and then separated the file from the file path to name the last field, see the screenshot below for details.

Once these are understood, the next step is the code (you can refer to my parsing ideas, only get 30 pages as a test) all the source code

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os
class doutuSpider(object):
    headers = {
        "user-agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}
    def get_url(self,url):
        data = requests.get(url, headers=self.headers)
        soup = BeautifulSoup(data.content,'lxml')
        totals = soup.findAll("a", {"class": "list-group-item"})
        for one in totals:
            sub_url = one.get('href')
            global path
            path = 'J:\\train\\image'+'\ \'+sub_url.split('/')[-1]
            os.mkdir(path)
            try:
                self.get_img_url(sub_url)
            except:
                pass

    def get_img_url(self,url):
        data = requests.get(url,headers = self.headers)
        soup = BeautifulSoup(data.content, 'lxml')
        totals = soup.find_all('div', {'class':'artile_des'})
        for one in totals:
            img = one.find('img')
            try:
                sub_url = img.get('src')
            except:
                pass
            finally:
                urls = 'http:' + sub_url
            try:
                self.get_img(urls)
            except:
                pass
    def get_img(self,url):
        filename = url.split('/')[-1]
        global path
        img_path = path+'\ \'+filename
        img = requests.get(url,headers=self.headers)
        try:
            with open(img_path,'wb') as f:
                f.write(img.content)
        except:
            pass
    def create(self):
        for count in range(1, 31):
            url = 'https://www.doutula.com/article/list/?page={}'.format(count)
            print 'Start downloading page {}'.format(count)
            self.get_url(url)
if __name__ == '__main__':
    doutu = doutuSpider()
    doutu.create()
Copy the code

The results of

conclusion

Overall, this site structure is relatively not very complex, we can refer to it, climb some interesting

Original author: Loading_ Miracle, original link:

https://www.jianshu.com/p/88098728aafd

Welcome to follow my wechat public account “Code farming breakthrough”, share Python, Java, big data, machine learning, artificial intelligence and other technologies, pay attention to code farming technology improvement, career breakthrough, thinking transition, 200,000 + code farming growth charge first stop, accompany you have a dream to grow together.

In the middle of the night, I used Python to crawl the entire doodle site

Related Posts

Use go to build a simple blog system

Java- Part 14 -JVM- Local method interface, local method stack, and heap

Sanmian Meituan successfully obtained offer and shared experience: Spring+Dubbo+Redis+Zookeeper+ micro service