A word is a fight! What if there’s no chart in the heat of the fight? What if I can’t find the graph I want? Go straight to the Crawl Doutu Internet cafe.
Start with a site that has a lot of memes. How do I climb the graph? With analytics, all information is available on the page, so the most important thing to consider is pagination. By clicking on different pages, you can easily see the pagination rules.
Picture links are in the source code, understand the structure of paging URL, you can go to write code to catch the picture! The source code is as follows:
# -*- coding:utf-8 -*- import requests from bs4 import BeautifulSoup import os class doutuSpider(object): Headers = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit Chrome/52.0.2743.116 "} def get_url(self,url): data = requests.get(url, headers=self.headers) soup = BeautifulSoup(data.content,'lxml') totals = soup.findAll("a", {"class": "list-group-item"}) for one in totals: sub_url = one.get('href') global path path = 'J:\\train\\image'+'\\'+sub_url.split('/')[-1] os.mkdir(path) try: self.get_img_url(sub_url) except: pass def get_img_url(self,url): data = requests.get(url,headers = self.headers) soup = BeautifulSoup(data.content, 'lxml') totals = soup.find_all('div',{'class':'artile_des'}) for one in totals: img = one.find('img') try: sub_url = img.get('src') except: pass finally: urls = 'http:' + sub_url try: self.get_img(urls) except: pass def get_img(self,url): filename = url.split('/')[-1] global path img_path = path+'\\'+filename img = requests.get(url,headers=self.headers) try: with open(img_path,'wb') as f: f.write(img.content) except: pass def create(self): for count in range(1, 31): Url = '{}' https://www.doutula.com/article/list/?page=. The format (count) print 'start download page first {}'. The format (count) self. Get_url (url) of the if __name__ == '__main__': doutu = doutuSpider() doutu.create()Copy the code
When you’re done, you’ll have a folder where you can get the emojis you just saw on the web. Go ahead and compete with your friends.