Today I’ve compiled 32 Python crawler projects for you. The reason for sorting is that the crawler is simple and fast, and it is also very suitable for the new partner to cultivate confidence. All links point to GitHub, have fun
WechatSogou [1]- wechat public account crawler. The crawler interface of wechat public account based on Sogou wechat search can be extended to the crawler based on Sogou search. The return result is a list, and each item is a dictionary of specific information of the public account.
DouBanSpider [2]- Douban reading crawler. You can climb down all the books under the Douban reading TAB and store them in Excel according to the score ranking, which is convenient for everyone to screen and search, such as the high score books with more than 1000 evaluations; It can be stored in different sheets of Excel according to different topics. User Agent is used to disguise as browser for crawling, and random delay is added to better imitate browser behavior and avoid crawler being blocked.
Zhihu_spider [3]- Zhihu_spider. The function of this project is to crawl zhihu user information and interpersonal topological relationship, crawler framework using scrapy, data storage using Mongo
Bilibili -user [4]- Bilibili user crawler. Total data: 20119918, crawl fields: user ID, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. User data report of station B is generated after capture.
SinaSpider [5]- Sina Weibo crawler. It mainly crawls sina Weibo users’ personal information, microblog information, fans and followers. Code to obtain sina Weibo Cookie login, through multi-account login to prevent Sina anti – stripping. The main use of scrapy crawler frame.
Distribute_crawler [6]- Novel download distributed crawler. Using scrapy,Redis, MongoDB,graphite implementation of a distributed network crawler, underlying storage MongoDB cluster, distributed using Redis, crawler state display using Graphite implementation, mainly for a novel site.
CnkiSpider [7]- CnKI reptile. After setting the search conditions, run SRC/cnkisider. Py to capture data, and capture the name of the first field in each data file stored in /data.
LianJiaSpider [8]- LianJiaSpider. Climb Beijing area chain home over the second – hand housing transaction records. Covers all the code of the link home crawler article, including the link home simulation login code.
Scrapy_jingdong [9]- Jingdong crawler. Jingdong website crawler based on scrapy, saved in CSV format.
Qq-groups-spider [10]- QQ group Spider. Batch capture QQ group information, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X)/CSV result files.
Wooyun_public [11]- Cloud crawler. Dark cloud exposes vulnerabilities, knowledge crawlers, and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, which is about 2G content. If the whole station crawling all the text and pictures as offline query, about 10G space, 2 hours (10M telecom bandwidth); Climb all the knowledge base, the total space is about 500M. Flask was used as the Web server and Bootstrap as the front end for vulnerability search.
Spider [12]- Hao123 Web crawler Take hao123 as the entrance page, scroll to climb the outer chain, collect url, and record the number of inner chain and outer chain on the website, record the information such as title, Windows7 32 bit test, at present every 24 hours, can collect data for about 100,000
Findtrip [13]- Ticket crawler (Qunar and Ctrip). Findtrip is a scrapy-based flight crawler that currently integrates two major flight websites in China (Qunar + Ctrip).
163Spider [14] – netease client content crawler based on Requests, MySQLdb, TornDB
Doubanspiders [15]- Spider films, books, groups, albums, things, etc
QQSpider [16]- QQ space crawler, including log, talk, personal information, etc., can grab 4 million pieces of data a day.
Bidu – Music -spider [17]- Baidu MP3 full site crawler, using Redis to support breakpoint continuation.
Tbcrawler [18]- The crawler of Taobao and Tmall can capture the information of the page according to the search keyword and item ID, and the data is stored in mongodb.
Stockholm [19]- A framework for testing stock data crawlers and stock picking strategies. Capture all stock market data in Shanghai and Shenzhen according to the selected date range. Supports the use of expressions to define stock selection strategies. Multithreading is supported. Save data to JSON files and CSV files.
BaiduyunSpider[20]- Baidu Cloud disk crawler.
Spider[21]- Social data crawler. Support weibo, Zhihu, Douban.
Proxy Pool [22]- Python proxy pool (Proxy pool).
Music-163 [23]- Crawl comments of all songs in netease Cloud Music.
Jandan_spider [24]- Crawl the omelette girl image.
CnblogsSpider[25]- CNBlogs list page crawler.
Spider_smooc [26]- Crawling MOOC videos.
CnkiSpider[27]- CnKI reptile.
KnowsecSpider2 [28]- knows the crawler topic.
Aiss-spider [29]- Picture crawler for Ace APP.
SinaSpider[30]- Dynamic IP solves sina’s anti-crawler mechanism and quickly captures content.
CSDN – Spider [31]- Crawls blog posts on CSDN.
ProxySpider[32]- crawls the proxy IP on the thorn and verifies proxy availability
Source: NoBB Development Circle
32 Python crawler Projects to Eat Until you’re full