Today check 6 crawler open source projects, they can help you climb the air, climb micro blog, climb B station, climb zhihu, climb * station.
Do not use these items for illegal commercial activities, only for scientific research!
01
Weibo crawler
The open source program can continuously crawl the data of one or more Sina Weibo users, such as Li Wendi and Wudui Fan, and write the results to a file or database. The written information includes almost all the data of the user’s microblog, including user information and microblog information.
Address: https://github.com/dataabc/weiboSpiderCopy the code
Crawl results can be written to files and databases. Specific file types are as follows:
-
TXT file
-
The CSV file
-
Json file
-
The MySQL database
-
Mongo database
-
SQLite database
At the same time, pictures and videos in weibo can be downloaded. The specific downloadable files are as follows:
-
The original picture in the original weibo
-
Repost the original picture in the micro blog
-
Video in original micro blog
-
Forward the video in weibo
-
Video from original weibo Live Photo
-
Repost videos from weibo Live Photo
First, we need to modify the config.json file, and then climb it. The program will automatically generate a folder of Weibo, and all the tweets we climb will be stored here.
Then the program generates a folder named “microblog name” under this folder, all the stars’ microblog crawl results are here. The folder contains a CSV file, a TXT file, a JSON file, an IMG folder, and a video folder. The IMG folder is used to store downloaded images, and the video folder is used to store downloaded videos. This information will also be stored in the database if you enable the save database function, as described in the Setup Database section.
02
Python crawler tutorial
Python crawler tutorial series, from 0 to 1 to learn Python crawler, including browser capture, mobile APP capture, such as Fiddler, mitmProxy, various crawler involved in the use of modules, such as: Requests, beautifulSoup, Selenium, Appium, scrapy, etc., as well as verification code recognition, MySQL, MongoDB database Python use, multi-threaded multi-process crawler use, CSS crawler encryption reverse crack, JS crawler reverse, Distributed crawler, practical examples of crawler project, etc.
Address: https://github.com/wistbean/learn_python3_spiderCopy the code
03
The crawler collection
This open source project collects all kinds of crawler, including Blibli, Blog Park, Baidu Encyclopedia, Beiyou, Baidu Cloud Network disk, Boss, Shell, Douban, CSDN, Douyin, GitHub, JINGdong, Zhihu, Hook, Lianjia, wechat public account, netease Cloud, and so on. You can think of all the crawler websites at home and abroad. Check out the open source crawlers here first.
Address: https://github.com/facert/awesome-spiderCopy the code
04
Intelligent crawler platform
The open source platform is a highly flexible and configurable crawler platform that defines crawlers in the form of flow charts. You can configure various crawlers on the platform.
Address: https://gitee.com/ssssssss-team/spider-flowCopy the code
Next, in the form of a flow chart, start to configure some variables and parameters. Click to crawl out the data you want.
05
Java crawler
Spiderman is a Java open source Web data extraction tool that collects specified Web pages and extracts useful data from those pages.
Spiderman mainly uses techniques such as XPath and regular expressions to extract real data.
Address: https://gitee.com/l-weiwei/spidermanCopy the code
06
The crawler daqo
This open source project includes a variety of websites, e-commerce data crawlers. Contains: Taobao commodity number, WeChat public, public review, recruitment website, carefree fish, ali tasks, scrapy blogs, microblogging, baidu post bar, watercress film, package diagram network, panoramic network, watercress, music, provincial food and drug administration, sohu news, machine learning, text collection, fofa assets acquisition and the car, the National Bureau of Statistics, the number of baidu keyword, included, the spider generic directory, Douban film review ️️️.
Address: https://gitee.com/AJay13/ECommerceCrawlersCopy the code