The six crawler open source projects yyDS

Today check 6 crawler open source projects, they can help you climb the air, climb micro blog, climb B station, climb zhihu, climb * station.

Do not use these items for illegal commercial activities, only for scientific research!

Weibo crawler

The open source program can continuously crawl the data of one or more Sina Weibo users, such as Li Wendi and Wudui Fan, and write the results to a file or database. The written information includes almost all the data of the user’s microblog, including user information and microblog information.

Address: https://github.com/dataabc/weiboSpiderCopy the code

Crawl results can be written to files and databases. Specific file types are as follows:

TXT file
The CSV file
Json file
The MySQL database
Mongo database
SQLite database

At the same time, pictures and videos in weibo can be downloaded. The specific downloadable files are as follows:

The original picture in the original weibo
Repost the original picture in the micro blog
Video in original micro blog
Forward the video in weibo
Video from original weibo Live Photo
Repost videos from weibo Live Photo

First, we need to modify the config.json file, and then climb it. The program will automatically generate a folder of Weibo, and all the tweets we climb will be stored here.

Then the program generates a folder named “microblog name” under this folder, all the stars’ microblog crawl results are here. The folder contains a CSV file, a TXT file, a JSON file, an IMG folder, and a video folder. The IMG folder is used to store downloaded images, and the video folder is used to store downloaded videos. This information will also be stored in the database if you enable the save database function, as described in the Setup Database section.

Python crawler tutorial

Python crawler tutorial series, from 0 to 1 to learn Python crawler, including browser capture, mobile APP capture, such as Fiddler, mitmProxy, various crawler involved in the use of modules, such as: Requests, beautifulSoup, Selenium, Appium, scrapy, etc., as well as verification code recognition, MySQL, MongoDB database Python use, multi-threaded multi-process crawler use, CSS crawler encryption reverse crack, JS crawler reverse, Distributed crawler, practical examples of crawler project, etc.

Address: https://github.com/wistbean/learn_python3_spiderCopy the code

The crawler collection

This open source project collects all kinds of crawler, including Blibli, Blog Park, Baidu Encyclopedia, Beiyou, Baidu Cloud Network disk, Boss, Shell, Douban, CSDN, Douyin, GitHub, JINGdong, Zhihu, Hook, Lianjia, wechat public account, netease Cloud, and so on. You can think of all the crawler websites at home and abroad. Check out the open source crawlers here first.

Address: https://github.com/facert/awesome-spiderCopy the code

Intelligent crawler platform

The open source platform is a highly flexible and configurable crawler platform that defines crawlers in the form of flow charts. You can configure various crawlers on the platform.

Address: https://gitee.com/ssssssss-team/spider-flowCopy the code

Next, in the form of a flow chart, start to configure some variables and parameters. Click to crawl out the data you want.

Java crawler

Spiderman is a Java open source Web data extraction tool that collects specified Web pages and extracts useful data from those pages.

Spiderman mainly uses techniques such as XPath and regular expressions to extract real data.

Address: https://gitee.com/l-weiwei/spidermanCopy the code

The crawler daqo

This open source project includes a variety of websites, e-commerce data crawlers. Contains: Taobao commodity number, WeChat public, public review, recruitment website, carefree fish, ali tasks, scrapy blogs, microblogging, baidu post bar, watercress film, package diagram network, panoramic network, watercress, music, provincial food and drug administration, sohu news, machine learning, text collection, fofa assets acquisition and the car, the National Bureau of Statistics, the number of baidu keyword, included, the spider generic directory, Douban film review ️️️.

Address: https://gitee.com/AJay13/ECommerceCrawlersCopy the code

The six crawler open source projects yyDS

Related Posts

React React

Promise learned quickly

The final chapter of 7 Guidelines for reliable React component design