Best Python crawler recommendations

Pay attention to the official website of vegetable bird’s nest to receive 200 sets of open source projects for free, operating wechat: Yrioyou

I just found 19 common Python crawlers on Github.

1. Crawler of wechat public account

GitHub:github.com/Chyroc/Wech… The crawler interface of wechat public account based on Sogou wechat search can be extended to the crawler based on Sogou search. The return result is a list, and each item is a dictionary of specific information of the public account.

2. Douban reading reptile

GitHub:github.com/lanbing510/… You can climb down all the books under the Douban reading TAB and store them in Excel according to the score ranking, which is convenient for everyone to screen and search, such as the high score books with more than 1000 evaluations; It can be stored in different sheets of Excel according to different topics. User Agent is used to disguise as browser for crawling, and random delay is added to better imitate browser behavior and avoid crawler being blocked.

3. Zhihu reptile

GitHub:github.com/LiuRoy/zhih… The function of this project is to crawl zhihu user information and interpersonal topological relationship, crawler framework using scrapy, data storage using Mongo

4.Bilibili user crawler

GitHub:github.com/airingursb/… Total data: 20119918, crawl fields: user ID, nickname, gender, avatar, level, experience value, number of fans, birthday, address, registration time, signature, level and experience value, etc. User data report of station B is generated after capture.

5. Sina Weibo crawler GitHub: github.com/LiuXingMing… It mainly crawls sina Weibo users’ personal information, microblog information, fans and followers. Code to obtain sina Weibo Cookie login, through multi-account login to prevent Sina anti – stripping. The main use of scrapy crawler frame.

6. Novel download distributed crawler

GitHub:github.com/gnemoug/dis… Using scrapy,Redis, MongoDB,graphite implementation of a distributed network crawler, underlying storage MongoDB cluster, distributed using Redis, crawler state display using Graphite implementation, mainly for a novel site.

7. GitHub: github.com/yanzhou/Cnk… After setting the search conditions, run SRC/cnkisider. Py to capture data, and capture the name of the first field in each data file stored in /data.

8. Homelink crawlers

GitHub:github.com/lanbing510/… Climb Beijing area chain home over the second – hand housing transaction records. Covers all the code of the link home crawler article, including the link home simulation login code.

9. Jd Crawler

GitHub:github.com/taizilongxu… Jingdong website crawler based on scrapy, saved in CSV format.

10. QQ group of reptiles

GitHub:github.com/caspartse/Q… Batch capture QQ group information, including group name, group number, group number, group owner, group profile, etc., and finally generate XLS(X)/CSV result files.

11. Dark clouds creep

GitHub:github.com/hanc00l/woo… Dark cloud exposes vulnerabilities, knowledge crawlers, and searches. The list of all public vulnerabilities and the text content of each vulnerability are stored in MongoDB, which is about 2G content. If the whole station crawling all the text and pictures as offline query, about 10G space, 2 hours (10M telecom bandwidth); Climb all the knowledge base, the total space is about 500M. Flask was used as the Web server and Bootstrap as the front end for vulnerability search.

12. Hao123 Web crawler

GitHub:github.com/buckyrobert… Take hao123 as the entrance page, scroll to climb the outer chain, collect url, and record the number of inner chain and outer chain on the website, record the information such as title, Windows7 32 bit test, at present every 24 hours, can collect data for about 100,000.

13. Ticket crawler (Qunar and Ctrip)

GitHub:github.com/fankcoder/f… Findtrip is a scrapy-based flight crawler that currently integrates two major flight websites in China (Qunar + Ctrip).

14. Netease client content crawler based on Requests, MySQLdb, TornDB

GitHub:github.com/leyle/163sp…

15. Douban movie, book, group, photo album, things and other crawler collection

GitHub:github.com/fanpei91/do…

16.QQ space crawler

GitHub:github.com/LiuXingMing… Including log, talk, personal information, etc., can capture 4 million data a day.

Baidu MP3 full site crawler, using Redis to support resumable breakpoint.

GitHub:github.com/Shu-Ji/baid…

18. Crawlers on Taobao and Tmall

GitHub:github.com/pakoo/tbcra… Page information can be captured according to search keywords, item IDS, data stored in mongodb.

19. A testing framework for stock data crawler and stock selection strategy

GitHub:github.com/benitoro/st… Capture all stock market data in Shanghai and Shenzhen according to the selected date range. Supports the use of expressions to define stock selection strategies. Multithreading is supported. Save data to JSON files and CSV files.

If you also want to write your own e-commerce website in Python, you can learn more about the course called “Imitating JINGdong Shopping Mall” here

Related Posts

Missing value padding in THE R language: Estimating missing values

MIPS instruction set will be open source in the near future, will risC-V camp panic?

Useful tool site