What is a reptile?
If we think of the Internet as a big spider’s web, where data is stored at each node, and a crawler is a tiny spider,
Crawler refers to a program that makes a request to a website, analyzes and extracts useful data after obtaining resources;
From the technical level is through the program to simulate the browser request site behavior, the site returned HTML code /JSON data/binary data (pictures, videos) climb to the local, and then extract their own needed data, stored for use;
Basic Environment Configuration
Version: Python3
System: Windows
IDE: Pycharm
Tools needed for crawler:
Request library: requests,selenium (can drive browsers to parse and render CSS and JS, but has a performance disadvantage (useful and useless pages will load);)
Parsing libraries: Re, Beautifulsoup, PyQuery
Repositories: files, MySQL, Mongodb, Redis
Learn Python web development, Python crawlers, Python data analysis, artificial Intelligence, etc. Learn Python from scratch!
Basic flow of a Python crawler
The basic version:
Function package
Concurrent version
(If there are 30 videos to climb and 30 threads to do it, the time it takes is the slowest of them.)
After understanding the basic flow of a Python crawler, is it particularly easy to compare the crawler to the code?