It takes about 4 minutes to read the passage.
Most people have heard of Python crawlers at some point, and I’ve always been interested in them, so I spent an afternoon getting started with lightweight crawlers. Why is lightweight crawler, because some web pages are more complicated, such as the need to login authentication code, verification or certificate is needed to visit, we understand the concept and architecture of the crawler, you just need to do some simple climb take work, for example crawl baidu encyclopedia that pure information display web pages, all of these are not need to log in to the static web pages. No matter how complex the crawler web page and crawler framework are, they can’t do without this basic crawler architecture.
A crawler is a program that automatically captures information on the Internet. Each web page has a URL, starting from a web page entrance, through the jump of various URLS to form a mutual pointing relationship, and finally can form a mesh structure, which is the Internet. Theoretically speaking, a huge web project, starting from the entrance, can always reach any web page in the project system through some jump path, when we manually obtain information from the web page, we can only follow the steps, step by step click jump, and finally obtain the information we want to get.
Typical, for example, I want to adopt a cat, yesterday I PM on the city website first, then find the pet classification, find cats classification, again to choose some items, such as adopt rather than buy, ranging in age from half a years old the following, raccoon dog flower cat and so on these features, click search, web page gave me a detailed list of entries, I through the way of artificial, access to the information I want. It’s accurate, but it’s a waste of human time.
Crawler is such an automatic program, we set the theme and target we need, such as “cat”, “6 months” and other tags, crawler will start from a specific URL, automatically access its associated URL, and extract the data we need. A crawler is a program that automatically accesses the Internet and extracts valuable data.
The value of crawler lies in this. It can acquire and make use of the huge amount of data on the Internet. With these data, we can study and analyze, or make relevant products by using the data.
Climb the GitHub project with the highest daily page views and star increases. With this data, you can make a GitHub open source project recommendation.
Now the songs of each major website have copyright protection, downloading songs is not very convenient, you can through the song name, climb all free download links on the Internet, so you can easily make a song search download clustering tool.
It can be said that as long as there is data, nothing can be done, only what you can’t think of, data is put on the Internet, through crawler we can make data play a greater role and value, in the era of big data, crawler is undoubtedly a front-line technology.
Let’s start with a simple crawler architecture diagram
First of all, we need a crawler scheduler to start and stop the crawler, at the same time to monitor the state of the crawler, and through it to provide interface for specific data applications. This part does not belong to the crawler itself.
The shaded box in the picture is our crawler. Because some pages have many entrances, we can access this interface through different URL scheduling paths, so as an intelligent crawler, when we encounter urls we have already crawled, we should choose filtering instead of crawling again. The URL manager is a tool for storing urls that have been crawled and urls that are about to be crawled.
Select a URL to crawl from the URL manager and send it to the web page loader, which downloads the web page as a string and gives it to the web page parser to parse. On the one hand, the web page parser extracts the value information you need to retrieve and returns it to the scheduler. On the other hand, If it encounters a new URL for the page to crawl, it passes that URL to the URL manager. From there, the three modules loop until all urls associated with the page have been climbed.
A clearer dynamic running process can be represented by a sequence diagram. You can compare the steps above to understand.
As an aside, sequence diagrams are one of my favorite diagrams to help sort out logic. You can learn from them, and they will help you a lot in your work and study
The next article will cover URL managers, downloaders, and parsers in more detail.
Reference content:
Moocs: Developing a simple crawler for Python — Crazyant
IOS Appreciation Channel
Interpret life with passion, highlight personality with code