In this age of the Internet and artificial intelligence everywhere, do you really understand it? Have you ever thought that the information on the web you browse every day is monitored by millions of web crawlers? You’ve probably heard the term before, but do you really know it?
What exactly is a web crawler? Today, the Wolf will share it with you simply.
Crawler is an automated web crawler that can extract information from a large number of web pages on the network. It also has a very special and down-to-earth name, ‘The Web Spider’.
Web spiders find web pages by their link addresses. Start from a page of the site (usually the home page), read the content of the web page, find other links in the web page, and then look for the next page through these links, and so on until all the pages of the site have been captured.
If the whole Internet as a close spider web, a website is the web of this spider web, and the website link is the network cable of this spider web, so the network spider can use this node and connect each node of the network cable to grab down all the web pages on the Internet.
In this way, a web spider is a crawler, a program that crawls web pages.
The related technologies of web crawler are shown in the figure below.
Because crawler system extracts content from web pages, we need to master the basic knowledge of web pages first.
Like HTML and simple JavaScript. The communication between the browser and the Web server uses HTTP protocol.
Therefore, basic HTTP knowledge should not be less. The basic knowledge of web page is very important for the development and operation of crawler system.
At the same time, a crawler needs to simulate a browser to make a request to a Web server to retrieve Web page content. Here you can use Python’s standard library urllib or, better yet, third-party library Requests for this purpose.
After getting the content of the web page, the need to analyze the web page at the same time, extract the required information, or other web links on the web page. You need to use the Python third-party libraries Beautiful Soup or PyQuery.
Useful information extracted from the web page, if the amount of data is not large, then can be saved in the file, but generally more general and more professional approach is to save in the database, you can choose relational database MySQL or non-relational database MongoBD to store the information extracted from the web page.
The above grasping and parsing process can also be completed directly with a professional crawler framework, such as Scrapy, which is a more engineering way. Meanwhile, the development of crawler system is relatively more convenient and formal.
As we also know, web crawler visits and crawls almost all the links in the whole website, in which many repeated web pages are inevitably captured for many times, thus causing low efficiency of crawler system in the operation process.
So when the number of pages to be captured is very large, it is necessary to pay special attention to the efficiency of web page judging, that is to say, there needs to be an efficient way to check whether a page has been captured before, which is to use the Bloom filter.
Redis supports bloem filter extensions to solve this problem. And to crawl the web page has a good screening ability.
However, the Internet is a huge system network, when the number of web pages to be captured is further expanded, the efficiency of single crawler is very low, so it is necessary to consider the construction of distributed crawler. That means running crawler tasks on multiple machines at the same time.
We need a distributed queue to manage and schedule all the scraping tasks, and scrapy-Redis can do that.
About the above mentioned knowledge and free courses, the Big Bad Wolf has summarized for you. Direct click to jump to the link.
Introductory information;
Learn and understand the basic architecture and operation process of crawlers, Python Development Simple Crawler
Basic use of Scrapy; Python’s most popular crawler Framework Scrapy
For details on Scrapy, see the introduction to Scrapy.
Introduction to MySQL Database, free course: Close Contact with MySQL.
To get started with the MongoDB database, see W3Cschool’s online MongoDB tutorial.
For a comprehensive explanation of the preceding technologies, you can read the free online version of the book Python3 Web crawler Development Field. Basically all crawler related knowledge and tools are covered, and the big Bad Wolf feels that it is still worth reading.
In the future, The Big Bad Wolf will also share information about Python in machine learning, automated operations and testing.
You can also follow my wechat account “Grey Wolf Cave Owner” for more Python technical sharing and Internet information.
Think useful remember to like attention oh, big bad Wolf look forward to progress with you!