background

There is a link relationship between web pages, and the crawler can crawl along the node to the next node, that is, through a web page to continue to obtain the subsequent web pages, the node of the entire web can be all crawled by spiders, and the data of the website can be captured.

One-sentence description

An automated program that retrieves web pages and extracts and saves information

Access to web pages

The crawler’s first job is to retrieve the source code of the web page, which can be done through Python libraries such as URllib, Requests, and so on. We can use these libraries to help us achieve HTTP request operation, request and response can be used to provide the data structure to represent the class library, after getting the response only need to parse the Body part of the data structure, that is, to get the source code of the web page, so we can use procedures to achieve the process of obtaining the web page.

Extracting information

After obtaining the web code, the next step is to analyze it. You can use regular expressions to analyze it, but this regular expression is difficult to write and error-prone. Based on the structure of the web page is regular, you can use libraries such as Beautiful Soup, PyQuery, LXML, etc., efficiently and quickly extract web page information, such as node attributes, text values, etc.

Save the data

We typically store the extracted data somewhere for later use. For example, it can be saved as TXT or JSON text, stored in databases such as MySQL and MongoDB, or stored on remote servers such as SFTP

Automatic program

It can replace manual grasping of information, and carry out operations such as exception handling and error retry in the process to ensure that the continuous and efficient crawling.

What kind of data can be captured

• Web code: HTML code is the most common. • JSON data: the most friendly. • Binary data, such as images, audio and video, can be saved as the corresponding file name

Js generated code

Js-generated, or Ajax asynchronously generated, page interfaces are not captured because, as mentioned earlier, it is the returned page code that is captured and does not execute JS. If you want a crawler to crawl to this part of the resource, you need to do server-side rendering (if you are originally separated from the front and back).

Note: This situation is not completely unclimbable, we can use libraries such as Selenium and Splash to implement simulated JavaScript rendering.