We can compare the Internet to a large web, and a crawler is a spider that crawls along the web. The nodes of the web are compared to a web page, and crawling to this page is equivalent to visiting the page and obtaining its information. Can compare attachment between nodes to a web page and the links between relations, this spider after a node can crawl along the node connection to continue to the next node, through a web page to continue for subsequent pages, so the whole network nodes can be spiders crawling to entirely, data can be fetching down.

1. Overview of crawlers

To put it simply, a crawler is an automatic program that obtains web pages and extracts and saves information.

(1) Get web pages

The first work of crawler is to obtain the web page, here is to obtain the source code of the web page. Source code contains part of the web page useful information, so as long as the source code to get down, you can extract the information you want.

Earlier, I talked about requests and responses. You send a request to the web site’s server and the response body is the source code of the web page. So, the key part is to construct a request and send it to the server, and then receive and parse the response. So how does this process work? You can’t just grab the source code by hand, can you?

Don’t worry, Python provides a number of libraries to help us do this, such as URllib, requests, and so on. We can use these libraries to help us achieve HTTP request operation, request and response can be used to provide the data structure to represent the class library, after getting the response only need to parse the Body part of the data structure, that is, to get the source code of the web page, so we can use procedures to achieve the process of obtaining the web page.

(2) Extracting information

After obtaining the source code of the web page, the next step is to analyze the source code of the web page and extract the data we want from it. First of all, the most common method is to use regular expression extraction, which is a universal method, but in the construction of regular expression is more complex and prone to error.

In addition, there are libraries that extract information from web pages based on node attributes, CSS selectors, or XPath, such as Beautiful Soup, PyQuery, LXML, etc., because web pages are structured according to certain rules. Using these libraries, we can efficiently and quickly extract web page information, such as node attributes, text values, etc.

Extracting information is a very important part of crawler, which can make the disordered data clear and facilitate our subsequent processing and analysis of data.

(3) Save data

After extracting information, we typically store the extracted data somewhere for later use. For example, it can be saved as TXT or JSON text, stored in databases such as MySQL and MongoDB, or stored on remote servers such as SFTP.

(4) Automation procedures

By automation, I mean that a crawler can do these things instead of a human. First of all, we can certainly extract this information by hand, but if the yield is very large or you want to get a lot of data quickly, you have to use a program. Crawler is an automated program that replaces us to complete the crawler work. It can handle various exceptions and retry errors in the process of crawler to ensure the continuous and efficient operation of crawler.

2. What kind of data can be captured

We can see all kinds of information in the web page, the most common is the conventional web page, they correspond to HTML code, and the most common grab is HTML source code.

In addition, there may be pages that return not HTML code but a JSON string (in the form of most API interfaces), in which the data is easily transported and parsed, as well as grabable and more easily extracted.

In addition, we can also see all kinds of binary data, such as pictures, video and audio. Using crawler, we can capture these binary data and save it into the corresponding file name.

In addition, you can also see a variety of extension files, such as CSS, JavaScript and configuration files, these are actually the most common files, as long as the browser is accessible, you can grab them down.

In fact, the above content corresponds to their respective URLS, which are based on HTTP or HTTPS protocol. As long as this kind of data is available, crawlers can capture it.

3. JavaScript renders the page

Sometimes when we crawl a web page with URllib or Requests, the source code we get is actually different from what we see in the browser.

This is a very common problem. Now that web pages are increasingly built using Ajax, front-end modular tools, the entire page may be rendered by JavaScript, that is, the original HTML code is an empty shell, for example:

<! DOCTYPE html> <html> <head> <meta charset="UTF-8">
        <title>This is a Demo</title>
    </head>
    <body>
        <div id="container">
        </div>
    </body>
    <script src="app.js"></script>
</html> Copy the code

The body node contains only one node with the id of Container, but note that app.js is introduced after the body node, which is responsible for rendering the entire site.

When you open the page in a browser, you load the HTML content, and then the browser sees that it’s importing an app.js file, and then it requests that file, and when it gets that file, it executes the JavaScript code that changes the nodes in the HTML, Add content to it, and you end up with a complete page.

But when we request the current page with libraries like URllib or Requests, all we get is this HTML code. It doesn’t help us continue loading the JavaScript file, so we can’t see what’s in the browser.

This also explains why sometimes the source code we get is different from what we see in the browser.

As a result, the source code you get using the basic HTTP request library might not look much like the page source in the browser. In such cases, we can analyze the back-end Ajax interface, or we can use libraries like Selenium and Splash to implement simulated JavaScript rendering.

Later, we’ll go into detail on how to capture JavaScript rendered web pages.

This section introduces some of the basic principles of crawlers, which will help us to be more handy when writing crawlers later.


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)