Life is short. I use Python

Previous portal:

Learning Python crawlers (1) : The Beginning

Python crawler (2) : Preparation (1) basic class library installation

Learn Python crawler (3) : Pre-preparation (2) Linux basics

Docker is a Python crawler

Learn Python crawler (5) : pre-preparation (4) database foundation

Python crawler (6) : Pre-preparation (5) crawler framework installation

Python crawler (7) : HTTP basics

Little White learning Python crawler (8) : Web basics

The core of a reptile

What is crawler? It is easy to understand. Crawler is a program that crawls web pages, extracts information according to certain rules, and repeats the above process automatically and repeatedly.

A crawler, the first thing is to crawl the web page, here mainly refers to the source code of the web page. In the source code of the web page, there will be the information we need, and what we have to do is to extract this information from the source code.

When we request web pages, Python provides us with a number of libraries to do so, such as urllib (official), requests, Aiohttp (third party), and so on.

We can use these libraries to send HTTP requests and get the data in response. After receiving the response, we only need to parse the body part of the data to get the source code of the web page.

Once we have the source code, our next job is to parse it and extract the data we need from it.

The most basic and common way to extract data is to use regular expressions, but this way is more complex, but also more prone to error, but have to say, a regular expression writing is very good, completely do not need the following parsing class library, this is a universal method.

In a whisper, the regular expressions are not written well enough to use these libraries provided by third parties.

Libraries for extracting data include Beautiful Soup, PyQuery, LXML, and so on. Using these libraries, we can efficiently and quickly extract web page information from HTML, such as node attributes, text values, etc.

After extracting the data from the source code, we will save the data, here in a variety of forms, can be directly saved into TXT, JSON, Excel files and so on, can also be saved to the database, such as Mysql, Oracle, SQLServer, MongoDB and so on.

Data format for fetching

Generally speaking, we are crawling to the HTML page source code, this is we see, conventional, intuitive page information.

But for information that is not returned directly to the web page with HTML, there are various APIS, most of which now return data in JSON format, some in XML format, and some odd interfaces that return custom strings. The API data interface needs to be analyzed on a case-by-case basis.

In addition, for some information, such as major picture and video sites (such as Douyin and B), the information we want to crawl is pictures or videos, which exist in binary form. We need to crawl down these binary data and then dump them.

In addition, we can also grab some resource files, such as CSS, JavaScript and other script resources, some will also have some font information such as WOFF. This information is an integral part of the composition of a web page, and we can crawl it down as long as the browser can access it.

Modern front-end page crawl

Today the core content is coming!!

Many times, when we use HTTP request libraries to crawl the source code of a web page, the information we crawl is nothing like what we see on a web page, just a few short lines.

This is because in recent years, front-end technology has made rapid progress. A large number of front-end modular tools are used to build front-end pages, such as Vue, React and so on.

As a result, we get only an empty shell of the web page, such as this:

<! DOCTYPE html> <html lang="en" style="background-color: #26282A; height: 100%"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, User-scalable =no"> <link rel="icon" href="<%= BASE_URL %>favicon.ico"> <title> Demo project </title> </head> #app { height: 100% } </style> <body> <noscript> <strong>We're sorry but xxxxxx doesn't work properly without JavaScript enabled. Please enable it to continue.</strong> </noscript> <div id="app"></div> <! -- built files will be auto injected --> <script src=/js/chunk-vendors.84ee7bec.js></script> <script src=/js/app.4170317d.js></script> </body> </html>Copy the code

The source of the code is the little things bloggers do on a regular basis, where bloggers have omitted a lot of the introduced JavaScript.

There is only one node with the ID app inside the body node, but note that JavaScript files are introduced at the end of the body node, which are responsible for rendering the entire page.

When the browser opens the page, it will first load the HTML content, and then it will find the script file loaded with JavaScript. After obtaining the script file, it will start executing the code in the script file, and the JavaScript script file will modify the HTML code of the entire page to add nodes to it. To complete the entire page rendering.

But when we use the request library to request this page, we can only get the current HTML content, it does not help us get the JavaScript script file and execute the script file to render the entire HTML DOM node, we can of course not see the content in the browser.

This also explains why sometimes the source code we get is different from what we see in the browser.

Of course, don’t panic, there are libraries like Selenium and Splash that can be used to simulate JavaScript rendering in a browser.

Later, we will talk about these contents slowly. This paper mainly helps students to have a basic understanding of reptiles, which is convenient for subsequent learning.

Reference:

https://cuiqingcai.com/5484.html