This record will be in accordance with my learning process, will be in the process of learning the problems and difficulties encountered truthfully recorded, one is to consolidate their own learning, the other also hope to be helpful to the future.

Chapter 1 development environment Configuration

Here do not do too much narration, environment configuration is every learning programming people should master the basic skills, encountered when their own Baidu, if there is a book of students can also read a book to solve, the book is written in detail.

Chapter 2 foundation of crawler

This chapter is the focus of the notes in this section, including the following:

See another blog post on computer networking for HTTP content.

2.1 Basic HTTP Principles

2.1.1 URI and URL

URI: Uniform Resource Identifiers

URL: Uniform Resource Locator

URN: Uniform Resource Name

Uris are a subset of URIs, urIs include urls and UrNs. Urls have an intersection with UrNs. For simple comprehension, a link to a website, a link to an image on a web page, are all urls.

2.1.2 hypertext

Hypertext: The English name is hypertext, such as the HTML of the source code of a web page.

Open any page in Chrome, right-click in the blank, select “Check” (or just press F12), and open the browser’s developer tools. In the Elements TAB, you’ll see the source code for the current page, which is hypertext.

2.1.3 HTTP and HTTPS

Urls start with HTTP, HTTPS, FTP, SFTP, SMB, these are all protocols. HTTP is a protocol used to transfer hypertext data from the network to the local browser, which can ensure efficient and accurate transmission of hypertext documents. Currently, HTTP 1.1 is widely used.

HTTPS is the secure sockets Layer (SSL) version of HTTP, which adds SSL layer to HTTP.

Some websites use THE HTTPS protocol, but they are still displayed as insecure. For example, if you open 12306 in Chrome and enter https:// www.12306.cn/, the browser displays a message…

Reason: The CA certificate of 12306 is issued by the Ministry of Railways of China, which is not trusted by THE CA organization. Therefore, the certificate verification here will not pass and indicate that it is unsafe. But in fact, its data transmission is still in English encrypted by SSL. To climb such a site, you need to set the option to ignore certificates, otherwise SSL link errors will be prompted.

2.1.4 HTTP Request Process

The client sends a request to the server. The server receives the request, processes it, parses it, and returns the corresponding response, which is then sent back to the browser.

The response contains the source code and other content of the page, which is then parsed by the browser to render the page.

2.1.5 request

1. Request method

Common money request methods: GET and POST.

  • The parameters in the GET request are contained in the URL, and the data can be seen in the URL, while the URL of the POST request does not contain these data. The data is transmitted through the form, and will be contained in the request body.
  • GET requests submit a maximum of 1024 bytes of data, while POST has no limit.

Requested url

The URL.

Request header

More important are cookies, reference sites, user agents, etc.

When writing crawlers, request headers are required in most cases.

Request body

The request body typically carries the form data in a POST request, whereas for A GET request, the request body is empty.

Notice the relationship between the content type and how the data is submitted by POST. In crawler, if the POST request is to be constructed, it is necessary to use the correct content type and understand which content type is used in each parameter setting of various request libraries, otherwise it will lead to the failure of normal response after POST submission.

2.1.6 response

1. Response Status Code

Indicates the status of the server. For example, 200 indicates that the server is responding normally, 404 indicates that the page is not found, and 500 indicates that an internal error occurs on the server.

2. Response Headers

Contains information about the server’s response to the request, such as content type, server, setting cookies, etc.

3. Response Body

The body of the response data is the body of the response, such as the HTML code of the page when requesting a web page. After we do crawler request web page, the content to parse is the response body. Click Preview in the browser developer tools to see the source code for the web page, the contents of the response body, which is the target for parsing.

When doing crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content.

For more information on HTTP, see “Diagrams for HTTP,” which you can quickly walk through. (You can buy it online or download the PDF here)

2.2 Web Basics

2.2.1 Composition of web pages

  1. HTML

Hypertext Markup Language: Hypertext Markup Language

In developer mode, you can see the source code of the web page in the Elements TAB. This code is HTML, and the entire web page is made up of nested tags. In short, HTML defines the content and structure of a web page.

  1. CSS

Cascading Style Sheets

CSS is currently the only web page layout standard. In a web page, the style rules of the entire web page are generally defined and written into the CSS file. In HTML, all you need is a link tag to bring in a written CSS file, and the entire page becomes beautiful and elegant. In short, CSS describes the layout of a web page.

3.JavaScript

JS for short, is a scripting language. HTML and CSS work together to provide users with static information and lack of interactivity. We may see some interactive and animated effects on web pages, such as download progress bars, prompt boxes, and scrolling graphics, which are often the result of JavaScript. In short, JavaScript defines the behavior of a web page.

2.2.2 Structure of web pages

<! -- First HTML example -->
<! DOCTYPEhtml>
<html>
<head>
<meta charset="UTF-8">
<title>This is a title</title>
</head>
<body>
<div id="container">
<div class="wrapper">
<h2 class = "title">Hello World</h2>
<p class = "text">Hello, this is a paragraph.</p>
</div>
</div>
</body>
</html>
Copy the code

This example is the general structure of a web page. The standard form of a web page is an HTML tag with a nested header and body tag. The header defines the configuration and references to the page, and the body defines the body of the page.

2.2.3 Node tree and relationship between nodes

In HTML, all tag definitions are nodes that form an HTML DOM tree.

DOM is a W3C (World Wide Web Consortium) standard, its full name is document Object Model, that is, document Object Model. It defines standards for accessing HTML and XML documents. By this standard, everything in an HTML document is a node.

2.2.4 selector

In CSS, we use CSS selectors to locate nodes.

There are three common selection methods:

  • By ID: In the example above, the ID of the div node is a container, which can be represented as #container, where # indicates the selected ID, followed by the name of the ID.
  • According to class: with point (.) The beginning represents the selection class, followed by the name of the class.
  • According to the tag name: for example, if you want to select a secondary title, just use H2.

If you want to learn more about HTML, CSS and JavaScript, please refer to the tutorials in W3school.

2.3 Basic principles of crawlers

2.3.1 Overview of crawlers

Simply put, a crawler is an automated program that obtains web pages and extracts and saves information

1. Get web pages (URllib, requests)

2. Extract information (Beautiful Soup, PyQuery, LXML)

3. Save data (TXT text, JSON text, MySQL, MongoDB, remote server)

4. Automate the process

2.3.2 What data can be captured

  • HTML source code
  • JSON string
  • Binary data (pictures, video, audio)
  • CSS, JavaScript, configuration files
  • In short, it can be grabbed as long as it is accessible in the browser (HTTP or HTTPS protocol)

2.3.3 JavaScript renders the page

<! -- JavaScript rendering page example -->
<! DOCTYPEhtml>
<html>
<head>
<meta charset="UTF-8">
<title>This is a title</title>
</head>
<body>
<div id="container">
</div>
</body>
<script src = "app.js"></script>
</html>
Copy the code

There is only one container node inside the body node, but notice that app.js is introduced after the body node, which is responsible for rendering the entire site.

In such cases, we can analyze the back-end Ajax interface, or we can use libraries like Selenium and Splash to simulate JavaScript rendering.

2.4 Sessions and Cookies

In the process of browsing the website, we often encounter the need to login, some pages can only be accessed after login, and after login can be accessed for many times, but sometimes after a period of time will need to re-login. There are also some sites that automatically log in when you open your browser and don’t fail for a long time. There are sessions and Cookies.

2.4.1 Static and Dynamic Web pages

Static web pages: fast loading speed, easy to write, but there are great defects, such as poor maintainability, can not flexibly display content according to the URL.

Dynamic web page: can dynamically resolve the change of URL parameters, associated database and dynamic presentation of different page content, very flexible and changeable.

2.4.2 Stateless HTTP

The HTTP protocol has no memory for transaction processing.

1. Session (on the server, it is used to store user session information)

  1. Cookies (on the client side, when the browser visits the web page next time, it will automatically attach Cookies and send them to the server. The server identifies the Cookies and identifies the user, and then determines whether the user is logged in, and then returns the corresponding response.)

2.4.3 Common misunderstandings

“Just close the browser and the session disappears” – (error)

Session due to close the browser will not be deleted (because the browser will not notice before closing the server it will be shut down, this will require a server set an expiry time for the session, when the distance from the client the last longer than the time when using session, the server can think the client has stopped.

2.5 Basic Principles of Proxy

2.5.1 Basic Principles

Instead of sending a request directly to the Web server, the local machine makes a request to the proxy server, which in turn sends the request to the Web server, which in turn forwards the response returned by the Web server to the local machine.

2.5.2 Functions of the Agent

  • Access some sites that cannot be accessed normally.
  • Accessing Intranet Resources
  • Increase access speed
  • Hiding the Real IP address

2.5.3 Crawler Proxy

The problem of too frequent IP access requires the use of proxies to hide the real IP and make the server think that the proxy server is requesting it.

2.5.4 Agent Classification

1. Differentiate by agreement

  • FTP proxy server (port number: 21,2121)
  • HTTP proxy server (8080,3128)
  • SSL/TLS proxy server (up to 128-bit encryption strength, 443)
  • RTSP Agent (554)
  • Remote login agent (mainly for TELENT remote control, 23)
  • POP3 / STMP (110/25)
  • SOCKS proxy

2. According to the degree of anonymity

  • Highly anonymous proxy (forwards packets as they are, so that the server looks like a normal user client is accessing them)
  • Normal anonymous proxy (some changes are made to the packet, it is possible to discover that it is a proxy server, and there is a certain chance that it can be traced to the real IP)
  • Transparent proxy (generally used to speed up browsing)
  • Spy agency

2.5.5 Common Proxy Settings

  • Free agents on the Internet (there are not many agents available, it is better to use Gaunic)
  • Paid agents (much better quality than free agents)
  • ADSL dialing (high stability, is also a more effective solution)