Crawler is a program that simulates the user’s operation in a browser or an application and automates the operation process

What happens in the background when we type a URL into the browser and press enter? Let’s say you type www.sina.com.cn/

In short, this process takes place in the following four steps:

  • Find the IP address corresponding to the domain name.
  • Sends a request to the server corresponding to the IP address.
  • The server responds to the request and sends back the web content.
  • Browsers parse web content.
! [](https://p26-tt.byteimg.com/large/pgc-image/4cf56f4a02fa432183836ee050061aa6)

Nature of Web crawler

This is essentially a browser HTTP request

Browsers and crawlers are two different types of web clients that fetch web pages in the same way:

Web crawler to do, in a nutshell, is to achieve the browser function. By specifying the URL, the user is directly returned to the required data, without the need to manually manipulate the browser step by step.

How does the browser send and receive this data?

HTTP profile

HyperText Transfer Protocol (HTTP) is intended to provide a way to publish and receive HTML(HyperText Markup Language) pages.

HTTP protocol layer (Understanding)

HTTP is based on the TCP protocol. The protocols corresponding to each layer of the TCP/IP protocol reference model are shown in the following figure, where HTTP is an application layer protocol. The default HTTP port number is 80 and HTTPS port number is 443.

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/5901410e76eb41aca49fef6e86132e94)

HTTP workflow

An HTTP operation is called a transaction, and it works as follows:

1) Address resolution,

If use the client browser requests this page: localhost.com: 8080 / index. HTM

Protocol name: HTTP Host name: localhost.com Port: 8080 Object path: /index.htm

In this step, the DNS resolves the domain name localhost.com to obtain the IP address of the host.

2) Encapsulate HTTP request packets

Combine the above information with the local machine’s own information and encapsulate it into an HTTP request packet

3) Encapsulate a TCP packet and establish a TCP connection (TCP three-way handshake)

Before HTTP work begins, the client (Web browser) must first establish a connection with the server through the network, the connection is completed through TCP, the protocol and IP protocol together to build the Internet, namely the famous TCP/IP protocol family, so the Internet is also called TCP/IP network.

HTTP is an application-layer protocol with a higher level than TCP. According to the rules, connections with lower-layer protocols can be implemented only after the lower-layer protocols are established. Therefore, TCP connections must be established first. This is port 8080

4) The client sends the request command

After the connection is established, the client sends a request to the server in the form of a uniform resource Identifier (URL), protocol version number, followed by MIME information including request modifiers, client information, and available content.

5) Server response

After receiving the request, the server sends the corresponding response information in the form of a status line, including the protocol version number of the message, a success or error code, followed by MIME information including server information, entity information, and possible content.

  1. An entity message is when the server sends a header to the browser, it ends with a blank line indicating that the header was sent there, and then it sends the actual data requested by the user in the format described in the Content-Type reply header

6) The server closes the TCP connection

Normally, once the Web server sends the request data to the browser, it closes the TCP connection, and then if the browser or server adds this line of code to its header

Connection:keep-alive

The TCP connection will remain open after it is sent, so the browser can continue sending requests over the same connection. Keeping the connection saves the time required to establish a new connection for each request and saves network bandwidth.

HTTPS

Hypertext Transfer Protocol Over Secure Socket Layer (HTTPS) is a Secure HTTP channel. In short, IT is the Secure version of HTTP. That is, add SSL layer to HTTP, and SECURE Sockets Layer (SSL) is the basis for HTTPS security. The port number used is 443.

SSL: Secure sockets Layer (SSL), a secure transport protocol designed by Netscape primarily for use on the Web. This protocol is widely used on the WEB. Certificate authentication is used to ensure that the communication data between the client and the web server is encrypted and secure.

There are two basic types of encryption and decryption algorithms:

1) Symmetrcic encryption: There is only one key, encrypts and decrypts the same password, and the encryption and decryption speed is fast. Typical symmetric encryption algorithms include DES, AES, RC5, 3DES, etc.

The main problem with symmetric encryption is the shared secret key. Unless your computer (client) knows the private key of another computer (server), it cannot encrypt and decrypt the communication flow. The solution to this problem is asymmetric secret keys.

2) Asymmetric encryption: use two secret keys: public key and private key. The private key is kept by one party’s password (usually by the server) and the public key can be obtained by anyone on the other party.

This type of key occurs in pairs (and the private key cannot be deduced from the public key, nor the public key from the private key). Encryption and decryption use different keys (public key encryption requires private key decryption, private key encryption requires public key decryption). Symmetric encryption is slow.

Advantages of HTTPS communication:

  • The key generated by the client can only be obtained by the client and the server.
  • Only the client and server can get plaintext for encrypted data;
  • Client-to-server communication is secure.

Do you understand? Need crawler zero basic entry to the actual combat teaching video free to share point blue word access! Video source code