The preface

HTTP protocol can be considered in People’s Daily life, work with more protocol. When we use a browser to access a web page, we pass data through HTTP; Most clients interact with servers using HTTP. For us in the data collection business, it’s perfectly normal. Requests and Scrapy are custom configuration libraries that encapsulate HTTP.

The Internet Engineering Task Force (IETF) proposed renaming HTTP-over-QUIC to HTTP/3 last year. We’re in the technology business. We need to be sensitive. Once the HTTP/3 standard is finalized and the major industries support it, what will happen to us? We need to review the history of HTTP.

The HTTP 0.9

Fast forward to 1991, when the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF) developed the HTTP 0.9 standard. HTTP 0.9 only supported GET requests because of the widespread use of the Internet at that time and the low speed and bandwidth.

The HTTP 1.0

In May 1996, HTTP/1.0 was released, adding a lot of new content to the HTTP protocol. The first is the diversification of request methods, from a single GET request to POST and HEAD commands. In addition, it supports sending content in any format. These two additions not only enable the Internet to transfer text, image, video, binary files, but also enrich the way browsers and servers interact. This laid the foundation for the great development of the Internet.

Again, the format of HTTP requests and responses has changed. In addition to the data section, each communication must include headers (HTTP headers) that describe some metadata. Other new features include Status Code, multi-character set support, multi-part Type, authorization, cache, and Content encoding.

However, HTTP/1.0 has drawbacks:

The first is that the connection cannot be reused. HTTP 1.0 stipulates that the browser and the server only maintain a short connection, each browser request needs to establish a TCP connection with the server, the server immediately disconnects the TCP connection after the completion of the request processing, the server does not track each client and does not record past requests. If additional resources are requested, a new connection must be created.

The second point is: Head Of Line Blocking. HOLB is a series of packages because the first package is blocked; When many resources are requested on a page, HOLB causes the remaining resources to wait for other resource requests to complete when the maximum number of requests is reached. This results in bandwidth underutilization and subsequent health requests being blocked.

The HTTP 1.1

In an effort to address HTTP 1.0’s legacy, the W3C released HTTP/1.1 in January 1997, just six months after the 1.0 release. It further refined the HTTP protocol, which is still in use 20 years later and is still the most popular version. Specific optimization points:

Cache processing. HTTP 1.0 mainly used if-Modified-since and Expires in the header as the criteria for cache judgment. HTTP 1.1 introduced more cache control policies such as Entity Tag. If-unmodified-since, if-match, if-none-match, etc.

Bandwidth optimization and network connection usage. To solve the problem of high network overhead, HTTP 1.1 introduced the range header field in the request header, which allows only a certain part of the resource to be requested, that is, the return code is 206 (Partial Content), which is convenient for developers to choose freely to make full use of bandwidth and connection.

Management of error notifications. New 24 error status response codes in HTTP1.1. Long links. HTTP/1.1 Added Connection: Keep-alive can reuse part of the Connection. Multiple HTTP requests and responses can be sent on a TCP Connection, reducing the consumption and delay of establishing and closing the Connection.

As the web evolves, HTTP 1.1 still shows some limitations.

Although keep-alive can reuse some connections, multiple connections need to be established in cases such as domain name sharding, which consumes resources and brings performance pressure to the server.

Pipeling has only partially addressed HOLB. HTTP 1.1 attempts to tackle pipeling, which allows browsers to issue multiple requests (from the same domain, over the same TCP connection) at once. Pipeling requires sequential returns, so if the first request is time-consuming (such as processing a large image), subsequent requests will wait for the first request to be processed.

The protocol is expensive. When HTTP/1 is used, the content carried in the header is too large, which increases the cost of transmission to a certain extent. Moreover, the header does not change much with each request, especially increasing user traffic on the mobile end.

The HTTP 2.0

The development of the Internet is still limited by network speed. As a quip, never ignore the bandwidth of a truck full of tapes speeding down the highway. When the amount of data is large enough, physical transportation is faster, safer and more convenient than network transmission.

Google was the first business to come up with the concept of cloud computing; In addition, Google’s corporate culture is characterized by open and transparent internal information, so everyone can see everyone else’s current work plan, code, etc. In order to solve the problem of slow data transmission in the internal system, Google developed its own SPDY protocol to minimize the network delay, improve the network speed, and solve the problem of low efficiency of HTTP/1.1.

SPDY is on top of TCP. HTTP/2 transmits data in binary format, which is more efficient to parse than HTTP/1’s text format. Header compression is also supported to reduce Header packet size.

HTTPS Encryption Process

  1. The client sends a request to the server for Baidu.com and then connects to port 443 of the server. The information sent is mainly random value 1 and encryption algorithm supported by the client.
  2. After receiving the information, the server responds to the client with the handshake information, including the random value 2 and the matched negotiated encryption algorithm. The encryption algorithm must be a subset of the encryption algorithm sent by the client to the server.
  3. The server then sends the second response packet to the client as a digital certificate. The server must have a digital certificate, which can be made by itself or applied to the organization. The difference is that the certificate issued by the user needs to be authenticated by the client before the user can continue to access the certificate, while the certificate applied by a trusted company does not display a prompt page. The certificate is actually a pair of public and private keys. The certificate is actually a public key that contains a lot of information, such as the certificate issuer, expiration time, the public key of the server, the signature of the third-party certificate Authority (CA), and the domain name information of the server.
  4. The client parses the certificate, which is performed by TLS on the client. First, it verifies whether the public key is valid, such as the issuing authority and expiration time. If an exception is found, a warning box is displayed indicating that there is a problem with the certificate. If there is no problem with the certificate, a random value (pre-primary key) is generated.
  5. After the client authentication certificate passes, the session key is then assembled with random value 1, random value 2, and the pre-master key. The session secret key is then encrypted using the certificate’s public key.
  6. Transmit encrypted information. This part of the transmission is the session secret key encrypted with the certificate. The purpose is for the server to decrypt with the secret key to obtain random value 1, random value 2 and the pre-master key.
  7. The server decrypts the random value 1, random value 2 and the pre-master key, and then assembles the session key, which is the same as the client session key.
  8. The client encrypts a message using the session key and sends it to the server to verify whether the server can normally accept the message.
  9. The server encrypts a message with the session key and sends it back to the client. If the client can accept the message, the SSL connection is established.