preface

I was once asked a question in an interview, how much do you know about HTTP? When I heard this question, I felt confused, thinking of HTTP cache, thinking of HTTP response status code, and thinking of the difference between HTTP and HTTPS, etc. However, there is no denying that all these knowledge are points without forming a systematic knowledge framework. Therefore, I have studied systematically recently and recorded them here. The next time you’re asked how much you know about HTTP, think clearly.

This is a series of recording and learning. Since the content is long, each recording will have a corresponding mind map, which makes it more obvious.

The genesis of HTTP was born

In 1989, Tim Berners-Lee, who works at CERN, the European nuclear research centre, published a paper outlining the idea of a hyperlinked document system on the Internet. In this paper he identified three key techniques:

  1. URI: Uniform Resource Identifier (URI), the unique identity of a resource on the Internet.
  2. HTML: Hypertext Markup Language, used to describe hypertext documents;
  3. HTTP: Hypertext Transfer Protocol, used to transfer hypertext;

These three technologies seem commonplace to us today, but they were great inventions at the time. With these, a hypertext system that Tim calls the “World Wide Web,” or what we now know as the Web, could work perfectly on the Internet, allowing people everywhere to freely share information.

So in this year, our hero “HTTP” was born and began its great journey.

History of HTTP

1. The HTTP / 0.9

The world of the Internet in the early 1990s was very humble, with low processing power, small storage capacity and slow Internet speeds. Most of the resources on the Web are plain text, and many communication protocols use plain text, so the design of HTTP is inevitably constrained by The Times.

HTTP at this stage is a simple text protocol, with only a GET command, which can only GET text resources, and the server closes the connection after sending the content.

2. HTTP / 1.0

With the development of the Internet, Mosaic, server software Apache, and the development of computer multimedia technology JPEG, MP3, etc., it is imperative to promote the development of HTTP from the perspective of user needs.

HTTP/1.0 was released in 1996, and it enhanced version 0.9 in many ways, making it not much different from HTTP today.

  1. New methods such as HEAD and POST have been added;
  2. The corresponding status codes are added to mark the possible causes of errors.
  3. The concept of HTTP Header is introduced to make HTTP more flexible in handling requests and responses.
  4. Data transmission is no longer limited to text;

HTTP/1.0 established most of the technologies in use today, but it is not a formal standard and has no actual binding force. So the release of HTTP/1.0 didn’t have much practical significance for the booming Internet at that time, and the various forces continued to fight in the market according to their own intentions.

3. HTTP / 1.1

After the Browser Wars ended in 1999, HTTP/1.1 released the RFC, marking it as a formal standard rather than an optional reference document. This means that all browsers, servers, gateways, etc. on the Internet that use HTTP will have to adhere to this standard.

The main changes to HTTP/1.1 are:

  1. Add new methods such as PUT and DELETE;
  2. Added cache management and control;
  3. Clear connection management to allow persistent connections;
  4. Chunked data is allowed to facilitate large file transfer.

The introduction of HTTP/1.1 can be described as “expected”, the Internet in its “escort” took a big step, thus embarked on the “broad road”, opened the follow-up “Web 1.0” “Web 2.0” era. Now many well-known websites are founded around this time point, such as Google, Sina, Sohu, NetEase, Tencent and so on.

4. HTTP/2

The existing HTTP connection is slow, unable to keep up with the rapid development of the Internet, so people have no choice but to use a variety of ways to do performance optimization, such as the common Sprite diagram, JS resource combination, etc..

Finally, Google couldn’t resist the idea and decided to rebel. Google first developed its own browser, Chrome, and then launched a new SPDY protocol, and Chrome in its own server, from the actual user side to “force” the HTTP protocol revolution.

Google now has more than 60% of the world’s market share, which pushed SPDY to the top of the standard list. The Internet Standardization Organization started to customize a new version of HTTP based on SPDY, and finally released HTTP/2 in 2015.

HTTP/2 is designed to take full account of the current state of the Internet: broadband, mobile, insecure, etc., in the highly compatible HTTP/1.1 colleagues have made great efforts to improve performance, the main features are:

  1. All data is transmitted in binary;
  2. Multiple requests can be made, no longer in order to send multiple requests within the same connection.
  3. Using a special algorithm to compress the head, reduce the amount of data transmission;
  4. Allow the server to actively push data to the client;
  5. Enhanced security requires encrypted communication;

5. HTTP/3

While HTTP/2 was still in draft form, Google invented a new protocol called QUIC, which continues to promote QUIC as the “fait accompli” of the Internet, based on its massive user base and data volume.

In 2018, the Internet Standardization Organization (ISO) proposed and approved the renaming of “HTTP over QUIC” to “HTTP/3”. HTTP/3 is officially in the standardization stage and will probably be released in two or three years, at which time we will most likely skip HTTP/2 and go straight to HTTP/3.

What is the HTTP

With a detailed history of HTTP, it may be easier to understand what HTTP is. To put it simply:

HTTP is the HyperText Transfer Protocol.

But that’s too simple. Let’s dig a little deeper. There is a Chinese idiom “Ren Ren as his name”, which means that a person’s personality and characteristics are consistent with his name. Let’s take a look at HTTP’s name: “Hypertext Transfer Protocol”. It can be broken down into three parts: “hypertext”, “Transport” and “Protocol”. Now that we understand these three words, we know what HTTP is.

1. “Agreement”

HTTP is a protocol, and a protocol is a behavioral convention and specification for participants.

HTTP is a protocol used in the computer world. It uses a language that computers can understand to establish a specification for communication between computers, as well as related controls and error handling.

2) Transmission

A transfer is A move from point A to point B, or from point B to point A, which contains two important pieces of information:

  1. HTTP is a “two-way protocol”
  2. Although data is transmitted between A and B, there is no restriction on only A and B, allowing “relay” or “relay” in between.

In this way, the transmission mode is changed from “A<===>B” to “A<=>X<=>Y<=>Z<=>B”. Any number of “middlemen” can exist in the transmission process from A to B, and these middlemen also comply with the HTTP protocol, as long as they do not disturb the basic data transmission, any additional functions can be added. Such as security authentication, data compression, code conversion and so on, optimize the entire transmission process. To sum up:

HTTP is a convention and specification for transferring data between two points in the computer world.

3. “Hypertext”

By “text,” HTTP does not transmit the messy datagrams shred in lower-level protocols like TCP/UDP, but complete, contested data that can be processed by upper-level applications like browsers and servers.

In the early days of the Internet, “text” was just a simple character text, but now, the meaning of “text” has been greatly expanded, pictures, audio, video, etc. can be counted as “text” in the eyes of HTTP.

Hypertext is “more than plain text”. It is a mixture of text, pictures, audio, video, etc. It can also contain “hyperlinks” to jump from one “hypertext” to another.


To summarize what HTTP is:

HTTP is a convention and specification for the transfer of hypertext data, such as text, pictures, audio, and video, between two points in the computer world.

Some concepts related to HTTP

Today’s Internet is a powerful and complex network, and it’s important to understand some of the concepts associated with HTTP.

1. CDN

The client and the server are the two endpoints of THE HTTP protocol. The client usually does not directly connect to the server, and will pass through “numerous checkpoints” in the middle. One important role is called CDN.

The full name of CDN is “Content Delivery Network”, which translates to “Content Delivery Network”. It applies the caching and proxy technology in HTTP protocol to replace the request of the corresponding client of the source site. So what are the benefits of a CDN?

In simple terms, it caches data from the source site, allowing the browser to get it halfway to the source site without having to travel all the way to the server. If the scheduling algorithm of CDN is excellent, the nearest node to the user can be found and the response time can be greatly shortened.

2. The crawler

A browser is a user agent that accesses the Internet on our behalf. But the HTTP protocol does not require that user agents be followed by “real humans.” It can make “robots,” formally known as “crawlers,” essentially applications that automatically access Web resources.

The name “crawler” is very good, they are like tireless ants, crawling in the endless web, constantly from site to site, collecting all kinds of information. It is estimated that crawlers generate at least 50 percent of all traffic on the Internet.

How did reptiles come about?

The vast majority is put out by the major search engines, crawl the webpage into a huge database, and then establish the keyword index, so that we can quickly search the contents of various industries in the Internet in the major search engines.

Crawler also has a bad side. It will consume excessive network resources, occupy server and bandwidth, affect the analysis of real data, and even lead to the disclosure of sensitive information. So there will be “anti-crawler” technology, through various means to restrict crawlers. One is a “gentleman’s agreement” about what to climb and what not to climb.

3. DNS

The full Name of DNS is Domain Name System.

In TCP/IP protocol, IP addresses are used to identify computers. Numerical addresses are convenient for computers, but difficult for humans to remember and use. Then came the domain name system, which essentially used meaningful names as equivalent alternatives to IP addresses.

In DNS, domain names, also known as host names, are designed as a hierarchical structure to better mark hosts in different countries and organizations and to make names easier to remember.

Domain names are separated into multiple words with periods (.). The level rises from left to right, and the rightmost domain is called a top-level domain. For top-level domain names, you can casually say a few, such as “com” for commercial companies, “edu” for educational institutions, “CN” and “UK” for countries, etc. Do you still remember the domain name when buying train tickets? Is “www.12306.cn”.

However, if you want to use TCP/IP protocol to communicate, you still need to use IP address, so you need to do a translation of the domain name, mapping to its real IP address, this is called domain name resolution.

To use the analogy of making a phone call, you want to make a phone call to Xiao Ming, but you don’t know the number, so you have to search through the number book in your mobile phone until you find that one record of Xiao Ming, and then you can look up the number. The “xiao Ming” is the domain name, and the “phone number” is the IP address, and the search process is domain name resolution.

The HTTP protocol does not explicitly require the use of DNS, but in fact, in order to facilitate access to Web servers on the Internet, DNS is usually used to locate or mark the host name, indirectly binding DNS and HTTP together.

4. URL

With TCP/IP and DNS, do we have access to everything on the web?

Not yet. DNS and IP addresses are only used to tag hosts on the Internet, but there are so many texts, pictures and pages on hosts. Which one to look for?

Therefore, there are Uniform Resource Locator (URL), which translates into Uniform Resource Locator (URL).

I’ll take the Nginx site as an example to see what urIs look like.


    http://nginx.org/en/download.html

Copy the code

As you can see, a URL consists of three basic parts:

  1. Protocol name: The protocol that should be used to access the resource, in this case HTTP.
  2. Host name: The mark of an Internet host, which can be a domain name or IP address, in this case “nginx.org”. A Proxy is a link between the requestor and the responder in THE HTTP protocol. It acts as a “relay station” and forwards requests from clients and responses from servers.
  3. Path: Indicates the location of the resource on the host. Use a slash (/) to separate multi-level directories, in this case, /en/download.html.

I find Xiao Ming through the phone book and ask him to send me the copy of the publicity that I finished yesterday. In this process, you have completed a URI resource access, “xiaoming” is the “host name”, “yesterday’s publicity copy” is the “path”, and “Courier” is the “protocol name” you want to access the resource.

5. Proxy

A Proxy is a link between the requestor and the responder in THE HTTP protocol. It acts as a “relay station” and forwards requests from clients and responses from servers.

There are many kinds of agents, the common ones are:

  1. Anonymous proxy: completely “hide” the proxy machine, the outside world only see the proxy server;
  2. Transparent proxy: As the name implies, it is “transparent and open” in the transmission process, the outside world knows both the proxy and the client;
  3. Forward proxy: close to the client and send requests to the server on behalf of the client.
  4. Reverse proxy: close to the server and responds to client requests on behalf of the server.

The CDN mentioned in the last lecture is actually a proxy that responds to client requests on behalf of the source server, usually acting as a transparent proxy and a reverse proxy.

Since the agent inserts an “intermediate layer” during transport, there are a lot of interesting things you can do at this point, such as:

  1. Load balancing: Evenly distribute access requests to multiple machines to achieve access clustering.
  2. Content cache: temporarily store up and down data to reduce the pressure on the back end;
  3. Security protection: hide IP, use WAF and other tools to resist network attacks, protect the proxy machine;
  4. Data processing: provides additional functions such as compression and encryption.