1. History of HTTP

Before learning about the web, knowing its history can help me understand why it developed the way it is today, and can make me interested in exploring it. The following picture shows the development of the Internet since its birth

2. What is HTTP?

HyperTextTransferProtocol literal translation for “hypertext transfer protocol”

  1. Hypertext: refers to a mixture of text, images, video, audio, etc., such as the most familiar HTML
  1. Transport: HTTP is a “two-way protocol” that transmits data between the requester and the responder. There is no limit to the roles between the requester and the responder. There can be any “middleman” in the process of transmission.
  2. Agreement: Agreement is the communication between two or more participants, and agreement is the agreement and specification between the participants. Therefore, HTTP protocol can be understood as acting between computers, using the language that computers can understand to establish the norms of communication between computers, as well as the relevant control and error handling methods.

To sum up, HTTP is a convention and specification for the transfer of hypertext data, such as text, pictures, audio, and video, between two points in the computer world

3. Some concepts related to HTTP

Web Browser:The essence of the browser is the requestor in HTTP, using HTTP protocol to obtain a variety of resources on the network. In HTTP, the role of the browser is called the “User Agent”, which means acting as the “Agent” of the visitor to initiate HTTP requests. Below are some of the major browsers and their kernels Server (Web Server) :Hardware means a machine in physical or “cloud” form. A software Web server is an application that provides Web services, usually running on a hardware server. It uses powerful hardware capabilities to respond to massive client HTTP requests and return dynamic information. Common Web servers include Apache and Nginx.

Content Delivery Network (CDN) : CDN is a Network application service created to solve the problem of slow Network access speed over long distances. The core principle of CDN is “nearest access”. By using proxy and cache technology in HTTP protocol, users do not directly visit the original website, but visit the nearest CDN node, which saves the time cost in the access process. (Load balancing, security protection, edge calculation)

Crawler: A user agent in the form of a “robot”, an application that automatically accesses Web resources.

HTML(Hyper Text Markup Language) : A hypertext Markup Language used to describe hypertext pages, using tags to define images, Text, and layout, which are eventually rendered by the browser.

Web Service: Application Service development specification defined by W3C, using client-server master-slave architecture. Is a Web (HTTP) based service architecture technology.

WAF: a network application firewall, located in front of the Web server, detects HTTP traffic and protects Web applications. Can prevent SQL injection, cross-site scripting attacks, can be fully integrated into Apache or Nginx

TCP/IP: A series of network communication protocols, the most core of which are TCP and IP. Other protocols include UDP, ICMP, AND ARP, which together form a complex but hierarchical protocol stack. The Internet Protocol (IP) mainly addresses and routes, and transmits data packets between two points. Transmission Control Protoco (TCP) is a Transmission Control protocol based on IP. It provides reliable byte stream communication and is the basis of HTTP. The HTTP protocol over the Internet runs on TCP/IP, which is more accurately called “HTTP over TCP/IP”

The Domain Name System (DNS) is a Domain Name System that uses meaningful names as equivalent substitutes for IP addresses. In DNS, a Domain Name is also called a Host Name. The domain name with “.” A number of words are divided up from left to right, and the one on the right is called a top-level domain. However, to use TCP/IP to communicate, you still need to use IP addresses, so you need to do a translation of the domain name, “mapping” to its real IP address, this is called “domain name resolution”.

URI/URL: Uniform Resource Identifier (URI) Chinese names are Uniform Resource Identifiers. DNS and IP addresses only mark hosts on the Internet. Uris uniquely mark resources on the Internet. A more common representation of URIs is Uniform Resource Locator (URL), which is actually a subset of URIs and is not strictly distinguished.

A URI consists of three basic parts:

  • Protocol name: Indicates the protocol used to access the resource
  • Host name: The mark of a host on the Internet. It can be a domain name or an IP address
  • Path: indicates the location of the resource on the host. Use slashes (/) to separate multi-level directories

HTTPS: HTTP over SSL/TLS, that is, HTTP running over SSL/TLS protocols. It is a secure protocol responsible for encrypting communications, based on TCP/IP, so it is also a reliable transport protocol, can be used as the HTTP layer, equivalent to “HTTP+SSL/TLS+TCP/IP”.

Proxy: a link between the requestor and responder in THE HTTP protocol. As a forwarding station, it can forward requests from clients or replies from servers.

There are many kinds of agents, the common ones are:

  • Anonymous proxy: completely “hide” the proxy machine, the outside world only see the proxy server;
  • Transparent proxy: As the name implies, it is “transparent and open” in the transmission process, the outside world knows both the proxy and the client;
  • Forward proxy: close to the client and send requests to the server on behalf of the client.
  • Reverse proxy: close to the server and responds to client requests on behalf of the server.

4. Hierarchical model of network

The hierarchy of network hierarchical model is from bottom to top. Generally, we are often exposed to the TCP/IP four-tier model, which is also a relatively early hierarchical model.

  • The first layer is the * Link layer, which is responsible for sending raw packets over the underlying network. It works at the nic level and uses MAC addresses to mark devices on the network, so it is sometimes called the MAC layer. This corresponds to the ISO model “data link layer”.
  • The second layer is called the Internet Layer (* Internet Layer *), and the IP protocol is at this layer. Because the IP protocol defines the concept of “IP address “, you can replace the MAC address with an IP address on the basis of the “link layer”, and then translate the IP address into a MAC address when looking for devices on the network. This corresponds to the “network layer” of the ISO model.
  • The third layer, called the * transport layer, is responsible for ensuring the reliable transfer of data between two points marked by AN IP address. It is the layer at which TCP and UDP work. This corresponds to the ISO model’s transport layer.
  • The fourth layer is called the “Application layer”, and because the bottom three layers lay the groundwork so well, this layer is “blooming”, with all kinds of application-specific protocols. Examples include Telnet, SSH, FTP, SMTP, and of course our HTTP. The corresponding ISO model is the “session layer”, “presentation layer”, “application layer”.

When the TCP/IP protocol family is used for network communication, the sender communicates with the peer in hierarchical order (the sender goes from the application layer to the bottom, and the receiver goes from the application layer to the top).

5. The domain name

Domain name is a hierarchical structure, is a string of “. A number of words are separated, with the right-most being called a top-level domain, followed by a second-level domain, descending to the left. The host name on the far left is usually used to indicate the purpose of the host, such as “WWW” for web services and “mail” for mail services, although this is not absolute

The following example illustrates the hierarchical relationship between protocol host domain names. Domain names are like human names, and the key to a name is to make it easy for us to remember. In addition to identifying people, domain names can also replace IP addresses.

6.DNS

We often use domain names to visit websites, but in fact, IP is used to locate resources in the network search project. Domain names must be resolved to IP addresses to get resources correctly. DNS is the protocol used to change domain names into IP addresses

The core system of DNS is a three-layer tree and distributed service, which basically corresponds to the structure of domain name:

  • Root DNS Server: manages top-level DNS servers and returns the IP addresses of such top-level DNS servers as “com”,”net”, and “CN”
  • Top-level DNS Server: an authoritative DNS Server that manages its own domain names. For example, cn returns the IP address of the 123.cn DNS Server.
  • Authoritative DNS Server: Manages the IP address of the host with its own domain name. For example, 123.cn can send the IP address of www.123.cn.

Although THE DNS service is spread all over the world and has great service capability, the Internet users all over the world are using this service, which will cause great pressure on the server. In addition to the core DNS system, there are two ways to reduce the stress of domain name resolution and get results faster. The basic idea is “caching”.

DNS analytic results can be stored in large companies own DNS server, or the operating system cache, hosts file, the work of many DNS just don’t have to request the DNS server, directly in the local or the machine can solve not only convenient for users, and to reduce the pressure of all levels of the DNS server, and efficiency is improved greatly.

Based on the domain name and DNS server, we can implement redirection. The domain name replaces the IP address, so the external domain name can be unchanged, but the host IP address can be arbitrarily changed. If a host needs to be offline or migrated, you can change the DNS record to point the domain name to another host.

We have all heard of load balancing. DNS can perform load balancing in the domain name resolution phase.

  • In the first method, domain name resolution can return multiple IP addresses, so one domain name can correspond to multiple hosts. After receiving multiple IP addresses, clients can use the polling algorithm to send requests to servers in sequence to achieve load balancing.
  • Second, you can configure an internal policy for domain name resolution to return the host closest to the client or the host with the best service quality. In this way, requests are distributed to different servers on the DNS server to achieve load balancing.

7.HTTP

HTTP is the hypertext Transfer Protocol (HYPERtext Transfer Protocol), a protocol and specification for sending text, pictures, audio, video and other hypertext data between two points in the computer world. After learning the hierarchical model of the network, we understand that HTTP is an application-layer protocol. This is where we begin to delve formally into the world of HTTP (based on HTTP /1.1).

The HTTP message

The structure of HTTP request packets and response packets is basically the same and consists of three parts:

  • Start line: Describes basic information about the request or response.
  • Header field set: describes the packet in more detail in key-value format.
  • Entity: Data that is actually transmitted. It may not be plain text but can be binary data such as pictures and videos.

The request line

The request line describes how the client will manipulate the server’s resources and consists of three parts. It is usually separated by a space and ended with a CRLF newline.

The status line

The status line is generally used to describe the status of the reset terminal’s response to the client’s request, and is generally composed of three parts.

Header fields

The request or status lines, plus the set of header fields, make up the complete request or response header in an HTTP message. Except for the start line, the structure of the request header and response header is essentially the same. HTTP header fields are very flexible. You can not only use existing headers such as Host and Connection in the standard, but also add any custom header. However, there are a few things to note when using header fields:

  • The field name is case-insensitive. For example, you can write Host as Host, but it is more readable to capitalize the first letter.
  • No Spaces are allowed in the field name. Use hyphens (-) but not underscores (_). For example, “test-name” is a valid field name, and “test name” and “test_name” are incorrect field names.
  • The field name must be followed by a colon (:) and cannot contain Spaces. The field value after a colon (:) can contain multiple Spaces.
  • The order of fields is meaningless and can be arbitrarily arranged without affecting semantics;
  • A field cannot be repeated in principle unless the semantics of the field itself allow it, such as set-cookie.

HTTP request methods

There are currently eight ways HTTP/1.1 specifies that words must be capitalized, and here’s a look at them

These are our more commonly used methods, it is necessary to learn

  • The GET and HEAD

    1. GET is used to request resources from the server and generally carries data to the URL.
    2. HEAD is similar to a simplified GET request. When a server receives a HEAD request, it only returns a response header that is exactly the same as GET
  • The POST and PUT

    1. POST is used to send data to the server and carry the data in the body, which usually means “create”
    2. PUT is similar to the POST method and can also submit data to the server. It stands for “Update”.
  • GET and POST

A particularly easy question to ask here is the difference between GET and POST, and I want to write about it in detail here. The following is based on my personal understanding

1. Size: GET usually takes the data in the URL, whereas POST puts the data in the body (RFC semantics, syntax, GET can also use the body transfer data, and POST can also put parameters in the URK). Therefore, due to browser URL length restrictions, The size of the data carried by a GET request is generally less than 2KB. It is worth mentioning that Chrome’s URL length limit has been increased to 2MB, but we consider compatibility, THE URL length should be the minimum standard of the maximum limit (IE limit is 2KB), in addition to the browser limit, should also consider the server limit. 2. Security: Security refers to whether the request method will affect the resources on the server. Because the GET method is read-only, as long as the server does not “misinterpret” the request from the client, the data on the server is safe. POST is insecure because it adds, deletes, and modifies data on the server. 3. Idempotent: Idempotent means whether an operation is repeated many times with the same effect. Obviously, because the GET method only performs read-only operations on resources on the server, it is idempotent. POST is defined in the RFC as “add or commit data”, and committing data multiple times creates multiple resources, so it is not idempotent (whereas PUT is “replace or update data”, updating a resource multiple times, so it is idempotent). 4. Caching: This means that the method is cacheable, most browser implementations only support GET caching. Because GET is read, you can cache the data from GET requests. POST is not idempotent which means you can’t do it multiple times. So you can’t cache.

Here are a few less commonly used methods that we can also look at

  • DELETE: deletes resources.
  • CONNECT: establish a special connection tunnel;
  • OPTIONS: Lists the methods that can be implemented on the resource;
  • TRACE: Traces the transmission path of the request and response.

What is the URI

Uris are Uniform Resource Identifiers. Because it often appears in the address bar of the browser, it is colloquially called “web address” or “web address” for short. A URI is not exactly the same as a web address. It contains two parts: a URL and a URN. The WEB address used in the HTTP world is actually a URL — a Uniform Resource Locator. But because urls are so ubiquitous, the two are often treated as simply equal.

A URI is essentially a string that uniquely identifies the location or name of a resource.The image above is a complete URI, but let’s break down its structure in detail

** Scheme ** Specifies the protocol used to access resources. The most common, of course, is “HTTP”, which means using the HTTP protocol. In addition, HTTPS indicates that the encrypted and secure HTTPS protocol is used. There are other less common schemes, such as FTP, LDAP, File, news, etc. ** :// ** delimiter, which must be followed by three special characters “://”, separating Scheme from the rest. There is no specific meaning. ** user:passwd@ ** Indicates the user name and password used to log in to the host. However, this format is not recommended because sensitive information is displayed in plaintext, which may cause serious security risks. ** host:port ** Host name: indicates the host name of the resource. It is usually in the format of host:port, that is, the host name and port number. ** path ** The path of a resource, which is similar to a file system directory, usually starts with a ‘/’. ** query ** the query parameter is used with a ‘? ‘. Start, but does not contain “?” Represents an additional requirement for a resource. Path is multiple “key=value” strings that are concatenated with the character ampersand (&), which can be used by both browsers and servers to parse long query parameters into an understandable dictionary or associative array. ** #fragment ** identifier, which is an “anchor point” inside the resource located by the URI that the browser can jump to after retrieving the resource. But fragment identifiers can only be used by clients like browsers, not servers.

Only ASCII characters can be used in URIs. A special operation is performed on non-ASCII character sets and special characters to convert them to a form that does not conflict with URI semantics. This is called “escape” and “unescape” in the RFC specification, colloquially “escape”. The rules for URI escape are a bit “crude”, simply converting non-ASCII or special characters to hexadecimal byte values, followed by a “%”.

Status code

In the HTTP message section we talked about the HTTP status line. In this section we will look at the status code in the status line.

A status code is a decimal number. The RFC standard divides the status code into five categories, with the first digit representing the classification, but 0 and 99 are not used. The actual range of status codes becomes 100 and 599. The specific meanings of the five categories are as follows:

  • 1 x x: prompt, indicating that the protocol processing is in the intermediate state and subsequent operations are required.
  • 2 x x: Yes, the packet is received and processed correctly.
  • 3 x x: indicates redirection. The resource location changes and the client needs to resend the request.
  • 4 x x: A client error occurs. The request packet is incorrect and the server cannot process the request packet.
  • 5 x x: Server error. An internal error occurred when the server was processing the request.

1 * * * * * *

The 1 x x status code is a prompt message and is an intermediate state in protocol processing. It is rarely used.

“100 Continue” is used when the server is asked if it can accept a large file in a POST request. Expect: 100- Continue This process is where the expression ‘POST sends two TCP packets to the server’ comes from, but the client does not have to wait for a response from the server and sends the data body to the server if no negative response is received within a certain amount of time.

2 * * * * * *

2 x x status code Indicates that the server receives and successfully processes the request from the client, which is the status code that the client prefers to see.

“200 OK” is the most common success status code, indicating that everything is fine and that the server has returned the processing result as expected by the client. For non-HEAD requests, there is usually body data after the response header.

“204 No Content” is another very common success status code that means basically the same as “200 OK,” but without the body data after the response header. So it is necessary for a Web server to correctly distinguish between 200 and 204.

206 Partial Content is the basis for HTTP Partial download or resumable HTTP. It is displayed when a client sends a range request requesting Partial data of a resource. Like 200, the server successfully processed the request, but the data in the body is not the entire resource. The 206 status code is usually accompanied by the content-range header field, indicating the specific Range of the body data in the response packet for the client to confirm. For example, “Content-range: Bytes 0-99/2000 “, meaning the first 100 bytes of a total of 2000 bytes are retrieved.

3 * * * * * *

3 x x status code Indicates that the resource requested by the client has changed. The client must re-send the request to obtain the resource using a new URI, which is commonly referred to as “redirection”, including the famous redirect 301 and 302.

301 Moved Permanently Permanently. This means that the requested resource no longer exists, and you need to access it again using a new URI.

“302 Found,” once described by the phrase “Moved Temporarily,” commonly known as a “temporary redirect,” means that the requested resource is still there but needs to be accessed Temporarily with another URI.

“304 Not Modified” is an interesting status code used for conditions such as if-modified-since to indicate that the resource has Not been Modified for cache control. It does not have the usual meaning of a jump, but can be understood as “redirects to cached files” (i.e., “cache redirects”).

4 * * * * * *

4 x x status code Indicates that the request packet sent by the client is incorrect and the server cannot process it. This is an error code.

400 Bad Request is a common error code, indicating that there is an error in the Request packet. It is a general error and does not have a specific status code.

“403 Forbidden” is not actually an error request from the client, but indicates that the server forbids access to the resource.

404 Not Found 404 Not Found 404 Not Found 404 Not Found 404 Not Found But now it’s so overused that a 404 can be given whenever the server is “unhappy,” and there’s no way to know if it’s really not found, or for some other reason, it’s in some ways even more annoying than 403.

5 * * * * * *

5 x x status code Indicates that the client requests the packet correctly, but the server fails to return the response data due to an internal error during processing.

“500 Internal Server Error” is a common Error code similar to 400. We do not know what Error occurred on the Server. This should be a good thing for the server, however, as it is generally not desirable to share internal server details, such as the faulty function call stack. Although not conducive to debugging, but can prevent hackers from snooping or analysis.

“501 Not Implemented” means the functionality requested by the client is Not yet supported. This error code is “softer” than 500, and is similar to “coming soon, stay tuned,” though it’s Not clear when.

502 Bad Gateway is usually an error code returned when the server functions as a Gateway or proxy, indicating that the server works properly but an error occurs when accessing the back-end server. The specific cause of the error is unknown.

503 Service Unavailable indicates that the server is busy and cannot respond to services temporarily. When we go online, the message “The network Service is busy. Please try again later” is the status code 503. 503 is a “temporary” state, and the server will probably become less busy After a few seconds, so the 503 response usually has a retry-after field indicating how soon the client can try to send the request again.

The characteristics of HTTP

  1. Flexible and extensible: WHEN HTTP was born, it only stipulated the basic formats of packets, such as separating words with Spaces, separating fields with newlines, and “header+body”. There were no strict syntax and semantic restrictions on each component of packets, and developers could customize it at will. Those RFC documents can actually be understood as “recognition and standardization” of existing extensions, realizing a virtuous circle of “from practice to practice”.
  2. Reliable transport: Because HTTP is based on TCP/IP, and TCP itself is a “reliable” transport protocol, HTTP naturally inherits this feature to “reliably” transfer data between requestor and responder.
  3. Application layer protocol: HTTP with the packet structure that can carry any header field and entity data, as well as connection control, cache proxy and other convenient and easy to use features, as long as the performance is not too demanding, HTTP can deliver almost everything, meet various needs, is called a “universal” protocol.
  4. Request-reply: Request-reply mode is the most basic communication model of HTTP protocol. Request-response mode also makes clear the positioning of the communication parties in the HTTP protocol. The requestor is always the first to initiate a connection and request, which is active, while the responder can only reply after receiving the request. It is passive, and there will be no action if there is no request.
  5. Stateless: “status” is actually some data or flags stored in the client or server, which records some changes in the communication process. HTTP does not specify any “state” in the entire protocol, but don’t forget that HTTP is “flexible and extensible”. Although there is no “state” in the standard, it is possible to “patch” it in the framework of the protocol and add this feature (cookie).
  6. Plaintext transmission: “plaintext” means that the packets in the protocol (specifically, the header part) do not use binary data, but use simple and readable text.
  7. Insecure: Security has many aspects, plaintext is only a weakness in the “secret” area, HTTP also lacks in the “authentication” and “integrity verification” areas.