The Web is an application that we programmers are very familiar with, and HTTP is the protocol that supports Web applications, but do you really know them? Today we will follow in the footsteps of the masters and look at the Web and HTTP protocol from the perspective of computer networks. I believe you will have new gains. ✌ ️
Antecedents feed
Systematically learn the basics of computers and blog them down. Our textbook is “Computer Networks – A Top-down Approach (7th Edition)”, complete set of “Computer Networks” by 烇 Zheng and Jian Yang of USTC
Computer Networking: Fundamentals of the Internet (part 1)
Computer Networks: Basic Principles of the Internet (ii)
Computer networking: principles of application layer protocols
This is chapter 2: Web and HTTP at the application level
Some terms
-
Web pages: Consist of objects
- What is an object? An object can be an HTML file, a JPEG image, a Java applet, a sound clip file, anything.
- Web pages contain basic HTML files, which in turn contain references to several objects.
-
Each object is referenced by a URL
- Any object can be accessed using a URL. For example, when we embed an image in a web page, instead of embedding the image directly into the web page, we use the link to the object of the image to introduce it.
-
URL format: access protocol, user name, password, host name, path name, port, etc. Familiar with the Web of students may have a question, how and we usually see different ah? That’s okay. Let’s break them down.
- Protocol: Tell us what protocol to use to access objects, such as HTTP, FTP, email, etc
: / /
. - User name and password: what user name and password we use to access an object. Of course, user names and passwords are often omitted because they are not safe and are not recommended.
- Host name, pathname: the name of the object’s path under the host’s domain name, and the file type of the object.
- Port: The outlet through which a computer communicates with the outside world.
HTTP
The default port is 80,HTTPS
Is 443.
- Protocol: Tell us what protocol to use to access objects, such as HTTP, FTP, email, etc
All objects on the Internet are pointed this way. So the Web object of the Internet, like a spider web, uses a network structure to point to other information.
So far, the number of web pages on the Internet is staggering, and it’s a very complex information space. If we want to find something in a spider web, is it difficult?
That’s where search engines come in. You feed keywords to a search engine, and it returns matches to you. The object with high heat and correlation degree should be pushed ahead. Of course, Web applications are the killer apps of the Internet, and they attract a large number of users, from businesses to individuals. They put their information on the web, and as long as they download a browser client, they can access any website in the world.
HTTP literacy
HTTP: hypertext transfer protocol
-
Application layer protocol of the Web
-
Client/server mode
- Client: Request, receive, and display
- Browser server for a Web object: The Web server that sends the object in response to the request
Let’s look at the HTTP protocol that supports Web applications, Hypertext Transfer Protocol.
- Hypertext: Refers to the arbitrary pointing relationship between text and text. Isn’t that what web objects are all about? A Web object contains many links that point to other objects and possibly back. Through this relationship, a huge complex network of information space is formed.
- Protocol: the protocol for communication between the browser and the server. A TCP connection is established and HTTP requests are sent over the connection. After receiving the request, the Web server encapsulates the object requested by the client into AN HTTP response packet and returns it to the browser. Whether PC terminal or mobile terminal, are used in this way to work.
Using TCP
So how exactly does it work?
-
The server is running on port 80 for a fixed IP address, waiting for the client to set up a TCP connection.
-
The server accepts the TCP connection from the client
- After the client requests to establish a connection, the server will agree to the connection establishment request. The Web server then has the session relationship that the socket points to. (For those who do not understand sockets, see Computer Networking: Application Layer Protocol Principles)
-
Exchanging HTTP packets (application layer protocol packets) between the browser (HTTP client) and the Web server (HTTP server)
- Once the connection is established, the Web browser can send an HTTP request to the server using the HTTP protocol.
- The server responds to the request, encapsulates the object of the HTTP request into a response packet, and sends it back.
- The Web browser gets the requested object and then handles it accordingly.
-
Finally, the TCP connection is closed
HTTP is stateless
The HTTP protocol was designed to be stateless. That is, the server does not maintain any information about the customer.
The server is only responsible for establishing the connection and returning the request packet. It does not know whether your client has communicated with it before or whether you will communicate with it again in the future.
So why design it this way? In other words, what’s the benefit of this design?
The protocols for maintaining state are complex.
-
If the server must maintain some historical information (state). On a day when clients connect to it and crash, their state information is likely to be inconsistent.
- At this point, the server has to set some mechanisms to get the two sides back in sync.
-
A stateless server can support more clients.
- On the same network, with the same server resources, a stateless server can support far more users than a stateless server. Since the server does not need to maintain browser state, resources are released.
HTTP non-persistent vs. persistent
Let’s look at non-durable and durable. Corresponding to HTTP 1.0 and 1.1 versions respectively.
HTTP 1.0: non-persistent connection.
In the initial VERSION 1.0 of the HTTP protocol, a TCP connection was broken every time an HTTP communication was made. That is, at most one object can be sent over a TCP connection, and downloading multiple objects requires multiple TCP connections.
In the early days of the Internet, this protocol worked fine because the transmission of text was very small. But with websites making dozens of HTTP requests, it’s easy to reach the browser’s request limit, and setting up a new TCP connection with each request (three handshakes and four goodbyes each time) adds significantly to the traffic overhead.
So HTTP Persistent Connections, also known as HTTP keep-alive or HTTP Connection reuse, are used by default in HTTP/1.1. That is, multiple objects can be transferred over a TCP connection between a client and a server. As long as neither end requests to disconnect, the TCP connection is maintained and the connection channel can be reused by other requests.
Non-persistent HTTP connection
What happens when a user enters a URL to access a page that contains text and references to 10 JPEG images. 👇
Response time model
In the figure, round-trip time (RTT) is the round-trip time of a data flow. The transmission time of a small packet from the client to the server and back to the client is ignored.
Then why is the transmission time ignored. Because the packet is small, the number of bytes is small relative to its bandwidth, so the transmission time is small.
In non-persistent HTTP connections, the response time of an object is equal to 2RTT+ transport time. One RTT is used to initiate a TCP connection, and the other RTT is used to request HTTP and wait for an HTTP response, plus file transfer time.
From this we can see that non-persistent connections take a long time, hence the birth of persistent connections.
Persistent HTTP
And durable and divided into two kinds, one is called pipeline, the other is called NON pipeline non pipeline.
Pipelining/pipeline
The assembly line
Only one request is made at a time, and only after the previous object has returned can requests for another object be made. For example, if a browser requests 10 images, it requests the first image first, and only requests the second one when the first one comes back. The second object comes back and asks for three more objects…
Assembly line
The other way is called pipeline way. I have ten requests at a time, the first one goes out, and when it doesn’t come back, I send the second one, the third one, and then the tenth one. And we call this kind of request flow.
Just like cars, the non-assembly line way is to produce one car at a time, whereas the assembly line might be producing 10,000 cars at a time.
Back to persistent connections, which enable multiple requests to be sent in a pipelining manner. In other words, a request was sent before a response was received before the next request was sent. Now, with pipelined technology, the next request can be sent directly without waiting for a response.
Pipelinization is the default in http1.1, and may cost as little as one RTT per object.
The HTTP message
The information used for HTTP protocol interaction is called HTTP packets. The HTTP packets of the requestor (client) are called request packets, and the HTTP packets of the responder (server) are called response packets.
A common format
The request message
The response message
The headers of request packets and response packets consist of the following data:
- Request line: Contains the method used for the request, the request URI, and the HTTP version
- Status line: Contains the status code, cause phrase, and HTTP version indicating the response result
- Header field: Contains a variety of headers representing the various conditions and attributes of the request and response
Method type
The HTTP 1.0
- GET: Obtains resources
- POST: indicates the transfer entity body
- HEAD: The packet header is obtained, but the packet body is not returned
The HTTP 1.1 increase
- PUT: transfers files
- DELETE: deletes a file
- OPTIONS: Used to ask for the method supported by the request URI resource
Status code
Recommended status code learning site: http.cat/
100
- The server receives the request and needs the requester to continue the operation
200
-
2** Successful, the operation was successfully received and processed
- 205 No Content. The request was processed successfully, but there were no resources to return.
- 206 Partial Content. The status code indicates that the client made the scope request and that the server successfully executed this part of the GET request. The response packet contains the entity Content in the content-range Range
300
-
3** Redirection, which requires further action to complete the request
- 301 Moved Permanently to Permanently redirect. The status code indicates that the requested resource has been assigned a new URL, and later
You should use the URL to which the resource now refers
- 302 Found Temporary redirection.
- 304 Not Modified The file is Not Modified. The requested resource is not modified, and when the server returns this status code, no resource is returned. Clients typically cache the resources they have accessed by providing a header indicating that the client wants to return only resources modified after a specified date
400
-
4** Client error, request contains syntax error or request cannot be completed
- 400 Bad Request The client Request syntax is incorrect and the server cannot understand it
- 401 Unauthorized Request User identity authentication is required
- 403 Forbidden The server understands the request from the requesting client, but refuses to perform the request
- 404 Not Found The server could Not find the resource (web page) requested by the client. With this code, the website designer can set the “resource you requested could not be found” personality page
- 405 Method Not Allowed A Method in the client request is disabled
500
-
5** Server error. An error occurred while the server was processing the request
- The server working as a Gateway or agent received an invalid response from the remote server while attempting to execute the request
- 504 Gateway time-out The server that acts as the Gateway or proxy does not obtain requests from the remote server in a timely manner
Web cache — proxy server
So let’s look at another topic, which is caching on the Web.
What is Web caching?
A cache is a copy of a resource stored on the local disk of a proxy server or client. Using caching reduces access to the source server, thus saving traffic and communication time. A cache server is a type of proxy server and is classified under the cache proxy type. In other words, when the proxy forwards the response returned from the server, the proxy server will keep a copy of the resource.
That is, we have two ways to access the server.
- The first: Access the Web server directly and get the object directly from the source server.
- The other is to set up a Web proxy server and obtain objects through the proxy.
Via: Proxy server related information, which is added as each proxy server passes, separated by commas
If the proxy server does not have an object, it requests the object from the original server and caches the object in its proxy server’s file system.
The next time another user accesses the same object, the proxy server reads it directly from the local file system and sends response packets to the user. We call this behavior that the object was hit in the proxy.
Why use Web caching?
In a word, quick.
- For the user, the request response time on the client is reduced
- Internet Service Providers (ISPs), such as mobile. Can greatly reduce the traffic between an organization’s internal network and the internal access link
- In the case of ICPs (Internet Content Providers, like Google), the Internet makes a lot of use of caching: it allows even weak ICPs to deliver Content effectively
Moreover, Internet content access follows the 80-80 distribution principle, that is, 80 percent of people access 20 percent of the content.
Therefore, we can only arrange a very small cache to hit a lot of users’ access requests, thus reducing the burden on the server and network.
So, to the client, the cache is the server, and to the original server, it is the client. Usually the cache is installed by an ISP, such as a university, a company, or a residential ISP.
The proxy server caches the accessed objects and returns them directly to the user. Are there any vulnerabilities in this method of caching?
Of course there is. If the original server object changes, the proxy server returns the wrong object to the user. Therefore, the HTTP protocol also has corresponding measures.
Conditional GET method
Then THE HTTP protocol has been upgraded to conditional GET. The goal of the upgrade is that if the copy of the object in the cache is up to date, the original server will not send the object.
What are the types of HTTP caches?
There are specific conditions, let’s analyze one by one.
The HTTP 1.0
To ease the strain on the server, a Cache mechanism was provided in HTTP/1.0.
-
Expires: Forces caching. Expiration time on the cache server.
- Specifies the expiration time of the resource
- Problem: Due to local time. If the user changes the local time, it’s useless
-
Last-modified: Negotiation cache.
- The last time the resource was modified on the server.
- Problem: when a resource is modified but its actual contents have not changed at all, the entire entity is returned to the client (even if the client has an identical resource in the cache).
The information on the client for the resource tag, next time again request, will bring along with the information attached in the request message to the server to do check, if transfer time value and the resource on the server end modification time is consistent, then the resource has not been modified, direct return a status code of 304, the content is empty, thus saving the transmission of data. If the two times are inconsistent, the server returns the resource and returns a 200 status code, similar to the first request. This ensures that no resources are sent to the client repeatedly, and that the client can get the latest resources when the server changes. A 304 response is usually much smaller than a static resource, thus saving network bandwidth.
HTTP / 1.1
With the development of technology, soon HTTP/1.0 will not be enough.
-
Cache-control: forces caching
Expires
Http1.1 adds cache-control to define the Cache expiration timemax-age=3600
.
Note: If both Expires and cache-control are present in the packet, cache-control prevails.
-
ETag: Negotiation cache: a unique identifier calculated based on the content
- To address the above
Last-Modified
For possible inaccuracies, Http1.1 also introduces the ETag entity header field.
- To address the above
Force caching versus negotiate caching
- The browser loads the resource according to the request header
expires
和cache-control
Check whether a strong cache is hit. If yes, the resource is read directly from the cache and no request is sent to the server. - If a strong cache is not hit, the browser must send a request to the server that passes
last-modified
和etag
Verify that the resource matches the negotiated cache. If it does, the server returns the request, but does not return the data for the resource. The resource is still read from the cache - If neither of the above hits, the resource is loaded directly from the server