HTTP Protocol Overview
HTTP is short for Hyper Text Transfer Protocol. It is used to Transfer hypertext from the World Wide Web server to the local browser.
HTTP is a TCP/ IP-based communication protocol to transfer data (HTML files, image files, query results, etc.).
HTTP is an object-oriented protocol belonging to the application layer. Because of its simple and fast way, it is suitable for distributed hypermedia information system. It was put forward in 1990. After several years of use and development, it has been constantly improved and expanded. Currently the sixth version of HTTP/1.0 is used in the WWW. The standardization of HTTP/1.1 is under way, and the proposal of HTTP-NG(Next Generation of HTTP) has been put forward.
The HTTP protocol works on a client-server architecture. As the HTTP client, the browser sends all requests to the HTTP server, namely the WEB server, through the URL. The Web server sends response information to the client based on the received request.
Five features of HTTP
-
Client/server mode is supported.
-
Simple and fast: when a client requests services from the server, it only needs to send the request method and path. The commonly used request methods are GET, HEAD and POST. Each method specifies a different type of contact between the client and the server. Because HTTP protocol is simple, the HTTP server program size is small, so the communication speed is very fast.
-
Flexibility: HTTP allows the transfer of any type of data object. The Type being transferred is marked by content-Type.
-
Connectionless: The meaning of connectionless is to limit processing to one request per connection. The server disconnects from the customer after processing the request and receiving the reply from the customer. In this way, transmission time can be saved. The reason for doing this early on was to ask for fewer resources and pursue faster. Later, Connection: keep-alive was used to implement the long Connection
-
Stateless: HTTP is a stateless protocol. Stateless means that the protocol has no memory for transaction processing. The lack of state means that if the previous information is needed for subsequent processing, it must be retransmitted, which can result in an increase in the amount of data transferred per connection. On the other hand, the server responds faster when it doesn’t need the previous information.
The URL of the HTTP
HTTP uses Uniform Resource Identifiers (URIs) to transfer data and establish connections. A URL is a special type of URI that contains enough information to find a resource
URL is an IP address used to identify a resource on the Internet. The following uses the URL as an example to describe the components of a common URL
www.xxx.com:8080/news/1.html…
As you can see from the above URL, a complete URL consists of the following parts:
- Protocol part: The protocol part of the URL is HTTP:, which indicates that the web page uses HTTP. There are many protocols that can be used on the Internet, such as HTTP, FTP, and so on. In this example, HTTP is used. The “//” after “HTTP” is the delimiter
- Domain name: The domain name of the URL is www.aspxfans.com. In a URL, an IP address can also be used as a domain name
- Port: The domain name is followed by the port. The domain name and port are separated by colons (:). The port is not a required part of a URL, and the default port is used if the port part is omitted
- Virtual directory part: the virtual directory part begins with the first slash after the domain name and ends with the last slash. The virtual directory is also not a required part of a URL. The virtual directory in this case is “/news/”
- File name: from the last slash after the domain name to? Is the filename part, if there is no? Is the file part, if there is no “?” And “#”, then from the last “/” after the domain name to the end, is the filename part. In this case, the file name is index.asp. The file name portion is also not a required part of a URL, and if omitted, the default file name is used
- Anchor part: From the “#” to the end, it is the anchor part. The anchor part in this case is “name”. The anchor part is also not a required part of a URL
- Parameter part: From “? The part between the beginning and “#” is the parameter part, also known as the search part, the query part. The parameter part in this example is “boardID=5&ID=24618&page=1”. A parameter can have multiple parameters separated by ampersand (&)
The difference between URLS and URIs
A UNIFORM Resource Identifier (URI) is a uniform resource identifier that uniquely identifies a resource.
Each resource available on the Web, such as HTML documents, images, video clips, and programs, is a URI to locate the resource. The URI generally consists of three parts: (1) the naming mechanism for accessing the resource, (2) the host name for storing the resource, and (3) the name of the resource itself, which is represented by the path and emphasizes the resource.
A URL is a Uniform Resource locator. A URL is a specific URI that can be used to identify a resource and specify how to locate the resource.
A URL is a string of characters used to describe information resources on the Internet. It is used in various WWW client and server programs, especially the famous Mosaic program. Using URLS can use a unified format to describe various information resources, including files, server addresses and directories. A URL consists of three parts: (1) protocol (or service mode), (2) IP address (sometimes including port number) of the host where the resource resides, and (3) specific address of the host resource. Such as directory and file name
URN, Uniform resource Name, identifies the resource by name, for examplemailto:[email protected]
.
Uris are an abstract, high-level concept that defines a uniform resource identity, while urls and UrNs are ways of identifying specific resources. Urls and UrNs are both urIs. Broadly speaking, every URL is a URI, but not necessarily every URI is a URL. This is because URIs also include a subclass, the Uniform Resource Name (URN), which names resources but does not specify how to locate them. The mailto, news, and ISBN URIs above are examples of UrNs.
In Java URIs, an instance of a URI can represent either absolute or relative, as long as it follows the syntax rules for URIs. The URL class, on the other hand, not only conforms to semantics but also contains information to locate the resource, so it cannot be relative. In the Java class library, the URI class does not contain any methods to access resources; its only function is parsing. In contrast, the URL class opens a stream to the resource.
The HTTP request
As can be seen from the figure above, an HTTP request consists of three parts: request line, message header and request body.
HTTP request status line
The request line consists of request Method, URL field and HTTP Version. In general, the request line defines the request mode, address and HTTP protocol Version of the request. For example:
GET/example. HTTP / 1.1 HTML (CRLF)Copy the code
HTTP protocol methods include:
GET
Request:To obtainRequest-uri Specifies the resource identifiedPOST
: after the resource identified by the request-uriincreaseThe new dataHEAD
: Requests access to the resource identified by request-URIResponse message headerPUT
: Request serverStore or modifyA resource identified by a request-URIDELETE
: Request serverdeleteRequest-uri Specifies the resource identifiedTRACE
: The request server sends back the received request informationTesting or diagnosisCONNECT
: Reserved for future useOPTIONS
: Requests queries about server performance, or about resource-related options and requirements
The HTTP request header
The message header consists of a series of key-value pairs that allow the client to send additional information to the server, or information about the client itself, including:
Header | explain | The sample |
---|---|---|
Accept | Specifies the type of content that the client can receive | Accept: text/plain, text/html |
Accept-Charset | A set of character encodings acceptable to the browser | Accept-Charset: iso-8859-5,utf-8 |
Accept-Encoding | Specifies the type of web server content compression encoding that the browser can support | Accept-Encoding: compress, gzip |
Accept-Language | Browser acceptable language | Accept-Language: en,zh |
Accept-Ranges | You can request one or more subscope fields of a web page entity | Accept-Ranges: bytes |
Authorization | Type of the HTTP authorization certificate | Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== |
Cache-Control | Specify the caching mechanism that requests and responses follow | Cache-Control: no-cache |
Connection | Indicates whether persistent connections are required (HTTP 1.1 does this by default) | Connection: close |
Cookie | When an HTTP request is sent, all cookie values stored under the domain name of the request are sent to the Web server | Cookie: $Version=1; Skin=new; |
Content-Length | The content length of the request | Content-Length: 348 |
Content-Type | MIME information that corresponds to the entity being requested | Content-Type: application/x-www-form-urlencoded |
Date | The date and time the request was sent | Date: Tue, 15 Nov 2010 08:12:31 GMT |
Expect | The specific server behavior requested | Expect: 100-continue |
From | Email address of the user who made the request | From: [email protected] |
Host | Specifies the domain name and port number of the requested server | Host: www.zcmhi.com |
If-Match | This is valid only if the request content matches the entity | If – the Match: “737060 cd8c284d8af7ad3082f209582d” |
If-Modified-Since | If the part of the request is modified after the specified time, the request succeeds; if it is not modified, the 304 code is returned | If-Modified-Since: Sat, 29 Oct 2010 19:43:31 GMT |
If-None-Match | If the content has not changed, the 304 code is returned with the Etag sent by the server. The Etag is compared with the Etag returned by the server to determine whether it has changed | If None – Match: “737060 cd8c284d8af7ad3082f209582d” |
If-Range | If the entity has not changed, the server sends the missing part of the client, otherwise sends the whole entity. The parameter is also Etag | If – Range: “737060 cd8c284d8af7ad3082f209582d” |
If-Unmodified-Since | The request succeeds only if the entity has not been modified after the specified time | If-Unmodified-Since: Sat, 29 Oct 2010 19:43:31 GMT |
Max-Forwards | Limit the amount of time messages can be sent through proxies and gateways | Max-Forwards: 10 |
Pragma | Used to contain implementation-specific instructions | Pragma: no-cache |
Proxy-Authorization | Certificate of authorization to connect to the agent | Proxy-Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== |
Range | Only a portion of the entity is requested, specifying scope | Range: bytes=500-999 |
Referer | The address of the previous web page, followed by the current requested web page, is the incoming path | Referer: www.zcmhi.com/archives/71… |
TE | The client is willing to accept the transmission code and notifies the server to accept the end plus header message | TE: trailers,deflate; Q = 0.5 |
Upgrade | Specify some transport protocol to the server for the server to convert (if supported) | Upgrade: HTTP/2.0, SHTTP/1.3, IRC/6.9, RTA/ X11 |
User-Agent | User-agent contains the information about the User that sends the request | The user-agent: Mozilla / 5.0 (Linux; X11) |
Via | Notification intermediate gateway or proxy server address, communication protocol | Via: Fred, 1.0 to 1.1nowhere.com(Apache / 1.1) |
Warning | Warning information about message entities | Warn: 199 Miscellaneous warning |
HTTP Request body
The GET method does not have a request body except when a POST request is sent.
The HTTP response
Similar to HTTP requests, here is the first diagram:
The HTTP response also consists of three parts, including the status line, the message header, and the response body.
HTTP response status line
The status line also consists of three parts, including the HTTP protocol version, the status code, and the text description of the status code. Such as:
HTTP/1.1 200 OK (CRLF)Copy the code
HTTP response status code
The status code consists of three digits. The first number defines the category of the response and has five possible values:
1xx
:instructions– Indicates that the request is received and processing continues2xx
:successful– Indicates that the request is successfully received, understood, or accepted3xx
:redirect– Further action must be taken to complete the request4xx
:Client error– The request has syntax errors or the request cannot be implemented5xx
:Server side error– The server failed to fulfill a valid request
Common status codes, status descriptions, instructions:
200
:OK– The client request succeeds400
:Bad Request– The client request has syntax errors and cannot be understood by the server401
:Unauthorized– Request unauthorized, this status code must andWWW-Authenticate
Header fields are used together403
:Forbidden– The server received the request but refused to provide service404
:Not Found– The requested resource does not exist, eg: An incorrect URL is entered500
:Internal Server Error– An unexpected error occurred on the server503
:Server Unavailable– The server cannot process client requests. However, the server may recover after a period of time
HTTP response status code description
StatusCode | StatusCode semantic | Product description |
---|---|---|
100 | Continue | To continue. The client should continue with its request |
101 | Switching Protocols | Switch protocol. The server switches protocols based on client requests. You can only switch to a more advanced protocol, for example, the new version of HTTP |
200 | OK | The request succeeded. Typically used for GET and POST requests |
201 | Created | Has been created. The new resource was successfully requested and created |
202 | Accepted | Has been accepted. The request has been accepted, but processing is not complete |
203 | Non-Authoritative Information | Unauthorized information. The request succeeded. The meta information returned is not the original server, but a copy |
204 | No Content | No content. The server processed successfully, but did not return content. You can ensure that the browser continues to display the current document without updating the web page |
205 | Reset Content | Reset the content. The server is successful, and the user end (for example, browser) should reset the document view. Use this return code to clear the browser’s form field |
206 | Partial Content | Part of the content. The server successfully processed some of the GET requests |
300 | Multiple Choices | A variety of options. The requested resource can include multiple locations, and a list of resource characteristics and addresses can be returned for user terminal (e.g., browser) selection |
301 | Moved Permanently | Permanently move. The requested resource has been permanently moved to the new URI, the return message will include the new URI, and the browser will automatically redirect to the new URI. Any future new requests should be replaced with a new URI |
302 | Found temporary movement. | Similar to 301. But resources are moved only temporarily. The client should continue to use the original URI |
303 | See Other | Look at other addresses. Similar to 301. Use GET and POST requests to view |
304 | Not Modified | Unmodified. The requested resource is not modified, and the server does not return any resources when it returns this status code. Clients typically cache accessed resources by providing a header indicating that the client wants to return only resources that have been modified after a specified date |
305 | Use Proxy | Use a proxy. The requested resource must be accessed through a proxy |
306 | Unused | An invalid HTTP status code |
307 | Temporary Redirect | Temporary redirect. Similar to 302. Use GET to request redirection |
400 | Bad Request | Client request syntax error, server cannot understand |
401 | Unauthorized | The request requires user authentication |
402 | Payment Required | Reserved for future use |
403 | Forbidden | The server understands the request from the requesting client, but refuses to execute the request |
404 | Not Found | The server could not find the resource (web page) based on the client’s request. With this code, a web designer can set up a personalized page that says “the resource you requested could not be found. |
405 | Method Not Allowed | The method in the client request is disabled |
406 | Not Acceptable | The server could not complete the request based on the content nature of the client request |
407 | Proxy Authentication Required | The request requires the identity of the broker, similar to the 401, but the requester should use the broker for authorization |
408 | Request Time-out | The server waited for a request sent by the client for a long time and timed out. Procedure |
409 | Conflict | The server may return this code after completing a PUT request from the client. A conflict occurred when the server processed the request |
410 | Gone | The resource requested by the client does not exist. 410 differs from 404 in that if a resource previously had a 410 code that is now permanently deleted, the site designer can specify a new location for the resource through the 301 code |
411 | Length Required | The server cannot process the content-length message sent by the client |
412 | Precondition Failed | A prerequisite error occurred when the client requested information |
413 | Request Entity Too Large | The request was rejected because the requested entity was too large for the server to process. To prevent continuous requests from clients, the server may close the connection. If the server is temporarily unable to process it, a retry-after response is included |
414 | Request-URI Too Larg | The request URI is too long (usually a url) for the server to process |
415 | Unsupported Media Type | The server could not process the media format attached to the request |
416 | Requested range not satisfiable | The scope requested by the client is invalid |
417 | Expectation Failed | The server cannot satisfy Expect’s request headers |
500 | Internal Server Error | The server had an internal error and could not complete the request |
501 | Not Implemented | The server did not support the requested functionality and could not complete the request |
502 | Bad Gateway | A server acting as a gateway or proxy received an invalid request from a remote server |
503 | Service Unavailable | The server is temporarily unable to process client requests due to overloading or system maintenance. The length of the delay can be included in the server’s retry-after header |
504 | Gateway Time-out | The server acting as a gateway or proxy did not get the request from the remote server in time |
505 | HTTP Version not supported | The server did not support the HTTP version of the request and could not complete the processing |
HTTP response packet
HTTP and HTTPS
The shortage of the HTTP
- Communications use clear text (not encryption) and the content can be eavesdropped
- The identity of the communicating party is not verified, so it is possible to encounter camouflage
- The integrity of the message could not be proved, so it may have been tampered with
HTTPS is introduced
HTTP has no encryption mechanism, but it can be used in combination with Secure Socket Layer (SSL) or Transport Layer Security (TLS) to encrypt HTTP traffic. Belongs to communication encryption, that is, encryption in the entire communication line.
HTTP + Encryption + Authentication + Integrity Protection = HTTP Secure (HTTPS) Code replicationCopy the code
HTTPS uses a hybrid encryption mechanism that uses both shared key encryption (symmetric) and public key encryption (asymmetric). If the key can be exchanged securely, it is possible to consider using public-key encryption only for communication. However, public key encryption is slower than shared key encryption.
Therefore, we should make full use of their respective advantages and combine a variety of methods for communication. Public key encryption is used in the stage of exchanging key, and shared key encryption is used in the stage of establishing communication exchange message.
The HTTPS handshake process is described as follows:
-
The browser sends its own set of encryption rules to the site.
The server gets the browser public key to copy the codeCopy the code
-
The site selects a set of encryption and HASH algorithms and sends its identity back to the browser in the form of a certificate. The certificate contains information such as the website address, encrypted public key, and certificate authority.
The browser gets the server's public key and copies the codeCopy the code
-
After obtaining a web certificate, the browser does the following:
(a). Verify the validity of the certificate (whether the authority issuing the certificate is legitimate, whether the website address contained in the certificate is consistent with the address being accessed, etc.). If the certificate is trusted, a small lock will be displayed in the browser bar, otherwise the certificate will be given a hint that it is not trusted.
(b). If the certificate is trusted, or if the user accepts an untrusted certificate, the browser generates a random number of passwords (the key for subsequent communication) and encrypts them with the public key provided in the certificate (shared key encryption).
(c) Use the agreed HASH to calculate the handshake message, encrypt the message with the generated random number, and finally send all the previously generated information to the website.
Browser authentication -> Random password server public key encryption -> communication key Communication key -> serverCopy the code
-
After the web site receives data from the browser, it does the following:
(a). Use its own private key to decrypt the information and retrieve the password. Use the password to decrypt the handshake message sent by the browser and verify whether the HASH is consistent with that sent by the browser.
(b). Encrypt a handshake message with a password and send it to the browser.
The server decrypts the random password with its own private key -> decrypts the handshake message with a password (shared key communication) -> verifies that HASH is consistent with the browser (verifies the browser)Copy the code
The shortage of the HTTPS
- The encryption and decryption process is complex, resulting in slow access
- Encryption requires subscribers to pay certification authorities
- Use HTTPS for requests throughout the page
Features and differences of HTTP1.0, HTTP1.1, and Http2.0
As long as the interview asks you about HTTP, this is usually the prerequisite for the interviewer.
Http1.0 features
- Stateless: The server does not track the requested status
- No connection: The browser establishes a TCP connection for each request
stateless
For stateless features, the cookie/session mechanism can be used for identity authentication and status recording
There is no connection
There are two types of performance resulting from no connection
-
Unable to reuse links
Each time a request is sent, TCP connections need to be made sequentially (i.e., three shakes and four shakes), which makes the network utilization very low
-
Adversary block
Http1.0 states that the next request cannot be sent until the response to the previous request arrives. If the previous request blocks, the subsequent request will also block. This is called head blocking
Http1.1 features
To address the performance shortcomings of HTTP1.0, a workaround has emerged for HTTP1.1:
- Long connection: The Connction field is added, and the keep-alive value can be set to keep the connection open
- Pipelining: Based on the long connection above, pipelining can continue to send subsequent requests without waiting for the first response, but the response is returned in the order requested. That is, multiple requests can be sent, but the responses are processed sequentially.
- Cache processing: Added field cache-control
- Breakpoint transmission
A long connection
Http1.1 maintains long connections by default. When data is transferred, keep TCP connections open and continue to transfer data over this channel
pipelining
Based on long connections:
TCP is not disconnected, using the same channel
request1> response1- > request2> response2- > request3> response3
Copy the code
Pipelined request response:
request1- > request2- > request3> response1-- > the response2-- > the response3
Copy the code
Even if the server prepares response 2 first, response 1 is returned in the order requested
Although piped, multiple requests can be sent at once, but the responses are still returned sequentially, still does not solve the problem of head blocking.
Cache handling
When a browser requests a resource, it checks whether there is a cached resource. If there is a cached resource, the browser directly obtains the cached resource and does not send another request. If there is no cached resource, the browser sends a request
Control by setting the field cache-control
Breakpoint transmission
When uploading or downloading resources, divide the resources into multiple parts and upload or download them separately. If a network fault occurs, you can continue to upload or download the resources from the places where the resources have been uploaded or downloaded, instead of starting from the beginning to improve efficiency
The two parameters that are implemented in the Header, the Range that the client sends the request and the content-range that the server responds to
Http2.0 features
- Binary framing
- Multiplexing: Sending requests and responses simultaneously over a shared TCP connection
- The head of compression
- Server push: The server can push additional resources to the client without an explicit request from the client
Binary framing
Divide all transmitted information into smaller messages and frames and encode them in binary format
multiplexing
Based on binary framing, where all access under the same domain name is routed through the same TCP connection, HTTP messages are broken up into separate frames, sent out of order, and the server reassembles the messages based on identifiers and headers
The difference between
- The main difference between HTTP1.0 and HTTP1.1 is the transition from no connection to long connection
- The main difference between Http2.0 and 1.x is multiplexing
The interview questions
Question 1: What happens when the browser enters the URL?
- The client connects to the Web server
An HTTP client, typically a browser, establishes a TCP socket connection with the HTTP port (403 by default) of the Web server. For example, www.baidu.com.
2. Send an HTTP request
Through the TCP socket, the client sends a text request packet to the Web server. A request packet consists of the request line, the request header, the blank line, and the request data.
3. The server accepts the request and returns an HTTP response
The Web server parses the request and locates the requested resource. The server writes the resource copy to the TCP socket, which is read by the client. A response consists of a status line, a response header, a blank line, and response data.
4. Release the TCP connection
If the connection mode is set to close, the server actively closes the TCP connection, and the client passively closes the connection to release the TCP connection. If the Connection mode is Keepalive, the connection is kept for a period of time, during which requests can be received.
5. The client browser parses THE HTML content
The client browser first parses the status line to see the status code indicating whether the request was successful. Each response header is then parsed, and the response header tells the following several bytes of HTML document and the document’s character set. The client browser reads the response data HTML, formats it according to the HTML syntax, and displays it in the browser window.
For example, enter the URL in the browser address bar and press Enter. The following process occurs:
1. The browser requests the DNS server to resolve the IP address corresponding to the domain name in the URL.
2. After the IP address is resolved, establish a TCP connection with the server based on the IP address and default port 403
3. The browser sends an HTTP request to read the file (the file following the domain name in the URL). The request packet is sent to the server as the third packet of the TCP three-way handshake.
4. The server responds to the browser request and sends the corresponding HTML text to the browser;
5. Release TCP connections.
6. The browser will display the HTML text;
Second question: since we talked about browser rendering, let’s talk about the principle and process of browser rendering web pages
The principle of
In fact, browser rendering principle as long as you understand the key rendering path
The key render path is the entire process by which the browser receives the requested HTML, CSS, JavaScript and other resources, then parses, builds the tree, renders the layout, draws, and finally renders the interface to the user
Take a look at the WebPKit flow:
To summarize the process:
- The browser parses the retrieved HTML document into a DOM tree
- The CSS markup is processed to form the cascading style sheet model CSSOM
- Combine DOM and CSSOM into a render tree representing the columns of objects to be rendered
- Each element of the render tree contains computed content, called a layout. The browser uses a streaming approach that allows all elements to be laid out in a single drawing operation
- Drawing the nodes of the render tree onto the screen is a step called painting
- Display content to a web page
In fact, the above summary will be asked a lot, because every interviewer will ask different questions, so it is best to prepare for the interview.
Let’s say a few more questions, and if you want to see them, you can summarize them by yourself:
- What is the difference between HTTP and HTTPS? (Mentioned in the article)
- Why is HTTPS safe? (Mentioned in the article)
- Do you understand how symmetric and asymmetric encryption algorithms perform encryption operations? (Check by yourself)
- This section describes the HTTPS handshake process.
- Man-in-the-middle attack on HTTPS
The HTTP family is a very large area of knowledge. If you want to learn, you can recommend the illustrated HTTP. SAO Nian !!!!