Notes on the Definitive guide to HTTP

An overview of HTTP

Web Resources: A Web server is a host for Web resources. Web resources are the source of Web content.

Media Type: HTTP carefully labels each object to be transferred over the Web with a data format called MIME Type. A MIME type is a text token that represents a primary object type and a specific subtype, separated by a slash.

  • Text documents in HTML format are marked by text/ HTML types.
  • Plain ASCII text documents are marked by text/plain types.
  • JPEG images are image/ JPEG.
  • An image in GIF format is of the image/ GIF type.

The Multipurpose Internet Mail Extension (MIME) was originally designed to solve the problem of moving messages between different E-mail systems.

URI: Uniform resource Identifier (URI)

URL: Uniform resource locator

2. Urls and resources

Urls give users and their browsers everything they need to find resources. A URL defines a specific resource that a user needs, where it is located, and how to get it.

URL syntax

The URL syntax for most URL schemes is based on this common nine-part format:

<scheme>://<user>:<password>@<host>:<port>/<path>; <params>? <query>#<frag>Copy the code

The three most important parts of a URL are the scheme, host, and path.

  • Scenario: Which protocol to use when accessing a server for resources
  • User: Some schemes require a user name to access resources
  • Password: A possible password followed by a colon:separated
  • Host: indicates the host name or point IP address of the resource server
  • Port: The port number on which the resource server is listening. Many scenarios have a default port number
  • Path: the local name of the resource on the server. Where is the resource on the server/Separate it from the previous content
  • Parameters: Some scenarios require input parameters to provide normal service, and the URL can contain multiple parameters with semicolons between each other and the rest of the path;separated
  • Query: Some solutions use this component to pass parameters. Many resources can be queried to narrow the scope of the requested resources?To separate it from the rest
  • Fragment: The name of a small piece or part of a resource.

URL shortcuts

Relative URL

The absolute URL contains all the information needed to access the resource. To obtain full information about access to a resource from a relative URL, it must be parsed relative to another, called its underlying URL.

Base URL:

  • An HTML tag that defines a base URL in an HTML document, provided explicitly in a resource<base>To convert all relative urls in that HTML document.
  • The underlying URL that encapsulates the resource

Automatic URL extension

Host name extension: With a few tips, the browser can extend the entered host name to the full URL

History extension: Matches prefixes to entered urls based on previous browsing history, and provides complete options for you to choose from

URL encoding

The URL encoding format uses US-ASCII code, so it needs to be encoded because some characters in the URL will cause ambiguity.

The principle of URL encoding is to use safe characters (printable characters with no special purpose or meaning) to represent unsafe characters.

When writing urls, use printable characters from the US-ASCII character set. If you want to use a character in a URL that does not belong to this character set, use a special symbol to encode the character.

In addition to those characters that cannot be displayed, you also need to encode reserved and unsafe characters in urls.

Reserved characters are those that have a specific meaning in the URL. Unsafe characters are characters that have no special meaning in the URL, but may have special meaning in the context of the URL. For example, double quotes (” “)

Url encoding, also commonly known as percent encoding, uses the % percent sign plus two hexadecimal numbers representing the character ASCII code.

Partially reserved and insecure characters and their URL encodings

character describe usage coding
; A semicolon keep %3B
/ slash keep %2F
? The question mark keep %3F
: The colon keep %3A
@ “At” symbol keep %4O
= The equal sign keep %3D
& The sum symbol keep % 26
< Less than no. unsafe %3C
> More than no. unsafe %3E
Double quotation marks unsafe % 22
# Well no. unsafe % 23
% percent unsafe % 25
{ The left brace unsafe %7B
} The right brace unsafe %7D
| A vertical bar unsafe %7C
\ The backslash unsafe %5C
^ Plus size unsafe %5E
~ The waves unsafe %7E
[ In the left parenthesis unsafe %5B
] Right in the brackets unsafe %5D
` The single quotes unsafe % 60
The blank space unsafe % 20

HTTP packets

HTTP packets are simple formatted blocks of data. Each message contains either a request from the client or a corresponding request from the server. They consist of three parts: the start line that describes the packet, the header that contains the attributes, and the body that contains the data.

HTTP packets are classified into two types: request packets and response packets

Format of request packet: <methord> <request-URL> <version> <headers> <entity-body> Format of response packet:  <version> <status> <reason-phrase> <headers> <entity-body>Copy the code

The format of the request message is different from that of the response message. Only the syntax of the start line is different

  • Method

    The action that the client wants the server to perform on the resource. Is a single word, such as GET, HEAD or POST.

  • Request URL (request-URL)

    Uniform Resource Identifier (URI).

  • Version

    HTTP version used by the packet. The format looks like this: HTTP/

    .

  • Status-code

    The response status code sent back by the server. These three digits describe what happens during the request. The first digit of each status code is used to describe the general category of state (” success, “” error,” etc.).

  • Reason phrase (reason-phrase)

    The text description of the status code.

  • Header

    There can be zero or more headers, each containing a name followed by a colon:, followed by an optional space, followed by a value, and finally a CRLF (carriage return or line feed). The header is terminated by a blank line (CRLF), indicating the end of the header list and the beginning of the body of the entity.

  • Entity-body of an entity

    The body part of the entity contains a data block composed of arbitrary data. Not all messages contain the body part of the entity, and sometimes the message simply ends with a CRLF.

methods

The opening line of the request starts with a method that tells the server what to do.

Common HTTP methods

methods describe Whether or not to include a body
GET Requests the resource identified by request-URL no
POST Append new data to the resource identified by request-URL is
HEAD Requests the header of the resource identified by request-URL without returning the body of the entity no
PUT The body of the Request is stored on the server and identified by the Request-URL is
DELETE Requests the server to remove the resource identified by request-URL no
TRACE The request server sends back the received request information, mainly for testing or diagnostics no
OPTIONS Asking the Web server to tell you about the various capabilities it supports can ask the server which methods it generally supports or which methods it supports for particular resources no

Status code

The status code consists of three digits. The first number defines the category of the response and has five possible values:

  • 1XX: indicates that the request has been received and processing continues
  • 2xx: success – The request is successfully received, understood, or accepted
  • 3xx: Redirect – Further action must be taken to complete the request
  • 4XX: client error – The request has a syntax error or the request cannot be implemented
  • 5xx: Server side error – The server failed to fulfill a valid request

Common status codes

Status code State description instructions
200 OK The client request succeeded. Procedure
400 Bad Request The client request has a syntax error and cannot be understood by the server
401 Unauthorized Request unauthorized, this status code must be used with wwW-Authenticate header field
403 Forbidden The server received the request but refused service
404 Not Found The requested resource does not exist
500 Internal Server Error An unexpected error occurred on the server
503 Server Unavailable The server cannot process client requests. However, the server may recover after a period of time

The first

Headers and methods work together to determine what the client and server can do.

Headers can be used to provide information in both request and response messages. Some headers are specific to certain messages, while others are more general. Headers can be divided into four types:

Gm’s first

These are common headers that can be used by both clients and servers. You can provide some very useful common functionality between clients, servers, and other applications. Provides the most basic information about the message.

The first describe
Connection Allows clients and servers to specify options related to request/response connections, such as specifying that the connection is continuous, or specifying the “close” option to inform the server that the connection is disconnected when the response is complete
Date Provide date and time flags indicating when the message was created
Cache-Control Used to specify cache directives that are one-way (cache directives that appear in the response may not appear in the request) and are independent (cache directives for one message do not affect the caching mechanism for another message processing)
Pragma Another way to send instructions along with a message, but not exclusively for caching
Transfer-Encoding Inform the receiver of the encoding mode used to ensure reliable transmission of packets

The request first

The request header is unique to the request message. They provide the server with some additional information, such as what type of data the client wants to receive. Used to describe who or what is sending the request, where the request comes from, or the preferences and capabilities of the client. The server can try to provide a better response for the client based on the client information given at the beginning of the request.

The informational header of the request
The first describe
Client-IP Provides the IP address of the machine running the client
From Provides the email address of the client user
Host The host name and port number of the server receiving the request are given
User-Agent Tell the server the name of the application that initiated the request
The Accept header

The Accept header provides a way for clients to tell the server what they like and can do, what they want, what they can use, and most importantly, what they don’t want. This allows the server to make more informed decisions about what to send based on this additional information.

The first describe
Accept Tells the server which media types can be sent
Accept-Charset Tells the server which character sets can be sent
Accept-Encoding Tells the server which encodings can be sent
Accept-Language Tells the server which languages can be sent
Security request header
The first describe
Authorization Contains data that the client provides to the server to authenticate itself
Cookie The client sends a token to the server — it’s not really a security header, but it does imply security capabilities

In response to the first

The response header is unique to the response message. The response header provides the client with some additional information, such as who is sending the response, the responder’s function, and even some special instructions related to the response. These headers help the client process the response and make better requests in the future.

The first describe
Age Response duration (since initial creation)
Location The response recipient (a client, such as a browser) can be directed to a resource at a different location than the requested URI, and the response can be returned when the requested resource location changes
Server The name and version of the server application software
WWW-Authenticate This header must be included in the response packet with the status code 401. When the client receives the response message with the status code 401, it sends a request packet containing the Authorization header to request the server for authentication. The server response packet contains this header

Entity first

Entity headers provide a lot of information about the entity and its contents. Both request and response messages may contain entity parts, so these headers may appear in both types of messages. In short, the entity head can tell the receiver of the message what it is processing.

The first describe
Content-Base The base URL to use when interpreting relative urls in a body
Content-Encoding Arbitrary encoding performed on a subject
Content-Language The natural language used by the subject resource
Content-Length The length or dimension of a body
Content-Locaton The actual location of the resource
Content-Type The media type of the entity body sent to the receiver
Content-Range The range of bytes represented by this entity in the entire resource
Expires The date and time the response expires, after which the entity is no longer valid and is retrieved from the original source
Last-Modified The date and time the entity was last modified

4. Connection management

Programming with TCP sockets

The socket API describe
s = socket(<parameters>) Creates a new, unnamed, unassociated socket
bind(s,<local IP:port>) Assign a local port number and interface to the socket
connect(s, <remote IP:port>) Create a connection that connects the local socket to the remote host and port
listen(s, ...) Identifies a local socket so that it can legally accept connections
s2 = accept(s) Waiting for someone to establish a connection to a local port
n = read(s, buffer, n) Attempt to read n bytes from the socket into the buffer
n = write(s, buffer, n) Try to write n bytes to the socket from the buffer
close(s) Disable the TCP connection completely
shutdown(s,<side>) Disable only the input or output end of the TCP connection

The socket API allows users to create TCP endpoint data structures, connect these endpoints to TCP endpoints on remote servers, and read and write data streams. The TCP API hides the handshake details of all the underlying network protocols, as well as the segmentation and reassembly details between TCP data streams and IP packets.

Parallel connection

HTTP allows clients to open multiple connections and perform multiple HTTP transactions in parallel.

  • Parallel links may increase page loading speed and delay overlap

  • Parallel connections are not necessarily faster, and when bandwidth is low, multiple connections compete for limited bandwidth, each slower.

    Browsers do use parallel connections, but they limit the total number of parallel connections to a small value (typically four).

  • Parallel connections may feel faster and the user can see how the load progresses

Permanent link

A TCP connection that remains open after the transaction ends is called a persistent connection. Persistent connections remain open between transactions until the client or server decides to close them.

By reusing idle persistent connections that have been opened to the target server, you can avoid the slow connection establishment phase. Moreover, an already open connection avoids the congestion adaptation phase of a slow start, allowing for faster data transfer.

Persistent and parallel connections

Parallel connections can speed up the transfer of composite pages. But parallel connections also have some disadvantages:

  • Each transaction opens/closes a new connection, consuming time and bandwidth.
  • Due to the slow start nature of TCP, the performance of each new connection is degraded.
  • The number of parallel connections that can be opened is actually limited.

Persistent connections have some advantages over parallel connections. Persistent connections reduce latency and connection establishment overhead, keep connections tuned, and reduce the potential number of open connections. However, be careful when managing persistent connections, or you can accumulate a large number of idle connections that consume resources on both local and remote clients and servers.

Using persistent connections in conjunction with parallel connections is probably the most efficient approach. Today, many Web applications open a small number of parallel connections, each of which is a persistent connection.

There are two types of persistent connections:

  • HTTP/1.0 + keep-alive connection
  • HTTP / 1.1 persistent connections

Keep alive – operation

Clients implementing HTTP/1.0 keep-alive connections can keep a Connection open by including a Connection: keep-alive header request.

If the server is willing to keep the connection open for the next request, include the same header in the response. If there is no Connection: keep-alive header in the response, the client assumes that the server does not support keep-alive and closes the Connection after sending the response.

Keep alive – option

The keep-alive header simply requests that the connection be kept active. After a keep-alive request is made, the client and server do not necessarily agree to a keep-alive session. They can close idle keep-alive connections at any time and limit the number of transactions handled by keep-alive connections at will.

The behavior of keep-alive can be adjusted using the comma-separated options specified in the keep-alive header.

  • Time-out, the time the server wants to keep the connection active.
  • Max, how many more transactions does the server want to keep this connection active.

Restrictions and rules for keep-alive connections

  • Keep-alive is not used by default in HTTP/1.0. The client must send a Connection: keep-alive request header to activate the keep-alive Connection.
  • Connection: the keep-alive header must be sent along with all the packets that want to maintain a persistent Connection. If the client does not send the Connection: keep-alive header, the server closes the Connection after that request.
  • By detecting whether the response contains the Connection: keep-alive header, the client can determine whether the server will close the Connection after sending the response.
  • The connection can only be kept open if the Length of the body part of the message entity can be determined without detecting connection closure — that is, the body part of the entity must have the correct Content-Length, have multiple part media types, or be encoded in a block-transfer encoding. Sending back the wrong Content-Length in a keep-alive channel is bad, so that the other end of a transaction can’t accurately detect the end of one message and the start of another.
  • The agent and gateway must execute the rule in the Connection header. The proxy or gateway must delete all header fields named in the Connection header and the Connection header itself before forwarding the message or caching it.
  • Strictly speaking, you should not establish a keep-alive Connection with a proxy server that cannot determine whether the Connection header is supported or not to prevent the dumb proxy problem described below.

Keep-alive and dumb agents

The problem is with the agents — especially those who don’t understand the Connection header and don’t know that it should be deleted before sending it down the forward link. Many older or simpler agents are blind relay agents that simply forward bytes from one Connection to another without special processing of the Connection header.

Here’s what’s happening:

  1. The Web client sends a message to the proxy containing the Connection: keep-alive header, requesting to establish a keep-alive Connection if possible. The client waits for the response to determine whether the other party recognizes its request for the Keep-alive channel.
  2. The dumb proxy receives the HTTP request, but it doesn’t understand the Connection header (it just treats it as an extension header). The proxy does not know what keep-alive means, so it simply sends the packet verbatim to the server along the forwarding link. However, the Connection header is a hop-by-hop header that applies only to a single transport link and should not be transmitted down the transport link. Now, something really bad is about to happen.
  3. The HTTP request is relayed to the Web server. When the Web server receives the Connection: keep-alive header forwarded by the proxy, it mistakenly thinks that the proxy wants to have a keep-alive conversation! This is fine for the Web server — it agrees to the keep-alive conversation and sends back a Connection: keep-alive response header. So, at this point, the Web server thinks it is having a keep-alive conversation with the proxy and follows the keep-alive rules. But the agent knew nothing about Keep-Alive. A rat.
  4. The dumb proxy sends the response message from the Web server back to the client, along with the Connection: keep-alive header from the Web server. The client sees this header and assumes that the agent has agreed to a keep-alive conversation. So, at this point, both the client and the server think they are having a keep-alive conversation, but the agent they are talking to has no knowledge of keep-alive.
  5. Since the agent knows nothing about Keep-Alive, it sends all the data it receives back to the client and waits for the source server to close the connection. But the source server will assume that the proxy has explicitly asked it to keep the connection open and will not close it. The agent then hangs there waiting for the connection to close.
  6. After receiving the response packet, the client immediately moves to the next request and sends another request to the proxy over the keep-alive connection. The agent doesn’t think another request is coming for the same connection, the request is ignored, and the browser just circles around and doesn’t make any progress.
  7. This kind of miscommunication keeps the browser suspended until the client or server times out the connection and shuts it down.

To avoid such agent communication problems, modern agents must never forward Connection headers and all headers whose names appear in Connection values. Keep-alive, proxy-authenticate, proxy-Connection, Transfer-Encoding and Upgrade headers should not be forwarded.

HTTP / 1.1 persistent

HTTP/1.1 persistent connections are enabled by default. Unless otherwise specified, HTTP/1.1 assumes that all connections are persistent. To close a Connection after a transaction, an HTTP/1.1 application must explicitly add a Connection: close header to the message. This is an important difference from previous versions of the HTTP protocol, where keep-alive connections were either optional or not supported at all.

HTTP/1.1 clients assume that after receiving a response, the HTTP/1.1 Connection will remain open unless the response contains a Connection: close header. However, clients and servers can still close idle connections at any time. Not sending Connection: close does not mean that the server promises to keep the Connection open forever.

Restrictions and rules for persistent connections

  • After sending the Connection: close header, the client cannot send any more requests on that Connection.
  • If the client does not want to send any more requests on the Connection, it should send a Connection: close header in the last request.
  • A connection can persist only if all the packets on the connection have the correct, custom message Length — that is, the Length of the body of the entity is the same as the corresponding Content-Length, or is encoded in block transport encoding.
  • HTTP/1.1 proxies must be able to manage persistent connections to clients and servers separately — each persistent connection is valid for one-hop transport only.

Pipe connection

HTTP/1.1 allows optional use of request pipes on persistent connections. This is another performance optimization over keep-alive connections. Multiple requests can be queued before the response arrives. When the first request is sent across the network to a server halfway around the world, the second and third requests can begin to be sent. This can reduce the loopback time and improve the performance in high delay networks.

Several limitations on piped connections:

  • Pipes should not be used if the HTTP client cannot verify that the connection is persistent.
  • The HTTP response must be sent back in the same order as the request. HTTP packets do not have serial number labels, so if the received response is out of order, there is no way to match it to the request.
  • The HTTP client must be prepared for the connection to be closed at any time and to resend any outstanding pipelining requests.

Close the connection

Any disconnection: Any HTTP client, server, or proxy can close a TCP transport connection at any time

Connection closed retry

HTTP applications should be prepared to handle unexpected shutdowns properly. If the transport connection is closed while the client is executing the transaction, the client should reopen the connection and try again, unless the transaction has some side effects. This is more serious for piped connections. The client can queue a large number of requests, but the source server can close the connection, leaving a large number of unprocessed requests that need to be rescheduled.

Side effects are an important issue. If the connection is closed after some request data has been sent and before the return result has been received, the client cannot be 100% sure how many transactions have actually been activated on the server side. Some transactions, such as getting a static HTML page, can be performed multiple times without changing. Other transactions, such as Posting an order to an online bookstore, cannot be repeated or risk placing multiple orders.

A transaction is idempotent if the result is the same whether it is executed once or many times. GET, HEAD, PUT, DELETE, TRACE, and OPTIONS.

The POST method is nonidempotent; to send a nonidempotent request, wait for the response status from the previous request.

Normally closed

Completely closed: The socket call close() closes both the input and output channels of the TCP connection.

Half-closed: The socket calls shutdown() to individually close the input or output channel.

Reset error:

It is always safe to close a connected output channel. It is dangerous to close the connected input channel unless you know that the other end is not going to send any more data.

Let’s say you’ve sent 10 pipe requests on a persistent connection, and the response has been received and is being stored in the operating system buffer (but the application hasn’t read it yet). Now, suppose you send an 11th request, but the server decides that you’ve been using the connection long enough and decides to shut it down. Your 11th request will be sent to a closed connection and a reset message will be sent back to you. This reset message will clear your input buffer.

When you finally try to read the data, you get an error that the connection has been reset by the peer, and the cached unread response data is lost, even though most of it has already made it to your machine.

Normal shutdown:

Applications that achieve a normal shutdown should first close their output channel and then wait for the peer entity connected to the other end to close its output channel. When both ends tell each other that they are not sending any more data (such as shutting down the output channel), the connection is completely closed without any risk of resetting.

An application that wants to close a connection properly should first semi-close its output channel and then periodically check the state of its input channel (looking for data, or the end of a stream). If the peer does not close the input channel within a certain period of time, the application can force the connection to be closed to save resources.

Web server

The Web server implements HTTP and associated TCP connection handling. Responsible for managing the resources provided by the Web server, as well as managing the configuration, control and expansion of the Web server.

1. Establish a connection

Handling new connections

When a client requests a TCP connection to the Web server, the Web server establishes the connection, determines which client is on the other end of the connection, and resolves the IP address from the TCP connection. Once the new connection is established and accepted, the server adds the new connection to its list of existing Web server connections, ready to monitor the data transfer on the connection.

Client host name recognition

Most Web servers can be configured with “reverse DNS” to translate client IP addresses into client host names. The Web server can use the client host name for detailed access control and logging. Be aware, however, that hostname lookups can take a long time, which can slow down Web transactions. Many high-volume Web servers either disable hostname resolution or only allow parsing for specific content.

The client user is identified by ident

Some Web servers also support the IDENT protocol of IETF. The server can use the Ident protocol to find the user name that initiated the HTTP connection. If the client supports the Ident protocol, it listens for ident requests on TCP port 113. The client opened an HTTP connection. The server then opens its connection to the client Ident server port (113), sends a simple request asking for the username corresponding to the new connection (specified by the client and server port numbers), and parses out the response containing the username from the client.

Ident works well within an organization, but does not work well over the public Internet for a number of reasons, including:

  • Many client PCS do not run the Ident identity protocol daemon software;
  • The IDENT agreement imposes severe delays on HTTP transactions;
  • Many firewalls do not allow ident traffic;
  • Ident is insecure and vulnerable to counterfeiting;
  • The IDENT protocol also does not support virtual IP addresses;
  • Exposing client usernames also has privacy implications.

2. Receive the request packet

Parsing request message

  • The request lines are parsed to find the request method, the specified resource identifier (URI), and the version number, separated by a space, and terminated by a carriage return newline (CRLF) sequence.
  • Read the packet header ending with CRLF.
  • Blank lines ending in CRLF and identifying header, if any, are detected;
  • If so (Length specified by the Content-Length header), the request body is read.

When parsing request messages, the Web server receives input data from the network from time to time. Network connections can be delayed at any time. The Web server reads data from the network and stores part of the packet data in memory until it receives enough data to parse and understand its meaning.

Because requests can arrive at any time, the Web server is constantly watching for new Web requests. Different Web server structures serve requests in different ways:

  • Single-threaded Web server
  • Multi-process and multi-threaded Web servers
  • Servers that reuse I/O
  • Multithreaded Web server for reuse

3. Process requests

Once the Web server receives the request, it can process the request based on methods, resources, headers, and optional body parts. Some methods, such as POST, require that the request message contain data from the body of the entity. Other methods (such as OPTIONS) allow or exclude the body of the request. A few methods, such as GET, prohibit the inclusion of the entity’s body data in the request message.

4. Mapping and accessing resources

docroot

Web servers support different types of resource mappings, but the simplest form of resource mapping is to access files in the Web server file system using the request URI as the name. Typically, a Web server has a special folder in its file system for Web content. This folder is called the document root (or docroot). The Web server takes the URI from the request message and appends it to the document root.

Directory listing

A Web server can receive a request for a directory URL whose path can be resolved to a directory rather than a file. Most Web servers can be configured to act differently when a client requests a directory URL.

  • Return an error.
  • Instead of returning the directory, a special default “index file” is returned.
  • Scanning a directory returns an HTML page containing the contents of the directory.

Most Web servers will look for a file in the directory called index.html to represent the directory. If the user requests a URL for a directory that contains a file called index.html, the server returns the contents of that file.

Mapping of dynamic content resources

The Web server can also map URIs to dynamic resources — that is, to programs that dynamically generate content on demand. In fact, there is a class of Web servers called application servers that connect Web servers to complex back-end applications. The Web server needs to be able to tell when a resource is dynamic, where a dynamic content generator is located, and how to run that program.

Access control

The Web server can also control access to specific resources. When a request arrives to access a controlled resource, the Web server can perform access control based on the IP address of the client or request a password to access the resource.

5. Build the response

In response to the entity

Once the Web server identifies the resource, it performs the action described in the request method and returns a response message. If there is a response body, the response message usually contains:

  • HTTP version, status code, and cause phrase

  • Represents the content-Type header of the MIME Type of the response body.

  • Content-length header that describes the Length of the response body;

  • The body content of the actual packet.

The MIME type

  • MIME type: Use the extension of the file to determine the MIME type
  • Magic sorting: Scans the contents of each resource and matches it to a known schema table (called magic files) to determine the MIME type of each file
  • Explicit classification: A Web server can be configured to force a SPECIFIC file or directory content to have a MIME type, regardless of the file extension or content
  • Type negotiation: Negotiates with the user which format (and associated MIME types) is best for use.

redirect

The Web server sometimes returns a redirected response instead of a success message. The Web server can redirect the browser elsewhere to perform the request. Location Response header contains the URI of the new or preferred address for the content. Redirection can be used in the following situations.

  • Permanently removed resources
  • Resources temporarily moved away
  • The URL to enhance
  • Load balancing
  • Server association
  • Specification directory name

6. Send a response

7. Record a log

Six, agents,

A Web proxy server is an intermediate entity on a network. A proxy sits between a client and a server and acts as a middleman, sending HTTP packets back and forth between endpoints.

The HTTP proxy server is both a Web server and a Web client. The HTTP client sends a request packet to the proxy, and the proxy server must process the request and connection correctly, just like a Web server, and then return a response. At the same time, the proxy itself sends requests to the server, so it must behave like a proper HTTP client, sending requests and receiving responses.

Private and shared agents

Common agents: Most agents are common shared agents. Centralized agents are more cost efficient and easier to manage. Some proxy applications, such as caching proxy servers, take advantage of common requests between users, so that the more users that come into the same proxy server, the more useful it is.

Private agents: Dedicated private agents are not common, but they do exist, especially when running directly on a client computer.

Proxy versus gateway

A proxy connects to two or more applications using the same protocol, while a gateway connects to two or more endpoints using different protocols.

Why use proxies

Proxy servers can do all sorts of snazzy and useful things. They can improve security, improve performance, and save money. The proxy server can see and touch all the HTTP traffic that passes, so the proxy can monitor and modify it to implement many useful value-added Web services.

  • Adult content filtering
  • Document access control
  • Security firewall
  • Web caching
  • The reverse proxy
  • Content router
  • transcoder
  • anonymous

Client proxy Settings

  • Manual configuration:
  • PAC files: PAC files are small JavaScript programs that compute proxy Settings on the fly and, therefore, are a more dynamic proxy configuration solution. When accessing each document, the JavaScript function selects the appropriate proxy server.
  • WPAD: The algorithm of the WPAD protocol automatically finds the appropriate PAC file for the browser using the step-up strategy of the discovery mechanism.

Clients implementing the WPAD protocol need to:

  • Find the URI of PAC with WPAD;
  • Get the PAC file from the specified URI;
  • Execute PAC file to determine proxy server;
  • Use a proxy server for requests.

Issues related to proxy requests

The proxy URI is different from the server URI

When a client sends a request to a Web server, the request line contains only part of the URI (no scheme, host, or port), as shown in the following example:

GET /index.html HTTP/1.0
User-Agent: SuperBrowser v1.3
Copy the code

But when the client sends a request to the proxy, the request line contains the full URI. Such as:

GET http://www.marys-antiques.com/index.html HTTP/1.0
User-Agent: SuperBrowser v1.3
Copy the code

A proxy can handle both proxy and server requests

Because of the different ways in which traffic is redirected to proxy servers, a generic proxy server should support both full and partial URIs in the request message. If it is an explicit proxy request, the proxy should use the full URI, and if it is a Web server request, it should use the partial URI and virtual Host header.

The rules for using full and partial URIs are shown below.

  • If the full URI is provided, the agent should use the full URI.
  • If a partial URI is provided and there is a Host header, the Host header should be used to determine the name and port number of the original server.
  • If a partial URI is provided and there is no Host header, use another method to determine the original server:
    • If the proxy is a proxy representing the original server, you can configure the proxy with the address and port number of the real server.
    • If traffic is intercepted, and the interceptor can also provide the original IP address and port, the proxy can use the IP address and port number provided by the interception technology
    • If all else fails and the agent does not have enough information to identify the original server, it must return an error message (usually advising the user to upgrade to a modern browser that supports the Host header)

Track message

Via the first

The Via header field lists information about each intermediate node (proxy or gateway) through which the message passes. Each time a message passes through a node, the intermediate node must be added to the end of the Via list.

The Via header field is used to record the forwarding of the message, diagnose the message loop, and identify the protocol capabilities of all senders on the request/response chain.

Via the grammar

The Via header field contains a comma-separated waypoint. Each roadmap represents a separate proxy server or gateway and contains information about the protocol and address of that intermediate node.

Each Via roadmap contains up to four components: an optional protocol name (HTTP by default), a required protocol version, a required node name, and an optional descriptive comment.

Via: 1.1 cache.joes-hardware.com, 1.1 proxy.irenes-isp.net
Copy the code
Via the gateway

Some proxies provide gateway functionality for servers that use non-HTTP protocols. The Via header logs these protocol conversions so that the HTTP application knows the protocol processing capabilities and the protocol conversions made at each point in the proxy chain.

ViaProxy.irenes-isp.net (traffic-server /5.0.1-17882 [cMs F])Copy the code

The TRACE method

The proxy server can modify the packet when forwarding it. You can add, modify, or delete headers, and you can convert the body part to a different format. To diagnose proxy networks, we need a convenient way to observe how packets change as they are forwarded hop by hop over the HTTP proxy network.

The HTTP/1.1 TRACE method allows you to TRACE the request packets transmitted through the proxy chain, which proxies the packets pass through, and how each proxy modifies the request packets.

When the TRACE request reaches the destination server, 4 The whole request packet is encapsulated in an HTTP response body and sent back to the sender. When the TRACE response arrives, the client can check the exact message received by the server and the list of agents it passed through (in the VIA header)

Proxy authentication

Agents can be used as access control devices. HTTP defines a mechanism called proxy authentication, which blocks requests for content until a user provides a valid access certificate to the proxy.

  • When requests for restricted content reach a Proxy server, the Proxy server can return a 407 Proxy Authorization Required status code that requires access certificates and a proxy-Authenticate header field describing how to provide those certificates
  • When the client receives a 407 response, it tries to collect the required certificate from the local database or by prompting the user.
  • As soon as the certificate is obtained, the client resends the request, providing the required certificate in the proxy-authorization head field.
  • If the certificate is valid, the agent sends the original request down the transport link; Otherwise, another 407 reply is sent.

Seven, caching,

A Web cache is an HTTP device that automatically saves copies of common documents. When the Web request reaches the cache, if there is a “cached” copy locally, the document can be extracted from the local storage device rather than the original server. Using a cache has the following advantages:

  • Caching reduces redundant data transfers and saves you money on your network.
  • Caching relieves network bottlenecks. Pages load faster without requiring more bandwidth.
  • Caching reduces the requirements on the original server. The server can respond more quickly and avoid overloads.
  • Caching reduces distance latency because it is slower to load pages from farther away.

Hits and misses

The cache cannot hold a copy of every document in the world. If a copy of the requested document is found in the cache, it is called a cache hit. If a copy is not found in the cache and the request is forwarded to the original server, it is called a cache miss.

Cache hit ratio: The percentage of requests serviced by the cache.

Byte hit ratio: Represents the percentage of bytes supplied by the cache out of all bytes transmitted.

revalidation

The contents of the original server may change, and caches check it from time to time to see if the copy they keep is still the most recent copy on the server, which is called HTTP revalidation.

The cache can revalidate copies at any time and with any frequency. But because caches typically contain millions of documents, and because network bandwidth is at a premium, most caches revalidate copies only when the client initiates a request and the copies are old enough to warrant detection.

When the cache revalidates a cached copy, it sends a small revalidation request to the original server. If the content does Not change, the server responds with a small 304 Not Modified. As long as the cache knows that the copy is still valid, it again marks the copy as temporarily fresh and provides the copy to the client, which is called a revalidation hit or slow hit. It does check against the original server, so it’s slower than a pure cache hit, but it doesn’t fetch object data from the server, so it’s faster than a cache miss.

HTTP provides several tools for revalidating cached objects, but the most common is the if-modified-since header. Adding this header to the GET request tells the server to send the object only if it has been modified after a copy of the object has been cached.

Here is a list of what happens when the server receives a GET IF-Modified-since request:

  • Revalidate hit: If the server object has Not been Modified, the server sends a small HTTP 304 Not Modified response to the client.
  • Revalidation missed: If the server content has been modified, the server sends a plain HTTP 200 OK response to the client with the full content.
  • Object deleted: If the server object has been deleted, the server sends back a 404 Not Found response, and the cache removes its copy.

The cache topology

Private caches: Private caches do not require a lot of power or storage space, so they can be made small and cheap. Web browsers have built-in private caches – most browsers cache frequently used documents on your PC’s disk and memory and allow users to configure the cache size and Settings.

Public proxy caches: The public cache is a special shared proxy server called a cache proxy server. A public cache accepts access from multiple users, so it is a better way to reduce redundant traffic.

The cache hierarchy

Procedure for handling the cache

  1. Receive – Cache reads incoming request packets from the network.
  2. Parsing – The cache parses the packet and extracts the URL and various headers.
  3. Query – The cache looks to see if a local copy is available, and if not, gets a copy (and saves it locally).
  4. Freshness detection – The cache looks to see if the cached copy is sufficiently fresh and, if not, asks the server if there have been any updates.
  5. Create response – The cache builds a response message from the new header and the cached body.
  6. Send — The cache sends the response back to the client over the network.
  7. Log – The cache optionally creates a log file entry to describe this transaction.

Keep copies fresh

Expiration date

The server specifies the expiration date with either the HTTP/1.0+ Expires header or the HTTP/1.1 cache-Control: max-age response header, along with the response body. The Expires header does essentially the same thing as the cache-Control: max-age header, but because the cache-control header uses relative time rather than absolute dates, we prefer to use the newer cache-Control header. The absolute date depends on the correct setting of the computer clock.

The max-age value defines the maximum lifetime of the document – the maximum legal generation time, in seconds, from the time the document is first generated until it is no longer fresh and unusable.

Cache-Control: max-age=484200
Copy the code

Expires Specifies an absolute expiration date; if that date has passed, the document is no longer fresh.

Expires: Fri, 05 Jul 2002, 05:00:00 GMT
Copy the code

Server revalidation

Just because a cached document has expired does not mean it is actually different from the document currently active on the original server; It just means it’s time to check. This is called server revalidation, and it means that the cache needs to ask whether the original server document has changed.

  • If the revalidation display changes, the cache takes a new copy of the document, stores it in place of the old document, and sends the document to the client.
  • If you revalidate that the display has not changed, the cache simply gets a new header, including a new expiration date, and updates the header in the cache.

Revalidation is performed using conditional methods

HTTP allows the cache to send a “conditional GET” to the original server, and the requesting server will only send back the object body if the document differs from an existing copy in the cache. In this way, freshness detection and object acquisition are combined into a single conditional GET. The conditional GET can be initiated by adding some special condition headers to GET request packets. The Web server returns the object only if the condition is true.

The if-modified-since: Data validation
  • If the document has been Modified Since the specified date, the if-modified-since condition is true, and usually GET executes successfully. A new document with a new header is returned to the cache, and the new header contains, among other things, a new expiration date.
  • If the document has Not been Modified since the specified date, the condition is false and a small 304 Not Modified response message is returned to the client. A new expiration date is typically sent.

The if-Modified-since header works with the last-Modified server response header. The original server appends the last modification date to the supplied document. When the cache revalidates a cached document, it contains an if-modified-since header with the date the cached copy was last Modified:

If-Modified-Since: <cached last-modified date>
Copy the code

If the content is modified in the meantime, the last modification date will be different and the original server will send back the new document. Otherwise, the server notices that the last Modified date of the cache matches the current last Modified date of the server document and returns a 304 Not Modified response.

If-none-match: revalidation of entity labels

Entity tags are arbitrary tags (reference strings) that are attached to a document. They may contain document serial numbers or version names, checksums of document contents and other fingerprint information.

When publishers make changes to the document, they can modify the entity label of the document to indicate the new version. This way, If the entity tag is changed, the cache can use the if-none-match condition header to GET a new copy of the document.

When should I use entity labels and the last modification date

If the server sends back an entity label, the HTTP/1.1 client must use the entity label validator. If the server returns only a last-Modified value, the client can use if-modified-since validation. If the entity label and last modified date are provided, the client should use both revalidation schemes so that both HTTP/1.0 and HTTP/1.1 caches respond correctly.

The ability to control caching

No-store: A response identified as no-store prevents the cache from copying the response. The cache typically forwards a no-store response to the client, like an uncached proxy server, and then deletes the object.

No-cache: The response identified as no-cache can actually be stored in the local Cache. It’s just that the cache can’t make it available to the client until freshness is re-verified with the original server.

Max-age: indicates the maximum lifetime of a document

Expires: indicates the expiration date

The response header tells the Cache that it cannot provide a stale copy of the object without first revalidating it with the original server. If the original server is not available while the cache is performing a must-revalidate freshness check, the cache must return a 504 Gateway Timeout error.

Gateway, tunnel and trunk

The gateway

Gateways can act as a glue between resources and applications. Gateways can connect two or more applications that use different protocols.

Server gateway: Talks with clients over HTTP and communicates with the server over other protocols (HTTP/*) Client gateway: talks with clients over other protocols and communicates with the server over HTTP (*/HTTP)

Protocol gateway

Browsers can be explicitly configured to use gateways for transparent interception of traffic, or gateways can be configured as substitutes (reverse proxies).

Normal HTTP traffic is not affected and continues to flow to the original server. But a request for an FTP URL is sent to the gateway gw1.joes-hardware.com in an HTTP request. The gateway performs FTP transactions on behalf of the client and sends the results back to the client over HTTP.

Common network types:

  • HTTP/*Server-side Web gateway: When the request flows to the original server, the server-side Web gateway converts the client HTTP request to another protocol
  • HTTP/HTTPSServer Security Gateway: An organization can encrypt all incoming Web requests through the gateway to provide additional privacy and security protection. Clients can browse Web content using plain HTTP, but the gateway automatically encrypts the user’s conversation
  • HTTPS/HTTPClient security accelerator gateway: receives secure HTTPS traffic, decrypts the secure traffic, and sends plain HTTP requests to the Web server. These gateways typically contain dedicated decryption hardware to decrypt secure traffic in a much more efficient manner than the original server, reducing the load on the original server.

Resources gateway

In addition to gateways that connect clients and servers over a network, the most common gateway is an application server. Application server, which combines the target server with the gateway in one server. The application server is a server-side gateway that communicates with clients over HTTP and connects to server-side applications.

Both clients are connected to the application server over HTTP. But instead of sending back the file, the Application server sends the request through a Gateway Application Programming Interface (API) to the Application running on the server. The Application processes the request and returns the result to the gateway. The gateway returns a response or response data to the server, which forwards it to the client. The server and gateway are independent of each other.

The first popular application Gateway API was the Common Gateway Interface (CGI). CGI is a standard set of interfaces that a Web server can use to load a program in response to HTTP requests for a particular URL, collect the program’s output data, and send it back in an HTTP response.

CGI
  • CGI was the first, and probably still the most widely used, server extension.
  • CGI applications are server-independent, so they can be implemented in almost any language, including Perl, Tcl, C, and various shell languages.
  • CGI is simple enough that almost all HTTP servers support it. Figure 8-9 shows the basic working mechanism of the CGI model.
  • CGI processing is invisible to the user.
  • CGI also does a good job of protecting the server from bad extensions.
  • Starting a new process for each CGI request can be expensive, limiting the performance of servers that use CGI, and taxing server machine resources. To solve this problem, a new kind of CGI has been developed — aptly called Fast CGI. This interface emulates CGI, but runs as a persistent daemon, eliminating the performance penalty of creating or removing a new process for each request.
Server extension API

The CGI protocol provides a neat way for external translators to interface with existing HTTP servers, but what if you want to change the behavior of the server itself, or just maximize the performance you can get from the server? Server developers provide several server extension apis for both needs, providing Web developers with powerful interfaces to connect their modules directly to HTTP servers. The extension API allows programmers to graft their own code onto the server, or replace a component of the server entirely with their own code.

The Web service

With Web application services type more and more, people gradually connect HTTP as a kind of application software to use the basis of the application in the process of connecting a trickier question is between the two application protocol interfaces of negotiations, so these applications can exchange data. Applications work together and interact with information that is much more complex than HTTP headers can express.

As a result, the Internet Commission developed a set of standards and protocols that allow Web applications to communicate with each other.

What are Web services?

  1. For web-based services, the server provides resources for clients to access
  2. A cross-language, cross-platform standard and protocol specification
  3. Multi-language, cross-platform communication integration between applications

Web services can use XML to exchange information over SOAP. XML (Extensible Markup Language) provides a way to create and interpret custom information about data objects. SOAP (Simple Object Access Protocol) is a standard way to add XML information to HTTP packets.

The tunnel

Web tunnel

Web tunneling allows users to send non-HTTP traffic over HTTP connections so that they can piggyback data from other protocols over HTTP. The most common reason to use Web tunneling is to embed non-HTTP traffic in HTTP connections so that it can pass through a firewall that only allows Web traffic.

Web tunnels are established using the CONNECT method of HTTP. The CONNECT method can create a TCP connection to any port on the destination server. After a tunnel is established, the tunnel will blind forward data between the server and the client.

When the client sends HTTP CONNECT to the gateway, it tells the gateway to establish a TCP connection between the gateway and the target server. After the TCP connection is established, the target server will send a reply to the gateway, which will forward the reply to the client. This Response is the status reply of the TCP connection between the gateway and the target server, rather than the Response of the request data. After that, all communication between the client and the target server will use the TCP connection previously established. In this case of HTTP tunneling, the gateway only implements forwarding and does not care about the forwarded data.

The CONNECT request syntax is as follows:

CONNECT to home.netscape.com: 443 HTTP / 1.0Copy the code

This is similar to the normal HTTP method, except that the following URL is replaced with data in the form of a host name: port number.

The response of CONNECT is similar to that of the normal HTTP method, which returns a status code of 200 on success, except that the message is usually displayed as “Connection Established”.

HTTP / 1.0 200 Connection EstablishedCopy the code

The whole process of establishing a tunnel is shown in the figure below:

SSL tunnel

SSL protocol, its information is encrypted, although we can generally through port 443 directly SSL connection, but can not be forwarded through the traditional HTTP firewall proxy server. At this point, tunnels can be used to transport SSL traffic over an HTTP connection to pass through the HTTP firewall on port 80.

Comparison between SSL tunnels and HTTP/HTTPS gateways

As mentioned earlier, clients can have SSL sessions with the server through HTTP/HTTPS gateways or through tunnels. What are the advantages of SSL tunnels over HTTP/HTTPS gateways?

Disadvantages of HTTP/HTTPS gateways:

  • The connection between the client and the gateway is a normal HTTP connection, so this section of transport is not secure

  • Although the proxy is an authenticated principal, the client cannot perform SSL client authentication on the remote server

  • The gateway needs to support a full SSL implementation

  • The gateway has the opportunity to snoop on the communication data between the client and the target server, and it has the opportunity to string change the data

Advantages of SSL tunnels:

An SSL session is established between the client and server. The proxy server in the middle only transmits encrypted data through tunnels. Therefore, the proxy in the middle does not need to implement SSL and only needs to forward data.

Tunnel certification

You can use the proxy authentication with the tunnel to authenticate the client’s right to use the tunnel

Tunnel safety considerations

In general, the tunnel gateway cannot verify that the protocol currently in use is the one it was intended to tunnel over. So, for example, a rogue user might pass Internet game traffic over a company’s firewall through a tunnel intended for SSL, while a malicious user might use the tunnel to open a Telnet session, or use the tunnel to bypass a company’s E-mail scanner to send E-mail. To reduce tunnel abuse, gateways should only open tunnels for certain well-known ports, such as HTTPS port 443.

relay

An HTTP relay is a simple HTTP proxy that does not fully follow the HTTP specification. The relay handles the part of HTTP that establishes the connection and then blind-forwards the bytes.

The advantage of relaying is that it is simple to implement and should be considered when we are simply providing a proxy for simple filtering, diagnostics, or content transformation capabilities. However, due to their blind forwarding nature, there is a very common problem that they can potentially hang keep-alive connections because they cannot handle the Connection header properly.

9. Web robots

Crawlers and ways of crawling

Web crawlers are robots that recursively traverse various informational Web sites, fetching the first Web page, then all the Web pages that that page points to, then all the Web pages that those pages point to, and so on. Robots that recursively follow these Web links “crawl” along the Web created by HTML hyperlinks, so they are called crawlers or spiders.

The root set

The initial set of urls that the crawler starts visiting is called the root set. When selecting a root set, you should choose urls from enough different sites that you can crawl through all the links to reach most of the Web pages you’re interested in. In general, a good root set includes a few large popular Web sites, a list of newly created pages, and a list of anonymous pages that are not often linked.

The loop

When a robot crawls across the Web, it should be especially careful not to get caught in a loop, or cycle. Robots must know where they have been to avoid loops. Loops create robot traps that pause or slow the robot’s progress. Due to the large number of urls, the data structure used by crawlers should be efficient in access speed and memory usage in order to quickly determine whether a link has been visited. Here are some of the techniques that large-scale Web crawlers use to manage the addresses they visit:

  • Trees and hash tables
  • The bitmap
  • Checkpoint: Be sure to save visited urls on hard disk to prevent crawler crashes
  • Classification: Some large Web robots use “clusters” of robots, where each individual computer is a robot and works in tandem. Assign a specific URL “slice” to each robot, which is responsible for crawling.

There are several cases of loops

  • The URL alias caused a loop
  • File system connection loop
  • Dynamic virtual Web Spaces

How to avoid getting stuck in a loop

  • The canonical URL

    • If no port is specified, add :80 to the host name.
    • Convert all escape characters %xx to their equivalent characters.
    • Delete the # tag.
  • Breadth first search crawl

  • Throttling: Limits the number of pages a bot can fetch from a Web site over a period of time. If the robot jumps into a loop, trying to repeatedly access a site’s alias, it can also limit the total number of duplicate pages and total number of visits to the server by throttling.

  • Limit the SIZE of urls: Robots may refuse to crawl urls that exceed a certain length (typically 1KB).

  • URL/ site blacklist

  • Pattern detection

  • Content of the fingerprint

  • Artificial monitoring

HTTP for Web robots

The request first

  • User-agent: Tells the server the name of the robot that initiates the request

  • From: Provide the Email address of the robot’s user/manager

  • Accept: Tells the server what types of media it can send, which helps ensure that the bot only receives content it is interested in.

  • Referer: Provides the URL of the document containing the currently requested URL

  • Host: Specifies the Host name to be accessed by the robot. With the popularity of virtual hosting, requests that do not include a Host header can cause bots to associate the wrong content with a particular URL.

Processing of responses

  • The robot should be able to handle at least some common state codes
  • In addition to the information embedded in the HTTP header, the robot also looks for information in the entity

Misbehaving robots

  • Runaway robots: Robots can make HTTP requests much faster than humans can surf the Web, often running on high-speed computers with fast Internet links. If the robot has a programming logic error or gets stuck in a loop, it can send a lot of load to the Web server — possibly overloading it and refusing to serve anyone else. All robot writers must take extra care to design protection from out-of-control robots.
  • Invalid urls: Some bots access lists of urls. These lists may be old. If a Web site makes a lot of changes to its content, bots may make requests to a lot of urls that don’t exist. This irritates the administrators of some Web sites, who don’t like their error logs to be filled with requests for access to non-existent documents and don’t want the overhead of providing error pages to reduce the processing power of their Web servers.
  • Long incorrect urls: Robots may request large, meaningless urls from Web sites due to loops and programming errors. If urls are long enough, they can degrade Web server performance, clutter Web server access logs, and even crash some of the more vulnerable Web servers.
  • Robots that access sensitive data
  • Dynamic gateway Access

Deny robot access

The idea of robots.txt is simple. All Web servers can provide an optional file named robots.txt in the document root of the server. This file contains information about which parts of the server the robot can access. If the robot follows this voluntary constraint standard, it will request a robots.txt file from the Web site before accessing all other resources on that site.

Obtain the robots.txt file

The robot will use the HTTP GET method to GET the robots.txt resource, just like all other resources on the Web server. If there is a robots.txt file, the server returns it in a text/plain body. If the server responds with a 404 Not Found HTTP status code, the robot can assume that there are no robot access restrictions on the server, and it can request arbitrary files.

Bots should transmit identification information in the From header and user-Agent header to help site managers track bot visits and provide contact information in case of bot events that site managers want to inquire about or complain about.

TXT HTTP/1.0 Host: www.joes-hardware.com user-agent: Slurp/2.0Copy the code

Many Web sites don’t have a robots.txt resource, but robots don’t know this. It must attempt to obtain a robots.txt resource from each site. The robot will take different actions based on the results of the robots.txt retrieval.

  • If the server responds with a success status (HTTP status code 2XX), the robot must parse the content and retrieve it from that site using rejection rules.
  • If the server responds that the resource does not exist (HTTP status code 404), the robot can assume that the server has not activated any exclusion rules and that access to the site is not restricted by robots.txt.
  • If the server responds with an access restriction (HTTP status code 401 or 403), the robot should assume that access to the site is completely restricted.
  • If the request attempt results in a temporary failure (HTTP status code 503), the robot should delay access to the site until the resource is available.
  • If the server responds with a redirect (HTTP status code 3XX), the robot should follow the redirect until it finds the resource.

TXT file format

The robots.txt file uses a very simple, line-oriented syntax. There are three types of lines in the robots.txt file: empty lines, comment lines, and rule lines. Rule lines look like HTTP headers for pattern matching.

# this robots.txt file allows Slurp & Webcrawler to crawl
# the public parts of our site, but no other robots...

User-Agent: slurp
User-Agent: webcrawler
Disallow: /private

User-Agent: *
Disallow:
Copy the code

The lines in the robots.txt file can be logically divided into “records”. Each record describes a set of exclusion rules for a particular set of robots. In this way, different exclusion rules can be used for different robots. Each record contains a set of regular lines terminated by an empty line or end-of-file character. The record begins with one or more User-agent lines indicating which robots are affected by the record, followed by Disallow and Allow lines indicating which urls these robots can access.

  1. The user-agent

Each robot record begins with one or more user-agent lines of the following form: user-agent: < rot-name > or user-agent: *. When a robot processes a robots.txt file, the records it follows must comply with one of the following rules:

  • The first one<robot-name>Is a case-independent substring of the robot name;
  • The first one<robot-name>*

If the robot cannot find a record to match, access is not restricted.

  1. Disallow and Allow

The Disallow and Allow lines follow the User-agent lines that the robot rejects the record. To describe which URL paths are explicitly forbidden or explicitly allowed for a particular robot.

The robot must match the urls it expects to visit in order with all the Disallow and Allow rules in the exclusion record. Use the first match found. If no match is found, the URL is allowed.

  1. The Disallow/Allow prefix matches
  • The Disallow and Allow rules require case-sensitive prefix matching.
  • All “escaped” characters (%XX) in the regular path or URL path are reverted to bytes (except for the forward slash %2F, which must match strictly) before making a comparison.
  • If the rule path is an empty string, it matches everything.

Cache and robots.txt expire

If a robot reobtains the robots.txt file every time it accesses it, the load on the Web server doubles and the robot becomes less efficient. Instead, the robot periodically retrieves the robots.txt file and caches the resulting file. The robot will use a cached copy of the robots.txt file until it expires. Both the original server and the robot use standard HTTP storage control mechanisms to control the caching of robots.txt files.

Robot-control meta tag of HTML

The robots.txt file allows site administrators to exclude robots from part or all of the content of a Web site. A disadvantage of the robots.txt file is that it is owned by the webmaster, not by the author of the sections.

HTML page authors have a more direct way to restrict bot access to individual pages. They can add the Robot-Control tag directly to the HMTL document. Robots following the rules of the robot-Control HTML tag can still retrieve documents, but if any of the robots reject the tag, they ignore the documents.

Robot exclusion tags are implemented in the form of the following, using HTML META tags:

<META NAME="ROBOTS" CONTENT="NOINDEX">
Copy the code

Common robot-Control META directives:

  • NOINDEX: Tells the robot not to process the content of the page and ignore the document
  • NOFOLLOW: Tell the robot not to climb any outer links of this page
  • INDEX: tells the robot that the page content can be indexed
  • FOLLOW: Tell the bot to crawl any outer links of the page
  • NOARCHIVE: Tells the robot that a local copy of the page should not be cached
  • ALL: equivalent to INDEX and FOLLOW
  • NONE: equivalent to NOINDEX and NOFOLLOW

Other META tag directives:

  • DESCRIPTION: Allows authors to define a short text summary of a Web page to describe their Web page.
  • KEYWORDS: Associated with a comma-separated list of Web page descriptions to help with keyword searches.

Ten, HTTP – NG

Client recognition and cookie mechanism

HTTP was originally an anonymous, stateless request/response protocol. The server processes the request from the client and sends a response back to the client. The Web server has little information to determine which user sent the request, nor can it record the sequence of requests from the visiting user. Modern Web sites want to provide personalized touch. They want to know more about the users at the other end of the connection and be able to track them as they browse the page.

The HTTP header that carries user-related information

The first name The first type describe
From request Email address of the user
User-Agent request The user’s browser software
Referer request Users follow the link from this page
Authorization request User name and password
Client-IP Extension (request) IP address of the client
X-Forwarded-For Extension (request) IP address of the client
Cookie Extension (request) ID label generated by the server

The From header contains the user’s E-mail address. Each user has a different E-mail address, so ideally this address can be used as a viable source to identify the user. But few browsers send From headers for fear that unscrupulous servers might collect E-mail addresses for spamming. In fact, the From header is sent by an automated robot or spider, so the webmaster has a place to send angry complaints if something goes wrong.

The user-Agent header can tell the server about the browser the User is using, including the name and version of the program, and often also about the operating system. This header is useful for good interoperation between custom content and specific browsers and their properties, but it doesn’t provide much meaningful help in identifying specific users.

The Referer header provides the URL of the user’s source page. The Referer header doesn’t completely identify the user by itself, but it does indicate which pages the user has visited before, and it gives a better understanding of browsing behavior and what interests the user has.

From, User-Agent and Referer headers are not enough to achieve reliable identification.

IP address of the client

Early Web pioneers experimented with client IP addresses as a form of identity. However, using the client IP address to identify users has many disadvantages that limit its effectiveness as a user identification technology.

  • The client IP address describes the machine being used, not the user. If multiple users share the same computer, it is impossible to distinguish between them.
  • Many Internet service providers dynamically assign IP addresses to users when they log on. Each time a user logs in, he or she gets a different address, so the Web server cannot assume that an IP address identifies the user between login sessions.
  • To improve security and manage scarce Address resources, many users browse online content through Network Address Translation (NAT) firewalls. These NAT devices hide the IP addresses of the actual clients behind the firewall, translating the actual client IP addresses into a shared firewall IP address (and different port numbers).
  • HTTP proxies and gateways typically open new TCP connections to the original server. The Web server will see the IP address of the proxy server, not the client. Some brokers get around this problem by adding a special client-IP or X-Forwarded-For header that holds the original IP address. But not all agents support this behavior.

The user login

Instead of passively guessing a user’s identity based on his IP address, the Web server can explicitly ask who the user is by asking for authentication (login) with a username and password.

To make logging into Web sites easier, HTTP includes a built-in mechanism to transmit user information to Web sites using wwW-Authenticate headers and Authorization headers.

However, logging into multiple Web sites can be tedious. When browsing from one site to another, you need to log in at each site.

Cookie

Cookies are currently the best way to identify users and implement persistent sessions.

The cookie contains one or more key-value pairs like name=value.

Session cookie: A temporary cookie that records a user’s Settings and preferences when visiting a site. The session cookie is deleted when the user exits the browser.

Persistent cookies: live longer; They are stored on the hard disk, the browser exits and the computer restarts.

The only difference between session cookies and persistent cookies is their expiration time.

How does Cookie work

When the user visits the Web site for the first time, the server knows nothing about the user. The server hopes to recognize the user when the user visits the Web site again, so it assigns a unique cookie to the user and returns it to the user through the set-cookie response header. The browser remembers the contents of the cookies in the set-cookie header returned from the server and stores the Set of cookies in the browser’s Cookie database. When the user visits the site again in the future, the browser retrieves the cookie from the database and sends it to the server in the request header.

Cookies can contain any information, but they usually contain only a unique identifier generated by a server for tracking purposes.

Cookies version 0 (Netscape)

The original cookie specification defined the set-cookie response header, cookie request header, and the fields used to control cookies. The cookie for version 0 looks like this:

Set-Cookie: name=value [; expires=date] [; path=path] [; domain=domain] [; secure]

Cookie: name1=value1 [; name2=value2] ...
Copy the code
The Set – cookies first

Set-cookie begins with a mandatory Cookie name and Cookie value. This is followed by the optional cookie property, separated by a semicolon.

attribute describe
NAME=VALUE Mandatory, NAME and VALUE are sequences of characters, excluding semicolons, commas, equals signs, and Spaces, unless enclosed within double quotation marks.
Expires Optionally, this property specifies a date string that defines the actual lifetime of the cookie, after which the cookie is no longer stored or sent.
Domain Optionally, the server only sends cookies to the server hostname in the specified domain.
Path Optionally, this property allows you to assign cookies to specific documents on the server, or attach a cookie if the Path property is a URL Path prefix.
Secure Optionally, if this property is included, cookies are sent only when HTTP uses an SSL secure connection.
Cookies first

When a client sends a request, it sends all unexpired cookies that match the domain, path, and security filter to this site.

Cookies Version 1(RFC 2965)

RFC 2965 defines an extended version of cookies. This version 1 standard introduces the SET-Cookie2 and Cookie2 headers, but it can also interoperate with version 0 systems. Major changes include the following:

  • Associate explanatory text with each cookie to explain its purpose.
  • Allows cookies to be forcibly destroyed when the browser exits, regardless of expiration time.
  • Cookie max-age is expressed in relative seconds, not absolute dates.
  • The ability to control cookies by URL port numbers, not just fields and paths.
  • Fields, ports, and path filters, if any, are loopback through the Cookie header.
  • The version number to use for interoperability.
  • At the Cookie header, separate the $prefix from the name with the additional keyword.
Set-Cookie2 adds a header
attribute describe
Version Mandatory. The value of this property is an integer corresponding to the version of the cookie specification. RFC 2965 is version 1:
Comment Optional. This property describes how the server intends to use the cookie. Users can check this policy to determine whether sessions with this cookie are allowed. This value must be utF-8 encoded
CommentURL Optional. This property provides a URL pointer to the document that details the purpose and policy of the cookie. The user can look at this policy to determine whether a session with this cookie is allowed
Discard Optional. If this property is provided, the client is instructed to discard the cookie when the client program terminates
Max-Age Cookie lifetime. Clients should calculate the lifetime of cookies according to HTTP/1.1 lifetime calculation rules. When a cookie is older than max-age, the client should discard the cookie. A value of zero indicates that the cookie should be discarded immediately
Part Optional. This property can be used as a keyword alone, or it can contain a comma-separated list of ports to which cookies can be applied. If you have a list of ports, you can only provide cookies to servers whose ports match those in the list. If the keyword Port is supplied alone without a value, cookies can only be supplied to the Port number of the current responding server
Cookie2 first

Version 1 cookies bring back additional information about each cookie transmitted, which describes the filter for each cookie path. Each matched cookie must contain all Domain, Port, or Path attributes from the corresponding set-Cookie2 header.

Set-Cookie2: ID="29046"; Domain=".joes-hardware.com"
Set-Cookie2: color=blue
Set-Cookie2: support-pref="L2"; Domain="customer-care.joes-hardware.com"
Set-Cookie2: Coupon="hammer027"; Version="1"; Path="/tools"
Set-Cookie2: Coupon="handvac103"; Version="1"; Path="/tools/cordless

Cookie2: $Version="1";
        ID="29046"; $Domain=".joes-hardware.com";
        color="blue";
        Coupon="hammer027"; $Path="/tools";
        Coupon="handvac103"; $Path="/tools/cordless
Copy the code
The header of version 1 Cookie2 is negotiated with version 1

The Cookie2 request header is responsible for negotiating interoperability between clients and servers that understand different versions of the cookie specification. The first part of Cookie2 tells the server that the user Agent understands the new form of cookie and provides a supported standard Version of cookie (cookie-version would be more appropriate) :

Cookie2: $Version="1"
Copy the code

If the server understands the new form of cookie, it can recognize the Cookie2 header and send set-cookie2 (instead of set-cookie) in the response header. If the client obtains both set-cookie and set-cookie2 headers from the same response, the old set-cookie header is ignored.

If the client supports both version 0 and version 1 cookies, but gets a version 0 set-cookie header from the server, it should send a cookie with the version 0 cookie header. But the client should also send Cookie2: $Version=”1″ to inform the server that it is upgradable.

Cookies and session tracing

Cookies, security, and privacy

Cookies are prohibitive, and most of the tracking can be done through log analysis or other means, so cookies by themselves are not a great security threat.

But the potential for abuse is always there, so it’s best to be careful when dealing with privacy and user tracking information. The use of persistent cookies by third-party Web sites to track users is one of the biggest abuses. By combining this with IP addresses and Referer headers, marketing companies can build up fairly accurate profiles and browsing patterns.

12. Basic authentication mechanism

certification

HTTP challenge/response authentication framework

When a Web application receives an HTTP request, the server does not act on the request. Instead, the server responds with an “authentication challenge,” asking the user to provide some confidential information about who he or she is.

When a user makes a request again, a confidentiality certificate (username and password) is attached. If the certificate does not match, the server can challenge the client again or generate an error message. If the certificates match, the request can be completed normally.

Authentication protocol and header

steps The first describe Method/state
request The first request has no authentication information GET
The inquiry WWW-Authenticate The server rejected the request with 401 status, indicating that the user is required to provide a username and password. The server may be divided into different zones, each zone has its own password, so the server will specify the security zone and authentication algorithm to access at www-Authenticate header. 401 Unauthorized
authorization Authorization The client reissues the request, but with an Authorization header that specifies the authentication algorithm, user name, and password GET
successful Authorization-Info If the authorization certificate is correct, the server returns the document. Some Authorization algorithms return additional information about the authorized session in the optional authorization-info header 200 OK

Basic authentication

Basic Authentication Instance

Base – 64 encoding

HTTP basic authentication packages a (colon-separated) user name and password together and encodes them in Base-64 encoding.

Base-64 encoding can take data represented by binary strings, text, and international characters and temporarily convert them into a portable alphabet for transmission. The original string can then be decoded remotely without worrying about transmission errors.

Base-64 encoding is useful for usernames and passwords that contain international characters or other characters (such as quotation marks, colons, and carriage return newlines) that are illegal in HTTP headers. Moreover, base-64 encoding scrambles user names and passwords, which also prevents administrators from accidentally seeing them while managing servers and networks.

Proxy authentication

The intermediate proxy server can also implement the authentication function. Some organizations use proxy servers to authenticate users before they access a server, LAN, or wireless network. Access policies can be centrally managed on a proxy server. Therefore, it is convenient to use a proxy server to provide unified access control over internal resources in an organization. The first step in this process is proxy authentication.

The Web server Proxy server
401 Unauthorized 407 Unauthorized
WWW-Authenticate Proxy-Authenticate
Authorization Proxy-Authorization
Authorization-Info Proxy-Authorization-Info

Defects in basic authentication

  1. Basic authentication sends user names and passwords across the network in a form that can be easily decoded. ** Actually, passwords are transmitted in clear text and can be read and captured by anyone. ** While base-64 encoding makes it less likely that friendly users will accidentally see passwords while making network observations by hiding them, base-64 encoding usernames and passwords can be easily decoded through the reverse encoding process, and can even be manually decoded with pen and paper in seconds! ** So base-64 encoded passwords are actually transmitted in plain text. ** Send all HTTP transactions over an SSL-encrypted channel or use a more secure authentication protocol, such as digest authentication, if a motivated third party user is likely to intercept the user name and password sent with basic authentication.
  2. Even if passwords are encrypted in a way that is harder to decode, third-party users can still capture the changed user names and passwords and replay them back to the original server again and again to gain access to the server. Nothing can be done to prevent these replay attacks.
  3. Even if basic authentication is used for less important applications, such as access control over a company’s internal network or access to personalized content, bad habits can make it dangerous. Many users, frustrated by the number of password-protected services, use the same username and password across them. For example, some crafty villain could capture a user name and password in plain text from a free Internet mail site, and then discover that you can access an important online banking site with the same user name and password!
  4. Basic authentication provides no protection against brokers and intermediate nodes acting as middlemen, which do not modify the authentication header but modify the rest of the message, thus drastically changing the nature of the transaction.
  5. It’s easy to fool basic authentication with a fake server. If the user can be convinced that he is connecting to a legitimate host protected by basic authentication when he is actually connecting to a malicious server or gateway, the attacker can ask the user for a password, store it for future use, and then send a fabricated error message to the user.

Basic authentication is made more secure by combining it with encrypted data transmission, such as SSL, to hide user names and passwords from malicious users. This is a common technique.

Summary certification

Basic authentication is convenient and flexible, but extremely insecure. The user name and password are transmitted in plain text, and no measures are taken to prevent packet tampering. The only way to use basic authentication safely is to use it in conjunction with SSL. Authentication is compatible with basic authentication, but more secure.

The improvement of certification

Authentication is another HTTP authentication protocol, which tries to fix the serious defects of the basic authentication protocol. In particular, summary authentication has been improved as follows.

  • Passwords are never sent over the network in clear text.
  • Prevents malicious users from capturing and replaying the authentication handshake process.
  • You can selectively prevent packet content tampering.
  • Defend against several other common attacks.

Authentication is not the most secure protocol. Authentication does not meet many requirements for secure HTTP transactions. For these requirements, the Transport Layer Security (TLS) and HTTPS protocols are more appropriate.

Protect passwords with digests

The motto of authentication is “never send passwords over the network”. Instead of sending a password, the client sends a “fingerprint” or “summary” of the password, which is an irreversible scrambler of the password. Both the client and the server know this password, so the server can verify that the provided digest matches the password. With only the digest, there is no way to find out which password the digest came from, except to try all the passwords!

One-way the

Abstract is “the concentration of information subject”. Abstract is a one-way function, mainly used for converting infinite input values into finite condensed output values. A common digest function, MD5, converts an arbitrarily long sequence of bytes into a 128-bit digest.

Most importantly for these digests, it is very difficult to correctly guess the digests sent to the server without knowing the password. Similarly, if there is a digest, it is very difficult to determine which of countless input values it was produced by.

The 128-bit digest output by MD5 is usually written as 32 hexadecimal characters, each representing four bits.

Digest functions are sometimes referred to as encrypted checksums, one-way hash functions, or fingerprint functions.

Use random numbers to prevent replay attacks

Simply hiding the password is not safe, because even if the password is unknown, someone with ulterior motives can intercept the digest and play it back to the server over and over again. Digests are as good as passwords.

To prevent such replay attacks, the server can send a special token to the client called a random number (nonce), which changes frequently (maybe every millisecond, or every authentication). The client appends this random number token to the password before calculating the digest.

The word random number means “this time” or “temporary”. In the concept of computer security, random numbers capture a specific point in time and add it to the security calculation.

Adding a random number to the password causes the digest to change with each change in the random number. A recorded password digest is valid only for certain random values, and without a password, an attacker cannot calculate the correct digest, thus preventing replay attacks.

Digest authentication requires the use of random numbers because this small replay vulnerability makes unrandomized digest authentication as vulnerable as basic authentication. Random numbers are transmitted from the server to the client in wwW-authenticate challenge.

The handshake mechanism of authentication

Simple summary authentication packet

Parameter overview

  • WWW-Authentication: Defines which methods (Basic, Digest, Bearer, etc.) are used to authenticate access to protected resources
  • realm: represents the security domain of protected documents in the Web server, such as the corporate financial information domain and the corporate employee information domain, indicating which domain user name and password are required
  • qop: Protect quality, includingauth(Default) andauth-int(Added packet integrity check). (Can be null, but null is not recommended
  • nonce: A random number that the server sends to the client when it sends a challenge. This number changes frequently. When the client computes the password digest, it is attached to the password digest, so that the password digest of the same user is generated several times, which is different, to prevent replay attacks
  • nc: Nonce counter, which is a hexadecimal value, indicates the number of requests sent by the client in the same Nonce. For example, in the first request in response, the client will send “NC =00000001”. The purpose of this indicator value is for the server to keep a copy of this counter to detect duplicate requests
  • cnonce: client random number, which is an opaque string value provided by the client and used by both the client and the server to avoid plain text. This allows each party to verify the other’s identity and provides some protection for the integrity of the message
  • response: a string calculated by the user agent software to prove that the user knows the password
  • Authorization-Info: Returns some additional information related to the authorized session
  • nextnonce: The next server side random number that allows the client to send the correct summary in advance
  • rspauth: Response summary for the client to authenticate the server
  • stale: When the random number used by the password digest expires, the server can return a 401 response with a new random number and specifystale=true, indicating that the server is telling the client to retry with a new random number instead of asking the user to re-enter the username and password

Calculation of abstracts

The core of summary authentication is the unidirectional summary of the combination of public information, confidential information and random value with time limit.

The summary is calculated based on the following three components:

  • A pair of functions consisting of the one-way hash function H(d) and the abstract KD(s, d), where S represents the password and D represents the data.
  • A data block containing security information, called A1, contains information such as user name, password, protection domain and random number. A1 involves only security information and has nothing to do with underlying packets.
  • A data block containing the non-secret attributes of the request message, called A2, represents information about the message itself, such as THE URL, request method, and the body part of the packet entity. A2 prevents methods, resources, or packets from being tampered with.

H and KD process two pieces of data A1 and A2 to produce a summary.

The function KD(s,d) mentioned here, s and D in the function have no practical meaning, the book says s represents password, D represents data, but in the actual calculation process, there is no relation at all… In fact, H and KD are both digest functions, generally MD5.

RFC 2069 abstract calculation method

H(A1) = MD5(A1) = MD5(username:realm:password)

H(A2) = MD5(A2) = MD5(method:uri)

Response = KD(H(A1):<nonce>:H(A2)) = MD5(MD5(A1):<nonce>:MD5(A2))

RFC 2617 abstract calculation method

H(A1) = MD5(A1) = MD5(username:realm:password)

If the value of qop is auth or not specified then:

H(A2) = MD5(A2) = MD5(method:uri)

If qop is auth-int, then:

H(A2) = MD5(A2) = MD5(method:uri:MD5(entityBody))

If the value of qop is auth or auth-int then:

Response = MD5(MD5(A1):<nonce>:<nc>:<cnonce>:<qop>:MD5(A2))

If the value of qop is not specified then:

Response = KD(H(A1):<nonce>:H(A2)) = MD5(MD5(A1):<nonce>:MD5(A2))

Pre authorization

The server tells the client what the next random number is in advance, so that the client can directly generate the correct Authorization head, avoiding multiple “requests/challenges”. There are three commonly used ways:

  • The server is pre-loadedAuthorization-InfoSend the next random number in the success headernextnonce. While this mechanism speeds up transaction processing, it also breaks the ability to pipe multiple requests from the same server, which can be costly.
  • The server allows the same random number to be used for a short period of time. Use the same random number within a certain period of time or limit the number of times a random number can be reusedstale=true. While this does reduce security, the lifetime of reused random numbers is manageable and a balance should be struck between security and performance.
  • The client and server use synchronous, predictable random number generation algorithms.

Random number selection

RFC 2617 recommends using this hypothetical random number formula:

BASE64(timestamp MD5(timestamp ":" ETag ":" private-key))
Copy the code

Wherein, timestamp is the time when the server generates a random number or other non-repeating value, ETag is the value of the HTTP ETag header related to the requested entity, and private-key is the private key known only to the server.

Practical issues that should be considered

Multiple inquiries

A server can make multiple challenges to a resource. For example, if the server does not know the capabilities of the client, it can provide both basic and summary authentication challenges. When a client is faced with multiple challenges, it must respond with the strongest challenge mechanism it supports.

Error handling

In digest authentication, if an instruction or its value is misused, or if a required instruction is missing, the response 400 Bad Request should be used.

If the request summary does not match, a login failure should be logged. Consecutive client failures indicate that an attacker is guessing passwords.

The authentication server must ensure that the resource specified by the URI directive is the same as the resource specified in the request line. If not, the server should return 400 Bad Request errors. (This can be a sign of an attack, so the server designer might consider documenting such errors.) This field contains the same content as in the request URL and is intended to accommodate any modifications that the intermediary agent may make to the client request. This modified (but semantically equivalent) request may be computed to produce a different summary than the one computed by the client.

To protect the space

The field value, combined with the standard root URL of the server being accessed, defines the protected space.

Protected resources on a server can be divided into a group of protected Spaces by domain. Each space has its own authentication mechanism and/or authorization database. The field value is a string, usually assigned by the original server, that may have additional semantics specific to the authentication scheme. Note that there may be multiple authorization schemes of the same, but different domains for the challenge.

The protected space identifies the area where certificates can be applied automatically. If one of the previous requests is authorized, all other requests in the protected space can reuse the same certificate for a period of time, depending on the authentication scheme, parameters, and/or user preferences. A single protected space cannot extend beyond the scope of its server unless the authentication scheme is otherwise defined.

The calculation of the protected space depends on the authentication mechanism.

  • In basic authentication, the client assumes that all paths in or under the request URI are in the same protected space as the current challenge. Clients can pre-submit authentication for resources in this space without waiting for another challenge from the server.
  • In digest authentication, the wwW-Authenticate: Domain field of the challenge gives a more precise definition of the protected space. The domain field is a quoted, space-delimited list of URIs. It is generally assumed that all URIs in the domain list and all URIs logically under these prefixes reside in the same protected space. If there is no domain field, or if this field is empty, all URIs on the challenge server are in the protected space.

Rewrite the URI

An agent can override a URI by changing the URI syntax without changing the actual resource being described. Such as:

  • Host names can be standardized or replaced with IP addresses;
  • You can replace an embedded character with a “%” escape;
  • Additional attributes of a type can be attached or inserted into a URI if they do not affect the retrieval of resources from a particular original server.

The proxy can modify the URI, and digest authentication checks the integrity of the URI value, so if either modification is made, digest authentication is broken.

The cache

When a shared Cache receives a request containing the Authorization head and a response resulting from that request, that response must not be used as a response to any other request unless one of the following two cache-control instructions is provided in the response.

  • If the original response contained the cache-control directive must-revalidate, the Cache can use the physical portion of that response when responding to subsequent requests. But it first validates with the original server again, using the request header of the new request, so that the original server can authenticate the new request.
  • If the original response contained the cache-control directive public, the physical part of the response can be returned in the response to any subsequent request.

Safety considerations

  • First to tamper with

  • Replay attack

  • Multiple authentication mechanism

  • Dictionary attacks

  • Malicious proxy attacks or man-in-the-middle attacks

  • Select plaintext attack

Secure HTTP

With HTTPS, all HTTP request and response data is encrypted before being sent to the network. HTTPS provides a transport-level password security layer underneath HTTP, using SSL or its successor, TSL.

Establishing secure transmission

In HTTPS, the client first opens a connection to the Web server port 443. Once a TCP connection is established, the client and server initialize the SSL layer, communicate encryption parameters, and exchange keys. Once the handshake is complete, SSL initialization is complete and the client can send the request to the security layer. These packets are encrypted before they are sent to TCP.

The SSL handshake

Before sending an encrypted HTTP packet, the client and server perform an SSL handshake, during which they do the following:

  • Switching protocol version number;
  • Choose a password that both sides understand;
  • Authenticate the identities of both ends.
  • Generate a temporary session key to encrypt the channel

Server certificate

SSL supports bidirectional authentication by sending the server certificate back to the client and sending the client certificate back to the server. Nowadays, client certificates are not often used for browsing. Secure HTTPS transactions, on the other hand, always require the use of server certificates. The server certificate is an X.509 V3-derived certificate that displays the organization’s name, address, server DNS domain name, and other information.

The validity of the site certificate

SSL itself does not require the user to check the Web server certificate, but most modern browsers do a simple integrity check on the certificate and provide the user with the means to perform further thorough checks. A Web server certificate validity algorithm developed by Netscape is the basis of most browser validation technologies. The verification steps are as follows:

  • Date of inspection
  • Signature issuer credibility check
  • Signature detection
  • Site identity detection

15. Entities and codes

Content-Length

The content-length header indicates the size in bytes of the entity body in the packet. This size includes all Content encoding. For example, if a text file is gzip compressed, the content-Length header is the compressed size, not the original size.

Unless block encoding is used, the Content-Length header is required for packets with entity bodies. The content-Length header is used to detect packet truncation caused by a server crash and properly fragment multiple packets sharing a persistent connection.

Entities in this paper,

The server uses the Content-MD5 header to send the result of running the MD5 algorithm on the entity body. Only the original server that generated the response can calculate and send the Content-MD5 header. Intermediate proxies and caches should not modify or add this header, otherwise it would conflict with the ultimate goal of verifying end-to-end integrity. The content-MD5 header is calculated after the Content is encoded, but before any transmission encoding is done. To verify the integrity of the message, the client must first decode the transmission code and then calculate the MD5 of the entity body that has not been transmitted.

Content encoding

The process of content encoding

  • The site server generates the original response with the original Content-Type and Content-Length headers.
  • The content encoding server (possibly the original server or downstream proxy) creates the encoded message. The encoded message has the same content-Type but may have a different content-Length (e.g. the body is compressed). The content-encoding server adds a content-Encoding header to the encoded message so that the receiving application can decode it.
  • The receiving program decodes the encoded message to obtain the original message.

The type of content encoding

Content-Encoding describe
gzip Indicates that the entity uses GNU ZIP encoding
compress Indicates that the entity uses a Unix file compression program
deflate Indicates that the entity is compressed in zlib format
identity By default, no encoding is performed

Accept-Encoding

Accept-encoding Specifies the content Encoding supported by the client. The Accept-Encoding field contains a comma-separated list of supported encodings.

Accept-Encoding: compress, gzip Accept-Encoding: Accept-Encoding: * Accept-Encoding: compress; Q = 0.5, gzip; Q = 1.0 Accept - Encoding: gzip; Q = 1.0, identity; Q = 0.5 *; q=0Copy the code

The client can attach a Q (quality) value parameter to each encoding to indicate the priority of the encoding. The Q value ranges from 0.0 to 1.0, with 0.0 indicating that the client does not want to accept the encoding indicated and 1.0 indicating that the encoding is most desired.

Transfer coding and block coding

  • Transfer-encoding: Informs the receiver of the Encoding that has been performed in order to reliably transmit the message.
  • TE: Used in the request header to tell the server which transport code extensions can be used.

Block coding

Block coding divides a message into several blocks of known size. Blocks are sent next to each other, so you don’t need to know the size of the entire message before sending it. Note that chunking is a transmission code and is therefore an attribute of the message, not of the body.

Dragging and dragging of block packets

If the TE header of the client indicates that it accepts drag, drag can be added to the end of the partitioned message. The server that generated the original response can also add a drag and drop to the end of the segmented message. Drag-and-drop content is optional metadata that the client does not necessarily need to understand and use (the client can ignore and discard the drag-and-drop content).

The drag-and-drop can include attached header fields whose values may not be determined at the beginning of the message (for example, the contents of the body must be included). The Content-MD5 header is a header that can be sent in drag, because it’s hard to figure out its MD5 until the document is generated.

The header of the packet contains a Trailer header, which lists the header list following the partitioned packet. The header listed in the Trailer header is immediately after the last partition.

Except for transfer-Encoding, Trailer, and Content-Length headers, all HTTP headers can be sent as drag-and-drop.

Xvi. Internationalization

Content negotiation and transcoding

Web hosting

19. Release system

Redirection and load balancing

21. Logging