The Web server

The implementation of Web server

The Web server implements HTTP and associated TCP connection handling. Responsible for managing the resources provided by the Web server, as well as managing the configuration, control and extension of the Web server.

The Web server logic implements HTTP protocol, manages Web resources, and provides Web server management functions. The Web server logic and the operating system share responsibility for managing the TCP connection. The underlying operating system manages the hardware details of the underlying computer system and provides TCP/IP network support, a file system responsible for loading Web resources, and process management functions that control current computing activities.

What does a Web server do

  • Establish a connection – Accept a client connection or close it if you do not want to establish a connection with the client.
  • Receive request – Reads an HTTP request packet from the network.
  • Processing the request – Interprets the request message and takes action.
  • Access resource – Access the resource specified in the message.
  • Build response – Create an HTTP response message with the correct header.
  • Send response – Sends the response back to the client.
  • Record transaction processing – Record content related to completed transactions in a log file.

Interface client connection

When a client requests a TCP connection to the Web server, the Web server establishes the connection, determines which client is on the other end of the connection, and resolves the IP address from the TCP connection.

Once the new connection is established and accepted, the server adds the new connection to its list of existing Web server connections, ready to monitor the connection data transfer.

Accept request message

When data arrives on the connection, the Web server reads the data from the network connection and parses the contents of the request message. When parsing request packets, the Web server:

  • The request lines are parsed to find the request method, the specified resource identifier (URI), and the version number, separated by a space, and terminated by a carriage return newline (CRLF) sequence.
  • Read the packet header ending with CRLF.
  • Blank lines ending in CRLF and identifying header, if any, are detected;
  • If so (Length specified by the Content-Length header), the request body is read.

The Web server reads data from the network and stores part of the packet data in memory until it receives enough data to parse and understand its meaning.

Handle the request

Once the Web server receives the request, it can process the request based on methods, resources, headers, and optional body parts.

Mapping and access to resources

A Web server is a resource server. They are responsible for sending pre-created content, such as HTML pages or JPEG images, as well as dynamic content generated by resource generators running on the server.

Build the response

Once the Web server identifies the resource, it performs the action described in the request method and returns a response message. The response message contains the response status code, the response header, and, if any, the response body.

Send a response

Web servers face the same problems when sending data over a connection as they do when receiving data. The server may have many connections to various clients, some idle, some sending data to the server, and some sending response data back to the client.

  • For non-persistent connections, the server should close its own end of the connection after sending the entire packet.
  • For persistent connections, the connection may remain open, in which case the server should be careful to calculate the Content-Length header correctly, otherwise the client will not know when the response ends

log

Finally, when the transaction ends, the Web server adds an entry to the log file describing the transaction that was executed. Most Web servers provide several logging configuration formats.

The agent

A Web proxy server is an intermediate entity on a network. A proxy sits between a client and a server and acts as a middleman, sending HTTP packets back and forth between endpoints.

An intermediate entity for the Web

Private and shared agents

  • Agents that are dedicated to a single client are called private agents.
  • An agent that is shared by many clients is called a common agent. Most agents are common shared agents. Centralized agents are more cost efficient and easier to manage.

Proxy versus gateway

Strictly speaking, a proxy connects to two or more applications that use the same protocol, while a gateway connects to two or more endpoints that use different protocols. The gateway acts as a “protocol converter”, enabling the client to complete transactions with the server even if the client and server use different protocols.

Why use proxies

Proxy servers can do all sorts of snazzy and useful things.

  • Improve security, improve performance and save cost.
  • You can see and touch all of the HTTP traffic that flows through, so the proxy can monitor and modify it to implement many useful value-added Web services.

Several functions implemented by proxy:

  • Children’s filter
  • Document access control
  • Security firewall
  • Web caching
  • The reverse proxy
  • Content router
  • transcoder
  • anonymous

Where the agent will go

Proxy server deployment

Agents can be placed anywhere depending on their intended purpose.

  • Export agent

    Anchor the agent at the exit point of the local network to control traffic between the local network and the large Internet. You can use an exit agent on your corporate network to provide firewall protection against malicious hackers outside the company, or to reduce bandwidth costs and improve the performance of Internet traffic.

  • Entry broker (Access broker)

    Proxies are often placed at ISP access points to handle aggregated requests from customers. Isps use caching proxies to store copies of frequently used documents to increase download speeds and reduce Internet bandwidth consumption for users, especially those with high-speed connections

  • The reverse proxy

    Proxies are typically deployed at the edge of the network, in front of the Web server, and used as proxies (often referred to as reverse proxies). Alternatives can improve the security features of Web servers, or put fast Web server caches ahead of slower servers to improve performance.

    The reverse proxy will usually simply assume the name and IP address of the Web server, so that all requests are sent to the proxy instead of the server.

Hierarchy of agents

Agents can be cascaded through proxy Hierarchy. In the broker hierarchy, packets are passed from one agent to another until they finally reach the original server (and then back to the client via the broker).

The agent hierarchy in the figure is static — Agent 1 always forwards packets to Agent 2, and Agent 2 always forwards packets to Agent 3. However, hierarchies do not have to be static. A proxy server can forward packets to an ever-changing set of proxy servers and original servers, depending on many factors.

The agent hierarchy can be dynamic, depending on the request

  • Load balancing

    The child agent may decide how to select a parent agent to balance the load based on the workload level on the current parent agent.

  • Routes near the geographical location

    The child agent may select the parent agent responsible for the physical region of the original server.

  • Protocol/type routing

    The child agent may forward packets to different parent agents and original servers based on the URI. Certain types of URIs may forward requests through special proxy servers for special protocol processing.

  • Ordering based routing

    If publishers pay extra for high-performance services, their URIs are forwarded to large caching or compression engines to improve performance.

How does the proxy capture traffic

How does the request get to the agent

In the figure below, there are four common ways in which client traffic can flow to the broker

  • (a) Modify the client

    Many Web clients, including Netscape and Microsoft browsers, support manual and automatic proxy configuration.

  • (b) Modify the network

    Network infrastructure can intercept network traffic and channel it into a proxy without the knowledge or involvement of the client through a number of technical means. This interception typically relies on switching and routing devices that monitor HTTP traffic, intercept it without the client’s knowledge, and direct the traffic to a proxy called an interception proxy.

  • (c) Modify the DNS namespace

    Proxies, placed in front of the Web server, directly masquerade as the name and IP address of the Web server, so that all requests are sent to these proxies instead of the server. You can do this by manually editing the LIST of DNS names, or by using a special dynamic DNS server to determine the appropriate proxy or server as needed.

  • (d) Modify the Web server

    Some Web servers can also be configured to send an HTTP redirection command (response code 305) to the client to redirect the client request to a proxy. After receiving the redirection command, the client communicates with the agent.

Proxy Settings on the client

All modern Web browsers allow users to configure the use of proxies. In fact, many browsers offer a variety of ways to configure proxies, including manual configuration, pre-configured browsers, automatic configuration of proxies, and WPAD proxy discovery.

Manual configuration

Explicitly set the agent to use.

Client proxy configuration: PAC file

Manual proxy configuration is simple but rigid. Only one proxy server can be specified for all content, and failover is not supported. Manual proxy configuration can also cause administrative problems for large organizations. If the configured browser base is large, it can be difficult, if not impossible, to reconfigure each browser when it comes time to make changes.

PAC files are small JavaScript programs that compute proxy Settings on the fly and, therefore, are a more dynamic proxy configuration solution. When accessing each document, the JavaScript function selects the appropriate proxy server.

To use the PAC file, configure the browser with the URI of the JavaScript PAC file (similar to manual configuration, but with a URI provided in the Automatic Configuration box). The browser gets the PAC file from this URI and uses JavaScript logic to calculate the appropriate proxy server for each access. PAC files are usually suffixed with.pac, and the MIME type is usually Application/X-NSproxy-autoconfig.

Client proxy configuration: WPAD protocol

Another browser configuration mechanism is the WPAD protocol. The algorithm of the WPAD protocol automatically finds the appropriate PAC file for the browser using the step-up strategy of the discovery mechanism. Clients implementing the WPAD protocol need to:

  • Find the URI of PAC with WPAD;
  • Get the PAC file from the specified URI;
  • Execute PAC file to determine proxy server;
  • Use a proxy server for requests

Thorny issues related to proxy requests

The proxy URI is different from the server URI

The syntax of Web server packets is the same as that of Web proxy packets. When the client sends a request to the server instead of the proxy, the URI in the HTTP request packet is different.

  • When a client sends a request to a Web server, the request line contains only part of the URI (no scheme, host, or port), as shown in the following example:

    GET /index.html HTTP/1.0 
    User-Agent: SuperBrowser v1.3
    Copy the code
  • But when the client sends a request to the proxy, the request line contains the full URI. Such as:

    GET http://www.marys-antiques.com/index.html 
    HTTP/1.0 User-Agent: SuperBrowser v1.3
    Copy the code

In the original HTTP design, the client would talk directly to a single server. There are no virtual hosts and no rules for agents. Individual servers know their own hostname and port, so to avoid sending redundant information, clients can send only partial URIs without sending schemes and hosts (and ports).

Once proxies appear, using partial URIs becomes problematic. Agents need to know the name of the target server so they can establish their own connection to the server. Proxy-based gateways need to know the URI scheme to connect to FTP resources and other schemes. HTTP/1.0 solved this problem by requiring proxy requests to send full URIs, but it reserved the form of partial URIs for server requests (quite a few servers have changed to support full URIs).

The intercepting proxy receives part of the URI

The client is not always aware that it is talking to the agent, because some agents may not be visible to the client. Even if the client is not configured to use a proxy, the client’s traffic may pass through a surrogate or intercepting proxy. In both cases, the client will assume that it is talking to the Web server and will not send the full URI.

A proxy can handle both proxy and server requests

Because of the different ways in which traffic is redirected to proxy servers, a generic proxy server should support both full and partial URIs in the request message.

  • If it is an explicit proxy request, the proxy should use the full URI;
  • If the request is from a Web server, you should use a partial URI and virtual Host header.

Changes to urIs during forwarding

The proxy server needs to be careful when changing the request URI when forwarding packets. Minor changes to urIs, even seemingly innocuous ones, can cause interoperability problems for downstream servers.

In particular, some agents are now known to “normalize” urIs into a standard format before forwarding them to the next hop node. Seemingly innocuous conversion behaviors, such as replacing the default HTTP port with an explicit “:80”, or correcting URIs by replacing illegal reserved characters with appropriate swap escapes, can cause interoperability problems.

URI resolution without proxy

Without a proxy, the browser takes the URI you entered and tries to find the corresponding IP address. If the host name is found, the browser tries the corresponding IP address until a successful connection is made.

URI resolution with explicit proxy

With an explicit proxy, the user’s URI is sent directly to the proxy, so the browser no longer performs all these handy extensions.

URI resolution when there is an intercepting proxy

Hostname resolution is slightly different when using an invisible intercepting proxy, because there is no proxy for the client! The behavior in this case is similar to that with a server, where the browser automatically extends the hostname until DNS succeeds.

Track message

It is now common for Web requests to pass through two or more proxies on the path from client to server. As proxies become more popular, it is as important to be able to trace the flow of packets passing through them to detect problems as it is to track IP packet flows passing through different switches and routers.

Via the first

The Via header field lists information about each intermediate node (proxy or gateway) through which the message passes. Each time a message passes through a node, the intermediate node must be added to the end of the Via list.

The Via string below tells us that the message flows through two agents. This string indicates that the first proxy, named proxy-62.irenes-isp.net, implements the HTTP/1.1 protocol, and the second proxy, called cache.joes-hardware.com, implements HTTP/1.0:

Via: 1.1 proxy-62.irenes-isp.net, 1.0 cache.joes-hardware.com
Copy the code

Via privacy and security issues

Sometimes we don’t want to use the exact hostname in the Via string. In general, unless explicitly permitted, proxy servers used as part of a network firewall should not forward the names and port numbers of hosts behind the firewall, because the network structure information behind the firewall could be used by malicious groups.

If Via node-name forwarding is not allowed, the agent used as part of the security line should replace the host name with an appropriate pseudonym. In general, agents should try to reserve one Via roadmap entry per proxy server, even if the real name is hidden.

For organizations with very strong privacy requirements that hide the internal network design and topology, the agent should merge an ordered sequence of Via roadmap entries with the same receive protocol value into a single federated entry. For example, you can combine:

Via: 1.0 foo, 1.1 devirus.company.com, 1.1 access-logger.company.com

Compressed into:

Via: 1.0 foo, 1.1 folded stuff

The entries cannot be merged unless they are all under the control of the same organization and the host name has been replaced with a pseudonym. Similarly, entries with different receive protocol values cannot be combined.

The TRACE method

The HTTP/1.1 TRACE method allows you to TRACE the request packets transmitted through the proxy chain, which proxies the packets pass through, and how each proxy modifies the request packets. TRACE is useful for debugging proxy flows.

Proxy authentication

Agents can be used as access control devices. HTTP defines a mechanism called proxy authentication, which blocks requests for content until a user provides a valid access certificate to the proxy.

  • A proxy can implement authentication mechanisms to control access to content

Interoperability between different agents

Clients, servers, and proxies are built by different vendors and implement different versions of the HTTP specification. They support different features and have different problems. Proxy servers sit between client and server devices that implement different protocols and can be problematic.

OPTIONS: Finds support for optional features

By using OPTIONS, the client can determine the capabilities of the server before interacting with it, so it can more easily interoperate with agents and servers with different features.

If the OPTIONS request URI is an asterisk (*), it is requesting functionality supported by the entire server. Such as:

The OPTIONS * HTTP / 1.1Copy the code

If the URI is an actual resource address, the OPTIONS request queries the available features of that particular resource:

The OPTIONS http://www.joes-hardware.com/index.html HTTP / 1.1Copy the code

If successful, the OPTIONS method returns a 200 OK response containing the various header fields that describe the various optional features supported by the server or available to the resource. The only header field that HTTP/1.1 specifies in the response is the Allow header, which describes the various methods (or specific resources on the server) supported by the server.

Allow the first

You can use the Allow header as the request header to suggest supporting certain methods on the new resource. The server is not required to support these methods, but should include an Allow header in the corresponding response that lists the methods it actually supports.

The cache

A Web cache is an HTTP device that automatically saves copies of common documents. When the Web request reaches the cache, if there is a “cached” copy locally, the document can be extracted from the local storage device rather than the original server.

Using a cache has the following advantages.

  • Caching reduces redundant data transfers and saves you money on your network.
  • Caching relieves network bottlenecks. Pages load faster without requiring more bandwidth.
  • Caching reduces the requirements on the original server. The server can respond more quickly and avoid overloads.
  • Caching reduces distance latency because it is slower to load pages from farther away.

Redundant data transmission

When many clients access a popular raw server page, the server transfers the same document multiple times, one client at a time. Some of the same bytes are sent over and over again across the network. These redundant data transfers can consume expensive network bandwidth, slow down transmission speeds, and increase the load on Web servers. With caching, a copy of the first server response can be kept, and subsequent requests can be handled by cached copies, reducing the amount of wasted duplicate traffic flowing into/out of the original server.

The bandwidth bottleneck

Caching can also alleviate network bottlenecks. Many networks provide more bandwidth for local network clients than for remote servers. The client accesses the server at the slowest network speed on the path. If the client gets a copy from the cache on a fast LAN, caching can improve performance — especially if large files are to be transferred.

The instant congestion

Caching is very important when breaking Flash Crowds. When a sudden event (such as a breaking news story, a mass E-mail announcement, or a celebrity event) causes many people to access a Web document at about the same time, there can be instant congestion. The resulting excessive traffic spikes can cause catastrophic crashes to the network and Web servers.

From the time delay

Even if bandwidth is not a problem, distance can be. Each network router increases the latency of Internet traffic. Even if there are not many routers between the client and server, the speed of light itself can cause significant delays.

Hits and misses

So caching is helpful. But a cache cannot hold a copy of every document in the world.

  • Existing replicas can be used to service certain requests that reach the cache. This is called a cache hit,
  • Other requests that reach the cache may be forwarded to the original server because no copies are available. This is called a cache miss.

Few can afford a cache big enough to hold all the documents on the Web. Even if you can afford a huge “whole Web cache,” some documents often change, and many of the caches are not up to date. As a result, it cannot be updated in time in many caches

revalidation

HTTP provides several tools for revalidating cached objects, but the most common is the ifModified-since header. Adding this header to the GET request tells the server to send the object only if it has been modified after a copy of the object has been cached.

Here is a list of what happens when the server receives a GET IF-Modified-since request in three cases:

  • Revalidation hit

    If the server object has Not been Modified, the server sends a small HTTP 304 Not Modified response to the client.

  • Revalidation failed

    If the server object is different from the cached copy, the server sends a plain HTTP 200 OK response to the client with the full content.

  • Object deleted

    If the server object has been deleted, the server sends back a 404 Not Found response, and the cache removes its copy.

shooting

The percentage of requests serviced by the cache is known as the cache hit rate (or cache hit ratio), sometimes referred to as the Document hit rate.

Byte hit ratio

Since documents are not all the same size, document hit ratios are not the whole story. Some large objects may be accessed less often but, because of their size, contribute more to the overall data traffic. As a result, some people prefer to use the byte hit rate as a metric (especially those who pay per byte of traffic!). .

The byte hit ratio represents the percentage of bytes supplied by the cache out of all bytes transferred. Through this measure, you can know how much traffic is saved. The 100% byte hit ratio means that every byte comes from the cache and no traffic flows to the Internet.

Distinguish between hits and misses

Unfortunately, HTTP does not provide a means for users to distinguish between a cache hit and a response from a visit to the original server. In both cases, the response code is 200 OK, indicating that the response has a body. Some commercial proxy caches attach additional information to the Via header to describe what is happening in the cache.

One way the client can determine if the response is from the cache is to use the Date header. Compare the Date header value in the response to the current time. If the Date value in the response is early, the client can usually assume that it is a cached response.

The cache topology

A private cache is called a private cache.

  • A private cache is a personal cache that contains the pages most commonly used by a single user. A shared cache is called a public cache.
  • The public cache contains pages commonly used by a user community.

Private cache

Private caches don’t require a lot of power or storage, so they can be made small and cheap. Web browsers have built-in private caches – most browsers cache frequently used documents on your PC’s disk and memory and allow users to configure the cache size and Settings.

Public proxy cache

A public cache is a special shared proxy server called a caching proxy server or, more commonly, a proxy cache. The proxy cache serves documents from the local cache or contacts the server on behalf of the user. A public cache accepts access from multiple users, so it is a better way to reduce redundant traffic.

Hierarchy of proxy caches

In practice, it makes sense to implement caching at hierarchy, where requests that miss in the smaller cache are directed to the larger parent cache, which serves the rest of the “refined” traffic. The following figure shows a two-level cache hierarchy. The basic idea is to use small, inexpensive caches close to the client side and, at higher levels, to progressively use larger, more powerful caches to load documents shared by multiple users.

Mesh caching, content routing, and peer caching

Some network structures build complex cache meshes instead of simple cache hierarchies. Proxy caches in mesh caches talk to each other in a more complex manner, making dynamic cache communication decisions, deciding which parent cache to talk to, or deciding to bypass the cache altogether and connect directly to the original server. This proxy cache determines which route to choose to access, manage, and deliver the content, so it can be called a Content Router.

Procedure for handling the cache

The basic workings of Web caching are mostly simple. The basic caching process of an HTTP GET packet consists of seven steps: receive -> parse -> query -> freshness check -> create response -> send -> log.

receive

In the first step, the cache detects activity on a network connection and reads the input data. High-performance caches read data from multiple input connections at the same time and start processing transactions before the entire message arrives.

parsing

Next, the cache parses the request message into fragments, putting the pieces of the header into easy-to-manipulate data structures. This makes it easier for the caching software to process the header fields and modify them.

To find the

In step 3, the cache retrieves the URL and looks for a local copy. A local copy may be stored in memory, on a local disk, or even on another nearby computer. Professional caches use fast algorithms to determine whether an object is in the local cache. If the document is not available locally, it can either fetch it from the original server or the parent agent, or return an error message, depending on the situation and configuration.

The cached object contains both the server response body and the original server response header, so that the correct server header is returned in case of a cache hit. The cached object also contains metadata that records how long the object has been in the cache and how many times it has been used.

Freshness detection

HTTP keeps a copy of a server document for a period of time through caching. During this time, the document is considered “fresh” and can be served by the cache without contacting the server. But once a cached copy stays for too long beyond the freshness limit of the document, the object is considered “out of date” and the cache verifies with the server again to see if the document has changed before serving it. To make things even more complicated, all request headers sent by the client to the cache can themselves force the cache to revalidate, or avoid validation altogether.

Create the response

We want the cached response to look like it came from the original server, and the cache takes the cached server response head as the starting point for the response head. The cache then modifies and extends these base headers.

The cache is responsible for adapting these headers to match the requirements of the client. For example, the server may return an HTTP/1.0 response (or even an HTTP/0.9 response) while the client expects an HTTP/1.1 response, in which case the cache must translate the header accordingly. Caches also insert freshness information into them (cache-Control, Age, and Expires headers), and often include a Via header indicating that the request was provided by a proxy Cache.

send

Once the response header is ready, the cache sends the response back to the client. Like all proxy servers, proxy caches manage connections to clients.

The log

Most caches hold log files and some statistics related to the use of the cache. At the end of each cache transaction, the cache updates statistics on the number of cache hits and misses (as well as other relevant metrics) and inserts entries into a log file that displays the request type, URL, and event that occurred.

Keep copies fresh

HTTP has some simple mechanisms to keep the cached data fully consistent with the server data without requiring the server to remember which caches have copies of their documents. HTTP refers to these simple mechanisms as document expiration and Server revalidation.

  • Flowchart for caching GET requests

Document date

Through special HTTP cache-control and Expires headers, HTTP lets the original server append an “expiration date” to each document. Like the expiration date on a milk carton, these headers indicate how long the content can be considered fresh.

Until cached documents expire, the cache can use these copies as often as it wants without linking to the server — unless, of course, the client request contains headers that prevent the provision of cached or unvalidated resources. But once a cached document expires, the cache must check with the server to see if the document has been modified, and if so, to get a fresh copy (with a new expiration date).

Expiration date and use period

The server specifies the expiration date with either the HTTP/1.0+ Expires header or the HTTP/1.1 cache-Control: max-age response header, along with the response body. The Expires header does essentially the same thing as the cache-Control: max-age header, but because the cache-control header uses relative time rather than absolute dates, we prefer to use the newer cache-Control header. The absolute date depends on the correct setting of the computer clock.

Server revalidation

Just because a cached document has expired does not mean it is actually different from the document currently active on the original server; It just means it’s time to check. This is called server revalidation, and it means that the cache needs to ask whether the original server document has changed.

  • If the revalidation display changes, the cache takes a new copy of the document, stores it in place of the old document, and sends the document to the client.

  • If you revalidate that the display has not changed, the cache simply gets a new header, including a new expiration date, and updates the header in the cache

Revalidation is performed using conditional methods

HTTP defines five conditional request headers. The two most useful headers for cache revalidation are ifModified-since and if-none-match. All conditions start with the prefix “If-“.

The if-modified-since: Date revalidation

A common cache revalidation header is if-modified-since. If-modified-since Revalidation requests are often referred to as IMS requests. An IMS request instructs the server to execute the request only if the resource has changed since a certain date:

  • If the document has been Modified Since the specified date, the if-modified-since condition is true, and usually GET executes successfully. New documents with a new header are returned to the cache, and the new header contains a new expiration date, among other messages.

  • If the document has Not been Modified since the specified date, the condition is false and a small 304 Not Modified response message is returned to the client. For validity, the body of the document is Not returned. These headers are returned in the response, but only those headers that need to be updated at the source end are returned. For example, the header of the Content-Type is usually not modified, so it usually doesn’t need to be sent. A new expiration date is typically sent.

If-none-match: revalidation of entity labels

In some cases revalidation using the last modified date alone is not sufficient.

  • Documents may be periodically rewritten (for example, written from a background process), but the actual data contained is often the same. Although the content does not change, the modification date does.
  • The document may have been modified, but the changes are not significant enough to require a worldwide cache to reload the data (such as changes to spelling or comments).
  • The server cannot accurately determine the last modification date of its page.
  • Documents provided by servers change at subsecond intervals (for example, live monitors), and one-second change dates may not be sufficient for these servers.

To address these issues, HTTP allows users to compare “version identifiers” called entity tags (ETags). Entity tags are arbitrary tags (reference strings) that are attached to a document. They may contain the serial number or version name of the document, or checksums and other fingerprint information of the document’s contents.

When publishers make changes to the document, they can modify the entity label of the document to indicate the new version. This way, If the entity tag is changed, the cache can use the if-none-match condition header to GET a new copy of the document.

Strong and weak validator

Sometimes a server wants to make some immaterial or unimportant change to a document without invalidating all cached copies. HTTP/1.1 supports “weak validators” that allow the server to declare that the content is a “good enough” equivalent if only a few changes are made

The strong validator changes whenever the content changes. Weak validators allow some content to be modified, but it usually changes when the main meaning of the content changes. Some operations cannot be implemented with weak validators (such as conditionally fetching parts of content), so the server identifies weak validators with the prefix “W/”.

ETag: W/" v2.6 "If - None - Match: W/" v2.6"Copy the code

When should I use entity labels and the last modification date

  • When the server sends back an entity label, the HTTP/1.1 client must use the entity label validator.
  • The server sends back only one LST-Modified, and the client can use i-Modified-since authentication.
  • The entity label and last modified date are provided, and the client should use both revalidation schemes so that both HTTP/1.0 and HTTP/1.1 caches respond correctly.
  • If the HTTP/1.1 cache or server receives a request with both if-Modified-since and an entity label header, the 304 Not Modified response will be returned only If both conditions are met.

The ability to control caching

The server can specify how long a document can be cached before it expires in several ways defined by HTTP. In descending order of priority, the server can:

  • Cache-control: the no-store header goes to the response;
  • Cache-control: the no-cache header goes to the response;
  • Cache-control: must-revalidate the header in the response;
  • Cache-control: max-age header to the response;
  • The Expires date heads into the response.
  • Add expiration information to let the cache determine its expiration date.

No-store and no-cache response headers

HTTP/1.1 provides several ways to limit object caching, or to limit how cached objects can be served, in order to maintain freshness. The no-store and no-cache headers prevent the cache from providing unverified cached objects:

Pragma: no-cache
Cache-Control: no-store 
Cache-Control: no-cache
Copy the code

A response identified as no-store disallows the cache from copying the response. The cache typically forwards a no-store response to the client, like an uncached proxy server, and then deletes the object.

Responses identified as no-cache can actually be stored in the local cache. It’s just that the cache can’t make it available to the client until freshness is re-verified with the original server. A more appropriate name would be Donot-serve-from-cache-without-revalidation for this header.

Pragma: no-cache headers are provided in HTTP/1.1 for HTTP/1.0+ compatibility. All HTTP 1.1 applications should use cache-Control: no-cache except when interacting with HTTP/1.0 applications that only understand Pragma: no-cache.

Max-age response header

The cache-control: max-age header represents the number of seconds that the document can be considered fresh from the time it was sent from the server. There is also an s-maxage header (note that maxage does not have a hyphen), which behaves like max-age, but only for shared (public) caches:

Cache-Control: max-age=3600 Cache-Control: s-maxage=3600
Copy the code

Expires header

It is not recommended to use an Expires header, which specifies the actual expiration date, not the number of seconds. HTTP designers later decided that since many servers’ clocks were out of sync or incorrect, it was better to use the remaining seconds rather than the absolute time to indicate expiration.

The must-revalidate response header

The cache can be configured to provide stale (expired) objects to improve performance. If the original server wants to Cache expiration information strictly, it can append a cache-Control: Mustrevalidate header to the original response.

The cache-control: must-revalidate response header tells the Cache that it cannot provide a stale copy of the object without first revalidating it with the original server.

Tentative expiration

If there is no cache-Control: max-age header or Expires header in the response, the Cache can calculate a tentative maximum lifespan. You can use any algorithm, but if you get a maximum lifespan of more than 24 hours, you should add a Heuristic Expiration Warning header to the response header. As far as we know, very few browsers offer this kind of warning to users.

Client freshness limits

Web browsers all have Refresh or Reload buttons that force a Refresh of content in the browser or proxy cache that might be out of date. The Refresh button issues a GET request with the CacheControl request header attached, which forces revalidation or unconditionally retrieves the document from the server. The exact behavior of Refresh depends on the configuration of the specific browser, document, and interceptor cache.

Setting cache Control

Different Web servers provide different mechanisms for setting HTTP cache-control and Expiration headers. This section briefly describes how the popular Apache Web server supports cache control

Controls the HTTP header of Apache

The Apache Web server provides several mechanisms for setting the HTTP cache control header. Many of them are not turned on by default — you turn them on.

Caching and advertising

Dilemma for advertisers

Caching can bring those beautiful articles and ads to a user’s monitor in a faster, even better looking way, encouraging them to browse more content and see more ads. This is what content providers want! Attract more eyeballs and more advertising!

Similarly, if the cache works well, the original server may not receive any HTTP access at all because it is absorbed by the Internet cache. But if your revenue is based on the number of visits, you won’t be happy.

Response from the publisher

One solution is to configure the cache and revalidate with the original server on each access. Thus, hits are pushed to the original server on each access, but typically no principal data is transmitted. Of course, this slows down transaction processing.

Log migration

The ideal solution is one that does not need to pass hits to the server. After all, the cache keeps track of all hits. The cache simply sends the hit log to the server.

However, the hit log is large and difficult to move. Cache logs are not standardized or organized into separate journals that can be delivered to separate content providers. And there are authentication and privacy issues.

Hit count and usage limits

A much simpler scheme is defined in RFC 2227, “Simple hit Counts and Usage restrictions for HTTP.” This protocol adds a header to HTTP called Meter, which periodically sends the hit count for a particular URL back to the server. In this way, the server can periodically get updates on the hit count of cached documents from the cache.

Furthermore, the server can control how many times documents in the cache can be used before they must be reported back to the server, or set a clock timeout value for cached documents. This type of control is called restriction of use; In this way, the server can control how many times cached resources are used before the cache is reported to the original server.

Gateway, tunnel, and trunk

Describes some of the ways developers use HTTP to access different resources, and shows how developers can use HTTP as a framework to enable other protocols and application communication.

The gateway

The evolution of HTTP extensions and interfaces is driven by user needs. When the need arises to publish more complex resources on the Web: a single application cannot handle all of these conceivable resources.

To solve this problem, the developers proposed the concept of a gateway, which can be used as a kind of translator, abstracting a way to reach a resource. Gateways are the glue between resources and applications.

Client and server gateways

  • The server-side gateway communicates with the client through HTTP and communicates with the server through other protocols (HTTP/*).

  • The client-side gateway talks to the client over other protocols and communicates with the server over HTTP (*/HTTP)

Protocol gateway

HTTP traffic is directed to a gateway in the same way as traffic is directed to a proxy. Most commonly, browsers are explicitly configured to use gateways to transparently intercept traffic or to configure gateways as substitutes (reverse proxies).

HTTP/* : server-side Web gateway

As the request flows to the original server, the server-side Web gateway converts the client HTTP request to another protocol.

HTTP/HTTPS: server-side security gateway

An organization can encrypt all incoming Web requests through the gateway to provide additional privacy and security. Clients can browse Web content using plain HTTP, but the gateway automatically encrypts the user’s conversation.

  • Enter the HTTP/HTTPS security gateway

HTTPS/HTTP client security accelerator gateway

Recently, the use of HTTPS/HTTP gateways as security accelerators has been increasing. These HTTPS/HTTP gateways are located in front of the Web server and are often used as invisible intercepting gateways or reverse proxies. They receive secure HTTPS traffic, decrypt secure traffic, and send plain HTTP requests to Web servers.

  • HTTPS/HTTP security accelerator Gateway

These gateways typically contain dedicated decryption hardware to decrypt secure traffic in a much more efficient manner than the original server, reducing the load on the original server. These gateways send unencrypted traffic between the gateway and the original server, so use caution to ensure that the network between the gateway and the original server is secure.

Resources gateway

The most common gateway, an application server, combines the target server with the gateway in one server. The application server is a server-side gateway that communicates with clients over HTTP and connects to server-side applications.

Note: Resource gateway, can help clients to achieve a variety of different types of resource requests.

  • Server gateway application mechanism

The first popular application Gateway API was the Common Gateway Interface (CGI). CGI is a standard set of interfaces that a Web server can use to load a program in response to HTTP requests for a particular URL, collect the program’s output data, and send it back in an HTTP response.

CGI

The Common Gateway Interface (CGI), was the first, and probably still the most widely used, server extension. It is widely used in dynamic HTML, credit card processing and database query tasks on the Web.

The tunnel

We have discussed several different ways that HTTP can be used to access different types of resources (through gateways) or to initiate application-to-application communication. In this section, we’ll look at another use of HTTP, a Web tunnel, which allows HTTP applications to access applications that use non-HTTP protocols.

Web tunneling allows users to send non-HTTP traffic over HTTP connections so that they can piggyback data from other protocols over HTTP. The most common reason to use Web tunneling is to embed non-HTTP traffic in HTTP connections so that it can pass through a firewall that only allows Web traffic.

Note: a tunnel can transmit non-HTTP traffic. There is no need to go through the gateway to translate the protocol. Tunnels can transmit non-HTTP streams over HTTP connections

Establish an HTTP tunnel with CONNECT

Web tunnels are established using the CONNECT method of HTTP. The CONNECT method is not part of the HTTP/1.1 core specification, but is a widely used extension.

The CONNECT method asks the tunnel gateway to create a TCP connection to any destination server and port and blind forward subsequent data between the client and server.

  • Establish an SSL tunnel with CONNECT

The example in the figure describes an SSL tunnel where SSL traffic is sent over an HTTP connection, but TCP connections can be established with any server using any protocol through the CONNECT method.

The CONNECT request

The syntax of CONNECT is similar to other HTTP methods, except for the start line. A host name followed by a colon and port number replaces the request URI. Both the host and port must be specified:

CONNECT Home.netscape.com :443 HTTP/1.0 User-agent: Mozilla/4.0Copy the code

CONNECT the response

After sending the request, the client waits for a response from the gateway. Similar to common HTTP packets, the response code 200 indicates success. By convention, the cause phrase in the response is usually set to “Connection Established” :

HTTP/1.0 200 Connection Established 
Proxy-agent: Netscape-Proxy/1.1
Copy the code

Unlike normal HTTP responses, this response does not need to contain a Content-Type header. At this point, the connection only forwards the raw byte and is no longer the bearer of the message, so the content type is not needed.

SSL tunnel

The Web tunnel was originally developed to transmit encrypted SSL traffic through a firewall. Many organizations tunnel all traffic through packet filtering routers and proxy servers to improve security. But some protocols, such as encrypted SSL, are encrypted and cannot be forwarded by traditional proxy servers. The tunnel transmits SSL traffic over an HTTP connection to pass through the HTTP firewall on port 80.

Comparison between SSL tunnels and HTTP/HTTPS gateways

The HTTP/HTTPS gateway

You can gateway the HTTPS protocol (HTTP over SSL) just like any other protocol:

  • The gateway (not the client) initializes the SSL session with the remote HTTPS server,
  • The HTTPS transaction is then performed on behalf of the client. The response is received and decrypted by the agent,
  • It is then sent to the client over (insecure) HTTP.

This is how gateways handle FTP, but there are several disadvantages to this approach:

  • The connection between the client and the gateway is plain insecure HTTP;
  • Although the proxy is an authenticated principal, the client cannot perform SSL client authentication on the remote server.
  • The gateway should support a full SSL implementation.

SSL tunnel

For SSL tunneling, there is no need to implement SSL in the proxy. An SSL session is established between the requesting client and the destination (secure) Web server, with the proxy server in the middle simply tunneling encrypted data and playing no other role in the secure transaction.

Tunnel certification

Where appropriate, other FEATURES of HTTP can be used in conjunction with tunneling. In particular, the authentication support of the proxy can be used in conjunction with tunnels to authenticate the client’s right to use tunnels.

Tunnel safety considerations

In general, the tunnel gateway cannot verify that the protocol currently in use is the one it was intended to tunnel over.

relay

An HTTP relay is a simple HTTP proxy that does not fully follow the HTTP specification. The relay handles the part of HTTP that establishes the connection and then blind-forwards the bytes.

A more common (and notorious) problem with some simple blind relay implementations is the potential to hang keep-alive connections because they don’t handle Connection headers properly.