Illustrated HTTP book Notes

The first chapter

Background to the birth of HTTP

Dr Tim Berners-Lee of CERN, the European Organisation for Nuclear Research, has come up with an idea that would allow far-flung researchers to share knowledge.

The basic idea of the original idea was to make the WWW (World Wide Web) accessible to each other by means of HyperText, which is formed by the correlation between multiple documents.

Three WWW construction techniques have been proposed, which are: SGML (Standard Generalized Markup Language) is HTML (HyperText Markup Language) as the text Markup Language of pages; HTTP as a document delivery protocol; Uniform Resource Locator (URL) that specifies the address of the document. WWW is the name that Web browsers used to use to browse hypertext client applications. It is now used to refer to the collection of these series, also known as Web for short.

TCP/IP protocol family

For computers and network devices to communicate with each other, they must be based on the same method. For example, rules on how to detect a communication target, which side initiates the communication first, which language is used for communication, and how to end the communication need to be determined in advance. Communication between different hardware, operating systems, all of this requires a rule. This rule is called a protocol.

There are all sorts of things in agreements. From cable specifications to the selection of IP addresses, the method of finding remote users, the order in which the two parties establish communication, and the steps to be processed displayed on Web pages, etc. A collection of protocols associated with the Internet like this is collectively called TCP/IP.

An important aspect of the TCP/IP protocol family is layering. The TCP/IP protocol family is divided into four layers: application layer, transport layer, network layer, and data link layer. The HTTP protocol is also at the application layer.

DNS service

The Domain Name System (DNS) service is a protocol at the application layer like HTTP. It provides domain name to IP address resolution service.

Computers can be assigned IP addresses as well as host names and domain names. Such as www.baidu.com

Users usually use host names or domain names to access each other’s computers, rather than directly through IP addresses. That’s because it’s better to remember a computer name as a combination of letters and numbers than as a set of pure numbers for an IP address.

But getting computers to understand names is relatively difficult. Because computers are better at processing long strings of numbers.

In order to solve the above problems, DNS service came into being. The DNS provides the service of searching IP addresses by domain names or reverse-searching domain names from IP addresses.

The URI and URL

We are more familiar with Uniform Resource Locator (URL) than URI (Uniform Resource Identifier). A URL is a Web page address that you need to enter when accessing a Web page using a Web browser. For example, www.baidu.com in the image below is the URL.

URI stands for Uniform Resource Identifier. RFC2396 defines these three terms as follows.

Uniform

Specifying a uniform format makes it easy to handle many different types of resources without having to identify specific access methods for resources based on context. It is also easier to join new protocol schemes such as HTTP: or FTP:.

Resource

A resource is defined as “anything identifiable”. All can be used as resources except for documents, images, or services (such as the weather forecast for the day) that can be distinguished from other types. In addition, resources can be not only a single, but also a collection of many.

Identifier

Represents an identifiable object. Also called identifiers.

In summary, a URI is a location identifier for a resource represented by a protocol scheme. A protocol scheme is the name of the protocol type used to access resources.

A URI identifies an Internet resource as a string, and a URL represents the location of the resource (its location on the Internet). Visible urls are a subset of URIs.

The second chapter

The HTTP protocol is used for communication between the client and server

HTTP is a protocol that does not save state

HTTP method to inform the server of the intent

GET: Obtains resources

The GET method is used to request access to a resource identified by a URI. The specified resource is parsed by the server and the response content is returned. That is, if the requested resource is text, return it as-is; If it is a program like CGI (Common Gateway Interface), it returns the output after execution.

POST: transmits the entity body

The POST method is used to transfer the body of the entity.

Although the body of an entity can be transferred using the GET method, it is usually transferred using the POST method instead of the GET method. While the function of POST is similar to GET, the primary purpose of POST is not to GET the body of the response.

PUT: transfers files

The PUT method is used to transfer files. Like FTP file uploads, file contents are required to be included in the body of the request message and then saved to the location specified by the request URI.

HEAD: obtains the packet HEAD

The HEAD method is the same as the GET method, except that it does not return the body part of the packet. Used to verify the validity of the URI and the date and time of resource updates.

DELETE: deletes a file

The DELETE method is used to DELETE a file, which is the opposite of PUT. The DELETE method deletes the specified resource based on the request URI.

OPTIONS: Asks for supported methods

The OPTIONS method is used to query the supported methods for the resource specified by the request URI.

TRACE: indicates a tracing path

The TRACE method is a way for the Web server to loop back previous request traffic to the client.

At the front of max-forwards request, the forward field is filled with a value. Each time it passes through the server end, the value is reduced by one. When the value reaches zero, the transmission is stopped.

The third chapter

The HTTP message

The information used for HTTP interaction is called HTTP packets. HTTP packets sent by the requesting end (client) are called request packets, and those sent by the responding end (server) are called response packets. The HTTP message itself is a string text composed of multiple lines of data (using CR+LF as a newline character).

HTTP packets are generally divided into a header and a packet body. The two are separated by the initial blank line (CR+LF). Usually, it is not necessary to have a message body.

Structure of request message and response message

The fourth chapter

The HTTP status code

The first digit of the number specifies the response category, and the last two digits are unclassified.

The fifth chapter

With a single virtual host to achieve multiple domain names

The HTTP/1.1 specification allows a single HTTP server to host multiple Web sites. For example, a Web Hosting Service provider can serve multiple clients with a single server or run different websites under the domain name held by each client. This is because of the use of Virtual Host (Virtual Host, also known as Virtual server) function.

Even if the physical level only one server, but as long as the use of virtual host function, you can assume that there are multiple servers.

At the same IP address, the virtual Host can Host multiple Web sites with different Host names and domain names, so when sending HTTP requests, you must specify the URI of the Host name or domain name in the Host header.

Proxy, gateway, tunnel

The agent

A proxy is a forwarding application that acts as a “middleman” between the server and the client, receiving requests sent by the client and forwarding them to the server, and receiving responses returned by the server and forwarding them to the client.

The gateway

A gateway is a server that forwards communication data from other servers, and when it receives a request from a client, it processes the request as if it were a source server with its own resources. Sometimes the client may not even realize that its communication target is a gateway.

The tunnel

A tunnel is an application that communicates between a remote client and a remote server.

Chapter vi

Generic header field

Request header field

Response header field

Entity head field

Chapter vii.

The disadvantage of HTTP

So far, we have learned that HTTP has a very good and convenient side. However, HTTP has not only a good side but also a bad side. Every coin has two sides. HTTP has these major shortcomings, as listed below.

Communications use clear text (not encryption) and the content can be eavesdropped
The identity of the communicating party is not verified, so it is possible to encounter camouflage
The integrity of the message could not be proved, so it may have been tampered with

Encrypts HTTP traffic using the Secure Socket Layer (SSL) or Transport Layer Security (TLS) protocol.

After establishing a secure communication line with SSL, HTTP communication can be carried out over this line. The HTTP used in combination with SSL is called HTTPS (HYPERtext Transfer Security Protocol) or HTTP over SSL.

HTTPS is HTTP in an SSL shell

HTTPS is not a new protocol at the application layer. Secure Socket Layer (SSL) and Transport Layer Security (TLS) are used to replace the HTTP communication interface.

Typically, HTTP communicates directly with TCP. When SSL is used, it evolves to communicate with SSL first and then with SSL and TCP. In short, HTTPS is HTTP in the shell of THE SSL protocol.

Public-key encryption in which keys are exchanged

Before we get into SSL, let’s look at encryption methods. SSL uses a type of encryption called public-key cryptography.

In modern encryption methods, the encryption algorithm is open, but the key is secret. In this way the encryption method can be kept secure.

Encryption and decryption use keys. A password cannot be decrypted without a key; conversely, anyone with a key can decrypt it. If the key is obtained by an attacker, the encryption is meaningless.

The dilemma of shared key encryption

Encryption and decryption using the same key is called Common key crypto system, also known as symmetric key encryption.

Public key encryption using both keys

Public key encryption solves the difficulty of shared key encryption.

Public-key encryption uses a pair of asymmetric keys. One is called a private key and the other is called a public key. As the name implies, a private key cannot be known to anyone else, whereas a public key can be freely distributed and available to anyone.

Chapter viii.

The computer itself cannot determine the identity of the user sitting in front of the monitor. Furthermore, there’s no way to confirm who’s on the other end of the network. So, in order to find out who is accessing the server, you have to ask the other client to identify himself.

The following describes the authentication modes used in HTTP/1.1.

BASIC Certification (BASIC Certification)
DIGEST Authentication
SSL client authentication
FormBase authentication (form-based authentication)

Chapter 9

HTTP based protocol

When the HTTP standard specification was created, its authors primarily intended HTTP to be a protocol for transporting HTML documents. With the development of The Times, the use of Web is more diversified, such as online shopping websites, SNS (Social Networking Service), various management tools within enterprises or organizations, and so on.

The functionality these sites seek can be achieved through Web applications and scripting. Even if these functions meet the requirements, they may not be optimal in terms of performance because of HTTP protocol limitations and their own limited performance.

The lack of HTTP functionality can be remedied by creating a whole new set of protocols. However, http-based Web browsers are used around the world, so HTTP cannot be completely abandoned. Some of the new protocols have rules based on HTTP and add new functionality on top of them.

WebSocket

Using Ajax and Comet technologies to communicate can speed up Web browsing. However, the problem is that if HTTP protocol is used for communication, it cannot completely solve the bottleneck problem. WebSocket network technology is a new protocol and API to solve these problems.

At the time, WebSocket was planned as part of the HTML5 standard, but now it is becoming a separate protocol standard. WebSocket communication Protocol was standardized by RFC 6455 – The WebSocket Protocol on December 11, 2011.

Main features of WebSocket protocol:

Push function
Reduced traffic

Chapter ten

HTML

HTML (HyperText Markup Language) is a Markup Language developed to send HyperText on the Web. Hypertext is a document system that associates information anywhere in a document with other information (text, images, etc.), known as hyperlinked text. A markup language is a language that decorates a document by interspersing parts of it with special string tags. We call these special strings that appear in HTML documents HTML tags.

Most of the Web pages we browse are written in HTML. The document formed by HTML is parsed and rendered by the browser, and the result presented is a Web page.

<html>

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <title>baidu.com</title>
  <style type="text/css">
    .logo {
      padding: 20px;
      text-align: center;
    }
  </style>
</head>

<body>

  <div class="logo">

    <p><img src="photo.jpg"
        alt="photo"
        width="240"
        height="127" /></p>

    <p><img src="baidu.gif"
        alt="baidu.jp"
        width="240"
        height="84" /></p <p><a href="https://www.baidu.com">baidu</a> </p>
  </div>
</body>

</html>
Copy the code

Cgis that work with Web servers and programs

The Common Gateway Interface (CGI) is a set of mechanisms that a Web server forwards to a program after receiving a request from a client. CGI allows programs to respond to requests for content, such as creating dynamic content such as HTML.

Programs that use CGI are called CGI programs and are usually written in programming languages such as Perl, PHP, Ruby, and C.

Chapter 11

Attack techniques against the Web

At present, most attacks from the Internet are directed at Web sites, and most of them target Web applications.

There are two attack modes against Web applications.

Take the initiative to attack
Passive aggression

Security vulnerability caused by incomplete escape of output values

Security measures for Web applications can be roughly divided into the following two parts.

Client-side validation
Web application side (server side) validation

Input value validation
Output value escaped

Cross-site scripting attacks

Cross-site Scripting (XSS) is an attack that involves running illegal HTML tags or JavaScript in the browser of a user registered with a vulnerable Web Site. Dynamically created HTML sections can hide security holes. In this way, attackers write scripts to trap users running on their own browsers, exposing them to passive attacks if they are not careful.

SQL injection attack

SQL Injection is an attack on the database used by a Web application by running illegal SQL. This security hazard may cause great threats, sometimes directly lead to the disclosure of personal information and confidential information. Web applications usually use databases. To retrieve, add, or delete data in a database table, SQL statements are used to connect to the database for specific operations. If there is an error in the way the SQL statement is called, it is possible to execute an illegal SQL statement that is maliciously injected.

OS command injection attack

OS Command Injection attack means that an illegitimate operating system Command is executed to attack a Web application. Wherever Shell functions can be called, there is a risk of attack.

Operating system commands can be invoked from a Web application through a Shell. If the Shell is inadvertently called, an illegal OS command inserted can be executed.

OS command injection attacks can send commands to the Shell to start a program on the command line of a Windows or Linux operating system. In other words, OS injection attacks can execute various programs installed on the OS.

HTTP header injection attack

HTTP Header Injection attack is an attack in which the attacker adds any response Header or body by inserting a newline into the response Header field. Passive attack mode.

Attacks that add content to a header body are called HTTP Response Splitting attacks.

Mail header injection attack

Mail Header Injection refers To the email sending function in a Web application. An attacker launches an attack by adding illegal content arbitrarily To the To or Subject in the email Header. Using a Web site with security vulnerabilities, you can send advertisement mail or virus mail to any email address.

Directory traversal attack

Directory Traversal attack refers to an attack in which a file Directory that is not intended to be exposed is accessed by illegally truncating its Directory path. This attack is sometimes called a Path Traversal attack.

When processing files through Web applications, in the case of omissions in processing externally specified file names, users can use… Relative paths such as/are positioned to absolute paths such as /etc/passed. Therefore, any file or file directory on the server can be accessed. This makes it possible to browse, tamper with, or delete files on your Web server illegally.

There are problems with output value escaping, but it is better to turn off access to any specified file name.

Remote files contain vulnerabilities

Remote File Inclusion vulnerability refers to an attack in which an attacker uses the URL of a specified external server as a dependency File when some script content needs to be read from other files so that any script can be run after being read by the script.

This is a PHP security vulnerability. For PHP include or require, this is a function that can be set to specify the URL of an external server as the file name. However, this feature is too dangerous and has been disabled by default since PHP5.2.0.

Although there are problems with output value escape, you should control the specification of arbitrary file names.

A security breach caused by a flaw in setup or design

Forced to browse

Forced Browsing is a security vulnerability that allows users to browse files that are not voluntarily made public from files stored in a public directory on a Web server.

Incorrect error message handling

An Error Handling Vulnerability is when a Web application Error message contains information useful to an attacker. The main error messages related to Web applications are shown below.

Error message thrown by the Web application
Error messages thrown by systems such as databases

Web applications do not have to display detailed error messages on the user’s browsing screen. For the attacker, detailed error messages may prompt the next attack.

Open redirection

Open Redirect is a function that allows you to Redirect any specified URL. The security vulnerability associated with this feature is that if a specified redirect URL is directed to a malicious Web site, the user will be directed to that Web site.

A security vulnerability caused by session management negligence

Session hijacking

Session Hijack means that an attacker obtains a user’s conference ID by some means and uses the Session ID to disguise himself as a user.

Session fixation attack

Session Fixation attack is passive attack for Session hijacking, which takes stealing target Session IDS as the active attack method. Session Fixation attack forces users to use the Session ID specified by the attacker.

Cross-site request forgery

Cross-site Request Forgeries (CSRF) attacks are passive attacks in which an attacker forces unexpected updates of personal information or Settings of users who have completed authentication by setting traps.