List of the journal

  1. About Socket, read my several articles enough (1)
  2. As for sockets, this article is enough (2) HTTP
  3. So you are such a Websocket

In the last article, we introduced the HTTP protocol. HTTP is a stateless and connectionless protocol.

Prior to HTTP 1.1, the TCP/IP connection from client to server was disconnected upon completion of use, and the TCP/IP socket layer of the server was expensive, and the client was likely to request multiple connections, each connection required three handshakes, and the connection required four waves. We can think about how to simplify these steps.

Therefore, HTTP 1.1 officially added a series of header fields such as Connection: keep-alive and so on, so that the socket Connection from the client to the server can be maintained for a certain period of time without being destroyed. Therefore, each request from the client to the server does not have to re-establish a socket connection, and can directly send data over the established connection.

Disadvantages of the HTTP protocol

Even though the HTTP protocol has evolved to reuse connections, it still leaves a lot to be desired:

1. Irrelevant content (protocol-related content) of HTTP requests is expensive

As we discussed in our last article, the part of the HTTP protocol that we operate on is usually the body, and there are also parts of the header

Here is a brief description in terms of Byte:

Assuming that we need to refresh a GET interface periodically for information (we only analyze send requests), the data text structure of our request would look like this:

GET/HTTP/1.1\r\n Host: www.example.com\r\n \r\nCopy the code

Some might say, well, that’s not a lot of data.

It is important to note that spending is not an absolute meaning, it is a relative one. We can see how many bytes we send in a simple request like this, 42 bytes in total. This means that each time we perform this request we need to send these 42 bytes, of which 14 bytes are used for formatting (HTTP/1.1 and \r\n). This data needs to be sent repeatedly with each request, and we can also say that HTTP requests are relatively heavy

2. HTTP requests can be sent only in one direction

HTTP requests use request-reply mode, that is, the client sends a request and the server responds. One drawback is that the server can only respond to data passively, not actively push it.

We can poll requests proactively, but this leads to problem 1. HTTP requests are expensive and the server is resource-constrained

So this leads to websockets:

Websocket

Websocket is a protocol for full duplex communication over TCP connections

Full duplex means that both ends of the communication have the ability to actively send data

WebSocket makes it easier to exchange data between the client and the server, allowing the server to actively push data to the client. In the WebSocket API, the browser and server only need to complete an additional handshake to create a persistent connection and two-way data transfer between the two.

agreement

Connection is established

We are talking about connections that have been established after the TCP/IP three-way handshake.

Websocket requires an additional HTTP handshake after the connection is established to ensure that both communication parties can support the protocol (in case of false access).

  1. The client initiates a protocol upgrade request

The client needs to send an HTTP header (containing Websocket information and other headers such as cookies). The client header structure is as follows:

GET/Access path HTTP/1.1\r\n Host: www.example.com\r\n Connection: Upgrade\r\n Upgrade: websocket\r\n Sec- websocket-version: 13\r\n Sec-WebSocket-Key: mgZ6+kXU1+mEgOXWDPPsBg==\r\n \r\nCopy the code

Fixed header information for webSocket specified above

  • ConnectionFields must beUpgradeTo indicate that the client needs to connect to upgrade
  • UpgradeFields must bewebsocket, indicating that the client needs to upgrade from HTTP request to WebSocket
  • Sec-WebSocket-VersionField is13Is the version number of the current protocol. Currently, 13 is generally used.
  • Sec-WebSocket-KeyThe field is mandatory, and the value is usually 16 bytes of random data converted into a Base64 string. This field is provided to the server for header return validation (used by the client to determine whether the server supports WebSocket)

The Websocket request header field is the same as the standard HTTP request, but the protocol stipulates that the Websocket request can only be of the GET type, and other header fields can be added by the server and client through negotiation.

Sec-websocket-key is used by the client to determine whether the server supports it. The client may access an HTTP server incorrectly for some reason. The server does not support WebSocket, but can respond to the corresponding GET request. The client can use the server’s return field to determine whether the connection should continue or be closed

  1. The server responds to the request for data

When the server receives a request header from the client, it needs to respond, and the response data is also the standard HTTP request header

HTTP/1.1 101 Switch Protocol\r\ N Connection: Upgrade\r\n Upgrade: websocket\r\n Sec- websocket-Accept: qIs5tRK57T9vTjEtFfTLOSe3K3w=\r\n \r\nCopy the code

The server must first return a status code 101 to indicate that the server has switched protocol and that the data parsing protocol will no longer be HTTP hypertext

The server also returns the Connection and Upgrade fields, processes the sec-websocket-key sent by the client, and returns the result to the SEC-websocket-Accept file for the client to verify.

  • Sec-WebSocket-KeyTreatment methods:

The sec-websocket-key concatenation string 258eAFa5-E914-47DA-95CA-C5AB0DC85B11 is then used for SHA1 hash calculation. Finally, the resulting hash is base64 transcoded into a string. Add to the Sec – WebSocket – Accept

When the client receives the corresponding SEC-websocket-Accept, the client performs the same process with the sec-websocket-key it transmits and compares the results returned by the server. If the results are consistent, the client considers that the server supports the request. When the comparison is inconsistent, the client should disconnect the connection according to the protocol.


As you can see, establishing a Websocket connection is essentially equivalent to the client making a plain HTTP request with an empty Body to the server, and the server responding in kind

Websocket does this to be compliant with the standard HTTP protocol, because a server application can be both an HTTP server and a Websocket server without having to listen on multiple ports at the same time.

Websocket requests can also support HTTP headers such as cookies.

It’s not clear how Websocket addresses HTTP’s shortcomings here, because this is just Websocket’s extra handshake, not real data delivery.

Data sent

This brings us to the most important part of Websocket

First we need to specify two definitions Byte and Bit:

  • Byte: A standard unit of computer storage and transmission (Byte), the largest number that can be converted to a non-negative integer is (2^ 8-1) = 255, when converted from a Byte to a binary bit:0, 0, 0, 0, 0It consists of eight zeros or ones, each of which is oneBit
  • Bit: In binary systems, each 0 or 1 is a Bit, which is the smallest unit of data storage. 1 Byte = 8 Bit

Next, we will talk about the data transmission structure of Websocket, we used to call each complete packet as a frame

Frame data structure:

In the figure above, we take Bit as the unit, but in the process of real data processing, the smallest unit of memory we operate is Byte, namely 8*Bit. In Swift, we can use UInt8 to transform Byte into unshaped for processing.

We can see that the protocol related part of the Websocket packet takes only 2-10 bytes, and if the relevant mask is included, it takes up a maximum of 14 bytes. Compared with HTTP, this means that the extra consumption of Websocket is small.

Here we begin to explain protocols in order:

  • FIN:

This bit is the first bit of the entire frame and is used to indicate whether the frame is the end of consecutive frames

0: Continuous data packets are not ended

1: indicates the last frame of the packet

  • RSV1-RSV3:

For subprotocols, or other related purposes. The three bits are officially required to be 0, which can be extended by the sub-protocol. If one or three of the three digits are 1, the receiver should close the connection if the data is not understood correctly

Closed: This does not mean the TCP/IP layer connection is closed, but the Websocket layer definition is closed. This applies to all subsequent closures, which we will explain next

  • Opcode (opcode) :

Opcodes take up four bits, so there are a total of 2^4=16 possible opcodes

I’ll list the cases in hexadecimal:

  1. 0: indicates that the current frame is a continue frame
  2. 1: indicates that the current frame is a text frame (the transmission data is UTF8 encoded text).
  3. 2: indicates that the current frame is a binary data stream frame (for Swift)Data)
  4. 3-7: Non-control frames for the future
  5. 8: indicates that the current frame is a closed frame
  6. 9: indicates that the current frame is a Ping frame for heartbeat detection
  7. A: It means that the current frame is A heartbeat detection reply Pong frame
  8. B-f: for future control frames

Here, there are two cases of a frame, control frame and non-control frame

Control frame

Control frames have certain special requirements:

  1. A control frame cannot be in a consecutive data frame
  2. The actual sent data size of the control frame cannot exceed 125 bytes
  3. The FIN(stop bit) of the control frame must be 1

A control frame means that when the corresponding frame is received, the receiver should make a certain response or action.

8: close the frame

When the receiver receives a close frame, there are two situations:

  1. If the receiver has not sent a close frame before

If the receiver is in the process of sending consecutive data frames at this point, it can continue to send data frames (there is no certainty that the other party will continue to process data). This should be followed by a close frame, followed by the disconnection of the TCP/IP connection.

  1. If the receiver has sent a close frame before

The receiver should not send any data frames after sending the closed frame. When receiving the closed frame, the TCP/IP connection is disconnected

  1. A closed frame is a control frame and can therefore carry up to 125 bytes of data. The first two bytes of data carried by this frame are error codes and the subsequent bytes are corresponding description of the cause (UTF8 encoded text).

Close: If a party initiates a close, that party actively sends a close frame and finally performs the complete process of closing the TCP/IP connection

9:Ping

Ping is the heartbeat packet mechanism frame of Websocket, mainly used to confirm that the other party has not closed the connection due to abnormality. When we receive the Ping frame, we should respond to the Pong frame as a response. If we do not receive a response for a long time, we should consider actively closing the connection

A:Pong

The Pong frame is the response frame in the heartbeat packet mechanism frame of Websocket.

Remaining control frames

There is no qualitative requirement in the existing protocol, which may be added in future Websocket upgrades (or defined in sub-protocols).

If the receiver does not define the appropriate handling for this frame, the connection should be closed

The control frame

Non-control frame, also known as data frame in our common sense, is mainly used for sending data from both sides, and is also the most commonly used by us

0: Continue frames (sharding)

shard

The main purpose of sharding is to allow an unknown size of extinction to be sent when a message is started but does not have to be buffered. If the message cannot be sharded, the endpoint will have to buffer the entire message to figure out its length before the first byte occurs. For sharding, the server or middleware can choose a buffer of appropriate size, and when the buffer is full, write a fragment to the network.

The second sharding use case is for multiplexing. It is not desirable for a large message on a logical channel to monopolize the output channel, so multiplexing needs to be able to split the message into smaller segments to better share the output channel.

Requirements for sending data fragments:

  1. The FIN bit of the first and process frames of the data is 0
  2. The opcode of the first frame of the data must be the corresponding opcode of the non-control frame and cannot be a continue frame
  3. The opcodes of the process and end frames of the data must be continuation frames
  4. The opcode of the terminating frame of data must be 1

We can understand it as follows:

First of all, when we need to send shard data, we must tell each other what kind of data we have at the beginning, and we must not tell each other that the data has been sent. At the same time, in the process of sending, we have to tell each other that our data has not been sent, and this data is a part of it. And when we get to the last one, we need to tell the other that we’re done.

In a nutshell, the rules are as follows:

  1. The start of sending determines the data type, and neither the process nor the end can be changed
  2. Send off indicates completion of data

The corresponding receiving processing mode is also as mentioned above: first frame is parsed to determine the data type, then intermediate data is received, and last frame is received to complete data processing. If you receive data that does not meet the requirements for shard sending, you should close the connection

1: text frame

A text frame signals that the data is encoded in UTF8, and when we use it, we need to convert the data to a UTF8 string, and when the conversion fails, we need to close the connection

2: binary frame

Binary frames represent the data sent as binary files

3 to 7: Other non-control frames

For future protocol upgrade or sub-protocol expansion

The opcode is a key part of the protocol header, which defines how data is handled, among other things

MASK (MASK)

Mask placeholder a Bit to indicate whether the field is sent with a mask and whether real data needs to be decoded.

If the mask bit is 1, it indicates that a mask exists and transcoding is required

Why design masks?

According to the protocol, the data sent from the client to the server must contain a mask, and the data returned by the server cannot carry a mask

Data length (Payload Len)

The data length takes up 7 bits (maybe more), so it is most likely that 2^ 7-1 = 127, but the actual sent data may be much larger than this value. What should be done?

So here’s what the protocol makers say:

  1. When this value is less than or equal to 125, the true data length (Byte)
  2. When the value is 126, we need to take the next 16 bits (2 bytes) as the length, so that the length can support 2^ 16-1 = 65535(bytes).
  3. When the value is 127, we need to take the next 64 bits (8 bytes) as the length, so that the length can support 2^ 64-1 = a large number

What if it’s not enough?

You can consider shard sending -_-

Masking-Key(real mask)

The real mask takes 32 bits (4 bytes)

The field is obtained according to the above mask flag bit. If the mask flag bit is 1, the field exists. If it is 0, the bit is null.

According to the protocol, the real mask should be random 32 bits (4 bytes) that we use an unpredictable algorithm.

We can use the Security in Swift. SecRandomCopyBytes () method for random value

Once we have the mask and real data, we need to process the real data as follows (show the Swift code directly)

func maskData(payloadData: Data, maskingKey: Data) -> Data {
    letFinalData = Data(count: payloaddata.count) // Convert Data to a pointer to facilitate processinglet payloadPointer: UnsafePointer<UInt8> = payloadData.withUnsafeBytes({$0})
    let maskPointer: UnsafePointer<UInt8> = maskingKey.withUnsafeBytes({$0})
    let finalPointer: UnsafeMutablePointer<UInt8> = finalData.withUnsafeBytes({UnsafeMutablePointer(mutating: $0)})


    for index in0.. <payloadData.count {letIndexMod = index % 4 // Corresponding bit XOR(^) (finalPointer + index). Pointee = (payloadPointer + index). Pointee ^ (maskPointer + indexMod).pointee }return finalData
}
Copy the code

The mask and decoding are calculated according to this algorithm

Payload Data

It can also be called payload data (maybe it should be called payload data rather than real data, but that doesn’t matter), which is the data we mostly use. I said no more.

other

There are a few things about Websockets that we haven’t covered yet, such as sub-protocols, that the authors need to delve more deeply into. Therefore, it will be covered in supplementary articles in the future.

When do I need to use Websocket

As iOS developers, we don’t get to use this much. But when we want the server to actively push data to us, but do not want to develop their own upper layer protocol we can consider this protocol, or very useful.

Why write this article?

The author is currently working on this protocol and is developing a Websocket client tripartite library, SwiftAsyncWebsocket, using the pure Swift language, which is currently under development. I feel that I have some research experience on Websocket, so I write this article

At the end

Every step we take now is a path paved by our predecessors.

If there are any mistakes in the article, please comment and point out

PS: and draw a picture with PPT, feel good laborious ah, -_-

Reference:

SocketRocket source

RFC 6455