The basic Socket model

In order for the client and server to communicate in the network, it has to use Socket programming, which is a special way of interprocess communication, special is that it can communicate between hosts.

The Chinese name for a Socket is a Socket. In fact, in order to communicate with each other on the network, each side needs to create a Socket, which is equivalent to the client and server opening a “Socket”, through which both sides read and send data. Look like this, is it like a network cable, one end inserted in the client, one end inserted in the server, and then communication.

When creating sockets, you can specify whether IPv4 or IPv6 is used at the network layer and WHETHER TCP or UDP is used at the transport layer. UDP Socket programming is relatively simple, here we only introduce Socket programming based on TCP.

The server program to run first, and then wait for the client connection and data, we first look at the server Socket programming process is how.

The server calls the socket() function to create an IPv4 and TCP socket, and then calls bind() to bind the socket to an IP address and port. What is the purpose of the bind?

The purpose of the binding port: When the kernel receives a TCP packet, it uses the port number in the TCP header to find our application and pass the data to us.

Purpose of binding IP address: A machine can have multiple network adapters, and each network adapter has a corresponding IP address. When binding a network adapter, the kernel will send the packet on the network adapter to us only after receiving it.

After binding the IP address and port, the listen() function can be called to listen. At this time, listen corresponds to the LISTEN in the TCP status diagram. If we want to determine whether a network program in the server is started, we can check whether the corresponding port number is monitored by using the netstate command.

When the server enters the listening state, it calls accept() to get the client connection from the kernel. If there is no client connection, it blocks waiting for the client connection to arrive.

How does the client initiate the connection? After the Socket is created, the client initiates the connection by calling the connect() function, whose parameters specify the IP address and port number of the server, and then the much-anticipated TCP three-way handshake begins.

During a TCP connection, the server’s kernel actually maintains two queues for each Socket:
  1. One is the queue that has not fully established the connection, which is called the TCP half-connection queue. This queue is the connection that has not completed the three-way handshake, and the server is in the syn_RCVD state.

  2. One is a queue for establishing a connection, called the TCP full connection queue. This queue is a three-way handshake, and the server is in the Established state.

When the TCP full connection queue is not empty, the accept() function on the server returns a completed Socket from the TCP full connection queue in the kernel to the application, which is used for subsequent data transmission.

Note that the listening Socket and the actual Socket used to transmit data are two different sockets:

  • One is called a listening Socket;

  • A Socket called connected;

Once the connection is established, the client and server start transferring data to each other, and both can read and write data using the read() and write() functions.

At this point, the call process of TCP Socket program is over, the whole process is shown as follows:

See this, I don’t know if you feel like reading and writing sockets, like reading and writing files.

Yes, based on the Linux concept of “everything is a file”, in the kernel Socket also exists in the form of “file”, also has the corresponding file descriptor.


What does a file descriptor do?

Each process has a data structure, task_struct, that has a member pointer to an array of file descriptors. This array lists the file descriptors for all files opened by the process. The subscript of the array is the file descriptor, which is an integer, and the content of the array is a pointer to the list of all open files in the kernel, which means that the kernel can find the corresponding open file through the file descriptor.

Each file has an inode. The inode of the Socket file refers to the Socket structure in the kernel. There are two queues in the Socket structure, namely the send queue and the receive queue, and the two queues store each struct SK_buff. Use the organization of a linked list.

Sk_buff can represent packets at various layers. At the application layer, packets are called data, at the TCP layer we call segment, at the IP layer we call packet, and at the data link layer we call frame.

Why, you may wonder, is there only one structure for all packets? The protocol stack adopts a hierarchical structure. When data is transferred from the upper layer to the lower layer, the packet header needs to be added, and when data is transferred from the lower layer to the upper layer, the packet header needs to be removed. If each layer uses a structure, multiple copies will occur when data is transferred between layers, which greatly reduces CPU efficiency.

So how does sk_buff describe all network packages in order to pass data between hierarchies without copying? By adjusting the pointer to data in the SK_buff, for example:

  • When receiving packets, packets are transmitted from the NIC driver through the protocol stack layer by layer. The protocol header is gradually stripped by increasing the value of SKB ->data.

  • When sending a packet, an SK_buff structure is created, with enough space left in the header of the data cache to fill the headers of each layer, and the protocol headers are increased by reducing the value of SKB ->data as it passes through the underlying protocols.

You can see in the figure below how the data pointer moves when a message is sent.


How to serve more users?

The TCP Socket invocation process mentioned above is the simplest and most basic. It can only communicate one to one, because it uses synchronous blocking. When the server has not finished processing the network I/O of a client, or when the read and write operation is blocked, other clients cannot connect to the server.

However, if our server could only serve one client, it would be a waste of resources, so we need to improve the network I/O model to support more clients.

Before improving the network I/O model, let me ask you a question. Do you know the maximum number of clients that can be connected to a single server?

As you probably know, TCP connections are uniquely confirmed by a quad: local IP, local port, peer IP, peer port.

The server, as the server, usually listens on a local port and waits for the client to connect. Therefore, the local IP address and port of the server are fixed. Therefore, only the peer IP address and port of the server TCP connection quad change. Therefore, the maximum number of TCP connections = number of CLIENT IP addresses x number of client ports.

For IPv4, the maximum number of CLIENT IP addresses is 2 ^ 32, and the maximum number of client ports is 2 ^ 16. That is, the maximum number of TCP connections on a server is about 2 ^ 48.

This theoretical value is quite “plump”, but the server certainly cannot handle such a large number of connections. There are two main limitations:

  1. File descriptors, Socket is actually a file, which corresponds to a file descriptor. Under Linux, there is a limit to the number of file descriptors that can be opened by a single process. The unmodified value is usually 1024, but we can increase the number of file descriptors by ulimit.

  2. System memory. Each TCP connection has a corresponding data structure in the kernel, which means that each connection takes up a certain amount of memory.

If the server only has 2 GB of memory and a gigabit network card, can it support 10,000 concurrent requests?

Concurrent 10,000 requests is the classic C10K problem. C is an acronym of the word Client, and C10K is the problem of processing 10,000 requests simultaneously on a single machine.

From the point of view of hardware resources, a server with 2GB memory gigabit network adapter can satisfy 10,000 concurrent requests if each request processing consumes less than 200KB of memory and 100Kbit of network bandwidth.

However, in order to truly implement C10K servers, the place to consider is the network I/O model of the server, the inefficient model, will increase the system overhead, and thus further and further away from the C10K goal.


Multiprocess model

Based on primitive blocking network I/O, one of the more traditional ways for a server to support multiple clients is to use the multi-process model, where each client is assigned a process to handle requests.

Once the connection is complete, accept() returns a “connected Socket.” At this point, a child process is created using the fork() function, in effect making a copy of everything related to the parent process. This includes file descriptors, memory address space, program counters, code executed, and so on.

When the two processes were first copied, they were almost identical. However, the parent or child process is distinguished by the return value. If the return value is 0, it is the child process. If the return value is some other integer, it is the parent process.

Because the child process copies the parent’s file descriptor, it can communicate directly with the client using the “connected Socket”,

It can be found that the child process does not need to care about “listening Socket”, only needs to care about “connected Socket”. The parent process, on the other hand, hands off customer service to the child process, so the parent process does not need to care about “connected sockets”, only “listening sockets”.

The following diagram depicts the time from the connection request to the establishment of the connection, when the parent process creates the child process to serve the client.

In addition, when the “child process” exits, in fact, the kernel will retain some information about the process, also will occupy memory, if not done “recycle” work, will becomezombiesAs thezombiesThe more they do, the more they drain our system.

Therefore, the father process is to “clean up” their own children, how to clean up? There are two ways to reclaim resources after the child exits by calling wait() and waitpid().

This way of using multiple processes to deal with multiple clients, in response to 100 clients is feasible, but when the client number is as high as ten thousand, must carry not to live, because each produce a process, will occupy a certain system resources, and process the context switch between “baggage” is very heavy, performance will be discounted.

Context switch of process includes not only user space resources such as virtual memory, stack and global variables, but also kernel space resources such as stack and register.


Multithreaded model

Since interprocess context switching is a heavy burden, we have a lighter model to handle multi-user requests — multithreaded model.

Thread is running a “logical flow” in the process of single process can run multiple threads, with the process of the thread can be Shared in the process of part of the resource, such as file descriptor list, process space, code, global data, heap, Shared libraries, etc., these Shared resources in a context switch is not need to switch, However, it only needs to switch private data, registers and other non-shared data of the thread, so the overhead of the thread context switch in the same process is much lower than that of the process.

After the server completes the TCP connection with the client, a thread is created through the pthread_create() function, and the file descriptor of the “connected Socket” is passed to the thread function, which then communicates with the client in the thread to achieve the purpose of concurrent processing.

If you create a thread for each connection, the operating system will have to destroy the thread after it runs. Although it is not expensive to switch between threads, it is also expensive to create and destroy threads frequently.

So, we can use the thread pool to avoid frequent thread creation and destruction, the so-called thread pool, is to create several threads in advance, so that when the new connection is established, will the Socket is connected to a queue and then the thread in the thread pool is responsible for processing are removed from the queue connected Socket process.

It is important to note that the queue is global, and each thread operates on it. To avoid multi-thread contention, the thread must lock the queue before operating on it.

The process – or thread-based model above is problematic. A new TCP connection needs to be allocated to a process or thread, so if you want to achieve C10K, it means that a machine has to maintain 10,000 connections, equivalent to maintaining 10,000 processes/threads, the operating system will not be able to sustain it even if it dies.


I/O multiplexing

Since allocating one process/thread per request is not appropriate, is it possible to maintain multiple sockets using only one process? The answer is I/O multiplexing.

Any time a process while can only handle a request, but the event processing each request, take control in less than 1 millisecond, so 1 seconds can handle thousands of requests, the longer term, multiple requests to reuse, a process that is multiplex, this idea is very similar to a CPU concurrent multiple processes, So it’s also called time division multiplexing.

The familiar Select /poll/epoll kernel provides user-mode multiplexing system calls where a process can retrieve multiple events from the kernel via a single system call function.

How does select/poll/epoll get network events? To retrieve events, all connections (file descriptors) are passed to the kernel, which returns the connection that generated the event, and then processes the connection requests in user mode.


select/poll

Select implements multiplexing by placing all connected sockets into a set of file descriptors, and then calling select to copy the set of file descriptors into the kernel. The kernel checks for network events by traversing the set of file descriptors. When an event is detected, the Socket is marked as readable or writable, and the entire set of file descriptors is copied back into the user state. The user state then iterates to find a Socket that is readable or writable, and then processes it.

So, for select, the file descriptor set needs to be “traversed” twice, once in kernel state and once in user state, and twice “copied” from user space into kernel space, modified by the kernel, and then out into user space.

  1. Select uses a fixed length BitsMap to represent the set of file descriptors. The number of file descriptors supported is limited. In Linux, the maximum value is 1024 by default, and only 0 to 1023 file descriptors can be listened on.

  2. Poll no longer uses BitsMap to store the file descriptors of interest, but instead uses a dynamic array organized in a linked list, breaking the limit on the number of file descriptors for select and, of course, the system file descriptors.

Poll and SELECT, however, are not fundamentally different. Both poll and SELECT use a “linear structure” to store the set of sockets that the process cares about, so both need to traverse the set of file descriptors to find readable or writable sockets in O(n) time. It is also necessary to copy the set of file descriptors between user-mode and kernel-mode, which degrades performance exponentially with the number of concurrent requests.

epoll

Epoll solves the select/poll problem in two ways.

  1. Epoll uses a red-black tree in the kernel to track all file descriptors to be detected. The socket to be monitored is added to the red-black tree in the kernel by the epoll_ctl() function. The red-black tree is an efficient data structure, and the general time complexity of adding and deleting is O(logn). This eliminates the need to pass in the entire socket set for each operation like select/poll, passing in only one socket to be detected, reducing the amount of data copying and memory allocation in kernel and user space.

  2. Epoll uses the event-driven mechanism. The kernel maintains a linked list to record ready events. When an event occurs on a socket, the kernel adds it to the list through the callback function. It does not need to poll and scan the whole socket set like select/poll, which greatly improves the detection efficiency.

Here you can see the epoll-related interface functions:

In epoll mode, even if the number of monitored sockets is larger, the efficiency is not greatly reduced, and the number of sockets that can be monitored at the same time is also very large. The upper limit is the maximum number of file descriptors that can be opened by the system defined process. Therefore, epoll is regarded as a sharp tool to solve the C10K problem.

As an aside, epoll uses shared memory for ready events when epoll_wait returns, meaning that both user and kernel states point to the ready list, thus avoiding memory copy consumption. This is wrong! As anyone who has read the epoll kernel source code knows, there is no such thing as shared memory. As you can see from the following code, the kernel code for epoll_wait calls the __put_user function, which copies data from the kernel to user space.

Epoll supports two event triggering modes, edge-triggered (ET) and level-triggered (LT).

These two terms are quite abstract, but the difference between them is quite understandable.

  1. In edged-triggered mode, the server wakes up only once from epoll_wait when a readable event occurs on the monitored Socket descriptor, even if the process does not call read from the kernel. Therefore, the program must ensure that all data in the kernel buffer is read at once.

  2. In horizontal trigger mode, when a readable event occurs on the Socket being monitored, the server constantly wakes up from epoll_wait until the kernel buffer has been read to tell us that there is data to read.

For example, if your delivery is placed in a delivery box, if the box will only notify you by SMS once, even if you never pick it up, it will not send you a second SMS to remind you, this method is edge trigger; If the delivery box finds that your package has not been taken out, it will continuously text you until you take out the package, it will not stop, this is the way of horizontal trigger.

This is the difference between the two. Horizontal triggering means that as long as the event conditions are met, such as there is data in the kernel that needs to be read, the event will continue to be passed to the user; An edge trigger means that the event is triggered only when the condition is met the first time, and the same event is never passed after that.

Using the horizontal trigger mode, when the kernel notifies the file descriptor that it is read or write, it can then continue to check its state to see if it is still readable or writable. So there is no need to perform as many reads and writes as possible at once after being notified.

In edge-triggered mode, we are only notified once when an I/O event occurs, and we do not know how much data can be read or written, so we should read or write as much data as possible after receiving the notification so as not to miss the opportunity to read or write. Therefore, we loop to read and write data from the file descriptor, so if the file descriptor is blocked, when there is no data to read or write, the process will block at the read-write function, and the program can’t proceed. Thus, edge-triggered mode is typically used in conjunction with non-blocking I/O, and the program will keep performing I/O until the system calls (such as read and write) return an error of type EAGAIN or EWOULDBLOCK.

In general, edge triggering is more efficient than horizontal triggering because edge triggering can reduce the number of system calls to epoll_wait. System calls also have some overhead, as there are also context switches.

Select /poll has only horizontal trigger mode. Epoll has horizontal trigger mode by default, but it can be set to edge trigger mode based on application scenarios.

In addition, when using I/O multiplexing, it is best to use it with non-blocking I/O, as described in the Linux manual on SELECT:

Under Linux, select() may report a socket file descriptor as “ready for reading”, while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.

Google Translate results:

Under Linux, select() may report a socket file descriptor as “ready to read”, while subsequent read blocks do not. This can happen, for example, when data has arrived but is discarded after being checked for an incorrect checksum. There may also be other cases where file descriptors are incorrectly reported as ready. Therefore, it is probably safer to use O_NONBLOCK on sockets that should not be blocked.

To put it simply, the events returned by the multiplexing API are not necessarily read-write. If you use blocking I/O, the program blocks when you call read/write, so it’s best to use non-blocking I/O to handle the rare special cases.

conclusion

The most basic TCP Socket programming, it is blocking I/O model, basically can only one-to-one communication, that in order to serve more clients, we need to improve the network I/O model.

The more traditional approach is to use the multi-process/thread model, which allocates a process/thread for each client connection and then keeps all subsequent reads and writes in the corresponding process/thread. This works fine with 100 clients, but when the number of clients increases to 10,000, Scheduling 10,000 processes/threads, context switching, and the amount of memory they consume can become bottlenecks.

In order to solve this problem, I/O multiplexing has emerged. It is possible to process I/O of multiple files in a single process. Linux has three APIS that provide I/O multiplexing: SELECT, poll, and epoll.

Select and Poll are not fundamentally different in that they both use a “linear structure” to store a collection of sockets that a process cares about.

In the use of the system, the concerned Socket set needs to be copied from the user state to the kernel state by the SELECT /poll system call, and then the kernel detects the event. When a network event occurs, the kernel needs to iterate through the process to focus on the Socket set and find the corresponding Socket. And set its state to readable/writable, and then copy the entire Socket set from the kernel state to the user state, the user state must continue to traverse the entire Socket set to find readable/writable sockets, and then process them.

Obviously, the drawback of select and poll is that when the number of clients is larger, that is, the Socket set is larger, the traversal and copy of the Socket set will bring a lot of overhead, so it is difficult to cope with C10K.

Epoll is the solution to the C10K problem. It solves the SELECT /poll problem in two ways.

  1. Epoll uses a “red-black tree” in the kernel to focus on the sockets that need to be detected by the process. The red-black tree is an efficient data structure, and the general time complexity of adding and deleting is O(logn). By managing this black tree, it does not need to pass in the whole Socket set like select/poll for each operation. Reduces the amount of data copy and memory allocation in kernel and user space.

  2. Epoll uses the event-driven mechanism. A “linked list” is maintained in the kernel to record ready events. Only the Socket set with event occurrence is passed to the application program.

Also, epoll supports both edge and horizontal triggers, while SELECT /poll supports only horizontal triggers. In general, edge triggers are more efficient than horizontal triggers.

This article from mp.weixin.qq.com/s/qVUXY7t51… Assault, delete.