Question introduction: Why is Redis so fast?

  • Firstly, multiplexing IO blocking mechanism is adopted. Single thread, avoiding switching overhead between threads.
  • However, the data structure is simple and the operation saves time.
  • Finally, running in memory is naturally fast.

Select function

1. Code parsing

  • 1. In a for loop, create a file descriptor based on the ACCEPT function and place it in the FDS array;
  • In a while loop, FD_ZERO() is called:
    • The length of fd_set is 1 byte. Each bit in fd_set can correspond to a file descriptor fd. A 1-byte fd_set can correspond to a maximum of eight FDS
    • Empty the given file descriptor, set in bits is 0000,0000;
  • 3, in a for loop summary, call FD_SET, the file descriptor array FDS array into the Rset;
    • The underlying rset is a bitmap, for example, with five file descriptors, set becomes 0001,0011;
  • Select (6,&set,0,0,0
    • First, the flag position is 0, indicating no data.
    • If there is data, mark position 1.
    • If a readable event occurs on both fd=1 and fd=2, select returns, and set changes to 0000,0011. Note: fd=5 with no event is cleared.
  • 5. Iterate over the file descriptor and check whether the FD_SET flag bit of the file descriptor is 1
    • If it is 1, data is returned, and the read operation is performed and the data is stored in buffer for later use.

2, select function introduction

  • Select () is used to wait for the state of the file descriptor to change.
    int select(int maxfd, fd_set *rdset, fd_set *wrset, fd_set *exset, struct timeval *timeout);
Copy the code
  • Parameter is introduced

    • Maxfd:
      • Is an integer value, which refers to the range of all file descriptors in the set, that is, the maximum value of all file descriptors plus 1;
    • Rdset:
      • A collection of readable file descriptors;
      • Is a pointer to an fd_set structure. This set should contain file descriptors that we want to monitor for read changes.
      • If there is a file in the collection that can be read, select returns a value greater than 0 indicating that there is a file that can be read.
      • If there are no readable files, use the timeout parameter to determine whether the file times out.
        • If the timeout is exceeded, select returns 0;
        • Returns a negative value in case of an error;
        • You can pass a NULL value to indicate that you don’t care about any file read changes.
    • wrset:
      • A collection of writable file descriptors;
      • Is a pointer to an fd_set structure. This set should contain file descriptors that we want to monitor for write changes.
      • If there is a file in the collection that can be written to, select returns a value greater than 0, indicating that there is a file to write to.
      • If there are no writable files, the timeout parameter is used to determine whether the timeout occurs.
        • If the timeout is exceeded, select returns 0;
        • Returns a negative value in case of an error;
        • You can pass a NULL value to indicate that you don’t care about any file write changes.
    • Exset: a collection of exception file descriptors;
    • struct timeval:
      • This function is used to describe a period of time during which no event occurred for the descriptor to be monitored. The function returns 0.
      • Is the timeout for the select, which can put the select in one of three states:
        • First, if NULL is passed as a parameter, that is, no time structure is passed, the SELECT is blocked until one of the file descriptors in the monitoring file descriptor set changes.
        • Second, if you set the time value to 0 seconds 0 milliseconds, it becomes a purely non-blocking function. It returns to continue execution immediately regardless of whether the file descriptor has changed, 0 if the file has not changed, and a positive value if the file has changed.
        • Third, the value of timeout is greater than 0, which is the timeout time. That is, the select will block within the timeout time. Otherwise, the select will return after the timeout anyway, which is the same value as above.
  • The return value

    • Negative value: select error
    • Positive value: some files can be read or written or error 0: wait timeout, no file can be read or error.
    • If timeout is set to NULL, select () does not have a timeout.
  • Other functions

    • FD_ZERO (fdset) :
      • Empty the specified file descriptor set. The file descriptor set must be initialized before being set. If the file descriptor set is not cleared, the result is unknown because the system does not clear the specified file descriptor set after allocating memory space.
    • FD_SET(fd_set, fdset)
      • Use to add a new file descriptor to the file descriptor collection.
    • FD_CLR(fd_set, fdset)
      • Deletes a file descriptor from a file descriptor collection.
    • FD_ISSET(int fd, fdset)
      • Tests whether the specified file descriptor is in the collection,
      • Used to test whether the relevant fd bit in the phrase set is true.

3 and disadvantages

  • The &rset is an array of bitmaps that can hold up to 1024 bits, depending on the operating system.
  • FD_SET is not reusable and must be reassigned each time through the for loop.
  • User mode – kernel mode, still has performance overhead.
  • The select function is wait blocking and O(n) query complexity.

2. Poll functions

1. Poll function introduction

  • Poll () and poll() are essentially the same as poll() system calls. Poll () has a similar mechanism to select() and is not much different in nature from select(). Poll (), however, has no limit on the maximum number of file descriptors (but performance degrades if you have too many).
  • A drawback of poll() and select() is that arrays containing a large number of file descriptors are copied between the user state and the kernel’s address space as a whole, regardless of whether the file descriptors are ready or not, and their overhead increases linearly with the number of file descriptors.
  • Monitor and wait for property changes for multiple file descriptors
    int poll(struct pollfd *fds, nfds_t nfds, int timeout);
Copy the code
  • Parameter is introduced

    • The FDS:
      • Pointer to the 0th element of an array of structures, each of which is a struct pollFD structure that specifies the conditions under which a given FD is tested.
      • Each pollFD structure specifies a file descriptor to be monitored, and multiple structures can be passed, instructing poll() to monitor multiple file descriptors.
      • Pollfd structure
        • Int fd: file descriptor
        • Short Events: Defines the type of event
        • Short Revents: Defines the type of event to return
      • Events: specifies the events (input, output, error) to monitor fd. Each event has multiple values, as follows:
      • Revents: The Revents field is the result of the operation event of the file descriptor, and the kernel sets this field when the call returns. Any events requested in the Events domain may be returned in the Revents domain.
      • Note: The Events field for each structure is set by the user to tell the kernel what we are interested in, whereas the Revents field is set by the kernel on return to indicate what events occurred for that descriptor.
    • NFDS: Specifies the number of elements in the first argument array.
    • Timeout: Specifies the number of milliseconds to wait. Poll () returns whether the I/O is ready or not.
      • -1: Wait forever until something happens;
      • 0: returns immediately.
      • 0: indicates the number of milliseconds to wait.

  • The return value

    • On success, poll() returns the number of file descriptors in the structure whose Revents field is not 0;
      • If no event occurs before the timeout, poll() returns 0;
    • On failure, poll() returns -1 and sets errno to one of the following values:
      • EBADF: The file descriptor specified in one or more structures is invalid.
      • EFAULT: The FDS pointer points to an address outside the process’s address space.
      • EINTR: A signal is generated before the requested event and the call can be re-initiated.
      • EINVAL: the NFDS parameter exceeds the PLIMIT_NOFILE value.
      • ENOMEM: Insufficient free memory to complete the request.

The working principle of

  • 1. Similarly, in a for loop, we use the Accept function to define a file descriptor that is directly assigned to a PollFDS array
    • Assign the POLLIN event type to the Events property.
  • 2. Also, in a while loop, poll is executed
    • The poll function sets the file descriptor with the data to the Revents property in the PollFD structure.
  • 3. Iterate through the file descriptor to see if there is any data, and if so, perform read operation
    • Determine the Revents attribute and event type at this point

Optimal point

  • To solve the storage upper limit problem of &rset, we declare a PollFDS array, theoretically unlimited.
  • The revent property in the PollFD structure is used to control whether the file descriptor has data reached or not.

Epoll function

3.1. Function introduction

3.1.1, epoll_create

    int epoll_create(int size);
Copy the code
  • Create a handle to epoll. Size tells the kernel how many listeners there are.
  • This parameter is different from the first parameter in select () and gives the maximum fd+1 value for listening.
    • When the epoll handle is created, it takes up a fd value, which can be viewed in /proc/process on Linux.
    • After epoll is used, close must be called to close it, otherwise fd may be used up.
  • When this method is called, an evemtpoll structure is also created:
Struct eventpoll {.../* * The root node of the red-black tree that stores all the events added to epollstruct rb_root rbr;/* * The bidirectional linked list rdlList holds the events that will be returned to the user via epoll_wait that meet the criteria */struct list_head rdllist; . };Copy the code
  • The kernel not only creates a file node handle in the epoll file system, but also creates more than one structure.
  • Create a red-black tree in the kernel cache to store future epoll_ctl incoming sockets.
  • An RDLList is also created to store ready events.
  • When epoll_wait is called, the rDLList is simply observed to see if there is any data in the list. If there is no data in the list, the rDLList will be returned. If there is no data in the list, the rDLList will be returned.

3.1.2, epoll_ctl

    int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
Copy the code
  • Epoll’s event registration function, unlike the select() function, tells the kernel what type of event to listen for when it listens for events, but registers the type of event to listen for first.
  • Parameter Description:
    • Int epfd: The epoll handle object returned by epoll_create.
    • Int op: indicates the registered action
      • EPOLL_CTL_ADD: Registers a new FD to an EPFD
      • EPOLL_CTL_MOD: Modifies a FD registered in an EPFD.
      • EPOLL_CTL_DEL: Deletes a FD from an EPFD.
    • Int fd: the fd to listen on. A FD file descriptor represents a client socket.
    • Struct epoll_event: tells the kernel what type of event to listen for.
  • Struct epoll_event:
    struct epoll_event {
      __uint32_t events;  /* Epoll events */
      epoll_data_t data;  /* User data variable */
    };
Copy the code
  • Events can be a collection of the following macros:
    • EPOLLIN: indicates that the corresponding file descriptor is readable (including that the peer socket is normally closed)
    • EPOLLOUT: indicates that the corresponding file descriptor can be written
    • EPOLLPRI: indicates that the corresponding file descriptor has urgent data to read, indicating that out-of-band data is arriving
    • EPOLLERR: Indicates that an error occurs in the corresponding file descriptor
    • EPOLLHUP: indicates that the corresponding file descriptor is hung up
    • EPOLLET: Treat EPOLL as Edge Triggered, as opposed to Level Triggered.
    • EPOLLONESHOT: only listens for one event. If you want to listen on the socket after this event, you need to add the socket to the epoll handle again.
  • A callback after an event
    • When the listening event occurs, a callback method called ep_poll_callback is invoked.
    • An EPItem structure is created for each event in epoll.
    • The epItem of the event is then placed into the RDLList bidirectional list.
struct epitem { ...// Red black tree node
           struct rb_node rbn;
           // Two-way linked list node
           struct list_head rdllink;
           // Event handle information
           struct epoll_filefd ffd; 
           // points to the eventepoll object to which it belongs
           struct eventpoll *ep; 
           // Expected event typestruct epoll_event event; . };// This contains information for each event.
Copy the code

3.1.3, epoll_wait

    int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
Copy the code
  • Wait for an event to occur, similar to a select () call.

  • Parameter Description:

    • Int epfd: The epoll handle object returned by epoll_create
    • Struct epoll_event * events: collection of all events obtained from the kernel.
    • Int maxEvents: tells the kernel the maximum number of events that can be processed.
    • Int timeout: indicates the timeout period.
      • 0: Returns immediately
      • -1: will be uncertain
      • 0: indicates the waiting time.

  • Working principle:

    • Epoll_wait () is called to check for any connection that has an event, just to check for epItem elements in the rDLList of the eventPoll object created by epoll_ctl().
    • If the RDLList is not empty, the events here are copied to user-mode memory (using shared memory for efficiency) and the number of events is returned to the user.
  • conclusion

    • A red-black tree, a rDLList of ready handles, and a small amount of kernel cache help us solve the problem of socket processing under large concurrency
    • When epoll_create() is executed, the red-black tree and ready list are created.
    • When epoll_ctl() is executed, if a socket handle is added, the system checks whether the socket handle exists in the red-black tree. If the socket handle exists, the system returns the result immediately. If the socket handle does not exist, it is added to the red-black tree.
      • It then registers a callback function with the kernel to insert data into the ready list RDLList when an interrupt event occurs.
    • When epoll_wait() is specified, data in the rDLList is returned immediately.

3.2 Trigger mode of epoll

3.2.1 EPOLLLT level trigger

  • This mode is Triggered by Level.
  • In this mode, epoll_wait returns its event each time the file descriptor has data to read, reminding the user to act on it.

3.2.1 EPOLLET edge trigger

  • Edge Triggered

  • In this mode, when it detects an I/O event, the epoll_wait call gets a file descriptor that is notified of the event. For each notified file descriptor, if readable, the file descriptor must be read until it is empty and errno returns EAGAIN. Otherwise, the next epoll_wait will not return the remaining data and the event data will be lost.

    • If the ET mode is not non-blocking, then reading or writing all the time is bound to block the last time.
  • ET mode greatly reduces the number of epoll events to be triggered repeatedly, so it is more efficient than LT mode.

    • When epoll works in ET mode, it must use a non-blocking socket to avoid starving the task of processing multiple file descriptors due to a blocking read/write operation on a file handle.
  • Why does epoll have EPOLLET trigger mode?

    • In EPOLLLT mode, if there are a large number of ready file descriptors in the system that you don’t need to read or write, they will be returned every time epoll_wait is called, which greatly reduces the efficiency of the processor to retrieve the ready file descriptors it cares about.
    • In EPOLLET’s edge-triggered mode, epoll_wait() tells the handler to read or write when a read or write event occurs on the file descriptor being monitored.
    • Epoll_wait () does not notify you of the next call to the file descriptor if the read/write buffer is too small. It does not notify you until the second read/write event occurs on the file descriptor.
    • This mode is more efficient than horizontal triggering, and the system is not flooded with ready file descriptors that you don’t care about.
  • conclusion

    • ET mode (edge trigger) is triggered only when data arrives, regardless of whether there is any data in the cache.
    • LT mode (horizontal, default) Epoll_wait returns if there is any data in the buffer.

conclusion

  • Epoll_create: When the Redis server starts up, it creates an event loop and calls the epoll_CREATE method to create an epoll instance.
  • Epoll_ctl: When a new client connection occurs, the new connection descriptor is registered with the epoll instance.
  • Epoll_wait: Call epoll_wait to obtain THE I/O events generated by the client.