Ali Cloud Elasticsearch team is looking for talents, all kinds of R & D positions are available, if you are interested, welcome to chat or email your [email protected]

Through this article, it can be obtained

  1. How does ES handle a request when it comes in?
  2. In which thread are HTTP and TCP requests processed?
  3. How is TCP communication between ES nodes sent and received?
  4. Es plug-in development needs to pay attention to the problem

Introduction to ES Network Framework

  • ES has been using Netty as the network framework before the current 7.x version. However, as ES has always had a dream of self-inclusion, it has been wanting to break away from Netty since 2017 and write a set of network framework based on NIO by itself. After 7.x, IT was finally provided as an official plug-in [Issue 27260].
  • However, both NetTY and self-developed NIO have the same essence. They use the selector and channel provided by Java NIO to realize a reactor model for multiplexing. Each new channel will register to a selector. And then all subsequent requests for that channel are responded to by that selector. The selector is tied to a fixed transport_worker thread, so all requests to a channel are answered by a single thread.

  • For a selector, it is always polling the channel it manages for ready, and if so, it is consumed by pipeline. Logical visibility in Nettyio.netty.channel.nio.NioEventLoop#run, visible in NIOorg.elasticsearch.nio.NioSelector#runLoop.
  • The execution of a pipeline, by default, takes place in the selector thread, so if the pipeline blocks, the entire selector will block, and all its managed channels will become unresponsive.

  • The overall network model is shown below, but there is a special point in ES. In both Netty and niO frameworks, bossGroup (the thread pool used to accept requests) is multiplexed with worker group (the transport_worker thread pool)

ES HTTP communication flow

  • As we all know, ES processes have two ports, one is HTTP (9200), which is used to communicate with users, and the other is TCP (9300), which is used for internal management.
  • Let’s take a look at how an HTTP request is handled using the Netty framework as an example.
    1. After receiving a channel, netty registers itChannelInitializer#channelRegistered
    2. Pipeline registration is performed by the ES custom channelHandlerHttpChannelHandler#initChannel
    3. Pipeline execution startsAbstractChannelHandlerContext#invokeChannelRegisteredYou can see that all pipeline execution is done in channel.eventLoop, which is the same thread that registers the selector
    4. The last pipeline handler calls the logic we wrote ourselves in the RestAction. In our development, we can define a direct return, such as a direct sendResponse in RestCatAction, or a callback, such as RestBulkAction, can be used for sendResponse. (But do not block)

  • As can be seen from the above, the whole HTTP processing process, if there is no transfer in the business layer, then all runs in the HTTP selector thread.
  • Another point to note is fromSharedGroupFactory#getHttpGroupAs you can see, if you don’t set httpWorkerCount, you’re going to reuse TCP’s eventGroup, so HTTP and TCP are actually the same selector

ES TCP communication thread introduction

  • TCP’s framework is similar to HTTP’s except that pipelines are different. HTTP channels are built with request links, whereas TCP channels are opened at initialization.
  • Those of you who are familiar with ES know that when the ES node is initialized, it builds a complete graph, and two or two nodes each set up a TCP link, a channel, and then randomly registers a selector, a transport_work, and all subsequent requests to that TCP link, All carried by this transport_work thread.
  • By default, ES will open 2 recovery links, 3 BULK links, 6 general links, 1 state link, and 1 ping link.

  • However, unlike TCP and HTTP/1’s single-channel synchronization requests, ES’s TCP communication is similar to HTTP/2’s multiplexing. Instead of waiting for a response after a request, a link may be sent request->request->response->response out of order.
  • So how does ES manage this multiplexing? Like HTTP/2, which reorganizes with the stream identifier at the head of the frame, ES has a unique requestId for each TCP request

  • The specific process is as followsTransportService#sendRequestInternalRegister the requestId, save the handler, and send it to the remote end. When the Response is received, the handler will be taken out according to the requestId and continue processing. If no separate thread pool is set, all are executed in the Transport thread.

Development needs attention

  • As can be seen from the above, the main risk point of ES network framework lies in the reuse of thread pool. Once problems occur, a large number of requests will be blocked
    1. If all transport threads were blocked, the link would be unaccepted and the entire node would be unable to respond to any requests
    2. Since TCP and HTTP reuse the same thread pool by default, if the HTTP request is too large, or if the RestAction logic is blocked, not only other HTTP channels managed by the same selector will be blocked, but the TCP channel will also be blocked. There will be problems in communication between nodes, such as intermittent failure to receive response, intermittent failure to write (blocked channels with different functions may be) and so on.
  • So, during the development process, be careful
    1. Do not block the RestAction, and if you must call it synchronously, be sure to put it into a new thread pool
    2. When communicating across nodes, try to set explicit thread pools, whether listener or transportHandler, rather than the default SAME

Refer to the article

  • Super detailed Netty entry, look at this is enough! Developer.aliyun.com/article/769…
  • Why advice I/O thread of Netty and business separation cloud.tencent.com/developer/a thread…
  • Elasticsearch 7.10 source