Abstract: From the development process to communication model design, to you know about GaussDB communication principle knowledge.

MPPDB communication library development history

1. Postgres-XC

Methods: The connection between CN and DN was realized by libpq communication library, CN was responsible for calculation, AND DN was only used for data storage.

Disadvantages: With the increasing of computing tasks, THE bottleneck of CN computing appears, and the computing capacity of DN is wasted.

2. V1R3

Methods: A distributed execution framework was proposed, and the computation was pushed down to DN, which not only acted as storage but also undertook the computation task. Because each DN stores only part of the data, the DN also needs to communicate with each other. The STREAM thread TCP direct connection is adopted, and each connection requires a TCP port.

Disadvantages: As the cluster scale increases, the number of connections between DNS increases. In a large-concurrency scenario, a large number of TCP ports are required. In fact, the number of TCP ports on a single machine is only 65535, which severely limits the cluster and concurrent scale of the database.

3. V1R5-V1R7C00

Methods: Combined with the multi-stream advantages of SCTP protocol, SCTP communication library architecture was adopted for communication between DN, and logical connection was used instead of physical connection to solve the problem of too many connections between DN.

Disadvantages: SCTP protocol itself exists kernel BUG, communication error.

4. V1R7C10

Method: TCP proxy communication library.

Disadvantages: The number of physical connections between CN and DN also jumps.

5. V1R8C00

Methods: Logical connection was also adopted between CN and DN, that is, CN multistream.

The problem1:DNIs the communication between query results or raw data?

Solution: there are both query results and original data;DNThe data exchange between isHashAfter that, eachDNObtain as required.

Design of communication model

1. Pooler communication model

The problem2: Connected to the poolCN,DNIs there an intersection, i.epoolAIn theDNinpoolBIs also present in?

Solution: No.

The problem3:CNandCNWhat does the connection do? Why are connection pool connections designedCNOn thePGThread?CNWhy are there more than onePGThreads, what are they used for?

Solution:CNandCNThe connection is for data synchronization, including information intersection after table building.CNOn thePGThreads are also created for user queries.

The problem4:DNWhy are there more than onePGThreads, what are they used for?

Solution: A query operation creates a new onePGThreads.

The problem5: How to understandgsqlWhen the client exits, it exits onlyPGThreads andagentProxy thread, while fetching toslotThe connection will returndatabase pool?

Solution:slotConnect back topoolerNext time, the connection can be obtained directly. If it does not exist, create it again.

Question 6: The number of TCP connections between DN: C = (H * D * Q * S) * D, why is D multiplied instead of **(d-1 ****) / 2? **** Also, does this formula have a high level of repetition, establishing communication between DN**** for each query, regardless of repeatability? Is there a limit on the number of communication threads between DN’s? **

Solution: Data is two-way communication, but the operator is one-way, so cannot be divided2. The quantity is limited, but it can be adjusted.

2. Stream thread communication model between DN

The execution plan needs to be executed by multiple threads, including one CN thread and three threads T1-T3 on each DN. Each DN node has 3 threads because the data is distributed to each DN. In the execution of the query, each DN must participate in each stage of the query. Bottom-up is data flow and top-down is control flow. The specific execution process is as follows:

  • T3 of each DN scans the table in sequence and broadcasts the table data to the T2 threads of all DN.
  • T2 of each DN receives data from T3 and establishes hash table, then scans store_SALES data and performs hashJoin with customer table. After hash aggregation, join results are redistributed and sent to T1 thread of corresponding DN.
  • The T1 thread of each DN will conduct the second hash aggregation of the received data, and then transmit the results to CN after sorting.
  • CN collects the returned data of all DN and returns it as the result set.

The problem7:DNIs the query operation performed on?DNDoes the broadcast data belong to the same table? eachDNAll broadcast data, and then finally allDNAre the data the same? In the figuret2Send to allDNthet1? ,

Solution: execute,DNThe data stored is a fraction of a table, not a whole table, not a part of a table, but a part of all tables, for the sake of concurrency.DNThe data are different because they are all for what they want.

SCTP communication library design

Summary 1.

  • SCTP: a reliable and sequence-preserving protocol that supports message-based mode. A single channel supports 65535 streams without blocking multiple streams. This feature breaks the restriction of the number of physical connections on communication between nodes in a large-scale cluster and supports a larger node scale.
  • Channel sharing: There is a unidirectional SCTP physical channel for data transmission between each two nodes. Inside this physical channel, there are many logical channels (inner streams). Each Stream is sent to the consumer by the producer. Different producer & Consumer pairs use different streams in the channel (SCTP streams), so only two data connection channels are required between every two points.
  • Channel multiplexing: After the query is complete, the logical connection in the physical channel is closed, but the physical connection is not closed. Subsequent queries continue to use the physical connection.
  • Flow control: The pull mode is adopted, because all streams in the SCTP channel share the socket buffer of the kernel. In order to avoid a large amount of data sent by one connection, the consumer does not receive the data. As a result, the buffer of the kernel is filled up, blocking the sending of other streams, and flow control is added. Each stream is set with a quota, which is allocated by the receiver. When there is a quota, the sender is told that the data can be sent. In order to ensure the reliability of control information, control information and data channel are separated, and flow control information goes through a separate bidirectional TCP control channel.

2. The architecture

  • TCP Channels TCP control Channels through which control flows flow.
  • SCTP Channels: An SCTP data channel that contains many streams. Data flows go through this channel.
  • Send Controller The flow control thread on the sending end is gs_senders_flow_controller(), which sends and receives control messages.
  • Recv Controller Flow control thread on the receiving end: gs_receivers_flow_controller(). The receiving end sends and receives control packets. Different from the proxy receiving thread, the proxy receiving thread receives data while the receiving flow control thread receives packets.
  • Receiver Agent Receiving thread: Gs_receivers_loop (), the thread that receives data from the SCTP data channel, puts the data into different CMailbox boxes with different logical connections, and notifies the executive consumer worker thread to fetch the data. The receiver’s flow control thread informs the sender via the TCP control channel how much free memory is available, that is, how much quota is available to continue receiving data.
  • Auxiliary Threads: Gs_auxiliary (), checked every two seconds by the top consumer, handles common transactions such as DFX messages, Cancel signal responses, etc.
  • Data PULL model: Each logical connection has a buffer of the size of a quota, and when it needs data, it sends the size of the free buffer to the sender. The sender can send up to the size of the quota, but blocks when the quota is low until the buffer is fetched by the receiver.

TCP multi-stream implementation

TCP proxy on the basis of the existing logical connection, proxy distribution, quota flow control and other implementation, the data channel from SCTP protocol to TCP protocol, TCP proxy logical connection implementation based on head+data packet sending and receiving model, write logical connection ID and subsequent data length in HEAD.

The problem8: singleTCPonly65535A port,SCTP?TCPMore flow andTCPWhat’s the difference in ports?TCPIs the three-way handshake still the same?

SCTP is based on message flow transmission. The minimum unit of data sent and received is chunk. An SCTP Association can support multiple streams at the same time, and each stream contains a series of chunks of message data required by users. The TCP protocol supports only one flow. Therefore, we need to distinguish data of different logical connections in this flow by writing the logical connection ID and the length of subsequent data in the HEAD of TCP packets. In this way, although TCP consists of only one packet, each packet contains multiple blocks, realizing TCP multi-flow. Ensure that atoms in the whole packet are sent at the same time.

The problem9: How to multistream? It’s multiple channels sending simultaneouslyhead+data?

Solution: A stream sends complete data. Multistreaming is concurrency.

The sender of TCP proxy communication library is implemented. ProducerA and producerB send data to DN1 respectively. In order to ensure the integrity of sending head+data at the sender end, producerA first needs to lock a single physical lock, and producerB is in equal lock state. ProducerA sends complete data to the local protocol stack and returns, releasing the lock. ProducerB obtains the lock and sends the data. The more nodes there are, the fewer conflicts there are.

The problem10: Does locking wait affect efficiency?SCTPSo is the implementation of?

Solution: No, because there is cachebuffer.

The problem11: Data is lost and never reachableheadHow to do? Always cache?

Solution: Not lost,TCPThe agreement is guaranteed.

CN multi-stream implementation

In V1R8C00, the link between CN and DN uses the libComm communication library, that is, there is only one physical channel between the two nodes, on which the logical connection channel is used for parallel communication.

1. CN process:

  • Set up a connection: CN calls Libcomm gs_connect to set up a connection with DN. The input parameter specifies to set up a bidirectional logical channel. Using the same NIDX, SIDx initializes the sending and receiving mailbox and notifies the receiver by sending the flow control thread.
  • Waiting for the DN to return the result: by judging the state of the sending mailbox, confirm that the logical connection has been successfully established at the DN end (the sending flow control thread receives the CTRL_READY message at the DN end).
  • Send startuppacket: use PQbuildPGconn to initialize libpq pg_CONN structure and generate startuppacket. Then send startuppacket to DN using gs_send.
  • Waiting for the DN PG thread to be initialized: After LIBCOMMgetResult is used to wait for the READY for Query message sent by the DN, the connection is successfully established.

2. DN end process:

  • Initialize the sending and receiving mailbox: After the flow control thread at the DN end recognizes that the connection request is from CN, it calls gs_build_reply_conntion, registers the information of CN, initializes the sending and receiving mailbox, and then initializes the receiving mailbox. Finally, the flow control thread returns CTRL_READY message, indicating that the logical channel is successfully established.
  • Create Unix Domain sock: The DN receives the flow control thread, creates a Unix domain sock, and sends the generated logical connection address to the postMaster main thread through the channel.
  • Fork postgres child threads: The serverloop of the postmaster main thread listens to Unix Domain Sock, receives the logical connection address, saves it in the port structure, and forks postgres child threads (using the original logic).
  • The postgres thread completes initialization: After the PG thread completes initialization, it sends the ready for Query message to the CN, and then enters the ReadCommand function to wait for the next query sent by the CN.

This document is shared with The Huawei cloud community “GaussDB Communication Principles summary” by Caesar.

Click to follow, the first time to learn about Huawei cloud fresh technology ~