It’s not uncommon for users to submit an Issue labeled “Help Wanted” on TDengine’s GitHub. Most of these issues are not bugs, but rather confusing usage issues due to unfamiliarity or lack of understanding of TDengine’s mechanics. We will share some common issues on a regular basis, and hope you can learn something from them. This installment shares “Solutions for Making TDengine Clients highly available.”
How does TDengine make the client highly available?
Recently, on TDengine’s GitHub, we encountered two cluster users who both mentioned issues with the TDengine client’s high availability:
The common question of users shows that this problem is very representative, so we select it, hoping to form more thoughts on product optimization from the perspective of users.
If the client fails to connect to the cluster using taos-h FQDN -p port, will the connection end in failure?
The answer is clearly “yes”.
One user told us, “It’s a shame that the high availability of TDengine on the server side hasn’t caught up with the high availability of the client side.”
In fact, it is true that the connection failed. However, TDengine does not encourage users to connect clusters in this way. Why is that? Let’s go through it a little bit.
Suppose a user is connecting to a TDengine cluster and the node to which he is connected hangs. At this point, we need two kinds of high availability: one from the server side and the other from the client side.
The high availability of the server means that when the TDengine node fails and does not respond within the specified time, the cluster immediately generates system alarm information and kicks off the damaged node. At the same time, automatic load balancing is triggered, and the system automatically transfers the data on the data node to other data nodes.
High availability on the client side means that TDengine immediately specifies another available database server for the client to continue the connection if the connection fails.
Can TDengine do this? B: Sure.
Here is the real reason why we do not recommend that you specify the FQDN in url or TAOS connection mode, but use the client configuration file taos.cfg to connect to the cluster. The client will automatically connect to firstEP, and if the firstEP fails at the moment of connection, the client will continue to connect to the node represented by secondEP.
It is worth noting that as long as either node is connected successfully, there is no problem with the client. Because firstEP and secondEP are used only at the moment of connection, they do not provide a complete list of services, but rather a connection target. The node automatically gets the address information of the management node as long as the cluster is successfully connected in this brief fraction of a second. The probability of both nodes going down at the same time is extremely low. Later, even if both firstEP and secondEP nodes fail, the cluster can still be used as long as the basic rules for external services are not broken. This is how TDengine maintains client high availability.
Of course, this is not the only way to maintain high client availability. Both users use Load Balance to wrap load balancing at the outer layer. In the process, both of them encounter the same problem. It turns out that when they do layer 4 network load, they only use TCP port, connection failed. Hence the common question on GitHub — how TDengine achieves high availability on the client side.
Let’s analyze why they still fail to connect when they do network load balancing.
The official documentation for TDengine explains this: Considering the small number of packets written to the Internet of Things, RPC supports UDP connections in addition to TCP connections. When the packet size is less than 15KB, RPC uses UDP for connection; otherwise, TCP is used for connection. The operations that exceed 15KB or are query operations are transmitted over TCP.
This is the answer. The packet used to establish the connection is less than 15KB and goes through UDP connections.
Therefore, when they add UDP forwarding rules, they successfully complete the network load balancing around the cluster. The benefits of this setup, in addition to achieving the same high availability of the client, also make the “development” and “operation” scenarios clearer and easier to manage.
Interestingly, yakir-yang, who was the first to solve the problem, also gave an enthusiastic answer to stringhuang, the second questioner, after finding another problem. Because he had just been through the exact same question, he immediately saw the pain point of another questioner.
Thus, the three parties, who do not know each other, carried out a harmonious technical interaction in the open source community. The end result: Once the questioners understood the mechanics of TDengine, they were able to build their own familiar high-availability strategies on top of the new product.
And that’s what we love to see.
Have you learned how to make TDengine client highly available? If you encounter any problems while using TDengine, you can submit your Issue on GitHub. Besides getting official technical support, you can also communicate with many like-minded users
Github.com/taosdata/TD…
About the author: Chen Yu, used to work as a database manager in IBM, and has experience in entrepreneurship of we-media in other industries. At present, I am responsible for community technical support and related operations in Taos Data.