Today’s question is hard, certainly, but typical or not — typical. Also, I believe that you don’t always have the energy to read boring technical text, so as the cover of the article, today’s topic is to share a light use case of TDEngine.
Here’s what happened: A user built a Tdengine cluster on the Intranet of two nodes on Huawei’s cloud service, and the cluster worked. In addition to this cluster, the user has another separate Huawei cloud server, which does not belong to the same Intranet and belongs to two Huawei cloud accounts. On this server, there is a stand-alone version of TDengine running.
One day, he suddenly realized that using JDBC-RESTful locally to connect to the TDEngine on a single machine was OK, but connecting to a cluster would report an error — timed out.
In fact, for the JDBC-RESTful connection, it should be transparent whether TDEngine is a standalone or a cluster, as it only connects to the 6041 HTTP service port, and the host running the service provides the Taosd service (standalone or cluster).
So it’s kind of weird to have one OK and one no OK. When we found that there was a problem in the group, we immediately went to the battlefield and started to investigate.
For the cloud server external network connection problems, our first reaction is actually the security group’s port policy configuration. Therefore, we first let the user log in the background of Huawei cloud where the cluster node is located, and sent the screenshot of security group configuration. After confirming that the security group policy was OK, we started the rest of the operation.
At first, we tried to switch from the Intranet IP cluster to the Intranet IP cluster. It doesn’t matter, the whole cluster won’t work at that point. The familiar “Unable to establish connection” appears.
In this case, checking the port connections between the nodes is a must. However, after we added port 6041 to the Telnet external network, we found that it did not work, and everything was normal when we changed to the Telnet internal network IP and port 6041.
This is where we get confused.
Is it the problem of the external IP? But check, these IP are all elastic IP, that is, are bound to the cloud server IP. So in this case, how can Telnet IP+6041 not connect?
At the end of the day, it occurred to us that the security group configuration would need to be associated with the server instance, otherwise it would not be effective. So we hurried back to the background to check — and sure enough, the user had configured the rules, but was new to the cloud service and not familiar with the operation. Therefore, this set of security group rules does not relate to the servers of the two clusters.
The reason the stand-alone nodes can be connected is very simple — it is related to the policy of another Huawei cloud account security group.
This is the real reason for the above bizarre incident – is it a bit of a joke, it seems: the cloud service Tdengine can only be used on a single machine, but not on a cluster? Fact: Cluster and stand-alone are two accounts, and the security group of the cluster is configured and not associated with the instance.
Just like the picture below: scary water monster from a distance, but just a cute giraffe. (dynamic)
As the ecology of TDEngine improves and interactions with major platforms or components become more frequent, the types of problems encountered will increase. Many of the problems are actually caused by some very small operations, which requires us to be very careful to examine our scene. For example, this question is typical of “Keng Ren, the devil is in the details”.
In the end, we spent a whole afternoon to solve the problem. It took another half-day to understand the whole story. , help to solve the problem of connection docker cluster (https://mp.weixin.qq.com/s/PJ629gbF1_m3U2_S85Wbeg) bosses @ freemine quietly passed to help the cause of the positioning problem again, very enthusiastic.
In the end, the problem is solved and everyone is happy.
Although it is an understatement, the efficiency of both parties is not high when they can only rely on text communication, and it is time-consuming and exhausting to check up. For example, “they (cluster and stand-alone) do not belong to the same Intranet and belong to two Huawei cloud accounts” is the information that I learned when I reversed the root cause later.
However, we will continue to protect Tdengine users.