Today’s choice question is certainly difficult, but is it typical? – it’s pretty typical. Also, you don’t always have the energy to read boring technical text, so like the cover of this article, today’s theme is to share a lighthearted use case for TDengine.
This is how it happened: a user built a TDengine cluster on huawei cloud service using the Intranet of two nodes, and the cluster worked normally. In addition to this cluster, the user has another Huawei cloud server. They belong to two Huawei cloud accounts and are not on the same Intranet. On this server, there is a stand-alone version of TDengine running.
One day, he suddenly found that it was OK to connect to TDengine in a single machine in jDBC-restful mode locally, but an error — timed out was reported when connecting to cluster.
In fact, for jDBC-restful connections, TDengine should be transparent regardless of whether it is standalone or clustered, and there is no special difference because it only connects to the HTTP service port 6041, and the host running this service port provides the TAOSD service (standalone or cluster).
So, one OK and one not OK is weird. We found that there was such a problem in the group, we immediately arrived at the battlefield to start the investigation.
Our first response to cloud server extranet connection problems is actually the port policy configuration of the security group. Therefore, we first let users log in the Huawei cloud background where the cluster nodes are located and send the screenshots of the security group configuration. After confirming that the security group policy was ok, we did the rest.
At first, we tried to switch the cluster of internal IP addresses to external IP addresses. This change does not matter, the whole cluster will not work at that time. The familiar: “Unable to establish connection”.
In this case, it is necessary to check port connections between nodes. However, when we add Telnet external IP and port 6041, we find that it does not work, but when we change it to Telnet internal IP and port 6041, everything is normal.
And now we’re confused.
Is it an extranet IP problem? However, all these IP addresses are elastic, that is, they are bound to the cloud server. How can Telnet extranet IP+6041 be disconnected?
At a loss what to do, we suddenly realized that the security group configuration needs to be associated with the server instance, otherwise it will not take effect. So we hurried back to the background to check — sure enough, although the user had configured the rules, he was not familiar with the operation because he had used the cloud service for the first time. Therefore, this set of security group rules is not associated with the servers in the two clusters.
The reason why the standalone node can be connected is very simple — it is associated with another Huawei cloud account security group policy.
This is the real reason for the above weirdness – is it a bit ironic, as TDengine on the cloud can only be serviced on a single machine, but not on a cluster? Fact: The cluster and the single-node cluster belong to two accounts. The security group of the cluster is not associated with the instance after configuration.
Like the image below: from a distance, the fearsome water monster is just a cute giraffe. (dynamic)
As TDengine’s ecosystem improves and interactions with major platforms and components become more frequent, the types of problems encountered will increase. Many of the problems are actually caused by some very trivial operations, which requires us to examine our scenarios very carefully. This case is a typical example of keng Ren, the devil is in the details.
Finally, it took us all afternoon to solve the problem. It took me half a day to find out the whole story. , to help solve the problem of connection docker cluster (mp.weixin.qq.com/s/PJ629gbF1…
In the end, the problem is solved and everyone is happy.
Although it is an understatement, the efficiency of both parties was not high at that time when they could only rely on text communication, and it was time-consuming and labor-intensive to investigate. For example, “They (cluster and single machine) do not belong to the same Intranet, and they belong to two Huawei cloud accounts.” This information was later learned when the root cause was pushed back.
However, we will continue to protect our TDengine users.