Abstract: How to test the function, performance, reliability and other aspects of tens of thousands of data lakes in the laboratory has also become a problem that the r&d team needs to consider.

This article is shared by huawei Cloud community “How to Test MRS Large-scale Cluster scale in the Laboratory”, by the old man and the Sea.

With the development of The Times, data is becoming more open and shared, and customers’ business is also facing diversified processing. The original cluster needs to be expanded. In addition, it is increasingly urgent to build an integrated data lake by pushing the chimney construction of the original small-scale cluster. In this context, the original thousands of clusters are far from meeting customers’ business demands, so there is an urgent need to build tens of thousands of data lakes.

How to test the function, performance, reliability and other aspects of tens of thousands of data lakes in the laboratory has also become a problem that our RESEARCH and development team needs to consider.

Under normal circumstances, our software is directly deployed on physical devices for testing. A large-scale cluster of 30,000 nodes requires 30,000 physical devices, which obviously cannot be met under laboratory conditions and requires the help of virtualization technology.

Combined with the characteristics of our big data products, its node types are divided into management node, control node and data node. In the actual deployment process, management nodes and control nodes tend to become bottlenecks in large cluster scale, and should be taken as the first test observation items. So how to effectively use limited laboratory resources to conduct effective testing? By comparing Docker container with virtual machine, we found that Docker container adopts the mode of OS sharing, occupies less resources than virtual machine, and its isolation can also meet our demands. Therefore, we adopted the following method to build the experimental environment.

We use Docker Swarm for Docker container management. Compared with Kubernetes, Docker Swarm is lighter and easier to install and uninstall quickly. In addition, it can build super-large clusters by cascading.

Here is the networking:

Under this test scheme, a 64U256G physical machine can virtualize 60 1U4G data nodes, and 200 machines can test tens of thousands of nodes.

In the process of implementation, we have made many mistakes, such as:

  • How to solve the problem of rapid deployment and installation of Docker data nodes with small resources.

Solution: Directly skip the installation process, build startup scripts into the Docker image, and directly start the data node during the image pulling process. In this way, management nodes cannot deliver software packages and the installation and deployment of software packages is slow in a small resource environment.

  • In the preceding scenario, how to ensure proper capacity expansion and capacity reduction in a large-scale cluster?

Solution: Perform capacity expansion or reduction tests on physical nodes to avoid slow capacity expansion or reduction in a small resource environment.

  • How to solve the PROBLEM of Docker data node IP address conflict?

Solution: Docker Swarm was used for networking design, and the network scope was divided for each physical node, so that the Docker data nodes started on different nodes would never repeat.

  • Avoid broadcast storms on large Layer 2 networks.

To facilitate networking and testing, we use the MAC-VLAN networking mode. In this mode, broadcast storm exists, and ARP static cache is used to avoid this problem.

  • How to solve the problem of shared directory of Docker data node.

Solution: Different directories are planned for each Docker data node. During the image startup process, directories with Docker names as variables are divided on disk to effectively solve the problem of directory conflicts.

These are some of the issues we encountered during the environment setup and deployment process. Next time, we will look at some of the product software improvements.

Click to follow, the first time to learn about Huawei cloud fresh technology ~