TiDB Hackathon 2020 TiDB Hackathon 2020 TiDB Hackathon 2020 TiDB Hackathon 2020 TiDB Hackathon 2020 This year is the fourth time that TiDB Hackathon has been held. The scale of participating teams is the largest ever, with a total of 45 teams from all over the world signing up, realizing global linkage for the first time. After 2 days of extreme challenge, many exciting projects emerged in the competition. In order to let more partners know the stories behind these participating teams, we have started TiDB Hackathon 2020 Excellent Project sharing series. This article will introduce the wonderful stories behind the competition of CNCF Special Award Senhaifei Xia team.
Imagine you have a 10 TiKV cluster and one day 3 disks fail at the same time and there happens to be a Raft Group on all 3 disks. You don’t have to worry about losing data because the probability of 3 disks failing in a 10 machine cluster is very small.
So what if this is a 5000 TiKV cluster?
For a distributed storage system, the multi-copy mechanism can ensure data security. However, in general, as the size of the cluster grows, most data copies do not grow with it. Over time, the probability that the number of failed machine nodes is equal to or greater than the number of copies of the data (which is usually three) increases with clusters of hundreds or thousands of sizes.
For a three-copy cluster, when three nodes in the cluster go down, the probability of data loss and the influence range are different for different scheduling algorithms. ** In this TiDB Hackathon competition, The Team of Sen Haifei Xia reduced the probability of data loss through the Dynamic Copysets project and won the CNCF special award in one fell swoop. ** We interviewed senhaifei Xia team members and judge Tang Liu after the competition and invited them to share their Hackathon experience.
Q: Why is the team named Sen Haifei Xia?
Takamatsu: Mori kaifei is a new hero in my favorite game Dota 2. I got 5 kills in the first game I played, so I decided that this is my original hero. And this time I will be the whole coding, so I have to assume the responsibility of big brother ~
Q: Why did you first come up with the idea for such a project? Can you share your inspiration?
Captain Gao Song: The project originated from a debate in Shanghai Office in February 2020. At that time, Feng Liyuan proposed in the Group that “if there are infinite number of machines in a cluster and Raft Group, I can always find one Raft Group by picking 3 randomly. At first, Dongxu did not believe this conclusion, until Feng Liyuan threw out the paper, the seemingly counterintuitive conclusion was proved. The debate also planted the seeds for Hackathon. ** The largest single TiKV cluster is probably only a few hundred, but we can’t wait until there are thousands. **Copysets will take a long time to debug and test before they reach the standard of true production environment GA, and must be addressed before cluster size is reached. Since the Dynamic Copysets themselves are scheduling problems, my work is also related to scheduling, so when the Hackathon news was just released, I silently thought of the topic, plus Feng Liyuan, who was a lot of scholars in the discussion, the two of us hit it off and formed a team.
Copysets are an area I’ve been focusing on for a long time. I wrote about them a few years ago, but they are difficult to implement and simulate, so it’s exciting to see Copysets in a static state on Hackathon and then run a PD Simulator.
Q: Your post on Zhihu says that it is the only project that has degraded TiDB performance, can you explain why in detail?
Captain Gao Song: For a scheduling system, there is a big problem similar to CAP. It can not only be uniform in partition (data distribution), but also meet the business needs of the overall load, and solve the problem of the localilty capability of data. These three aspects are contradictory in nature, just like CAP theory. Our project focuses on partition security optimization, which will inevitably lead to the loss of Load Balancing, so the performance of TiDB may be set back.
Judge Tang Liu: I don’t think Dynamic Copysets will degrade the performance of TiDB. After the completion of the project, the impact on the performance can be ignored, and the safety will be greatly improved. If it does take performance backwards, it must be because of their poor implementation (the official joke being the most deadly).
Q: What were the biggest technical difficulties you encountered during the competition? What are the biggest challenges of subsequent maintenance?
Team member Feng Liyuan: scheduling needs machine resources, testing needs a super large cluster, and the competition time is limited, it is difficult to make a good result in a short time. Repair speed is a very important indicator, irrigation data can not be prepared in advance.
Captain Gao Song: We will continue to complete the project later. The verification of Copysets itself is not a difficult problem, but how to combine with the existing scheduling system and ensure the scheduling effect is a difficult problem. From the existing data, few other industry peers have done the sharing, we can only feel the stones across the river. At present, the effect of verifying ten clusters by scheduling group is completely OK, but whether the scale of thousands is still effective after that, this way of thinking may need a new iteration, which is also a new challenge.
Q: Why do you think so few people do Dynamic Copysets?
Judge Tang Liu: I paid attention to this problem a few years ago. I was surprised that there is still no progress in the industry. I guess the quality of hardware is getting better and better. And there may not be as much demand for the cloud now that users are moving massively forward. There is also a very large scale AP cluster like Hadoop bad point has little impact on the cluster, but like TP, the problem is still relatively large.
Q: Is there anything interesting to share during the competition?
Captain Gao Song: I can share two small things. When THE RFC was made public, I saw dean’s project was to do table back optimization. I just talked with feng Liyuan, my team member, about this problem and made up my mind to do it in 2021, but unexpectedly, someone beat me to it in Hackathon. I was surprised and happy when I saw RFC, after all great minds think alike 🙂 I will discuss the project with dean later.
The second thing is that although I didn’t get the first prize, I was a little disappointed, but the landing of Dynamic copysets in the production environment is still in a blank state both in the industry and in the academic world. Coincidentally, the student who is the first author of copysets also contacted me on Zhihu to see if there is a new research direction on scheduling. I will probably be able to write a paper after I finish this project.
Q: Besides your own project, which project do you like best?
Captain TaKAMatsu: I like Mods best. What I agree with most is that MODS is the biggest benefit for DBaaS, we don’t need customers to prepare GPU, we can directly use the GPU on the cloud, but the fast income is so shocking, when I saw the demo, I was amazed, I never thought of using GPU to optimize TiDB before. This project has given me a new perspective.
Team member Feng Liyuan: I also like MODS most, my second favorite is The cross-flow scheduling project of Brother Flathead, because this project can save a lot of money on the cloud ~
Finally, Senhaifei Xia’s captain Gao Song would like to express his gratitude to team member Feng Liyuan
“To Feng Liyuan:
We have cooperated with many projects in 2020. Thank you for helping me grow up a lot. I hope we can continue to cooperate in the future.
– takamatsu