At Stanford, Jobs gave what I think was his best talk. The first story he tells is connecting the dots, which is a very important idea throughout his life.

You cannot connect the dots foreseeingly; It’s only when you look back that you see the connections between the dots. So trust that what you’re experiencing now will connect in your future life… It is this faith that keeps me from losing hope, and it has made all the difference in my life.

I can’t help but wonder: what are the dots in my coding and architecture career, and how they will eventually connect?

Ten years ago, when I first entered the IT industry, I was just an ordinary engineer with a poor brain and could not get the key points in my work. However, my classmate with a high IQ could write the code almost immediately after reading IT. IT took me a long time to digest and absorb IT, and I had doubts about whether I could survive in this industry.

There is no way, can only be early bird, when the problem, I have a dead bite mentality to find the best solution. I think in the shower, while EATING, and even when I use the bathroom. Naturally, “chasing” source code has become part of my programming life.

I’ve read a lot of source code, and I’d like to share with you a few experiences that have had a big impact on my career.

01 Database connection pool Durid

In 2013, I was in charge of reconstructing a lottery service. The original code was C# version. It took 2-3 hours to calculate the order amount each time.

I used Druid as the database connection pool for my new project, and the results were impressive, with performance up to 10x.

One problem, however, is that the first database request every day always reports a connection error. I didn’t know how to read the source code at the time, so I sent a direct email to Druid’s author Wen Shao (also the author of FastJson) :

Wen shao replied to me, I immediately looked at the source code, found that THERE was a problem with the connection heartbeat I configured. The core point is that the connection pool sends heartbeat packets to the database server at regular intervals, and the database saves resources by closing down connections that have not been read or written for a long time.

This simple source code tour inspired me and made me pay more attention to the principles behind the technology.

1. Mental: Asking questions is addictive.

2. Skill level: Understand the implementation principle of connection pool. Druid was implemented based on arrays, the later Jedis connection pool was implemented based on Commons-pool, and the Netty connection pool was implemented based on FixChannelPool.

3. Architecture level: Heartbeat needs to be taken into account for long connection communication between client and server. Similar to druid connection pool sending heartbeat mechanism and Netty idleStateHandler.

02 branch library middleware Cobar

It was 2013, when the mobile Internet boom started, and the data of Internet companies exploded.

I once saw a post on JavaEye about taobao order technicians sharing their inventory and table, which made me feel like a treasure at once. Unfortunately, due to the limited space, the article did not explain the principle in detail, which always felt inadequate.

It didn’t occur to me that Soon Ali would open source Cobar, using Navicat to configure Cobar information as if it were a single MySQL, and distributing data evenly across multiple databases. For my weak technical mind at the time, it was like a water drop in a three-body meeting a human fleet, which gave me a great stimulation.

I spent about three months copying the entire Cobar core code because of my desire for the principle of separate libraries and tables. Really not enough intelligence, physical strength.

But physical strength alone is really not enough, I often fall into doubt, some places still do not understand, while copying code while learning seems not so obvious progress. All right, well, there’s got to be an opening.

Network communication is very important, so I decided to separate Cobar’s network communication module to understand the mode of communication using native NIO. The process was also painful, but I had a purpose, not a headless fly, which led to my first GitHub project.

While writing about the NIO project, I also learned about Maven’s Assemble packaging model, which sounds simple now, but in 2013 was still dominated by Tomcat deployment of war packages, which really blew my mind.

In the process of chasing Cobar, I also found an opportunity to communicate with Daniel Ali face to face. Although I was inferior in qualification, Daniel patiently answered my questions, which seemed to connect me with both the supervisor and the supervisor. Therefore, it is fair to say that this experience was the most important one in my coding career.

Later, when I worked at Elong, I communicated with the architect of elong’s database middleware, because I had studied the source code of Cobar, I understood his thinking very quickly. In addition, it also helped him find a Bug in distributed transactions.

MetaQ, Alibaba’s messaging middleware

I joined Shenzhou Special Bus in 2015. At that time, Shenzhou Special Bus was on the rise, and each system encountered a big bottleneck. MetaQ played a very important role in that period, and there were many related knowledge points.

3.1 Use of broadcast message in push system

The MetaQ we used at that time was zhuang Xiaodan’s open source version on GitHub. At the beginning of 2016, I checked out the source code of MetaQ to understand its mechanism while understanding the business.

I was very lucky to discuss the design scheme of private car push with colleagues in the architecture department. In the beginning, we also used aurora push, but because of the increasing demand for customization, we decided to develop our own push system.

So how does the server push a message to every connected ride-hailing APP? The solution is simple: use MetaQ broadcast mode to achieve this function.

1. The service system pushes messages to MetaQ

2. The TCP gateway broadcast mode consumes MetaQ messages

3. The TCP gateway obtains the Session held by the current server and pushes data to the App

Later, I carefully studied the TCP gateway design of Jingdong Jingmai system, and the implementation of push was very similar to our scheme above.

In 2018, the e-commerce company I worked for developed a live answering system, and I used this solution to realize the function of pushing questions.

3.2 A series of knowledge derived from ZK crash

We all know that MetaQ relies on ZooKeeper for load balancing. Suddenly one day, the whole ZK cluster went down. The architecture lead changed the JVM parameters for ZK, and the problem seemed to be resolved, but then questions arose.

Is it appropriate for MetaQ and service governance to share a ZK cluster?

After reading the source code of MetaQ, I found that when MetaQ was started with a large number of consumers, it would fight for locks frequently. In addition, offset would also be frequently modified during consumption. As the number of topics and partitions increased, MetaQ actually had writing pressure on ZK.

Later, China Architecture Department did separate the ZK cluster of MetaQ from the ZK cluster of service governance. Of course, migration is tricky, but I won’t expand it here.

Does ZK have a bottleneck as the registration center of Shenzhou system?

ZK stores the IP and port of each service, as well as exposed methods. As more and more services are offered in the private car system, can ZK really afford it? Later, the company leaders invited jingdong r & D students to answer your questions.

Jingdong registry service information with what? The answer from JINGdong is: MySQL. I’m standing there gaping. What? MySQL? Later, Taobao middleware blog published an article: Why does Alibaba not use ZooKeeper to do service discovery?

When data center services grow beyond a certain number, ZooKeeper, as a registry, quickly becomes overwhelmed like a donkey.

Then I took a look at a blog post by Kaitao Zhang and the JSF code scattered around GitHub and handpicked a registry with an AP model and BerkeleyDB clients.

We also know that in 2019, Ali made its contribution to SpringCloud ecology and Nacos was born. Nacos supports both AP and CP models, increasing the diversity of the open source world. When using ZooKeeper, it is important to pay attention to cluster size and usage scenarios.

04 Task scheduling system xxl-job

Time has come to 2018, the technical department needs a reliable task scheduling system. I used xxL-job at first, but I don’t know how to use it all the time. From the source to optimize the transformation of a total of 3 stages.

In phase 1, after looking at the xxL-job source code, my first instinct was “simple.” This is because the author has made this system out of the box to the maximum extent, removing the Quartz cluster scheduling mode, and developing the database based scheduler.

However, the company has developed its own RPC service, so it is not easy for other teams to add JobHandler with XXL-job. So initially, I changed the scheduler code to use the company’S RPC for execution, but replaced the JobHandler of xxL-job with the ServiceId of the company’s RPC. It works fine. It meets the company’s requirements.

Phase 2, why do I want to optimize another wave? The current RPC scheduling is synchronous and does not support asynchronous execution. If the RPC execution task is long, the scheduling thread of xxL-job will be blocked.

I approached Meituan’s friend and asked him how his company designed its task scheduling system. He showed me the execution process of The Meituan task scheduling system Crane. Considering the confidentiality, he only told me the principle of it. I made the following architectural design according to his description:

When schedule-client receives a scheduling request, it drops the task to the thread pool for asynchronous execution and immediately returns it to the scheduling server so that it does not block.

In stage 3, architecture alone is useless. How to achieve a more elegant implementation in engineering? I think of The Schedulex of Aliyun. If I were a developer of Aliyun, how would I design a good task scheduling system to support the task scheduling of 100 billion levels every day?

I looked through schedulex’s open documentation and client-side source code, and it was a treasure trove. The Schedulex client contains the following highlights:

1. RPC calls are similar to RocketMQ Remoting

2. Task scheduling is triggered by RPC, with a unified registry (NameServer mode)

3, support multi-port startup, in case the current port startup failure

4. Task execution and task scheduling thread pools are isolated

Drawing on these advantages, I soon completed the project implementation, technical colleagues are still satisfied with the transformation. But I do know that there are two problems with the current system:

1, task scheduling heavily depends on the database, when there are really 100,000, 200,000 level of tasks, task allocation and scheduling trigger will certainly have bottlenecks.

2. When the system is a container, can it be used normally?

Therefore, at the beginning of this year, I wrote down a relatively complete task scheduling system of my own on GitHub, replacing Quartz with time wheel, and changing task trigger to server push mode.

In the process of writing task scheduling, I am actually constantly surpassing myself. If I want to make it a work that can match the target of the industry, I must learn from the most advanced technical products and excellent peers in the industry.

Write the 05 at the end

Review those chasing the source of the intravenous drip, surge of emotion, a long time can not calm down. They are some of my fondest memories of being an architect.

Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you truly believe is great work. And the only way to do great work is to love it.

Focus on what you are doing now and do everything to the limit of your ability. You may not have a great sense of achievement at this time, but at some point in the future, you may suddenly have a feeling of hindsight, “suddenly look back, the lights are dim.”

Dear programmer friends, love what you choose, full devotion, I believe you will have income.


About the author: 985 master, former Engineer of Amazon, now 58-year-old technical director

Welcome to pay attention to my personal public number: Wuge Ramble IT, wonderful original constantly!