A reader asked me: What qualities do you think make a programmer a good programmer?

I answer: a programmer with the ability to solve problems.

This answer seems a bit abstract, don’t read the article below you will gradually understand.

Problem solving skills

Many years ago, when I was a rookie, my leader often told me not to limit myself to the technology when solving problems, and gave me a vivid example.

Two programmers once argued for a long time about how to determine whether the network between two servers was normal. A test next to it said, “Ping won’t you know?” So they solved the problem by implementing Ping in Java code.

Years later, I know there was a more elegant way to solve this problem, but I still think the tester was smart. We continued to deal with each other for a year, and she is really strong, equivalent to product manager + test function in a small company.

Need to explain to everyone is: the ability to solve problems and technical ability is two ability range, I have seen a lot of programmers source code play slip, production problems still do not know how to solve the problem.

Production problems, is the test of a programmer’s highest level, in the face of high intensity and high pressure, the movement is not deformed, can calmly think, analyze, solve the problem, can reach this level of programmers, which in ancient times can worship as the general.

I’ve always liked programmers who can solve problems quickly, and I’m willing to study and analyze any production problems as soon as they occur. To put it mildly, good programmers are trained to solve problems, especially those that can stand up when something goes wrong in the production environment.

Two, share a late-night technology story

01. Old platforms and new platforms

The company has an old system, a new system.

The old system has been used for many years, and has already exceeded the limit it can support. When this system was first launched in 2013, it was estimated that the daily trading volume would be one or two billion. In fact, the daily trading volume has already reached 4 billion.

From 2013 to 2017, the technical team made a lot of efforts. The old system used Oracle database. In order to support the maximum transaction volume, read and write separation, database and table + all kinds of strongest hardware were developed. The system was split, restructured and optimized for many times, but still could not meet the company’s increasing trading volume.

To be honest, it wasn’t easy for the team to put together an old system like this. The original architecture was not designed properly, and fixing and fixing won’t solve the big problem. The development of the new platform has become a necessary thing for the development of the company. The design of the new platform in the first stage can support the trading volume of ten billion yuan per day, and the most important thing is to support the expansion of ten billion yuan per day in the later stage.

After the brothers hard stand, the new platform was finally online. The new platform line is the small part of the success, the data migration is the most important thing. A system that has been running for several years, using a traditional vertical architecture. (See this article about What Spring Cloud is doing from an architectural evolution perspective.) , the various operations, policies, activities, risk control are all rolled together.

The new platform uses micro-service architecture, hundreds of micro-services alone, Mysql HA database. In terms of architecture design, the two systems are separated by a generation. In order to accommodate some functions of the old system, part of the redundant design is also made. Anyway, the two systems are not the product of the same era.

The requirement of the migration is that the normal transactions of merchants should not be affected during the migration from the old platform to the new platform. The analogy is that you’re driving down the highway, changing the wheels as you go, and you’re still not able to feel anything.

Therefore, we developed a set of migration system, originally planned to migrate in batches, to cut one or two hundred million transaction volume for the new platform at a time, slowly see the effect and then move according to the rhythm, but a sudden policy (activity) disrupted the rhythm.

The changes brought about by the new policy

For third-party payment companies, they often launch some new policies (activities) along with the market environment. Some policies are relatively simple, but most of them are very complicated and require a large amount of development.

At that time, the new platform had been switched for a period of time, and we gradually gained some confidence in the new platform, so we decided to implement the new policy on the new platform. The plan is to move all the remaining merchants on one of the old platforms to the new platform on the night of the implementation.

After the program is settled, the departments began to perform their respective duties, the operation center external notice, we want to make a big action in the New Year’s day, what kind of changes may be; The marketing center is responsible for contacting various agents for training in batches; The commercial department began to issue red documents of the company to each branch.

We communicated with customer service and operation departments in advance to make plans for possible problems. Public account, company official website, App, email external notice of policy changes, announced the start date of implementation; The product center is responsible for sorting out policy implementation requirements, and the R & D center develops plans determined by new policies.

The most important thing is to ensure that the remaining millions of merchants can be transferred to the new platform smoothly at one time in the evening of New Year’s Day.

Migration begins at midnight

The migration procedure has been carried out many times before, so we are relatively relieved about this part, but we still confirmed with the colleagues who are mainly responsible for the migration for many times. The development environment must be tested two weeks in advance, and the UAT environment must be tested one week before the migration, and both r&d and testing are verified.

It was only three days before the migration that I reached out to the programmer responsible for the migration to see what was going on and ask if any simulation tests had been run in production. After confirming that there is no problem, it is estimated that the migration can be completed in three or four hours according to the feedback time of the main person in charge, so that the migration can be completed between 1:00 am and 4:00-5:00 am.

On the day before the actual implementation of the migration, we held a communication meeting with all departments to discuss various possible situations and the personnel needed to stay in each department. After the meeting, everyone felt good and waited for the big fight in the evening!

In the evening, more than ten developers and two testers, as well as some colleagues from other departments, about twenty people, before 12 o ‘clock, we talked and laughed, played games and waited for the migration at 1:00 a.m., because it happened to be the New Year’s day, the office felt a festival.

It’s 1 a.m. in Beijing, starlight outside the window, and the office is tense.

More than a dozen colleagues gathered around the programmer who was in charge of the migration, and it was obvious that the programmer was stressed (haha, I guess that kind of thing can happen to anyone). However, he was still skilled in accordance with the previous tests, after checking the data for many times, click the migration button.

First, migrate an agent in the production environment to see if the data is correct. After the execution, relevant personnel start to check the data. Operation and maintenance personnel check logs, developers confirm that relevant nodes are normal, and database engineers check migration data; The testers checked the data on the operation platform and tested the Pos card swiping test. Everything was normal!

After trying two agents, there was no problem. Now we are ready for All In. Millions of merchants are left, and thousands of agents plan to move out. The programmer in charge of migration configures all agent numbers into the execution program, clicks the execute button, the production traces the log, and everything is fine.

A few people are left to monitor the data, and the rest of us disperse, waiting for the migration to complete. I went back to my station and lit a cigarette, thinking the night was going well.

04. Sudden accident

The night of the early morning is quite sleepy, when I light up the third cigarette, the programmer in charge of migration, rushed over to look for me.

“Johnny, there’s a problem!”

In the heart a surprised, fierce smoke, smoke pinched out, busy ask: “appear what problem?”

It turned out that the programmer had been tracking the progress of the migration after the implementation of the migration program. He found that it took half an hour to migrate 100,000 merchants, and the total number of millions of merchants on the old platform. At this rate, it would take several days to complete the implementation.

This is a big deal!

If it’s not done by 8:00 a.m., it’s a major accident.

Not to mention how to deal with the data separation between the new and old platforms, if the implementation of the company’s policy is delayed, how to inform millions of merchants and thousands of agents in such a short period of time is an impossible task.

Can imagine the next day will appear what kind of situation, customer service 400 phone was hit, operation personnel communication to vomit blood, due to the delayed implementation of the policy may lead to the loss of the company, compensation for agents…

If this problem is not solved within an hour, we need to report to the deputy general manager of the company immediately, and then it is estimated that all the management of the company will have a meeting with the company to discuss the follow-up treatment plan.

Although the serious consequences of a failed migration are flashing through your mind, you still need to put all your thoughts on the back of your mind and analyze what went wrong first, and whether there are any downgrades or remedies.

Analyze the reasons:

After querying logs and checking data, the cause was basically found out. The developers used tests conducted by small and medium-sized agents in production tests. However, the data of the largest core agent may account for 5%-6% of the overall trading volume of the platform.

Therefore, according to the evaluation of small and medium-sized agents of the time is certainly not accurate, things have so far let alone who’s problem. How to solve the problem quickly is the next key, everyone together to come up with a solution, what can be done to make the migration faster.

Remedy:

For example, the core data should be synchronized first, and other contents should be processed later to ensure the transactions of the next day; For example, can all use artificial guide table to deal with, database engineers heard this plan, almost cry faint, more than thousands of tables, the relationship is very complex; All sorts of other schemes…

When we discussed the optimization program, we found that the main process of the migration program did not use multithreading to migrate.

Migration program provides an interface, each migration time the developer will fill in the page to migrate the agent number, the background receives the parameters passed by the page, start for cycle migration. Although the merchant under the agent uses multi-thread migration, but the main program entrance of the migration agent does not use multi-thread, so we want to use multi-line agents to speed up the migration.

05. Artificial multi-threaded rescue

After discussion, we felt that multi-threading to migrate agents should be a better plan at present, but if we let the site write, without testing directly on the production execution, the risk is still relatively large.

Is there anything else that can achieve the effect of concurrent agent migration without changing the program? There are!

We all know that we usually develop Web applications, the front end of each request to the back end will be assigned a Servlet to deal with the response, this Servlet is actually an independent thread. So each time open a few pages, at the same time the execution of the migration request not to achieve multi-threaded migration agent effect?

Say dry dry, stop the migration program before, choose more than ten agents for multi-thread migration test, opened 4 pages at the same time, each page input different agents, began migration test, after the test found that everything is normal.

I started to increase the amount of testing, using dozens of agents. After input on different pages, I clicked the migration program successively. In the process of the second concurrent migration, I suddenly found that some errors would be reported from time to time.

Stop the migration program and start to look for the cause. According to the cause of the error, it is found that shared data occurs.

We know that servlets are thread-safe, and when multithreaded access occurs, there are thread-safe issues if there are global shared variables.

ThreadLocal provides a separate copy of the variable for each thread that uses it, so each thread can change its own copy independently without affecting any other thread’s copy.

After this problem is resolved, multiple pages are opened for execution, but when the same Tomcat is running for more than 6 threads in parallel, the machine load is high, because additional thread pools are invoked again within each thread to handle the migration logic of merchants and salesmen.

So immediately arrange the operation and maintenance staff to find ten servers in the production environment, and deploy the migrated master scheduler on all ten servers. In order to prevent developers from having problems with hand shaking, I asked operation and Maintenance to give me permission.

Therefore, ON my computer (I used multiple screens), I opened the page of migration program on ten servers respectively, grouped all agents that need to be migrated into fifteen groups each time, and entered a group of agents on one page each time to migrate, and so began to migrate agents on each server in turn.

When I ran the loop for 6 times, the database engineer noticed that the data migration was significantly faster, so I spent 2 hours migrating all agents separately on the page.

By about 4 a.m., MY work is mostly done, and the rest of the program runs slowly; By 5 a.m., most of the merchant data had been migrated, leaving only two servers running; By 6 a.m., all 10 servers had run out of migration programs.

There was a long sigh of relief after arranging to check all the relevant figures one by one.

When we came down for breakfast at 7 a.m., we were still talking about how we almost couldn’t get through last night. I joked about how my boss would feel if I called him at 2 or 3 in the morning.

At that time, we thought that it was a small matter to have such a big accident and our boss fired us. What we were most concerned about was how to end it. If you lose your job, you can find it again. It’s up to us to deal with it anyway.

After the transaction was opened at 9 a.m., some small problems gradually appeared, but they were all small in area and did not affect the transaction. The overall scope of the problems was controllable.

In hindsight, we all had a sense of survival.

Iii. Event review

Later, when we held a review meeting, we summarized many of the omissions, but these are not the focus of this article. Back to the beginning of this article, what is a great programmer?

As you can see, this problem is not particularly complicated and the technical means needed to deal with it are relatively simple, but the most important thing is to solve the most urgent problem at that time. Therefore, there is no high or low technology, the essence of learning technology is also to solve all kinds of problems, do not be confident about the fans of technology, it is best to use it.

Technical people should learn to enjoy pressure, because pressure is power, pressure is to let you grow, the earlier you meet the faster growth. People in high pressure and high intensity environment, even very simple movements may be deformed, which may cause a bigger secondary accident.

Keep a calm and analytical mind under high intensity and high pressure. Only when you calm down, can you really find and solve problems. Many technical people, when there is a problem, you see that he is busy, in fact, there is no idea in the blind operation.

Calm down, carefully analyze the whole chain, think about where the problem may occur, and then check the log or related commands step by step to verify the root cause of the problem, until you find the root cause of the problem, you can be confident when solving the problem.

The programmers who stayed in the migration that night were all the core programmers of our company, but it was easy to find out who had the ability and who had the ability. Excellent programmers were like gold, and they would shine at the critical moment.

Many people will naturally retreat a few steps when they encounter problems, and some people like to rush forward when they encounter problems. No matter how good your source code research is or how good your powerpoint presentation is, companies need someone to step up to the plate when things go wrong.

Programmers who are able to go up at the top of a critical moment are, by and large, easily placed in management positions. In fact, people build trust through constant running-in. When leaders choose to promote employees, the main consideration is whether they can trust things to you.

So we usually research technology, do not go astray, source code, design mode these things should be studied, but should consider how to apply after the study, more focus on some actual combat type of knowledge, these things can save your life at the critical moment (workplace).

How to be a capable programmer

So how do you develop your problem-solving skills as a programmer? * * practice! Practice! Practice! Normal technical learning is just a strong input, and without practice, these skills will soon be lost.

Then how to practice, do more projects, if the company’s project does not use this technology, you can write code in your spare time to debug yourself; In addition, when colleagues have problems to help solve problems, when the company has problems, take the initiative to help solve problems; Solving a variety of problems is the fastest way to improve your ability.

After the practice is complete, it is best to review the summary and record the summary as a journal or blog. The recorded content will become a treasure house of knowledge for you to retrieve and solve similar problems in the future, so as to constantly enrich your problem-solving experience.

Finally, wish you become a real technical master!