Ah! Did your service hang up again?

A: Can there be a consistent answer for Troubleshooting, which is by no means a simple task? Since specific technologies will eventually become obsolete, this article does not discuss any specific technical details, but proposes ten methodologies for Troubleshooting. By Steve Mushero At medium.com/faun/shit-b…

1. Faults are unavoidable

Ah, your service is down again, unfortunately. Unfortunately, it often dies because of the high load and complexity of the business. It dies in a way that can’t be easily “fixed” by “automatically expanding”, “adding containers”, “rebooting”, etc., and the fancy scheduling system doesn’t work. Of course, I’m not saying that these methods are useless, after all, each has its own scenario. Sometimes, when you face a glitch, it only takes 5 minutes to locate the cause, but as a veteran, you must know how much experience and effort is needed behind it. As the saying goes, “Kung fu is all outside the drama”. If you happen to use micro-service, server-less, infinitely divisible, loosely connected pieces and Parts, etc., are even harder to fix. Why? Specific technologies become obsolete sooner or later, while methodologies have longevity. Only “tao” (referring to methodology) is the guiding light for dealing with complex systems.

Model everything

Be able to say where each part is in the model, how they interact, and how they are configured. If conditions permit, even its behavior should be clarified. Get and understand logical architecture diagrams and, if necessary, physical architecture diagrams and network architecture diagrams. Be clear about stratification and grouping on different scales.

3. What is known is known

Figure out all the configurations and states as best you can. It’s really hard to make sense of the code in the repository, configuration files,.env, infrastructure-as-code systems, let alone the dynamic part of the runtime. But like it or not, the real system is where all the truth comes from.

4. Who changed the environment?

What’s happened to you recently? By whom? When? What are we operating on? What’s the effect? Who logged on the server? Who pushes code? What configuration has changed? And so on. Then, what behavior changes. For example, whose delay has changed, the dynamics [1] of the relevant part has changed, whether the error rate has changed, which resource load or availability has changed? What changes matter?

5. Ask an expert

Knowledge and experience should be directly or indirectly applied to understand the relationships and dependencies between things, especially dynamics and the associated failure modes [2]. Use all your resources to find the people who know the best, ask friends and colleagues, post on forum communities, ask questions on social networks, ask questions on IRC or mailing lists… If the expert is a “ghost”, then “practice the evocation” [3]. To on-site guidance is the best, really not on the remote. Where possible, use the available expert systems or rule engines, which are coded expertise.

6. Ask clear and accurate questions

Before asking questions, please be sure to observe and think again and again. Although the information provided to the experts is always one-sided, the wrong questions cannot be saved. And some low-risk, clear and accurate questions do not even need to be answered by the human brain, which can be answered quickly and automatically by the rule engine.

7. Local small-scale experiments

You can make minor changes or adjustments to the system and observe the impact. This is especially useful when using the method of elimination, exploring relationships between components, and verifying a guess that something doesn’t work.

8. Eliminate in advance

Don’t waste time going in an obvious wrong direction. It will eat up a lot of time and energy, divert attention, and waste resources. Don’t wait until the end to regret not having eliminated the wrong option earlier. What the question “is” is important, but never lose sight of what the question “is not.” Continue to rule out logic and experience.

9. Check carefully

The investigation may end up with contradictory conclusions, some parts are specious, and the final problem is not solved. In the words of Mark Twain, “The trouble is not that you do not know, but that you believe it and find out that it is not.” So be willing to challenge and test your basic assumptions, basic facts and facts in the process, because the plausible part is often buried in it.

10. Seek solace

Identifying problems is difficult, there’s never enough time and tools, and there’s always a lot of pressure. Often stop to rethink and examine what is known, how it connects and interacts, and the truth is often discovered in such strange ways…

Failure is inevitable, and by following these ten truths, the answer will come.

[1] We need to identify who is the cause, who is the effect, and to what extent. The interpretation of the indicators is key. [2] Failure Modes

En.wikipedia.org/wiki/Failur…

The emphasis here is on making effective use of previous experience.

[3] Ouija means seeking expertise from any source, including ghosts, if necessary.