There is one thing that has struck me particularly recently, as a lead to talk to you, talk to you: When we did some extreme flow regression simulations internally, we observed abnormal CPU utilization on TiKV (TiDB’s distributed storage component), but we did not see abnormal CPU utilization in our Grafana Metrics and log output, so we were confused for several days. Profiling was the only way to find out who was guilty of profiling, and it was in a place no one expected it to be: a Debug log module (to clarify: At present, this Bug has been fixed, and the trigger of this Bug only appears when the + log level is fully turned on under extreme pressure, please rest assured users). This article is not about Bug analysis. I think it is more important to find out the tools we used in the process of problems and the thinking process of old drivers. As an observer, I watched my younger colleagues admire the older drivers as they expertly operated perf and switched between various tools and interfaces, and I had an inkling that something was amply wrong: it meant the craft couldn’t be replicated. Afterwards, I did some research about the basic software user experience, found in the field of theory and the information it did less (most of the studies is the ToC product related system software is only about UNIX philosophy), and the lack of systematic, depends on the author’s “taste”, but there is certainly a good or bad of the experience of the software An experienced engineer, for example, looks at a command-line tool, taps it a few times and knows whether it is working or not, whether it is a “tasteful” tool. In many cases “taste” is called “taste” because it can’t be explained, which is the art of software development, but it also means that it can’t be copied and can’t be learned easily. I don’t think that’s a good idea either, and today’s post and possibly the next few (I don’t know what to write about yet, but Flag it) will try to summarize where a good basic software experience comes from. As the first part of this paper, observability and interactivity are two important topics. As for why these two points are put together, I will keep it in suspense and say last.

observability

What is observability? This can be seen in my article “Observability of Distributed Systems in my Eyes” [1], published two years ago, which I will not repeat here. As the observability practice in TiDB deeps our understanding of this topic, let’s first clarify a question: when we talk about observability, who is observing?

Who is observing?

Many friends may be stunned, thinking: this is of course, it must be a person, not a machine. Yes, people are observing, but it’s a simple truth that software designers often overlook, so what’s the difference between the two? Why is it important to emphasize the human subject?

To answer this question, one needs to be aware of the fact that people have a very limited short-term working memory. A large number of psychological studies have shown that the capacity of human working memory is roughly 4, that is, to pay attention to 4 items of information at the same time in a short term [2]. No amount of information can be memorized by modules, such as the way we quickly memorize telephone numbers. Taking 13800001111 as an example, we usually do not memorize numbers one by one, but the following: 138-0000-1111 Group. After understanding the basic assumptions and bandwidth of the human mental model, I think many system software developers will probably stop bragging that my software has 1000 + monitoring items! This is not a good thing, not only can actually make more information damage the formation of short-term memory, the introduction of more noise, let users in the sea of information to spend a lot of time to find the key information, and the classification of involuntary (I believe that the brain an involuntary background task is to build information index and classification, note that this is also consumes bandwidth), so the first conclusion: The best way to use an app is to have only four key messages in one screen. So, the next question is: what are the key messages? What is noise?

Distinguish key information from noise

There is no right answer to this question. My rule of thumb for system software is to follow key resources. Software is actually very simple, the essence is the use and allocation of hardware resources, pay attention to the art of balance. There are only the following key hardware resources. For each of the following key resources in a certain sampling period (single point does not have much meaning), some simple questions can be asked to get a general picture of the system running state:

  • CPU: Which threads are working? What are these threads doing? How much CPU Time did each of these threads consume?
  • Memory: What is currently stored in memory? What’s the hit rate of these things? (Usually we focus more on business caching)?
  • Network I/O: Is the QPS/TPS abnormal? What are the major network I/O requests currently initiated by? Is there enough bandwidth? Request delay? Long links or short links (measuring syscall overhead)?
  • Disk I/O: Is the disk reading or writing files? Which files to read and write? What Pattern is most of the read and write? How big is the swallow? How late is an I/O?
  • Key logs: Not all logs are useful, and only logs that contain certain keywords are of interest. So, are there logs for specific keywords?

Through the above standard questions of soul torture, must be able to have a certain understanding of the system running state.

  • Further, it is crucial that these system metrics be used in conjunction with the business context. For example, if we look at the CPU threads and the Call stack and see that a lot of CPU time is spent doing things like wait sleep idle, There are also no other I/O resource bottlenecks. At this point, looking at these numbers alone may be confusing, but the combination of transaction conflict rates may be more useful for observers to see which transactions, or even which rows, the lock wait time is spent on.

This is not to say that other information is useless, but quite a lot of information is a posterior value, such as: The vast majority of debug logs, or ancillary information designed to confirm a conjecture, are of little help in solving an unknown problem and require a lot of background knowledge from the observer, which is best presented folded up and out of sight. If you open the internal Grafana of TiDB, you’ll see a number of such metrics, such as stall-conditions-change-of-each cf (although I know what this means, I’m guessing 99% of TiDB users don’t), And I can see the struggle of the engineer who wrote the name. He must have wanted others (or himself) to understand what the name meant, but unfortunately, at least in my case, it didn’t work. What’s the next step in observation? Take action. Before you take action, what is the premise for taking action? The general pattern of how we deal with a problem (which I’ve summed up myself, but any book on cognitive psychology has a similar concept) is: observe — > discover motivation — > guess — > verify a guess — > form a plan — > act, and then return to observation and repeat. In this case, people (or the experience of the old driver) reflect the more important place is in the link from observation to conjecture. As for the motivation of observation, there are no more than two kinds:

  1. Solve the problems at hand;
  2. Avoid potential risks (avoid future failures).

Assume that there is nothing wrong with the system and that there is not much need to change it. I think these two steps are important because basically everything else can be automated, but these two steps are difficult because they require human knowledge/experience and intuition. A system with good observability is usually a master of human intuition. Here’s a quick example: When opening a backend interface, we tried not to focus on specific text messages, if red yellow color piece of interface is more, our intuition will tell yourself this system may be in the condition of less healthy, further if the red and yellow is roughly gathered at a specific location on the screen, our attention will be focused on to this position; If an interface is all green, it should be healthy. How do you make the most of your intuition? Or to what? I think the best point is: risk prediction.

Where is human intuition used? Risk prediction

We need to use some prior knowledge here. Before talking about this topic, I wanted to share a little story I heard before, the ford factories have a motor is broken, and then find a return, he listen to the voice, looked at the workings of the machine and finally drew a line on the motor with chalk, said the local coil around how many, how much more skeptical workers do, indeed as expected problem solved, When ford’s boss asked him why he charged so much for drawing a line, the old man wrote down a bill: $1 for drawing a line, $9,999 for knowing where to draw the line.

Regardless of whether the story is true or not, if it is true, we can see that intuition and experience can really produce a lot of value. My first reaction when I heard this story was that the old master must have seen this kind of situation a lot (nonsense), and this problem must be a common problem. The hardest part of the problem solving is to rule out most unreliable directions by observing (especially some feature points), and to trust that the causes of common failures converge. In this case, the first step of a system with good observability is to give users intuitive direction, which requires previous knowledge to give the most likely points of failure and related indicators (such as CPU utilization, etc.); The second step is to show it through some psychological tricks. Here is an example of a small feature TopSQL that will be introduced in TiDB. This feature is also easy to say. We found that a lot of user failures were associated with a small number of SQL queries, which were characterized by having a significantly different CPU footprint from other SQL queries, but it seemed normal for each SQL query to have an independent FOOTPRINT. So the function of TopSQL is to answer the question: How much CPU is being consumed? On what SQL? I’m trying not to decapitate the screenshot below, but I’m guessing you’ll soon know how to use it:

Your intuition will tell you that the dense green ratio in the second half seems to be different from the rest, driving up the overall CPU usage and feeling problematic. Yes, this is probably the right direction. Good visualization can use human intuition to quickly locate the main contradiction.

What is an operation? Identify the true lifecycle of the operation

The first point was written with another key resource in mind that is often overlooked: time. I was going to put it in the critical resources section, but I thought it might be more appropriate to put it here.

On a slightly metaphysical level, today’s computers are all turing-machine implementations, and I learned in elementary school the minimal set of functions of a Turing-complete language: read/write variables, branches, loops. In literary terms, a program is an infinite number of cycles, with smaller cycles (loops) nested within large cycles, each of which makes choices (branches) based on the status quo (variables). At this point, the intelligent reader may guess what I mean: if we talk about observability in isolation from periods, it makes no sense. The definition of a cycle is flexible. For people, a large cycle is obviously a lifetime, while a small cycle can be one day a year, or even a cycle can be used as a unit without time span, such as the cycle of a job… What is a reasonable cycle for database software? Is it an SQL execution cycle? Or a transaction from Begin to Commit? There is no standard answer, but I personally recommend that the cycle be as close to the end user’s usage scenario as possible. For example, in a database, selecting the execution of a single SQL as a cycle is not as good as selecting the transaction cycle, and the transaction cycle is not as good as the application’s one-request full-link cycle. In fact, TiDB has long introduced OpenTracing to track which functions are called and how much time is spent in a SQL execution cycle. However, it was only used in TiDB SQL layer at the earliest (those who are familiar with TiDB should know that SQL and storage are separated). TiKV is not implemented in the storage layer, so there will be an SQL statement execution process down to TiKV is a dead end; We were going to stop there, but then one little thing happened. One day a customer said, “Why is my app accessing TiDB so slow?” Then I looked at the TiDB monitoring, no ah, SQL to the database side are basically milliseconds back, but the customer said: you see I did not do this request other ah, both sides how not on? Later, after we added Tracer, we knew there was something wrong with the customer’s network. This case reminds me that if full-link Tracing can be developed, it is meaningful to view the life cycle from business requests. Therefore, by extending Session Variable in TiDB, we can support users to pass Tracer information of OpenTracing protocol into TiDB system through Session Varible, connecting business layer and database layer. Can actually implement a full life cycle tracking, this feature will also be available in the near future.

Having said that, here are a few points:

  1. Time is also an important resource.
  2. Whether it’s capturing Sample or tracing, it’s important to pick the right period.
  3. The closer the cycle is to the business cycle, the more useful it is.

Observables save lives: post-mortem observations

I’m sure no one is looking at the monitoring interface every day, but if you think about it, most of the time when we need visibility, there are already perceived failures or very clear risks. The system may be “sick”, or it may not know what caused it, the root cause of it, or some less obvious anomaly at some point in the past, where there is no information beyond normal Metrics, and we certainly don’t leave the CPU Profiler on forever. Profilers are usually triggered manually. However, if the cause of a Profiler is recovered after the event, it is very helpful to have the CPU Profile before the event. Therefore, a better solution is as follows: Start Profiler automatically at a relatively short time interval (such as minute level) to automatically save the diagnosis results, just like making an in-depth physical examination record on a regular basis. The old records should be deleted periodically, so that in case of accidents, you can quickly go back and save lives more efficiently.

Continuous Profiling is a very useful feature that can be used for Profiling in a long time. Based on our experience, combined with the above section, there is a perfect Tracing system. Most debugging processes can find the root cause of problems in Tracing + Log.

The best observability is the ability to instruct the user: “What should I do next?”

As mentioned above, I found a particularly interesting phenomenon when OBSERVING the old master’s handling of problems: Experienced developers can always quickly by observing, decide what to do next, what does not need to consult the information or wait for others to guide, completely in a state of flow (such as in TiDB see data distribution within the cluster, or hot spots, know to modify scheduling policy or manual split region). But newcomers often get stuck at this stage, either going to Google or scrolling through documents, thinking, “I see the problem, what do I do next?” If at this point, the system can give some suggestions about what to observe next, or what to do, it would be more friendly. Currently, there are not many systems that can do this. If you can do this, I believe that your system has done a good job in the observability. Putting this point at the end of observability is meant to introduce interactivity into the topic.

interoperability

Before we talk about the interactivity of basic software, I’d like to review the history of computers. In my opinion, one of the side-effects of computer history is the evolution of human-computer interaction: From the first picture, WHEN I looked at a bunch of lines, I didn’t know how to operate them. Now, I can skillfully use the iPhone without ever reading the instructions. Behind this is actually the progress of many disciplines (including but not limited to psychology, cognitive science, neuroscience, philosophy and computer science).

Back in our field, basic software is far away from the public. In the past, a lot of design was done by engineers. People like us generally lack the understanding of human nature. My design can be understood by myself, because I am human, so other people can also understand. If they don’t know how to use it, just read the documentation.”

When we duplicate some failures, we often come to the conclusion that “user misoperation”, but is this really the root cause? I had an incident at my previous company that left a lasting impression on me: there was a build-your-own distributed file system that, like all file systems, had a shell that supported some UniX-style commands. Once, an engineer ran the command rm -rf usr local/… (note the space after usr), and then the system obediently starts deleting itself… Checking last this matter did not blame the operator, but the punishment of the system designer (at that time, the boss of the company), because this is a bad interaction design, even confirm before deleting important folders or protect a permissions system for all this doesn’t happen, the machine does work in according to the logic, This place is also bug-free (even the deletion is efficient, since distributed systems LOL). In my long years as an engineer, I came to understand that the best engineers find a balance between logic and sensibility. Good design comes from understanding technology and psychology. After all, we’re writing programs for people. As users of software, we are not so much using it as “talking” to it. Since it is a dialogue, it means that it is an interactive process. What is a good interactive experience? I’m trying to summarize some principles for software designers, trying to do this for the first time, and I can’t rule out adding them later.

No one reads documents: a command to initiate and explore learning

Admit it, no one reads the instructions. When we get a new iPhone, the first reaction must be turned on (very magical, it seems we subconsciously will know where the power button) is surely not see instructions for boot button, boot began exploring the new world through the fingers, very shallow reason, why in the field of system software is to read the document before mount guard?

I often teach our young product managers:“At best, your users will spend 10 seconds on your GitHub front page or the Quick Start section of your document, and they won’t even have the patience to read through the document, subconsciously looking for ‘dark-background words’ (shell commands) and copying it to their terminal to see what happens. Nothing else will be done. If the first command fails, there will be no further action, so remember that you only get one chance.A small example is when I was making TiUP (the installation and deployment tool of TiDB), I repeatedly told the product manager of TiUP that there was no nonsense in the home page, just a command, and it could be used:

A screenshot of TiUP’s home page (tiup. IO)

I think the example is a little bit more extended. I remember one year before the pandemic I was at FOSDEM in Brussels, and I was in a bar near the venue at night talking to a DevOps guy from the UK, probably drunk, and he said:“System software that cannot be successfully installed with an apt-get install is not good software.”Rough words but not rough logic. So you might ask, if there really is some information or some concept that needs to be conveyed to the user, what is the best way to construct a Mental Model using a concept from cognitive psychology? My own experience:Exploratory learning. A system supporting such cognitive construction mode usually needs self-explanatory ability, that is, it tells the user that after the first step (such as starting iPhone), the user can make use of the output of the last step to determine the next behavior to complete the learning. Here’s an example: The MySQL system table is familiar to all MySQL users. You can only use an interactive mysql-client to link to an instance without waiting for the system to tell you what the INFORMATION_SCHEMA contains. You can do this by simply using the SHOW TABLES user and then using the SELECT * FROM statement to explore the contents of specific TABLES within INFORMATION_SCHEMA step by step. This is a perfect example of a self-explanatory Explanatory (one of the premises in this example is SQL as a unified interactive language). Another good example is Telegram’s Botfather. I believe that friends who have written robots for Telegram will be impressed by the ease of Botfather. Let me put a picture for you to understand:

Telegram botfather to create a chatbot process

Telegram is a chat software, Botfather clever use of the interactive mode applied to IM a relatively boring bot development process, rather than cold give users a URL core.telegram.org/bots/api, let… 7s, I mean, so do people. Wish you make a “fish” can use good software.

Help the user think one more step, tell the user half step, let the user go half step

I’m a big fan of science fiction, and a lot of science fiction explores one of the ultimate philosophical questions: Do we really have self-consciousness? Although we think we do, when the software prints Unknown Error, you want a voice telling you what to do next, right? A good basic piece of software, when delivering negative feedback, is best done by suggesting what to do next. As a classic example, all Rust developers have been tweaked by compilers at some point, but the process is not strictly painful, as shown in the screenshot below:

Plain Text
error[E0596]: cannot borrow immutable borrowed content `*some_string` as  mutable
 --> error.rs:8:5
  |
7 | fn change(some_string: &String) {
  |                        ------- use `&mut String` here to make  mutable
8 |      some_string.push_str(", world");
  |     ^^^^^^^^^^^ cannot borrow as mutable
Copy the code

It’s not painful because the compiler tells you exactly what’s wrong, why, and what to do next. A regular compiler might print an Cannot Borrow as mutable file, but a good compiler will help you think a little bit further.

Going back to the question of self-awareness, I heard this one before: A test engineer walks into a bar and asks for NaN Null, a test engineer disguised as the owner walks into a bar and asks for 500 beers and doesn’t pay, 10,000 test engineers howl outside the bar, a test engineer walks into a bar and asks for a beer ‘; The test engineers left the bar satisfied, then a customer ordered fried rice and the bar exploded LOL. The moral of this story is that as a software designer, you can never capture the imagination of the user. Instead of letting the user’s imagination run wild, you can design your own storyline and let them follow you step by step. But why stay? My answer:

  1. “Engagement” can lead to happiness. People are sometimes conflicted about wanting the machine to do everything automatically, but also wanting to be in charge. Sometimes the software already knows that the next step is to do something, but leaving the door open for the user to complete is equivalent to giving the user a sense of accomplishment.
  2. The right to choose is given to the operator, especially in the face of some one-way door decisions, go or no-go should still be given to people.

I have a few more tips for this:

  1. For patterns where actions may cause multiple consecutive actions (such as terraform deployment scripts, or features such as cluster changes), it is necessary to provide a Dry Run mode that outputs actions but does not execute them.
  2. For batch operations like the one above, it is much better to design the Save point as much as possible without having to redo it every time (similar to breakpoint continuation).
  3. When encountering true Unknown Error, various context information to help Debug should be output. Finally, the Error log will prompt the user which link to mention Github Issue. Then it is best to help the user fill in the Issue Title in the URL Link (let the user decide whether to Issue the Issue).

Unified language: Controllers and control objects

I interview a lot of systems engineers, and I have this question: What is your best CLI tool? The vast majority almost subconsciously answer redis-cli. In fact, I would have given the same answer myself, and then I thought why?

“Controller” – “controlled object” is a very common pattern in the basic software, just like when we operate the TV on, most of the time is through the remote control, so you can think that the user on the television for the first and most contact is remote, so, in analogy to the base software for the design of the controller is very critical, completes the controller, I think the key point is:

  1. Build a unified interactive language
  2. Self-consistent and concise conceptual model

I’ll use the Redis-CLI as an example for a bit of interpretation. For those of you who have used redis-cli, all operations follow [CMD] [ARG1] [ARG2]… In the mode of Redis-CLI, there is no exception. No matter how to operate data or modify configuration, everything is in a unified interactive language, and this language is clear at a glance, and there are some natural conventions in this language, for example, the command (CMD) is always composed of several letters without symbols.

Bash
redis 127.0.0.1:6379> SET k v
OK
redis 127.0.0.1:6379> DEL k
(integer) 1
redis 127.0.0.1:6379> CONFIG SET loglevel "notice"
OK
redis 127.0.0.1:6379> CONFIG GET loglevel
1) "loglevel"
2) "notice"
Copy the code

Redis – CLI interaction example

In fact, this is also the case with MySQL in the exploratory learning section. SQL itself is a unified interactive language, but not as intuitive as Redis. The second point is the conceptual model. The advantage of Redis is that it is a key-value database, so the concept is very simple: Everything is key-value, and if you look at the CLI tool, you can see that the author is trying to map all the functions and interactions onto this key-value model, which is natural, because the reason we use Redis-CLI, First of all, we accept the reality that Redis is a KV database, so an automatic mental assumption when using Redis-CLI is key-value mode, which makes all operations become natural when using CLI. This applies to many good database software, such as Oracle, which in theory can rely on SQL to do everything on the software itself, because users should know the relational model and SQL by default whenever they use Oracle. As a positive example, let’s take a counter example: As you know, the TiDB main project (excluding other tools such as CDC and Binlog) has at least 3 Controller tools: Tidb-ctl tikv-ctl pd-ctl, although tiDB is indeed a distributed system composed of multiple components, but for users, most of the time the object is actually tiDB as a whole (database software), but several CTL uses are not quite the same. For example, pD-CTL is an interactive controller, and the scope of influence is probably pd itself and TiKV. There is also some intersection of tiKV-CTL functions, but it is only used for a single TiKV instance. This is too confusing, TiKV is obviously a distributed system. But tiKV-CTL is a controller for a single point, right? So which CTL should be used to control TiKV? Answer: PD-ctl most of the time. . For example, if you have a TV set, you need to use three remote controls to control it, and the remote control that really controls the TV is called set-top box. This kind of problem is considered as a natural design problem in daily life, but in the field of basic software, why does everyone’s tolerance seem to suddenly become higher?

E.g. No Surprise, I’m afraid of trouble.

I don’t know if it’s a common phenomenon, but users of basic software tend to blame themselves and feel guilty when faced with errors (especially because of bad interactions) and rarely attribute them to the software. Especially when they can skillfully operate some complex and split software, many people will think it is a kind of “skill”, after all, no one wants others to look at their clumsy operation.

There’s a deep reason behind this (there’s a bit of a Hacker Culture that celebrates complexity), but I’d say it’s a software problem! I’ve never been shy about saying I wouldn’t use GDB, not because I’m not smart but because it’s so hard to use. But I’ve seen a lot of people really is to use the command line in a skilled GDB as capital to show off, back to the previously mentioned that, I’m in a TiDB depth the user to observe their operators to do daily operations, the operator is very skilled in all sorts of CTL and switching between operations, he doesn’t think have what problem, and even feel a bit bad, Then I thought, the people of the adaptability is strong, really troubling thing is not trouble, but when you are to make an operating system, usually with a knee-jerk assumption, such as the name of a function call “xx switch”, user expectations when turn on the switch should be a positive feedback, But if it doesn’t, it can be very frustrating. A true story here is that in TiDB 5.0 we have introduced a new feature called MPP (Massively Parallel Processing), which is a switch configuration called Tidb_allow_mpp

I don’t know if you noticed the question: As a switch configuration, when it is set to OFF, there is a 100% negative feedback, which is fine, but when it is set to ON, whether this function is enabled or not depends ON the optimizer’s judgment, that is, there is a certain probability that the MPP function will not work, which is like a light switch in a room. When you turn it off, the light doesn’t always come on, and when you turn it on, the light doesn’t always come on. You won’t think it’s smart. You’ll think it’s broken. A better way to write the above configuration would be:

tidb_mpp_mode = ON | OFF | AUTO
Copy the code

I don’t even have to explain this, and you don’t have to look at the documentation, do you know how to use it at a glance? Good configuration should be self-explanatory. In general, configuration items are the worst offenders of the user experience, which will be covered later in the feedback section.

There is a “quiet principle” in UNIX philosophy, which states that if a program has nothing special to say, it should be quiet. A specific expression is to encourage the command line program to exit with 0 as return code if it is successfully executed without output. In fact, I have reservations about this point. If the user’s behavior meets the expected result, Use a definite positive feedback as a reward (such as printing a Success). Don’t forget Pavlov.

Feedback: Reveal progress, not internal details

I just mentioned feedback, and I think it’s fair to say that feedback is the most important part of a good experience. Friends who have studied cybernetics know that feedback is a very important concept. The reason why self-explanatory mentioned above is a good experience is the timeliness of feedback.

But my surprise, a lot of software in an interactive feedback part design bad outrageous, for example, one I am familiar with some database software in receives a complex queries, when hit the enter, usually just Hang in there, may indeed database program behind the hard data retrieval and scan, It then returns a result (or hangs) after a few minutes without feedback on how much data was scanned or expected to be scanned, which is a bad experience because that information is progress (which ClickHouse does very well). Feedback needs to be carefully designed. My experience is that feedback must be immediate, preferably within 200ms after hitting enter (people’s physiological reaction time, beyond this time, feedback people will feel stuck). Smooth feeling is created by feedback.

Give feedback on progress, don’t give feedback on details, don’t give feedback on details that require context to read (unless it’s Debug Mode), here’s our own counter example (asktug.com/t/topic/201…

Bash
MySQL [test]> SELECT COUNT(1) AS count, SUM(account_balance) AS  amount, trade_desc AS type FROM b_test WHERE member_id = 「22792279001」 AND detail_create_date >= 「2019-11-19 17:00:00」 AND detail_create_date < 「2019-11-28 17:00:00」 group by trade_desc;
ERROR 9005 (HY000): Region is unavailable
Copy the code

What is bad about this Case? Obviously, a Region is an internal TiDB concept for users, so the natural question is: What is a Region? Why is Select data related to Region? Why is Region unavailable? How can I solve this problem? Exposing the user to this information is useless and creates noise for the user. The reason for this Case is that TiKV is too busy to return the required data. A better feedback should be: which TiKV is responsible for which data (in a form that users can understand, such as: Which table, which rows) can’t be read because TiKV is too busy. It is better to tell users why they are busy and how to solve the problem. At least post a FAQ link (I have seen some software directly post LOL on StackOverflow’s Search URL).

Setting milestones for positive feedback, such as printing an Ascii Art when a server application starts to provide normal service, and using some colored labels for different log levels, is a clear signal to the user. This is done very well by Redis-Server. It is usually easy to design feedback for interactive command-line programs. One of the most troublesome things is that basic software often relies heavily on configuration files. The problem with configuration is that there is often a long feedback cycle between modifying a configuration and confirming it takes effect. Modify configuration – restart – observe the effect, and usually the configuration is stored in the configuration file, which also causes the feedback of file modification operation is very poor, because users do not know whether the operation has taken effect, especially the effect of some configurations is not obvious, some good practices are as follows: When the program starts, it prints which configuration file is read and what the content of this configuration file is. Design a command line function like print-default-config, directly output template configuration, save the user to Google. In addition for distributed systems, the problem of configuration is more complex, because there is not difference between local configuration and global configuration, as well as the updated configuration distribution problems, including rolling restart problem (restart the process to make configuration to take effect itself is not a good design), to be honest I don’t have very good solution at present, The idea is to use a distributed global configuration center like ETCD or (in the case of databases) some global configuration tables. But the general principle is that concentration is better than dispersion; Immediate effect is better than restart. Unified interaction (the way you modify and read configurations) is better than multiple interactions. \

Write in the last

Finally written about, but I think this article is just a beginning, there must be a lot of good practice not concluded, also hope to have idea friends call me to discuss, I announced the opening of a suspense, why in the first article, observability and interoperability are put together to write, In fact, this is from the classical model of human action in cognitive psychology [3] :

There are two chasm that users face when using software: the execution chasm, where they have to figure out how to operate and “talk” to the software; Another is the evaluation gap, where users have to figure out the results of their actions. Our mission as designers is to help users bridge these two gaps, which correspond to observability and interactivity in articles.

Design a pleasant to use software is an art, is also not compared to design a sophisticated algorithm or simple robust program, in a sense more difficult, because it requires the designer really want to have a thorough understanding of both for people and software and give affection, finally gave you a quote from Steve Jobs’ : The design is not just what it looks like and feels like. The design is how it works.

[1] Huang Dongxu, Observability of Distributed Systems in my Eyes, 2020[2] Overtaxed Working Memory Knocks the Brain Out of Sync | Quanta Magazine[3] The Design of Everyday Things, Donald Norman, 1988