Chatter about dynamic tracking technology

About the author

Hi, I’m Zhang Yichun, my online name is Agentzh. Many of you probably know me from my open source projects, such as the OpenResty open source project I created, the many third-party modules I wrote for Nginx, the Perl open source modules I contributed to since college, and the many Lua libraries I have written in recent years. I have a wide range of interests. I like fancy things with high levels of abstraction, such as functional and logical programming languages. At the same time, I am very interested in very basic things, such as operating system, Web server, database, high-level language compiler and other system software. Especially, I like to build and optimize large-scale Internet application systems.

What is dynamic tracking

I am delighted to share with you the topic of Dynamic Tracing, which is a very exciting topic for me personally. So what is dynamic tracking technology?

Dynamic tracking technology is actually a postmodern advanced debugging technology. It can help software engineers with a very low cost, in a very short time, some difficult questions about the software system, so as to troubleshoot and solve problems more quickly. One of the background of its rise and prosperity is that we are in a rapidly growing Internet era. As an engineer, we are faced with two major challenges: one is the scale, whether it is the scale of users, the scale of the machine room, the number of machines are in the era of rapid growth. The second challenge is complexity. Our business logic is more and more complex, we run the software system is also becoming more and more complex, we know that it will be divided into many, many levels, including the operating system kernel and the above is all sorts of system software, like database and Web server, and a scripting language or other virtual machine in a high-level language, the interpreter and instant (JIT) compiler, At the top are abstraction levels of various business logic at the application level and a lot of complex code logic.

The most serious consequence of these enormous challenges is that today’s software engineers are rapidly losing insight and control over the entire production system. In such a complex and large system, the probability of all kinds of problems is greatly increased. Some problems can be fatal, such as 500 error pages, memory leaks, or returning error results. The other big category of problems is performance. We may find that software runs very slowly at certain times, or very slowly on certain machines, but we don’t know why. With everyone embracing cloud computing and big data, it’s easy to take up most of an engineer’s time and energy with the increasing number of weird problems in mass production environments. Most of the problems are actually online problems that are difficult or almost impossible to reproduce. And some problems occur at very low rates, one in a hundred, one in a thousand, or even less. It is better to analyze, locate, and take targeted solutions while the system is still running without taking the machine offline, modifying our code or configuration, or restarting services. If you can do that, then it’s perfect to get a good night’s sleep every night.

Dynamic tracking technology can actually help us realize that vision, realize that dream, and dramatically liberate the productivity of our engineers. I still remember when I worked at Yahoo China, sometimes I had to take a taxi to the company in the middle of the night to deal with online problems. This is obviously a very helpless and frustrating way to live and work. Now I work in a CDN company in the United States. Our customers also have their own operation and maintenance team, and they go to the original logs provided by CDN when they have no time. It may be one in a hundred or one in a thousand for us, but if it is an important problem for them, it will be reported, and we have to go to the investigation, we have to find out the real reason, and give it back to them. These actual existence of a large number of practical problems, stimulate the invention and production of new technology.

One of the great things about motion tracking, I think, is that it’s an “in vivo analysis” technology. That is, while one of our programs or entire software systems is still running, serving online, and processing real requests, we can analyze it (whether it wants to or not), like querying a database. This is very interesting. Many engineers tend to overlook the fact that the running software system itself contains most of the valuable information, which can be directly treated as a real-time database to “query”. Of course, this particular “database” has to be read-only, otherwise our analysis and debugging work could affect the behavior of the system itself, potentially compromising online services. We can launch a series of targeted queries from the outside with the help of the operating system kernel, and obtain many first-hand valuable details about the operation process of the software system itself, so as to guide our problem analysis and performance analysis and many other work.

Dynamic tracking technology is usually implemented based on operating system kernel. The operating system kernel actually controls the software world because it is in a position of “creator”. It has absolute authority, and it ensures that the various “queries” we issue to the software system do not affect the normal operation of the software system itself. In other words, our queries must be secure enough to be used in large quantities on production systems. The software system as a “database” query will involve a query way, obviously we are not through SQL such a way to query this special “database”.

In dynamic tracing, the probe is usually used to initiate queries. We will place probes on a layer, or layers, of the software system, and we will define the handlers that these probes are associated with. It’s a bit like the inside of the Chinese medicine acupuncture, that is, if we use a software system as a people, we can to some of his points on some “needle”, then the needle above will usually have some our own definition of “sensor”, we are free to collect key information needed on the acupuncture point, and then put the information together, To produce a reliable diagnosis of etiology and feasible treatment. Tracking here usually involves two latitudes. One is time latitude, because the software is always running, it has a continuous process of change along the time line. Another latitude latitude is space, because it involves a number of different processes, may contain the kernel process, and each process often has its own memory space, the process space, so in between the different levels, and at the same level of memory space, and I can at the same time along the longitudinal and lateral, get a lot of valuable information on the space. It’s a bit like a spider searching for prey on a web.

We can not only take some information in the operating system kernel, but also collect some information at a higher level, such as user mode program. This information can be linked together on timelines to build a complete software picture that effectively guides us through some very complex analyses. The key thing here is that it’s non-invasive. If software systems are a person, we obviously don’t want to gut a living person just to diagnose a disease. Instead, we’ll take an X-ray, give him an MRI, give him a pulse, or, most simply, listen with a stethoscope, and so on. The same should be true for a diagnosis of a production system. Dynamic tracking technology allow us to use non-invasive way, don’t need to modify our operating system kernel, don’t need to modify our application, also need not to modify our business code, or any configuration, you can quickly and efficiently precise access to the information we want, first-hand information, thereby help we are trying to locate all sorts of problems.

I think most engineers are probably very familiar with the process of software construction, it is actually our basic skills. We typically build different levels of abstraction, building software layer by layer, both bottom-up and top-down. There are many ways to establish software abstraction levels, such as through object-oriented classes and methods, or directly through functions and subroutines. And debugging process, contrary to software construction way, actually we just is to be able to easily “break” originally set up the level of abstraction, can follow one’s inclinations to any one or any of any necessary information on several levels of abstraction, regardless of what packaging design, no matter what the isolation design, Regardless of any artificial rules that are created during software construction. This is because when debugging, you want to get as much information as possible, because problems can happen at any level.

Because dynamic tracing is typically based on the operating system kernel, which is the “creator” and the absolute authority, it can be easily traversed through various levels of abstraction and encapsulation, so the layers of abstraction and encapsulation established during software construction are not really a hindrance. Conversely, the well-designed layers of abstraction and encapsulation established during software construction actually help the debugging process, as we’ll discuss later. In my own work, I often find that some engineers are very confused when something goes wrong on the line, and they will make wild guesses about the possible causes, but lack any evidence to support or deny their guesses and assumptions. He will even try and error on the line again and again, making himself and his colleagues miserable and wasting valuable time. When we have dynamic tracking technology, troubleshooting can become a fun process in and of itself, making us feel excited when we encounter a weird problem online, as if we had finally caught a chance to solve a fascinating puzzle. All of this, of course, assumes that we have powerful tools at our disposal to help us gather information and reason, and to quickly prove or disprove any hypothesis or conjecture.

Advantages of dynamic tracking

Dynamic tracking technology generally does not require the cooperation of the target application. For example, we can take a dynamic X-ray of a buddy running around the playground while he’s doing his physical, and he doesn’t know it. Which, when you think about it, is actually quite a remarkable thing. Analysis tools based on dynamic tracking operate in a “hot-plug” manner, meaning that we can run the tool at any time, sample at any time, and stop sampling at any time, regardless of the current state of the target system. A lot of statistics and analysis are actually thought of after the target system goes online. It is impossible for us to predict what problems we may encounter in the future before going online, let alone all the information we need to collect to troubleshoot those unknown problems. The advantage of dynamic tracking is that it can be “collected anytime, anywhere, on demand”. Another advantage is that it has minimal performance loss. A carefully written debugging tool usually has an impact on the system’s extreme performance of five percent or less, so it generally doesn’t have an observable performance impact on our end users. In addition, even this small performance loss occurs within a few tens of seconds or minutes of our actual sampling. Once our debugging tool is finished running, the online system will automatically return to its original performance of 100 percent and continue to run.

The DTrace and SystemTap

You can’t talk about dynamic tracing without mentioning DTrace. DTrace is the ancestor of modern dynamic tracking technology. It was created on the Solaris operating system in the early 2000s and was written by engineers at the former Sun Microsystems. Many of you have probably heard of Solaris and Sun.

When it first came out, I remember a story about a couple of Solaris operating system engineers spending days and nights trying to troubleshoot what seemed to be a very weird online problem. At first they thought it was a very advanced problem, so they worked very hard, and after a few days, it turned out to be a very stupid configuration problem somewhere in the background. Since that incident, the engineers have taken a step back and created DTrace, a very advanced debugging tool to help them avoid spending too much time on silly problems in the future. After all, most of the so-called “weird questions” are actually low-level questions, which belong to the type of “if you can’t tune out, you are depressed, and if you tune out, you are more depressed.”

It should be said that DTrace is a very general-purpose debugging platform that provides a very C-like scripting language called D. Dtrace-based debugging tools are written in this language. The D language supports special syntax for specifying “probes”, which usually have a location description. You can locate it on the entry or exit of a kernel function, on the entry or exit of a user-mode process, or even on any program statement or machine instruction. Writing D language debugging program is required to have a certain understanding and knowledge of the system. These debuggers are the tools we need to regain our insight into complex systems. Brendan Gregg, an engineer at Sun, was an original user of DTrace, even before it was open-source. Brendan has written a number of reusable DTrace based debugging tools in an open source project called the DTrace Toolkit. Dtrace is one of the earliest dynamic tracing frameworks and one of the best known.

The advantage of DTrace is that it takes a tightly integrated approach to the operating system kernel. The D implementation is actually a virtual machine (VM), somewhat like a Java Virtual Machine (JVM). One benefit is that the D runtime is kernel resident and very small, so each debugger has a very short startup and exit time. But I think DTrace also has significant drawbacks. One of the drawbacks that really bothered me was the lack of loop structure in D, which made many of the analysis tools for complex data structures in the target process difficult to write. Although DTrace officially claims that the reason for the lack of loops is to avoid overheated loops, it is clear that DTrace can effectively limit the number of times each loop is executed at the VM level. Another major disadvantage is that DTrace has weak support for tracing user-mode code. There is no automatic loading function for user-mode debugging symbols, and you need to declare the type of user-mode C language structure used in D language.

DTrace’s influence was so great that many engineers ported it to other operating systems. Apple’s Mac OS X operating system, for example, has a port of DTrace. In fact, every MAC laptop or desktop released in recent years has the dTrace command line tool ready to call, you can go to the MAC command line terminal to try it. This is a port of DTrace on Apple. The FreeBSD operating system also has such a port of DTrace. It’s just not enabled by default. You need to load the DTrace kernel module of FreeBSD by command. Oracle has also begun porting DTrace to the Linux kernel in its own Distribution of the Oracle Linux operating system. However, Oracle’s migration efforts never seemed to get off the ground. After all, The Linux kernel is not controlled by Oracle, and DTrace needs to be tightly integrated with the operating system kernel. For similar reasons, DTrace’s Linux port, attempted by some intrepid engineers on the ground, has been far from production-level.

These DTrace ports lack more or less advanced features than the native DTrace on Solaris, so they are less capable than the original DTrace.

Another impact of DTrace on Linux operating systems is reflected in the open source project SystemTap. This is a relatively independent dynamic tracking framework created by Red Hat engineers. SystemTap provides its own mini-language, which is not the same as D. Obviously, Red Hat itself serves a very large number of enterprise users, and their engineers have to deal with a lot of “weird problems” online every day. The generation of this kind of technology must be inspired by practical needs. I think SystemTap is the most powerful and useful dynamic tracking framework in the Linux world today. I’ve used it successfully in my own work for years. People like Frank Ch. Eigler and Josh Stone, the authors of SystemTap, are very passionate and very smart engineers. When I ask questions on the IRC or mailing list, they usually answer them very quickly and in great detail. It’s worth mentioning that I also contributed a significant new feature to SystemTap that allows it to access the values of user-mode global variables in any probe context. The C++ patch I incorporated into the SystemTap mainline was about a thousand lines in size, thanks to the kind help of the SystemTap authors. This new feature plays a key role in my SystemTap based fire chart tool for dynamic scripting languages such as Perl and Lua.

The advantage of SystemTap is that it has very sophisticated automatic loading of user-mode debug symbols, as well as the language structure of the loop to write complex probe handlers, can support a lot of very complex analysis processing. Due to SystemTap’s early implementation immaturity, the Internet has been filled with outdated criticisms of it. SystemTap has come a long way in recent years.

Of course, SystemTap has its drawbacks. First, it’s not part of the Linux kernel, meaning it’s not tightly integrated with the kernel, so it has to keep up with mainline kernel changes. Another drawback is that it usually dynamically compiles its “little language” scripts (a bit like D) into C source code for a Linux kernel module, so it is often necessary to deploy the C compiler toolchain and Linux kernel header files online, and to dynamically load these compiled kernel modules. To run our debug logic. After our debugging tool runs, there is a problem of dynamic uninstallation of Linux kernel module. For these reasons, SystemTap scripts start up much more slowly than DTrace, similar to JVM startup times. Despite these shortcomings, SystemTap is still a very mature framework for dynamic tracing.

Neither DTrace nor SystemTap actually supports writing a full debugging tool because they lack a convenient primitive for command-line interaction. That’s why many real-world tools based on them have a package of Perl, Python, or Shell scripting on the outside. To make it easier to write full debugging tools in a clean language, I’ve extended the SystemTap language to implement a higher-level “macro language” called stap++. My own stap++ interpreter, implemented in Perl, can interpret executing stap++ source code directly and call the SystemTap command-line tool internally. Interested friends can check out my open source stapxx code repository on GitHub. The repository also contains a number of complete debugging tools implemented directly using my stap++ macro language.

Application of SystemTap in production

The influence of DTrace today cannot be attributed to the famous DTrace preacher Brendan Gregg. We also mentioned his name earlier. He started at Sun Microsystems, working on the file system optimization team for Solaris, and was one of the earliest DTrace users. He has written several books on DTrace and performance optimization, as well as many blog posts on dynamic tracking.

After I left Taobao in 2011, I spent a year in Fuzhou living a so-called “pastoral life.” In the last few months of my idyllic life, I systematically learned about DTrace and dynamic tracking through Brendan’s public blog. Actually, I first heard about DTrace because of the comment of a Weibo friend, who only mentioned the name DTrace. So I wanted to know what it was. Who knows, do not know do not know, a understand startled. This turned out to be a whole new world that completely changed my view of the entire world of computing. So I spent a lot of time reading Brendan’s blog post by post. Then one day, I had an Epiphany that I could finally connect with and understand the subtlety of motion tracking technology.

In 2012, I ended my “pastoral life” in Fuzhou and came to the United States to join the current CDN company. And I immediately started applying SystemTap and the whole dynamic tracking approach THAT I had learned to the CDN’s global network to solve some really weird, weird online problems. In fact, I have observed that many engineers in this company will often bury their own points in the software system when troubleshooting online problems. This is mostly in business code, or even in the code base of system software like Nginx, to make your own changes, add counters, or bury some logging points. In this way, a large number of logs are collected online in real time, entered into a dedicated database, and then analyzed offline. Obviously, the cost of this approach is huge, not only involving the business system itself modification and maintenance costs of a sudden increase, and the full collection and storage of a large amount of buried information online overhead, is also very considerable. And all too often, Joe buries a collection point in the business code one day, Joe buries another similar point the next day, and it may all be forgotten in the code base and no one cares about it. Eventually, there will be more and more of these points, making the code base more and more messy. This kind of intrusive modification causes the corresponding software, whether system software or business code, to become increasingly difficult to maintain.

There are two main problems in the way of burying sites, one is “too much” and the other is “too little”. “Too much” refers to the fact that we tend to collect information that we don’t need at all, but only at a moment’s notice, resulting in unnecessary collection and storage costs. Most of the time we can analyze the problem through sampling, may be used to the whole network of full collection, this cost is obviously very expensive. The “too few” problem is that it is often difficult to plan all the information gathering points we need at the beginning, because no one is a prophet and can predict the problems that need to be examined in the future. So when we run into a new problem, the existing collection points are almost always underinformed. This leads to frequent software system modifications and online operations, which greatly increases the workload of development engineers and operation and maintenance engineers and increases the risk of larger online failures.

Another method of brute force debugging is often used by some of our o&M engineers, which is to pull the machine offline, set up a series of temporary firewall rules to block user traffic or their own monitoring traffic, and then mess around with the production machine. It’s a tedious and high-impact process. First, it makes the machine no longer in service, reducing the overall throughput of the online system. At the same time, there are some problems that only real traffic can reproduce, which can not be reproduced at this time. As you can imagine, these brutal practices can be very troublesome.

In fact, the Use of SystemTap dynamic tracking technology can be a good solution to this problem, there is a “silent” wonderful. First of all, we do not need to modify our software stack itself, whether it is system software or business software. I often write targeted tools and then place carefully arranged probes on key system “acupuncture points”. The probes collect their own information, and the debugging tool aggregates this information and outputs it to the terminal. In this way, I can quickly get the key information I want from a certain machine or several machines through sampling, so as to quickly answer some very basic questions and point out the direction for the follow-up debugging work.

As I mentioned earlier, in a production system with artificial to point to in a journal, to collect log storage, return not equal to the entire production system itself as a “database”, can be directly queries we directly from the “database” in the security quickly get the information we want, and never leave a trace, never to gather the information we don’t need. With this in mind, I’ve written a number of debugging tools, most of which are open source on GitHub, many for system software like Nginx, LuaJIT, and operating system kernels, and some for higher-level Web frameworks like OpenResty. Check out the GitHub repository for nginx-Systemtap-Toolkit, Perl-Systemtap-Toolkit, and Stappxx.

Using these tools, I’ve successfully located countless online problems, some of which I’ve even stumbled upon by accident. Here are just a few random examples.

In the first example, I used the SystemTap based flame chart tool to analyze the Nginx process on our line and found that a significant amount of CPU time was being spent on a very strange code path. This is actually temporary debug code left over from an old problem that a colleague of mine was debugging a long time ago, kind of like the “buried code” we mentioned earlier. It ended up being forgotten online, in the company’s code repository, even though the problem had already been solved. Because this costly “buried code” has not been removed, there has been a significant performance loss that has gone unnoticed. So it was kind of an accident. That’s how I sampled it and let the tool automatically draw a flame map. As soon as I look at this picture, I can spot the problem and take action. It’s a very, very effective way.

The second example is a very small number of requests with long latency, known as “long tail requests.” The number of requests is low, but can reach latency of “seconds”. Unconvinced by a colleague’s wild guess that my OpenResty was buggy, I immediately wrote a SystemTap tool to sample online and analyze requests with a total latency of more than a second. The tool directly tests the time distribution within these problem requests, including latency for each typical I/O operation during request processing as well as pure CPU computation latency. It quickly turned out that OpenResty was experiencing a slow delay in accessing a DNS server written by Go. I then asked my tool to output the details of these long tail DNS queries and found that they all involved CNAME expansion. Obviously, OpenResty has nothing to do with this, and there is a clear direction for further troubleshooting and optimization.

The third example is that we have noticed that the network timeout problem in one machine room is significantly higher than that in other machines, but only 1% of the time. It was natural to be skeptical at first about the details of the network protocol stack. But then I used a series of specialized SystemTap tools to directly analyze the internal details of those timeout requests, and I was able to locate the hard disk configuration issues. From network to hard drive, this kind of debugging is very interesting. First hand data puts us on the right track quickly.

In another example, we observed that the open and close operations of files in the Nginx process took a lot of CPU time, so we naturally enabled Nginx’s own file handle cache configuration, but the optimization was not obvious. I made a new flame diagram and found that it was the turn of Nginx’s file handle cache metadata to use a “spin lock” that took up a lot of CPU time. This is because we enabled caching, but set the size of the cache too large, so the cost of metadata spin locking cancels out the benefits of caching. All of this can be easily seen in the flame chart. If we didn’t have a fire chart and just experimented blindly, we would probably conclude that Nginx’s file handle cache didn’t work, without even thinking about tweaking the cache parameters.

As a final example, we observed that regular expression compilation took a lot of CPU time in the most recent flame graph online after an online operation, but we actually enabled caching of regular compilation results online. It became clear that the number of regular expressions used in our business system exceeded our initial cache size, so it was natural to make the online regular expressions cache larger. Then, regular compilation operations are no longer visible in the online flame diagram.

As we can see from these examples, different data centers, different machines, and even different times of the same machine can create their own unique new problems. We need to analyze the problem directly, sample it, rather than guess and try and error. With powerful tools, troubleshooting can be a lot easier.

Flame figure

We’ve talked a lot about a Flame Graph. What is a Flame Graph? It’s actually an amazing visualization method that was invented by a student named Brendan Gregg, who has been mentioned repeatedly.

The flame map is like the X-ray photo taken for a software system. It can naturally integrate the information of time and space into one map and display it in a very intuitive form, thus reflecting many quantitative statistical laws of the system performance.

For example, the classic flame chart is the CPU time distribution of all code paths for a particular piece of software. This distribution gives you an intuitive view of which code paths are consuming the most CPU time and which are not. Further, we can generate fire maps at different software levels, such as the C/C++ language level of system software, and then at higher levels of, say, dynamic scripting languages such as Lua and Perl code. Fire diagrams at different levels often provide different perspectives on code hot spots at different levels.

Because I’ve maintained the OpenResty such open source software community, we have our own mailing list, I often encourage report problems users offer our map of flame, so that we can look at the picture speak comfortably, help users positioning performance problems quickly and without repeated trial and error, and the user go to guess, Save each other a lot of time, and everyone’s happy.

It’s worth noting here that even when we come across an unfamiliar program we don’t know much about, by looking at the flame chart, we can tell roughly where the performance problem is, even if we’ve never read a line of its source code. It’s a pretty remarkable thing. Because most programs are actually well-written, that is, they tend to use levels of abstraction when the software is constructed, such as through functions. The names of these functions usually contain semantic information and are displayed directly on the flame chart. From these function names, we can infer roughly what the corresponding function, or even a corresponding code path, is doing, and thus deduce the performance problems of the program. So, again, naming in program code is very important, not only for reading the source code, but also for debugging problems. Fire charts, in turn, provide a shortcut to learning unfamiliar software systems. The important code paths, after all, are almost always the ones that take more time, so they’re worth focusing on; Otherwise, there must be a big problem with the way the software is constructed.

The fire chart can be extended to other dimensions, for example, the fire chart that we just talked about is looking at the amount of time that the program is running on the CPU over all the code paths, and this is the on-CPU time dimension. Similarly, the amount of time that a process does not run on any CPU is actually quite interesting, which we call off-CPU time. Off-cpu time is when the process is dormant for some reason, such as waiting for a system-level lock or being forcibly stripped of CPU time by a very busy process scheduler. All of these can result in the process not running on the CPU, but still spending a lot of wall clock time. A very different picture can be drawn from the fire diagram of this dimension. From this dimension, we can analyze the overhead of system locks (such as system calls like sem_WAIT), certain blocking I/O operations (such as open, read, and so on), and CPU contention between processes or threads. Through the off-CPU flame diagram, it is clear at a glance.

I should say that the off-CPU flame is a bold attempt of my own. I remember first reading Brendan’s blog post about off-CPU time by a lake called Tahoe between California and Nevada. It occurred to me that maybe off-CPU time could be used instead of on-CPU time for a fire map presentation. So WHEN I came back I tried this out on the company’s production system, using SystemTap to draw the off-CPU flame map of the Nginx process. After I tweeted about the success, Brendan reached out to me to say that he had tried it himself before and it didn’t work out so well. I suspect this is because he was applying it to multithreaded programs like MySQL, and multithreaded programs have a lot of noise in the off-CPU graph due to thread synchronization, which tends to drown out the really interesting parts. The scenarios in which I apply the off-CPU flame diagram are single-threaded programs like Nginx, so the off-CPU flame diagram tends to immediately indicate system calls that block the Nginx event loop, lock operations like sem_WAIT, or preemptive process scheduler interventions. This is a great way to analyze a broad class of performance problems. The only “noise” in such an off-CPU flame diagram is the system calls like epoll_wait of the Nginx event loop itself, which are easy to recognize and ignore.

Similarly, we can extend the flame diagram to other system metric dimensions, such as the number of bytes of memory leakage. I once used the memory leak flame map to quickly locate a subtle leak in the core of Nginx. Because this leak occurs in Nginx’s own memory pool, it cannot be caught using traditional tools such as Valgrind and AddressSanitizer. Another used the “memory leak flame map” to easily locate leaks in an Nginx C module written by a European developer. The leak was so subtle and slow that it bothered him for so long that I didn’t need to read his source code before I could help him locate it. When I think about it, I find it kind of magical. Of course, we can extend the flame chart to other system metrics such as file I/O latency and data volume. So it’s really an amazing visualization that can be applied to a whole bunch of different problem classes.

methodology

Earlier we introduced sampling-based visualization methods such as fire maps, which are actually quite generic. No matter what system it is, and what language it is written in, we can generally get a fire map of some performance dimension that we can easily analyze. But more often, we may need to analyze and troubleshoot some deeper and more special problems. At this time, we need to write a series of specialized dynamic tracking tools to approach the real problem in a planned and step-by-step manner.

In this process, the strategy we recommend is what we call a small, continuous approach to questions. That said, we don’t expect to write a large, complex debugging tool that can capture all the information we might need to solve the final problem all at once. Instead, we break down the hypothesis of the final problem into a series of smaller hypotheses, and then we explore it, and we test it, and we make sure that we’re going to correct our direction, and we’re going to adjust our trajectory and our assumptions to get closer to the final problem. One advantage of this is that the tools at each stage of each step can be simple enough that the tools themselves are less likely to make mistakes. Brendan also noticed that if he tried to write a multipurpose complex tool, the chances of the complex tool introducing bugs were much higher. And the wrong tools give us the wrong information, which can lead us to the wrong conclusions. This is very dangerous. Another benefit of a simple tool is that there is relatively little overhead on the production system during the sampling process, as there are fewer probes introduced and fewer computationally complex processes for each probe handler. Each of these debugging tools has its own specific purpose and can be used independently, which increases the chances that these tools will be reused in the future. So overall, this debugging strategy is very beneficial.

It is worth noting that we reject the practice of debugging so-called “big data” here. We don’t try to gather as much information and data as possible all at once. Instead, at each stage and at each step, we collect only the information we really need for the current step. At each step, based on the information we have collected, we support or modify our original plan and direction, and then guide the writing of the next more detailed analysis tools.

In addition, for very infrequent online events, we often adopt a “wait and see” approach, which means we set a threshold or other filtering conditions and wait for interesting events to be picked up by our probes. For example, when tracking small frequency and large delay requests, we will first screen out those requests whose delay exceeds a certain threshold in the debugging tool, and then collect as much actual details as possible for these requests. In fact, this strategy is completely contrary to our traditional approach of collecting as much as possible full statistical data. Because we are targeted and strategically sampled and analyzed, we can minimize the loss and cost and avoid unnecessary resource waste.

Knowledge is power

I think dynamic tracking technology is a good illustration of the old saying, “Knowledge is power.”

With dynamic tracking tools, we can turn some of our knowledge of the system into a very useful tool for solving real problems. The abstract concepts we learned from textbooks in computer science education, such as virtual file systems, virtual memory systems, process schedulers, and so on, can now become very vivid and concrete. For the first time, we can actually observe their specific operation, their statistical rules in the actual production system, without having to change the operating system kernel or system software source code completely. These non-invasive real-time observations are enabled by dynamic tracking technology.

This technique is just like the iron esaber used by Yang Guo in Jin Yong’s novels. People who don’t know martial arts can’t move it. But as long as some martial arts, you can make better and better, continuous progress, until the wooden sword can also run amok in the world. So if you have some knowledge of systems, you can swing this sword, and you can solve some basic but previously unimaginable problems. And the more systems knowledge you accumulate, the better the sword will work. And, interestingly, every time you learn more, you instantly solve a new problem. In turn, because we can solve many problems through these debugging tools, we can measure and learn many interesting micro and macro aspects of the statistical law in the production system, these visible results will become a powerful motivation for us to learn more about the system. Naturally, then, this has become the pursuit of the engineer “level magic tool.”

I remember that I once said on my Microblog, “the tool that encourages engineers to continue to learn deeply is a good tool for the future”. This is actually a benign and mutually reinforcing process.

Open source and debug symbols

As mentioned earlier, dynamic tracking technology can turn a running software system into a real-time read-only database that can be queried, but this can usually be done only if the software system has a relatively complete debugging notation. So what is a debug symbol? Debug symbols are generally meta information generated by the compiler for debugging during software compilation. This information maps many details of the compiled binary program, such as the addresses of functions and variables, the memory layout of data structures, and so on, back to the names of abstract entities in the source code, such as function names, variable names, type names, and so on. The format for debugging symbols common in the Linux world is called DWARF. Because of these debugging symbols, we have a map and a lighthouse in the cold and dark binary world, and it is possible to explain and restore the semantics of every subtle aspect of the underlying world and reconstruct high-level abstractions and relationships.

Debug symbols are usually easy to generate only in open source software, because most closed source software, for confidentiality reasons, does not provide any debug symbols to make reverse engineering and cracking more difficult. One example is Intel’s IPP library. IPP provides the optimization of many common algorithms for Intel chips. We also tried to use the IPP-based Gzip compression library on production systems, but unfortunately we ran into problems — IPP would crash online from time to time. Obviously, closed source software without debug symbols can be very painful to debug. We had to give up after several remote conversations with Intel engineers that failed to locate and solve the problem. If you have source code, or if you have debug symbols, the debugging process will probably be a lot easier.

Brendan Gregg also mentioned this relationship between open source and dynamic tracking technology in a previous post. Especially when our entire software stack is open source, the power of dynamic tracking can be maximized. Software stacks typically include operating system kernels, various system software, and higher-level language programs. When the entire stack is open source, we can easily get the information we want from each software level, and turn it into knowledge and action plan.

Because more complex dynamic tracing relies on debug symbols, some C compilers generate debug symbols that are problematic. These debugging errors can greatly reduce the effectiveness of dynamic tracing and even directly hinder our analysis. For example, with the widely used C compiler GCC, the quality of debug symbols generated prior to version 4.5 was poor, but has improved considerably since, especially with compiler optimizations turned on.

Linux kernel support

As mentioned above, dynamic tracing technology is generally based on the operating system kernel, and for the Linux operating system kernel, which is widely used in our daily life, its dynamic tracing support is a long and arduous process. One of the main reasons may have been that Linus, the Linux boss, had always seen the technology as unnecessary.

Initially, Red Hat engineers prepared a patch for the Linux kernel called UTrace to support user-mode dynamic tracking. This is what frameworks like SystemTap originally relied on. For years, Linux distributions in the Red Hat family, such as RHEL, CentOS, and Fedora, included this uTrace patch by default. In those Days of UTrace dominance, SystemTap only made sense on Red Hat operating systems. This UTrace patch also failed to be incorporated into the mainline version of the Linux kernel, and was replaced with another compromise.

Linux mainline versions have long had the kprobes mechanism to dynamically place probes at the entrances and exits of specified kernel functions and define their own probe handlers.

User-mode dynamic tracking support has been a long time coming, with countless discussions and revisions. Since version 3.5 of the official Linux kernel, the Uprobes kernel mechanism based on inode has been introduced to safely set up dynamic probes and execute their own probe handlers at user-mode functions, etc. Later, starting with the 3.10 kernel, a mechanism called uretprobes was incorporated to further set up dynamic probes on the return addresses of user-mode functions. Together, Uprobes and Uretprobes can finally replace the main functionality of UTrace. The Utrace patch has now completed its historical mission. SystemTap can now automatically use mechanisms such as Uprobes and uretprobes on newer kernels instead of relying on UTrace patches.

In recent years Linux mainline developers have extended the dynamic compiler used in netFilter for firewalls, known as BPF, to create a so-called eBPF that can serve as a more general kernel virtual machine. With this mechanism, we can actually build dynamic tracking virtual machines in Linux that are resident in the kernel like DTrace. In fact, there have been some recent attempts at this, such as tools like the BPF Compiler (BCC), which uses the LLVM tool chain to compile C code into bytecode accepted by eBPF virtual machines. Overall, Linux’s dynamic tracing support is getting better and better. Especially since the 3.15 kernel, the dynamic tracing related kernel mechanism has finally become more robust and stable.

Hardware track

Seeing how dynamic tracking can play a key role in the analysis of software systems, it is natural to wonder if hardware can be traced in a similar way and with similar ideas.

We know that the operating system is directly dealing with hardware, so by tracking some drivers or other aspects of the operating system, we can also indirectly analyze some behaviors and problems of the hardware devices connected with it. At the same time, modern Hardware, such as Intel CPU, generally has some built-in Performance statistics registers (Hardware Performance Counter). By reading the information in these special registers through software, we can also get a lot of interesting information about Hardware directly. The Perf tool in the Linux world, for example, was originally designed for this purpose. Even virtual machine software such as VMWare emulates special hardware registers. This special register has led to interesting debugging tools like Mozilla RR that efficiently record and play back the execution of a process.

Directly set the dynamic probe inside the hardware and implement the dynamic tracking, maybe it still exists in the science fiction level at present. Welcome students who are interested to contribute more inspiration and information.

Remains analysis of the death process

All we’ve seen so far is analysis of living processes, or running programs. What about the process of death? The most common form of a dead process is when the process crashes abnormally, producing a so-called core dump file. In fact, we can also do a lot of in-depth analysis of the “remains” of such a dead process, and it is possible to determine its cause of death. In this sense, we as programmers play the role of “forensic”.

The most classic tool for analyzing the remains of dead processes is the well-known GNU Debugger (GDB). The LLVM world has a similar tool called LLDB. Obviously, the native command language of GDB is very limited, and we can get very limited information if we manually analyze core dump command by command. Most engineers analyze core dump only by using bt full to check the current C call stack trace, or using info reg to check the current value of each CPU register, or looking at the machine code sequence at the crash location, etc. Much more information lies deep within the complex binary data structures allocated in the heap. Scanning and analyzing complex data structures in the heap clearly requires automation, and we need a programmable way to write complex core dump analysis tools.

In keeping with this need, GDB has built-in support for Python scripts in newer versions (starting with 7.0, I think). We can now implement more complex GDB commands in Python to do in-depth analysis of things like core dump. In fact, I have written many of these advanced gDB-based debugging tools in Python, and many of them even correspond to the SystemTap tool for analyzing live processes. Similar to dynamic tracking, with the help of debug notation, we can find a light path through the dark “dead world”.

One problem with this approach, however, is that tool development and migration can become a burden. Traversing C-style data structures in a scripting language like Python is not fun. Writing too much of this weird Python code really drives people crazy. In addition, we had to write the same tool once in SystemTap scripting language and then again in GDB Python code: this was undoubtedly a large burden, and both implementations required careful development and testing. They do similar things, but the implementation code and the corresponding API are completely different. (It is worth mentioning here that the LLDB tool in the LLVM world also provides similar Python programming support, and the Python API there is incompatible with GDB’s.)

We can certainly use GDB to analyze live applications, but compared to SystemTap, GDB is most obviously a performance problem. I have compared the SystemTap version of a more complex tool to the GDB Python version. Their performance differs by an order of magnitude. GDB is clearly not designed for this kind of online analysis, but rather for interactive use. It can also run in batch mode, but the internal implementation has very serious performance limitations. One of the things that drives me crazy is GDB’s internal abuse of LongjMP for routine error handling, which leads to a significant performance drain, as is evident in the GDB flame charts generated by SystemTap. Fortunately, analysis of dead processes can always be done offline, so there is no need to do so online, so timing considerations are not that important. Unfortunately, some of our very complex GDB Python tools, which take several minutes to run, are frustrating to do offline.

I myself have used SystemTap to do a performance analysis of GDB + Python and have located the two largest execution hot spots within GDB based on the flame map. I then proposed two C patches to GDB officials, one for Python string manipulation and one for GDB error handling. They increase the overall speed of our most complex GDB Python tools by 100%. GDB has now officially merged one of the patches. It is also interesting to use dynamic tracing techniques to analyze and improve traditional debugging tools.

I have opened a lot of GDB Python debugging tools written in my own work to GitHub, interested students can go to have a look. It is usually stored in GitHub repositories such as nginx-gdb-utils for nginx and LuaJIT. I used these tools to help LuaJIT author Mike Pall locate more than a dozen internal LuaJIT bugs. Most of these bugs have been hidden for years and are subtle problems in just-in-time (JIT) compilers.

Since there is no possibility of a dead process changing over time, let’s call this analysis of core dump “static tracing.”

Traditional debugging techniques

Speaking of GDB, we have to talk about the differences and connections between dynamic tracing and traditional debugging methods. Careful and experienced engineers will notice that the “precursor” of dynamic tracing is setting breakpoints in GDB and then performing a series of checks at breakpoints. The difference is that dynamic tracing always emphasizes non-interactive batch processing with the lowest possible performance loss. Tools like GDB are built for interactive operations, so implementation doesn’t care much about production safety or performance loss. Generally, its performance loss is extremely high. Also, the very old system call pTrace on which GDB is based is riddled with potholes and problems. For example, PTrace needs to change the parent of the target debugger and does not allow multiple debuggers to analyze the same process at the same time. So, in a sense, using GDB can simulate what’s called “poor people’s dynamic tracking.”

Many beginners prefer to use GDB for “one-step execution,” which is often inefficient in real industrial production development. This is because single-step execution often causes changes in the program execution timing, resulting in many timing-related problems that cannot be repeated. In addition, with complex software systems, it’s easy to get lost in the code path, or “garden path,” where you can’t see the forest for the trees.

Therefore, for daily debugging during development, we still recommend the simplest and most stupid method, which is to print the output statement on the critical code path. In this way, we can view the log and other output to get a complete context, so that we can effectively analyze the program execution results. This approach is particularly effective when combined with test-driven development. Obviously, this logging and burying approach is not practical for online debugging, as discussed sufficiently earlier. On the other hand, traditional performance analysis tools like Perl’s DProf, gprof in the C world, and performance profilers for other languages and environments often require programs to be recompiled with special options or rerun in special ways. This kind of performance analysis tool, which requires special treatment and coordination, is obviously not suitable for online real-time in vivo analysis.

A messy debugging world

Today’s debugging world is messy, as we’ve seen before with DTrace, SystemTap, ePBF/BCC, GDB, LLDB, and many more that we haven’t mentioned and you can find on the web. Perhaps this is a reflection of the chaos of the real world in which we live.

Sometimes I wonder if we could design and implement a unified debugging language — in fact, I’ve even named it Y. I would love to be able to implement the Y language so that its compiler can automatically generate input code that is accepted by various debugging frameworks and technologies. For example, generate D code accepted by DTrace, generate STAP scripts accepted by SystemTap, and Python scripts accepted by GDB, and another API incompatible Python script from LLDB, or bytecode accepted by eBPF, Or even some mixture of C and Python code that the BCC accepts.

If we design a debugging tool that needs to be ported to several different debugging frameworks, then obviously the manual migration effort is very heavy, as I mentioned earlier. If there were a monolithic Y language where the compiler could automatically convert the same Y code into input code for different debugging platforms and optimize it for those platforms, every debugging tool would have to be written in Y only once. That would be a huge relief. As a debugger, there is no need to learn all the messy details of specific debugging techniques and step on the “pit” of every debugging technique.

This is a wonderful vision of mine.

Some of you might say why do I call it Y? This is because my name is Yichun, and the first letter of the Chinese phonetic alphabet for yichun is Y… And, of course, more importantly, because it’s the language for answering questions that start with “why,” which in English is “Why,” and why sounds like Y.

After Y language is born, I plan to share more complete dynamic tracking examples with you.

How to contribute

Well, with everyone say so many, I just want to attract more engineers to watch and participate in the dynamic tracking technology in the field, which can then be like Brendan and I go to contribute some open source based on dynamic tracking debugging tools, as well as on the very bottom of the core of the system level to improve dynamic tracking relies on technology and infrastructure, This includes frameworks like SystemTap, as well as core mechanisms in the operating system kernel, such as Kprobes, Uprobes, eBPF, and so on. There is also a lot of work to be done on performance optimization in so-called “static tracking” tools based on GDB and LLDB.

Look forward to seeing you in the open source world!

thanks

This article was helped by many of my friends and family. First of all, thank you for the hard dictation work of Rui; This article is actually based on an hour-long voice share. Then I would like to thank many friends for their careful review and feedback, especially he Weiping, Yang Shuxin, Anbang, Lin Zi, Dai Guanlan, Chi Jianqiang, Fu Kai and many other good friends for their valuable comments and suggestions. I also want to thank my father and my wife for their patience in writing.