introduce
The purpose of this article is to share some of the ideas I’ve had over the years about how to develop an application for which the term “service” is a feeble approximation. More precisely, it relates to a class of programs that process a large number of discrete messages or requests per second. Web services usually fit this definition best, but in a sense, not all programs are actually services. However, since “High performance Request Handler” is a terrible title, it might as well be “Service” for simplicity’s sake.
Although multitasking in a single program is now common, I wouldn’t write such “lightly parallel” applications. The browser you’re using to read this article might perform some operations in parallel, but such low parallelism doesn’t really present any interesting challenges. Interesting challenges arise when the request-processing infrastructure itself is a bottleneck to overall performance, so improving the infrastructure can actually improve performance. For browsers running on gigahertz processors with gigabit of memory, it’s not uncommon for the infrastructure to be a bottleneck for simultaneous six-way downloads over DSL lines. The focus of this article is not an app that sips water through a straw, but an app that drinks from a fire hose. At the edge of hardware functionality, it’s what you do that really matters.
Inevitably, some people will be skeptical of some of my opinions and suggestions, or think they have a better way. Good. I’m not trying to be the voice of God; I find these methods useful to me, not only for their impact on performance, but also for how hard it is to debug or extend the code later. The effect varies from person to person. If there are other approaches that are better for you, great, but be aware that almost all of the approaches I suggest here have existed as alternatives to other approaches that I have tried before, only to find abhorrent or frightening. Your favorite idea may feature prominently in one of these stories, and innocent readers may be bored to death if I tell it now. You don’t want to hurt them, do you?
The rest of this article revolves around what I call the “four horsemen of poor performance” :
The four Horsemen of the Apocalypse, war, plague, famine and death.
- Data copies
- Context switch
- Memory allocation
- Lock contention
There will be an overarching chapter at the end, but these are the biggest performance killers. If you can handle most requests without data copies, without context switches, without going through memory allocators, and without contending for locks, you will have a well-performing service even with minor issues.
Data copies
This may be a short chapter for one simple reason: most people have learned this lesson. Everyone knows that data copying is bad. Obvious, right? In fact, it’s obvious that you probably knew it very early in your computer career, simply because someone started using it decades ago. I know this is the case with me, but I digress. It is now covered in every school curriculum and every informal guide. Even marketers have found that “zero copy” is a good buzzword.
As bad as the copy looks in hindsight, there still seem to be some nuances that people miss. Most important of all, data copies are often hidden and disguised. Do you really know if the code in the driver or library you are calling will make a data copy? It may be more than you think. Guess what “programmatic I/O” stands for on a PC. An example of a masqueraded, unhidden copy is the hash function, which has all the memory access overhead of the copy and involves more computation. Once it’s pointed out that hashes are actually “copy upgrades,” it’s obvious to avoid hashes, but I know at least a group of talented people who have to solve the problem the hard way. If you really want to get rid of data copying, either because it really affects performance or because you want to include “zero copy operations” in hacker conference slides, you’ll need to keep track of a lot of things that are really data copying but not advertised.
An effective way to avoid data copying is to use indirection and pass buffer descriptors (or buffer descriptor chains) rather than just buffer Pointers. Each descriptor typically consists of the following:
- The pointer and length of the entire buffer.
- Pointer and length, or offset and length, of the actual fill part of the buffer.
- Before and after Pointers to other buffer descriptors in the list.
- Reference counting.
Instead of copying a piece of data to make sure it stays in memory, the code now simply incrementing the reference count for the appropriate buffer descriptor by one. This can work very well in some cases, including the way a typical network protocol stack works, but it can also be a real headache. In general, it’s easy to add a buffer at the beginning or end of a chain, add a reference to the entire buffer, and release the entire chain. It becomes more difficult to add, block by block, or reference portions of the buffer in the middle. Trying to split or merge buffers is crazy.
However, I don’t actually recommend this approach for every situation. Why not? Because you have to traverse the descriptor chain every time you want to view the header field, this can be a pain. There are worse things than data copying. I’ve found that the best way to do this is to identify large objects in your program, such as blocks of data, and make sure that these large objects are allocated individually as described above so that you don’t have to copy them or worry too much about anything else.
This brings me to my final point about data copying: Don’t be overly circumventive. I’ve seen too much code avoid data copying by doing worse things, such as forcing context switches or interrupting large I/O requests. Data copying is expensive and one of the first things you should consider when looking for a place to avoid redundant operations, but it has diminishing returns. Combing through code and then doubling its complexity just to get rid of the last few copies of data is often a waste of time that could be better spent elsewhere.
Context switch
While everyone agrees that data copying sucks, I’m often surprised at how many people completely ignore the performance impact of context switching. In my experience, context switches actually “crash” more than data replicas under high loads. The system begins to spend more time moving from one thread to another than it does actually performing useful work within the thread. Surprisingly, to some extent, the cause of excessive context switching is obvious. The number one reason for context switching is that there are more active threads than processors. As the ratio of active threads to processors increases, so does the number of context switches — linearly if you’re lucky, but usually exponentially. This very simple fact explains why multithreaded designs with one thread per connection scale very poorly. For scalable systems, the only viable option is to limit the number of active threads to (usually) less than or equal to the number of processors. A popular variant of this approach is to use only one thread at all times. Although this approach does avoid context jitter and locks, it does not achieve a total throughput of more than one processor. Therefore, unless the program is non-CPU intensive anyway (usually network I/O intensive), it is still not taken seriously.
The first thing a “threaded” program needs to do is figure out how to make a single thread handle multiple connections simultaneously. This usually means using select/poll, asynchronous I/O, signal or completion ports on the front end and an event-driven architecture on the back end. A lot of “religion wars” have been fought and are still going on over which front-end API is best. Dan Kegel’s C10K paper is a good source in the field. Personally, I think all select/poll and signal forms are ugly and therefore favor AIO or completion ports, but it doesn’t really matter. With the exception of select(), everything works fine, and the processor-front outermost layer doesn’t require much work.
The simplest conceptual model of a multithreaded event-driven service is that it has a queue at its center. One or more “Listener” threads read the request and queue it, and one or more “worker” threads remove it and process it. Conceptually, this is a good model, but people often implement their code this way. Why is that wrong? Because the second most common reason for context switching is to move work from one thread to another. Some even make the error worse by requiring the response to the request to be sent by the original thread — resulting in two context switches per request instead of one. It is important to use a “symmetric” approach, in which a given thread can go from “listener” to “worker” to “listener” without changing the context.
Often, it is impossible to know how many threads are active, even at a fraction of a second in the future. After all, requests can appear on any connection at any time, or “background” threads dedicated to various maintenance tasks can wake up at that moment. If you don’t know how many threads are active, how do you limit the number of active threads? In my experience, one of the most effective and easiest ways to do this is to use the old-fashioned counting semaphore that each thread must hold while performing “real work.” If the thread limit has been reached, each Listen thread may wake up with an additional context switch and then block on the semaphore, but once all Listen threads block in this way, they do not continue to compete for resources until an existing thread “quits,” Therefore, the system impact is negligible. More importantly, this approach handles maintenance threads more elegantly than most alternatives (sleeping most of the time and therefore not counting active threads).
Once you split the request processing into two phases (listener and worker) and use multiple threads to service these phases, it is natural to further divide the processing into more than two phases. In its simplest form, processing a request becomes a matter of calling the phases in one direction and then calling them in another (for the reply). But things get more complicated. A phase may represent two distinct phases that “fork” out of two processing paths, or it may itself generate a reply (for example, cached values) without calling the other phases. Therefore, each stage must be able to specify what the request “should do next.” Represented by the return value of the distribution function of the phase, there are three possibilities:
- The request needs to be passed to another phase (the ID or pointer of the station indicating phase in the return value).
- Request completed (special request Completed return value)
- The request is blocked (special “request block” return value). Same as before, except that the request is not released and will be resumed later by another thread.
Note that in this model, requests are queued within phases, not between phases. Avoid the common folly of constantly putting requests in the queue for the successor phase and then immediately calling the successor phase to get the request out of the queue again. I call it the queue, the lockout.
The idea of breaking up a complex task into smaller communication parts seems familiar because it’s actually quite old. My method is based on the 1978 concept of “Communicating Sequential Processes” by C.A.R. Hoare, based on the ideas of Per Brinch Hansen and Matthew Conway, These ideas date back to 1963 — before I was born! However, when Hoare coined the term CSP, he meant “processes” in an abstract mathematical sense, and CSP processes need not be associated with operating system entities of the same name. In my opinion, the common approach of implementing A CSP through a thread-like coroutine inside a single OS thread gives the user all the trouble of concurrency without any scalability.
At the same time, Matt Welsh’s SEDA is an example of the idea of stage implementation moving in a more sensible direction. In fact, SEDA is a good example of “getting the service architecture right,” and some of its specific characteristics are worth commenting on (especially those that differ from those I outlined above).
- SEDA’s “batch” tends to emphasize processing multiple requests at once, whereas my approach tends to emphasize processing multiple phases of a single request at once.
- In my opinion, a significant drawback of SEDA is that it allocates a separate thread pool for each phase, only reallocating the threads for each phase in the background to respond to the load. Therefore, the “1” and “2” causes of context switching mentioned above still exist.
- In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, however, this is not an appropriate choice.
Memory allocation
Allocating and freeing memory is one of the most common operations in many applications. Therefore, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can compensate for the fact that in many cases the generality of such allocators inevitably makes them far less efficient than other allocators. So I have three suggestions on how to avoid using system memory allocators altogether.
Tip 1: Simple pre-allocation. We all know that static allocators are bad if they limit the functionality of the program, but there are many other forms of preallocation that can be very beneficial. Often, the reason boils down to the fact that even if some memory is “wasted” along the way, one access through the system memory allocator is better than several. Therefore, if you can assert that no more than N items can be used simultaneously, pre-allocation at program startup may be a valid option. Even if this is not the case, it is possible to pre-allocate all the content that the request handler might need at the outset, rather than each content being allocated as needed. In addition to the possibility of consecutively allocating multiple items in a single trip through the system allocator, error recovery code is often greatly simplified. If memory is very tight, preallocation may not be an option, but in all but the most extreme cases, the result is usually a net gain.
Tip 2: Use lookaside lists for frequently allocated and freed objects. The basic idea is to put recently freed objects into the list rather than actually free them, hoping that if they are used again soon, they will simply be removed from the list rather than allocated from system memory. Another benefit is that the implementation of lookaside list access transformations can often skip complex object initialization/finalization.
You usually don’t want the lookaside list to grow indefinitely and never release anything even when the program is idle. Therefore, it is often necessary to perform some sort of regular “sweeper” task to release inactive objects, but it is not desirable if the cleanup program introduces undue locking complexity or contention. So a good compromise is that lookaside lists are actually a system of “old” lists and “new” lists that are locked separately. Allocation starts from the new list first, then from the old list, and is only allocated from the system as a last resort; Objects are always released into a new list. To clean up a thread, do the following:
- Lock two lists.
- Save the header of the old list.
- Changes the (previously) new list to the old list by header assignment.
- Unlocked.
- Free all objects from the saved old list when idle.
Objects in such a system are truly released only when at least one but not more than two full cleanup intervals are not required. Most importantly, the cleaner does most of its work without holding any locks that compete with regular threads. In theory, the same method could be extended to more than two levels, but I haven’t found it useful to do so.
One concern with lookaside lists is that list Pointers can increase the size of objects. In my experience, most objects that use lookaside lists already contain a list pointer, so it’s moot to consider this point. But even if Pointers are only used for lookaside lists, the savings in avoiding the system memory allocator (and object initialization) more than make up for the extra memory.
Tip 3: It actually has something to do with locking that we haven’t talked about yet, but I’m going to add it anyway. Even with lookaside lists, lock contention is usually the biggest cost of allocating memory. One solution is to maintain multiple private lookaside lists so that it is absolutely impossible to compete for any one of them. For example, each thread can have a separate list of lookaside. For cache-warmth reasons, a list per processor is better, but only works if threads can’t be preempted. If necessary, private lookaside lists can even be combined with shared lists to create systems with very low allocation overhead.
Lock contention
Efficient locking schemes are notoriously difficult to design, so I call them “Scylla” and “Charybdis”, after the monsters in Odyssey. Scylla is overly simplistic and/or coarse-grained locks, serialized activities that can or should be done in parallel, sacrificing performance and scalability. Charybdis are overly complex or fine-grained locks, and the space and operation time of the lock can again degrade performance. Traps near Scylla represent deadlocks and live locks. Traps near Charybdis represent race conditions. Between the two, there is a narrow channel that represents efficient and correct locking… Or where? Because locking is often tied to program logic, it is often impossible to design a good locking scheme without fundamentally changing the way the program works. This is why people hate locks and try to rationalize non-scalable single-threaded implementations.
Almost every locking scheme starts with “one big lock around everything” and hopes vaguely that performance won’t be too bad. When that hope is dashed (which is almost always the case), the big lock is broken up into smaller locks, then prayers continue, and the whole process is repeated, presumably until performance is sufficient. However, typically each iteration increases complexity and lock overhead by 20-50% to reduce lock contention by 5-10%. Fortunately, the end result is still a slight improvement in performance, but actual degradation is not uncommon. The designer scratched his head. “I made the lock tighter, like the textbook says,” he thought, “so why is it worse?”
I think the situation is worse because the above approach is fundamentally wrong. Think of the solution space as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the starting point of “big Lock” is almost always separated by various valleys, Ma on Shan, small peaks, cul-de-sacs and peaks. This is a classic mountain climbing problem. It is almost impossible to climb from a starting point to a higher mountain, taking one small step and never going downhill. What we need is a completely different approach to the top.
The first thing you need to do is form a locked mental map. The map has two axes:
- The vertical axis represents the code. If you are using a phase architecture with non-branching phases, you probably already have a diagram showing partitioning, like the OSI model network protocol stack that everyone uses.
- The horizontal axis represents data. In each phase, each request should be assigned to a data set that uses resources independent of any other resources.
You now have a grid where each cell represents a particular set of data in a particular stage of processing. The most important rule is that two requests should not be in contention unless they are in the same data set and the same processing phase. If you can do that, you’re halfway there.
Once the grid is defined, each locking type of the program can be drawn, and the next goal is to ensure that the resulting points are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond cutter, using your knowledge of program execution to look for natural “cleavage lines” between stages and data sets. They are sometimes obvious from the beginning, sometimes hard to find, but seem even more obvious in retrospect. Dividing code into phases is a complex programming problem, so THERE’s not much I can offer, but here are some suggestions on how to define data sets:
- If there is some kind of block number or hash or transaction ID associated with the request, it is best to divide that value by the number of datasets.
- Sometimes, it is best to assign requests to data sets dynamically, based on which data set has the most available resources, rather than some intrinsic attribute of the request. Think of it as multiple integer units in a modern CPU; They know a thing or two about discrete requests flowing through systems.
- It is often useful to ensure that data sets are allocated differently for each phase to ensure that requests that are competing in one phase will not be competing again in another phase.
If you have divided the locked fields vertically and horizontally, and ensured that the locking activity is evenly distributed across the generated cells, you know that the locking is in good condition. There is, however, one more step. Do you remember the “baby steps” I ridiculed a few paragraphs ago? It still serves its purpose, because now you’re at a good starting point instead of a bad starting point. Metaphorically speaking, you may have climbed the slopes of one of the mountain’s highest peaks, but you may not have reached the top yet. Now it’s time to gather competitive statistics, see what you need to do to improve, split phases and data sets in different ways, and then collect more statistics until you’re satisfied. If you do this, you are sure to get a beautiful view from the top of the mountain.
Other content
As promised, I’ve covered the four biggest performance issues in service design. However, there are still other important issues that need to be addressed for specific services. The main thing is to understand the platform/environment:
- How does the storage subsystem handle large and small requests? Sequential or random? What are the read-ahead and write-behind capabilities?
- How efficient is the network protocol used? Can parameters or flags be set for better performance? Are there tools such as TCP_CORK, MSG_PUSH, or Nagle-Toggling techniques that can be used to avoid sending tiny messages?
- Does the system support decentralized/centralized I/O (such as READV/WriteV)? Using these can improve performance as well as ease the pain of using buffer chains.
- What is the page size? What is the cache row size? Is it worthwhile to align content on boundaries? How expensive are system calls or context switches relative to other operations?
- Is the reader/writer locking primitive hungry? Why are you hungry? Is there a “stampede effect” problem? Is there a bad (but very common) sleep/wake behavior where when X wakes up Y, the context immediately switches to Y even though X has something to do?
I’m sure I can think of more questions like this. I believe you can too. It’s not worth doing anything about any one issue in any particular situation, but it’s usually worth at least considering. If you don’t know the answers — many of which are not found in system documentation — find out. Write a test program or micro-benchmark to find the answer empirically; In any case, writing such code is a useful skill in itself. If you’re writing code that runs on multiple platforms, many of these issues relate to the point at which you should abstract functionality into each platform library so that you can achieve performance gains on platforms that support specific functionality.
The “know the answer” theory also applies to your own code. Find out what the important high-level operations are in your code and time them under different conditions. This is different from traditional profiling; This measures design elements, not the actual implementation. Low-level optimizations are usually the last resort for people who mess up their design.
High-performance Server Architecture
Author: Cyningsun Author: www.cyningsun.com/06-02-2021/… Copyright notice: All articles on this blog are licensed under CC BY-NC-ND 3.0CN unless otherwise stated. Reprint please indicate the source!
Follow public account