Before we finish our discussion of locks, let’s discuss how to use locks in common data structures. Data structures can be made thread safe by locking. Of course, how exactly does locking determine the correctness and efficiency of the data structure? So our challenge is to:

Key question: How to lock data structures?

How do you lock a particular data structure to make it function correctly? Furthermore, how can the data structure be locked to ensure high performance and concurrently access by many threads, namely concurrent access?

Of course, it’s hard to cover all the data structures, or all the ways to achieve concurrency, because this is a topic that has been studied for many years, with thousands of papers published on it. Therefore, we hope to provide an adequate introduction to this way of thinking, as well as some good material for your own further research. We find that Moir and Shavit’s survey [MS04] is a good source.

29.1 Concurrency counters

Counters are one of the simplest data structures, widely used and with simple interfaces. A non-concurrent counter is defined in Figure 29.1.

1 typedef struct counter_t { 2 int value; 3 } counter_t; 4 5 void init(counter_t *c) { 6 c->value = 0; 7 } 8 9 void increment(counter_t *c) { 10 c->value++; 11 } 12 13 void decrement(counter_t *c) { 14 c->value--; 15 } 16 17 int get(counter_t *c) { 18 return c->value; 19}Copy the code

Figure 29.1 A lockless counter

Simple but unscalable

As you can see, counters without synchronization are simple and require very little code to implement. Now our next challenge is: How do we make this code thread safe? Figure 29.2 illustrates our approach.

1 typedef struct counter_t { 2 int value; 3 pthread_mutex_t lock; 4 } counter_t; 5 6 void init(counter_t *c) { 7 c->value = 0; 8 Pthread_mutex_init(&c->lock, NULL); 9 } 10 11 void increment(counter_t *c) { 12 Pthread_mutex_lock(&c->lock); 13 c->value++; 14 Pthread_mutex_unlock(&c->lock); 15 } 16 17 void decrement(counter_t *c) { 18 Pthread_mutex_lock(&c->lock); 19 c->value--; 20 Pthread_mutex_unlock(&c->lock); 21 } 22 23 int get(counter_t *c) { 24 Pthread_mutex_lock(&c->lock); 25 int rc = c->value; 26 Pthread_mutex_unlock(&c->lock); 27 return rc; 28}Copy the code

Figure 29.2 A locked counter

This concurrency counter is simple and correct. In fact, it follows a common data pattern found in the simplest, most basic concurrent data structures: it simply adds a lock, acquires the lock when a function is called to manipulate the data structure, and releases the lock when the call returns. This approach is similar to the data structure based on Monitor [BH73], which automatically acquires and releases locks when calling and exiting object methods.

Now, with a concurrent data structure, the problem may be performance. If this structure results in running too slowly, you may need to optimize beyond simply locking. If such optimization is needed, it will be discussed in the rest of this chapter. Note that if the data structure isn’t too slow, then it’s fine! If simple solutions work, there is no need for elaborate design.

To understand the performance cost of the simple approach, we run a benchmark where each thread updates the same shared counter a fixed number of times, and then we change the number of threads. Figure 29.3 shows the total time it takes to run one to four threads, with each thread updating the counter one million times. This experiment is run on iMac with 4-core Intel 2.7ghz I5 CPU. By increasing the CPU, we hope to accomplish more tasks per unit of time.

As you can see from the curve at the top of Figure 29.3 (labeled “accurate”), the synchronized counter does not scale well. It takes a very short time (about 0.03s) for a single thread to complete 1 million updates, while two threads executing concurrently with 1 million updates each degrade performance considerably (over 5s!). . More threads, worse performance.

Figure 29.3 Traditional counter vs. lazy counter

Ideally, you should see multithreading running on multiprocessing as fast as a single thread. Achieving this state is called perfect scaling. Although the total amount of work increased, the time to complete the task did not increase after parallel execution.

Extensible counting

Surprisingly, researchers have been studying how to implement scalable counters for many years [MS04]. More surprisingly, recent operating system performance analysis studies [B+10] have shown the importance of scalable counters. Without scalable counting, some jobs running on Linux will have serious scalability issues on multi-core machines.

Although various techniques have been developed to solve this problem, we will introduce a specific approach. This approach is proposed in a recent study called the sloppy Counter [B+10].

Lazy counters implement a logical counter with multiple local counters and a global counter, where each CPU core has a local counter. Specifically, on a 4-CPU machine, there are four local counters and one global counter. In addition to these counters, there are locks: one for each local counter and one for the global counter.

The basic idea of a lazy counter is this. If a thread on the core wants to increment its counter, it increments its local counter, which is accessed synchronously through the corresponding local lock. Because each CPU has its own local counter, threads on different cpus do not compete, so the counter update operation scales well.

However, to keep the global counter up to date (in case a thread wants to read it), the local value is periodically transferred to the global counter by acquiring the global lock, adding the value of the local counter, and then setting the local counter to zero.

The frequency of this local to global transition depends on a threshold, called S (for sloppiness). The smaller S is, the closer the lazy counter is to the non-extended counter. The larger S is, the more scalable it is, but the greater the deviation of the global counter from the actual count. We can preempt all local and global locks (in a specific order, avoiding deadlocks) to get exact values, but this approach does not extend.

To clarify this, take a look at an example (see Table 29.1). In this example, the threshold S is set to 5 with one thread on each of the four cpus updating the local counter L1,… L4. As time increases, the value of the global counter G is also recorded. Each time, the local counter may increase. If the local count value increases to the threshold S, the local value is transferred to the global counter and the local counter is cleared.

 

The bottom line in Figure 29.3 is the performance of the lazy counter when the threshold S is 1024. Performance is high, with four processors updated 4 million times in almost the same time as one processor updated 1 million times.

Figure 29.4 shows the importance of threshold S, incrementing the counter 1 million times for each of the four threads on the four cpus. If S is small, the performance is poor (but the global counter is accurate). If S is large, the performance is good, but the global counter has latency. A lazy counter is a compromise between accuracy and performance.

Figure 29.4 Extending the lazy counter

Figure 29.5 is a basic implementation of this lazy counter. Read it, or run it, and try some examples to better understand how it works.

1 typedef struct counter_t { 2 int global; // global count 3 pthread_mutex_t glock; // global lock 4 int local[NUMCPUS]; // local count (per cpu) 5 pthread_mutex_t llock[NUMCPUS]; / /... and locks 6 int threshold; // update frequency 7 } counter_t; 8 9 // init: record threshold, init locks, init values 10 // of all local counts and global count 11 void init(counter_t *c, int threshold) { 12 c->threshold = threshold; 13 14 c->global = 0; 15 pthread_mutex_init(&c->glock, NULL); 16 17 int i; 18 for (i = 0; i < NUMCPUS; i++) { 19 c->local[i] = 0; 20 pthread_mutex_init(&c->llock[i], NULL); 21 } 22 } 23 24 // update: usually, just grab local lock and update local amount 25 // once local count has risen by 'threshold', grab global 26 // lock and transfer local values to it 27 void update(counter_t *c, int threadID, int amt) { 28 pthread_mutex_lock(&c->llock[threadID]); 29 c->local[threadID] += amt; // assumes amt > 0 30 if (c->local[threadID] >= c->threshold) { // transfer to global 31 pthread_mutex_lock(&c->glock); 32 c->global += c->local[threadID]; 33 pthread_mutex_unlock(&c->glock); 34 c->local[threadID] = 0; 35 } 36 pthread_mutex_unlock(&c->llock[threadID]); 37 } 38 39 // get: just return global amount (which may not be perfect) 40 int get(counter_t *c) { 41 pthread_mutex_lock(&c->glock); 42 int val = c->global; 43 pthread_mutex_unlock(&c->glock); 44 return val; // only approximate! 45}Copy the code

Figure 29.5 Implementation of a lazy counter

29.2 Concurrent linked lists

Let’s move on to a more complex data structure, linked lists. Again, let’s start with a base implementation. For simplicity, we will focus on the insertion of the linked list, leaving other operations such as find and delete to the reader. Figure 29.6 shows the code for this basic data structure.

1    // basic node structure
2    typedef struct  node_t {
3        int                key;
4        struct  node_t        *next;
5    } node_t;
6
7    // basic list structure (one used per list)
8    typedef struct  list_t {
9        node_t                *head;
10       pthread_mutex_t    lock;
11   } list_t;
12
13   void List_Init(list_t *L) {
14       L->head = NULL;
15       pthread_mutex_init(&L->lock,  NULL);
16   }
17
18   int List_Insert(list_t *L, int key) {
19       pthread_mutex_lock(&L->lock);
20       node_t *new = malloc(sizeof(node_t));
21       if (new == NULL) {
22           perror("malloc");
23           pthread_mutex_unlock(&L->lock);
24           return -1; // fail
25       }
26       new->key = key;
27       new->next = L->head;
28       L->head = new;
29       pthread_mutex_unlock(&L->lock);
30       return 0; // success
31   }
32
33   int List_Lookup(list_t *L, int key) {
34       pthread_mutex_lock(&L->lock);
35       node_t *curr = L->head;
36       while (curr) {
37           if (curr->key == key) {
38               pthread_mutex_unlock(&L->lock);
39               return 0; // success
40           }
41           curr = curr->next;
42       }
43       pthread_mutex_unlock(&L->lock);
44       return -1; // failure
45   }
Copy the code

Figure 29.6 Concurrent linked list

As you can see from the code, the code inserts the function entry to get the lock and releases the lock when it finishes. If (infrequently) malloc fails, there is a small problem, in which case the code must release the lock before the insert fails.

It has been shown that this abnormal control flow is error-prone. A recent study of Linux kernel patches showed that 40% of them were code paths of this infrequent occurrence (in fact, this finding inspired some of our own research, where we removed all memory failure paths from the Linux filesystem, resulting in a more robust system [S+11]).

So the challenge is: can we rewrite the insert and find functions to keep concurrent inserts correct, but avoid the need to call the release lock in case of failure?

In this case, the answer is yes. Specifically, we tuned our code so that lock acquisition and lock release only surround the true critical region of the insert code. The previous approach works because some of the work actually doesn’t require locks, assuming that malloc() is thread-safe and can be called by every thread without worrying about race conditions and other concurrency defects. The lock is held only when the share list is updated. Figure 29.7 shows the details of these modifications.

For the lookup function, a simple code adjustment was made to break out of the main lookup loop to a single return path. This reduces the number of places in the code where locks need to be acquired and released, and reduces the possibility of accidentally introducing defects in the code, such as forgetting to release locks before returning.

1    void List_Init(list_t *L) {
2        L->head = NULL;
3        pthread_mutex_init(&L->lock,  NULL);
4    }
5
6    void List_Insert(list_t *L, int key) {
7        // synchronization not needed
8        node_t *new = malloc(sizeof(node_t));
9        if (new == NULL) {
10           perror("malloc");
11           return;
12       }
13       new->key = key;
14
15       // just lock critical section
16       pthread_mutex_lock(&L->lock);
17       new->next = L->head;
18       L->head = new;
19       pthread_mutex_unlock(&L->lock);
20   }
21
22   int List_Lookup(list_t *L, int key) {
23       int rv = -1;
24       pthread_mutex_lock(&L->lock);
25       node_t *curr = L->head;
26       while (curr) {
27           if (curr->key == key) {
28               rv = 0;
29               break;
30           }
31           curr = curr->next;
32       }
33       pthread_mutex_unlock(&L->lock);
34       return rv; // now both success and failure
35   }
Copy the code

Figure 29.7 Rewriting the concurrent linked list

Extension chain table

Even though we had a basic concurrent list, we ran into the problem that the list didn’t scale well. Hand-over-hand locking, also known as lock coupling, is one of the technologies that researchers have found to increase the concurrency of linked lists [MS04].

The principle is simple. Each node has a lock instead of a lock for the entire list. When traversing a linked list, first preempt the lock on the next node and then release the lock on the current node.

From a conceptual point of view, hand-chain lists make sense because they increase the concurrency of linked list operations. However, in reality, the overhead of acquiring and releasing locks on each node is very high during traversal, which is hardly faster than the single-lock method. Even with a large number of threads and a large linked list, this concurrent scheme is not necessarily faster than the single-lock scheme. Perhaps some hybrid scheme (a lock for a given number of nodes) is worth investigating.

Tip: More concurrency is not necessarily faster

High concurrency makes little sense if the scenario incurs a lot of overhead (for example, frequently acquiring locks, releasing locks). Simple schemes usually work well if they rarely use expensive calls. Adding more locks and complexity can backfire. Having said that, there is one way to gain insight: implement two scenarios (simple with less concurrency, complex with more concurrency) and test their performance. After all, you can’t cheat on performance. The result is either faster or faster.

 

Tip: Beware of locks and control flow

A general recommendation, useful for concurrent code as well as other code, is to watch out for changes in control flow that cause functions to return and exit, or other error conditions that cause functions to stop executing. Because many functions start by acquiring locks, allocating memory, or doing some other state change, if an error occurs, the code needs to restore various states before returning, which is error-prone. Therefore, it is best to organize your code to reduce this pattern.

29.3 Concurrent Queue

As you now know, there is always a standard way to create a concurrent data structure: add a large lock. For a queue, we’ll skip this method, assuming you can figure it out.

Let’s look at a more concurrent queue designed by Michael and Scott [MS98]. Figure 29.8 shows the data structure and code for this queue.

1    typedef struct  node_t {
2        int                 value;
3        struct  node_t     *next;
4    } node_t;
5
6    typedef struct  queue_t {
7        node_t            *head;
8        node_t            *tail;
9        pthread_mutex_t    headLock;
10       pthread_mutex_t    tailLock;
11   } queue_t;
12
13   void Queue_Init(queue_t *q) {
14       node_t *tmp = malloc(sizeof(node_t));
15       tmp->next = NULL;
16       q->head = q->tail = tmp;
17       pthread_mutex_init(&q->headLock,  NULL);
18       pthread_mutex_init(&q->tailLock,  NULL);
19   }
20
21   void Queue_Enqueue(queue_t *q, int value) {
22       node_t *tmp = malloc(sizeof(node_t));
23       assert(tmp != NULL);
24       tmp->value = value;
25       tmp->next = NULL;
26
27       pthread_mutex_lock(&q->tailLock);
28       q->tail->next = tmp;
29       q->tail = tmp;
30       pthread_mutex_unlock(&q->tailLock);
31   }
32
33   int Queue_Dequeue(queue_t *q, int *value) {
34       pthread_mutex_lock(&q->headLock);
35       node_t *tmp = q->head;
36       node_t *newHead = tmp->next;
37       if (newHead == NULL) {
38           pthread_mutex_unlock(&q->headLock);
39           return -1; // queue was empty
40       }
41       *value = newHead->value;
42       q->head = newHead;
43       pthread_mutex_unlock(&q->headLock);
44       free(tmp);
45       return 0;
46   }
Copy the code

Figure 29.8Michael and Scott’s concurrent queue

If you look closely at this code, you’ll see that there are two locks, one for the queue head and one for the queue tail. These two locks allow the in-queue and out-queue operations to be executed concurrently, because the in-queue accesses only the tail lock and the out-queue accesses only the head lock.

Michael and Scott use a trick to add a dummy node (assigned in the queue initialization code). The dummy node separates the head and tail operations. Study the code, or type, run, and test it to understand it more deeply.

Queues are widely used in multithreaded programs. However, the queue here (just with a lock) is usually not sufficient for such a program. Better bounded queues that allow threads to wait when the queue is empty or full. This is the subject of intensive study in the next chapter on conditional variables. Readers need to look carefully!

29.4 Diverge the list

Let’s discuss the last widely used concurrent data structure, the hash table (see Figure 29.9). Let’s focus on simple hash tables that don’t need to be resized. Supporting resizing takes some work and is left as an exercise for the reader.

1 #define BUCKETS (101) 2 3 typedef struct hash_t { 4 list_t lists[BUCKETS]; 5 } hash_t; 6 7 void Hash_Init(hash_t *H) { 8 int i; 9 for (i = 0; i < BUCKETS; i++) { 10 List_Init(&H->lists[i]); 11 } 12 } 13 14 int Hash_Insert(hash_t *H, int key) { 15 int bucket = key % BUCKETS; 16 return List_Insert(&H->lists[bucket], key); 17 } 18 19 int Hash_Lookup(hash_t *H, int key) { 20 int bucket = key % BUCKETS; 21 return List_Lookup(&H->lists[bucket], key); 22}Copy the code

Figure 29.9 diverges the list

The hash table in this example uses the concurrent linked list we implemented earlier and performs particularly well. There is one lock per hash bucket (each bucket is a linked list), rather than one lock for the entire hash table, enabling many concurrent operations.

Figure 29.10 Extended hash table

Figure 29.10 shows the performance of a hash table with concurrent updates (also on a 4-CPU iMac, with four threads each executing 10,000 to 50,000 concurrent updates). At the same time, for comparison, we also show the performance of a single chain table. As you can see, this simple divergent list scales extremely well, whereas linked lists do the opposite.

Tip: Avoid premature optimizations (Knuth’s Law)

When implementing concurrent data structures, start with the simplest solution, which is to add a large lock to synchronize. By doing so, you’ve probably built the right lock. If performance issues are found, refine the approach, as long as it is optimized to meet your needs. As Knuth famously said, “Premature optimization is the root of all bad things.”

Many operating systems, including Sun and Linux, initially made the transition to multiprocessors with one big lock. In Linux, this lock even has a name called BKL (Big Kernel lock). This worked well for many years, until multi-CPU systems became common and the kernel allowed only one thread activity to become a performance bottleneck. It’s finally time to optimize concurrency performance for these systems. Linux takes the simple approach of replacing one lock with multiple locks. Sun went further, implementing a new system, Solaris, that could be concurrent from the start. You can learn more about the Linux and Solaris kernel materials [BC05, MM00].

29.5 summary

We’ve covered a number of concurrent data structures, from counters to linked list queues, and finally to heavily used hash tables. At the same time, we also learn: pay attention to lock acquisition and lock release when control flow changes; Increasing concurrency does not necessarily improve performance; Optimize when there are performance issues. On the last point, avoidance of premature optimization is useful to any developer concerned with performance. There is no value in making a small part of the application faster without improving overall performance.

Of course, we’ve only scratched the surface of high-performance data structures. Moir and Shavit’s survey provides more information, including links to other sources [MS04]. In particular, you might be interested in other structures (such as B-trees), so a database course would be a good choice. You might also be interested in techniques that don’t use traditional locks at all. This kind of non-blocking data structure makes sense, and we’ll cover it a little in the section on common concurrency problems. But to be honest, this is a broad field of knowledge, much more than this book can cover. Interested readers can do their own research.

In this paper, fromIntroduction to Operating Systems

Organized around the three main concepts of virtualization, concurrency, and persistence, the book introduces the major components of all modern systems (including scheduling, virtual memory management, disk and I/O subsystems, and file systems). The book consists of 50 chapters divided into three parts covering virtualization, concurrency, and persistence. The author introduces the concepts in the form of a dialogue, which is humorous and insightful, to help readers understand the principles of virtualization, concurrency, and persistence in operating systems. The book is comprehensive and provides real working code (rather than pseudo-code), as well as corresponding exercises, which are suitable for teachers of relevant majors in colleges and universities to carry out teaching and students to study on their own.