As part of my day job, I’m always striving towards higher performance code while at the same time trying to retain a readable and flexible design (since customers are always changing what they want).
One of the big killers of performance is locks — they’re the most basic way to ensure mutual-exclusion of shared data between threads. Normally this is all well and good, and there are some situations in which locks are the only acceptable solution, but the big problem with them is that they block progress (which is part of their job, of course).
Say you have two threads, a higher priority one and a lower priority one. You don’t really care how long it takes the lower priority thread to do its job, but you want the higher priority thread to run really quickly when it has something to do (maybe it’s controlling some external hardware, maybe it’s transmitting some important data over the network, maybe it’s playing video or audio — there are lots of times when performance is critical to correct functionality). When the higher priority thread has something to do, it will preempt the lower priority thread, do it’s job, and then let the lower priority thread get back to chugging along with its slower task.
But what about when these two need to share some data? The easiest way to do this is to wrap the data structure in locks, so only one thread can access/update the structure at a time. This ensures correctness, but not performance. The problem case is when the lower priority thread has already acquired the lock and is in the middle of reading or updating the data when it gets preempted by the higher priority thread, which tries to acquire that same lock. And now the higher priority thread is forced to wait for the lower priority thread to release the lock before it can continue. This is known as a priority inversion.
Now, if these are the only two threads in the system, and if the low priority thread minimises the amount of time that it holds the shared lock (and minimising lock-held blocks should always be a goal of good software design) then this isn’t a big deal. You lose a bit of performance due to a couple of extra thread context shifts (when the high priority thread is forced to wait, and when the low priority thread releases the lock and lets the high priority thread run again); but even this would be minimised in a dual-core system.
But of course, unless you’re using a dedicated OS and a simple program, these are never the only two threads in the system. And the real problem is when there’s some other thread with work to do in between the priority levels of our two threads above. The high priority thread is blocked waiting for the lock, but the low priority thread won’t run because the medium priority thread is still busy doing things. Thus the high priority thread has to wait for the medium priority thread to finish working before the lower priority thread gets a chance to release the lock. And again, multicore CPUs help with this (they increase the probability that the medium and low priority threads could still both run) but this just moves the problem and makes it rarer without eliminating it.
Ultimately even this isn’t really a problem for most software, as the delays we’re talking about here will typically be in the 10-50ms range (depending on what threads are running and what kind of work they’re doing), and most software won’t really care about latency at that level. But there are some where it is important, and some where it’s not really important but might be structurally beneficial to decouple some of the internals.
One solution to these sorts of problems is to use lock-free data structures — code that has been carefully written (typically taking advantage of atomic processor instructions) to maintain correctness even without any locks. Somewhat paradoxically, many of these structures are actually slower than their lock-based cousins (at least when considering raw run speed in the non-contended case); but their advantage becomes apparent in the contended case (when both threads want to access the structure at the same time). Different structures make different guarantees, but typically one of those is “non-blocking” or “wait free” — meaning that no matter which thread happens to be running at the time and no matter what state the previous thread left the structure in before being pre-empted, the current thread can continue without waiting. (Note that this doesn’t necessarily mean that it will be able to successfully update the structure — sometimes some structures can only guarantee non-blocking if they return a “busy, try again later” response.)
The most common lock-free data structures are queues. There are many different varieties of lock-free queues; some examples of the variations include:
- Bounded or Unbounded
This determines whether the queue can only hold a limited number of items (bounded) or as many as needed (unbounded). Attempts to add a new item to a bounded queue when it is full will fail, so for those you need to pick the right size so that whichever thread is dequeuing can keep up.
Of course, every queue is ultimately bounded by available memory, so it’s always important to ensure that your consuming (dequeuing) thread doesn’t let too many items build up.
- Array or Linked-List
There are some other internal implementation possibilities, but these are the most common two. An array-based implementation allocates a single block of data on creation (typically linked in a circular fashion), while a linked-list implementation allocates disconnected blocks of memory linked via pointers.
Most commonly array-based implementations are bounded and linked list implementations are unbounded — but these are independent properties and it’s possible to mix-n-match.
- Node caches
Unbounded queues in particular need some way to allocate additional memory as items are added to the queue. The trouble is that memory allocation is inherently bad for performance — most standard multi-threaded memory allocators use locks (which kinda defeats the point of using a lock-free queue), and there’s always the chance it will have to request a new block of memory from the kernel, which may trigger a page fault and disk swapping. Meanwhile deallocating memory will (either every time or occasionally) trigger a “garbage collection” of sorts where the allocator will try to merge adjacent free memory into larger blocks or even possibly release them back to the kernel.
Obviously, then, if you’re looking to squeeze the best performance out of your queue you’ll want it to minimise or avoid allocating or deallocating memory as you add or remove items from the queue. Some queues do this inherently (particularly bounded queues, which will typically do all their allocation only at construction time), and some others may provide a sort of caching mechanism to do something similar — eg. to allocate a bunch of “spare” nodes up front (possibly allocating additional nodes on-demand if unbounded), but then re-using these spares after they’ve been released by a dequeue operation.
This might be less of a concern if you’re doing memory allocation/deallocation with the items added or removed from the queue anyway (since you’re already going to have a potential performance hole there) — but on the other hand, it’s still worthwhile to reduce the number of potential holes by one.
- Thread safety
It should go without saying that the goal of any lock-free queue is to safely separate enqueue and dequeue operations, such that one thread can enqueue and another can dequeue without blocking each other. But sometimes there are special cases (eg. a queue might temporarily block both if it’s a bounded queue that’s pretending to be unbounded, and needs to reallocate memory), so you need to carefully read the notes for a particular implementation.
But what if you want more than one thread to enqueue items? Or to dequeue items? Not all implementations will support this in a lock-free manner. Again you’ll need to check the notes; queues are typically characterised as “single producer” if only one thread can enqueue at once and “single consumer” if only one can dequeue at once. These are commonly abbreviated, so for an example an “MPSC” queue means “multiple producer single consumer”, which means that any number of threads can simultaneously enqueue items without locks, but only one thread can concurrently dequeue items.
Even the “single” queues usually can be extended to multiple threads, but only via locks or other mechanisms that can guarantee only a single thread can simultaneously enqueue and/or dequeue — and while at first this might seem like it violates the whole point of using a lock-free queue, it can sometimes actually be a good idea. Typically an SPSC queue is faster than an MPSC queue, which is faster than a SPMC or MPMC queue. If the only threads enqueuing items are lower priority ones and you don’t mind them blocking each other, you can use an SPSC queue and lock the producer side, without harming performance on the consumer side. Do read up on the specific implementation though — in rare cases some queues may actually have thread affinity, and only be safe to call from specific threads, not just non-concurrently.
If you have a small and predetermined number of producer threads, then another possibility that avoids the lock is to use one SPSC queue per producer thread, and have the consumer thread pull items from all of these. However this increases the coupling (as the consumer now needs to know who all the producers are, or at least how to get at all of their queues) and could introduce bias as to which items get processed first. So it’s important to profile each before deciding.
- Intrusive or Non-Intrusive
Some queues will require modifying the item being contained (usually either via inheritance or by adding a particular field), while others won’t. Typically intrusive containers (those that require a modified item) are faster but more of a pain to use.
There are lots more properties of course, but these are the most common and most useful to use when choosing a particular data structure. As always, though, it’s important to profile your performance (both with and without contention) with a reasonable workload before deciding whether it’s worthwhile to use lock-based or lock-free structures, and which of the many options would work better for your application.
Of course, simply adding a lock-free queue to your program isn’t enough by itself. To use them properly will typically require a design that works nicely with them. How this will work depends on what you’re doing, of course; but a typical design would be to give each thread its own MPSC queue; you can then let any thread queue data or requests to any other thread; these threads would then have an action loop which either waits for a request to come in and then processes it, or periodically throughout its normal work checks to see if any requests have come in. Even this isn’t as easy as it sounds at first, though — the thread making the request has to deal with the asynchronous nature of the reply (if any), either by doing some kind of continuation of processing when the reply arrives, or by blocking until the reply arrives — but bear in mind that you shouldn’t block unless you’re calling from a thread that doesn’t have its own queue to process; otherwise you can get deadlocked. Or alternatively you could re-enter the processing loop instead of blocking, but that requires your code to be re-entrancy-safe.
There are more lock-free data structures than just queues, naturally. There are also stacks and even more complex structures such as hash tables and transactional memory; in addition there are some more borderline lock-minimisation techniques. I’m not going to go into full details on any of these things, since not that many people are reading this anyway.
But, the reason why I’m introducing this topic is that I’m planning to write another post soon with a more specific code example, and this seemed like worthwhile background material to present beforehand.