Kernel space: Ticket spinlocks
- 14 February, 2008 10:06
Spinlocks are the lowest-level mutual exclusion mechanism in the Linux kernel. As such, they have a great deal of influence over the safety and performance of the kernel, so it is not surprising that a great deal of optimization effort has gone into the various (architecture-specific) spinlock implementations. That does not mean that all of the work has been done, though; a patch merged for 2.6.25 shows that there is always more which can be done.
On the x86 architecture, in the 2.6.24 kernel, a spinlock is represented by an integer value. A value of one indicates that the lock is available. The spin_lock() code works by decrementing the value (in a system-wide atomic manner), then looking to see whether the result is zero; if so, the lock has been successfully obtained. Should, instead, the result of the decrement option be negative, the spin_lock() code knows that the lock is owned by somebody else. So it busy-waits ("spins") in a tight loop until the value of the lock becomes positive; then it goes back to the beginning and tries again.
Once the critical section has been executed, the owner of the lock releases it by setting it to 1.
This implementation is very fast, especially in the uncontended case (which is how things should be most of the time). It also makes it easy to see how bad the contention for a lock is - the more negative the value of the lock gets, the more processors are trying to acquire it. But there is one shortcoming with this approach: it is unfair. Once the lock is released, the first processor which is able to decrement it will be the new owner. There is no way to ensure that the processor which has been waiting the longest gets the lock first; in fact, the processor which just released the lock may, by virtue of owning that cache line, have an advantage should it decide to reacquire the lock quickly.
One would hope that spinlock unfairness would not be a problem; usually, if there is serious contention for locks, that contention is a performance issue even before fairness is taken into account. Nick Piggin recently revisited this issue, though, after noticing:
On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times.
This sort of runtime difference is certainly undesirable. But lock unfairness can also create latency issues; it is hard to give latency guarantees when the wait time for a spinlock can be arbitrarily long.
Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes: "Next" and "Owner".
Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter.
So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner."
The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that?
With the older spinlock implementation, all processors contending for a lock fought to see who could grab it first. Now they wait nicely in line and grab the lock in the order of arrival. Multi-thread run times even out, and maximum latencies are reduced (and, more to the point, made deterministic). There is a slight cost to the new implementation, says Nick, but that gets very small on contemporary processors and is essentially zero relative to the cost of a cache miss - which is a common event when dealing with contended locks. The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree.