Nick's response was a new spinlock implementation which he calls "ticket spinlocks." Under the initial version of this patch, a spinlock became a 16-bit quantity, split into two bytes: "Next" and "Owner".
Each byte can be thought of as a ticket number. If you have ever been to a store where customers take paper tickets to ensure that they are served in the order of arrival, you can think of the "next" field as being the number on the next ticket in the dispenser, while "owner" is the number appearing in the "now serving" display over the counter.
So, in the new scheme, the value of a lock is initialized (both fields) to zero. spin_lock() starts by noting the value of the lock, then incrementing the "next" field - all in a single, atomic operation. If the value of "next" (before the increment) is equal to "owner," the lock has been obtained and work can continue. Otherwise the processor will spin, waiting until "owner" is incremented to the right value. In this scheme, releasing a lock is a simple matter of incrementing "owner."
The implementation described above does have one small disadvantage in that it limits the number of processors to 256 - any more than that, and a heavily-contended lock could lead to multiple processors thinking they had the same ticket number. Needless to say, the resulting potential for mayhem is not something which can be tolerated. But the 256-processor limit is an unwelcome constraint for those working on large systems, which already have rather more processors than that. So the add-on "big ticket" patch - also merged for 2.6.25 - uses 16-bit values when the configured maximum number of processors exceeds 256. That raises the maximum system size to 65536 processors - who could ever want more than that?
With the older spinlock implementation, all processors contending for a lock fought to see who could grab it first. Now they wait nicely in line and grab the lock in the order of arrival. Multi-thread run times even out, and maximum latencies are reduced (and, more to the point, made deterministic). There is a slight cost to the new implementation, says Nick, but that gets very small on contemporary processors and is essentially zero relative to the cost of a cache miss - which is a common event when dealing with contended locks. The x86 maintainers clearly thought that the benefits of eliminating the unseemly scramble for spinlocks exceeded this small cost; it seems unlikely that others will disagree.