Linux enthusiasts like to point out just how scalable the system is; Linux runs on everything from pocket-size devices to supercomputers with several thousand processors. What they talk about a little bit less is that, at the high end, the true scalability of the system is limited by the sort of workload which is run. CPU-intensive scientific computing tasks can make good use of very large systems, but database-heavy workloads do not scale nearly as well. There is a lot of interest in making big database systems work better, but it has been a challenging task. Nick Piggin appears to have come up with a logical next step in that direction, though, with a relatively straightforward set of core memory management changes.
For some time, Linux has supported direct I/O from user space. This, too, is a scalability technology: the idea is to save processor time and memory by avoiding the need to copy data through the kernel as it moves between the application and the disks. With sufficient programming effort, the application should be able to make use of its superior knowledge of its own data access patterns to cache data more effectively than the kernel can; direct I/O allows that caching to happen without additional overhead. Large database management systems have had just that kind of programming effort applied to them, with the result that they use direct I/O heavily. To a significant extent, these systems use direct I/O to replace the kernel's paging algorithms with their own, specialized code.
When the kernel is asked to carry out a direct I/O operation, one of the first things it must do is to pin all of the relevant user-space pages into memory and locate their physical addresses. The function which performs this task is get_user_pages():
int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas);
A successful call to get_user_pages() will pin len pages into memory, those pages starting at the user-space address start as seen in the given mm. The addresses of the relevant struct page pointers will be stored in pages, and the associated VMA pointers in vmas if it is not NULL.
This function works, but it has a problem (beyond the fact that it is a long, twisted, complex mess to read): it requires that the caller hold mm->mmap_sem. If two processes are performing direct I/O on within the same address space - a common scenario for large database management systems - they will contend for that semaphore. This kind of lock contention quickly kills scalability; as soon as processors have to wait for each other, there is little to be gained by adding more of them.