Experimental container support for 2.6.24

Faster than virtualization, but harder to implement, containers are a promising security technology for Linux. Watch the 2.6.24 kernel for experimental support for creating and managing containers.

"Containers" are a form of lightweight virtualization as represented by projects like OpenVZ. While virtualization creates a new virtual machine upon which the guest system runs, containers implementations work by making walls around groups of processes. The result is that, while virtualized guests each run their own kernel (and can run different operating systems than the host), containerized systems all run on the host's kernel. So containers lack some of the flexibility of full virtualization, but they tend to be quite a bit more efficient.

As of 2.6.23, virtualization is quite well supported on Linux, at least for the x86 architecture. Containers lag a little behind, instead. It turns out that, in many ways, containers are harder to implement than virtualization is. A container implementation must wrap a namespace layer around every global resource found in the kernel, and there are a lot of these resources: processes, filesystems, devices, firewall rules, even the system time. Finding ways to wrap all of these resources in a way which satisfies the needs of the various container projects out there, and which also does not irritate kernel developers who may have no interest in containers, has been a bit of a challenge.

Full container support will get quite a bit closer once the 2.6.24 kernel is released. The merger of a number of important patches in this development cycle fills in some important pieces, though a certain amount of work remains to be done.

Once upon a time, there was a patch set called process containers. The containers subsystem allows an administrator (or administrative daemon) to group processes into hierarchies of containers; each hierarchy is managed by one or more "subsystems." The original "containers" name was considered to be too generic - this code is an important part of a container solution, but it's far from the whole thing. So containers have now been renamed "control groups" (or "cgroups") and merged for 2.6.24.

Control groups need not be used for containers; for example, the group scheduling feature (also merged for 2.6.24) uses control groups to set the scheduling boundaries. But it makes sense to pair control groups with the management of the various namespaces and resource management in general to create a framework for a containers implementation.

The management of control groups is straightforward. The system administrator starts by mounting a special cgroup filesystem, associating the subsystems of interest with the filesystem at mount time. There can be more than one such filesystem mounted, as long as each subsystem appears on at most one control group. So the administrator could create one cgroup filesystem to manage scheduling and a completely different one to associate processes with namespaces.

Once the filesystem is mounted, specific groups are created by making directories within the cgroup filesystem. Putting a process into a control group is a simple matter of writing its process ID into the tasks virtual file in the cgroup directory. Processes can be moved between control groups at will.

The concept of a process ID has gotten more complicated, though, since the PID namespace code was also merged. A PID namespace is a view of the processes on the system. On a "normal" Linux system, there is only the global PID namespace, and all processes can be found there. On a system with PID namespaces, different processes can have very different views of what is running on the system. When a new PID namespace is created, the only visible process is the one which created that namespace; it becomes, in essence, the init process for that namespace. Any descendants of that process will be visible in the new namespace, but they will never be able to see anything running outside of that namespace.

Virtualizing process IDs in this way complicates a number of things. A process which creates a namespace remains visible to its parent in the old namespace - and it may not have the same process ID in both namespaces. So processes can have more than one ID, and the same process ID may be found referring to different processes in different namespaces. For example, it is fairly common in containers implementations to have the per-namespace init process have ID 1 in its namespace.

What all of this means is that process IDs only make sense when placed into a specific context. That, in turn, sets a trap for any kernel code which works with process IDs; any such code must take care to maintain the association between a process ID and the namespace in which it is defined. To make life easier (and safer), the containers developers have been working for some time to eliminate (to the greatest extent possible) use of process IDs within the kernel itself. Kernel code should use task_struct pointers (which are always unambiguous) to refer to specific processes; a process ID, instead, has become a cookie for communication with user space, and not much more.

Join the newsletter!

Error: Please check your email address.

More about LinuxProvisionProvisionVIA

Show Comments

Market Place