Computerworld
A better ext4 filesystem for Linux
A new Linux filesystem gets rid of the 256-petabyte limit, and adds a checksum feature for the journal. But developers want you to know that it's not yet ready for production sytems.
Jonathan Corbet (LinuxWorld)  31 January, 2008 12:52

Linux's ext4 filesystem, the successor to ext3, may well be the filesystem many of us are using a few years from now. Things have been relatively quiet on that front - at least, outside of the relevant mailing lists - but the ext4 developers have not been idle. Some of their work has now come to the surface with Ted Ts'o's posting of the ext4 merge plans for 2.6.25.

One of the changes going into ext4 is a lifting of the longstanding 4KB block size limit. That does not mean that just any block size works, though, and this feature will benefit fewer people than one might think, for one specific reason: the block size must still be no larger than the page size on the host system. So those of us running x86 systems with 4KB pages will be stuck with 4KB blocks still. And, on any system, the maximum block size is now 64KB.

One amusing effect of this change is that the size of a directory entry can now be as large as 64KB as well. But the field which holds the size of directory entries is only 16 bits wide. So a special hack has been employed to recognize 64KB directory entries and keep everything consistent.

Some internal variables have overflow problems as well. Block numbers are stored as a signed, 32-bit quantity, and so are block group numbers. That limits the maximum size of a filesystem to a mere 256 petabytes. In 2.6.25, these values will become unsigned long variables, eliminating that limit. Through some trickery, the inode field which stores the number of blocks associated with a file will be expanded to 48 bits, raising the maximum size of an individual file to just under 248 512-byte blocks.

The work does not stop there, though: another patch redefines that field to mean the number of filesystem blocks (instead of 512-byte sectors) used by the file. This is a change which has to be handled carefully, since it is an on-disk format change which could create trouble for people with existing ext4 filesystems. Everybody who is using ext4 should certainly be doing so with the knowledge that it's a development filesystem and is only suitable for storing files which are not valuable for more than about 30 minutes - Rawhide OpenOffice.org updates, say. But it still would be nice to not trash every existing ext4 filesystem out there. So the i_blocks field will continue, by default, to hold the number of 512-byte blocks. But, if that field exceeds 32 bits and forces the use of 48-bit numbers, it is thereafter interpreted as filesystem blocks. Since no existing filesystems are yet using 48-bit numbers, this approach successfully avoids breaking them.

Journal checksums are another feature arriving for 2.6.25. If the system crashes, the journal is used to recover any transactions which were committed, but which did not actually make it to disk. It sure would be nice to know that the journal, as stored in the filesystem, is intact before using it to make changes elsewhere. The checksum enables the filesystem to ensure that the journal is good and avoid (further) corrupting the filesystem if it is not. An interesting side benefit is that the checksum loosens the constraints on how the journal is written to disk, since an incompletely-written journal will now be detected; that should help to improve filesystem performance slightly.

Full data checksumming is still not on the agenda for ext4. But checksumming the journal is a good (if small) step in the right direction.

Another change is a VFS API change, in that it turns the i_version field of the inode structure into an unsigned, 64-bit value on all architectures. This version number is incremented when the file is changed, and it's stored (split into two fields) in the on-disk inode. 64-bit version numbers are required by NFSv4, which uses them to provide the dreaded "stale file handle" error when things change.

There is a new ioctl() (EXT4_IOC_MIGRATE) which can be used to explicitly request that the on-disk inode for a file be converted to the ext4 format.

The ext4 filesystem is extent-based, and has been for some time. "Extent-based" means that it tracks block allocations by extents (first block, number of blocks) rather than storing pointers to each individual block, as is done in ext3. There are a number of performance benefits to doing things this way, especially for larger files. Those benefits disappear, though, if a file's blocks cannot be grouped into the smallest number of extents possible.

One technique which greatly helps in optimizing block allocations for files is to allocate them in relatively large groups, rather than individually. In 2.6.25, ext4 will contain the multi-block allocator, which does exactly that. One might think that allocating a few blocks at a time would not be that big of a change, but the multi-block allocator is by far the most complex patch in the set. A lot of effort and heuristics go into deciding how many blocks to allocate, finding the optimal set of blocks, tracking the allocation, recovering blocks which end up never being used, ensuring that an application cannot read pre-allocated (but unwritten) blocks in search of leaked secrets, etc. It is quite a bit of code, but it is worth the trouble; multi-block allocation will be enabled by default in 2.6.25.

Computerworld Buyer's Guide - Vendors Matched to this Article
More about NICE, OpenOffice, Linux

Comments

Post new comment

Login or register to link comments to your user profile, or you may also post a comment without being logged in.
The content of this field is kept private and will not be shown publicly.
Enter the fully qualified URL, eg. http://www.example.com/
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

Add to Google
Computerworld Buyer's Guide - Vendors Matched to this Article
Zones
Zone logoZones provide focussed content from Computerworld and leading technology partners.
Newsletter Subscription
Newsletter Subscription
Sign up for our Computerworld newsletters!
Syndicate content
 

Computerworld Webinar

Thursday, June 11th, 2009
10:30am EST (Sydney, Australia)
Screening at your PC

Computerworld is hosting a 30 minute live webinar to help you to learn how unified communications can save you money, foster innovation and business agility by making it easier for people to find, reach and collaborate with one another.

Register Now

Computerworld Community Comments
Whitepaper

State of Internet Security

Spyware, viruses and other malware transported via Web sites represent the most serious data threat to companies today. Read on find out how you can appropriately leverage technology and appropriate business technologies to protect your business.

Enterprise IT Buyer's Guide
Find Technology Vendors Fast
 
Find vendors by name | Find by category
Sponsored Links
 
Send Us E-mail | Privacy Policy
Features List | Media Kit | Advertising | Contact Us

Copyright 2009 IDG Communications. ABN 14 001 592 650. All rights reserved.
Reproduction in whole or in part in any form or medium without express written permission of IDG Communications is prohibited.