The Linux Foundation

 
Linux Weather Forecast/filesystems

From The Linux Foundation

This is the filesystems page of the Linux Platform Weather Forecast.

Filesystems, of course, are a crucial part of any operating system; they are the code which maintains our persistent data. Reliability is of especially high importance in filesystems, since any mistakes can lead to lost data or (even worse) subtly corrupted data which is not discovered for a long time. But filesystems are also a performance-critical part of the system; a poorly-written filesystem will result in substandard performance for almost any kind of workload.

Traditional filesystems are driven by the requirements imposed by rotating storage - disk drives, in other words. For such devices, great care must be paid to the layout of data on the media so that it may be read and written with a minimum number of head seeks. In the time it takes to reposition the disk heads, vast amounts of data could have been moved, so unnecessary seeks will kill performance. The most heavily-used Linux filesystems have highly optimized layout code, and the up-and-coming filesystems have had even more work done in that area.

It is worth noting, though, that flash-based storage devices are getting larger and cheaper, to the point that they can replace rotating storage in low-end applications now. Flash brings a different set of constraints; data layout is less important (flash devices provide random access), but a flash-based filesystem must deal with large erase blocks, wear leveling, and other device-specific issues. For this reason, running traditional filesystems on flash-based devices is not the best solution; Linux has some flash-oriented filesystems now, with others in the works.

Finally, the growth in storage capacities in general presents scaling problems for all filesystems. Dealing with terabyte-scale filesystems requires different approaches, especially when it comes to integrity checking. Running a filesystem checker on a large array can take days even now; that problem is only going to get worse. So a lot of work is going into fixing the "fsck problem."

For an overview of where Linux filesystem development is heading, see this report from the 2006 Linux Filesystems Workshop by Val Henson.

Contents

Ext4

Ext4 is the successor to the longstanding ext2 and ext3 filesystems used in most Linux installations. It has been part of the mainline kernel since 2.6.19, but it is currently considered to be an experimental filesystem, not recommended for production use. Features which are present now include:

  • 48-bit block numbers and other enhancements, which greatly increase the maximum filesystem size,
  • extents, which improve the efficiency of the filesystem, especially with large files
  • nanosecond time stamps, needed to track relative file modification times on fast systems,
  • block pre-allocation (for resource reservation and optimal block layout) and multi-block allocation
  • journal checksums
  • very large file support
  • delayed allocation, which facilitates better layout of data on the disk.

Other planned features include online defragmentation, faster checking of filesystems after a system crash, and more. It's important to note that, while ext4 adds a great many useful features, it still falls somewhat short of being a true next generation filesystem. So ext4 looks more like a stopgap system meant for use until btrfs is ready.

Forecast: Delayed allocation was merged for 2.6.27. As of this writing, about the only planned feature which remains out of the mainline kernel is online defragmentation. As of 2.6.28, ext4 is no longer considered a "developmental" filesystem, but it has not yet been proclaimed ready for general use. In the past, I have predicted that this declaration could happen around 2.6.30; that timeline still looks reasonable.

For more information:

Btrfs

Btrfs is a new, from-scratch filesystem developed by Chris Mason at Oracle. It has a number of interesting features, including extent-based storage, the ability to divide a filesystem into subvolumes, a fast snapshot capability, checksums on data and metadata, online integrity checking, very fast offline checking, and more.

Forecast: btrfs is a very new project. This work is progressing, but conservatism reigns when new filesystems are being considered; the standards for reliability are very high in this area. Btrfs has generated a lot of interest, but there is much work yet to be done; the Btrfs timeline calls for new features to be added through January, 2009. A production-ready Btrfs before 2010 seems unlikely.

The Btrfs code was merged into the mainline for the 2.6.29 kernel release. It remains under heavy development, though, and should not be used for production data.

For more information:

Object Storage Devices

Object storage devices (OSDs) are intended to offload much filesystem and security processing from the host system onto the hardware. To that end, the interface they provide deals in "objects" (files, primarily) rather than individual blocks. Supporting OSDs thus requires a different kind of filesystem; all of the complicated block-layout code can be replaced with higher-level protocol code and an interface to the security mechanism.

The open-osd project is working toward OSD support for Linux. Among other things, this project is implementing an "osd initiator" (low-level interface code), osdfs (a filesystem built on top of osd-initiator), and a software OSD implementation.

Forecast: The osd-initiator and osdfs code was first posted for wider review in November, 2008. Initial reviews were positive; there is a fair amount of interest in getting this code into the mainline relatively quickly. We might see it merged by 2.6.30 or 2.6.31.

For more information:

AdvFS

AdvFS (or the "Tru64 Advanced Filesystem") was developed by Digital Equipment Corporation in the 1990's. In June, 2008, HP (which has come to own that technology) announced that AdvFS had been released under the GPL. This filesystem has some appealing features, including some snapshot and volume management capabilities which are still not matched by mainline Linux filesystems.

Forecast: While the AdvFS code has been released under the GPL, it has not been ported to Linux, and may never be. It will certainly serve as a source of useful information - and possibly code - for ongoing Linux filesystem efforts, though.

For more information:

Reiser4

Reiser4 is a successor to the ReiserFS filesystem. It offers a number of features, including good performance on small files (and claimed high performance on all files), space efficiency, and a "plugin" architecture which makes the addition of features (such as compression, encryption, and new formats) relatively easy.

Forecast: Reiser4 has been in the -mm tree for years, but has not managed to make the leap into the mainline. Parts of the filesystem were designed in ways which did not fit well with the Linux VFS layer and a number of features have been disabled in response. More recently, Reiser4 has lost its lead developer; while some people are working on the filesystem, it will need a new champion if it is to eventually become part of mainline Linux.

For more information:


LogFS

LogFS is a completely new filesystem aimed at efficient operation on solid-state (flash) media. It is a log-structured filesystem, but, unlike others, it also includes a directory tree on the media, eliminating the need to build such a tree at mount time.

Forecast: LogFS is still in a relatively early stage of development, and development has stalled for lack of funding. For the short term, this work appears to have been upstaged by UBIFS, though it may yet reach a state of completion and make its way into the mainline.

For more information:

Tux3

Tux3 is an exceedingly new filesystem being developed by Daniel Phillips; it is not available in a working form at this time. Daniel has high ambitions for this work, which is based on a new versioning scheme and a number of other novel ideas.

Forecast: It's too soon to tell. Daniel is a capable developer with a lot of good ideas, but his record for complete implementation of those ideas is not the strongest. In any case, Tux3 is late to the game; by the time it's getting ready, btrfs should be in solid shape. But there's always room for surprises; this project bears watching.

For more information:

UBIFS

UBIFS is a flash-based filesystem being developed by engineers at Nokia and elsewhere. It is closer to completion than LogFS, but potentially has some boot-time scaling issues resulting from its use of the UBI flash-management layer. UBIFS would appear to perform better on benchmarks than LogFS at this point.

Forecast: UBIFS has been posted for public comment with an eye toward merging in the not-too-distant future. There has not been a great deal of discussion of this filesystem yet, though, so chances of a merger in 2.6.26 appear to be quite small. If all goes well, 2.6.27 or shortly thereafter may be a possibility.

For more information:

ZFS

ZFS is a filesystem made for Solaris by Sun Microsystems. It includes a number of interesting features: 128 bit support, block checksumming, integrated volume management support, etc.

Forecast: This development would not normally merit inclusion here, but people keep asking. ZFS is an open-source filesystem, but it is still highly unlikely to find its way into Linux. The license used for ZFS is incompatible with the GPL, making the mixing of the two code bases impossible. There are also patents held on ZFS which will prevent its use with Linux.

For more information:


FS-Cache

FS-Cache is a generic file caching layer meant to be used with network filesystems. A number of these filesystems support (or even require) the caching of remote files on the local system; FS-Cache is meant to serve as the mechanism for that caching for all filesystems which can make use of it.

Forecast: This patch is in a relatively mature state, having been in circulation for some years now. It's path into the mainline has been difficult, but it now appears likely that a merge for 2.6.30 or 2.6.31 is not out of the question. A large piece of prerequisite work - the credentials patch set - was accepted for 2.6.29.

For more information:

Large block support

Filesystems in Linux cannot use block sizes greater than the native page size on the host processor. This limitation is not normally a problem - most files are small, so larger blocks would simply make the whole system more space- and time-inefficient. There are times when larger blocks would be good to have: storage larger files on extent-based systems is a prime example. Large blocks would also improve portability of filesystems between systems with different page sizes.

Forecast: A patch set adding large block support to the Linux page cache and virtual memory subsystem has been posted. It is incomplete (lacking mmap() support in particular) and there have been questions about whether the approach taken is the correct one. One might see large block support in 2.6.25, but it's a long shot.

For more information:


unionfs

A union filesystem is a combination of two or more independent filesystems which appears as a coherent whole in its own right. One common use for such filesystems is live CD distributions: a writable filesystem is made into a union with the underlying, read-only, CD-based filesystem, giving the illusion of a filesystem with the full CD contents and which is modifiable. Many other applications are possible as well.

The unionfs code is a union filesystem implementation for Linux which has been under development for a few years. It is currently shipped by a number of distributors, but is not part of the mainline kernel.

Forecast: A concerted effort was made to get unionfs merged into 2.6.25, but it was not successful. Some filesystem developers are opposed to its presence, thinking that it is the wrong approach to this problem. Their preferred approach ("union mounts," which works within the virtual filesystem layer) may be more elegant, but it is far from being ready for production use. So unionfs is the only way that Linux can have this capability in the near future. How it will all play out is anybody's guess.

For more information:


88x31.png

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.


[Article] [Discussion] [View source] [History]