Filesystems for NOR and NAND flash memory

To use raw flash chips for mass storage, you have to use a filesystem that understands the peculiarities of the underlying technology. There are three such filesystems:

All of these use MTD as the common interface to flash memory.

The Journaling Flash File System had its beginnings in the software for the Axis 2100 network camera in 1999. For many years, it was the only flash filesystem for Linux and has been deployed on many thousands of different types of devices. Today, it is not the best choice, but I will cover it first because it shows the beginning of the evolutionary path.

JFFS2 is a log-structured filesystem that uses MTD to access flash memory. In a log-structured filesystem, changes are written sequentially as nodes to the flash memory. A node may contain changes to a directory, such as the names of files created and deleted, or it may contain changes to file data. After a while, a node may be superseded by information contained in subsequent nodes and becomes an obsolete node.

Erase blocks are categorized into three types:

At any one time, there is one block receiving updates which is called the open block. If power is lost or the system is reset, the only data that can be lost is the last write to the open block. In addition, nodes are compressed as they are written, increasing the effective storage capacity of the flash chip, which is important if you are using expensive NOR flash memory.

When the number of free blocks falls below a threshold, a garbage collector kernel thread is started, which scans for dirty blocks and copies the valid nodes into the open block, and then frees up the dirty block.

At the same time, the garbage collector provides a crude form of wear leveling because it cycles valid data from one block to another. The way that the open block is chosen means that each block is erased roughly the same number of times so long as it contains data that changes from time to time. Sometimes a clean block is chosen for garbage collection to make sure that blocks containing static data that is seldom written are also wear leveled.

JFFS2 filesystems have a write through cache, meaning that writes are written to the flash memory synchronously as if they have been mounted with a -o sync option. While improving reliability, it does increase the time to write data. There is a further problem with small writes: if the length of a write is comparable to the size of the node header (40 bytes) the overhead becomes high. A well-known corner case is log files, produced, for example, by syslogd.

Creating an empty JFFS2 filesystem at runtime is as simple as erasing an MTD partition with clean markers and then mounting it. There is no formatting step because a blank JFFS2 filesystem consists entirely of free blocks. For example, to format MTD partition 6, you would enter these commands on the device:

The -j option to flash_erase adds the clean markers, and mounting with type jffs2 presents the partition as an empty filesystem. Note that the device to be mounted is given as mtd6, not /dev/mtd6. Alternatively, you can give the block device node /dev/mtdblock6. This is just a peculiarity of JFFS2. Once mounted, you can treat it like any filesystem and, when you next boot and mount it, all the files will still be there.

You can create a filesystem image directly from the staging area of your development system using mkfs.jffs2 to write out the files in JFFS2 format and sumtool to add the summary nodes. Both of these are part of the mtd-utils package.

As an example, to create an image of the files in rootfs for a NAND flash device with an erase block size of 128 KB (0x20000) and with summary nodes, you would use these two commands:

The -p option adds padding at the end of the image file to make it a whole number of erase blocks. The -n option suppresses the creation of clean markers in the image, which is normal for NAND devices as the clean marker is in the OOB area. For NOR devices, you would leave out the -n option. You can use a device table with mkfs.jffs2 to set the permissions and the ownership of files by adding -D [device table]. Of course, Buildroot and the Yocto Project will do all this for you.

You can program the image into flash memory from your bootloader. For example, if you have loaded a filesytem image into RAM at address 0x82000000 and you want to load it into a flash partition begins at 0x163000 bytes from the start of the flash chip and is 0x7a9d000 bytes long, the U-Boot commands would be:

You can do the same thing from Linux using the mtd driver like this:

To boot with a JFFS2 root filesystem, you need to pass the mtdblock device on the kernel command line for the partition and a root fstype because JFFS2 cannot be auto-detected:

The YAFFS filesystem was written by Charles Manning beginning in 2001, specifically to handle NAND flash chips at a time when JFFS2 did not. Subsequent changes to handle larger (2 KiB) page sizes resulted in YAFFS2. The website for YAFFS is http://www.yaffs.net.

YAFFS is also a log-structured filesystem following the same design principles as JFFS2. The different design decisions mean that it has a faster mount-time scan, simpler and faster garbage collection, and has no compression, which speeds up reads and writes at the expense of less efficient use of storage.

YAFFS is not limited to Linux; it has been ported to a wide range of operating systems. It has a dual license: GPLv2 to be compatible with Linux, and a commercial license for other operating systems. Unfortunately, the YAFFS code has never been merged into mainline Linux so you will have to patch your kernel, as shown in the following code.

To get YAFFS2 and patch a kernel, you would:

Then, configure the kernel with CONFIG_YAFFS_YAFFS2.

As with JFFS2, to create a YAFFS2 filesystem at runtime, you only need to erase the partition and mount it but note that, in this case, you do not enable clean markers:

To create a filesystem image, the simplest thing to do is use the mkyaffs2 tool from https://code.google.com/p/yaffs2utils using the following command:

$ mkyaffs2 -c 2048 -s 64 rootfs rootfs.yaffs2

Here -c is the page size and -s the OOB size. There is a tool named mkyaffs2image that is part of the YAFFS code, but it has a couple of drawbacks. Firstly, the page and OOB size are hard-coded in the source: you will have to edit and recompile if you have memory that does not match the defaults of 2,048 and 64. Secondly, the OOB layout is incompatible with MTD, which uses the first two byes as a bad block marker, whereas mkyaffs2image uses those bytes to store part of the YAFFS metadata.

To copy the image to the MTD partition from a Linux shell prompt, follow these steps:

# flash_erase /dev/mtd6 0 0
# nandwrite -a /dev/mtd6 rootfs.yaffs2

To boot with a YAFFS2 root filesystem, add the following to the kernel command line:

root=/dev/mtdblock6 rootfstype=yaffs2

The unsorted block image (UBI) driver, is a volume manager for flash memory which takes care of bad block handling and wear leveling. It was implemented by Artem Bityutskiy and first appeared in Linux 2.6.22. In parallel with that, engineers at Nokia were working on a filesystem that would take advantage of the features of UBI which they called UBIFS; it appeared in Linux 2.6.27. Splitting the flash translation layer in this way makes the code more modular and also allows other filesystems to take advantage of the UBI driver, as we shall see later on.

UBI provides an idealized, reliable view of a flash chip by mapping physical erase blocks (PEB) to logical erase blocks (LEB). Bad blocks are not mapped to LEBs and so are never used. If a block cannot be erased, it is marked as bad and dropped from the mapping. UBI keeps a count of the number of times each PEB has been erased in the header of the LEB and changes the mapping to ensure that each PEB is erased the same number of times.

UBI accesses the flash memory through the MTD layer. As an extra feature, it can divide an MTD partition into a number of UBI volumes, which improves wear leveling in the following way. Imagine that you have two filesystems, one containing fairly static data, for example, a root filesystem, and the other containing data that is constantly changing. If they are stored in separate MTD partitions, the wear leveling only has an effect on the second one, whereas, if you choose to store them in two UBI volumes in a single MTD partition, the wear leveling takes place over both areas of the storage and the lifetime of the flash memory is increased. The following diagram illustrates this situation:

UBI

In this way, UBI fulfills two of the requirements of a flash translation layer: wear leveling and bad block handling.

To prepare an MTD partition for UBI, you don't use flash_erase as with JFFS2 and YAFFS2, instead you use the ubiformat utility, which preserves the erase counts that are stored in the PED headers. ubiformat needs to know the minimum unit of IO which, for most NAND flash chips, is the page size, but some chips allow reading and writing in sub pages that are a half or a quarter of the page size. Consult the chip data sheet for details and, if in doubt, use the page size. This example prepares mtd6 using a page size of 2,048 bytes:

You use the ubiattach command to load the UBI driver on an MTD partition that has been prepared in this way:

This creates the device node /dev/ubi0 through which you can access the UBI volumes. You can use ubiattach multiple times for other MTD partitions, in which case they can be accessed through /dev/ubi1, /dev/ubi2, and so on.

The PEB to LEB mapping is loaded into memory during the attach phase, a process that takes time proportional to the number of PEBs, typically a few seconds. A new feature was added in Linux 3.7 called the UBI fastmap which checkpoints the mapping to flash from time to time and so reduces the attach time. The kernel configuration option is CONFIG_MTD_UBI_FASTMAP.

The first time you attach to an MTD partition after a ubiformat there will be no volumes. You can create volumes using ubimkvol. For example, suppose you have a 128MB MTD partition and you want to split it into two volumes of 32 MB and 96 MB using a chip with 128 KB erase blocks and 2 KB pages:

Now, you have device the nodes /dev/ubi0_0 and /dev/ubi0_1. You can confirm the situation using ubinfo:

Note that, since each LEB has a header to contain the meta information used by UBI, the LEB is smaller than the PEB by one page. For example, a chip with a PEB size of 128 KB and 2 KB pages would have an LEB of 126 KB. This is important information that you will need when creating a UBIFS image.

UBIFS uses a UBI volume to create a robust filesystem. It adds sub-allocation and garbage collection to create a complete flash translation layer. Unlike JFFS2 and YAFFS2, it stores index information on-chip and so mounting is fast, although don't forget that attaching the UBI volume beforehand may take a significant amount of time. It also allows for write-back caching like a normal disk filesystem, which means that writes are much faster, but with the usual problem of potential loss of data that has not been flushed from the cache to flash memory in the event of power down. You can resolve the problem by making careful use of the fsync(2) and fdatasync(2) functions to force a flush of file data at crucial points.

UBIFS has a journal for fast recovery in the event power down. The journal takes up some space, typically 4 MiB or more, so UBIFS is not suitable for very small flash devices.

Once you have created the UBI volumes, you can mount them using the device node for the volume, /dev/ubi0_0, or by using the device node for the whole partition plus the volume name, as shown here:

Creating a filesystem image for UBIFS is a two-stage process: first you create a UBIFS image using mkfs.ubifs, and then embed it into a UBI volume using ubinize.

For the first stage, mkfs.ubifs needs to be informed of the page size with -m, the size of the UBI LEB with -e, remembering that the LEB is usually one page shorter than the PEB, and the maximum number of erase blocks in the volume with -c. If the first volume is 32 MiB and an erase block is 128 KiB, then the number of erase blocks is 256. So, to take the contents of the directory rootfs and create a UBIFS image named rootfs.ubi, you would type the following:

The second stage requires you to create a configuration file for ubinize which describes the characteristics of each volume in the image. The help page (ubinize -h) gives details of the format. This example creates two volumes, vol_1 and vol_2:

The second volume has an auto-resize flag and so will expand to fill the remaining space on the MTD partition. Only one volume can have this flag. From this information, ubinize will create an image file named by the -o parameter, with the PEB size -p, the page size -m, and the sub-page size -s:

To install this image on the target, you would enter these commands on the target:

If you want to boot with a UBIFS root filesystem, you would give these kernel command line parameters: