As the trend towards managed flash technologies continues, particularly eMMC, we need to consider how to use it effectively. While they appear to have the same characteristics as hard disk drives, some NAND flash chips have the limitations of large erase blocks with limited erase cycles and bad block handling. And, of course, we need robustness in the event of losing power.
It is possible to use any of the normal disk filesystems but we should try to choose one that reduces disk writes and has a fast restart after an unscheduled shutdown, typically provided by a journal.
To make optimum use of the underlying flash memory, you need to know the erase block size and page size. Manufacturers do not publish these numbers, as a rule, but it is possible to deduce them by observing the behavior of the chip or card.
Flashbench is one such tool. It was initially written by Arnd Bergman, as described in the LWN article available at http://lwn.net/Articles/428584. You can get the code from https://github.com/bradfa/flashbench.
Here is a typical run on a SanDisk GiB SDHC card:
$ sudo ./flashbench -a /dev/mmcblk0 --blocksize=1024 align 536870912 pre 4.38ms on 4.48ms post 3.92ms diff 332µs align 268435456 pre 4.86ms on 4.9ms post 4.48ms diff 227µs align 134217728 pre 4.57ms on 5.99ms post 5.12ms diff 1.15ms align 67108864 pre 4.95ms on 5.03ms post 4.54ms diff 292µs align 33554432 pre 5.46ms on 5.48ms post 4.58ms diff 462µs align 16777216 pre 3.16ms on 3.28ms post 2.52ms diff 446µs align 8388608 pre 3.89ms on 4.1ms post 3.07ms diff 622µs align 4194304 pre 4.01ms on 4.89ms post 3.9ms diff 940µs align 2097152 pre 3.55ms on 4.42ms post 3.46ms diff 917µs align 1048576 pre 4.19ms on 5.02ms post 4.09ms diff 876µs align 524288 pre 3.83ms on 4.55ms post 3.65ms diff 805µs align 262144 pre 3.95ms on 4.25ms post 3.57ms diff 485µs align 131072 pre 4.2ms on 4.25ms post 3.58ms diff 362µs align 65536 pre 3.89ms on 4.24ms post 3.57ms diff 511µs align 32768 pre 3.94ms on 4.28ms post 3.6ms diff 502µs align 16384 pre 4.82ms on 4.86ms post 4.17ms diff 372µs align 8192 pre 4.81ms on 4.83ms post 4.16ms diff 349µs align 4096 pre 4.16ms on 4.21ms post 4.16ms diff 52.4µs align 2048 pre 4.16ms on 4.16ms post 4.17ms diff 9ns
Flashbench reads blocks of, in this case, 1,024 bytes just before and just after various power-of-two boundaries. As you cross a page or erase block boundary, the reads after the boundary take longer. The rightmost column shows the difference and is the one that is most interesting. Reading from the bottom, there is a big jump at 4 KiB, which is the most likely size of a page. There is a second jump from 52.4µs to 349µs at 8 KiB. This is fairly common and indicates that the card can use multi-plane accesses to read two 4 KiB pages at the same time. Beyond that, the differences are less well marked, but there is a clear jump from 485µs to 805µs at 512 KiB, which is probably the erase block size. Given that the card being tested is quite old, these are the sort of numbers you would expect.
Usually, when you delete a file, only the modified directory node is written to storage while the sectors containing the file's contents remain unchanged. When the flash translation layer is in the disk controller, as with managed flash, it does not know that this group of disk sectors no longer contains useful data and so it ends up copying stale data.
In the last few years, the addition of transactions that pass information about deleted sectors down to the disk controller has improved the situation. The SCSI and SATA specifications have a TRIM
command and MMC has a similar command named ERASE
. In Linux, this feature is known as discard
.
To make use of discard
, you need a storage device that supports it – most current eMMC chips do – and a Linux device driver to match. You can check by looking at the block system queue parameters in /sys/block/<block device>/queue/
. The ones of interest are as follows:
discard_granularity
: The size of the internal allocation unit of the devicediscard_max_bytes
: The maximum number of bytes that can be discarded in one godiscard_zeroes_data
: If 1
, discarded data will be set to zeroIf the device or the device driver does not support discard
, these values are all set to zero. These are the parameters you will see from the eMMC chip on the BeagleBone Black:
# grep -s "" /sys/block/mmcblk0/queue/discard_* /sys/block/mmcblk0/queue/discard_granularity:2097152 /sys/block/mmcblk0/queue/discard_max_bytes:2199023255040 /sys/block/mmcblk0/queue/discard_zeroes_data:1
There is more information in the kernel documentation file, Documentation/block/queue-sysfs.txt
.
You can enable discard
when mounting a filesystem by adding the option -o discard
to the mount
command. Both ext4 and F2FS support it.
It is also possible to force discard
from the command line independently of how the partition is mounted, using the fstrim
command which is part of the util-linux
package. Typically, you would run this command periodically, once a week perhaps, to free up unused space. fstrim
operates on a mounted filesystem so, to trim the root filesystem /
, you would type the following:
# fstrim -v / /: 2061000704 bytes were trimmed
The preceding example uses the verbose option, -v
, so that it prints out the number of bytes potentially freed up. In this case 2,061,000,704 is the approximate amount of free space in the filesystem, so it is the maximum amount of storage that could have been freed.
The extended filesystem, ext, has been the main filesystem for Linux desktops since 1992. The current version, ext4, is very stable, well tested and has a journal that makes recovery from an unscheduled shutdown fast and mostly painless. It is a good choice for managed flash devices and you will find that it is the preferred filesystem for Android devices that have eMMC storage. If the device supports discard
, you should mount with the option -o discard
.
To format and create an ext4 filesystem at runtime, you would type the following:
# mkfs.ext4 /dev/mmcblk0p2 # mount -t ext4 -o discard /dev/mmcblk0p1 /mnt
To create a filesystem image, you can use the genext2fs
utility, available from http://genext2fs.sourceforge.net. In this example, I have specified the block size with -B
and the number of blocks in the image with -b
:
$ genext2fs -B 1024 -b 10000 -d rootfs rootfs.ext4
genext2fs
can make use of a device table to set the file permissions and ownership, as described in Chapter 5, Building a Root Filesystem, with -D [file table]
.
As the name implies, this will actually generate an image in .ext2
format. You can upgrade using tune2fs
as follows (details of the command options are in the main page for tune2fs
):
$ tune2fs -j -J size=1 -O filetype,extents,uninit_bg,dir_index rootfs.ext4 $ e2fsck -pDf rootfs.ext4
Both the Yocto Project and Buildroot use exactly these steps when creating images in .ext4
format.
While a journal is an asset for devices that may power down without warning, it does add extra write cycles to each write transaction, wearing out the flash memory. If the device is battery-powered, especially if the battery is not removable, the chances of an unscheduled power down are small and so you may want to leave the journal out.
The Flash-Friendly File System, F2FS, is a log-structured filesystem designed for managed flash devices, especially eMMC and SD. It was written by Samsung and was merged into mainline Linux in 3.8. It is marked experimental, indicating that it has not been extensively deployed as yet, but it seems that some Android devices are using it.
F2FS takes into account the page and erase block sizes and tries to align data on these boundaries. The log format gives resilience in the face of power down and also gives good write performance, in some tests showing a two-fold improvement over ext4. There is a good description of the design of F2FS in the kernel documentation in Documentation/filesystems/f2fs.txt
and there are references at the end of the chapter.
The mfs2.fs2
utility creates an empty F2FS filesystem with the label -l
:
# mkfs.f2fs -l rootfs /dev/mmcblock0p1 # mount -t f2fs /dev/mmcblock0p1 /mnt
There isn't (yet) a tool to create F2FS filesystem images off-line.
The old Microsoft filesystems, FAT16 and FAT32, continue to be important as a common format that is understood by most operating systems. When you buy an SD card or USB flash drive, it is almost certain to be formatted as FAT32 and, in some cases, the on-card microcontroller is optimized for FAT32 access patterns. Also, some boot ROMs require a FAT partition for the second stage bootloader, the TI OMAP-based chips for example. However, FAT formats are definitely not suitable for storing critical files because they are prone to corruption and make poor use of the storage space.
Linux supports FAT16 through the msdos
filesystem and both FAT32 and FAT16 through the vfat
filesystem. In most cases, you need to include the vfat
driver. Then, to mount a device, say an SD card on the second mmc
hardware adapter, you would type this:
# mount -t vfat /dev/mmcblock1p1 /mnt
In the past, there have been licensing issues with the vfat
driver which may (or may not) infringe a patent held by Microsoft.
FAT32 has a limitation on the device size of 32 GiB. Devices of a larger capacity may be formatted using the Microsoft exFAT format and it is a requirement for SDXC cards. There is no kernel driver for exFAT, but it can be supported by means of a user-space FUSE driver. Since exFAT is proprietary to Microsoft there are certain to be licensing implications if you support this format on your device.