ZFS

Few filesystems have ever inspired the sort of zealous advocacy that fans of ZFS regularly display. While it has its weak points, in many respects, ZFS is a reinvention of the filesystem concept with significant advantages. One thing that's different about ZFS is that it combines filesystem and RAID capabilities into an integrated pair. The RAID-Z implementation in ZFS is a worthwhile alternative to standard RAID5 and RAID6 installations.

ZFS defaults to working in records of 128 KB in size. This is much larger than a PostgreSQL block, which can cause a variety of inefficiencies if your system is regularly reading or writing only small portions of the database at a time (like many OLTP systems do). It's only really appropriate if you prefer to optimize your system for large operations. The default might be fine if you're running a data warehouse that is constantly scanning large chunks of tables. But standard practice for ZFS database installations that do more scattered random I/O is to reduce the ZFS record size to match the database one, which means 8K for PostgreSQL:

    $ zfs set recordsize=8K zp1/data

You need to do this before creating any of the database files on the drive, because the record size is actually set per file. Note that this size will not be optimal for WAL writes, which may benefit from a larger record size like the default.

One important thing to know about ZFS is that unlike Solaris's UFS, which caches almost nothing by default, ZFS is known to consume just about all the memory available for its Adaptive Replacement Cache (ARC). You'll need to reduce those amounts for use with PostgreSQL, where large blocks of RAM are expected to be allocated for the database buffer cache and things like working memory. The actual tuning details vary based on Solaris release, and are documented in the Limiting the ARC Cache section of the ZFS Evil Tuning Guide article at http://www.solaris-cookbook.eu/solaris/solaris-10-zfs-evil-tuning-guide/ or http://www.serverfocus.org/zfs-evil-tuning-guide.

For FreeBSD, refer to http://wiki.freebsd.org/ZFSTuningGuide for similar information. One of the scripts suggested there, arc_summary.pl , is a useful one in both its FreeBSD and Solaris incarnations, for determining just what's in the ARC cache and whether it's using its RAM effectively. This is potentially, valuable tuning feedback for PostgreSQL, where the OS cache is used quite heavily, but such use is not tracked for effectiveness by the database.

ZFS handles its journaling using a structure called the intent log. High performance systems with many disks commonly allocate a dedicated storage pool just to hold the ZFS intent log for the database disk, in the same way that the database WAL is commonly put on another drive. There's no need to have a dedicated intent log for the WAL disk too, though.

Similar to XFS, if you have a system with a non-volatile write cache such as a battery-backed write controller, the cache flushing done by ZFS will defeat some of the benefit of that cache. You can disable that behaviour by adjusting the zfs_nocacheflush parameter; the following line in /etc/system will do that:

    set zfs:zfs_nocacheflush = 1

You can toggle the value to 1 (no cache flushing) and back to 0 (default, flushing enabled) with the following on a live filesystem:

    echo zfs_nocacheflush/W0t1 | mdb -kw
    echo zfs_nocacheflush/W0t0 | mdb -kw

ZFS has a few features that make it well suited to database use. All reads and writes include block checksums, which allow ZFS to detect the, sadly, common situation where data is quietly corrupted by RAM or disk errors. Some administrators consider such checksums vital for running a large database safely. Another useful feature is ZFS's robust snapshot support. This makes it far easier to make a copy of a database you can replicate to another location, backup, or even to create a temporary copy you can then rollback to an earlier version. This can be particularly valuable when doing risky migrations or changes you might want to back out.

Because of the robustness of its intent log and block checksum features, ZFS is one filesystem where disabling PostgreSQL's full_page_writes parameter is a candidate optimization with little risk. It's quite resistant to the torn pages issue that makes that parameter important for other filesystems. There is also transparent compression available on ZFS. While expensive in terms of CPU, applications that do lots of sequential scans of data too small to be compressed by the PostgreSQL TOAST method, might benefit from reading more logical data per physical read, which is what should happen if compression is enabled.

The choices for Windows are much easier, because there's only one of them that makes sense for a database disk (https://en.wikipedia.org/wiki/File_Allocation_Table).