Write-back caches

The CPUs and memory in your server are quite fast compared to its disk drives. Accordingly, making the rest of the system wait for the disks, particularly when things need to be written out, can drag overall performance down heavily. Systems that wait for the disks to complete their writes before moving into their next task are referred to as having a write-through cache. While the data may be stored temporarily in a memory cache, until it's made it all the way through to the physical disk, any write an application requested isn't considered complete.

The normal solution to making that faster is to introduce a different type of write cache between the program doing the writing and the disks. A write-back cache is one where data is copied into memory, and then control returns to the application that requested the write. Those writes are then handled asynchronously, at some future time dictated by the design of the write-back cache. It can take minutes before the data actually makes it to disk.

When PostgreSQL writes information to the WAL, and sometimes when it writes to the regular database files, too, that information must be flushed to permanent storage in order for the database's crash corruption defense mechanism to work. So, what happens if you have a write-back cache that says the write is complete but it really isn't? People call these lying drives, and the result can be very bad.

If you have a system with a write-back cache and a system crash causes the contents of that write-back cache to be lost, this can corrupt a PostgreSQL database stored on that drive and make it unusable! You can discover it takes expert intervention to even get the database to start again, and determining what data is damaged will be difficult.

Consider the case where you have committed a transaction. Details of that new transaction might be spread across two data blocks on the drive. Now, imagine that one of those made it to disk before the system crashed, but the other didn't. You've now left the database in a corrupted state--one block refers to a transaction that doesn't exist where it's supposed to in the other block.

Had all of the data blocks related to the WAL been written properly, the database WAL can correct this error after the crash. But the WAL protection only works if it can get honest information about whether information has been written to the disks properly or not, and the lying write-back caches do not report that.