It is possible to force flushing of kernel buffers for output files. Sometimes, this is necessary if an application (e.g., a database journaling process) must ensure that output really has been written to the disk (or at least to the disk’s hardware cache) before continuing.
Before we describe the system calls used to control kernel buffering, it is useful to consider a few relevant definitions from SUSv3.
SUSv3 defines the term synchronized I/O completion to mean “an I/O operation that has either been successfully transferred [to the disk] or diagnosed as unsuccessful.”
SUSv3 defines two different types of synchronized I/O completion. The difference between the types involves the metadata (“data about data”) describing the file, which the kernel stores along with the data for a file. We consider file metadata in detail when we look at file i-nodes in I-nodes, but for now, it is sufficient to note that the file metadata includes information such as the file owner and group; file permissions; file size; number of (hard) links to the file; timestamps indicating the time of the last file access, last file modification, and last metadata change; and file data block pointers.
The first type of synchronized I/O completion defined by SUSv3 is synchronized I/O data integrity completion. This is concerned with ensuring that a file data update transfers sufficient information to allow a later retrieval of that data to proceed.
For a read operation, this means that the requested file data has been transferred (from the disk) to the process. If there were any pending write operations affecting the requested data, these are transferred to the disk before performing the read.
For a write operation, this means that the data specified in the write request has been transferred (to the disk) and all file metadata required to retrieve that data has also been transferred. The key point to note here is that not all modified file metadata attributes need to be transferred to allow the file data to be retrieved. An example of a modified file metadata attribute that would need to be transferred is the file size (if the write operation extended the file). By contrast, modified file timestamps would not need to be transferred to disk before a subsequent data retrieval could proceed.
The other type of synchronized I/O completion defined by SUSv3 is synchronized I/O file integrity completion, which is a superset of synchronized I/O data integrity completion. The difference with this mode of I/O completion is that during a file update, all updated file metadata is transferred to disk, even if it is not necessary for the operation of a subsequent read of the file data.
The fsync() system call causes the buffered data and all metadata associated with the open file descriptor fd to be flushed to disk. Calling fsync() forces the file to the synchronized I/O file integrity completion state.
#include <unistd.h>
int fsync
(int fd);
Returns 0 on success, or -1 on error
An fsync() call returns only after the transfer to the disk device (or at least its cache) has completed.
The fdatasync() system call operates similarly to fsync(), but only forces the file to the synchronized I/O data integrity completion state.
#include <unistd.h>
int fdatasync
(int fd);
Returns 0 on success, or -1 on error
Using fdatasync() potentially reduces the number of disk operations from the two required by fsync() to one. For example, if the file data has changed, but the file size has not, then calling fdatasync() only forces the data to be updated. (We noted above that changes to file metadata attributes such as the last modification timestamp don’t need to be transferred for synchronized I/O data completion.) By contrast, calling fsync() would also force the metadata to be transferred to disk.
Reducing the number of disk I/O operations in this manner is useful for certain applications in which performance is crucial and the accurate maintenance of certain metadata (such as timestamps) is not essential. This can make a considerable performance difference for applications that are making multiple file updates: because the file data and metadata normally reside on different parts of the disk, updating them both would require repeated seek operations backward and forward across the disk.
In Linux 2.2 and earlier, fdatasync() is implemented as a call to fsync(), and thus carries no performance gain.
Starting with kernel 2.6.17, Linux provides the nonstandard sync_file_range() system call, which allows more precise control than fdatasync() when flushing file data. The caller can specify the file region to be flushed, and specify flags controlling whether the system call blocks on disk writes. See the sync_file_range(2) manual page for further details.
The sync() system call causes all kernel buffers containing updated file information (i.e., data blocks, pointer blocks, metadata, and so on) to be flushed to disk.
#include <unistd.h>
void sync
(void);
In the Linux implementation, sync() returns only after all data has been transferred to the disk device (or at least to its cache). However, SUSv3 permits an implementation of sync() to simply schedule the I/O transfer and return before it has completed.
A permanently running kernel thread ensures that modified kernel buffers are flushed to disk if they are not explicitly synchronized within 30 seconds. This is done to ensure that buffers don’t remain unsynchronized with the corresponding disk file (and thus vulnerable to loss in the event of a system crash) for long periods. In Linux 2.6, this task is performed by the pdflush kernel thread. (In Linux 2.4, it is performed by the kupdated kernel thread.)
The file /proc/sys/vm/dirty_expire_centisecs
specifies the age (in hundredths of a second) that a dirty buffer must reach before it is flushed by pdflush. Additional files in the same directory control other aspects of the operation of pdflush.
Specifying the O_SYNC
flag when calling open() makes all subsequent output synchronous:
fd = open(pathname, O_WRONLY | O_SYNC);
After this open() call, every write() to the file automatically flushes the file data and metadata to the disk (i.e., writes are performed according to synchronized I/O file integrity completion).
Older BSD systems used the O_FSYNC
flag to provide O_SYNC
functionality. In glibc, O_FSYNC
is defined as a synonym for O_SYNC
.
Using the O_SYNC
flag (or making frequent calls to fsync(), fdatasync(), or sync()) can strongly affect performance. Table 13-3 shows the time required to write 1 million bytes to a newly created file (on an ext2 file system) for a range of buffer sizes with and without O_SYNC
. The results were obtained (using the filebuff/write_bytes.c
program provided in the source code distribution for this book) using a vanilla 2.6.30 kernel and an ext2 file system with a block size of 4096 bytes. Each row shows the average of 20 runs for the given buffer size.
As can be seen from the table, O_SYNC
increases elapsed times enormously—in the 1-byte buffer case, by a factor of more than 1000. Note also the large differences between the elapsed and CPU times for writes with O_SYNC
. This is a consequence of the program being blocked while each buffer is actually transferred to disk.
The results shown in Table 13-3 omit a further factor that affects performance when using O_SYNC
. Modern disk drives have large internal caches, and by default, O_SYNC
merely causes data to be transferred to the cache. If we disable caching on the disk (using the command hdparm -W0), then the performance impact of O_SYNC
becomes even more extreme. In the 1-byte case, the elapsed time rises from 1030 seconds to around 16,000 seconds. In the 4096-byte case, the elapsed time rises from 0.34 seconds to 4 seconds.
In summary, if we need to force flushing of kernel buffers, we should consider whether we can design our application to use large write() buffer sizes or make judicious use of occasional calls to fsync() or fdatasync(), instead of using the O_SYNC
flag when opening the file.
SUSv3 specifies two further open file status flags related to synchronized I/O: O_DSYNC
and O_RSYNC
.
The O_DSYNC
flag causes writes to be performed according to the requirements of synchronized I/O data integrity completion (like fdatasync()). This contrasts with O_SYNC
, which causes writes to be performed according to the requirements of synchronized I/O file integrity completion (like fsync()).
The O_RSYNC
flag is specified in conjunction with either O_SYNC
or O_DSYNC
, and extends the write behaviors of these flags to read operations. Specifying both O_RSYNC
and O_DSYNC
when opening a file means that all subsequent reads are completed according to the requirements of synchronized I/O data integrity (i.e., prior to performing the read, all pending file writes are completed as though carried out with O_DSYNC
). Specifying both O_RSYNC
and O_SYNC
when opening a file means that all subsequent reads are completed according to the requirements of synchronized I/O file integrity (i.e., prior to performing the read, all pending file writes are completed as though carried out with O_SYNC
).
Before kernel 2.6.33, the O_DSYNC
and O_RSYNC
flags were not implemented on Linux, and the glibc headers defined these constants to be the same as O_SYNC
. (This isn’t actually correct in the case of O_RSYNC
, since O_SYNC
doesn’t provide any functionality for read operations.)
Starting with kernel 2.6.33, Linux implements O_DSYNC
, and an implementation of O_RSYNC
is likely to be added in a future kernel release.
Before kernel 2.6.33, Linux didn’t fully implement O_SYNC
semantics. Instead, O_SYNC
was implemented as O_DSYNC
. To maintain consistent behavior for applications that were built for older kernels, applications that were linked against older versions of the GNU C library continue to provide O_DSYNC
semantics for O_SYNC
, even on Linux 2.6.33 and later.