Scatter-Gather I/O: readv() and writev()

The readv() and writev() system calls perform scatter-gather I/O.

#include <sys/uio.h>

ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

Note

Returns number of bytes read, 0 on EOF, or -1 on error

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

Note

Returns number of bytes written, or -1 on error

Instead of accepting a single buffer of data to be read or written, these functions transfer multiple buffers of data in a single system call. The set of buffers to be transferred is defined by the array iov. The integer iovcnt specifies the number of elements in iov. Each element of iov is a structure of the following form:

struct iovec {
    void  *iov_base;        /* Start address of buffer */
    size_t iov_len;         /* Number of bytes to transfer to/from buffer */
};

Note

SUSv3 allows an implementation to place a limit on the number of elements in iov. An implementation can advertise its limit by defining IOV_MAX in <limits.h> or at run time via the return from the call sysconf(_SC_IOV_MAX). (We describe sysconf() in Section 11.2.) SUSv3 requires that this limit be at least 16. On Linux, IOV_MAX is defined as 1024, which corresponds to the kernel’s limit on the size of this vector (defined by the kernel constant UIO_MAXIOV).

However, the glibc wrapper functions for readv() and writev() silently do some extra work. If the system call fails because iovcnt is too large, then the wrapper function temporarily allocates a single buffer large enough to hold all of the items described by iov and performs a read() or write() call (see the discussion below of how writev() could be implemented in terms of write()).

Figure 5-3 shows an example of the relationship between the iov and iovcnt arguments, and the buffers to which they refer.

Figure 5-3. Example of an iovec array and associated buffers

Scatter input

The readv() system call performs scatter input: it reads a contiguous sequence of bytes from the file referred to by the file descriptor fd and places (“scatters”) these bytes into the buffers specified by iov. Each of the buffers, starting with the one defined by iov[0], is completely filled before readv() proceeds to the next buffer.

An important property of readv() is that it completes atomically; that is, from the point of view of the calling process, the kernel performs a single data transfer between the file referred to by fd and user memory. This means, for example, that when reading from a file, we can be sure that the range of bytes read is contiguous, even if another process (or thread) sharing the same file offset attempts to manipulate the offset at the same time as the readv() call.

On successful completion, readv() returns the number of bytes read, or 0 if end-of-file was encountered. The caller must examine this count to verify whether all requested bytes were read. If insufficient data was available, then only some of the buffers may have been filled, and the last of these may be only partially filled.

Example 5-2 demonstrates the use of readv().

Note

Using the prefix t_ followed by a function name as the name of an example program (e.g., t_readv.c in Example 5-2) is our way of indicating that the program primarily demonstrates the use of a single system call or library function.

Example 5-2. Performing scatter input with readv()

fileio/t_readv.c
#include <sys/stat.h>
#include <sys/uio.h>
#include <fcntl.h>
#include "tlpi_hdr.h"

int
main(int argc, char *argv[])
{
    int fd;
    struct iovec iov[3];
    struct stat myStruct;       /* First buffer */
    int x;                      /* Second buffer */
#define STR_SIZE 100
    char str[STR_SIZE];         /* Third buffer */
    ssize_t numRead, totRequired;

    if (argc != 2 || strcmp(argv[1], "--help") == 0)
        usageErr("%s file\n", argv[0]);

    fd = open(argv[1], O_RDONLY);
    if (fd == -1)
        errExit("open");

    totRequired = 0;

    iov[0].iov_base = &myStruct;
    iov[0].iov_len = sizeof(struct stat);
    totRequired += iov[0].iov_len;

    iov[1].iov_base = &x;
    iov[1].iov_len = sizeof(x);
    totRequired += iov[1].iov_len;

    iov[2].iov_base = str;
    iov[2].iov_len = STR_SIZE;
    totRequired += iov[2].iov_len;

    numRead = readv(fd, iov, 3);
    if (numRead == -1)
        errExit("readv");

    if (numRead < totRequired)
        printf("Read fewer bytes than requested\n");

    printf("total bytes requested: %ld; bytes read: %ld\n",
            (long) totRequired, (long) numRead);
    exit(EXIT_SUCCESS);
}
     fileio/t_readv.c

Gather output

The writev() system call performs gather output. It concatenates (“gathers”) data from all of the buffers specified by iov and writes them as a sequence of contiguous bytes to the file referred to by the file descriptor fd. The buffers are gathered in array order, starting with the buffer defined by iov[0].

Like readv(), writev() completes atomically, with all data being transferred in a single operation from user memory to the file referred to by fd. Thus, when writing to a regular file, we can be sure that all of the requested data is written contiguously to the file, rather than being interspersed with writes by other processes (or threads).

As with write(), a partial write is possible. Therefore, we must check the return value from writev() to see if all requested bytes were written.

The primary advantages of readv() and writev() are convenience and speed. For example, we could replace a call to writev() by either:

code that allocates a single large buffer, copies the data to be written from other locations in the process’s address space into that buffer, and then calls write() to output the buffer; or
a series of write() calls that output the buffers individually.

The first of these options, while semantically equivalent to using writev(), leaves us with the inconvenience (and inefficiency) of allocating buffers and copying data in user space.

The second option is not semantically equivalent to a single call to writev(), since the write() calls are not performed atomically. Furthermore, performing a single writev() system call is cheaper than performing multiple write() calls (refer to the discussion of system calls in System Calls).

Performing scatter-gather I/O at a specified offset

Linux 2.6.30 adds two new system calls that combine scatter-gather I/O functionality with the ability to perform the I/O at a specified offset: preadv() and pwritev(). These system calls are nonstandard, but are also available on the modern BSDs.

#define _BSD_SOURCE
#include <sys/uio.h>

ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);

Note

Returns number of bytes read, 0 on EOF, or -1 on error

ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);

Note

Returns number of bytes written, or -1 on error

The preadv() and pwritev() system calls perform the same task as readv() and writev(), but perform the I/O at the file location specified by offset (like pread() and pwrite()).

These system calls are useful for applications (e.g., multithreaded applications) that want to combine the benefits of scatter-gather I/O with the ability to perform I/O at a location that is independent of the current file offset.