The epoll API

Like the I/O multiplexing system calls and signal-driven I/O, the Linux epoll (event poll) API is used to monitor multiple file descriptors to see if they are ready for I/O. The primary advantages of the epoll API are the following:

The performance of epoll and signal-driven I/O is similar. However, epoll has some advantages over signal-driven I/O:

The epoll API is Linux-specific, and is new in Linux 2.6.

The central data structure of the epoll API is an epoll instance, which is referred to via an open file descriptor. This file descriptor is not used for I/O. Instead, it is a handle for kernel data structures that serve two purposes:

The membership of the ready list is a subset of the interest list.

For each file descriptor monitored by epoll, we can specify a bit mask indicating events that we are interested in knowing about. These bit masks correspond closely to the bit masks used with poll().

The epoll API consists of three system calls:

The epoll_create() system call creates a new epoll instance whose interest list is initially empty.

#include <sys/epoll.h>

int epoll_create(int size);

Note

Returns file descriptor on success, or -1 on error

The size argument specifies the number of file descriptors that we expect to monitor via the epoll instance. This argument is not an upper limit, but rather a hint to the kernel about how to initially dimension internal data structures. (Since Linux 2.6.8, the size argument must be greater than zero but is otherwise ignored, because changes in the implementation meant that the information it provided is no longer required.)

As its function result, epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used to refer to the epoll instance in other epoll system calls. When the file descriptor is no longer required, it should be closed in the usual way, using close(). When all file descriptors referring to an epoll instance are closed, the instance is destroyed and its associated resources are released back to the system. (Multiple file descriptors may refer to the same epoll instance as a consequence of calls to fork() or descriptor duplication using dup() or similar.)

The epoll_ctl() system call modifies the interest list of the epoll instance referred to by the file descriptor epfd.

#include <sys/epoll.h>

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *ev);

Note

Returns 0 on success, or -1 on error

The fd argument identifies which of the file descriptors in the interest list is to have its settings modified. This argument can be a file descriptor for a pipe, FIFO, socket, POSIX message queue, inotify instance, terminal, device, or even another epoll descriptor (i.e., we can build a kind of hierarchy of monitored descriptors). However, fd can’t be a file descriptor for a regular file or a directory (the error EPERM results).

The op argument specifies the operation to be performed, and has one of the following values:

EPOLL_CTL_ADD

Add the file descriptor fd to the interest list for epfd. The set of events that we are interested in monitoring for fd is specified in the buffer pointed to by ev, as described below. If we attempt to add a file descriptor that is already in the interest list, epoll_ctl() fails with the error EEXIST.

EPOLL_CTL_MOD

Modify the events setting for the file descriptor fd, using the information specified in the buffer pointed to by ev. If we attempt to modify the settings of a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT.

EPOLL_CTL_DEL

Remove the file descriptor fd from the interest list for epfd. The ev argument is ignored for this operation. If we attempt to remove a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT. Closing a file descriptor automatically removes it from all of the epoll interest lists of which it is a member.

The ev argument is a pointer to a structure of type epoll_event, defined as follows:

struct epoll_event {
    uint32_t     events;        /* epoll events (bit mask) */
    epoll_data_t data;          /* User data */
};

The data field of the epoll_event structure is typed as follows:

typedef union epoll_data {
    void        *ptr;           /* Pointer to user-defined data */
    int          fd;            /* File descriptor */
    uint32_t     u32;           /* 32-bit integer */
    uint64_t     u64;           /* 64-bit integer */
} epoll_data_t;

The ev argument specifies settings for the file descriptor fd, as follows:

  • The events subfield is a bit mask specifying the set of events that we are interested in monitoring for fd. We say more about the bit values that can be used in this field in the next section.

  • The data subfield is a union, one of whose members can be used to specify information that is passed back to the calling process (via epoll_wait()) if fd later becomes ready.

Example 63-4 shows an example of the use of epoll_create() and epoll_ctl().

The epoll_wait() system call returns information about ready file descriptors from the epoll instance referred to by the file descriptor epfd. A single epoll_wait() call can return information about multiple ready file descriptors.

#include <sys/epoll.h>

int epoll_wait(int epfd, struct epoll_event *evlist, int
 maxevents, int timeout);

Note

Returns number of ready file descriptors, 0 on timeout, or -1 on error

The information about ready file descriptors is returned in the array of epoll_event structures pointed to by evlist. (The epoll_event structure was described in the previous section.) The evlist array is allocated by the caller, and the number of elements it contains is specified in maxevents.

Each item in the array evlist returns information about a single ready file descriptor. The events subfield returns a mask of the events that have occurred on this descriptor. The data subfield returns whatever value was specified in ev.data when we registered interest in this file descriptor using epoll_ctl(). Note that the data field provides the only mechanism for finding out the number of the file descriptor associated with this event. Thus, when we make the epoll_ctl() call that places a file descriptor in the interest list, we should either set ev.data.fd to the file descriptor number (as shown in Example 63-4) or set ev.data.ptr to point to a structure that contains the file descriptor number.

The timeout argument determines the blocking behavior of epoll_wait(), as follows:

On success, epoll_wait() returns the number of items that have been placed in the array evlist, or 0 if no file descriptors were ready within the interval specified by timeout. On error, epoll_wait() returns -1, with errno set to indicate the error.

In a multithreaded program, it is possible for one thread to use epoll_ctl() to add file descriptors to the interest list of an epoll instance that is already being monitored by epoll_wait() in another thread. These changes to the interest list will be taken into account immediately, and the epoll_wait() call will return readiness information about the newly added file descriptors.

Example 63-5 demonstrates the use of the epoll API. As command-line arguments, this program expects the pathnames of one or more terminals or FIFOs. The program performs the following steps:

The following shell session logs demonstrate the use of the program in Example 63-5. We use two terminal windows. In one window, we use the program in Example 63-5 to monitor two FIFOs for input. (Each open of a FIFO for reading by this program will complete only after another process has opened the FIFO for writing, as described in Section 44.7.) In the other window, we run instances of cat(1) that write data to these FIFOs.

Terminal window 1                   Terminal window 2
$ mkfifo p q
$ ./epoll_input p q
                                    $ cat > p
Opened "p" on fd 4
                                    Type Control-Z to suspend cat
                                    [1]+  Stopped      cat >p
                                    $ cat > q
Opened "q" on fd 5
About to epoll_wait()
Type Control-Z to suspend the epoll_input program
[1]+  Stopped     ./epoll_input p q

Above, we suspended our monitoring program so that we can now generate input on both FIFOs, and close the write end of one of them:

qqq
                                    Type Control-D to terminate “cat > q”
                                    $ fg %1
                                    cat >p
                                    ppp

Now we resume our monitoring program by bringing it into the foreground, at which point epoll_wait() returns two events:

$ fg
./epoll_input p q
About to epoll_wait()
Ready: 2
  fd=4; events: EPOLLIN
    read 4 bytes: ppp

  fd=5; events: EPOLLIN EPOLLHUP
    read 4 bytes: qqq

    closing fd 5
About to epoll_wait()

The two blank lines in the above output are the newlines that were read by the instances of cat, written to the FIFOs, and then read and echoed by our monitoring program.

Now we type Control-D in the second terminal window in order to terminate the remaining instance of cat, which causes epoll_wait() to once more return, this time with a single event:

Type Control-D to terminate “cat >p”
Ready: 1
  fd=4; events: EPOLLHUP
    closing fd 4
All file descriptors closed; bye

Example 63-5. Using the epoll API

altio/epoll_input.c
    #include <sys/epoll.h>
    #include <fcntl.h>
    #include "tlpi_hdr.h"

    #define MAX_BUF     1000        /* Maximum bytes fetched by a single read() */
    #define MAX_EVENTS     5        /* Maximum number of events to be returned from
                                       a single epoll_wait() call */

    int
    main(int argc, char *argv[])
    {
        int epfd, ready, fd, s, j, num0penFds;
        struct epoll_event ev;
        struct epoll_event evlist[MAX_EVENTS];
        char buf[MAX_BUF];

        if (argc < 2 || strcmp(argv[1], "--help") == 0)
            usageErr("%s file...\n", argv[0]);

    epfd = epoll_create(argc - 1);
        if (epfd == -1)
            errExit("epoll_create");

        /* Open each file on command line, and add it to the "interest
           list" for the epoll instance */

    for (j = 1; j < argc; j++) {
            fd = open(argv[j], O_RDONLY);
            if (fd == -1)
                errExit("open");
            printf("Opened \"%s\" on fd %d\n", argv[j], fd);

            ev.events = EPOLLIN;            /* Only interested in input events */
            ev.data.fd = fd;
        if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
                errExit("epoll_ctl");
        }

        numOpenFds = argc - 1;

    while (numOpenFds > 0) {

            /* Fetch up to MAX_EVENTS items from the ready list */

            printf("About to epoll_wait()\n");
        ready = epoll_wait(epfd, evlist, MAX_EVENTS, -1);
            if (ready == -1) {
            if (errno == EINTR)
                    continue;               /* Restart if interrupted by signal */
                else
                    errExit("epoll_wait");
            }

                printf("Ready: %d\n", ready);

            /* Deal with returned list of events */

        for (j = 0; j < ready; j++) {
                printf("  fd=%d; events: %s%s%s\n", evlist[j].data.fd,
                        (evlist[j].events & EPOLLIN)  ? "EPOLLIN "  : "",
                        (evlist[j].events & EPOLLHUP) ? "EPOLLHUP " : "",
                        (evlist[j].events & EPOLLERR) ? "EPOLLERR " : "");

            if (evlist[j].events & EPOLLIN) {
                    s = read(evlist[j].data.fd, buf, MAX_BUF);
                    if (s == -1)
                        errExit("read");
                    printf("    read %d bytes: %.*s\n", s, s, buf);

            } else if (evlist[j].events & (EPOLLHUP | EPOLLERR)) {

                    /* If EPOLLIN and EPOLLHUP were both set, then there might
                       be more than MAX_BUF bytes to read. Therefore, we close
                       the file descriptor only if EPOLLIN was not set.
                       We'll read further bytes after the next epoll_wait(). */

                    printf("    closing fd %d\n", evlist[j].data.fd);
                if (close(evlist[j].data.fd) == -1)
                        errExit("close");
                    numOpenFds--;
                }
            }
        }

        printf("All file descriptors closed; bye\n");
        exit(EXIT_SUCCESS);
    }

          altio/epoll_input.c

We now look at some subtleties of the interaction of open files, file descriptors, and epoll. For the purposes of this discussion, it is worth reviewing Figure 5-2 (page 95), which shows the relationship between file descriptors, open file descriptions, and the system-wide file i-node table.

When we create an epoll instance using epoll_create(), the kernel creates a new in-memory i-node and open file description, and allocates a new file descriptor in the calling process that refers to the open file description. The interest list for an epoll instance is associated with the open file description, not with the epoll file descriptor. This has the following consequences:

When we perform an epoll_ctl() EPOLL_CTL_ADD operation, the kernel adds an item to the epoll interest list that records both the number of the monitored file descriptor and a reference to the corresponding open file description. For the purpose of epoll_wait() calls, the kernel monitors the open file description. This means that we must refine our earlier statement that when a file descriptor is closed, it is automatically removed from any epoll interest lists of which it is a member. The refinement is this: an open file description is removed from the epoll interest list once all file descriptors that refer to it have been closed. This means that if we create duplicate descriptors referring to an open file—using dup() (or similar) or fork()—then the open file will be removed only after the original descriptor and all of the duplicates have been closed.

These semantics can lead to some behavior that at first appears surprising. Suppose that we execute the code shown in Example 63-6. The epoll_wait() call in this code will tell us that the file descriptor fd1 is ready (in other words, evlist[0].data.fd will be equal to fd1), even though fd1 has been closed. This is because there is still one open file descriptor, fd2, referring to the open file description contained in the epoll interest list. A similar scenario occurs when two processes hold duplicate descriptors for the same open file description (typically, as a result of a fork()), and the process performing the epoll_wait() has closed its file descriptor, but the other process still holds the duplicate descriptor open.

Table 63-9 shows the results (on Linux 2.6.25) when we monitor N contiguous file descriptors in the range 0 to N - 1 using poll(), select(), and epoll. (The test was arranged such that during each monitoring operation, exactly one randomly selected file descriptor is ready.) From this table, we see that as the number of file descriptors to be monitored grows large, poll() and select() perform poorly. By contrast, the performance of epoll hardly declines as N grows large. (The small decline in performance as N increases is possibly a result of reaching CPU caching limits on the test system.)

In Problems with select() and poll(), we saw why select() and poll() perform poorly when monitoring large numbers of file descriptors. We now look at the reasons why epoll performs better:

  • On each call to select() or poll(), the kernel must check all of the file descriptors specified in the call. By contrast, when we mark a descriptor to be monitored with epoll_ctl(), the kernel records this fact in a list associated with the underlying open file description, and whenever an I/O operation that makes the file descriptor ready is performed, the kernel adds an item to the ready list for the epoll descriptor. (An I/O event on a single open file description may cause multiple file descriptors associated with that description to become ready.) Subsequent epoll_wait() calls simply fetch items from the ready list.

  • Each time we call select() or poll(), we pass a data structure to the kernel that identifies all of the file descriptors that are to be monitored, and, on return, the kernel passes back a data structure describing the readiness of all of these descriptors. By contrast, with epoll, we use epoll_ctl() to build up a data structure in kernel space that lists the set of file descriptors to be monitored. Once this data structure has been built, each later call to epoll_wait() doesn’t need to pass any information about file descriptors to the kernel, and the call returns information about only those descriptors that are ready.

Note

In addition to the above points, for select(), we must initialize the input data structure prior to each call, and for both select() and poll(), we must inspect the returned data structure to find out which of the N file descriptors are ready. However, some testing showed that the time required for these other steps was insignificant compared to the time required for the system call to monitor N descriptors. Table 63-9 doesn’t include the times for the inspection step.

Very roughly, we can say that for large values of N (the number of file descriptors being monitored), the performance of select() and poll() scales linearly with N. We start to see this behavior for the N = 100 and N = 1000 cases in Table 63-9. By the time we reach N = 10000, the scaling has actually become worse than linear.

By contrast, epoll scales (linearly) according to the number of I/O events that occur. The epoll API is thus particularly efficient in a scenario that is common in servers that handle many simultaneous clients: of the many file descriptors being monitored, most are idle; only a few descriptors are ready.

By default, the epoll mechanism provides level-triggered notification. By this, we mean that epoll tells us whether an I/O operation can be performed on a file descriptor without blocking. This is the same type of notification as is provided by poll() and select().

The epoll API also allows for edge-triggered notification—that is, a call to epoll_wait() tells us if there has been I/O activity on a file descriptor since the previous call to epoll_wait() (or since the descriptor was opened, if there was no previous call). Using epoll with edge-triggered notification is semantically similar to signal-driven I/O, except that if multiple I/O events occur, epoll coalesces them into a single notification returned via epoll_wait(); with signal-driven I/O, multiple signals may be generated.

To employ edge-triggered notification, we specify the EPOLLET flag in ev.events when calling epoll_ctl():

struct epoll_event ev;

ev.data.fd = fd
ev.events = EPOLLIN | EPOLLET;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, ev) == -1)
    errExit("epoll_ctl");

We illustrate the difference between level-triggered and edge-triggered epoll notification using an example. Suppose that we are using epoll to monitor a socket for input (EPOLLIN), and the following steps occur:

  1. Input arrives on the socket.

  2. We perform an epoll_wait(). This call will tell us that the socket is ready, regardless of whether we are employing level-triggered or edge-triggered notification.

  3. We perform a second call to epoll_wait().

If we are employing level-triggered notification, then the second epoll_wait() call will inform us that the socket is ready. If we are employing edge-triggered notification, then the second call to epoll_wait() will block, because no new input has arrived since the previous call to epoll_wait().

As we noted in , edge-triggered notification is usually employed in conjunction with nonblocking file descriptors. Thus, the general framework for using edge-triggered epoll notification is as follows:

Suppose that we are monitoring multiple file descriptors using edge-triggered notification, and that a ready file descriptor has a large amount (perhaps an endless stream) of input available. If, after detecting that this file descriptor is ready, we attempt to consume all of the input using nonblocking reads, then we risk starving the other file descriptors of attention (i.e., it may be a long time before we again check them for readiness and perform I/O on them). One solution to this problem is for the application to maintain a list of file descriptors that have been notified as being ready, and execute a loop that continuously performs the following actions:

Although it requires extra programming work, this approach offers other benefits in addition to preventing file-descriptor starvation. For example, we can include other steps in the above loop, such as handling timers and accepting signals with sigwaitinfo() (or similar).

Starvation considerations can also apply when using signal-driven I/O, since it also presents an edge-triggered notification mechanism. By contrast, starvation considerations don’t necessarily apply in applications employing a level-triggered notification mechanism. This is because we can employ blocking file descriptors with level-triggered notification and use a loop that continuously checks descriptors for readiness, and then performs some I/O on the ready descriptors before once more checking for ready file descriptors.