Like the I/O multiplexing system calls and signal-driven I/O, the Linux epoll (event poll) API is used to monitor multiple file descriptors to see if they are ready for I/O. The primary advantages of the epoll API are the following:
The performance of epoll scales much better than select() and poll() when monitoring large numbers of file descriptors.
The epoll API permits either level-triggered or edge-triggered notification. By contrast, select() and poll() provide only level-triggered notification, and signal-driven I/O provides only edge-triggered notification.
The performance of epoll and signal-driven I/O is similar. However, epoll has some advantages over signal-driven I/O:
We avoid the complexities of signal handling (e.g., signal-queue overflow).
We have greater flexibility in specifying what kind of monitoring we want to perform (e.g., checking to see if a file descriptor for a socket is ready for reading, writing, or both).
The epoll API is Linux-specific, and is new in Linux 2.6.
The central data structure of the epoll API is an epoll instance, which is referred to via an open file descriptor. This file descriptor is not used for I/O. Instead, it is a handle for kernel data structures that serve two purposes:
The membership of the ready list is a subset of the interest list.
For each file descriptor monitored by epoll, we can specify a bit mask indicating events that we are interested in knowing about. These bit masks correspond closely to the bit masks used with poll().
The epoll API consists of three system calls:
The epoll_create() system call creates an epoll instance and returns a file descriptor referring to the instance.
The epoll_ctl() system call manipulates the interest list associated with an epoll instance. Using epoll_ctl(), we can add a new file descriptor to the list, remove an existing descriptor from the list, and modify the mask that determines which events are to be monitored for a descriptor.
The epoll_wait() system call returns items from the ready list associated with an epoll instance.
The epoll_create() system call creates a new epoll instance whose interest list is initially empty.
#include <sys/epoll.h>
int epoll_create
(int size);
Returns file descriptor on success, or -1 on error
The size argument specifies the number of file descriptors that we expect to monitor via the epoll instance. This argument is not an upper limit, but rather a hint to the kernel about how to initially dimension internal data structures. (Since Linux 2.6.8, the size argument must be greater than zero but is otherwise ignored, because changes in the implementation meant that the information it provided is no longer required.)
As its function result, epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used to refer to the epoll instance in other epoll system calls. When the file descriptor is no longer required, it should be closed in the usual way, using close(). When all file descriptors referring to an epoll instance are closed, the instance is destroyed and its associated resources are released back to the system. (Multiple file descriptors may refer to the same epoll instance as a consequence of calls to fork() or descriptor duplication using dup() or similar.)
Starting with kernel 2.6.27, Linux supports a new system call, epoll_create1(). This system call performs the same task as epoll_create(), but drops the obsolete size argument and adds a flags argument that can be used to modify the behavior of the system call. One flag is currently supported: EPOLL_CLOEXEC
, which causes the kernel to enable the close-on-exec flag (FD_CLOEXEC)
for the new file descriptor. This flag is useful for the same reasons as the open() O_CLOEXEC
flag described in .
The epoll_ctl() system call modifies the interest list of the epoll instance referred to by the file descriptor epfd.
#include <sys/epoll.h>
int epoll_ctl
(int epfd, int op, int fd, struct epoll_event *ev);
Returns 0 on success, or -1 on error
The fd argument identifies which of the file descriptors in the interest list is to have its settings modified. This argument can be a file descriptor for a pipe, FIFO, socket, POSIX message queue, inotify instance, terminal, device, or even another epoll descriptor (i.e., we can build a kind of hierarchy of monitored descriptors). However, fd can’t be a file descriptor for a regular file or a directory (the error EPERM
results).
The op argument specifies the operation to be performed, and has one of the following values:
EPOLL_CTL_ADD
Add the file descriptor fd to the interest list for epfd. The set of events that we are interested in monitoring for fd is specified in the buffer pointed to by ev, as described below. If we attempt to add a file descriptor that is already in the interest list, epoll_ctl() fails with the error EEXIST
.
EPOLL_CTL_MOD
Modify the events setting for the file descriptor fd, using the information specified in the buffer pointed to by ev. If we attempt to modify the settings of a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT
.
EPOLL_CTL_DEL
Remove the file descriptor fd from the interest list for epfd. The ev argument is ignored for this operation. If we attempt to remove a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT
. Closing a file descriptor automatically removes it from all of the epoll interest lists of which it is a member.
The ev argument is a pointer to a structure of type epoll_event, defined as follows:
struct epoll_event { uint32_t events; /* epoll events (bit mask) */ epoll_data_t data; /* User data */ };
The data field of the epoll_event structure is typed as follows:
typedef union epoll_data { void *ptr; /* Pointer to user-defined data */ int fd; /* File descriptor */ uint32_t u32; /* 32-bit integer */ uint64_t u64; /* 64-bit integer */ } epoll_data_t;
The ev argument specifies settings for the file descriptor fd, as follows:
The events subfield is a bit mask specifying the set of events that we are interested in monitoring for fd. We say more about the bit values that can be used in this field in the next section.
The data subfield is a union, one of whose members can be used to specify information that is passed back to the calling process (via epoll_wait()) if fd later becomes ready.
Example 63-4 shows an example of the use of epoll_create() and epoll_ctl().
Example 63-4. Using epoll_create() and epoll_ctl()
int epfd; struct epoll_event ev; epfd = epoll_create(5); if (epfd == -1) errExit("epoll_create"); ev.data.fd = fd; ev.events = EPOLLIN; if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, ev) == -1) errExit("epoll_ctl");
Because each file descriptor registered in an epoll interest list requires a small amount of nonswappable kernel memory, the kernel provides an interface that defines a limit on the total number of file descriptors that each user can register in all epoll interest lists. The value of this limit can be viewed and modified via max_user_watches
, a Linux-specific file in the /proc/sys/fs/epoll
directory. The default value of this limit is calculated based on available system memory (see the epoll(7) manual page).
The epoll_wait() system call returns information about ready file descriptors from the epoll instance referred to by the file descriptor epfd. A single epoll_wait() call can return information about multiple ready file descriptors.
#include <sys/epoll.h>
int epoll_wait
(int epfd, struct epoll_event *evlist, int
maxevents, int timeout);
Returns number of ready file descriptors, 0 on timeout, or -1 on error
The information about ready file descriptors is returned in the array of epoll_event structures pointed to by evlist. (The epoll_event structure was described in the previous section.) The evlist array is allocated by the caller, and the number of elements it contains is specified in maxevents.
Each item in the array evlist returns information about a single ready file descriptor. The events subfield returns a mask of the events that have occurred on this descriptor. The data subfield returns whatever value was specified in ev.data when we registered interest in this file descriptor using epoll_ctl(). Note that the data field provides the only mechanism for finding out the number of the file descriptor associated with this event. Thus, when we make the epoll_ctl() call that places a file descriptor in the interest list, we should either set ev.data.fd to the file descriptor number (as shown in Example 63-4) or set ev.data.ptr to point to a structure that contains the file descriptor number.
The timeout argument determines the blocking behavior of epoll_wait(), as follows:
If timeout equals -1, block until an event occurs for one of the file descriptors in the interest list for epfd or until a signal is caught.
If timeout equals 0, perform a nonblocking check to see which events are currently available on the file descriptors in the interest list for epfd.
If timeout is greater than 0, block for up to timeout milliseconds, until an event occurs on one of the file descriptors in the interest list for epfd, or until a signal is caught.
On success, epoll_wait() returns the number of items that have been placed in the array evlist, or 0 if no file descriptors were ready within the interval specified by timeout. On error, epoll_wait() returns -1, with errno set to indicate the error.
In a multithreaded program, it is possible for one thread to use epoll_ctl() to add file descriptors to the interest list of an epoll instance that is already being monitored by epoll_wait() in another thread. These changes to the interest list will be taken into account immediately, and the epoll_wait() call will return readiness information about the newly added file descriptors.
The bit values that can be specified in ev.events when we call epoll_ctl() and that are placed in the evlist[].events fields returned by epoll_wait() are shown in Table 63-8. With the addition of an E
prefix, most of these bits have names that are the same as the corresponding event bits used with poll(). (The exceptions are EPOLLET
and EPOLLONESHOT
, which we describe in more detail below.) The reason for this correspondence is that, when specified as input to epoll_ctl() or returned as output via epoll_wait(), these bits convey exactly the same meaning as the corresponding poll() event bits.
Table 63-8. Bit-mask values for the epoll events field
Bit | Input to epoll_ctl()? | Returned by epoll_wait()? | Description |
---|---|---|---|
| • | • | Data other than high-priority data can be read |
| • | • | High-priority data can be read |
| • | • | Shutdown on peer socket (since Linux 2.6.17) |
| • | • | Normal data can be written |
| • | Employ edge-triggered event notification | |
| • | Disable monitoring after event notification | |
| • | An error has occurred | |
| • | A hangup has occurred |
By default, once a file descriptor is added to an epoll interest list using the epoll_ctl() EPOLL_CTL_ADD
operation, it remains active (i.e., subsequent calls to epoll_wait() will inform us whenever the file descriptor is ready) until we explicitly remove it from the list using the epoll_ctl() EPOLL_CTL_DEL
operation. If we want to be notified only once about a particular file descriptor, then we can specify the EPOLLONESHOT
flag (available since Linux 2.6.2) in the ev.events value passed in epoll_ctl(). If this flag is specified, then, after the next epoll_wait() call that informs us that the corresponding file descriptor is ready, the file descriptor is marked inactive in the interest list, and we won’t be informed about its state by future epoll_wait() calls. If desired, we can subsequently reenable monitoring of this file descriptor using the epoll_ctl() EPOLL_CTL_MOD
operation. (We can’t use the EPOLL_CTL_ADD
operation for this purpose, because the inactive file descriptor is still part of the epoll interest list.)
Example 63-5 demonstrates the use of the epoll API. As command-line arguments, this program expects the pathnames of one or more terminals or FIFOs. The program performs the following steps:
Open each of the files named on the command line for input and add the resulting file descriptor to the interest list of the epoll instance , specifying the set of events to be monitored as EPOLLIN
.
Execute a loop that calls epoll_wait() to monitor the interest list of the epoll instance and handles the returned events from each call. Note the following points about this loop:
After the epoll_wait() call, the program checks for an EINTR
return , which may occur if the program was stopped by a signal in the middle of the epoll_wait() call and then resumed by SIGCONT
. (Refer to Section 21.5.) If this occurs, the program restarts the epoll_wait() call.
It the epoll_wait() call was successful, the program uses a further loop to check each of the ready items in evlist . For each item in evlist, the program checks the events field for the presence of not just EPOLLIN
, but also EPOLLHUP
and EPOLLERR
. These latter events can occur if the other end of a FIFO was closed or a terminal hangup occurred. If EPOLLIN
was returned, then the program reads some input from the corresponding file descriptor and displays it on standard output. Otherwise, if either EPOLLHUP
or EPOLLERR
occurred, the program closes the corresponding file descriptor and decrements the counter of open files (numOpenFds).
The loop terminates when all open file descriptors have been closed (i.e., when numOpenFds equals 0).
The following shell session logs demonstrate the use of the program in Example 63-5. We use two terminal windows. In one window, we use the program in Example 63-5 to monitor two FIFOs for input. (Each open of a FIFO for reading by this program will complete only after another process has opened the FIFO for writing, as described in Section 44.7.) In the other window, we run instances of cat(1) that write data to these FIFOs.
Terminal window 1
Terminal window 2
$mkfifo p q
$./epoll_input p q
$cat > p
Opened "p" on fd 4 Type Control-Z to suspend cat [1]+ Stopped cat >p $cat > q
Opened "q" on fd 5 About to epoll_wait() Type Control-Z to suspend the epoll_input program [1]+ Stopped ./epoll_input p q
Above, we suspended our monitoring program so that we can now generate input on both FIFOs, and close the write end of one of them:
qqq
Type Control-D to terminate “cat > q” $fg %1
cat >pppp
Now we resume our monitoring program by bringing it into the foreground, at which point epoll_wait() returns two events:
$ fg
./epoll_input p q
About to epoll_wait()
Ready: 2
fd=4; events: EPOLLIN
read 4 bytes: ppp
fd=5; events: EPOLLIN EPOLLHUP
read 4 bytes: qqq
closing fd 5
About to epoll_wait()
The two blank lines in the above output are the newlines that were read by the instances of cat, written to the FIFOs, and then read and echoed by our monitoring program.
Now we type Control-D in the second terminal window in order to terminate the remaining instance of cat, which causes epoll_wait() to once more return, this time with a single event:
Type Control-D to terminate “cat >p”
Ready: 1
fd=4; events: EPOLLHUP
closing fd 4
All file descriptors closed; bye
Example 63-5. Using the epoll API
altio/epoll_input.c
#include <sys/epoll.h> #include <fcntl.h> #include "tlpi_hdr.h" #define MAX_BUF 1000 /* Maximum bytes fetched by a single read() */ #define MAX_EVENTS 5 /* Maximum number of events to be returned from a single epoll_wait() call */ int main(int argc, char *argv[]) { int epfd, ready, fd, s, j, num0penFds; struct epoll_event ev; struct epoll_event evlist[MAX_EVENTS]; char buf[MAX_BUF]; if (argc < 2 || strcmp(argv[1], "--help") == 0) usageErr("%s file...\n", argv[0]); epfd = epoll_create(argc - 1); if (epfd == -1) errExit("epoll_create"); /* Open each file on command line, and add it to the "interest list" for the epoll instance */ for (j = 1; j < argc; j++) { fd = open(argv[j], O_RDONLY); if (fd == -1) errExit("open"); printf("Opened \"%s\" on fd %d\n", argv[j], fd); ev.events = EPOLLIN; /* Only interested in input events */ ev.data.fd = fd; if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1) errExit("epoll_ctl"); } numOpenFds = argc - 1; while (numOpenFds > 0) { /* Fetch up to MAX_EVENTS items from the ready list */ printf("About to epoll_wait()\n"); ready = epoll_wait(epfd, evlist, MAX_EVENTS, -1); if (ready == -1) { if (errno == EINTR) continue; /* Restart if interrupted by signal */ else errExit("epoll_wait"); } printf("Ready: %d\n", ready); /* Deal with returned list of events */ for (j = 0; j < ready; j++) { printf(" fd=%d; events: %s%s%s\n", evlist[j].data.fd, (evlist[j].events & EPOLLIN) ? "EPOLLIN " : "", (evlist[j].events & EPOLLHUP) ? "EPOLLHUP " : "", (evlist[j].events & EPOLLERR) ? "EPOLLERR " : ""); if (evlist[j].events & EPOLLIN) { s = read(evlist[j].data.fd, buf, MAX_BUF); if (s == -1) errExit("read"); printf(" read %d bytes: %.*s\n", s, s, buf); } else if (evlist[j].events & (EPOLLHUP | EPOLLERR)) { /* If EPOLLIN and EPOLLHUP were both set, then there might be more than MAX_BUF bytes to read. Therefore, we close the file descriptor only if EPOLLIN was not set. We'll read further bytes after the next epoll_wait(). */ printf(" closing fd %d\n", evlist[j].data.fd); if (close(evlist[j].data.fd) == -1) errExit("close"); numOpenFds--; } } } printf("All file descriptors closed; bye\n"); exit(EXIT_SUCCESS); }altio/epoll_input.c
We now look at some subtleties of the interaction of open files, file descriptors, and epoll. For the purposes of this discussion, it is worth reviewing Figure 5-2 (page 95), which shows the relationship between file descriptors, open file descriptions, and the system-wide file i-node table.
When we create an epoll instance using epoll_create(), the kernel creates a new in-memory i-node and open file description, and allocates a new file descriptor in the calling process that refers to the open file description. The interest list for an epoll instance is associated with the open file description, not with the epoll file descriptor. This has the following consequences:
If we duplicate an epoll file descriptor using dup() (or similar), then the duplicated descriptor refers to the same epoll interest and ready lists as the original descriptor. We may modify the interest list by specifying either file descriptor as the epfd argument in a call to epoll_ctl(). Similarly, we can retrieve items from the ready list by specifying either file descriptor as the epfd argument in a call to epoll_wait().
The preceding point also applies after a call to fork(). The child inherits a duplicate of the parent’s epoll file descriptor, and this duplicate descriptor refers to the same epoll data structures.
When we perform an epoll_ctl() EPOLL_CTL_ADD
operation, the kernel adds an item to the epoll interest list that records both the number of the monitored file descriptor and a reference to the corresponding open file description. For the purpose of epoll_wait() calls, the kernel monitors the open file description. This means that we must refine our earlier statement that when a file descriptor is closed, it is automatically removed from any epoll interest lists of which it is a member. The refinement is this: an open file description is removed from the epoll interest list once all file descriptors that refer to it have been closed. This means that if we create duplicate descriptors referring to an open file—using dup() (or similar) or fork()—then the open file will be removed only after the original descriptor and all of the duplicates have been closed.
These semantics can lead to some behavior that at first appears surprising. Suppose that we execute the code shown in Example 63-6. The epoll_wait() call in this code will tell us that the file descriptor fd1 is ready (in other words, evlist[0].data.fd will be equal to fd1), even though fd1 has been closed. This is because there is still one open file descriptor, fd2, referring to the open file description contained in the epoll interest list. A similar scenario occurs when two processes hold duplicate descriptors for the same open file description (typically, as a result of a fork()), and the process performing the epoll_wait() has closed its file descriptor, but the other process still holds the duplicate descriptor open.
Example 63-6. Semantics of epoll with duplicate file descriptors
int epfd, fd1, fd2; struct epoll_event ev; struct epoll_event evlist[MAX_EVENTS]; /* Omitted: code to open 'fd1' and create epoll file descriptor 'epfd' ... */ ev.data.fd = fd1 ev.events = EPOLLIN; if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd1, ev) == -1) errExit("epoll_ctl"); /* Suppose that 'fd1' now happens to become ready for input */ fd2 = dup(fd1); close(fd1); ready = epoll_wait(epfd, evlist, MAX_EVENTS, -1); if (ready == -1) errExit("epoll_wait");
Table 63-9 shows the results (on Linux 2.6.25) when we monitor N contiguous file descriptors in the range 0 to N - 1 using poll(), select(), and epoll. (The test was arranged such that during each monitoring operation, exactly one randomly selected file descriptor is ready.) From this table, we see that as the number of file descriptors to be monitored grows large, poll() and select() perform poorly. By contrast, the performance of epoll hardly declines as N grows large. (The small decline in performance as N increases is possibly a result of reaching CPU caching limits on the test system.)
For the purposes of this test, FD_SETSIZE
was changed to 16,384 in the glibc header files to allow the test program to monitor large numbers of file descriptors using select().
Table 63-9. Times taken by poll(), select(), and epoll for 100,000 monitoring operations
Number of descriptors monitored (N) | poll() CPU time (seconds) | select() CPU time (seconds) | epoll CPU time (seconds) |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In Problems with select() and poll(), we saw why select() and poll() perform poorly when monitoring large numbers of file descriptors. We now look at the reasons why epoll performs better:
On each call to select() or poll(), the kernel must check all of the file descriptors specified in the call. By contrast, when we mark a descriptor to be monitored with epoll_ctl(), the kernel records this fact in a list associated with the underlying open file description, and whenever an I/O operation that makes the file descriptor ready is performed, the kernel adds an item to the ready list for the epoll descriptor. (An I/O event on a single open file description may cause multiple file descriptors associated with that description to become ready.) Subsequent epoll_wait() calls simply fetch items from the ready list.
Each time we call select() or poll(), we pass a data structure to the kernel that identifies all of the file descriptors that are to be monitored, and, on return, the kernel passes back a data structure describing the readiness of all of these descriptors. By contrast, with epoll, we use epoll_ctl() to build up a data structure in kernel space that lists the set of file descriptors to be monitored. Once this data structure has been built, each later call to epoll_wait() doesn’t need to pass any information about file descriptors to the kernel, and the call returns information about only those descriptors that are ready.
In addition to the above points, for select(), we must initialize the input data structure prior to each call, and for both select() and poll(), we must inspect the returned data structure to find out which of the N file descriptors are ready. However, some testing showed that the time required for these other steps was insignificant compared to the time required for the system call to monitor N descriptors. Table 63-9 doesn’t include the times for the inspection step.
Very roughly, we can say that for large values of N (the number of file descriptors being monitored), the performance of select() and poll() scales linearly with N. We start to see this behavior for the N = 100 and N = 1000 cases in Table 63-9. By the time we reach N = 10000, the scaling has actually become worse than linear.
By contrast, epoll scales (linearly) according to the number of I/O events that occur. The epoll API is thus particularly efficient in a scenario that is common in servers that handle many simultaneous clients: of the many file descriptors being monitored, most are idle; only a few descriptors are ready.
By default, the epoll mechanism provides level-triggered notification. By this, we mean that epoll tells us whether an I/O operation can be performed on a file descriptor without blocking. This is the same type of notification as is provided by poll() and select().
The epoll API also allows for edge-triggered notification—that is, a call to epoll_wait() tells us if there has been I/O activity on a file descriptor since the previous call to epoll_wait() (or since the descriptor was opened, if there was no previous call). Using epoll with edge-triggered notification is semantically similar to signal-driven I/O, except that if multiple I/O events occur, epoll coalesces them into a single notification returned via epoll_wait(); with signal-driven I/O, multiple signals may be generated.
To employ edge-triggered notification, we specify the EPOLLET
flag in ev.events when calling epoll_ctl():
struct epoll_event ev; ev.data.fd = fd ev.events = EPOLLIN | EPOLLET; if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, ev) == -1) errExit("epoll_ctl");
We illustrate the difference between level-triggered and edge-triggered epoll notification using an example. Suppose that we are using epoll to monitor a socket for input (EPOLLIN
), and the following steps occur:
Input arrives on the socket.
We perform an epoll_wait(). This call will tell us that the socket is ready, regardless of whether we are employing level-triggered or edge-triggered notification.
We perform a second call to epoll_wait().
If we are employing level-triggered notification, then the second epoll_wait() call will inform us that the socket is ready. If we are employing edge-triggered notification, then the second call to epoll_wait() will block, because no new input has arrived since the previous call to epoll_wait().
As we noted in , edge-triggered notification is usually employed in conjunction with nonblocking file descriptors. Thus, the general framework for using edge-triggered epoll notification is as follows:
Make all file descriptors that are to be monitored nonblocking.
Build the epoll interest list using epoll_ctl().
Handle I/O events using the following loop:
Retrieve a list of ready descriptors using epoll_wait().
For each file descriptor that is ready, process I/O until the relevant system call (e.g., read(), write(), recv(), send(), or accept()) returns with the error EAGAIN or EWOULDBLOCK
.
Suppose that we are monitoring multiple file descriptors using edge-triggered notification, and that a ready file descriptor has a large amount (perhaps an endless stream) of input available. If, after detecting that this file descriptor is ready, we attempt to consume all of the input using nonblocking reads, then we risk starving the other file descriptors of attention (i.e., it may be a long time before we again check them for readiness and perform I/O on them). One solution to this problem is for the application to maintain a list of file descriptors that have been notified as being ready, and execute a loop that continuously performs the following actions:
Monitor the file descriptors using epoll_wait() and add ready descriptors to the application list. If any file descriptors are already registered as being ready in the application list, then the timeout for this monitoring step should be small or 0, so that if no new file descriptors are ready, the application can quickly proceed to the next step and service any file descriptors that are already known to be ready.
Perform a limited amount of I/O on those file descriptors registered as being ready in the application list (perhaps cycling through them in round-robin fashion, rather than always starting from the beginning of the list after each call to epoll_wait()). A file descriptor can be removed from the application list when the relevant nonblocking I/O system call fails with the EAGAIN
or EWOULDBLOCK
error.
Although it requires extra programming work, this approach offers other benefits in addition to preventing file-descriptor starvation. For example, we can include other steps in the above loop, such as handling timers and accepting signals with sigwaitinfo() (or similar).
Starvation considerations can also apply when using signal-driven I/O, since it also presents an edge-triggered notification mechanism. By contrast, starvation considerations don’t necessarily apply in applications employing a level-triggered notification mechanism. This is because we can employ blocking file descriptors with level-triggered notification and use a loop that continuously checks descriptors for readiness, and then performs some I/O on the ready descriptors before once more checking for ready file descriptors.