Reading and Writing Files

Now that we’ve covered the basics of I/O, let’s get to some more practical applications of reading and writing files. The following sections discuss reading data frames or matrices from files, working with text files, accessing files on remote machines, and getting file and directory information.

In Section 5.1.2, we discussed the use of the function read.table() to read in a data frame. As a quick review, suppose the file z looks like this:

name age
John 25
Mary 28
Jim 19

The first line contains an optional header, specifying column names. We could read the file this way:

> z <- read.table("z",header=TRUE)
> z
  name age
1 John  25
2 Mary  28
3  Jim  19

Note that scan() would not work here, because our file has a mixture of numeric and character data (and a header).

There appears to be no direct way of reading in a matrix from a file, but it can be done easily with other tools. A simple, quick way is to use scan() to read in the matrix row by row. You use the byrow option in the function matrix() to indicate that you are defining the elements of the matrix in a row-wise, rather than column-wise, manner.

For instance, say the file x contains a 5-by-3 matrix, stored row-wise:

1 0 1
1 1 1
1 1 0
1 1 0
0 0 1

We can read it into a matrix this way:

> x <- matrix(scan("x"),nrow=5,byrow=TRUE)

This is fine for quick, one-time operations, but for generality, you can use read.table(), which returns a data frame, and then convert via as.matrix(). Here is a general method:

read.matrix <- function(filename) {
   as.matrix(read.table(filename))
}

In computer literature, there is often a distinction made between text files and binary files. That distinction is somewhat misleading—every file is binary in the sense that it consists of 0s and 1s. Let’s take the term text file to mean a file that consists mainly of ASCII characters or coding for some other human language (such as GB for Chinese) and that uses newline characters to give humans the perception of lines. The latter aspect will turn out to be central here. Nontext files, such as JPEG images or executable program files, are generally called binary files.

You can use readLines() to read in a text file, either one line at a time or in a single operation. For example, suppose we have a file z1 with the following contents:

John 25
Mary 28
Jim 19

We can read the file all at once, like this:

> z1 <- readLines("z1")
> z1
[1] "John 25" "Mary 28" "Jim 19"

Since each line is treated as a string, the return value here is a vector of strings—that is, a vector of character mode. There is one vector element for each line read, thus three elements here.

Alternatively, we can read it in one line at a time. For this, we first need to create a connection, as described next.

Connection is R’s term for a fundamental mechanism used in various kinds of I/O operations. Here, it will be used for file access.

The connection is created by calling file(), url(), or one of several other R functions. To see a list of those functions, type this:

> ?connection

So, we can now read in the z1 file (introduced in the previous section) line by line, as follows:

> c <- file("z1","r")
> readLines(c,n=1)
[1] "John 25"
> readLines(c,n=1)
[1] "Mary 28"
> readLines(c,n=1)
[1] "Jim 19"
> readLines(c,n=1)
character(0)

We opened the connection, assigned the result to c, and then read the file one line at a time, as specified by the argument n=1. When R encountered the end of file (EOF), it returned an empty result. We needed to set up a connection so that R could keep track of our position in the file as we read through it.

We can detect EOF in our code:

> c <- file("z","r")
> while(TRUE) {
+    rl <- readLines(c,n=1)
+    if (length(rl) == 0) {
+       print("reached the end")
+       break
+    } else print(rl)
+ }
[1] "John 25"
[1] "Mary 28"
[1] "Jim 19"
[1] "reached the end"

If we wish to “rewind”—to start again at the beginning of the file—we can use seek():

> c <- file("z1","r")
> readLines(c,n=2)
[1] "John 25" "Mary 28"
> seek(con=c,where=0)
[1] 16
> readLines(c,n=1)
[1] "John 25"

The argument where=0 in our call to seek() means that we wish to position the file pointer zero characters from the start of the file—in other words, directly at the beginning.

The call returns 16, meaning that the file pointer was at position 16 before we made the call. That makes sense. The first line consists of "John 25" plus the end-of-line character, for a total of eight characters, and the same is true for the second line. So, after reading the first two lines, we were at position 16.

You can close a connection by calling—what else?—close(). You would use this to let the system know that the file you have been writing is complete and should now be officially written to disk. As another example, in a client/server relationship over the Internet (see Section 10.3.1), a client would use close() to indicate to the server that the client is signing off.

The U.S. Census Bureau makes census data available in the form of Public Use Microdata Samples (PUMS). The term microdata here means that we are dealing with raw data and each record is for a real person, as opposed to statistical summaries. Data on many, many variables are included.

The data is organized by household. For each unit, there is first a Household record, describing the various characteristics of that household, followed by one Person record for each person in the household. Character positions 106 and 107 (with numbering starting at 1) in the Household record state the number of Person records for that household. (The number can be very large, since some institutions count as households.)

To enhance the integrity of the data, character position 1 contains H or P to confirm that this is a Household or Person record. So, if you read an H record, and it tells you there are three people in the household, then the following three records should be P records, followed by another H record; if not, you’ve encountered an error.

As our test file, we’ll take the first 1,000 records of the year 2000 1 percent sample. The first few records look like this:

H000019510649     06010        99979997  70                               631973
15758   59967658436650000012000000  0 0 0 0 0 0 0 0 0 0 0 0 0    0    0    0
0    0 0 0     0 0     0 0000 0    0    0  0 0     00000000000000000000000000000
00000000000000000000000000
P00001950100010923000420190010110000010147050600206011099999904200000 0040010000
00300280     28600  70    9997    9997202020202020220000040000000000000006000000
     00000  00    0000    00000000000000000132241057904MS     476041-20311010310
07000049010000000000900100000100000100000100000010000001000139010000490000
H000040710649     06010        99979997  70                               631973
15758   599676584365300800200000300106060503010101010102010 01200006000000100001
00600020 0     0 0     0 0000 0    0    0  0 0     02000102010102200000000010750
02321125100004000000040000
P00004070100005301000010380010110000010147030400100009005199901200000 0006010000
00100000     00000  00    0000    0000202020202020220000040000000000000001000060
     06010  70    9997    99970101004900100000001018703221    770051-10111010500
40004000000000000000000000000000000000000000000000000000004000000040000349
P00004070200005303011010140010110000010147050000204004005199901200000 0006010000
00100000     00000  00    0000    000020202020 0 0200000000000000000000000050000
     00000  00    0000    000000000000000000000000000000000000000000-00000000000
000      0      0      0     0     0     0      0      0       00000000349
H000061010649     06010        99979997  70                               631973
15758   599676584360801190100000200204030502010101010102010 00770004800064000001
1    0 030     0 0     0 0340 00660000000170 0     06010000000004410039601000000
00021100000004940000000000

The records are very wide and thus wrap around. Each one occupies four lines on the page here.

We’ll create a function called extractpums() to read in a PUMS file and create a data frame from its Person records. The user specifies the filename and lists fields to extract and names to assign to those fields.

We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.)

Before looking at the function code, let’s see what the function does. In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our filename is pumsa. The following call creates a data frame consisting of those two variables.

pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))

Note that we are stating here the names we want the columns to have in the resulting data frame. We can use any names we want—say Sex and Ancientness.

Here is the first part of that data frame:

> head(pumsdf)
  serno Gender Age
2   195      2  19
3   407      1  38
4   407      1  14
5   610      2  65
6  1609      1  50
7  1609      2  49

The following is the code for the extractpums() function.

1    # reads in PUMS file pf, extracting the Person records, returning a data
2    # frame; each row of the output will consist of the Household serial
3    # number and the fields specified in the list flds; the columns of
4    # the data frame will have the names of the indices in flds
5
6    extractpums <- function(pf,flds) {
7       dtf <- data.frame()  # data frame to be built
8       con <- file(pf,"r")  # connection
9       # process the input file
10      repeat {
11         hrec <- readLines(con,1)  # read Household record
12         if (length(hrec) == 0) break  # end of file, leave loop
13         # get household serial number
14         serno <- intextract(hrec,c(2,8))
15         # how many Person records?
16         npr <- intextract(hrec,c(106,107))
17         if (npr > 0)
18            for (i in 1:npr) {
19               prec <- readLines(con,1)  # get Person record
20               # make this person's row for the data frame
21               person <- makerow(serno,prec,flds)
22               # add it to the data frame
23               dtf <- rbind(dtf,person)
24            }
25      }
26      return(dtf)
27    }
28
29    # set up this person's row for the data frame
30    makerow <- function(srn,pr,fl) {
31       l <- list()
32       l[["serno"]] <- srn
33       for (nm in names(fl)) {
34          l[[nm]] <- intextract(pr,fl[[nm]])
35       }
36       return(l)
37    }
38
39    # extracts an integer field in the string s, in character positions
40    # rng[1] through rng[2]
41    intextract <- function(s,rng) {
42       fld <- substr(s,rng[1],rng[2])
43       return(as.integer(fld))
44    }

Let’s see how this works. At the beginning of extractpums(), we create an empty data frame and set up the connection for the PUMS file read.

dtf <- data.frame()  # data frame to be built
con <- file(pf,"r")  # connection

The main body of the code then consists of a repeat loop.

repeat {
   hrec <- readLines(con,1)  # read Household record
   if (length(hrec) == 0) break  # end of file, leave loop
   # get household serial number
   serno <- intextract(hrec,c(2,8))
   # how many Person records?
   npr <- intextract(hrec,c(106,107))
   if (npr > 0)
      for (i in 1:npr) {
         ...
      }
}

This loop iterates until the end of the input file is reached. The latter condition will be sensed by encountering a zero-length Household record, as seen in the preceding code.

Within the repeat loop, we alternate reading a Household record and reading the associated Person records. The number of Person records for the current Household record is extracted from columns 106 and 107 of that record, storing this number in npr. That extraction is done by a call to our function intextract().

The for loop then reads in the Person records one by one, in each case forming the desired row for the output data frame and then attaching it to the latter via rbind():

for (i in 1:npr) {
   prec <- readLines(con,1)  # get Person record
   # make this person's row for the data frame
   person <- makerow(serno,prec,flds)
   # add it to the data frame
   dtf <- rbind(dtf,person)
}

Note how makerow() creates the row to be added for a given person. Here the formal arguments are srn for the household serial number, pr for the given Person record, and fl for the list of variable names and column fields.

makerow <- function(srn,pr,fl) {
   l <- list()
   l[["serno"]] <- srn
   for (nm in names(fl)) {
      l[[nm]] <- intextract(pr,fl[[nm]])
   }
   return(l)
}

For instance, consider our sample call:

pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))

When makerow() executes, fl will be a list with two elements, named Gender and Age. The string pr, the current Person record, will have Gender in column 23 and Age in columns 25 and 26. We call intextract() to pull out the desired numbers.

The intextract() function itself is a straightforward conversion of characters to numbers, such as converting the string "12" to the number 12.

Note that, if not for the presence of Household records, we could do all of this much more easily with a handy built-in R function: read.fwf(). The name of this function is an abbreviation for “read fixed-width formatted,” alluding to the fact that each variable is stored in given character positions of a record. In essence, this function alleviates the need to write a function like intextract().

Certain I/O functions, such as read.table() and scan(), accept web URLs as arguments. (Check R’s online help facility to see if your favorite function allows this.)

As an example, we’ll read some data from the University of California, Irvine archive at http://archive.ics.uci.edu/ml/datasets.html, using the Echocardiogram data set. After navigating the links, we find the location of that file and then read it from R, as follows:

> uci <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
> uci <- paste(uci,"echocardiogram/echocardiogram.data",sep="")
> ecc <- read.csv(uci)

(We’ve built up the URL in stages here to fit the page.)

Let’s take a look at what we downloaded:

> head(ecc)
  X11 X0 X71 X0.1 X0.260     X9 X4.600  X14    X1  X1.1 name X1.2 X0.2
1  19  0  72    0  0.380      6  4.100   14 1.700 0.588 name    1    0
2  16  0  55    0  0.260      4  3.420   14     1     1 name    1    0
3  57  0  60    0  0.253 12.062  4.603   16 1.450 0.788 name    1    0
4  19  1  57    0  0.160     22  5.750   18 2.250 0.571 name    1    0
5  26  0  68    0  0.260      5  4.310   12     1 0.857 name    1    0
6  13  0  62    0  0.230     31  5.430 22.5 1.875 0.857 name    1    0

We could then do our analyses. For example, the third column is age, so we could find its mean or perform other calculations on that data. See the echocardiogram.names page at http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.names for descriptions of all of the variables.

Given the statistical basis of R, file reads are probably much more common than writes. But writes are sometimes necessary, and this section will present methods for writing to files.

The function write.table() works very much like read.table(), except that it writes a data frame instead of reading one. For instance, let’s take the little Jack and Jill example from the beginning of Chapter 5:

> kids <- c("Jack","Jill")
> ages <- c(12,10)
> d <- data.frame(kids,ages,stringsAsFactors=FALSE)
> d
  kids ages
1 Jack   12
2 Jill   10
> write.table(d,"kds")

The file kds will now have these contents:

"kids" "ages"
"1" "Jack" 12
"2" "Jill" 10

In the case of writing a matrix to a file, just state that you do not want row or column names, as follows:

> write.table(xc,"xcnew",row.names=FALSE,col.names=FALSE)

The function cat() can also be used to write to a file, one part at a time. Here’s an example:

> cat("abc\n",file="u")
> cat("de\n",file="u",append=TRUE)

The first call to cat() creates the file u, consisting of one line with contents "abc". The second call appends a second line. Unlike the case of using the writeLines() function (which we’ll discuss next), the file is automatically saved after each operation. For instance, after the previous calls, the file will look like this:

abc
de

You can write multiple fields as well. So:

> cat(file="v",1,2,"xyz\n")

would produce a file v consisting of a single line:

1 2 xyz

You can also use writeLines(), the counterpart of readLines(). If you use a connection, you must specify "w" to indicate you are writing to the file, not reading from it:

> c <- file("www","w")
> writeLines(c("abc","de","f"),c)
> close(c)

The file www will be created with these contents:

abc
de
f

Note the need to proactively close the file.

R has a variety of functions for getting information about directories and files, setting file access permissions, and the like. The following are a few examples:

To see all the file- and directory-related functions, type the following:

> ?files

Some of these options will be demonstrated in the next example.

Here, we’ll develop a function to find the sum of the contents (assumed numeric) in all files in a directory tree. In our example, a directory dir1 contains the files filea and fileb, as well as a subdirectory dir2, which holds the file filec. The contents of the files are as follows:

If dir1 is in our current directory, the call sumtree("dir1") will yield the sum of those nine numbers, 98. Otherwise, we need to specify the full pathname of dir1, such as sumtree("/home/nm/dir1"). Here is the code:

1    sumtree <- function(drtr) {
2       tot <- 0
3       # get names of all files in the tree
4       fls <- dir(drtr,recursive=TRUE)
5       for (f in fls) {
6          # is f a directory?
7          f <- file.path(drtr,f)
8          if (!file.info(f)$isdir) {
9             tot <- tot + sum(scan(f,quiet=TRUE))
10          }
11       }
12       return(tot)
13    }

Note that this problem is a natural for recursion, which we discussed in Section 7.9. But here, R has done the recursion for us by allowing it as an option in dir(). Thus, in line 4, we set recursive=TRUE in order to find the files throughout the various levels of the directory tree.

To call file.info(), we need to account for the fact that the current filename f is relative to drtr, so our file filea would be referred to as dir1/filea. In order to form that pathname, we need to concatenate drtr, a slash, and filea. We could use the R string concatenation function paste() for this, but we would need a separate case for Windows, which uses a backslash instead of a slash. But file.path() does all that for us.

Some commentary pertaining to line 8 is in order. The function file.info() returns information about f as a data frame, one of whose columns is isdir, with one row for each file and with row names being the filenames. That column consists of Boolean values indicating whether each file is a directory. In line 8, then, we can detect whether the current file f is a directory. If f is an ordinary file, we go ahead and add its contents to our running total.