Now that we’ve covered the basics of I/O, let’s get to some more practical applications of reading and writing files. The following sections discuss reading data frames or matrices from files, working with text files, accessing files on remote machines, and getting file and directory information.
In Section 5.1.2, we discussed the use of the function read.table()
to read in a data frame. As a quick review, suppose the file z looks like this:
name age John 25 Mary 28 Jim 19
The first line contains an optional header, specifying column names. We could read the file this way:
> z <- read.table("z",header=TRUE) > z name age 1 John 25 2 Mary 28 3 Jim 19
Note that scan()
would not work here, because our file has a mixture of numeric and character data (and a header).
There appears to be no direct way of reading in a matrix from a file, but it can be done easily with other tools. A simple, quick way is to use scan()
to read in the matrix row by row. You use the byrow
option in the function matrix()
to indicate that you are defining the elements of the matrix in a row-wise, rather than column-wise, manner.
For instance, say the file x contains a 5-by-3 matrix, stored row-wise:
1 0 1 1 1 1 1 1 0 1 1 0 0 0 1
We can read it into a matrix this way:
> x <- matrix(scan("x"),nrow=5,byrow=TRUE)
This is fine for quick, one-time operations, but for generality, you can use read.table()
, which returns a data frame, and then convert via as.matrix()
. Here is a general method:
read.matrix <- function(filename) { as.matrix(read.table(filename)) }
In computer literature, there is often a distinction made between text files and binary files. That distinction is somewhat misleading—every file is binary in the sense that it consists of 0s and 1s. Let’s take the term text file to mean a file that consists mainly of ASCII characters or coding for some other human language (such as GB for Chinese) and that uses newline characters to give humans the perception of lines. The latter aspect will turn out to be central here. Nontext files, such as JPEG images or executable program files, are generally called binary files.
You can use readLines()
to read in a text file, either one line at a time or in a single operation. For example, suppose we have a file z1 with the following contents:
John 25 Mary 28 Jim 19
We can read the file all at once, like this:
> z1 <- readLines("z1") > z1 [1] "John 25" "Mary 28" "Jim 19"
Since each line is treated as a string, the return value here is a vector of strings—that is, a vector of character mode. There is one vector element for each line read, thus three elements here.
Alternatively, we can read it in one line at a time. For this, we first need to create a connection, as described next.
Connection is R’s term for a fundamental mechanism used in various kinds of I/O operations. Here, it will be used for file access.
The connection is created by calling file()
, url()
, or one of several other R functions. To see a list of those functions, type this:
> ?connection
So, we can now read in the z1 file (introduced in the previous section) line by line, as follows:
> c <- file("z1","r") > readLines(c,n=1) [1] "John 25" > readLines(c,n=1) [1] "Mary 28" > readLines(c,n=1) [1] "Jim 19" > readLines(c,n=1) character(0)
We opened the connection, assigned the result to c
, and then read the file one line at a time, as specified by the argument n=1
. When R encountered the end of file (EOF), it returned an empty result. We needed to set up a connection so that R could keep track of our position in the file as we read through it.
We can detect EOF in our code:
> c <- file("z","r") > while(TRUE) { + rl <- readLines(c,n=1) + if (length(rl) == 0) { + print("reached the end") + break + } else print(rl) + } [1] "John 25" [1] "Mary 28" [1] "Jim 19" [1] "reached the end"
If we wish to “rewind”—to start again at the beginning of the file—we can use seek()
:
> c <- file("z1","r") > readLines(c,n=2) [1] "John 25" "Mary 28" > seek(con=c,where=0) [1] 16 > readLines(c,n=1) [1] "John 25"
The argument where=0
in our call to seek()
means that we wish to position the file pointer zero characters from the start of the file—in other words, directly at the beginning.
The call returns 16, meaning that the file pointer was at position 16 before we made the call. That makes sense. The first line consists of "John 25"
plus the end-of-line character, for a total of eight characters, and the same is true for the second line. So, after reading the first two lines, we were at position 16.
You can close a connection by calling—what else?—close()
. You would use this to let the system know that the file you have been writing is complete and should now be officially written to disk. As another example, in a client/server relationship over the Internet (see Section 10.3.1), a client would use close()
to indicate to the server that the client is signing off.
The U.S. Census Bureau makes census data available in the form of Public Use Microdata Samples (PUMS). The term microdata here means that we are dealing with raw data and each record is for a real person, as opposed to statistical summaries. Data on many, many variables are included.
The data is organized by household. For each unit, there is first a Household record, describing the various characteristics of that household, followed by one Person record for each person in the household. Character positions 106 and 107 (with numbering starting at 1) in the Household record state the number of Person records for that household. (The number can be very large, since some institutions count as households.)
To enhance the integrity of the data, character position 1 contains H or P to confirm that this is a Household or Person record. So, if you read an H record, and it tells you there are three people in the household, then the following three records should be P records, followed by another H record; if not, you’ve encountered an error.
As our test file, we’ll take the first 1,000 records of the year 2000 1 percent sample. The first few records look like this:
H000019510649 06010 99979997 70 631973 15758 59967658436650000012000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0 0 0 0 0 00000000000000000000000000000 00000000000000000000000000 P00001950100010923000420190010110000010147050600206011099999904200000 0040010000 00300280 28600 70 9997 9997202020202020220000040000000000000006000000 00000 00 0000 00000000000000000132241057904MS 476041-20311010310 07000049010000000000900100000100000100000100000010000001000139010000490000 H000040710649 06010 99979997 70 631973 15758 599676584365300800200000300106060503010101010102010 01200006000000100001 00600020 0 0 0 0 0000 0 0 0 0 0 02000102010102200000000010750 02321125100004000000040000 P00004070100005301000010380010110000010147030400100009005199901200000 0006010000 00100000 00000 00 0000 0000202020202020220000040000000000000001000060 06010 70 9997 99970101004900100000001018703221 770051-10111010500 40004000000000000000000000000000000000000000000000000000004000000040000349 P00004070200005303011010140010110000010147050000204004005199901200000 0006010000 00100000 00000 00 0000 000020202020 0 0200000000000000000000000050000 00000 00 0000 000000000000000000000000000000000000000000-00000000000 000 0 0 0 0 0 0 0 0 00000000349 H000061010649 06010 99979997 70 631973 15758 599676584360801190100000200204030502010101010102010 00770004800064000001 1 0 030 0 0 0 0340 00660000000170 0 06010000000004410039601000000 00021100000004940000000000
The records are very wide and thus wrap around. Each one occupies four lines on the page here.
We’ll create a function called extractpums()
to read in a PUMS file and create a data frame from its Person records. The user specifies the filename and lists fields to extract and names to assign to those fields.
We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.)
Before looking at the function code, let’s see what the function does. In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our filename is pumsa. The following call creates a data frame consisting of those two variables.
pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))
Note that we are stating here the names we want the columns to have in the resulting data frame. We can use any names we want—say Sex and Ancientness.
Here is the first part of that data frame:
> head(pumsdf) serno Gender Age 2 195 2 19 3 407 1 38 4 407 1 14 5 610 2 65 6 1609 1 50 7 1609 2 49
The following is the code for the extractpums()
function.
1 # reads in PUMS file pf, extracting the Person records, returning a data 2 # frame; each row of the output will consist of the Household serial 3 # number and the fields specified in the list flds; the columns of 4 # the data frame will have the names of the indices in flds 5 6 extractpums <- function(pf,flds) { 7 dtf <- data.frame() # data frame to be built 8 con <- file(pf,"r") # connection 9 # process the input file 10 repeat { 11 hrec <- readLines(con,1) # read Household record 12 if (length(hrec) == 0) break # end of file, leave loop 13 # get household serial number 14 serno <- intextract(hrec,c(2,8)) 15 # how many Person records? 16 npr <- intextract(hrec,c(106,107)) 17 if (npr > 0) 18 for (i in 1:npr) { 19 prec <- readLines(con,1) # get Person record 20 # make this person's row for the data frame 21 person <- makerow(serno,prec,flds) 22 # add it to the data frame 23 dtf <- rbind(dtf,person) 24 } 25 } 26 return(dtf) 27 } 28 29 # set up this person's row for the data frame 30 makerow <- function(srn,pr,fl) { 31 l <- list() 32 l[["serno"]] <- srn 33 for (nm in names(fl)) { 34 l[[nm]] <- intextract(pr,fl[[nm]]) 35 } 36 return(l) 37 } 38 39 # extracts an integer field in the string s, in character positions 40 # rng[1] through rng[2] 41 intextract <- function(s,rng) { 42 fld <- substr(s,rng[1],rng[2]) 43 return(as.integer(fld)) 44 }
Let’s see how this works. At the beginning of extractpums()
, we create an empty data frame and set up the connection for the PUMS file read.
dtf <- data.frame() # data frame to be built con <- file(pf,"r") # connection
The main body of the code then consists of a repeat
loop.
repeat { hrec <- readLines(con,1) # read Household record if (length(hrec) == 0) break # end of file, leave loop # get household serial number serno <- intextract(hrec,c(2,8)) # how many Person records? npr <- intextract(hrec,c(106,107)) if (npr > 0) for (i in 1:npr) { ... } }
This loop iterates until the end of the input file is reached. The latter condition will be sensed by encountering a zero-length Household record, as seen in the preceding code.
Within the repeat
loop, we alternate reading a Household record and reading the associated Person records. The number of Person records for the current Household record is extracted from columns 106 and 107 of that record, storing this number in npr
. That extraction is done by a call to our function intextract()
.
The for
loop then reads in the Person records one by one, in each case forming the desired row for the output data frame and then attaching it to the latter via rbind()
:
for (i in 1:npr) { prec <- readLines(con,1) # get Person record # make this person's row for the data frame person <- makerow(serno,prec,flds) # add it to the data frame dtf <- rbind(dtf,person) }
Note how makerow()
creates the row to be added for a given person. Here the formal arguments are srn
for the household serial number, pr
for the given Person record, and fl
for the list of variable names and column fields.
makerow <- function(srn,pr,fl) { l <- list() l[["serno"]] <- srn for (nm in names(fl)) { l[[nm]] <- intextract(pr,fl[[nm]]) } return(l) }
For instance, consider our sample call:
pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))
When makerow()
executes, fl
will be a list with two elements, named Gender
and Age
. The string pr
, the current Person record, will have Gender
in column 23 and Age
in columns 25 and 26. We call intextract()
to pull out the desired numbers.
The intextract()
function itself is a straightforward conversion of characters to numbers, such as converting the string "12"
to the number 12.
Note that, if not for the presence of Household records, we could do all of this much more easily with a handy built-in R function: read.fwf()
. The name of this function is an abbreviation for “read fixed-width formatted,” alluding to the fact that each variable is stored in given character positions of a record. In essence, this function alleviates the need to write a function like intextract()
.
Certain I/O functions, such as read.table()
and scan()
, accept web URLs as arguments. (Check R’s online help facility to see if your favorite function allows this.)
As an example, we’ll read some data from the University of California, Irvine archive at http://archive.ics.uci.edu/ml/datasets.html, using the Echocardiogram
data set. After navigating the links, we find the location of that file and then read it from R, as follows:
> uci <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" > uci <- paste(uci,"echocardiogram/echocardiogram.data",sep="") > ecc <- read.csv(uci)
(We’ve built up the URL in stages here to fit the page.)
Let’s take a look at what we downloaded:
> head(ecc) X11 X0 X71 X0.1 X0.260 X9 X4.600 X14 X1 X1.1 name X1.2 X0.2 1 19 0 72 0 0.380 6 4.100 14 1.700 0.588 name 1 0 2 16 0 55 0 0.260 4 3.420 14 1 1 name 1 0 3 57 0 60 0 0.253 12.062 4.603 16 1.450 0.788 name 1 0 4 19 1 57 0 0.160 22 5.750 18 2.250 0.571 name 1 0 5 26 0 68 0 0.260 5 4.310 12 1 0.857 name 1 0 6 13 0 62 0 0.230 31 5.430 22.5 1.875 0.857 name 1 0
We could then do our analyses. For example, the third column is age, so we could find its mean or perform other calculations on that data. See the echocardiogram.names page at http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.names for descriptions of all of the variables.
Given the statistical basis of R, file reads are probably much more common than writes. But writes are sometimes necessary, and this section will present methods for writing to files.
The function write.table()
works very much like read.table()
, except that it writes a data frame instead of reading one. For instance, let’s take the little Jack and Jill example from the beginning of Chapter 5:
> kids <- c("Jack","Jill") > ages <- c(12,10) > d <- data.frame(kids,ages,stringsAsFactors=FALSE) > d kids ages 1 Jack 12 2 Jill 10 > write.table(d,"kds")
The file kds will now have these contents:
"kids" "ages" "1" "Jack" 12 "2" "Jill" 10
In the case of writing a matrix to a file, just state that you do not want row or column names, as follows:
> write.table(xc,"xcnew",row.names=FALSE,col.names=FALSE)
The function cat()
can also be used to write to a file, one part at a time. Here’s an example:
> cat("abc\n",file="u") > cat("de\n",file="u",append=TRUE)
The first call to cat()
creates the file u, consisting of one line with contents "abc"
. The second call appends a second line. Unlike the case of using the writeLines()
function (which we’ll discuss next), the file is automatically saved after each operation. For instance, after the previous calls, the file will look like this:
abc de
You can write multiple fields as well. So:
> cat(file="v",1,2,"xyz\n")
would produce a file v consisting of a single line:
1 2 xyz
You can also use writeLines()
, the counterpart of readLines()
. If you use a connection, you must specify "w"
to indicate you are writing to the file, not reading from it:
> c <- file("www","w") > writeLines(c("abc","de","f"),c) > close(c)
The file www will be created with these contents:
abc de f
Note the need to proactively close the file.
R has a variety of functions for getting information about directories and files, setting file access permissions, and the like. The following are a few examples:
file.info()
: Gives file size, creation time, directory-versus-ordinary file status, and so on for each file whose name is in the argument, a character vector.
dir()
: Returns a character vector listing the names of all the files in the directory specified in its first argument. If the optional argument recursive=TRUE
is specified, the result will show the entire directory tree rooted at the first argument.
file.exists()
: Returns a Boolean vector indicating whether the given file exists for each name in the first argument, a character vector.
getwd()
and setwd()
: Used to determine or change the current working directory.
To see all the file- and directory-related functions, type the following:
> ?files
Some of these options will be demonstrated in the next example.
Here, we’ll develop a function to find the sum of the contents (assumed numeric) in all files in a directory tree. In our example, a directory dir1 contains the files filea and fileb, as well as a subdirectory dir2, which holds the file filec. The contents of the files are as follows:
filea: 5, 12, 13
fileb: 3, 4, 5
filec: 24, 25, 7
If dir1 is in our current directory, the call sumtree("dir1")
will yield the sum of those nine numbers, 98. Otherwise, we need to specify the full pathname of dir1, such as sumtree("/home/nm/dir1")
. Here is the code:
1 sumtree <- function(drtr) { 2 tot <- 0 3 # get names of all files in the tree 4 fls <- dir(drtr,recursive=TRUE) 5 for (f in fls) { 6 # is f a directory? 7 f <- file.path(drtr,f) 8 if (!file.info(f)$isdir) { 9 tot <- tot + sum(scan(f,quiet=TRUE)) 10 } 11 } 12 return(tot) 13 }
Note that this problem is a natural for recursion, which we discussed in Section 7.9. But here, R has done the recursion for us by allowing it as an option in dir()
. Thus, in line 4, we set recursive=TRUE
in order to find the files throughout the various levels of the directory tree.
To call file.info()
, we need to account for the fact that the current filename f is relative to drtr
, so our file filea would be referred to as dir1/filea. In order to form that pathname, we need to concatenate drtr
, a slash, and filea
. We could use the R string concatenation function paste()
for this, but we would need a separate case for Windows, which uses a backslash instead of a slash. But file.path()
does all that for us.
Some commentary pertaining to line 8 is in order. The function file.info()
returns information about f
as a data frame, one of whose columns is isdir
, with one row for each file and with row names being the filenames. That column consists of Boolean values indicating whether each file is a directory. In line 8, then, we can detect whether the current file f is a directory. If f is an ordinary file, we go ahead and add its contents to our running total.