Importing Data from External Files

One of the nicest things about R is how easy it is to pull in data from other programs. R can import data from text files, other statistics software, and even spreadsheets. You don’t even need a local copy of the file: you can specify a file at a URL, and R will fetch the file for you over the Internet.

Most text files containing data are formatted similarly: each line of a text file represents an observation (or record). Each line contains a set of different variables associated with that observation. Sometimes, different variables are separated by a special character called the delimiter. Other times, variables are differentiated by their location on each line.

R includes a family of functions for importing delimited text files into R, based on the read.table function:

read.table(file, header, sep = , quote = , dec = , row.names, col.names,
           as.is = , na.strings , colClasses , nrows =, skip = ,
           check.names = , fill = , strip.white = , blank.lines.skip = ,
           comment.char = , allowEscapes = , flush = , stringsAsFactors = ,
           encoding = )

The read.table function reads a text file into R and returns a data.frame object. Each row in the input file is interpreted as an observation. Each column in the input file represents a variable. The read.table function expects each field to be separated by a delimiter.

For example, suppose that you had a file called top.5.salaries.csv that contained the following text (and only this text):

name.last,name.first,team,position,salary
"Manning","Peyton","Colts","QB",18700000
"Brady","Tom","Patriots","QB",14626720
"Pepper","Julius","Panthers","DE",14137500
"Palmer","Carson","Bengals","QB",13980000
"Manning","Eli","Giants","QB",12916666

This file contains the same data frame that we entered in Entering Data Using R Commands. Notice how this data is encoded:

  • The first row contains the column names.

  • Each text field is encapsulated in quotes.

  • Each field is separated by commas.

To load this file into R, you would specify that the first row contained column names (header=TRUE), that the delimiter was a comma (sep=","), and that quotes were used to encapsulate text (quote="\""). Here is an R statement that loads in this file:

> top.5.salaries <- read.table("top.5.salaries.csv",
+   header=TRUE, sep=",", quote="\"")

The read.table function is very flexible and allows you to load files with many different properties. Here is a brief description of the options for read.table.

ArgumentDescriptionDefault
fileThe name of the file to open or, alternatively, the name of a connection containing the data. You can even use a URL. (This is the one required argument for read.table.) 
headerA logical value indicating whether the first row of the file contains variable names.FALSE
sepThe character (or characters) separating fields. When “” is specified, any white space is used as a separator.""
quoteIf character values are enclosed in quotes, this argument should specify the type of quotes.“”
decThe character used for decimal points..
row.namesA character vector containing row names for the returned data frame. 
col.namesA character vector containing column names for the returned data frame. 
as.isA logical vector (the same length as the number of columns) that specifies whether or not to convert character values to factors.!stringsAsFactors
na.stringsA character vector specifying values that should be interpreted as NA.NA
colClassesA character vector of class names to be assigned to each column.NA
nrowsAn integer value specifying the number of rows to read. (Invalid values, such as negatives, are ignored.)-1
skipAn integer value specifying the number of rows in the text file to skip before beginning to read data.0
check.namesA logical value that specifies whether read.table should check if the column names are valid symbol names in R.TRUE
fillSometimes, a file might contain rows of unequal length. This argument is a logical value that specifies whether read.table should implicitly add blank fields at the end of rows where some values were missing.!blank.lines.skip
strip.whiteWhen sep !="", this logical value specifies whether read.table should remove extra leading and trailing white space from character fields.FALSE
blank.lines.skipA logical value that specifies whether read.table should ignore blank lines.TRUE
comment.charread.table can ignore comment lines in input files if the comment lines begin with a single special character. This argument specifies the character used to delineate these lines."#"
allowEscapesA logical value that indicates whether escapes (such as “\n” for a new line) should be interpreted or if character strings should be read literally.FALSE
flushA logical value that indicates whether read.table should skip to the next line when all requested fields have been read in from a line.FALSE
stringsAsFactorsA logical value indicating whether text fields should be converted to factors. default.stringsAsFactors()
encodingThe encoding scheme used for the source file."unknown"

The most important options are sep and header. You almost always have to know the field separator and know if there is a header field. R includes a set of convenience functions that call read.table with different default options for these values (and a couple of others). Here is a description of these functions.

Functionheadersepquotedecfillcomment.char
read.tableFALSE \" or \'.!blank.lines.skip#
read.csvTRUE,\".TRUE 
read.csv2TRUE;\",TRUE 
read.delimTRUE\t\".TRUE 
read.delim2TRUE\t\",TRUE 

In most cases, you will find that you can use read.csv for comma-separated files or read.delim for tab-delimited files without specifying any other options. (Except, I suppose, if you are in Europe, and you use commas to indicate the decimal point in numbers. Then you can use read.csv2 and read.delim2.)

As another example, suppose that you wanted to analyze some historical stock quote data. Yahoo! Finance provides this information in an easily downloadable form on its website; you can fetch a CSV file from a single URL. For example, to fetch the closing price of the S&P 500 index for every month between April 1, 1999, and April 1, 2009, you could use the following URL: http://ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv.

Conveniently, you can use a URL in place of a filename in R. This means that you could load this data into R with the following expression:

> sp500 <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
+   "s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv", sep=""))
> # show the first 5 rows
> sp500[1:5,]
        Date   Open   High    Low  Close      Volume Adj.Close
1 2009-04-01 793.59 813.62 783.32 811.08 12068280000    811.08
2 2009-03-02 729.57 832.98 666.79 797.87  7633306300    797.87
3 2009-02-02 823.09 875.01 734.52 735.09  7022036200    735.09
4 2009-01-02 902.99 943.85 804.30 825.88  5844561500    825.88
5 2008-12-01 888.61 918.85 815.69 903.25  5320791300    903.25

We will revisit this example in the next section.

If you’re trying to load a really big file, you might find that loading the file takes a long time. It can be very frustrating to wait 15 minutes for a file to load, only to discover that you have specified the wrong separator. A useful technique for testing is to load only a small number of rows into R. For example, to load 20 rows, you would add nrows=20 as an argument to read.table.

Many programs can export data as text files. Here are a few tips for creating text files that you can easily read into R:

To read a fixed-width format text file into a data frame, you can use the read.fwf function:

read.fwf(file, widths, header = , sep = ,
         skip = , row.names, col.names, n = ,
         buffersize = , ...)

Here is a description of the arguments to read.fwf.

ArgumentDescriptionDefault
fileThe name of the file to open or, alternatively, the name of a connection containing the data. (This is a required argument.) 
widthsAn integer vector or a list of integer vectors. If the input file has one record per line, then use an integer vector where each value represents the width of each variable. If each record spans multiple lines, then use a list of integer vectors where each integer vector corresponds to the widths of the variables on that line. (This is a required argument.) 
headerA logical value indicating whether the first line of the file contains variable names. (If it does, the names must be delimited by sep.)FALSE
sepThe character used to delimit variable names in the header.\t
skipAn integer specifying the number of lines to skip at the beginning of the file.0
row.namesA character vector used to specify row names in the data frame. 
col.namesA character vector used to specify column names in the data frame. 
nAn integer value specifying the number of rows of records to read into R. (Invalid values, such as negatives, are ignored.)-1
buffersizeAn integer specifying the maximum number of lines to be read at one time. (This value may be tuned to optimize performance.)2000

Note that read.fwf can also take many arguments used by read.table, including as.is, na.strings, colClasses, and strip.white.

Most of the time, you should be able to load text files into R with the read.table function. Sometimes, however, you might be provided with a file that cannot be read correctly with this function. For example, observations in the file might span multiple lines. To read data into R one line at a time, use the function readLines:

readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE,
          encoding = "unknown")

The readLines function will return a character vector, with one value corresponding to each row in the file. Here is a description of the arguments to readLines.

ArgumentDescriptionDefault
conA character string (specifying a file or URL) or a connection containing the data to read.stdin()
nAn integer value specifying the number of lines to read. (Negative values mean “read until the end of the file.”)-1L
okA logical value specifying whether to trigger an error if the number of lines in the file is less than n.TRUE
warnA logical value specifying whether to warn the user if the file does not end with an EOL.TRUE
encodingA character value specifying the encoding of the input file."unknown"

Note that you can use readLines interactively to enter data.

Another useful function for reading more complex file formats is scan:

scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",
     quote = if(identical(sep, "\n")) "" else "'\"", dec = ".",
     skip = 0, nlines = 0, na.strings = "NA",
     flush = FALSE, fill = FALSE, strip.white = FALSE,
     quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE,
     comment.char = "", allowEscapes = FALSE,
     encoding = "unknown")

The scan function allows you to read the contents of a file into R. Unlike readLines, scan allows you to read data into a specifically defined data structure using the argument what.

Here is a description of the arguments to scan.

ArgumentDescriptionDefault
fileA character string (specifying a file or URL) or a connection containing the data to read.""
whatThe type of data to be read. If all fields are the same type, you can specify logical, integer, numeric, complex, character, or raw. Otherwise, specify a list of types to read values into a list. (You can specify the type of each element in the list individually.)double(0)
nmaxAn integer value specifying the number of values to read or the number of records to read (if what is a list). (Negative values mean “read until the end of the file.”)-1
nAn integer value specifying the number of values to read. (Negative values mean “read until the end of the file.”)-1
sepCharacter value specifying the separator between values. sep="" means that any white space character is interpreted as a separator.“”
quoteCharacter value used to quote strings.if(identical(sep, "\n")) "" else "'\""
decCharacter value used for decimal place in numbers."."
skipNumber of lines to skip at the top of the file.0
nlinesNumber of lines of data to read. Nonpositive values mean that there is no limit.0
na.stringsCharacter values specifying how NA values are encoded."NA"
flushA logical value specifying whether to “flush” any remaining text on a line after the last requested item on a line is read into what. (Commonly used to allow comments at the end of lines or to ignore unneeded fields.)FALSE
fillSpecifies whether to add empty fields to lines with fewer fields than specified by what.FALSE
strip.whiteSpecifies whether to strip leading and trailing white space from character fields. Applies only when sep is specified.FALSE
quietIf quiet=FALSE, scan will print a message showing how many lines were read. If quiet=TRUE, then this message is suppressed.FALSE
blank.lines.skipSpecifies whether to ignore blank lines.TRUE
multi.lineIf what is a list, allows records to span multiple lines.TRUE
comment.charNotes a character to be used to specify comment lines.""
allowEscapesSpecifies whether C-style escapes (such as \t for Tab character or \n for newlines) should be interpreted by scan or read verbatim. If allowEscapes=FALSE, then they are interpreted as special characters; if allowEscapes=TRUE, then they are read literally.FALSE
encodingA character value specifying the encoding of the input file."unknown"

Like readLines, you can also use scan to enter data directly into R.

Although many software packages can export data as text files, you might find it more convenient to read their data files directly. R can read files in many other formats. Table 11-1 shows a list of functions for reading (and writing) files in other formats. You can find more information about these functions in the help files.