Importing Data from External Files

One of the nicest things about R is how easy it is to pull in data from other programs. R can import data from text files, other statistics software, and even spreadsheets. You don’t even need a local copy of the file: you can specify a file at a URL, and R will fetch the file for you over the Internet.

Text Files

Most text files containing data are formatted similarly: each line of a text file represents an observation (or record). Each line contains a set of different variables associated with that observation. Sometimes, different variables are separated by a special character called the delimiter. Other times, variables are differentiated by their location on each line.

Delimited files

R includes a family of functions for importing delimited text files into R, based on the read.table function:

read.table(file, header, sep = , quote = , dec = , row.names, col.names,
           as.is = , na.strings , colClasses , nrows =, skip = ,
           check.names = , fill = , strip.white = , blank.lines.skip = ,
           comment.char = , allowEscapes = , flush = , stringsAsFactors = ,
           encoding = )

The read.table function reads a text file into R and returns a data.frame object. Each row in the input file is interpreted as an observation. Each column in the input file represents a variable. The read.table function expects each field to be separated by a delimiter.

For example, suppose that you had a file called top.5.salaries.csv that contained the following text (and only this text):

name.last,name.first,team,position,salary
"Manning","Peyton","Colts","QB",18700000
"Brady","Tom","Patriots","QB",14626720
"Pepper","Julius","Panthers","DE",14137500
"Palmer","Carson","Bengals","QB",13980000
"Manning","Eli","Giants","QB",12916666

This file contains the same data frame that we entered in Entering Data Using R Commands. Notice how this data is encoded:

The first row contains the column names.
Each text field is encapsulated in quotes.
Each field is separated by commas.

To load this file into R, you would specify that the first row contained column names (header=TRUE), that the delimiter was a comma (sep=","), and that quotes were used to encapsulate text (quote="\""). Here is an R statement that loads in this file:

> top.5.salaries <- read.table("top.5.salaries.csv",
+   header=TRUE, sep=",", quote="\"")

The read.table function is very flexible and allows you to load files with many different properties. Here is a brief description of the options for read.table.

Argument	Description	Default
file	The name of the file to open or, alternatively, the name of a connection containing the data. You can even use a URL. (This is the one required argument for `read.table`.)
header	A logical value indicating whether the first row of the file contains variable names.	`FALSE`
sep	The character (or characters) separating fields. When `“”` is specified, any white space is used as a separator.	`""`
quote	If character values are enclosed in quotes, this argument should specify the type of quotes.	“”
dec	The character used for decimal points.	`.`
row.names	A character vector containing row names for the returned data frame.
col.names	A character vector containing column names for the returned data frame.
as.is	A logical vector (the same length as the number of columns) that specifies whether or not to convert character values to factors.	`!stringsAsFactors`
na.strings	A character vector specifying values that should be interpreted as NA.	`NA`
colClasses	A character vector of class names to be assigned to each column.	`NA`
nrows	An integer value specifying the number of rows to read. (Invalid values, such as negatives, are ignored.)	`-1`
skip	An integer value specifying the number of rows in the text file to skip before beginning to read data.	`0`
check.names	A logical value that specifies whether `read.table` should check if the column names are valid symbol names in R.	`TRUE`
fill	Sometimes, a file might contain rows of unequal length. This argument is a logical value that specifies whether `read.table` should implicitly add blank fields at the end of rows where some values were missing.	`!blank.lines.skip`
strip.white	When `sep !=""`, this logical value specifies whether `read.table` should remove extra leading and trailing white space from character fields.	`FALSE`
blank.lines.skip	A logical value that specifies whether `read.table` should ignore blank lines.	`TRUE`
comment.char	`read.table` can ignore comment lines in input files if the comment lines begin with a single special character. This argument specifies the character used to delineate these lines.	`"#"`
allowEscapes	A logical value that indicates whether escapes (such as “\n” for a new line) should be interpreted or if character strings should be read literally.	`FALSE`
flush	A logical value that indicates whether `read.table` should skip to the next line when all requested fields have been read in from a line.	`FALSE`
stringsAsFactors	A logical value indicating whether text fields should be converted to factors.	`default.stringsAsFactors()`
encoding	The encoding scheme used for the source file.	`"unknown"`

The most important options are sep and header. You almost always have to know the field separator and know if there is a header field. R includes a set of convenience functions that call read.table with different default options for these values (and a couple of others). Here is a description of these functions.

Function	header	sep	quote	dec	fill	comment.char
`read.table`	`FALSE`		`\"` or `\'`	`.`	`!blank.lines.skip`	#
`read.csv`	`TRUE`	`,`	\"	`.`	`TRUE`
`read.csv2`	`TRUE`	`;`	`\"`	`,`	`TRUE`
`read.delim`	`TRUE`	`\t`	`\"`	`.`	`TRUE`
`read.delim2`	`TRUE`	`\t`	`\"`	`,`	`TRUE`

In most cases, you will find that you can use read.csv for comma-separated files or read.delim for tab-delimited files without specifying any other options. (Except, I suppose, if you are in Europe, and you use commas to indicate the decimal point in numbers. Then you can use read.csv2 and read.delim2.)

As another example, suppose that you wanted to analyze some historical stock quote data. Yahoo! Finance provides this information in an easily downloadable form on its website; you can fetch a CSV file from a single URL. For example, to fetch the closing price of the S&P 500 index for every month between April 1, 1999, and April 1, 2009, you could use the following URL: http://ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv.

Conveniently, you can use a URL in place of a filename in R. This means that you could load this data into R with the following expression:

> sp500 <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
+   "s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv", sep=""))
> # show the first 5 rows
> sp500[1:5,]
        Date   Open   High    Low  Close      Volume Adj.Close
1 2009-04-01 793.59 813.62 783.32 811.08 12068280000    811.08
2 2009-03-02 729.57 832.98 666.79 797.87  7633306300    797.87
3 2009-02-02 823.09 875.01 734.52 735.09  7022036200    735.09
4 2009-01-02 902.99 943.85 804.30 825.88  5844561500    825.88
5 2008-12-01 888.61 918.85 815.69 903.25  5320791300    903.25

We will revisit this example in the next section.

If you’re trying to load a really big file, you might find that loading the file takes a long time. It can be very frustrating to wait 15 minutes for a file to load, only to discover that you have specified the wrong separator. A useful technique for testing is to load only a small number of rows into R. For example, to load 20 rows, you would add nrows=20 as an argument to read.table.

Many programs can export data as text files. Here are a few tips for creating text files that you can easily read into R:

For Microsoft Excel spreadsheets, you can export them as either comma-delimited files (CSV files) or tab-delimited files (TXT files). When possible, you should specify Unix-style line delimiters, not MS-DOS line delimiters. (MS-DOS files end each line with “\n\r,” while Unix-style systems end lines with “\n.”) There are two things to think about when choosing between CSV and TXT files.
CSV files can be more convenient because (by default) opening these files in Windows Explorer will open these files in Microsoft Excel. However, if you are using CSV files, then you must be careful to enclose text in quotes if the data contains commas (and, additionally, you must escape any quotation marks within text fields). Tab characters occur less often in text, so tab-delimited files are less likely to cause problems.
If you are exporting data from a database, consider using a GUI tool to query the database and export the results. It is possible to use command-line scripts to export data using tools like SQL Plus, pgsSQL, or MySQL, but doing so is often tricky.
Here are a few options I have tried. If you are using Microsoft Windows, a good choice is Toad for Data Analysts (available from http://www.toadsoft.com/tda/tdaindex.html); this will work with many different databases. If you are exporting from MySQL, MySQL Query Browser is also a good choice; versions are available for Microsoft Windows, Mac OS X, and Linux (you can download it from http://dev.mysql.com/downloads/gui-tools/5.0.html). Oracle now produces a free multi-platform query tool called SQL Developer. (You can find it at http://www.oracle.com/technology/products/database/sql_developer/index.html.)

Fixed-width files

To read a fixed-width format text file into a data frame, you can use the read.fwf function:

read.fwf(file, widths, header = , sep = ,
         skip = , row.names, col.names, n = ,
         buffersize = , ...)

Here is a description of the arguments to read.fwf.

Argument	Description	Default
file	The name of the file to open or, alternatively, the name of a connection containing the data. (This is a required argument.)
widths	An integer vector or a list of integer vectors. If the input file has one record per line, then use an integer vector where each value represents the width of each variable. If each record spans multiple lines, then use a list of integer vectors where each integer vector corresponds to the widths of the variables on that line. (This is a required argument.)
header	A logical value indicating whether the first line of the file contains variable names. (If it does, the names must be delimited by `sep`.)	`FALSE`
sep	The character used to delimit variable names in the header.	`\t`
skip	An integer specifying the number of lines to skip at the beginning of the file.	`0`
row.names	A character vector used to specify row names in the data frame.
col.names	A character vector used to specify column names in the data frame.
n	An integer value specifying the number of rows of records to read into R. (Invalid values, such as negatives, are ignored.)	`-1`
buffersize	An integer specifying the maximum number of lines to be read at one time. (This value may be tuned to optimize performance.)	`2000`

Note that read.fwf can also take many arguments used by read.table, including as.is, na.strings, colClasses, and strip.white.

Other functions to parse data

Most of the time, you should be able to load text files into R with the read.table function. Sometimes, however, you might be provided with a file that cannot be read correctly with this function. For example, observations in the file might span multiple lines. To read data into R one line at a time, use the function readLines:

readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE,
          encoding = "unknown")

The readLines function will return a character vector, with one value corresponding to each row in the file. Here is a description of the arguments to readLines.

Argument	Description	Default
con	A character string (specifying a file or URL) or a connection containing the data to read.	`stdin()`
n	An integer value specifying the number of lines to read. (Negative values mean “read until the end of the file.”)	`-1L`
ok	A logical value specifying whether to trigger an error if the number of lines in the file is less than n.	`TRUE`
warn	A logical value specifying whether to warn the user if the file does not end with an EOL.	`TRUE`
encoding	A character value specifying the encoding of the input file.	`"unknown"`

Note that you can use readLines interactively to enter data.

Another useful function for reading more complex file formats is scan:

scan(file = "", what = double(0), nmax = -1, n = -1, sep = "",
     quote = if(identical(sep, "\n")) "" else "'\"", dec = ".",
     skip = 0, nlines = 0, na.strings = "NA",
     flush = FALSE, fill = FALSE, strip.white = FALSE,
     quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE,
     comment.char = "", allowEscapes = FALSE,
     encoding = "unknown")

The scan function allows you to read the contents of a file into R. Unlike readLines, scan allows you to read data into a specifically defined data structure using the argument what.

Here is a description of the arguments to scan.

Argument	Description	Default
file	A character string (specifying a file or URL) or a connection containing the data to read.	`""`
what	The type of data to be read. If all fields are the same type, you can specify logical, integer, numeric, complex, character, or raw. Otherwise, specify a list of types to read values into a list. (You can specify the type of each element in the list individually.)	`double(0)`
nmax	An integer value specifying the number of values to read or the number of records to read (if `what` is a list). (Negative values mean “read until the end of the file.”)	`-1`
n	An integer value specifying the number of values to read. (Negative values mean “read until the end of the file.”)	`-1`
sep	Character value specifying the separator between values. `sep=""` means that any white space character is interpreted as a separator.	“”
quote	Character value used to quote strings.	`if(identical(sep, "\n")) "" else "'\""`
dec	Character value used for decimal place in numbers.	`"."`
skip	Number of lines to skip at the top of the file.	`0`
nlines	Number of lines of data to read. Nonpositive values mean that there is no limit.	`0`
na.strings	Character values specifying how `NA` values are encoded.	`"NA"`
flush	A logical value specifying whether to “flush” any remaining text on a line after the last requested item on a line is read into `what`. (Commonly used to allow comments at the end of lines or to ignore unneeded fields.)	`FALSE`
fill	Specifies whether to add empty fields to lines with fewer fields than specified by `what`.	`FALSE`
strip.white	Specifies whether to strip leading and trailing white space from character fields. Applies only when `sep` is specified.	`FALSE`
quiet	If `quiet=FALSE`, scan will print a message showing how many lines were read. If `quiet=TRUE`, then this message is suppressed.	`FALSE`
blank.lines.skip	Specifies whether to ignore blank lines.	`TRUE`
multi.line	If `what` is a list, allows records to span multiple lines.	`TRUE`
comment.char	Notes a character to be used to specify comment lines.	`""`
allowEscapes	Specifies whether C-style escapes (such as `\t` for Tab character or `\n` for newlines) should be interpreted by scan or read verbatim. If `allowEscapes=FALSE`, then they are interpreted as special characters; if `allowEscapes=TRUE`, then they are read literally.	`FALSE`
encoding	A character value specifying the encoding of the input file.	`"unknown"`

Like readLines, you can also use scan to enter data directly into R.

Other Software

Although many software packages can export data as text files, you might find it more convenient to read their data files directly. R can read files in many other formats. Table 11-1 shows a list of functions for reading (and writing) files in other formats. You can find more information about these functions in the help files.

Table 11-1. Functions to read and write data

File format	Reading	Writing
ARFF	`read.arff`	`write.arff`
DBF	`read.dbf`	`write.dbf`
Stata	`read.dta`	`write.dta`
Epi Info	`read.epiinfo`
Minitab	`read.mtp`
Octave	`read.octave`
S3 binary files, data.dump files	`read.S`
SPSS	`read.spss`
SAS Permanent Dataset	`read.ssd`
Systat	`read.sysstat`
SAS XPORT File	`read.xport`