One of the nicest things about R is how easy it is to pull in data from other programs. R can import data from text files, other statistics software, and even spreadsheets. You don’t even need a local copy of the file: you can specify a file at a URL, and R will fetch the file for you over the Internet.
Most text files containing data are formatted similarly: each line of a text file represents an observation (or record). Each line contains a set of different variables associated with that observation. Sometimes, different variables are separated by a special character called the delimiter. Other times, variables are differentiated by their location on each line.
R includes a family of functions for importing delimited
text files into R, based on the read.table
function:
read.table(file, header, sep = , quote = , dec = , row.names, col.names, as.is = , na.strings , colClasses , nrows =, skip = , check.names = , fill = , strip.white = , blank.lines.skip = , comment.char = , allowEscapes = , flush = , stringsAsFactors = , encoding = )
The read.table
function reads a text file into R and returns a
data.frame
object. Each row in the
input file is interpreted as an observation. Each column in the input
file represents a variable. The read.table
function expects each field to be
separated by a delimiter.
For example, suppose that you had a file called top.5.salaries.csv that contained the following text (and only this text):
name.last,name.first,team,position,salary "Manning","Peyton","Colts","QB",18700000 "Brady","Tom","Patriots","QB",14626720 "Pepper","Julius","Panthers","DE",14137500 "Palmer","Carson","Bengals","QB",13980000 "Manning","Eli","Giants","QB",12916666
This file contains the same data frame that we entered in Entering Data Using R Commands. Notice how this data is encoded:
The first row contains the column names.
Each text field is encapsulated in quotes.
Each field is separated by commas.
To load this file into R, you would specify that the first row
contained column names (header=TRUE
), that the delimiter was a comma
(sep=","
), and that quotes were
used to encapsulate text (quote="\""
). Here is an R statement that
loads in this file:
> top.5.salaries <- read.table("top.5.salaries.csv", + header=TRUE, sep=",", quote="\"")
The read.table
function is
very flexible and allows you to load files with many different
properties. Here is a brief description of the options for read.table
.
Argument | Description | Default |
---|---|---|
file | The name of the file to open or, alternatively, the
name of a connection containing the data. You can even use a
URL. (This is the one required argument for read.table .) | |
header | A logical value indicating whether the first row of the file contains variable names. | FALSE |
sep | The character (or characters) separating fields. When
“” is specified, any white
space is used as a separator. | "" |
quote | If character values are enclosed in quotes, this argument should specify the type of quotes. | “” |
dec | The character used for decimal points. | . |
row.names | A character vector containing row names for the returned data frame. | |
col.names | A character vector containing column names for the returned data frame. | |
as.is | A logical vector (the same length as the number of columns) that specifies whether or not to convert character values to factors. | !stringsAsFactors |
na.strings | A character vector specifying values that should be interpreted as NA. | NA |
colClasses | A character vector of class names to be assigned to each column. | NA |
nrows | An integer value specifying the number of rows to read. (Invalid values, such as negatives, are ignored.) | -1 |
skip | An integer value specifying the number of rows in the text file to skip before beginning to read data. | 0 |
check.names | A logical value that specifies whether read.table should check if the
column names are valid symbol names in R. | TRUE |
fill | Sometimes, a file might contain rows of unequal length.
This argument is a logical value that specifies whether
read.table should
implicitly add blank fields at the end of rows where some
values were missing. | !blank.lines.skip |
strip.white | When sep !="" , this
logical value specifies whether read.table should remove extra
leading and trailing white space from character
fields. | FALSE |
blank.lines.skip | A logical value that specifies whether read.table should ignore blank
lines. | TRUE |
comment.char | read.table can
ignore comment lines in input files if the comment lines begin
with a single special character. This argument specifies the
character used to delineate these lines. | "#" |
allowEscapes | A logical value that indicates whether escapes (such as “\n” for a new line) should be interpreted or if character strings should be read literally. | FALSE |
flush | A logical value that indicates whether read.table should skip to the next
line when all requested fields have been read in from a
line. | FALSE |
stringsAsFactors | A logical value indicating whether text fields should be converted to factors. | default.stringsAsFactors()
|
encoding | The encoding scheme used for the source file. | "unknown" |
The most important options are sep
and header
. You almost always have to know the
field separator and know if there is a header field. R includes a set
of convenience functions that call read.table
with different default options
for these values (and a couple of others). Here is a description of
these functions.
Function | header | sep | quote | dec | fill | comment.char |
---|---|---|---|---|---|---|
read.table | FALSE | \" or \' | . | !blank.lines.skip | # | |
read.csv | TRUE | , | \" | . | TRUE | |
read.csv2 | TRUE | ; | \" | , | TRUE | |
read.delim | TRUE | \t | \" | . | TRUE | |
read.delim2 | TRUE | \t | \" | , | TRUE |
In most cases, you will find that you can use read.csv
for comma-separated files or read.delim
for tab-delimited files without specifying any other
options. (Except, I suppose, if you are in Europe, and you use commas
to indicate the decimal point in numbers. Then you can use read.csv2
and read.delim2
.)
As another example, suppose that you wanted to analyze some historical stock quote data. Yahoo! Finance provides this information in an easily downloadable form on its website; you can fetch a CSV file from a single URL. For example, to fetch the closing price of the S&P 500 index for every month between April 1, 1999, and April 1, 2009, you could use the following URL: http://ichart.finance.yahoo.com/table.csv?s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv.
Conveniently, you can use a URL in place of a filename in R. This means that you could load this data into R with the following expression:
> sp500 <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?", + "s=%5EGSPC&a=03&b=1&c=1999&d=03&e=1&f=2009&g=m&ignore=.csv", sep="")) > # show the first 5 rows > sp500[1:5,] Date Open High Low Close Volume Adj.Close 1 2009-04-01 793.59 813.62 783.32 811.08 12068280000 811.08 2 2009-03-02 729.57 832.98 666.79 797.87 7633306300 797.87 3 2009-02-02 823.09 875.01 734.52 735.09 7022036200 735.09 4 2009-01-02 902.99 943.85 804.30 825.88 5844561500 825.88 5 2008-12-01 888.61 918.85 815.69 903.25 5320791300 903.25
We will revisit this example in the next section.
If you’re trying to load a really big file, you might find that
loading the file takes a long time. It can be very frustrating to wait
15 minutes for a file to load, only to discover that you have
specified the wrong separator. A useful technique for testing is to
load only a small number of rows into R. For example, to load 20 rows,
you would add nrows=20
as an
argument to read.table
.
Many programs can export data as text files. Here are a few tips for creating text files that you can easily read into R:
For Microsoft Excel spreadsheets, you can export them as either comma-delimited files (CSV files) or tab-delimited files (TXT files). When possible, you should specify Unix-style line delimiters, not MS-DOS line delimiters. (MS-DOS files end each line with “\n\r,” while Unix-style systems end lines with “\n.”) There are two things to think about when choosing between CSV and TXT files.
CSV files can be more convenient because (by default) opening these files in Windows Explorer will open these files in Microsoft Excel. However, if you are using CSV files, then you must be careful to enclose text in quotes if the data contains commas (and, additionally, you must escape any quotation marks within text fields). Tab characters occur less often in text, so tab-delimited files are less likely to cause problems.
If you are exporting data from a database, consider using a GUI tool to query the database and export the results. It is possible to use command-line scripts to export data using tools like SQL Plus, pgsSQL, or MySQL, but doing so is often tricky.
Here are a few options I have tried. If you are using Microsoft Windows, a good choice is Toad for Data Analysts (available from http://www.toadsoft.com/tda/tdaindex.html); this will work with many different databases. If you are exporting from MySQL, MySQL Query Browser is also a good choice; versions are available for Microsoft Windows, Mac OS X, and Linux (you can download it from http://dev.mysql.com/downloads/gui-tools/5.0.html). Oracle now produces a free multi-platform query tool called SQL Developer. (You can find it at http://www.oracle.com/technology/products/database/sql_developer/index.html.)
To read a fixed-width format text file into a data
frame, you can use the read.fwf
function:
read.fwf(file, widths, header = , sep = , skip = , row.names, col.names, n = , buffersize = , ...)
Here is a description of the arguments to read.fwf
.
Argument | Description | Default |
---|---|---|
file | The name of the file to open or, alternatively, the name of a connection containing the data. (This is a required argument.) | |
widths | An integer vector or a list of integer vectors. If the input file has one record per line, then use an integer vector where each value represents the width of each variable. If each record spans multiple lines, then use a list of integer vectors where each integer vector corresponds to the widths of the variables on that line. (This is a required argument.) | |
header | A logical value indicating whether the first line of
the file contains variable names. (If it does, the names must be delimited by sep .) | FALSE |
sep | The character used to delimit variable names in the header. | \t |
skip | An integer specifying the number of lines to skip at the beginning of the file. | 0 |
row.names | A character vector used to specify row names in the data frame. | |
col.names | A character vector used to specify column names in the data frame. | |
n | An integer value specifying the number of rows of records to read into R. (Invalid values, such as negatives, are ignored.) | -1 |
buffersize | An integer specifying the maximum number of lines to be read at one time. (This value may be tuned to optimize performance.) | 2000 |
Note that read.fwf
can also
take many arguments used by read.table
, including as.is
, na.strings
, colClasses
, and strip.white
.
Most of the time, you should be able to load text files
into R with the read.table
function. Sometimes, however, you might be provided with a file that
cannot be read correctly with this function. For example, observations
in the file might span multiple lines. To read data into R one line at
a time, use the function readLines
:
readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown")
The readLines
function will
return a character vector, with one value corresponding to each row in
the file. Here is a description of the arguments to readLines
.
Argument | Description | Default |
---|---|---|
con | A character string (specifying a file or URL) or a connection containing the data to read. | stdin() |
n | An integer value specifying the number of lines to read. (Negative values mean “read until the end of the file.”) | -1L |
ok | A logical value specifying whether to trigger an error if the number of lines in the file is less than n. | TRUE |
warn | A logical value specifying whether to warn the user if the file does not end with an EOL. | TRUE |
encoding | A character value specifying the encoding of the input file. | "unknown" |
Note that you can use readLines
interactively to enter
data.
Another useful function for reading more complex file formats is
scan
:
scan(file = "", what = double(0), nmax = -1, n = -1, sep = "", quote = if(identical(sep, "\n")) "" else "'\"", dec = ".", skip = 0, nlines = 0, na.strings = "NA", flush = FALSE, fill = FALSE, strip.white = FALSE, quiet = FALSE, blank.lines.skip = TRUE, multi.line = TRUE, comment.char = "", allowEscapes = FALSE, encoding = "unknown")
The scan
function allows you to read the contents of a file into
R. Unlike readLines
, scan
allows you to read data into a
specifically defined data structure using the argument what
.
Here is a description of the arguments to scan
.
Argument | Description | Default |
---|---|---|
file | A character string (specifying a file or URL) or a connection containing the data to read. | "" |
what | The type of data to be read. If all fields are the same type, you can specify logical, integer, numeric, complex, character, or raw. Otherwise, specify a list of types to read values into a list. (You can specify the type of each element in the list individually.) | double(0) |
nmax | An integer value specifying the number of values to
read or the number of records to read (if what is a list). (Negative values mean “read until the end
of the file.”) | -1 |
n | An integer value specifying the number of values to read. (Negative values mean “read until the end of the file.”) | -1 |
sep | Character value specifying the separator between
values. sep="" means that
any white space character is interpreted as a
separator. | “” |
quote | Character value used to quote strings. | if(identical(sep, "\n")) ""
else "'\"" |
dec | Character value used for decimal place in numbers. | "." |
skip | Number of lines to skip at the top of the file. | 0 |
nlines | Number of lines of data to read. Nonpositive values mean that there is no limit. | 0 |
na.strings | Character values specifying how NA values are encoded. | "NA" |
flush | A logical value specifying whether to “flush” any
remaining text on a line after the last requested item on a
line is read into what .
(Commonly used to allow comments at the end of lines or to
ignore unneeded fields.) | FALSE |
fill | Specifies whether to add empty fields to lines with
fewer fields than specified by what . | FALSE |
strip.white | Specifies whether to strip leading and trailing white
space from character fields. Applies only when sep is specified. | FALSE |
quiet | If quiet=FALSE , scan
will print a message showing how many lines were read. If
quiet=TRUE , then this
message is suppressed. | FALSE |
blank.lines.skip | Specifies whether to ignore blank lines. | TRUE |
multi.line | If what is a list,
allows records to span multiple lines. | TRUE |
comment.char | Notes a character to be used to specify comment lines. | "" |
allowEscapes | Specifies whether C-style escapes (such as \t for Tab character or \n for newlines) should be
interpreted by scan or read verbatim. If allowEscapes =FALSE , then they are interpreted as
special characters; if allowEscapes=TRUE , then they are
read literally. | FALSE |
encoding | A character value specifying the encoding of the input file. | "unknown" |
Like readLines
, you can also
use scan
to enter data directly
into R.
Although many software packages can export data as text files, you might find it more convenient to read their data files directly. R can read files in many other formats. Table 11-1 shows a list of functions for reading (and writing) files in other formats. You can find more information about these functions in the help files.