Let’s start with one of the most common obstacles to data analysis: working with data that’s stored in two different places. For example, suppose that you wanted to look at batting statistics for baseball players by age. In most baseball data sources (like the Baseball Databank data), player information (like ages) is kept in different files from performance data (like batting statistics). So you would need to combine two files to do this analysis. This section discusses several tools in R used for combining data sets.
R provides several functions that allow you to paste together multiple data structures into a single structure.
The simplest of these functions is paste
. The paste
function allows you to concatenate
multiple character vectors into a single vector. (If you concatenate a
vector of another type, it will be coerced to a character vector
first.)
> x <- c("a", "b", "c", "d", "e") > y <- c("A", "B", "C", "D", "E") > paste(x,y) [1] "a A" "b B" "c C" "d D" "e E"
By default, values are separated by a space; you can specify
another separator (or none at all) with the sep
argument:
> paste(x, y, sep="-")
[1] "a-A" "b-B" "c-C" "d-D" "e-E"
If you would like all of values in the returned vector to be
concatenated with one another (to return just a single value), then
specify a value for the collapse
argument. The value of collapse
will be used as the separator in this value:
> paste(x, y, sep="-", collapse="#")
[1] "a-A#b-B#c-C#d-D#e-E"
Sometimes, you would like to bind together multiple data
frames or matrices. You can do this with the rbind
and cbind
functions. The cbind
function will combine objects by
adding columns. You can picture this as combining two tables
horizontally. As an example, let’s start with the data frame for the
top five salaries in the NFL in 2008:[33]
> top.5.salaries
name.last name.first team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
Now let’s create a new data frame with two more columns (a year and a rank):
> year <- c(2008, 2008, 2008, 2008, 2008) > rank <- c(1, 2, 3, 4, 5) > more.cols <- data.frame(year, rank) > more.cols year rank 1 2008 1 2 2008 2 3 2008 3 4 2008 4 5 2008 5
Finally, let’s put together these two data frames:
> cbind(top.5.salaries, more.cols)
name.last name.first team position salary year rank
1 Manning Peyton Colts QB 18700000 2008 1
2 Brady Tom Patriots QB 14626720 2008 2
3 Pepper Julius Panthers DE 14137500 2008 3
4 Palmer Carson Bengals QB 13980000 2008 4
5 Manning Eli Giants QB 12916666 2008 5
The rbind
function will
combine objects by adding rows. You can picture this as combining two
tables vertically.
As an example, suppose that you had a data frame with the top five salaries (as shown above) and a second data frame with the next three salaries:
> top.5.salaries
name.last name.first team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
> next.three
name.last name.first team position salary
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000
You could combine these into a single data frame using the
rbind
function:
> rbind(top.5.salaries, next.three)
name.last name.first team position salary
1 Manning Peyton Colts QB 18700000
2 Brady Tom Patriots QB 14626720
3 Pepper Julius Panthers DE 14137500
4 Palmer Carson Bengals QB 13980000
5 Manning Eli Giants QB 12916666
6 Favre Brett Packers QB 12800000
7 Bailey Champ Broncos CB 12690050
8 Harrison Marvin Colts WR 12000000
To show how to fetch and combine together data and build a data frame for analysis, we’ll use an example from the previous chapter: stock quotes. Yahoo! Finance allows you to download CSV files with stock quotes for a single ticker.
Suppose that you wanted a single data set with stock quotes for multiple securities (say, the 30 stocks in the Dow Jones Industrial Average). You would need a way to bind together the data returned by the query into a single data set. Let’s write a function that can return historical stock quotes for multiple securities in a single data frame. First, let’s write a function that assembles the URL for the CSV file and then fetches a data frame with the contents.
Here is what this function will do. First, it will define the
URL. (I determined the format of the URL by trial and error: I tried
fetching CSV files from Yahoo! Finance with different ticker symbols
and different date ranges until I knew how to construct the queries.)
We will use the paste
function to
put together all these different character values. Next, we will fetch
the URL with the read.csv
function,
assigning the data frame to the symbol tmp
. The data frame has most of the
information we want but doesn’t include the ticker symbol. So we will
use the cbind
function to attach a
vector of ticker symbols to the data frame. (By the way, the function
uses Date
objects to represent the
date. I also used the current date as the default value for to
, and the date one year ago as the default
value for from
.)
Here is the function:
get.quotes <- function(ticker, from=(Sys.Date()-365), to=(Sys.Date()), interval="d") { # define parts of the URL base <- "http://ichart.finance.yahoo.com/table.csv?"; symbol <- paste("s=", ticker, sep=""); # months are numbered from 00 to 11, so format the month correctly from.month <- paste("&a=", formatC(as.integer(format(from,"%m"))-1,width=2,flag="0"), sep=""); from.day <- paste("&b=", format(from,"%d"), sep=""); from.year <- paste("&c=", format(from,"%Y"), sep=""); to.month <- paste("&d=", formatC(as.integer(format(to,"%m"))-1,width=2,flag="0"), sep=""); to.day <- paste("&e=", format(to,"%d"), sep=""); to.year <- paste("&f=", format(to,"%Y"), sep=""); inter <- paste("&g=", interval, sep=""); last <- "&ignore=.csv"; # put together the url url <- paste(base, symbol, from.month, from.day, from.year, to.month, to.day, to.year, inter, last, sep=""); # get the file tmp <- read.csv(url); # add a new column with ticker symbol labels cbind(symbol=ticker,tmp); }
Now let’s write a function that returns a data frame with quotes
from multiple securities. This function will simply call get.quotes
once for every ticker in a vector
of tickers and bind together the results using rbind
:
get.multiple.quotes <- function(tkrs, from=(Sys.Date()-365), to=(Sys.Date()), interval="d") { tmp <- NULL; for (tkr in tkrs) { if (is.null(tmp)) tmp <- get.quotes(tkr,from,to,interval) else tmp <- rbind(tmp,get.quotes(tkr,from,to,interval)) } tmp }
Finally, let’s define a vector with the set of ticker symbols in the Dow Jones Industrial Average and then build a data frame with data from all 30 tickers:
> dow.tickers <- c("MMM", "AA", "AXP", "T", "BAC", "BA", "CAT", "CVX", + "CSCO", "KO", "DD", "XOM", "GE", "HPQ", "HD", "INTC", + "IBM", "JNJ", "JPM", "KFT", "MCD", "MRK", "MSFT", "PFE", + "PG", "TRV", "UTX", "VZ", "WMT", "DIS") > # date on which I ran this code > Sys.Date() [1] "2012-01-08" > dow30 <- get.multiple.quotes(dow30.tickers)
We’ll return to this data set below.data
As an example, let’s return to the Baseball Databank
database that we used in Importing Data From Databases. In
this database, player information is stored in the Master
table. Players are uniquely identified
by the column playerID
:
> dbListFields(con,"Master")
[1] "lahmanID" "playerID" "managerID" "hofID"
[5] "birthYear" "birthMonth" "birthDay" "birthCountry"
[9] "birthState" "birthCity" "deathYear" "deathMonth"
[13] "deathDay" "deathCountry" "deathState" "deathCity"
[17] "nameFirst" "nameLast" "nameNote" "nameGiven"
[21] "nameNick" "weight" "height" "bats"
[25] "throws" "debut" "finalGame" "college"
[29] "lahman40ID" "lahman45ID" "retroID" "holtzID"
[33] "bbrefID"
Batting information is stored in the Batting
table. Players are uniquely identified
by playerID
in this table as
well:
> dbListFields(con, "Batting")
[1] "playerID" "yearID" "stint" "teamID" "lgID"
[6] "G" "G_batting" "AB" "R" "H"
[11] "2B" "3B" "HR" "RBI" "SB"
[16] "CS" "BB" "SO" "IBB" "HBP"
[21] "SH" "SF" "GIDP" "G_old"
Suppose that you wanted to show batting statistics for each player
along with his name and age. To do this, you would need to merge data
from the two tables. In R, you can do this with the merge
function:
> batting <- dbGetQuery(con, "SELECT * FROM Batting") > master <- dbGetQuery(con, "SELECT * FROM Master") > batting.w.names <- merge(batting, master)
In this case, there was only one common variable between the two
tables: playerID
:
> intersect(names(batting), names(master))
[1] "playerID"
By default, merge
uses common
variables between the two data frames as the merge keys. So, in this
case, we did not have to specify any more arguments to merge. Let’s take
a closer look at the arguments to merge
(for data frames):
merge(x, y, by = , by.x = , by.y = , all = , all.x = , all.y = , sort = , suffixes = , incomparables = , ...)
Here is a description of the arguments to merge
.
Argument | Description | Default |
---|---|---|
x | One of the two data frames to combine. | |
y | One of the two data frames to combine. | |
by | A vector of character values corresponding to column names. | intersect(names(x), names(y))
|
by.x | A vector of character values corresponding to column
names in x . Overrides the
list given in by . | by |
by.y | A vector of character values corresponding to column
names in y . Overrides the
list given in by . | by |
all | A logical value specifying whether rows from each data
frame should be included even if there is no match in the other
data frame. This is equivalent to an OUTER JOIN in a database.
(Equivalent to all.x=TRUE and
all.y=TRUE .) | FALSE |
all.x | A logical value specifying whether rows from data frame
x should be included even if
there is no match in the other data frame. This is equivalent to
x LEFT OUTER JOIN y in a database. | all |
all.y | A logical value specifying whether rows from data frame
x should be included even if
there is no match in the other data frame. This is equivalent to
x RIGHT OUTER JOIN y in a database. | all |
sort | A logical value that specifies whether the results should
be sorted by the by
columns. | TRUE |
suffixes | A character vector with two values. If there are columns
in x and y with the same name that are not used
in the by list, they will be
renamed with the suffixes given by this argument. | suffixes = c(“.x”,
“.y”) |
incomparables | A list of variables that cannot be matched. | NULL |
By default, merge
is equivalent
to a NATURAL JOIN in SQL. You can specify other columns to make it use
merge
like an INNER JOIN. You can
specify values of ALL to get the same results as OUTER or FULL joins. If
there are no matching field names, or if by
is of length 0 (or by.x
and by.y
are of length 0), then merge
will return the full Cartesian product
of x
and y
.
[33] Salary data is from http://sportsillustrated.cnn.com/football/nfl/salaries/2008/all.html. The salary numbers are cap numbers, not cash salaries.