Combining Data Sets

Let’s start with one of the most common obstacles to data analysis: working with data that’s stored in two different places. For example, suppose that you wanted to look at batting statistics for baseball players by age. In most baseball data sources (like the Baseball Databank data), player information (like ages) is kept in different files from performance data (like batting statistics). So you would need to combine two files to do this analysis. This section discusses several tools in R used for combining data sets.

R provides several functions that allow you to paste together multiple data structures into a single structure.

Sometimes, you would like to bind together multiple data frames or matrices. You can do this with the rbind and cbind functions. The cbind function will combine objects by adding columns. You can picture this as combining two tables horizontally. As an example, let’s start with the data frame for the top five salaries in the NFL in 2008:[33]

> top.5.salaries
  name.last name.first     team position   salary
1   Manning     Peyton    Colts       QB 18700000
2     Brady        Tom Patriots       QB 14626720
3    Pepper     Julius Panthers       DE 14137500
4    Palmer     Carson  Bengals       QB 13980000
5   Manning        Eli   Giants       QB 12916666

Now let’s create a new data frame with two more columns (a year and a rank):

> year <- c(2008, 2008, 2008, 2008, 2008)
> rank <- c(1, 2, 3, 4, 5)
> more.cols <- data.frame(year, rank)
> more.cols
  year rank
1 2008    1
2 2008    2
3 2008    3
4 2008    4
5 2008    5

Finally, let’s put together these two data frames:

> cbind(top.5.salaries, more.cols)
  name.last name.first     team position   salary year rank
1   Manning     Peyton    Colts       QB 18700000 2008    1
2     Brady        Tom Patriots       QB 14626720 2008    2
3    Pepper     Julius Panthers       DE 14137500 2008    3
4    Palmer     Carson  Bengals       QB 13980000 2008    4
5   Manning        Eli   Giants       QB 12916666 2008    5

The rbind function will combine objects by adding rows. You can picture this as combining two tables vertically.

As an example, suppose that you had a data frame with the top five salaries (as shown above) and a second data frame with the next three salaries:

> top.5.salaries
  name.last name.first     team position   salary
1   Manning     Peyton    Colts       QB 18700000
2     Brady        Tom Patriots       QB 14626720
3    Pepper     Julius Panthers       DE 14137500
4    Palmer     Carson  Bengals       QB 13980000
5   Manning        Eli   Giants       QB 12916666
> next.three
  name.last name.first    team position   salary
6     Favre      Brett Packers       QB 12800000
7    Bailey      Champ Broncos       CB 12690050
8  Harrison     Marvin   Colts       WR 12000000

You could combine these into a single data frame using the rbind function:

> rbind(top.5.salaries, next.three)
  name.last name.first     team position   salary
1   Manning     Peyton    Colts       QB 18700000
2     Brady        Tom Patriots       QB 14626720
3    Pepper     Julius Panthers       DE 14137500
4    Palmer     Carson  Bengals       QB 13980000
5   Manning        Eli   Giants       QB 12916666
6     Favre      Brett  Packers       QB 12800000
7    Bailey      Champ  Broncos       CB 12690050
8  Harrison     Marvin    Colts       WR 12000000

To show how to fetch and combine together data and build a data frame for analysis, we’ll use an example from the previous chapter: stock quotes. Yahoo! Finance allows you to download CSV files with stock quotes for a single ticker.

Suppose that you wanted a single data set with stock quotes for multiple securities (say, the 30 stocks in the Dow Jones Industrial Average). You would need a way to bind together the data returned by the query into a single data set. Let’s write a function that can return historical stock quotes for multiple securities in a single data frame. First, let’s write a function that assembles the URL for the CSV file and then fetches a data frame with the contents.

Here is what this function will do. First, it will define the URL. (I determined the format of the URL by trial and error: I tried fetching CSV files from Yahoo! Finance with different ticker symbols and different date ranges until I knew how to construct the queries.) We will use the paste function to put together all these different character values. Next, we will fetch the URL with the read.csv function, assigning the data frame to the symbol tmp. The data frame has most of the information we want but doesn’t include the ticker symbol. So we will use the cbind function to attach a vector of ticker symbols to the data frame. (By the way, the function uses Date objects to represent the date. I also used the current date as the default value for to, and the date one year ago as the default value for from.)

Here is the function:

get.quotes <- function(ticker,
                       from=(Sys.Date()-365),
                       to=(Sys.Date()),
                       interval="d") {

 # define parts of the URL
 base <- "http://ichart.finance.yahoo.com/table.csv?";
 symbol <- paste("s=", ticker, sep="");

 # months are numbered from 00 to 11, so format the month correctly
 from.month <- paste("&a=",
  formatC(as.integer(format(from,"%m"))-1,width=2,flag="0"),
  sep="");
 from.day <- paste("&b=", format(from,"%d"), sep="");
 from.year <- paste("&c=", format(from,"%Y"), sep="");
 to.month <- paste("&d=",
  formatC(as.integer(format(to,"%m"))-1,width=2,flag="0"),
  sep="");
 to.day <- paste("&e=", format(to,"%d"), sep="");
 to.year <- paste("&f=", format(to,"%Y"), sep="");
 inter <- paste("&g=", interval, sep="");
 last <- "&ignore=.csv";

 # put together the url
 url <- paste(base, symbol, from.month, from.day, from.year,
              to.month, to.day, to.year, inter, last, sep="");

 # get the file
 tmp <- read.csv(url);

 # add a new column with ticker symbol labels
 cbind(symbol=ticker,tmp);
}

Now let’s write a function that returns a data frame with quotes from multiple securities. This function will simply call get.quotes once for every ticker in a vector of tickers and bind together the results using rbind:

get.multiple.quotes <- function(tkrs,
                                from=(Sys.Date()-365),
                                to=(Sys.Date()),
                                interval="d") {
    tmp <- NULL;
    for (tkr in tkrs) {
        if (is.null(tmp))
             tmp <- get.quotes(tkr,from,to,interval)
        else tmp <- rbind(tmp,get.quotes(tkr,from,to,interval))
        }
    tmp
}

Finally, let’s define a vector with the set of ticker symbols in the Dow Jones Industrial Average and then build a data frame with data from all 30 tickers:

>  dow.tickers <- c("MMM", "AA", "AXP", "T", "BAC", "BA", "CAT", "CVX",
+                   "CSCO", "KO", "DD", "XOM", "GE", "HPQ", "HD", "INTC",
+                   "IBM", "JNJ", "JPM", "KFT", "MCD", "MRK", "MSFT", "PFE",
+                   "PG", "TRV", "UTX", "VZ", "WMT", "DIS")
> # date on which I ran this code
> Sys.Date()
[1] "2012-01-08"
> dow30 <- get.multiple.quotes(dow30.tickers)

We’ll return to this data set below.data

As an example, let’s return to the Baseball Databank database that we used in Importing Data From Databases. In this database, player information is stored in the Master table. Players are uniquely identified by the column playerID:

> dbListFields(con,"Master")
 [1] "lahmanID"     "playerID"     "managerID"    "hofID"
 [5] "birthYear"    "birthMonth"   "birthDay"     "birthCountry"
 [9] "birthState"   "birthCity"    "deathYear"    "deathMonth"
[13] "deathDay"     "deathCountry" "deathState"   "deathCity"
[17] "nameFirst"    "nameLast"     "nameNote"     "nameGiven"
[21] "nameNick"     "weight"       "height"       "bats"
[25] "throws"       "debut"        "finalGame"    "college"
[29] "lahman40ID"   "lahman45ID"   "retroID"      "holtzID"
[33] "bbrefID"

Batting information is stored in the Batting table. Players are uniquely identified by playerID in this table as well:

> dbListFields(con, "Batting")
 [1] "playerID"  "yearID"    "stint"     "teamID"    "lgID"
 [6] "G"         "G_batting" "AB"        "R"         "H"
[11] "2B"        "3B"        "HR"        "RBI"       "SB"
[16] "CS"        "BB"        "SO"        "IBB"       "HBP"
[21] "SH"        "SF"        "GIDP"      "G_old"

Suppose that you wanted to show batting statistics for each player along with his name and age. To do this, you would need to merge data from the two tables. In R, you can do this with the merge function:

> batting <- dbGetQuery(con, "SELECT * FROM Batting")
> master <- dbGetQuery(con, "SELECT * FROM Master")
> batting.w.names <- merge(batting, master)

In this case, there was only one common variable between the two tables: playerID:

> intersect(names(batting), names(master))
[1] "playerID"

By default, merge uses common variables between the two data frames as the merge keys. So, in this case, we did not have to specify any more arguments to merge. Let’s take a closer look at the arguments to merge (for data frames):

merge(x, y, by = , by.x = , by.y = , all = , all.x = , all.y = ,
      sort = , suffixes = , incomparables = , ...)

Here is a description of the arguments to merge.

ArgumentDescriptionDefault
xOne of the two data frames to combine. 
yOne of the two data frames to combine. 
byA vector of character values corresponding to column names. intersect(names(x), names(y))
by.xA vector of character values corresponding to column names in x. Overrides the list given in by.by
by.yA vector of character values corresponding to column names in y. Overrides the list given in by.by
allA logical value specifying whether rows from each data frame should be included even if there is no match in the other data frame. This is equivalent to an OUTER JOIN in a database. (Equivalent to all.x=TRUE and all.y=TRUE.)FALSE
all.xA logical value specifying whether rows from data frame x should be included even if there is no match in the other data frame. This is equivalent to x LEFT OUTER JOIN y in a database.all
all.yA logical value specifying whether rows from data frame x should be included even if there is no match in the other data frame. This is equivalent to x RIGHT OUTER JOIN y in a database.all
sortA logical value that specifies whether the results should be sorted by the by columns.TRUE
suffixesA character vector with two values. If there are columns in x and y with the same name that are not used in the by list, they will be renamed with the suffixes given by this argument.suffixes = c(“.x”, “.y”)
incomparablesA list of variables that cannot be matched.NULL

By default, merge is equivalent to a NATURAL JOIN in SQL. You can specify other columns to make it use merge like an INNER JOIN. You can specify values of ALL to get the same results as OUTER or FULL joins. If there are no matching field names, or if by is of length 0 (or by.x and by.y are of length 0), then merge will return the full Cartesian product of x and y.



[33] Salary data is from http://sportsillustrated.cnn.com/football/nfl/salaries/2008/all.html. The salary numbers are cap numbers, not cash salaries.