Transformations

Sometimes, there will be some variables in your source data that aren’t quite right. This section explains how to change a variable in a data frame.

One of the most convenient ways to redefine a variable in a data frame is to use the assignment operator. For example, suppose that you wanted to change the type of a variable in the dow30 data frame that we created above. When read.csv imported this data, it interpreted the “Date” field as a character string and converted it to a factor:

> class(dow30$Date)
[1] "factor"

Factors are fine for some things, but we could better represent the date field as a Date object. (That would create a proper ordering on dates and allow us to extract information from them.) Luckily, Yahoo! Finance prints dates in the default date format for R, so we can just transform these values into Date objects using as.Date (see the help file for as.Date for more information). So let’s change this variable within the data frame to use Date objects:

> dow30$Date <- as.Date(dow30$Date)
> class(dow30$Date)
[1] "Date"

It’s also possible to make other changes to data frames. For example, suppose that we wanted to define a new midpoint variable that is the mean of the high and low price. We can add this variable with the same notation:

> dow30$mid <- (dow30$High + dow30$Low) / 2
> names(dow30)
[1] "symbol"    "Date"      "Open"      "High"      "Low"
[6] "Close"     "Volume"    "Adj.Close" "mid"

A convenient function for changing variables in a data frame is the transform function. Formally, transform is defined as:

transform(`_data`, ...)

Notice that there aren’t any named arguments for this function. To use transform, you specify a data frame (as the first argument) and a set of expressions that use variables within the data frame. The transform function applies each expression to the data frame and then returns the final data frame.

For example, suppose that we wanted to perform the two transformations listed above: changing the Date column to a Date format, and adding a new midpoint variable. We could do this with transform using the following expression:

> dow30.transformed <- transform(dow30, Date=as.Date(Date),
+   mid = (High + Low) / 2)
> names(dow30.transformed)
[1] "symbol"    "Date"      "Open"      "High"      "Low"
[6] "Close"     "Volume"    "Adj.Close" "mid"
> class(dow30.transformed$Date)
[1] "Date"

When transforming data, one common operation is to apply a function to a set of objects (or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this.

To apply a function to parts of an array (or matrix), use the apply function:

apply(X, MARGIN, FUN, ...)

Apply accepts three arguments: X is the array to which a function is applied, FUN is the function, and MARGIN specifies the dimensions to which you would like to apply a function. Optionally, you can specify arguments to FUN as addition arguments to apply arguments to FUN.) To show how this works, here’s a simple example. Let’s create a matrix with five rows of four elements, corresponding to the numbers between 1 and 20:

> x <- 1:20
> dim(x) <- c(5, 4)
> x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

Now let’s show how apply works. We’ll use the function max because it’s easy to look at the matrix above and see where the results came from.

First, let’s select the maximum element of each row. (These are the values in the rightmost column: 16, 17, 18, 19, and 20.) To do this, we will specify X=x, MARGIN=1 (rows are the first dimension), and FUN=max:

> apply(X=x, MARGIN=1, FUN=max)
[1] 16 17 18 19 20

To do the same thing for columns, we simply have to change the value of MARGIN:

> apply(X=x, MARGIN=2, FUN=max)
[1]  5 10 15 20

As a slightly more complex example, we can also use MARGIN to apply a function over multiple dimensions. (We’ll switch to the function paste to show which elements were included.) Consider the following three-dimensional array:

> x <- 1:27
> dim(x) <- c(3, 3, 3)
> x
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

, , 3

     [,1] [,2] [,3]
[1,]   19   22   25
[2,]   20   23   26
[3,]   21   24   27

Let’s start by looking at which values are grouped for each value of MARGIN:

> apply(X=x, MARGIN=1, FUN=paste, collapse=",")
[1] "1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26"
[3] "3,6,9,12,15,18,21,24,27"
> apply(X=x, MARGIN=2, FUN=paste, collapse=",")
[1] "1,2,3,10,11,12,19,20,21" "4,5,6,13,14,15,22,23,24"
[3] "7,8,9,16,17,18,25,26,27"
> apply(X=x, MARGIN=3, FUN=paste, collapse=",")
[1] "1,2,3,4,5,6,7,8,9"          "10,11,12,13,14,15,16,17,18"
[3] "19,20,21,22,23,24,25,26,27"

Let’s do something more complicated. Let’s select MARGIN=c(1, 2) to see which elements are selected:

> apply(X=x, MARGIN=c(1,2), FUN=paste, collapse=",")
     [,1]      [,2]      [,3]
[1,] "1,10,19" "4,13,22" "7,16,25"
[2,] "2,11,20" "5,14,23" "8,17,26"
[3,] "3,12,21" "6,15,24" "9,18,27"

This is the equivalent of doing the following: for each value of i between 1 and 3 and each value of j between 1 and 3, calculate FUN of x[i][j][1], x[i][j][2], x[i][j][3].

To apply a function to each element in a vector or a list and return a list, you can use the function lapply. The function lapply requires two arguments: an object X and a function FUNC. (You may specify additional arguments that will be passed to FUNC.) Let’s look at a simple example of how to use lapply:

> x <- as.list(1:5)
> lapply(x,function(x) 2^x)
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 8

[[4]]
[1] 16

[[5]]
[1] 32

You can apply a function to a data frame, and the function will be applied to each vector in the data frame. For example:

> d <- data.frame(x=1:5, y=6:10)
> d
  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10
> lapply(d,function(x) 2^x)
$x
[1]  2  4  8 16 32

$y
[1]   64  128  256  512 1024
> lapply(d,FUN=max)
$x
[1] 5

$y
[1] 10

Sometimes, you might prefer to get a vector, matrix, or array instead of a list. To do this, use the sapply function. This function works exactly the same way as apply, except that it returns a vector or matrix (when appropriate):

> sapply(d, FUN=function(x) 2^x)
      x    y
[1,]  2   64
[2,]  4  128
[3,]  8  256
[4,] 16  512
[5,] 32 1024

Another related function is mapply, the “multivariate” version of sapply:

mapply(FUN, ..., MoreArgs = , SIMPLIFY = , USE.NAMES = )

Here is a description of the arguments to mapply.

ArgumentDescriptionDefault
FUNThe function to apply. 
...A set of vectors over which FUN should be applied. 
MoreArgsA list of additional arguments to pass to FUN. 
SIMPLIFYA logical value indicating whether to simplify the returned array.TRUE
USE.NAMESA logical value indicating whether to use names for returned values. Names are taken from the values in the first vector (if it is a character vector) or from the names of elements in that vector.TRUE

This function will apply FUN to the first element of each vector, then to the second, and so on, until it reaches the last element.

Here is a simple example of mapply:

> mapply(paste,
+        c(1, 2, 3, 4, 5),
+        c("a", "b", "c", "d", "e"), 
+        c("A", "B", "C", "D", "E"),
+        MoreArgs=list(sep="-"))
[1] "1-a-A" "2-b-B" "3-c-C" "4-d-D" "5-e-E"

At this point, you’re probably confused by all the different apply functions. They all accept different arguments, they’re named inconsistently, and they work differently. Luckily, you don’t have to remember any of the details of these function if you use the plyr package.

The plyr package contains a set of 12 logically named functions for applying another function to an R data object and returning the results. Each of these functions takes an array, data frame, or list as input and returns an array, data frame, list, or nothing as output. (You can choose to discard the results.) Here’s a table of the most useful functions:

InputArray OutputData Frame OutputList OutputDiscard Output
Arrayaaplyadplyalplya_ply
Data Framedaplyddplydlplyd_ply
Listlaplyldplyllplyl_ply

All of these functions accept the following arguments:

ArgumentDescriptionDefault
.dataThe input data object 
.funThe function to apply to the dataNULL
.progressThe type of progress bar (created with create_progress); choices include "none", "text", "tk", and "win""none"
.expandIf .data is a dataframe, controls how output is expanded; choose .expand=TRUE for 1d output, .expand=FALSE for nd.TRUE
.parallelSpecifies whether to apply the function in parallel (through foreach)FALSE
...Other arguments passed to .fun 

Other arguments depend on the input and output. If the input is an array, then these arguments are available:

ArgumentDescriptionDefault
.marginsA vector describing the subscripts to split up data by 

If the input is a data frame, then these arguments are available:

ArgumentDescriptionDefault
.drop (or .drop_i for daply)Specifies whether to drop combinations of variables that do not appear in the data inputTRUE
.variablesSpecifies a set of variables by which to split the data frame 
.drop_o (for daply only)Specifies whether to drop extra dimensions in the output for dimensions of length 1TRUE

If the output is dropped, then this argument is available:

ArgumentDescriptionDefault
.printSpecifies whether to print each output valueFALSE

Let’s try to re-create some of our examples from above using plyr:

> # (1) input list, output list
> lapply(d, function(x) 2^x)
$x
[1]  2  4  8 16 32

$y
[1]   64  128  256  512 1024
> # equivalent is llply
> llply(.data=d, .fun=function(x) 2^x)
$x
[1]  2  4  8 16 32

$y
[1]   64  128  256  512 1024
> # (2) input is an array, output is a vector
> apply(X=x,MARGIN=1, FUN=paste, collapse=",")
[1] "1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26"
[3] "3,6,9,12,15,18,21,24,27"
> # equivalent (but note labels)
> aaply(.data=x,.margins=1, .fun=paste, collapse=",")
                        1                         2
"1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26"
                        3
"3,6,9,12,15,18,21,24,27"
> # (3) Data frame in, matrix out
> t(sapply(d, FUN=function(x) 2^x))
  [,1] [,2] [,3] [,4] [,5]
x    2    4    8   16   32
y   64  128  256  512 1024
> # equivalent (but note the additional labels)
> aaply(.data=d, .fun=function(x) 2^x, .margins=2)

X1   1   2   3   4    5
  x  2   4   8  16   32
  y 64 128 256 512 1024