Sometimes, there will be some variables in your source data that aren’t quite right. This section explains how to change a variable in a data frame.
One of the most convenient ways to redefine a variable in
a data frame is to use the assignment operator. For example, suppose
that you wanted to change the type of a variable in the dow30
data frame that we created above. When
read.csv
imported this data, it
interpreted the “Date” field as a character string and converted it to a
factor:
> class(dow30$Date)
[1] "factor"
Factors are fine for some things, but we could better represent
the date field as a Date
object.
(That would create a proper ordering on dates and allow us to extract
information from them.) Luckily, Yahoo! Finance prints dates in the
default date format for R, so we can just transform these values into
Date
objects using as.Date
(see the help file for as.Date
for more information). So let’s change
this variable within the data frame to use Date
objects:
> dow30$Date <- as.Date(dow30$Date) > class(dow30$Date) [1] "Date"
It’s also possible to make other changes to data frames. For example, suppose that we wanted to define a new midpoint variable that is the mean of the high and low price. We can add this variable with the same notation:
> dow30$mid <- (dow30$High + dow30$Low) / 2 > names(dow30) [1] "symbol" "Date" "Open" "High" "Low" [6] "Close" "Volume" "Adj.Close" "mid"
A convenient function for changing variables in a data
frame is the transform
function.
Formally, transform
is defined
as:
transform(`_data`, ...)
Notice that there aren’t any named arguments for this function. To
use transform
, you specify a data
frame (as the first argument) and a set of expressions that use
variables within the data frame. The transform
function applies each expression to
the data frame and then returns the final data frame.
For example, suppose that we wanted to perform the two
transformations listed above: changing the Date column to a Date format,
and adding a new midpoint variable. We could do this with transform
using the following
expression:
> dow30.transformed <- transform(dow30, Date=as.Date(Date), + mid = (High + Low) / 2) > names(dow30.transformed) [1] "symbol" "Date" "Open" "High" "Low" [6] "Close" "Volume" "Adj.Close" "mid" > class(dow30.transformed$Date) [1] "Date"
When transforming data, one common operation is to apply a function to a set of objects (or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this.
To apply a function to parts of an array (or matrix),
use the apply
function:
apply(X, MARGIN, FUN, ...)
Apply
accepts three
arguments: X
is the array to which
a function is applied, FUN
is the
function, and MARGIN
specifies the
dimensions to which you would like to apply a function. Optionally,
you can specify arguments to FUN
as
addition arguments to apply arguments to FUN
.) To show how this works, here’s a
simple example. Let’s create a matrix with five rows of four elements,
corresponding to the numbers between 1 and 20:
> x <- 1:20 > dim(x) <- c(5, 4) > x [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20
Now let’s show how apply
works. We’ll use the function max
because it’s easy to look at the matrix above and see where the
results came from.
First, let’s select the maximum element of each row. (These are
the values in the rightmost column: 16, 17, 18, 19, and 20.) To do
this, we will specify X=x
, MARGIN=1
(rows are the first
dimension), and FUN=max
:
> apply(X=x, MARGIN=1, FUN=max)
[1] 16 17 18 19 20
To do the same thing for columns, we
simply have to change the value of MARGIN
:
> apply(X=x, MARGIN=2, FUN=max)
[1] 5 10 15 20
As a slightly more complex example, we can also use MARGIN
to apply a function over multiple
dimensions. (We’ll switch to the function paste
to show which elements were included.)
Consider the following three-dimensional array:
> x <- 1:27 > dim(x) <- c(3, 3, 3) > x , , 1 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 , , 2 [,1] [,2] [,3] [1,] 10 13 16 [2,] 11 14 17 [3,] 12 15 18 , , 3 [,1] [,2] [,3] [1,] 19 22 25 [2,] 20 23 26 [3,] 21 24 27
Let’s start by looking at which values are grouped for each
value of MARGIN
:
> apply(X=x, MARGIN=1, FUN=paste, collapse=",") [1] "1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26" [3] "3,6,9,12,15,18,21,24,27" > apply(X=x, MARGIN=2, FUN=paste, collapse=",") [1] "1,2,3,10,11,12,19,20,21" "4,5,6,13,14,15,22,23,24" [3] "7,8,9,16,17,18,25,26,27" > apply(X=x, MARGIN=3, FUN=paste, collapse=",") [1] "1,2,3,4,5,6,7,8,9" "10,11,12,13,14,15,16,17,18" [3] "19,20,21,22,23,24,25,26,27"
Let’s do something more complicated. Let’s select MARGIN=c(1, 2)
to see which elements are
selected:
> apply(X=x, MARGIN=c(1,2), FUN=paste, collapse=",")
[,1] [,2] [,3]
[1,] "1,10,19" "4,13,22" "7,16,25"
[2,] "2,11,20" "5,14,23" "8,17,26"
[3,] "3,12,21" "6,15,24" "9,18,27"
This is the equivalent of doing the following: for each value of
i between 1 and 3 and each value of
j between 1 and 3, calculate FUN
of x[i][j][1]
, x[i][j][2]
, x[i][j][3]
.
To apply a function to each element in a vector or a
list and return a list, you can use the function lapply
. The function lapply
requires two arguments: an object
X
and a function FUNC
. (You may specify additional arguments
that will be passed to FUNC
.) Let’s
look at a simple example of how to use lapply
:
> x <- as.list(1:5) > lapply(x,function(x) 2^x) [[1]] [1] 2 [[2]] [1] 4 [[3]] [1] 8 [[4]] [1] 16 [[5]] [1] 32
You can apply a function to a data frame, and the function will be applied to each vector in the data frame. For example:
> d <- data.frame(x=1:5, y=6:10) > d x y 1 1 6 2 2 7 3 3 8 4 4 9 5 5 10 > lapply(d,function(x) 2^x) $x [1] 2 4 8 16 32 $y [1] 64 128 256 512 1024 > lapply(d,FUN=max) $x [1] 5 $y [1] 10
Sometimes, you might prefer to get a vector, matrix, or array
instead of a list. To do this, use the sapply
function. This function works exactly the same way as
apply
, except that it returns a
vector or matrix (when appropriate):
> sapply(d, FUN=function(x) 2^x)
x y
[1,] 2 64
[2,] 4 128
[3,] 8 256
[4,] 16 512
[5,] 32 1024
Another related function is mapply
, the “multivariate” version of sapply
:
mapply(FUN, ..., MoreArgs = , SIMPLIFY = , USE.NAMES = )
Here is a description of the arguments to mapply
.
Argument | Description | Default |
---|---|---|
FUN | The function to apply. | |
... | A set of vectors over which FUN should be applied. | |
MoreArgs | A list of additional arguments to pass to FUN . | |
SIMPLIFY | A logical value indicating whether to simplify the returned array. | TRUE |
USE.NAMES | A logical value indicating whether to use names for returned values. Names are taken from the values in the first vector (if it is a character vector) or from the names of elements in that vector. | TRUE |
This function will apply FUN
to the first element of each vector, then to the second, and so on,
until it reaches the last element.
Here is a simple example of mapply
:
> mapply(paste, + c(1, 2, 3, 4, 5), + c("a", "b", "c", "d", "e"), + c("A", "B", "C", "D", "E"), + MoreArgs=list(sep="-")) [1] "1-a-A" "2-b-B" "3-c-C" "4-d-D" "5-e-E"
At this point, you’re probably confused by all the
different apply functions. They all accept different arguments,
they’re named inconsistently, and they work differently. Luckily, you
don’t have to remember any of the details of these function if you use
the plyr
package.
The plyr
package contains a
set of 12 logically named functions for applying another function to
an R data object and returning the results. Each of these functions
takes an array, data frame, or list as input and returns an array,
data frame, list, or nothing as output. (You can choose to discard the
results.) Here’s a table of the most useful functions:
Input | Array Output | Data Frame Output | List Output | Discard Output |
---|---|---|---|---|
Array | aaply | adply | alply | a_ply |
Data Frame | daply | ddply | dlply | d_ply |
List | laply | ldply | llply | l_ply |
All of these functions accept the following arguments:
Argument | Description | Default |
---|---|---|
.data | The input data object | |
.fun | The function to apply to the data | NULL |
.progress | The type of progress bar (created with create_progress ); choices include
"none" , "text" , "tk" , and "win" | "none" |
.expand | If .data is a dataframe, controls how output is
expanded; choose .expand=TRUE for 1d output, .expand=FALSE for nd. | TRUE |
.parallel | Specifies whether to apply the function in parallel (through foreach) | FALSE |
... | Other arguments passed to .fun |
Other arguments depend on the input and output. If the input is an array, then these arguments are available:
Argument | Description | Default |
---|---|---|
.margins | A vector describing the subscripts to split up data by |
If the input is a data frame, then these arguments are available:
Argument | Description | Default |
---|---|---|
.drop (or .drop_i for daply) | Specifies whether to drop combinations of variables that do not appear in the data input | TRUE |
.variables | Specifies a set of variables by which to split the data frame | |
.drop_o (for daply only) | Specifies whether to drop extra dimensions in the output for dimensions of length 1 | TRUE |
If the output is dropped, then this argument is available:
Argument | Description | Default |
---|---|---|
Specifies whether to print each output value | FALSE |
Let’s try to re-create some of our examples from above using
plyr
:
> # (1) input list, output list
> lapply(d, function(x) 2^x)
$x
[1] 2 4 8 16 32
$y
[1] 64 128 256 512 1024
> # equivalent is llply
> llply(.data=d, .fun=function(x) 2^x)
$x
[1] 2 4 8 16 32
$y
[1] 64 128 256 512 1024
> # (2) input is an array, output is a vector
> apply(X=x,MARGIN=1, FUN=paste, collapse=",")
[1] "1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26"
[3] "3,6,9,12,15,18,21,24,27"
> # equivalent (but note labels)
> aaply(.data=x,.margins=1, .fun=paste, collapse=",")
1 2
"1,4,7,10,13,16,19,22,25" "2,5,8,11,14,17,20,23,26"
3
"3,6,9,12,15,18,21,24,27"
> # (3) Data frame in, matrix out
> t(sapply(d, FUN=function(x) 2^x))
[,1] [,2] [,3] [,4] [,5]
x 2 4 8 16 32
y 64 128 256 512 1024
> # equivalent (but note the additional labels)
> aaply(.data=d, .fun=function(x) 2^x, .margins=2)
X1 1 2 3 4 5
x 2 4 8 16 32
y 64 128 256 512 1024