One of the most famous and most used features of R is the *apply()
family of functions, such as apply()
, tapply()
, and lapply()
. Here, we’ll look at apply()
, which instructs R to call a user-specified function on each of the rows or each of the columns of a matrix.
This is the general form of apply
for matrices:
apply(m,dimcode,f,fargs)
where the arguments are as follows:
m
is the matrix.
dimcode
is the dimension, equal to 1 if the function applies to rows or 2 for columns.
f
is the function to be applied.
fargs
is an optional set of arguments to be supplied to f
.
For example, here we apply the R function mean()
to each column of a matrix z
:
> z [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > apply(z,2,mean) [1] 2 5
In this case, we could have used the colMeans()
function, but this provides a simple example of using apply()
.
A function you write yourself is just as legitimate for use in apply()
as any R built-in function such as mean()
. Here’s an example using our own function f
:
> z [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > f <- function(x) x/c(2,8) > y <- apply(z,1,f) > y [,1] [,2] [,3] [1,] 0.5 1.000 1.50 [2,] 0.5 0.625 0.75
Our f()
function divides a two-element vector by the vector (2,8). (Recycling would be used if x
had a length longer than 2.) The call to apply()
asks R to call f()
on each of the rows of z
. The first such row is (1,4), so in the call to f()
, the actual argument corresponding to the formal argument x
is (1,4). Thus, R computes the value of (1,4)/(2,8), which in R’s element-wise vector arithmetic is (0.5,0.5). The computations for the other two rows are similar.
You may have been surprised that the size of the result here is 2 by 3 rather than 3 by 2. That first computation, (0.5,0.5), ends up at the first column in the output of apply()
, not the first row. But this is the behavior of apply()
. If the function to be applied returns a vector of k components, then the result of apply()
will have k rows. You can use the matrix transpose function t()
to change it if necessary, as follows:
> t(apply(z,1,f)) [,1] [,2] [1,] 0.5 0.500 [2,] 1.0 0.625 [3,] 1.5 0.750
If the function returns a scalar (which we know is just a one-element vector), the final result will be a vector, not a matrix.
As you can see, the function to be applied needs to take at least one argument. The formal argument here will correspond to an actual argument of one row or column in the matrix, as described previously. In some cases, you will need additional arguments for this function, which you can place following the function name in your call to apply()
.
For instance, suppose we have a matrix of 1s and 0s and want to create a vector as follows: For each row of the matrix, the corresponding element of the vector will be either 1 or 0, depending on whether the majority of the first d
elements in that row is 1 or 0. Here, d
will be a parameter that we may wish to vary. We could do this:
> copymaj function(rw,d) { maj <- sum(rw[1:d]) / d return(if(maj > 0.5) 1 else 0) } > x [,1] [,2] [,3] [,4] [,5] [1,] 1 0 1 1 0 [2,] 1 1 1 1 0 [3,] 1 0 0 1 1 [4,] 0 1 1 1 0 > apply(x,1,copymaj,3) [1] 1 1 0 1 > apply(x,1,copymaj,2) [1] 0 1 0 0
Here, the values 3 and 2 form the actual arguments for the formal argument d
in copymaj()
. Let’s look at what happened in the case of row 1 of x
. That row consisted of (1,0,1,1,0), the first d
elements of which were (1,0,1). A majority of those three elements were 1s, so copymaj()
returned a 1, and thus the first element of the output of apply()
was a 1.
Contrary to common opinion, using apply()
will generally not speed up your code. The benefits are that it makes for very compact code, which may be easier to read and modify, and you avoid possible bugs in writing code for looping. Moreover, as R moves closer and closer to parallel processing, functions like apply()
will become more and more important. For example, the clusterApply()
function in the snow
package gives R some parallel-processing capability by distributing the submatrix data to various network nodes, with each node basically applying the given function on its submatrix.
In statistics, outliers are data points that differ greatly from most of the other observations. As such, they are treated either as suspect (they might be erroneous) or unrepresentative (such as Bill Gates’s income among the incomes of the citizens of the state of Washington). Many methods have been devised to identify outliers. We’ll build a very simple one here.
Say we have retail sales data in a matrix rs
. Each row of data is for a different store, and observations within a row are daily sales figures. As a simple (undoubtedly overly simple) approach, let’s write code to identify the most deviant observation for each store. We’ll define that as the observation furthest from the median value for that store. Here’s the code:
1 findols <- function(x) { 2 findol <- function(xrow) { 3 mdn <- median(xrow) 4 devs <- abs(xrow-mdn) 5 return(which.max(devs)) 6 } 7 return(apply(x,1,findol)) 8 }
Our call will be as follows:
findols(rs)
How will this work? First, we need a function to specify in our apply()
call.
Since this function will be applied to each row of our sales matrix, our description implies that it needs to report the index of the most deviant observation in a given row. Our function findol()
does that, in lines 4 and 5. (Note that we’ve defined one function within another here, a common practice if the inner function is short.) In the expression xrow-mdn
, we are subtracting a number that is a one-element vector from a vector that generally will have a length greater than 1. Thus, recycling is used to extend mdn
to conform with xrow
before the subtraction.
Then in line 5, we use the R function which.max()
. Instead of finding the maximum value in a vector, which the max()
function does, which.max()
tells us where that maximum value occurs—that is, the index where it occurs. This is just what we need.
Finally, in line 7, we ask R to apply findol()
to each row of x
, thus producing the indices of the most deviant observation in each row.