In addition to the usual if-then-else construct found in most languages, R also includes a vectorized version, the ifelse()
function. The form is as follows:
ifelse(b,u,v)
where b
is a Boolean vector, and u
and v
are vectors.
The return value is itself a vector; element i
is u[i]
if b[i]
is true, or v[i]
if b[i]
is false. The concept is pretty abstract, so let’s go right to an example:
> x <- 1:10 > y <- ifelse(x %% 2 == 0,5,12) # %% is the mod operator > y [1] 12 5 12 5 12 5 12 5 12 5
Here, we wish to produce a vector in which there is a 5 wherever x
is even or a 12 wherever x
is odd. So, the actual argument corresponding to the formal argument b
is (F,T,F,T,F,T,F,T,F,T). The second actual argument, 5, corresponding to u
, is treated as (5,5,...)(ten 5s) by recycling. The third argument, 12, is also recycled, to (12,12,...).
> x <- c(5,2,9,12) > ifelse(x > 6,2*x,3*x) [1] 15 6 18 24
We return a vector consisting of the elements of x
, either multiplied by 2 or 3, depending on whether the element is greater than 6.
Again, it helps to think through what is really occurring here. The expression x > 6
is a vector of Booleans. If the ith component is true, then the ith element of the return value will be set to the ith element of 2*x
; otherwise, it will be set to 3*x[i]
, and so on.
The advantage of ifelse()
over the standard if-then-else construct is that it is vectorized, thus potentially much faster.
In assessing the statistical relation of two variables, there are many alternatives to the standard correlation measure (Pearson product-moment correlation). Some readers may have heard of the Spearman rank correlation, for example. These alternative measures have various motivations, such as robustness to outliers, which are extreme and possibly erroneous data items.
Here, let’s propose a new such measure, not necessarily for novel statistical merits (actually it is related to one in broad use, Kendall’s τ
), but to illustrate some of the R programming techniques introduced in this chapter, especially ifelse()
.
Consider vectors x
and y
, which are time series, say for measurements of air temperature and pressure collected once each hour. We’ll define our measure of association between them to be the fraction of the time x
and y
increase or decrease together—that is, the proportion of i
for which y[i+1]-y[i]
has the same sign as x[i+1]-x[i]
. Here is the code:
1 # findud() converts vector v to 1s, 0s, representing an element 2 # increasing or not, relative to the previous one; output length is 1 3 # less than input 4 findud <- function(v) { 5 vud <- v[-1] - v[-length(v)] 6 return(ifelse(vud > 0,1,-1)) 7 } 8 9 udcorr <- function(x,y) { 10 ud <- lapply(list(x,y),findud) 11 return(mean(ud[[1]] == ud[[2]])) 12 }
> x [1] 5 12 13 3 6 0 1 15 16 8 88 > y [1] 4 2 3 23 6 10 11 12 6 3 2 > udcorr(x,y) [1] 0.4
In this example, x
and y
increased together in 3 of the 10 opportunities (the first time being the increases from 12 to 13 and 2 to 3) and decreased together once. That gives an association measure of 4/10 = 0.4.
Let’s see how this works. The first order of business is to recode x
and y
to sequences of 1s and −1s, with a value of 1 meaning an increase of the current observation over the last. We’ve done that in lines 5 and 6.
For example, think what happens in line 5 when we call findud()
with v
having a length of, say, 16 elements. Then v[-1]
will be a vector of 15 elements, starting with the second element in v
. Similarly, v[-length(v)]
will again be a vector of 15 elements, this time starting from the first element in v
. The result is that we are subtracting the original series from the series obtained by shifting rightward by one time period. The difference gives us the sequence of increase/decrease statuses for each time period—exactly what we need.
We then need to change those differences to 1 and −1s, according to whether a difference is positive or negative. The ifelse()
call does this easily, compactly, and with smaller execution time than a loop version of the code would have.
We could have then written two calls to findud()
: one for x
and the other for y
. But by putting x
and y
into a list and then using lapply()
, we can do this without duplicating code. If we were applying the same operation to many vectors instead of only two, especially in the case of a variable number of vectors, using lapply()
like this would be a big help in compacting and clarifying the code, and it might be slightly faster as well.
We then find the fraction of matches, as follows:
return(mean(ud[[1]] == ud[[2]]))
Note that lapply()
returns a list. The components are our 1/−1–coded vectors. The expression ud[[1]] == ud[[2]]
returns a vector of TRUE
and FALSE
values, which are treated as 1 and 0 values by mean()
. That gives us the desired fraction.
A more advanced version would make use of R’s diff()
function, which does lag operations for vectors. We might, for instance, compare each element with the element three spots behind it, termed a lag of 3. The default lag value is one time period, just what we need here.
> u [1] 1 6 7 2 3 5 > diff(u) [1] 5 1 −5 1 2
Then line 5 in the preceding example would become this:
vud <- diff(d)
We can make the code really compact by using another advanced R function, sign()
, which converts the numbers in its argument vector to 1, 0, or −1, depending on whether they are positive, zero, or negative. Here is an example:
> u [1] 1 6 7 2 3 5 > diff(u) [1] 5 1 −5 1 2 > sign(diff(u)) [1] 1 1 −1 1 1
Using sign()
then allows us to turn this udcorr()
function into a one-liner, as follows:
> udcorr <- function(x,y) mean(sign(diff(x)) == sign(diff(y)))
This is certainly a lot shorter than the original version. But is it better? For most people, it probably would take longer to write. And although the code is short, it is arguably harder to understand.
All R programmers must find their own “happy medium” in trading brevity for clarity.
Due to the vector nature of the arguments, you can nest ifelse()
operations. In the following example, which involves an abalone data set, gender is coded as M, F, or I (for infant). We wish to recode those characters as 1, 2, or 3. The real data set consists of more than 4,000 observations, but for our example, we’ll say we have just a few, stored in g
:
> g [1] "M" "F" "F" "I" "M" "M" "F" > ifelse(g == "M",1,ifelse(g == "F",2,3)) [1] 1 2 2 3 1 1 2
What actually happens in that nested ifelse()
? Let’s take a careful look. First, for the sake of concreteness, let’s find what the formal argument names are in the function ifelse()
:
> args(ifelse) function (test, yes, no) NULL
Remember, for each element of test
that is true, the function evaluates to the corresponding element in yes
. Similarly, if test[i]
is false, the function evaluates to no[i]
. All values so generated are returned together in a vector.
In our case here, R will execute the outer ifelse()
call first, in which test
is g == "M"
, and yes
is 1 (recycled); no
will (later) be the result of executing ifelse(g=="F",2,3)
. Now since test[1]
is true, we generate yes[1]
, which is 1. So, the first element of the return value of our outer call will be 1.
Next R will evaluate test[2]
. That is false, so R needs to find no[2]
. R now needs to execute the inner ifelse()
call. It hasn’t done so before, because it hasn’t needed it until now. R uses the principle of lazy evaluation, meaning that an expression is not computed until it is needed.
R will now evaluate ifelse(g=="F",2,3)
, yielding (3,2,2,3,3,3,2); this is no
for the outer ifelse()
call, so the latter’s second return element will be the second element of (3,2,2,3,3,3,2), which is 2.
When the outer ifelse()
call gets to test[4]
, it will see that value to be false and thus will return no[4]
. Since R had already computed no
, it has the value needed, which is 3.
Remember that the vectors involved could be columns in matrices, which is a very common scenario. Say our abalone data is stored in the matrix ab
, with gender in the first column. Then if we wish to recode as in the preceding example, we could do it this way:
> ab[,1] <- ifelse(ab[,1] == "M",1,ifelse(ab[,1] == "F",2,3))
Suppose we wish to form subgroups according to gender. We could use which()
to find the element numbers corresponding to M, F, and I:
> m <- which(g == "M") > f <- which(g == "F") > i <- which(g == "I") > m [1] 1 5 6 > f [1] 2 3 7 > i [1] 4
Going one step further, we could save these groups in a list, like this:
> grps <- list() > for (gen in c("M","F","I")) grps[[gen]] <- which(g==gen) > grps $M [1] 1 5 6 $F [1] 2 3 7 $I [1] 4
Note that we take advantage of the fact that R’s for()
loop has the ability to loop through a vector of strings. (You’ll see a more efficient approach in Section 4.4.)
We might use our recoded data to draw some graphs, exploring the various variables in the abalone data set. Let’s summarize the nature of the variables by adding the following header to the file:
Gender,Length,Diameter,Height,WholeWt,ShuckedWt,ViscWt,ShellWt,Rings
We could, for instance, plot diameter versus length, with a separate plot for males and females, using the following code:
aba <- read.csv("abalone.data",header=T,as.is=T) grps <- list() for (gen in c("M","F")) grps[[gen]] <- which(aba==gen) abam <- aba[grps$M,] abaf <- aba[grps$F,] plot(abam$Length,abam$Diameter) plot(abaf$Length,abaf$Diameter,pch="x",new=FALSE)
First, we read in the data set, assigning it to the variable aba
(to remind us that it’s abalone data). The call to read.csv()
is similar to the read.table()
call we used in Chapter 1, as we’ll discuss in Chapter 6 and Chapter 10. We then form abam
and abaf
, the submatrices of aba
corresponding to males and females, respectively.
Next, we create the plots. The first call does a scatter plot of diameter against length for the males. The second call is for the females. Since we want this plot to be superimposed on the same graph as the males, we set the argument new=FALSE
, instructing R to not create a new graph. The argument pch="x"
means that we want the plot characters for the female graph to consist of x characters, rather than the default o characters.
The graph (for the entire data set) is shown in Figure 2-1. By the way, it is not completely satisfactory. Apparently, there is such a strong correlation between diameter and length that the points densely fill up a section of the graph, and the male and female plots pretty much coincide. (It does appear that males have more variability, though.) This is a common issue in statistical graphics. A finer graphical analysis may be more illuminating, but at least here we see evidence of the strong correlation and that the relation does not vary much across genders.
We can compact the plotting code in the previous example by yet another use of ifelse
. This exploits the fact that the plot parameter pch
is allowed to be a vector rather than a single character. In other words, R allows us to specify a different plot character for each point.
pchvec <- ifelse(aba$Gender == "M","o","x") plot(aba$Length,aba$Diameter,pch=pchvec)
(Here, we’ve omitted the recoding to 1, 2, and 3, but you may wish to retain it for various reasons.)