Using all() and any()

The any() and all() functions are handy shortcuts. They report whether any or all of their arguments are TRUE.

> x <- 1:10
> any(x > 8)
[1] TRUE
> any(x > 88)
[1] FALSE
> all(x > 88)
[1] FALSE
> all(x > 0)
[1] TRUE

For example, suppose that R executes the following:

> any(x > 8)

It first evaluates x > 8, yielding this:

(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE)

The any() function then reports whether any of those values is TRUE. The all() function works similarly and reports if all of the values are TRUE.

Suppose that we are interested in finding runs of consecutive 1s in vectors that consist just of 1s and 0s. In the vector (1,0,0,1,1,1,0,1,1), for instance, there is a run of length 3 starting at index 4, and runs of length 2 beginning at indices 4, 5, and 8. So the call findruns(c(1,0,0,1,1,1,0,1,1),2) to our function to be shown below returns (4,5,8). Here is the code:

1    findruns <- function(x,k) {
2       n <- length(x)
3       runs <- NULL
4       for (i in 1:(n-k+1)) {
5          if (all(x[i:(i+k-1)]==1)) runs <- c(runs,i)
6       }
7       return(runs)
8    }

In line 5, we need to determine whether all of the k values starting at x[i]—that is, all of the values in x[i],x[i+1],...,x[i+k-1]—are 1s. The expression x[i:(i+k-1)] gives us this range in x, and then applying all() tells us whether there is a run there.

Let’s test it.

> y <- c(1,0,0,1,1,1,0,1,1)
> findruns(y,3)
[1] 4
> findruns(y,2)
[1] 4 5 8
> findruns(y,6)
NULL

Although the use of all() is good in the preceding code, the buildup of the vector runs is not so good. Vector allocation is time consuming. Each execution of the following slows down our code, as it allocates a new vector in the call c(runs,i). (The fact that new vector is assigned to runs is irrelevant; we still have done a vector memory space allocation.)

runs <- c(runs,i)

In a short loop, this probably will be no problem, but when application performance is an issue, there are better ways.

One alternative is to preallocate the memory space, like this:

1    findruns1 <- function(x,k) {
2       n <- length(x)
3       runs <- vector(length=n)
4       count <- 0
5       for (i in 1:(n-k+1)) {
6          if (all(x[i:(i+k-1)]==1)) {
7             count <- count + 1
8             runs[count] <- i
9          }
10      }
11      if (count > 0) {
12         runs <- runs[1:count]
13      } else runs <- NULL
14      return(runs)
15    }

In line 3, we set up space of a vector of length n. This means we avoid new allocations during execution of the loop. We merely fill runs, in line 8. Just before exiting the function, we redefine runs in line 12 to remove the unused portion of the vector.

This is better, as we’ve reduced the number of memory allocations to just two, down from possibly many in the first version of the code.

If we really need the speed, we might consider recoding this in C, as discussed in Chapter 14.

Suppose we observe 0- and 1-valued data, one per time period. To make things concrete, say it’s daily weather data: 1 for rain and 0 for no rain. Suppose we wish to predict whether it will rain tomorrow, knowing whether it rained or not in recent days. Specifically, for some number k, we will predict tomorrow’s weather based on the weather record of the last k days. We’ll use majority rule: If the number of 1s in the previous k time periods is at least k/2, we’ll predict the next value to be 1; otherwise, our prediction is 0. For instance, if k = 3 and the data for the last three periods is 1,0,1, we’ll predict the next period to be a 1.

But how should we choose k? Clearly, if we choose too small a value, it may give us too small a sample from which to predict. Too large a value will cause us to rely on data from the distant past that may have little or no predictive value.

A common solution to this problem is to take known data, called a training set, and then ask how well various values of k would have performed on that data.

In the weather case, suppose we have 500 days of data and suppose we are considering using k = 3. To assess the predictive ability of that value for k, we “predict” each day in our data from the previous three days and then compare the predictions with the known values. After doing this throughout our data, we have an error rate for k = 3. We do the same for k = 1, k = 2, k = 4, and so on, up to some maximum value of k that we feel is enough. We then use whichever value of k worked best in our training data for future predictions.

So how would we code that in R? Here’s a naive approach:

1    preda <- function(x,k) {
2       n <- length(x)
3       k2 <- k/2
4       # the vector pred will contain our predicted values
5       pred <- vector(length=n-k)
6       for (i in 1:(n-k)) {
7          if (sum(x[i:(i+(k-1))]) >= k2) pred[i] <- 1 else pred[i] <- 0
8       }
9       return(mean(abs(pred-x[(k+1):n])))
10    }

The heart of the code is line 7. There, we’re predicting day i+k (prediction to be stored in pred[i]) from the k days previous to it—that is, days i,...,i+k-1. Thus, we need to count the 1s among those days. Since we’re working with 0 and 1 data, the number of 1s is simply the sum of x[j] among those days, which we can conveniently obtain as follows:

sum(x[i:(i+(k-1))])

The use of sum() and vector indexing allow us to do this computation compactly, avoiding the need to write a loop, so it’s simpler and faster. This is typical R.

The same is true for this expression, on line 9:

mean(abs(pred-x[(k+1):n]))

Here, pred contains the predicted values, while x[(k+1):n] has the actual values for the days in question. Subtracting the second from the first gives us values of either 0, 1, or −1. Here, 1 or −1 correspond to prediction errors in one direction or the other, predicting 0 when the true value was 1 or vice versa. Taking absolute values with abs(), we have 0s and 1s, the latter corresponding to errors.

So we now know where days gave us errors. It remains to calculate the proportion of errors. We do this by applying mean(), where we are exploiting the mathematical fact that the mean of 0 and 1 data is the proportion of 1s. This is a common R trick.

The above coding of our preda() function is fairly straightforward, and it has the advantage of simplicity and compactness. However, it is probably slow. We could try to speed it up by vectorizing the loop, as discussed in Section 2.6. However, that would not address the major obstacle to speed here, which is all of the duplicate computation that the code does. For successive values of i in the loop, sum() is being called on vectors that differ by only two elements. Except for cases in which k is very small, this could really slow things down.

So, let’s rewrite the code to take advantage of previous computation. In each iteration of the loop, we will update the previous sum we found, rather than compute the new sum from scratch.

1    predb <- function(x,k) {
2       n <- length(x)
3       k2 <- k/2
4       pred <- vector(length=n-k)
5       sm <- sum(x[1:k])
6       if (sm >= k2) pred[1] <- 1 else pred[1] <- 0
7       if (n-k >= 2) {
8          for (i in 2:(n-k)) {
9             sm <- sm + x[i+k-1] - x[i-1]
10            if (sm >= k2) pred[i] <- 1 else pred[i] <- 0
11         }
12      }
13      return(mean(abs(pred-x[(k+1):n])))
14    }

The key is line 9. Here, we are updating sm, by subtracting the oldest element making up the sum (x[i-1]) and adding the new one (x[i+k-1]).

Yet another approach to this problem is to use the R function cumsum(), which forms cumulative sums from a vector. Here is an example:

> y <- c(5,2,-3,8)
> cumsum(y)
[1]  5  7  4 12

Here, the cumulative sums of y are 5 = 5, 5 + 2 = 7, 5 + 2 + (−3) = 4, and 5 + 2 + (−3) + 8 = 12, the values returned by cumsum().

The expression sum(x[i:(i+(k-1)) in preda() in the example suggests using differences of cumsum() instead:

predc <- function(x,k) {
   n <- length(x)
   k2 <- k/2
   # the vector red will contain our predicted values
   pred <- vector(length=n-k)
   csx <- c(0,cumsum(x))
   for (i in 1:(n-k)) {
      if (csx[i+k] - csx[i] >= k2) pred[i] <- 1 else pred[i] <- 0
   }
   return(mean(abs(pred-x[(k+1):n])))
}

Instead of applying sum() to a window of k consecutive elements in x, like this:

sum(x[i:(i+(k-1))

we compute that same sum by finding the difference between the cumulative sums at the end and beginning of that window, like this:

csx[i+k] - csx[i]

Note the prepending of a 0 in the vector of cumulative sums:

csx <- c(0,cumsum(x))

This is needed in order to handle the case i = 1 correctly.

This approach in predc() requires just one subtraction operation per iteration of the loop, compared to two in predb().