Filtering

Another feature reflecting the functional language nature of R is filtering. This allows us to extract a vector’s elements that satisfy certain conditions. Filtering is one of the most common operations in R, as statistical analyses often focus on data that satisfies conditions of interest.

Generating Filtering Indices

Let’s start with a simple example:

> z <- c(5,2,-3,8)
> w <- z[z*z > 8]
> w
[1] 5  −3  8

Looking at this code in an intuitive, “What is our intent?” manner, we see that we asked R to extract from z all its elements whose squares were greater than 8 and then assign that subvector to w.

But filtering is such a key operation in R that it’s worthwhile to examine the technical details of how R achieves our intent above. Let’s look at it done piece by piece:

> z <- c(5,2,-3,8)
> z
[1]  5  2 −3  8
> z*z > 8
[1]  TRUE FALSE  TRUE  TRUE

Evaluation of the expression z*z > 8 gives us a vector of Boolean values! It’s very important that you understand exactly how this comes about.

First, in the expression z*z > 8, note that everything is a vector or vector operator:

Since z is a vector, that means z*z will also be a vector (of the same length as z).
Due to recycling, the number 8 (or vector of length 1) becomes the vector (8,8,8,8) here.
The operator >, like +, is actually a function.

Let’s look at an example of that last point:

> ">"(2,1)
[1] TRUE
> ">"(2,5)
[1] FALSE

Thus, the following:

z*z > 8

is really this:

">"(z*z,8)

In other words, we are applying a function to vectors—yet another case of vectorization, no different from the others you’ve seen. And thus the result is a vector—in this case, a vector of Booleans. Then the resulting Boolean values are used to cull out the desired elements of z:

> z[c(TRUE,FALSE,TRUE,TRUE)]
[1]  5 −3  8

This next example will place things into even sharper focus. Here, we will again define our extraction condition in terms of z, but then we will use the results to extract from another vector, y, instead of extracting from z:

> z <- c(5,2,-3,8)
> j <- z*z > 8
> j
[1]  TRUE FALSE  TRUE  TRUE
> y <- c(1,2,30,5)
> y[j]
[1]  1 30  5

Or, more compactly, we could write the following:

> z <- c(5,2,-3,8)
> y <- c(1,2,30,5)
> y[z*z > 8]
[1]  1 30  5

Again, the point is that in this example, we are using one vector, z, to determine indices to use in filtering another vector, y. In contrast, our earlier example used z to filter itself.

Here’s another example, this one involving assignment. Say we have a vector x in which we wish to replace all elements larger than a 3 with a 0. We can do that very compactly—in fact, in just one line:

> x[x > 3] <- 0

Let’s check:

> x <- c(1,3,8,2,20)
> x[x > 3] <- 0
> x
[1] 1 3 0 2 0

Filtering with the subset() Function

Filtering can also be done with the subset() function. When applied to vectors, the difference between using this function and ordinary filtering lies in the manner in which NA values are handled.

> x <- c(6,1:3,NA,12)
> x
[1]  6  1  2  3 NA 12
> x[x > 5]
[1]  6 NA 12
> subset(x,x > 5)
[1]  6 12

When we did ordinary filtering in the previous section, R basically said, “Well, x[5] is unknown, so it’s also unknown whether its square is greater than 5.” But you may not want NAs in your results. When you wish to exclude NA values, using subset() saves you the trouble of removing the NA values yourself.

The Selection Function which()

As you’ve seen, filtering consists of extracting elements of a vector z that satisfy a certain condition. In some cases, though, we may just want to find the positions within z at which the condition occurs. We can do this using which(), as follows:

> z <- c(5,2,-3,8)
> which(z*z > 8)
[1] 1 3 4

The result says that elements 1, 3, and 4 of z have squares greater than 8.

As with filtering, it is important to understand exactly what occurred in the preceding code. The expression

z*z > 8

is evaluated to (TRUE,FALSE,TRUE,TRUE). The which() function then simply reports which elements of the latter expression are TRUE.

One handy (though somewhat wasteful) use of which() is for determining the location within a vector at which the first occurrence of some condition holds. For example, recall our code in Obtaining the Length of a Vector to find the first 1 value within a vector x:

first1 <- function(x) {
   for (i in 1:length(x)) {
      if (x[i] == 1) break  # break out of loop
   }
   return(i)
}

Here is an alternative way of coding this task:

first1a <- function(x) return(which(x == 1)[1])

The call to which() yields the indices of the 1s in x. These indices will be given in the form of a vector, and we ask for element index 1 in that vector, which is the index of the first 1.

That is much more compact. On the other hand, it’s wasteful, as it actually finds all instances of 1s in x, when we need only the first. So, although it is a vectorized approach and thus possibly faster, if the first 1 comes early in x, this approach may actually be slower.