Functional Programming and Memory Issues

Most R objects are immutable, or unchangeable. Thus, R operations are implemented as functions that reassign to the given object, a trait that can have performance implications.

As an example of some of the issues that can arise, consider this simple-looking statement:

z[3] <- 8

As noted in Chapter 7, this assignment is more complex than it seems. It is actually implemented via the replacement function "[<-" through this call and assignment:

z <- "[<-"(z,3,value=8)

An internal copy of z is made, element 3 of the copy is changed to 8, and then the resulting vector is reassigned to z. And recall that the latter simply means that z is pointed to the copy.

In other words, even though we are ostensibly changing just one element of the vector, the semantics say that the entire vector is recomputed. For a long vector, this would slow down the program considerably. The same would be true for a shorter vector if it were assigned from within a loop of our code.

In some situations, R does take some measures to mitigate this impact, but it is a key point to consider when aiming for fast code. You should be mindful of this when working with vectors (including arrays). If your code seems to be running unexpectedly slowly, assignment of vectors should be a prime area of suspicion.

A related issue is that R (usually) follows a copy-on-change policy. For instance, if we execute the following in the previous setting:

> y <- z

then initially y shares the same memory area with z. But if either of them changes, then a copy is made in a different area of memory, and the changed variable will occupy the new area of memory. However, only the first change is affected, as the relocating of the moved variable means there are no longer any sharing issues. The function tracemem() will report such memory relocations.

Though R usually adheres to copy-on-change semantics, there are exceptions. For example, R doesn’t exhibit the location-change behavior in the following setting:

> z <- runif(10)
> tracemem(z)
[1] "<0x88c3258>"
> z[3] <- 8
> tracemem(z)
[1] "<0x88c3258>"

The location of z didn’t change; it was at memory address 0x88c3258 both before and after the assignment to z[3] was executed. Thus, although you should be vigilant about location change, you also can’t assume it.

Let’s look at the times involved.

> z <- 1:10000000
> system.time(z[3] <- 8)
   user  system elapsed
  0.180   0.084   0.265
> system.time(z[33] <- 88)
   user  system elapsed
      0       0       0

In any event, if copying is done, the vehicle is R’s internal function duplicate(). (The function is called duplicate1() in recent versions of R.) If you’re familiar with the GDB debugging tool and your R build includes debugging information, you can explore the circumstances under which a copy is performed.

Following the guide in Section 15.1.4, start up R with GDB, step through R through GDB, and place a breakpoint at duplicate1(). Each time you break at that function, submit this GDB command:

call Rf_PrintValue(s)

This will print the value of s (or whatever variable is of interest).

This example, though artificial, will demonstrate the memory-copy issues discussed in the previous section.

Suppose we have a large number of unrelated vectors and, among other things, we wish to set the third element of each to 8. We could store the vectors in a matrix, one vector per row. But since they are unrelated, maybe even of different lengths, we may consider storing them in a list.

But things can get very subtle when it comes to R performance issues, so let’s try it out.

> m  <- 5000
> n <- 1000
> z <- list()
> for (i in 1:m) z[[i]] <- sample(1:10,n,replace=T)
> system.time(for (i in 1:m) z[[i]][3] <- 8)
   user  system elapsed
  0.288   0.024   0.321
> z <- matrix(sample(1:10,m*n,replace=T),nrow=m)
> system.time(z[,3] <- 8)
   user  system elapsed
  0.008   0.044   0.052

Except for system time (again), the matrix formulation did better. One of the reasons is that in the list version, we encounter the memory-copy problem in each iteration of the loop. But in the matrix version, we encounter it only once. And, of course, the matrix version is vectorized.

But what about using lapply() on the list version?

>
> set3 <- function(lv) {
+    lv[3] <- 8
+    return(lv)
+ }
> z <- list()
> for (i in 1:m) z[[i]] <- sample(1:10,n,replace=T)
> system.time(lapply(z,set3))
   user  system elapsed
  0.100   0.012   0.112

It’s hard to beat vectorized code.