R is a great language, but it can’t do everything well. Thus, it is sometimes desirable to call code written in other languages from R. Conversely, when working in other great languages, you may encounter tasks that could be better done in R.
R interfaces have been developed for a number of other languages, fromubiquitous languages like C to esoteric ones like the Yacas computeralgebra system. This chapter will cover two interfaces: one for callingC/C++ from R and the other for calling R from Python.
You may wish to write your own C/C++ functions to be called from R. Typically, the goal is performance enhancement, since C/C++ code may run much faster than R, even if you use vectorization and other R optimization techniques to speed things up.
Another possible goal in dropping down to the C/C++ level is specialized I/O. For example, R uses the TCP protocol in layer 3 of the standard Internet communication system, but UDP can be faster in some settings.
To work in UDP, you need C/C++, which requires an interface to R for those languages.
R actually offers two C/C++ interfaces via the functions .C()
and .Call()
. The latter is more versatile but requires some knowledge of R’s internal structures, so we’ll stick with .C()
here.
In C, two-dimensional arrays are stored in row-major order, in contrast to R’s column-major order. For instance, if you have a 3-by-4 array, the element in the second row and second column is element number 5 of the array when viewed linearly, since there are three elements in the first column and this is the second element in the second column. Also keep in mind that C subscripts begin at 0, rather than at 1, as with R.
All the arguments passed from R to C are received by C as pointers. Note that the C function itself must return void
. Values that you would ordinarily return must be communicated through the function’s arguments, such as result
in the following example.
Here, we will write C code to extract subdiagonals from a square matrix. (Thanks to my former graduate assistant, Min-Yu Huang, who wrote an earlier version of this function.) Here’s the code for the file sd.c:
#include <R.h> // required // arguments: // m: a square matrix // n: number of rows/columns of m // k: the subdiagonal index--0 for main diagonal, 1 for first // subdiagonal, 2 for the second, etc. // result: space for the requested subdiagonal, returned here void subdiag(double *m, int *n, int *k, double *result) { int nval = *n, kval = *k; int stride = nval + 1; for (int i = 0, j = kval; i < nval-kval; ++i, j+= stride) result[i] = m[j]; }
The variable stride
alludes to a concept from the parallel-processing community. Say we have a matrix in 1,000 columns and our C code is looping through all the elements in a given column, from top to bottom. Again, since C uses row-major order, consecutive elements in the column are 1,000 elements apart from each other if the matrix is viewed as one long vector.
Here, we would say that we are traversing that long vector with a stride of 1,000—that is, accessing every thousandth element.
You compile your code using R. For example, in a Linux terminal window, we could compile our file like this:
% R CMD SHLIB sd.c gcc -std=gnu99 -I/usr/share/R/include -fpic -g -O2 -c sd.c -o sd.o gcc -std=gnu99 -shared -o sd.so sd.o -L/usr/lib/R/lib -lR
This would produce the dynamic shared library file sd.so.
Note that R has reported how it invoked GCC in the output of the example. You can also run these commands by hand if you have special requirements, such as special libraries to be linked in. Also note that the locations of the include and lib directories may be system-dependent.
GCC is easily downloadable for Linux systems. For Windows, it is included in Cygwin, an open source package available from http://www.cygwin.com/
We can then load our library into R and call our C function like this:
> dyn.load("sd.so") > m <- rbind(1:5, 6:10, 11:15, 16:20, 21:25) > k <- 2 > .C("subdiag", as.double(m), as.integer(dim(m)[1]), as.integer(k), result=double(dim(m)[1]-k)) [[1]] [1] 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 5 10 15 20 25 [[2]] [1] 5 [[3]] [1] 2 $result [1] 11 17 23
For convenience here, we’ve given the name result
to both the formal argument (in the C code) and the actual argument (in the R code). Note that we needed to allocate space for result
in our R code.
As you can see from the example, the return value takes on the form of a list consisting of the arguments in the R call. In this case, the call had four arguments (in addition to the function name), so the returned list has four components. Typically, some of the arguments will be changed during execution of the C code, as was the case here with result
.
Chapter 13 discussed a number of tools and methods for debugging R code. However, the R/C interface presents an extra challenge. The problem in using a debugging tool such as GDB here is that you must first apply it to R itself.
The following is a walk-through of the R/C debugging steps using GDB on our previous sd.c code as the example.
$ R -d gdb GNU gdb 6.8-debian ... (gdb) run Starting program: /usr/lib/R/bin/exec/R ... > dyn.load("sd.so") > # hit ctrl-c here Program received signal SIGINT, Interrupt. 0xb7ffa430 in __kernel_vsyscall () (gdb) b subdiag Breakpoint 1 at 0xb77683f3: file sd.c, line 3. (gdb) continue Continuing. Breakpoint 1, subdiag (m=0x92b9480, n=0x9482328, k=0x9482348, result=0x9817148) at sd.c:3 3 int nval = *n, kval = *k; (gdb)
So, what happened in this debugging session?
We launched the debugger, GDB, with R loaded into it, from a command line in a terminal window:
R -d gdb
We told GDB to run R:
(gdb) run
We loaded our compiled C code into R as usual:
> dyn.load("sd.so")
We hit the ctrl-C interrupt key pair to pause R and put us back at the GDB prompt.
We set a breakpoint at the entry to subdiag()
:
(gdb) b subdiag
We told GDB to resume executing R (we needed to hit the enter key a second time in order to get the R prompt):
(gdb) continue
We then executed our C code:
< m >- rbind(1:5, 6:10, 11:15, 16:20, 21:25) < k >- 2 < .C("subdiag", as.double(m), as.integer(dim(m)[1]), as.integer(k), + result=double(dim(m)[1]-k)) Breakpoint 1, subdiag (m=0x942f270, n=0x96c3328, k=0x96c3348, result=0x9a58148) at subdiag.c:46 46 if (*n > 1) error("n > 1\n");
At this point, we can use GDB to debug as usual. If you’re not familiar with GDB, you may want to try one of the many quick tutorials on the Web. Table 15-1 lists some of the most useful commands.
Recall our example in Section 2.5.2 where we observed 0- and 1-valued data, one per time period, and attempted to predict the value in any period from the previous k
values, using majority rule. We developed two competing functions for the job, preda()
and predb()
, as follows:
# prediction in discrete time series; 0s and 1s; use k consecutive # observations to predict the next, using majority rule; calculate the # error rate preda <- function(x,k) { n <- length(x) k2 <- k/2 # the vector pred will contain our predicted values pred <- vector(length=n-k) for (i in 1:(n-k)) { if (sum(x[i:(i+(k-1))]) >= k2) pred[i] <- 1 else pred[i] <- 0 } return(mean(abs(pred-x[(k+1):n]))) } predb <- function(x,k) { n <- length(x) k2 <- k/2 pred <- vector(length=n-k) sm <- sum(x[1:k]) if (sm >= k2) pred[1] <- 1 else pred[1] <- 0 if (n-k >= 2) { for (i in 2:(n-k)) { sm <- sm + x[i+k-1] - x[i-1] if (sm >= k2) pred[i] <- 1 else pred[i] <- 0 } } return(mean(abs(pred-x[(k+1):n]))) }
Since the latter avoids duplicate computation, we speculated it would be faster. Now is the time to check that.
> y <- sample(0:1,100000,replace=T) > system.time(preda(y,1000)) user system elapsed 3.816 0.016 3.873 > system.time(predb(y,1000)) user system elapsed 1.392 0.008 1.427
Hey, not bad! That’s quite an improvement.
However, you should always ask whether R already has a fine-tuned function that will suit your needs. Since we’re basically computing a moving average, we might try the filter()
function, with a constant coefficient vector, as follows:
predc <- function(x,k) { n <- length(x) f <- filter(x,rep(1,k),sides=1)[k:(n-1)] k2 <- k/2 pred <- as.integer(f >= k2) return(mean(abs(pred-x[(k+1):n]))) }
That’s even more compact than our first version. But it’s a lot harder to read, and for reasons we will explore soon, it may not be so fast. Let’s check.
> system.time(predc(y,1000)) user system elapsed 3.872 0.016 3.945
Well, our second version remains the champion so far. This actually should be expected, as a look at the source code shows. Typing the following shows the source for that function:
> filter
This reveals (not shown here) that filter1()
is called. The latter is written in C, which should give us some speedup, but it still suffers from the duplicate computation problem—hence the slowness.
So, let’s write our own C code.
#include <R.h> void predd(int *x, int *n, int *k, double *errrate) { int nval = *n, kval = *k, nk = nval - kval, i; int sm = 0; // moving sum int errs = 0; // error count int pred; // predicted value double k2 = kval/2.0; // initialize by computing the initial window for (i = 0; i < kval; i++) sm += x[i]; if (sm >= k2) pred = 1; else pred = 0; errs = abs(pred-x[kval]); for (i = 1; i < nk; i++) { sm = sm + x[i+kval-1] - x[i-1]; if (sm >= k2) pred = 1; else pred = 0; errs += abs(pred-x[i+kval]); } *errrate = (double) errs / nk; }
This is basically predb()
from before, “hand translated” into C. Let’s see if it will outdo predb()
.
> system.time(.C("predd",as.integer(y),as.integer(length(y)),as.integer(1000), + errrate=double(1))) user system elapsed 0.004 0.000 0.003
You can see that writing certain functions in C can be worth the effort. This is especially true for functions that involve iteration, as R’s own iteration constructs, such as for()
, are slow.