As you’ve seen, using parallel R may greatly speed up your R code. This allows you to retain the convenience and expressive power of R, while still ameliorating large runtimes in big applications. If the parallelized R gives you sufficiently good performance, then all is well.
Nevertheless, parallel R is still R and thus still subject to the performance issues covered in Chapter 14. Recall that one solution offered in that chapter was to write a performance-critical portion of your code in C and then call that code from your main R program. (The references to C here mean C or C++.) We will explore this from a parallel-processing viewpoint. Here, instead of writing parallel R, we write ordinary R code that calls parallel C. (I assume a knowledge of C.)
The C code covered here runs only on multicore systems, so we must discuss the nature of such systems.
You are probably familiar with dual-core machines. Any computer includes a CPU, which is the part that actually runs your program. In essence, a dual-core machine has two CPUs, a quad-core system has four, and so on. With multiple cores, you can do parallel computation!
This parallel computation is done with threads, which are analogous to snow
’s workers. In computationally intensive applications, you generally set up as many threads as there are cores, for example two threads in a dual-core machine. Ideally, these threads run simultaneously, though overhead issues do arise, as will be explained when we look at general performance issues in Section 16.4.1.
If your machine has multiple cores, it is structured as a shared-memory system. All cores access the same RAM. The shared nature of the memory makes communication between the cores easy to program. If a thread writes to a memory location, the change is visible to the other threads, without the programmer needing to insert code to make that happen.
OpenMP is a very popular package for programming on multicore machines. To see how it works, here is the mutual outlinks example again, this time in R-callable OpenMP code:
1 #include <omp.h> 2 #include <R.h> 3 4 int tot; // grand total of matches, over all threads 5 6 // processes row pairs (i,i+1), (i,i+2), ... 7 int procpairs(int i, int *m, int n) 8 { int j,k,sum=0; 9 for (j = i+1; j < n; j++) { 10 for (k = 0; k < n; k++) 11 // find m[i][k]*m[j][k] but remember R uses col-major order 12 sum += m[n*k+i] * m[n*k+j]; 13 } 14 return sum; 15 } 16 17 void mutlinks(int *m, int *n, double *mlmean) 18 { int nval = *n; 19 tot = 0; 20 #pragma omp parallel 21 { int i,mysum=0, 22 me = omp_get_thread_num(), 23 nth = omp_get_num_threads(); 24 // in checking all (i,j) pairs, partition the work according to i; 25 // this thread me will handle all i that equal me mod nth 26 for (i = me; i < nval; i += nth) { 27 mysum += procpairs(i,m,nval); 28 } 29 #pragma omp atomic 30 tot += mysum; 31 } 32 int divisor = nval * (nval-1) / 2; 33 *mlmean = ((float) tot)/divisor; 34 }
Again, compilation follows the recipe in Chapter 15. We do need to link in the OpenMP library, though, by using the -fopenmp
and -lgomp
options. Suppose our source file is romp.c. Then we use the following commands to run the code:
gcc -std=gnu99 -fopenmp -I/usr/share/R/include -fpic -g -O2 -c romp.c -o romp.o gcc -std=gnu99 -shared -o romp.so romp.o -L/usr/lib/R/lib -lR -lgomp
Here’s an R test:
> dyn.load("romp.so") > Sys.setenv(OMP_NUM_THREADS=4) > n <- 1000 > m <- matrix(sample(0:1,n^2,replace=T),nrow=n) > system.time(z <- .C("mutlinks",as.integer(m),as.integer(n),result=double(1))) user system elapsed 0.830 0.000 0.218 > z$result [1] 249.9471
The typical way to specify the number of threads in OpenMP is through an operating system environment variable, OMP_NUM_THREADS
. R is capable of setting operating system environment variables with the Sys.setenv()
function. Here, I set the number of threads to 4, because I was running on a quad-core machine.
Note the runtime—only 0.2 seconds! This compares to the 5.0-second time we saw earlier for a 12-node snow
system. This might be surprising to some readers, as our code in the snow
version was vectorized to a fair degree, as mentioned earlier. Vectorizing is good, but again, R has many hidden sources of overhead, so C might do even better.
I tried R’s new byte-compilation function cmpfun()
, but mtl()
actually became slower.
Thus, if you are willing to write part of your code in parallel C, dramatic speedups may be possible.
OpenMP code is C, with the addition of pragmas that instruct the compiler to insert some library code to perform OpenMP operations. Look at line 20, for instance. When execution reaches this point, the threads will be activated. Each thread then executes the block that follows—lines 21 through 31—in parallel.
A key point is variable scope. All the variables within the block starting on line 21 are local to their specific threads. For example, we’ve named the total variable in line 21 mysum
because each thread will maintain its own sum. By contrast, the global variable tot
on line 4 is held in common by all the threads. Each thread makes its contribution to that grand total on line 30.
But even the variable nval
on line 18 is held in common with all the threads (during the execution of mutlinks()
), as it is declared outside the block beginning on line 21. So, even though it is a local variable in terms of C scope, it is global to all the threads. Indeed, we could have declared tot
on that line, too. It needs to be shared by all the threads, but since it’s not used outside mutlinks()
, it could have been declared on line 18.
Line 29 contains another pragma, atomic
. This one applies only to the single line following it—line 30, in this case—rather than to a whole block. The purpose of the atomic
pragma is to avoid what is called a race condition in parallel-processing circles. This term describes a situation in which two threads are updating a variable at the same time, which may produce incorrect results. The atomic
pragma ensures that line 30 will be executed by only one thread at a time. Note that this implies that in this section of the code, our parallel program becomes temporarily serial, which is a potential source of slowdown.
Where is the manager’s role in all of this? Actually, the manager is the original thread, and it executes lines 18 and 19, as well as .C()
, the R function that makes the call to mutlinks()
. When the worker threads are activated in line 21, the manager goes dormant. The worker threads become dormant once they finish line 31. At that point, the manager resumes execution. Due to the dormancy of the manager while the workers are executing, we do want to have as many workers as our machine has cores.
The function procpairs()
is straightforward, but note the manner in which the matrix m
is being accessed. Recall from the discussion in Chapter 15 on interfacing R to C that the two languages store matrices differently: column by column in R and row-wise in C. We need to be aware of that difference here. In addition, we have treated the matrix m
as a one-dimensional array, as is common in parallel C code. In other words, if n
is, say, 4, then we treat m
as a vector of 16 elements. Due to the column-major nature of R matrix storage, the vector will consist first of the four elements of column 1, then the four of column 2, and so on. To further complicate matters, we must keep in mind that array indices in C start at 0, instead of starting at 1 as in R.
Putting all of this together yields the multiplication in line 12. The factors here are the (k,i) and (k,j) elements of the version of m
in the C code, which are the (i+1,k+1) and (j+1,k+1) elements back in the R code.
OpenMP includes a wide variety of possible operations—far too many to list here. This section provides an overview of some OpenMP pragmas that I consider especially useful.
The parallel-processing term barrier refers to a line of code at which the threads rendezvous. The syntax for the omp barrier
pragma is simple:
#pragma omp barrier
When a thread reaches a barrier, its execution is suspended until all other threads have reached that line. This is very useful for iterative algorithms; threads wait at a barrier at the end of every iteration.
Note that in addition to this explicit barrier invocation, some other pragmas place an implicit barrier following their blocks. These include single
and parallel
. There is an implied barrier immediately following line 31 in the previous listing, for example, which is why the manager stays dormant until all worker threads finish.
The block that follows this pragma is a critical section, meaning one in which only one thread is allowed to execute at a time. The omp critical
pragma essentially serves the same purpose as the atomic
pragma discussed earlier, except that the latter is limited to a single statement.
The OpenMP designers defined a special pragma for this single-statement situation in the hope that the compiler can translate this to an especially fast machine instruction.
Here is the omp critical
syntax:
1 #pragma omp critical 2 { 3 // place one or more statements here 4 }
The block that follows this pragma is to be executed by only one of the threads. Here is the syntax for the omp single
pragma:
1 #pragma omp single 2 { 3 // place one or more statements here 4 }
This is useful for initializing sum variables that are shared by the threads, for instance. As noted earlier, an automatic barrier is placed after the block. This should make sense to you. If one thread is initializing a sum, you wouldn’t want other threads that make use of this variable to continue execution until the sum has been properly set.
You can learn more about OpenMP in my open source textbook on parallel processing at http://heather.cs.ucdavis.edu/parprocbook.
Another type of shared-memory parallel hardware consists of graphics processing units (GPUs). If you have a sophisticated graphics card in your machine, say for playing games, you may not realize that it is also a very powerful computational device—so powerful that the slogan “A supercomputer on your desk!” is often used to refer to PCs equipped with high-end GPUs.
As with OpenMP, the idea here is that instead of writing parallel R, you write R code interfaced to parallel C. (Similar to the OpenMP case, C here means a slightly augmented version of the C language.) The technical details become rather complex, so I won’t show any code examples, but an overview of the platform is worthwhile.
As mentioned, GPUs do follow the shared-memory/threads model, but on a much larger scale. They have dozens, or even hundreds, of cores (depending on how you define core). One major difference is that several threads can be run together in a block, which can produce certain efficiencies.
Programs that access GPUs begin their run on your machine’s CPU, referred to as the host. They then start code running on the GPU, or device. This means that your data must be transferred from the host to the device, and after the device finishes its computation, the results must be transferred back to the host.
As of this writing, GPU has not yet become common among R users. The most common usage is probably through the CRAN package gputools
, which consists of some matrix algebra and statistical routines callable from R. For instance, consider matrix inversion. R provides the function solve()
for this, but a parallel alternative is available in gputools
with the name gpuSolve()
.
For more about GPU programming, again see my book on parallel processing at http://heather.cs.ucdavis.edu/parprocbook.