This section discusses some issues that you may find generally useful in parallelizing R applications. I’ll present some material on the main sources of overhead and then discuss a couple of algorithmic issues.
Having at least a rough idea of the physical causes of overhead is essential to successful parallel programming. Let’s take a look at these in the contexts of the two main platforms, shared-memory and networked computers.
As noted earlier, the memory sharing in multicore machines makes for easier programming. However, the sharing also produces overhead, since the two cores will bump into each other if they both try to access memory at the same time. This means that one of them will need to wait, causing overhead. That overhead is typically in the range of hundreds of nanoseconds (billionths of seconds). This sounds really small, but keep in mind that the CPU is working at a subnanosecond speed, so memory access often becomes a bottleneck.
Each core may also have a cache, in which it keeps a local copy of some of the shared memory. It’s intended to reduce contention for memory among the cores, but it produces its own overhead, involving time spent in keeping the caches consistent with each other.
Recall that GPUs are special types of multicore machines. As such, they suffer from the problems I’ve described, and more. First, the latency, which is the time delay before the first bit arrives at the GPU from its memory after a memory read request, is quite long in GPUs.
There is also the overhead incurred in transferring data between the host and the device. The latency here is on the order of microseconds (millionths of seconds), an eternity compared to the nanosecond scale of the CPU and GPU.
GPUs have great performance potential for certain classes of applications, but overhead can be a major issue. The authors of gputools
note that their matrix operations start achieving a speedup only at matrix sizes of 1000 by 1000. I wrote a GPU version of our mutual outlinks application, which turned out to have a runtime of 3.0 seconds—about half of the snow
version but still far slower than the OpenMP implementation.
Again, there are ways of ameliorating these problems, but they require very careful, creative programming and a sophisticated knowledge of the physical GPU structure.
As you saw earlier, another way to achieve parallel computation is through networked systems of computers. You still have multiple CPUs, but in this case, they are in entirely separate computers, each with its own memory.
As pointed out earlier, network data transfer causes overhead. Its latency is again on the order of microseconds. Thus, even accessing a small amount of data across the network incurs a major delay.
Also note that snow
has additional overhead, as it changes numeric objects such as vectors and matrices to character form before sending them, say from the manager to the workers. Not only does this entail time for the conversion (both in changing from numeric to character form and in charging back to numeric at the receiver), but the character form tends to make for much longer messages, thus longer network transfer time.
Shared-memory systems can be networked together, which, in fact, we did in the previous example. We had a hybrid situation in which we formed snow
clusters from several networked dual-core computers.
It’s no shame to be poor, but it’s no great honor either. | ||
--Tevye, Fiddler on the Roof |
Man is the only animal that blushes, or needs to. | ||
-- |
The term embarrassingly parallel is heard often in talk about parallel R (and in the parallel processing field in general). The word embarrassing alludes to the fact that the problems are so easy to parallelize that there is no intellectual challenge involved; they are embarrassingly easy.
Both of the example applications we’ve looked at here would be considered embarrassingly parallel. Parallelizing the for i
loop for the mutual outlinks problem in Section 16.1 was pretty obvious. Partitioning the work in the KMC example in Section 16.2.4 was also natural and easy.
By contrast, most parallel sorting algorithms require a great deal of interaction. For instance, consider merge sort, a common method of sorting numbers. It breaks the vector to be sorted into two (or more) independent parts, say the left half and right half, which are then sorted in parallel by two processes. So far, this is embarrassingly parallel, at least after the vector is divided in half. But then the two sorted halves must be merged to produce the sorted version of the original vector, and that process is not embarrassingly parallel. It can be parallelized but in a more complex manner.
Of course, to paraphrase Tevye, it’s no shame to have an embarrassingly parallel problem! It may not exactly be an honor, but it is a cause for celebration, as it is easy to program. More important, embarrassingly parallel problems tend to have low communication overhead, which is crucial to performance, as discussed earlier. In fact, when most people refer to embarrassingly parallel applications, they have this low overhead in mind.
But what about nonembarrassingly parallel applications? Unfortunately, parallel R code is simply not suitable for many of them for a very basic reason: the functional programming nature of R. As discussed in Section 14.3, a statement like this:
x[3] <- 8
is deceptively simple, because it can cause the entire vector x
to be rewritten. This really compounds communication traffic problems. Accordingly, if your application is not embarrassingly parallel, your best strategy is probably to write the computationally intensive parts of the code in C, say using OpenMP or GPU programming.
Also, note carefully that even being embarrassingly parallel does not make an algorithm efficient. Some such algorithms can still have significant communication traffic, thus compromising performance.
Consider the KMC problem, run under snow
. Suppose we were to set up a large enough number of workers so that each worker had relatively little work to do. In that case, the communication with the manager after each iteration would become a signficant portion of run time. In this situation, we would say that the granularity is too fine, and then probably switch to using fewer workers. We would then have larger tasks for each worker, thus a coarser granularity.
Look again at the loop beginning on line 26 of our OpenMP example, reproduced here for convenience:
for (i = me; i < nval; i += nth) { mysum += procpairs(i,m,nval); }
The variable me
here was the thread number, so the effect of this code was that the various threads would work on nonoverlapping sets of values of i
. We do want the values to be nonoverlapping, to avoid duplicate work and an incorrect count of total number of links, so the code was fine. But the point now is that we were, in effect, preassigning the tasks that each thread would handle. This is called static assignment.
An alternative approach is to revise the for
loop to look something like this:
int nexti = 0; // global variable ... for (;myi < n;) { // revised "for" loop #pragma omp critical { nexti += 1; myi = nexti; } if (myi < n) { mysum += procpairs(myi,m,nval); ... } } ...
This is dynamic task assignment, in which it is not determined ahead of time which threads handle which values of i
. Task assignment is done during execution. At first glance, dynamic assignment seems to have the potential for better performance. Suppose, for instance, that in a static assignment setting, one thread finishes its last value of i
early, while another thread still has two values of i
left to do. This would mean our program would finish somewhat later than it could. In parallel-processing parlance, we would have a load balance problem. With dynamic assignment, the thread that finished when there were two values of i
left to handle could have taken up one of those values itself. We would have better balance and theoretically less overall runtime.
But don’t jump to conclusions. As always, we have the overhead issue to reckon with. Recall that a critical
pragma, used in the dynamic version of the code above, has the effect of temporarily rendering the program serial rather than parallel, thus causing a slowdown. In addition, for reasons too technical to discuss here, these pragmas may cause considerable cache activity overhead. So in the end, the dynamic code could actually be substantially slower than the static version.
Various solutions to this problem have been developed, such as an OpenMP construct named guided
. But rather than present these, the point I wish to make is that they are unnecessary. In most situations, static assignment is just fine. Why is this the case?
You may recall that the standard deviation of the sum of independent, identically distributed random variables, divided by the mean of that sum, goes to zero as the number of terms goes to infinity. In other words, sums are approximately constant. This has a direct implication for our load-balancing concerns: Since the total work time for a thread in static assignment is the sum of its individual task times, that total work time will be approximately constant; there will be very little variation from thread to thread. Thus, they will all finish at pretty close to the same time, and we do not need to worry about load imbalance. Dynamic scheduling will not be necessary.
This reasoning does depend on a statistical assumption, but in practice, the assumption will typically be met sufficiently well for the outcome: Static scheduling does as well as dynamic in terms of uniformity of total work times across threads. And since static scheduling doesn’t have the overhead problems of the dynamic kind, in most cases the static approach will give better performance.
There is one more aspect of this to discuss. To illustrate the issue, consider again the mutual outlinks example. Let’s review the outline of the algorithm:
1 sum = 0 2 for i = 0...n-1 3 for j = i+1...n-1 4 for k = 0...n-1 sum = sum + a[i][k]*a[j][k] 5 mean = sum / (n*(n-1)/2)
Say n
is 10000 and we have four threads, and consider ways to partition the for i
loop. Naively, we might at first decide to have thread 0 handle the i
values 0 through 2499, thread 1 handle 2500 through 4999, and so on. However, this would produce a severe load imbalance, since the thread that handles a given value of i
does an amount of work proportional to n-i
. That, in fact, is why we staggered the values of i
in our actual code: Thread 0 handled the i
values 0, 4, 8 ..., thread 1 worked on 1, 5, 9, ..., and so on, yielding good load balance.
The point then is that static assignment might require a bit more planning. One general approach to this is to randomly assign tasks (i
values, in our case here) to threads (still doing so at the outset, before work begins). With a bit of forethought such as this, static assignment should work well in most applications.
As discussed earlier, it’s difficult to attain good performance from non-embarrassingly parallel algorithms. Fortunately, for statistical applications, there is a way to turn nonembarrassingly parallel problems into embarrassingly parallel ones. The key is to exploit some statistical properties.
To demonstrate the method, let’s once again turn to our mutual outlinks problem. The method, applied with w
workers on a links matrix m
, consists of the following:
Break the rows of m
into w
chunks.
Have each worker find the mean number of mutual outlinks for pairs of vertices in its chunk.
Average the results returned by the workers.
It can be shown mathematically that for large problems (the only ones you would need parallel computing for anyway), this chunked approach gives the estimators of the same statistical accuracy as in the nonchunked method. But meanwhile, we’ve turned a nonparallel problem into not just a parallel one but an embarrassingly parallel one! The workers in the preceding outline compute entirely independently of each other.
This method should not be confused with the usual chunk-based approaches in parallel processing. In those, such as the merge-sort example discussed in Embarrassingly Parallel Applications and Those That Aren’t, the chunking is embarrassingly parallel, but the combining of results is not. By contrast, here the combining of results consists of simple averaging, thanks to the mathematical theory.
I tried this approach on the mutual outlinks problem in a 4-worker snow
cluster. This reduced the runtime to 1.5 seconds. This is far better than the serial time of about 16 seconds, double the speedup obtained by the GPU and approaching comparability to the OpenMP time. And the theory showing that the two methods give the same statistical accuracy was confirmed as well. The chunked method found the mean number of mutual outlinks to be 249.2881, compared to 249.2993 for the original estimator.