One of the most common uses of R is simulation. Let’s see what kinds of tools R has available for this application.
As mentioned, R has functions to generate variates from a number of different distributions. For example, rbinom()
generates binomial or Bernoulli random variates.[3]
Let’s say we want to find the probability of getting at least four heads out of five tosses of a coin (easy to find analytically, but a handy example). Here’s how we can do this:
> x <- rbinom(100000,5,0.5) > mean(x >= 4) [1] 0.18829
First, we generate 100,000 variates from a binomial distribution with five trials and a success probability of 0.5. We then determine which of them has a value 4 or 5, resulting in a Boolean vector of the same length as x
. The TRUE
and FALSE
values in that vector are treated as 1s and 0s by mean()
, giving us our estimated probability (since the average of a bunch of 1s and 0s is the proportion of 1s).
Other functions include rnorm()
for the normal distribution, rexp()
for the exponential, runif()
for the uniform, rgamma()
for the gamma, rpois()
for the Poisson, and so on.
Here is another simple example, which finds E[max(X,Y)], the expected value of the maximum of independent N(0,1) random variables X and Y:
sum <- 0 nreps <- 100000 for (i in 1:nreps) { xy <- rnorm(2) # generate 2 N(0,1)s sum <- sum + max(xy) } print(sum/nreps)
We generated 100,000 pairs, found the maximum for each, and averaged those maxima to obtain our estimated expected value.
The preceding code, with an explicit loop, may be clearer, but as before, if we are willing to use some more memory, we can do this more compactly.
> emax function(nreps) { x <- rnorm(2*nreps) maxxy <- pmax(x[1:nreps],x[(nreps+1):(2*nreps)]) return(mean(maxxy)) }
Here, we generated double nreps
values. The first nreps
value simulates X, and the remaining nreps
value represents Y. The pmax()
call then computes the pair-wise maxima that we need. Again, note the contrast here between max()
and pmax()
, the latter producing pair-wise maxima.
According to the R documentation, all random-number generators use 32-bit integers for seed values. Thus, other than round-off error, the same initial seed should generate the same stream of numbers.
By default, R will generate a different random number stream from run to run of a program. If you want the same stream each time—important in debugging, for instance—call set.seed()
, like this:
> set.seed(8888) # or your favorite number as an argument
Consider the following probability problem:
Three committees, of sizes 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?
This problem is not hard to solve analytically, but we may wish to check our solution using simulation, and in any case, writing the code will demonstrate how R’s set operations can come in handy in combinatorial settings. Here is the code:
1 sim <- function(nreps) { 2 commdata <- list() # will store all our info about the 3 committees 3 commdata$countabsamecomm <- 0 4 for (rep in 1:nreps) { 5 commdata$whosleft <- 1:20 # who's left to choose from 6 commdata$numabchosen <- 0 # number among A, B chosen so far 7 # choose committee 1, and check for A,B serving together 8 commdata <- choosecomm(commdata,5) 9 # if A or B already chosen, no need to look at the other comms. 10 if (commdata$numabchosen > 0) next 11 # choose committee 2 and check 12 commdata <- choosecomm(commdata,4) 13 if (commdata$numabchosen > 0) next 14 # choose committee 3 and check 15 commdata <- choosecomm(commdata,3) 16 } 17 print(commdata$countabsamecomm/nreps) 18 } 19 20 choosecomm <- function(comdat,comsize) { 21 # choose committee 22 committee <- sample(comdat$whosleft,comsize) 23 # count how many of A and B were chosen 24 comdat$numabchosen <- length(intersect(1:2,committee)) 25 if (comdat$numabchosen == 2) 26 comdat$countabsamecomm <- comdat$countabsamecomm + 1 27 # delete chosen committee from the set of people we now have to choose from 28 comdat$whosleft <- setdiff(comdat$whosleft,committee) 29 return(comdat) 30 }
We number the potential committee members from 1 to 20, with persons A and B having ID 1 and 2. Recalling that R lists are often used to store several related variables in one basket, we se up a list comdat
. Its components include the following:
comdat$whosleft
: We simulate the random selection of the committees by randomly choosing from this vector. Each time we choose a committee, we remove the committee members’ IDs. It is initialized to 1:20, indicating that no one has been selected yet.
comdat$numabchosen
: This is a count of how many among the people A and B have been chosen so far. If we choose a committee and find this to be positive, we can skip choosing the remaining committees for the following reason: If this number is 2, we know definitely that A and B are on the same committee; if it is 1, we know definitely that A and B are not on the same committee.
comdat$countabsamecomm
: Here, we store a count of the number of times A and B are on the same committee.
Since committee selection involves subsets, it’s not surprising that a couple of R’s set operations—intersect()
and setdiff()
—come in handy here. Note, too, the use of R’s next
statement, which tells R to skip the rest of this iteration of the loop.
[3] A sequence of independent 0- and 1- valued random variables with the same probability of 1 for each is called Bernoulli.