As detailed in the introduction, R is an extremely versatile open source programming language for statistics and data science. It is widely used in every field where there is data—business, industry, government, medicine, academia, and so on.
In this chapter, you’ll get a quick introduction to R—how to invoke it, what it can do, and what files it uses. We’ll cover just enough to give you the basics you need to work through the examples in the next few chapters, where the details will be presented.
R may already be installed on your system, if your employer or university has made it available to users. If not, see Appendix A.
for installation instructions.
R operates in two modes: interactive and batch. The one typically used is interactive mode. In this mode, you type in commands, R displays results, you type in more commands, and so on. On the other hand, batch mode does not require interaction with the user. It’s useful for production jobs, such as when a program must be run periodically, say once per day, because you can automate the process.
On a Linux or Mac system, start an R session by typing R
on the command line in a terminal window. On a Windows machine, start R by clicking the R icon.
The result is a greeting and the R prompt, which is the >
sign. The screen will look something like this:
R version 2.10.0 (2009-10-26) Copyright (C) 2009 The R Foundation for Statistical Computing ISBN 3-900051-07-0 ... Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
You can then execute R commands. The window in which all this appears is called the R console.
As a quick example, consider a standard normal distribution—that is, with mean 0 and variance 1. If a random variable X has that distribution, then its values are centered around 0, some negative, some positive, averaging in the end to 0. Now form a new random variable Y = |X|. Since we’ve taken the absolute value, the values of Y will not be centered around 0, and the mean of Y will be positive.
Let’s find the mean of Y. Our approach is based on a simulated example of N(0,1) variates.
> mean(abs(rnorm(100))) [1] 0.7194236
This code generates the 100 random variates, finds their absolute values, and then finds the mean of the absolute values.
The [1]
you see means that the first item in this line of output is item 1. In this case, our output consists of only one line (and one item), so this is redundant. This notation becomes helpful when you need to read voluminous output that consists of a lot of items spread over many lines. For example, if there were two rows of output with six items per row, the second row would be labeled [7]
.
> rnorm(10) [1] −0.6427784 −1.0416696 −1.4020476 −0.6718250 −0.9590894 −0.8684650 [7] −0.5974668 0.6877001 1.3577618 −2.2794378
Here, there are 10 values in the output, and the label [7]
in the second row lets you quickly see that 0.6877001, for instance, is the eighth output item.
You can also store R commands in a file. By convention, R code files have the suffix .R or .r. If you create a code file called z.R, you can execute the contents of that file by issuing the following command:
> source("z.R")
Sometimes it’s convenient to automate R sessions. For example, you may wish to run an R script that generates a graph without needing to bother with manually launching R and executing the script yourself. Here you would run R in batch mode.
As an example, let’s put our graph-making code into a file named z.R with the following contents:
pdf("xh.pdf") # set graphical output file hist(rnorm(100)) # generate 100 N(0,1) variates and plot their histogram dev.off() # close the graphical output file
The items marked with #
are comments. They’re ignored by the R interpreter. Comments serve as notes to remind us and others what the code is doing, in a human-readable format.
Here’s a step-by-step breakdown of what we’re doing in the preceding code:
We call the pdf()
function to inform R that we want the graph we create to be saved in the PDF file xh.pdf.
We call rnorm()
(for random normal) to generate 100 N(0,1) random variates.
We call hist()
on those variates to draw a histogram of these values.
We call dev.off()
to close the graphical “device” we are using, which is the file xh.pdf in this case. This is the mechanism that actually causes the file to be written to disk.
We could run this code automatically, without entering R’s interactive mode, by invoking R with an operating system shell command (such as at the $
prompt commonly used in Linux systems):
$ R CMD BATCH z.R
You can confirm that this worked by using your PDF viewer to display the saved histogram. (It will just be a plain-vanilla histogram, but R is capable of producing quite sophisticated variations.)