R is a scripting language for statistical data manipulation and analysis. It was inspired by, and is mostly compatible with, the statistical language S developed by AT&T. The name S, for statistics, was an allusion to another programming language with a one-letter name developed at AT&T—the famous C language. S later was sold to a small firm, which added a graphical user interface (GUI) and named the result S-Plus.
R has become more popular than S or S-Plus, both because it’s free and because more people are contributing to it. R is sometimes called GNU S, to reflect its open source nature. (The GNU Project is a major collection of open source software.)
As the Cantonese say, yauh peng, yauh leng, which means “both inexpensive and beautiful.” Why use anything else?
It is a public-domain implementation of the widely regarded S statistical language, and the R/S platform is a de facto standard among professional statisticians.
It is comparable, and often superior, in power to commercial products in most of the significant senses—variety of operations available, programmability, graphics, and so on.
It is available for the Windows, Mac, and Linux operating systems.
In addition to providing statistical operations, R is a general-purpose programming language, so you can use it to automate analyses and create new functions that extend the existing language features.
It incorporates features found in object-oriented and functional programming languages.
The system saves data sets between sessions, so you don’t need to reload them each time. It saves your command history too.
Because R is open source software, it’s easy to get help from the user community. Also, a lot of new functions are contributed by users, many of whom are prominent statisticians.
I should warn you at the outset that you typically submit commands to R by typing in a terminal window, rather than clicking a mouse in a GUI, and most R users do not use a GUI. This doesn’t mean that R doesn’t do graphics. On the contrary, it includes tools for producing graphics of great utility and beauty, but they are used for system output, such as plots, not for user input.
If you can’t live without a GUI, you can use one of the free GUIs that have been developed for R, such as the following open source or free tools:
RStudio, http://www.rstudio.org/
ESS (Emacs Speaks Statistics), http://ess.r-project.org/
R Commander: John Fox, “The R Commander: A Basic-Statistics Graphical Interface to R,” Journal of Statistical Software 14, no. 9 (2005):1–42.
JGR (Java GUI for R), http://cran.r-project.org/web/packages/JGR/index.html
The first three, RStudio, StatET and ESS, should be considered integrated development environments (IDEs), aimed more toward programming. StatET and ESS provide the R programmer with an IDE in the famous Eclipse and Emacs settings, respectively.
On the commercial side, another IDE is available from Revolution Analytics, an R service company (http://www.revolutionanalytics.com/).
Because R is a programming language rather than a collection of discrete commands, you can combine several commands, each using the output of the previous one. (Linux users will recognize the similarity to chaining shell commands using pipes.) The ability to combine R functions gives tremendous flexibility and, if used properly, is quite powerful. As a simple example, consider this (compound) command:
nrow(subset(x03,z == 1))
First, the subset()
function takes the data frame x03
and extracts all records for which the variable z
has the value 1. This results in a new frame, which is then fed to the nrow()
function. This function counts the number of rows in a frame. The net effect is to report a count of z
= 1 in the original frame.
The terms object-oriented programming and functional programming were mentioned earlier. These topics pique the interest of computer scientists, and though they may be somewhat foreign to most other readers, they are relevant to anyone who uses R for statistical programming. The following sections provide an overview of both topics.
The advantages of object orientation can be explained by example. Consider statistical regression. When you perform a regression analysis with other statistical packages, such as SAS or SPSS, you get a mountain of output on the screen. By contrast, if you call the lm()
regression function in R, the function returns an object containing all the results—the estimated coefficients, their standard errors, residuals, and so on. You then pick and choose, programmatically, which parts of that object to extract.
You will see that R’s approach makes programming much easier, partly because it offers a certain uniformity of access to data. This uniformity stems from the fact that R is polymorphic, which means that a single function can be applied to different types of inputs, which the function processes in the appropriate way. Such a function is called a generic function. (If you are a C++ programmer, you have seen a similar concept in virtual functions.)
For instance, consider the plot()
function. If you apply it to a list of numbers, you get a simple plot. But if you apply it to the output of a regression analysis, you get a set of plots representing various aspects of the analysis. Indeed, you can use the plot()
function on just about any object produced by R. This is nice, since it means that you, as a user, have fewer commands to remember!
As is typical in functional programming languages, a common theme in R programming is avoidance of explicit iteration. Instead of coding loops, you exploit R’s functional features, which let you express iterative behavior implicitly. This can lead to code that executes much more efficiently, and it can make a huge timing difference when running R on large data sets.
As you will see, the functional programming nature of the R language offers many advantages:
Clearer, more compact code
Potentially much faster execution speed
Less debugging, because the code is simpler
Easier transition to parallel programming