This book aims to provide a broad introduction to the R statistical computing environment (R Development Core Team, 2009a) in the context of applied regression analysis, which is typically studied by social scientists and others in a second course in applied statistics. We assume that the reader is learning or is otherwise familiar with the statistical methods that we describe; thus, this book is a companion to a text or course on modern applied regression, such as, but not necessarily, our own Applied Regression Analysis and Generalized Linear Models, second edition (Fox, 2008) and Applied Linear Regression, third edition (Weisberg, 2005). Of course, different texts and courses have somewhat different content, and if you encounter a topic that is unfamiliar or that is not of interest, feel free to skip it or to pass over it lightly. With a caveat concerning the continuity of examples within chapters, the book is designed to let you skip around and study only the sections you need.
The availability of cheap, powerful, and convenient computing has revolutionized the practice of statistical data analysis, as it has revolutionized other aspects of our society. Once upon a time, but well within living memory, data analysis was typically performed by statistical packages running on mainframe computers. The primary input medium was the punchcard, large data sets were stored on magnetic tapes, and printed output was produced by line printers; data were in rectangular case-by-variable format. The job of the software was to combine instructions for data analysis with a data set to produce a printed report. Computing jobs were submitted in batchmode, rather than interactively, and a substantial amount of time—hours, or even days—elapsed between the submission of a job and its completion.
Eventually, batch-oriented computers were superseded by interactive, timeshared, terminal-based computing systems and then successively by personal computers and workstations, networks of computers, and the Internet—and perhaps in a few years by cloud computing. But some statistical software still in use traces its heritage to the days of the card reader and line printer. Statistical packages, such as SAS and IBM SPSS,1 have acquired a variety of accoutrements, including programming capabilities, but they are still principally oriented toward combining instructions with rectangular data sets to produce printed output.2
The package model of statistical computing can work well in the application of standard methods of analysis, but we believe that it has several serious drawbacks, both for students and for practitioners of statistics. In particular, we think that the package approach of submitting code and getting output to decipher is not a desirable pedagogical model for learning new statistical ideas, and students can have difficulty separating the ideas of data analysis that are under study from the generally rigid implementation of those ideas in a statistical package. We prefer that students learn to request specific output, examine the result, and then modify the analysis or seek additional output. This feedback loop of intermediate examination forces students to think about what is being computed and why it is useful. Once statistical techniques are mastered, the data-analytic process of inserting the intelligence of the user in the middle of an analysis becomes second nature—in our view, a very desirable outcome.
The origins of R are in the S programming language, which was developed at Bell Labs by experts in statistical computing, including John Chambers, Richard Becker, and Allan Wilks (see, e.g., Becker et al., 1988, Preface). Like most good software, S has evolved considerably since its origins in the mid-1970s. Although Bell Labs originally distributed S directly, it is now available only as the commercial product S-PLUS. R is an independent, open-source, and free implementation of the S language, developed by an international team of statisticians, including John Chambers. As described in Ihaka and Gentleman (1996), what evolved into the R Project for Statistical Computing was originated by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. A key advantage of the R system is that it is free—simply download and install it, as we will describe shortly, and then use it.
R is a statistical computing environment and includes an interpreter, with which the user-programmer can interact in a conversational manner.3 R is one of several statistical programming environments; others include Gauss, Stata, and Lisp-Stat (which are described, e.g., in Stine and Fox, 1996).
If you can master the art of typing commands, a good statistical programming environment allows you to have your cake and eat it too. Routine data analysis is convenient, but so are programming and the incorporation of new statistical methods. We believe that R balances these factors especially well:
One of the great strengths of R is that it allows users and experts in particular areas of statistics to add new capabilities to the software. Not only is it possible to write new programs in R, but it is also convenient to combine related sets of programs, data, and documentation in R packages. The previous edition of this book, published in 2002, touted the then “more than 100 contributed packages available on the R website, many of them prepared by experts in various areas of applied statistics, such as resampling methods, mixed models, and survival analysis” (p. xii). The Comprehensive R Archive Network (abbreviated CRAN and variously pronounced see-ran or kran)now holds more than 2,500 packages (see Figure 1, drawn, of course, with R); other R package archives—most notably the archive of the Bioconductor project, which develops software for bioinformatics—add several hundred more packages to the total. In the statistical literature, new methods are often accompanied by implementations in R; indeed, R has become a kind of lingua franca of statistical computing—at least among statisticians—although interest in R is also robust in other areas, including the social and behavioral sciences.
New in the Second Edition
Readers familiar with the first edition of this book will immediately notice two key changes. First, and most significant, there are now two authors, the first edition having been written by John Fox alone. Second, “S-PLUS” is missing from the title of the book (originally An R and S-PLUS Companion to Applied Regression), which now describes only R. In the decade since the first edition of the book was written, the open-source, free R has completely eclipsed its commercial cousin, S-PLUS. Moreover, where R and S-PLUS differ, we believe that the advantage generally goes to R. Although most of the contents of this second edition are applicable to S-PLUS as well as to R, we see little reason to discuss S-PLUS explicitly.
We have added a variety of new material—for example, with respect to transformations and effects plots—and in addition, virtually all the text has been rewritten. We have taken pains to make the book as self-contained as possible, providing the information that a new user needs to get started. Many topics, such as R graphics (in Chapter 7) and R programming (in Chapter 8), have been considerably expanded in the second edition.
Figure 1 The number of packages on CRAN grew roughly exponentially since reliable data first became available in 2001 through 2009.
Source: Fox (2009).
The book has a companion R package called car, and we have substantially added to, extended, and revised the functions in the car package to make them more consistent, easier to use, and, we hope, more useful. The new car package includes several functions inherited from the alr3 package designed to accompany Weisberg (2005). The alr3 package still exists, but it now contains mostly data.
Obtaining and Installing R
We assume that you’re working on a single-user computer on which R has not yet been installed and for which you have administrator privileges to install software. To state the obvious, before you can start using R, you have to get it and install it. The good news is that R is free and runs under all commonly available computer operating systems—Windows, Mac OS X, and Linux and Unix—and that precompiled binary distributions of R are available for these systems. There is no bad news—at least not yet. It is our expectation that most readers of the book will use either the Windows or the Mac OS X implementations of R, and the presentation in the text reflects that assumption. Virtually everything in the text, however, applies equally to Linux and Unix systems, although the details of installing R vary across specific Linux distributions and Unix systems.
The best way to obtain R is by downloading it over the Internet from CRAN, at http://cran.r-project.org/ . It is faster, and better netiquette, to download R from one of the many CRAN mirror sites than from the main CRAN site: Click on the “Mirrors” link near the top left of the CRAN home page and select a mirror near you.
Warning: The following instructions are current as of version 2.11.0 of R. Some of the details may change, so check for updates on the website for this book, and also consult the instructions on the CRAN site.
INSTALLING R ON A WINDOWS SYSTEM
Click on the “Windows” link in the “Download and Install R” section near the top of the CRAN home page. Then click on the “base” link on the “R for Windows” page. We recommend that you install the latest “patched build” of the current version of R; the patched release incorporates fixes to known bugs, usually small. Click on the link in “Patches to this release are incorporated in the r-patched snapshot build” and then on “Download R-x.y.z Patched build for Windows” to download the R Windows installer. “R-x.y.z” is the current version of R—for example, R-2.11.0.
R installs as a standard Windows application. We suggest that you take all the defaults in the installation, with one exception: We recommend that you select the single-document interface (SDI) in preference to the default multiple-document interface (MDI). In the former, various R windows float freely on the desktop, while in the latter they are contained within a master window.
Once R is installed, you can start it as you would any Windows application, for example, by double-clicking on its desktop icon.
Whenever you start R, a number of files are automatically read and their contents executed. The start-up process provides the user with the ability to customize the program to meet particular needs or tastes, and as you gain experience with R, you may wish to customize the program in this way. On a single-user Windows system, probably the easiest route to customization is to edit the Rprofile.site file located in R’s etc subdirectory. The possibilities for customization are nearly endless, but here are two useful steps, both of which assume that you have an active Internet connection:
# set a CRAN mirror
# local({r <- getOption("repos")
# r["CRAN"] <- "http://my.local.cran"
# options(repos=r)})
You must then replace the dummy site http://my.local.cran with a link to a real mirror site, such as http://probability.ca/cran for the CRAN mirror at the University of Toronto. This is, of course, just an example: You should pick a mirror site near you.
utils::update.packages(ask=FALSE)
A disadvantage of the last change is that starting up R will take a bit longer. If you find the wait annoying, you can always remove this line from your Rprofile.site file.
Edit the Rprofile.site file with a plain-text (ASCII) editor, such as Windows Notepad; if you use a word-processing program, such as Word,make sure to save the file as plain text.
You can also customize certain aspects of the R graphical user interface via the Edit → GUI preferences menu.
INSTALLING R ON A MAC OS X SYSTEM
Click on the “Mac OS X” link in the “Download and Install R” section near the top of the CRAN home page. Click on the “R-x.y.z.pkg (latest version)” link on the “R for Mac OS X” page to download the R Mac OS X installer. “R-x.y.z,” as mentioned earlier, is the current version of R—for example, R-2.11.0.
R installs as a standard Mac OS X application. Just double-click on the downloaded installer package, and follow the on-screen directions. Once R is installed, you can treat it as you would any Mac OS X application. For example, you can put the R.app program (or, on a 64-bit system, the R64.app program) on the Mac OS X Dock, from which it can conveniently be launched.
R is highly configurable under Mac OS X, but some of the details differ from the Windows details. The possibilities for customization are nearly endless. Here are the same two customizations that we suggested for Windows users:
utils::update.packages(ask=FALSE)
INSTALLING AND USING THE CAR PACKAGE
Most of the examples in this book require the car package, which is not part of the standard R installation. The car package is available on CRAN. It “depends” on some other packages and “suggests” still others; the packages on which it depends will automatically be installed along with the car package.
Although both the Windows and the Mac OS X versions of R have menus for installing packages, the following command entered at the R command prompt will install the car package and all the other packages that it requires (i.e., both depends on and suggests):
> install.packages("car", dependencies=TRUE)
Installing a package does not make it available for use in a particular R session. When R starts up, it automatically loads a set of standard packages that are part of the R distribution. To access the programs and data in another package, you must first load the package using the library command:4
> library(car)
This command also loads all the packages on which the car package depends. If you want to use still other packages, you need to enter a separate library command for each. The process of loading packages as you need them will come naturally as you grow more familiar with R. You can also arrange to load packages automatically at the beginning of every R session by adding a pair of commands such as the following to your R profile:
pkgs <- getOption("defaultPackages")
options(defaultPackages = c(pkgs, "car", "alr3"))
The Website for the R Companion
There is a website for this book at http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/ .5 If you are currently using R and are connected to the Internet, the carWeb command will open the website for the book in your browser:
> library(car)
> carWeb()
The website for the book includes the following materials:
All these can be accessed using the carWeb function; after loading the car package in R, type help(carWeb) for details.
Using This Book
This book is intended primarily as a companion for use with another textbook that covers linear and generalized linear models. For details on the statistical methods, particularly in Chapters 3 to 6, you will need to consult the regression textbook that you are using. To help you with this task, we provide sections of complementary readings, including references to Fox (2008) and Weisberg (2005).
While the R Companion is not intended as a comprehensive users’ manual for R, we anticipate that most students learning regression methods and researchers already familiar with regression but interested in learning to use R will find this book sufficiently thorough for their needs.6 Various features of R are introduced as they are needed, primarily in the context of detailed, worked-through examples. If you want to locate information about a particular feature, however, consult the index of functions and operators, or the subject index, at the end of the book; there is also an index of the data sets used in the text.
Occasionally, more demanding material (e.g., requiring a knowledge of matrix algebra or calculus) is marked with an asterisk; this material may be skipped without loss of continuity, as may the footnotes.7
Most readers will want to try out the examples in the text. You should therefore install R and the car package associated with this book before you start to work through the book. As you duplicate the examples in the text, feel free to innovate, experimenting with R commands that do not appear in the examples. Examples are often reused within a chapter, and so later examples in a chapter can depend on earlier ones in the same chapter; packages used in a chapter are loaded only once. The examples in different chapters are independent of each other; however, think of the R code in each chapter as pertaining to a separate R session.
Here are brief chapter synopses:
With the possible exception of starred material, Chapters 1 and 2 contain general information that should be of interest to all readers. Chapters 3 to 6 cover material that will be contained in most regression courses. The material in Chapters 7 and 8 has less to do with regression specifically and more to do with using R in real-world applications, where the facilities provided either in the basic packages or in the car package need to be modified to meet a particular goal. Readers with an interest in programming may prefer to read the last two chapters before Chapters 3 to 6.
We employ a few simple typographical conventions:
> mean(1:10) # an input line
[1] 5.5
The > prompt at the beginning of the input and the + prompt (not illustrated in this example), which begins continuation lines, are provided by R, not typed by the user.
Graphical output from R is shown in many figures scattered through the text; in normal use, graphs appear on the computer screen in graphics device windows that can be moved, resized, copied into other programs, saved, or printed (as described in Section 7.4).
There is, of course, much to R beyond the material in this book. The S language is documented in several books by John Chambers and his colleagues: The New S Language: A Programming Environment for Data Analysis and Graphics (Becker et al., 1988) and an edited volume, Statistical Models in S (Chambers and Hastie, 1992), describe what came to be known as S3, including the S3 object-oriented programming system, and facilities for specifying and fitting statistical models. Similarly, Programming With Data (Chambers, 1998) describes the S4 language and object system. The R dialect of S incorporates both S3 and S4, and so these books remain valuable sources.
Beyond these basic sources, there are now so many books that describe the application of R to various areas of statistics that it is impractical to compile a list here, a list that would inevitably be out-of-date by the time this book goes to press. We include complementary readings at the end of many chapters, however. There is nevertheless one book that is especially worthy of mention here: The fourth edition of Modern Applied Statistics With S (Venables and Ripley, 2002), though somewhat dated, demonstrates the use of R for a wide range of statistical applications. The book is associated with several R packages, including the MASS package, to which we make occasional reference. Venables and Ripley’s text is generally more advanced and has a broader focus than our book. There are also some differences in emphasis: For example, the R Companion has more material on diagnostic methods.
Acknowledgments
We are grateful to a number of individuals who provided valuable assistance in writing this book and its predecessor:
1SPSS was acquired by IBM in October 2009.
2With SAS, in particular, the situation is not so clear-cut, because there are several facilities for programming: The SAS DATA step is a simple programming language for manipulating data sets, the IML (interactive matrix language) procedure provides a programming language for matrix computations, and the macro facility allows the user to build applications that incorporate DATA steps and calls to SAS procedures. Nevertheless, programming in SAS is considerably less consistent and convenient than programming in a true statistical programming environment, and it remains fair to say that SAS principally is oriented toward processing rectangular data sets to produce printed output. Interestingly, both SAS and SPSS recently introduced facilities to link to R code.
3A compiler translates a program written in a programming language into an independently executable program in machine code. In contrast, an interpreter translates and executes a program under the control of the interpreter. Although it is in theory possible to write a compiler for a high-level, interactive language such as R, it is difficult to do so. Compiled programs usually execute more efficiently than interpreted programs. In advanced use, R has facilities for incorporating compiled programs written in Fortran and C.
4Thenameofthe library command is an endless source of confusion among new users of R.The command loads a package,suchas car, which in turn resides in a library of packages. If you want to be among the R cognoscenti, never call a package a “library”!
5If you have difficulty accessing this website, please check the Sage Publications website at www.sagepub.com for up-to-date information. Search for “John Fox,” and follow the links to the web-site for the book.
6A set of manuals in PDF and HTML format is distributed with R and can be accessed with Windows or Mac OS X through the Help menu. The manuals are also available on the R website. R has a substantial user community, which contributes to active and helpful email lists. See the previously mentioned website for details. And please remember to observe proper netiquette: Look for answers in the documentation and frequently-asked-questions (FAQ) lists before posting a question to an email discussion list; the people who answer your question are volunteering their time. Also, check the posting guide, at www.r-project.org/posting-guide.html , before posting a question to one of the R email lists.
7Footnotes include references to supplementary material (e.g., cross-references to other parts of the text), elaboration of points in the text, and indications of portions of the text that represent (we hope) innocent distortion for the purpose of simplification. The objective is to present more complete and correct information without interrupting the flow of the text and without making the main text overly difficult.