Chapter 1. Introduction

Congratulations! You’ve just begun your quest to become an R programmer. So you don’t pull any mental muscles, this chapter starts you off gently with a nice warm-up. Before you begin coding, we’re going to talk about what R is, and how to install it and begin working with it. Then you’ll try writing your first program and learn how to get help.

After reading this chapter, you should:

Just to confuse you, R refers to two things. There is R, the programming language, and R, the piece of software that you use to run programs written in R. Fortunately, most of the time it should be clear from the context which R is being referred to.

R (the language) was created in the early 1990s by Ross Ihaka and Robert Gentleman, then both working at the University of Auckland. It is based upon the S language that was developed at Bell Laboratories in the 1970s, primarily by John Chambers. R (the software) is a GNU project, reflecting its status as important free and open source software. Both the language and the software are now developed by a group of (currently) 20 people known as the R Core Team.

The fact that R’s history dates back to the 1970s is important, because it has evolved over the decades, rather than having been designed from scratch (contrast this with, for example, Microsoft’s .NET Framework, which has a much more “created”[2] feel). As with life-forms, the process of evolution has led to some quirks and inconsistencies. The upside of the more free-form nature of R (and the free license in particular) is that if you don’t like how something in R is done, you can write a package to make it do things the way that you want. Many people have already done that, and the common question now is not “Can I do this in R?” but “Which of the three implementations should I use?”

R is an interpreted language (sometimes called a scripting language), which means that your code doesn’t need to be compiled before you run it. It is a high-level language in that you don’t have access to the inner workings of the computer you are running your code on; everything is pitched toward helping you analyze data.

R supports a mixture of programming paradigms. At its core, it is an imperative language (you write a script that does one calculation after another), but it also supports object-oriented programming (data and functions are combined inside classes) and functional programming (functions are first-class objects; you treat them like any other variable, and you can call them recursively). This mix of programming styles means that R code can bear a lot of similarity to several other languages. The curly braces mean that you can write imperative code that looks like C (but the vectorized nature of R that we’ll discuss in Chapter 2 means that you have fewer loops). If you use reference classes, then you can write object-oriented code that looks a bit like C# or Java. The functional programming constructs are Lisp-inspired (the variable-scoping rules are taken from the Lisp dialect, Scheme), but there are fewer brackets. All this is a roundabout way of saying that R follows the Perl ethos:

There is more than one way to do it.

Larry Wall

If you are using a Linux machine, then it is likely that your package manager will have R available, though possibly not the latest version. For everyone else, to install R you must first go to http://www.r-project.org. Don’t be deceived by the slightly archaic website;[3] it doesn’t reflect on the quality of R. Click the link that says “download R” in the “Getting Started” pane at the bottom of the page.

Once you’ve chosen a mirror close to you, choose a link in the “Download and Install R” pane at the top of the page that’s appropriate to your operating system. After that there are one or two OS-specific clicks that you need to make to get to the download.

If you are a Windows user who doesn’t like clicking, there is a cheeky shortcut to the setup file at http://<CRAN MIRROR>/bin/windows/base/release.htm.

If you use R under Windows or Mac OS X, then a graphical user interface (GUI) is available to you. This consists of a command-line interpreter, facilities for displaying plots and help pages, and a basic text editor. It is perfectly possible to use R in this way, but for serious coding you’ll at least want to use a more powerful text editor. There are countless text editors for programmers; if you already have a favorite, then take a look to see if you can get syntax highlighting of R code for it.

If you aren’t already wedded to a particular editor, then I suggest that you’ll get the best experience of R by using an integrated development environment (IDE). Using an IDE rather than a separate text editor gives you the benefit of only using one piece of software rather than two. You get all the facilities of the stock R GUI, but with a better editor, and in some cases things like integrated version control.

The following sections introduce five popular choices, but this is by no means an exhaustive list (a few additional suggestions follow). It is worth trying several IDEs; a development environment is a piece of software that you could be spending thousands of hours using, so it’s worth taking the time to find one[4] that you like. A few additional suggestions follow this selection.

Although Emacs calls itself a text editor, 36 years (and counting) of development have given it an unparalleled number of features. If you’ve been programming for any substantial length of time, you probably already know whether or not you want to use it. Converts swear by its limitless customizability and raw editing power; others complain that it overcomplicates things and that the key chords give them repetitive strain injury. There is certainly a steep learning curve, so be willing to spend a month or two getting used to it. The other big benefit is that Emacs is not R-specific, so you can use it for programming in many languages. The original version of Emacs is (like R) a GNU project, available from http://www.gnu.org/software/emacs/.

Another popular fork is XEmacs, available from http://www.xemacs.org/.

Emacs Speaks Statistics (ESS) is an add-on for Emacs that assists you in writing R code. Actually, it works with S-Plus, SAS, and Stata, too, so you can write statistical code with whichever package you like (choose R!). Several of the authors of ESS are also R Core Team members, so you are guaranteed good integration with R. It is available through the Emacs package management system, or you can download it from http://ess.r-project.org/.

Use it if you want to write code in multiple languages, you want the most powerful editor available, and you are fearless with learning curves.

It is a law of programming books that the first example shall be a program to print the phrase “Hello world!” In R that’s really boring, since you just type “Hello world!” at the command prompt, and it will parrot it back to you. Instead, we’re going to write the simplest statistical program possible.

Open up R GUI, or whichever IDE you’ve decided to use, find the command prompt (in the code editor window), and type:

mean(1:5)

Hit Enter to run the line of code. Hopefully, you’ll get the answer 3. As you might have guessed, this code is calculating the arithmetic mean of the numbers from 1 to 5. The colon operator, :, creates a sequence of numbers from the first number, in this case 1, to the second number (5), each separated by 1. The resulting sequence is called a vector. mean is a function (that calculates the arithmetic mean), and the vector that we enclose inside the parentheses is called an argument to the function.

Well done! You’ve calculated a statistic using R.

Before you get started writing R code, the most important thing to know is how to get help. There are lots of ways to do this. Firstly, if you want help on a function or a dataset that you know the name of, type ? followed by the name of the function. To find functions, type two question marks (??) followed by a keyword related to the problem to search. Special characters, reserved words, and multiword search terms need enclosing in double or single quotes. For example:

?mean                  #opens the help page for the mean function
?"+"                   #opens the help page for addition
?"if"                  #opens the help page for if, used for branching code
??plotting             #searches for topics containing words like "plotting"
??"regression model"   #searches for topics containing phrases like this

The functions help and help.search do the same things as ? and ??, respectively, but with these you always need to enclose your arguments in quotes. The following commands are equivalent to the previous lot:

help("mean")
help("+")
help("if")
help.search("plotting")
help.search("regression model")

The apropos function[5] finds variables (including functions) that match its input. This is really useful if you can only half-remember the name of a variable that you’ve created, or a function that you want to use. For example, suppose you create a variable a_vector:

a_vector <- c(1, 3, 6, 10)

You can then recall this variable using apropos:

apropos("vector")
## [1] ".__C__vector"         "a_vector"             "as.data.frame.vector"
## [4] "as.vector"            "as.vector.factor"     "is.vector"
## [7] "vector"               "Vectorize"

The results contain the variable you just created, a_vector, and all other variables that contain the string vector. In this case, all the others are functions that are built into R.

Just finding variables that contain a particular string is fine, but you can also do fancier matching with apropos using regular expressions.

A simple usage of apropos could, for example, find all variables that end in z, or to find all variables containing a number between 4 and 9:

apropos("z$")
## [1] "alpe_d_huez" "alpe_d_huez" "force_tz"    "indexTZ"     "SSgompertz"
## [6] "toeplitz"    "tz"          "unz"         "with_tz"
apropos("[4-9]")
##  [1] ".__C__S4"            ".__T__xmlToS4:XML"   ".parseISO8601"
##  [4] ".SQL92Keywords"      ".TAOCP1997init"      "asS4"
##  [7] "assert_is_64_bit_os" "assert_is_S4"        "base64"
## [10] "base64Decode"        "base64Encode"        "blues9"
## [13] "car90"               "enc2utf8"            "fixPre1.8"
## [16] "Harman74.cor"        "intToUtf8"           "is_64_bit_os"
## [19] "is_S4"               "isS4"                "seemsS4Object"
## [22] "state.x77"           "to.minutes15"        "to.minutes5"
## [25] "utf8ToInt"           "xmlToS4"

Most functions have examples that you can run to get a better idea of how they work. Use the example function to run these. There are also some longer demonstrations of concepts that are accessible with the demo function:

example(plot)
demo()         #list all demonstrations
demo(Japanese)

R is modular and is split into packages (more on this later), some of which contain vignettes, which are short documents on how to use the packages. You can browse all the vignettes on your machine using browseVignettes:

browseVignettes()

You can also access a specific vignette using the vignette function (but if your memory is as bad as mine, using browseVignettes combined with a page search is easier than trying to remember the name of a vignette and which package it’s in):

vignette("Sweave", package = "utils")

The help search operator ?? and browseVignettes will only find things in packages that you have installed on your machine. If you want to look in any package, you can use RSiteSearch, which runs a query at http://search.r-project.org. Multiword terms need to be wrapped in braces:

RSiteSearch("{Bayesian regression}")

Tip

Learning to help yourself is extremely important. Think of a keyword related to your work and try ?, ??, apropos, and RSiteSearch with it.

There are also lots of R-related resources on the Internet that are worth trying. There are too many to list here, but start with these:

There are a few other bits of software that R can use to extend its functionality. Under Linux, your package manager should be able to retrieve them. Under Windows, rather than hunting all over the Internet to track down this software, you can use the installr add-on package to automatically install these extra pieces of software. None of this software is compulsory, so you can skip this section now if you want, but it’s worth knowing that the package exists when you come to need the additional software. Installing and loading packages is discussed in detail in Chapter 10, so don’t worry if you don’t understand the commands yet:

install.packages("installr")   #download and install the package named installr
library(installr)              #load the installr package
install.RStudio()              #download and install the RStudio IDE
install.Rtools()               #Rtools is needed for building your own packages
install.git()                  #git provides version control for your code
Exercise 1-1
Visit http://www.r-project.org, download R, and install it. For extra credit, download and install one of the IDEs mentioned in Other IDEs and Editors. [30]
Exercise 1-2
The function sd calculates the standard deviation. Calculate the standard deviation of the numbers from 0 to 100. Hint: the answer should be about 29.3. [5]
Exercise 1-3
Watch the demonstration on mathematical symbols in plots, using demo(plotmath). [5]


[2] Intelligently designed?

[4] You don’t need to limit yourself to just one way of using R. I have IDE commitment issues and use a mix of Eclipse + StatET, RStudio, Live-R, Tinn-R, Notepad++, and R GUI. Experiment, and find something that works for you.

[5] apropos is Latin for “A Unix program that finds manpages.”