Now that you’ve seen a simple example of creating a list, let’s look at how to access and work with lists.
You can access a list component in several different ways:
> j$salary [1] 55000 > j[["salary"]] [1] 55000 > j[[2]] [1] 55000
We can refer to list components by their numerical indices, treating the list as a vector. However, note that in this case, we use double brackets instead of single ones.
So, there are three ways to access an individual component c
of a list lst
and return it in the data type of c
:
lst$c
lst[["c"]]
lst[[i]]
, where i
is the index of c
within lst
Each of these is useful in different contexts, as you will see in subsequent examples. But note the qualifying phrase, “return it in the data type of c
.” An alternative to the second and third techniques listed is to use single brackets rather than double brackets:
lst["c"]
lst[i]
, where i
is the index of c
within lst
Both single-bracket and double-bracket indexing access list elements in vector-index fashion. But there is an important difference from ordinary (atomic) vector indexing. If single brackets [ ]
are used, the result is another list—a sublist of the original. For instance, continuing the preceding example, we have this:
> j[1:2] $name [1] "Joe" $salary [1] 55000 > j2 <- j[2] > j2 $salary [1] 55000 > class(j2) [1] "list" > str(j2) List of 1 $ salary: num 55000
The subsetting operation returned another list consisting of the first two components of the original list j
. Note that the word returned makes sense here, since index brackets are functions. This is similar to other cases you’ve seen for operators that do not at first appear to be functions, such as +
.
By contrast, you can use double brackets [[ ]]
for referencing only a single component, with the result having the type of that component.
> j[[1:2]] Error in j[[1:2]] : subscript out of bounds > j2a <- j[[2]] > j2a [1] 55000 > class(j2a) [1] "numeric"
The operations of adding and deleting list elements arise in a surprising number of contexts. This is especially true for data structures in which lists form the foundation, such as data frames and R classes.
New components can be added after a list is created.
> z <- list(a="abc",b=12) > z $a [1] "abc" $b [1] 12 > z$c <- "sailing" # add a c component > # did c really get added? > z $a [1] "abc" $b [1] 12 $c [1] "sailing"
Adding components can also be done via a vector index:
> z[[4]] <- 28 > z[5:7] <- c(FALSE,TRUE,TRUE) > z $a [1] "abc" $b [1] 12 $c [1] "sailing" [[4]] [1] 28 [[5]] [1] FALSE [[6]] [1] TRUE [[7]] [1] TRUE
You can delete a list component by setting it to NULL.
> z$b <- NULL > z $a [1] "abc" $c [1] "sailing" [[3]] [1] 28 [[4]] [1] FALSE [[5]] [1] TRUE [[6]] [1] TRUE
Note that upon deleting z$b
, the indices of the elements after it moved up by 1. For instance, the former z[[4]]
became z[[3]]
.
You can also concatenate lists.
> c(list("Joe", 55000, T),list(5)) [[1]] [1] "Joe" [[2]] [1] 55000 [[3]] [1] TRUE [[4]] [1] 5
Since a list is a vector, you can obtain the number of components in a list via length()
.
> length(j) [1] 3
Web search and other types of textual data mining are of great interest today. Let’s use this area for an example of R list code.
We’ll write a function called findwords()
that will determine which words are in a text file and compile a list of the locations of each word’s occurrences in the text. This would be useful for contextual analysis, for example.
Suppose our input file, testconcord.txt, has the following contents (taken from this book!):
The [1] here means that the first item in this line of output is item 1. In this case, our output consists of only one line (and one item), so this is redundant, but this notation helps to read voluminous output that consists of many items spread over many lines. For example, if there were two rows of output with six items per row, the second row would be labeled [7].
In order to identify words, we replace all nonletter characters with blanks and get rid of capitalization. We could use the string functions presented in Chapter 11 to do this, but to keep matters simple, such code is not shown here. The new file, testconcorda.txt, looks like this:
the here means that the first item in this line of output is item in this case our output consists of only one line and one item so this is redundant but this notation helps to read voluminous output that consists of many items spread over many lines for example if there were two rows of output with six items per row the second row would be labeled
Then, for instance, the word item has locations 7, 14, and 27, which means that it occupies the seventh, fourteenth, and twenty-seventh word positions in the file.
Here is an excerpt from the list that is returned when our function findwords()
is called on this file:
> findwords("testconcorda.txt") Read 68 items $the [1] 1 5 63 $here [1] 2 $means [1] 3 $that [1] 4 40 $first [1] 6 $item [1] 7 14 27 ...
The list consists of one component per word in the file, with a word’s component showing the positions within the file where that word occurs. Sure enough, the word item is shown as occurring at positions 7, 14, and 27.
Before looking at the code, let’s talk a bit about our choice of a list structure here. One alternative would be to use a matrix, with one row per word in the text. We could use rownames()
to name the rows, with the entries within a row showing the positions of that word. For instance, row item
would consist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix approach has a couple of major drawbacks:
There is a problem in terms of the columns to allocate for our matrix. If the maximum frequency with which a word appears in our text is, say, 10, then we would need 10 columns. But we would not know that ahead of time. We could add a new column each time we encountered a new word, using cbind()
(in addition to using rbind()
to add a row for the word itself). Or we could write code to do a preliminary run through the input file to determine the maximum word frequency. Either of these would come at the expense of increased code complexity and possibly increased runtime.
Such a storage scheme would be quite wasteful of memory, since most rows would probably consist of a lot of zeros. In other words, the matrix would be sparse—a situation that also often occurs in numerical analysis contexts.
Thus, the list structure really makes sense. Let’s see how to code it.
1 findwords <- function(tf) { 2 # read in the words from the file, into a vector of mode character 3 txt <- scan(tf,"") 4 wl <- list() 5 for (i in 1:length(txt)) { 6 wrd <- txt[i] # ith word in input file 7 wl[[wrd]] <- c(wl[[wrd]],i) 8 } 9 return(wl) 10 }
We read in the words of the file (words simply meaning any groups of letters separated by spaces) by calling scan()
. The details of reading and writing files are covered in Chapter 10, but the important point here is that txt
will now be a vector of strings: one string per instance of a word in the file. Here is what txt
looks like after the read:
> txt [1] "the" "here" "means" "that" "the" [6] "first" "item" "in" "this" "line" [11] "of" "output" "is" "item" "in" [16] "this" "case" "our" "output" "consists" [21] "of" "only" "one" "line" "and" [26] "one" "item" "so" "this" "is" [31] "redundant" "but" "this" "notation" "helps" [36] "to" "read" "voluminous" "output" "that" [41] "consists" "of" "many" "items" "spread" [46] "over" "many" "lines" "for" "example" [51] "if" "there" "were" "two" "rows" [56] "of" "output" "with" "six" "items" [61] "per" "row" "the" "second" "row" [66] "would" "be" "labeled"
The list operations in lines 4 through 8 build up our main variable, a list wl
(for word list). We loop through all the words from our long line, with wrd
being the current one.
Let’s see what happens with the code in line 7 when i = 4
, so that wrd = "that"
in our example file testconcorda.txt. At this point, wl[["that"]]
will not yet exist. As mentioned, R is set up so that in such a case, wl[["that"]] = NULL
, which means in line 7, we can concatenate it! Thus wl[["that"]]
will become the one-element vector (4). Later, when i = 40
, wl[["that"]]
will become (4,40), representing the fact that words 4 and 40 in the file are both "that"
. Note how convenient it is that list indexing can be done through quoted strings, such as in wl[["that"]]
.
An advanced, more elegant version of this code uses R’s split()
function, as you’ll see in Section 6.2.2.