Oh No, the Data Doesn’t Fit into Memory!

As mentioned earlier, all objects in an R session are stored in memory. R places a limit of 231−1 bytes on the size of any object, regardless of word size (32-bit versus 64-bit) and the amount of RAM in your machine. However, you really should not consider this an obstacle. With a little extra care, applications that have large memory requirements can indeed be handled well in R. Some common approaches are chunking and using R packages for memory management.

One option involving no extra R packages at all is to read in your data from a disk file one chunk at a time. For example, suppose that our goal is to find means or proportions of some variables. We can use the skip argument in read.table().

Say our data set has 1,000,000 records and we divide them into 10 chunks (or more—whatever is needed to cut the data down to a size so it fits in memory). Then we set skip = 0 on our first read, set skip = 100000 the second time, and so on. Each time we read in a chunk, we calculate the counts or totals for that chunk and record them. After reading all the chunks, we add up all the counts or totals in order to calculate our grand means or proportions.

As another example, suppose we are performing a statistical operation, say calculating principle components, in which we have a huge number of rows—that is, a huge number of observations—but the number of variables is manageable. Again, chunking could be the solution. We apply the statistical operation to each chunk and then average the results over all the chunks. My mathematical research shows that the resulting estimators are statistically efficient in a wide class of statistical methods.

Again looking at a bit more sophistication, there are alternatives for accommodating large memory requirements in the form of some specialized R packages.

One such package is RMySQL, an R interface to SQL databases. Using it requires some database expertise, but this package provides a much more efficient and convenient way to handle large data sets. The idea is to have SQL do its variable/case selection operations for you back at the database end and then read the resulting selected data as it is produced by SQL. Since the latter will typically be much smaller than the overall data set, you will likely be able to circumvent R’s memory restriction.

Another useful package is biglm, which does regression and generalized linear-model analysis on very large data sets. It also uses chunking but in a different manner: Each chunk is used to update the running totals of sums needed for the regression analysis and then discarded.

Finally, some packages do their own storage management independently of R and thus can deal with very large data sets. The two most commonly used today are ff and bigmemory. The former sidesteps memory constraints by storing data on disk instead of memory, essentially transparently to the programmer. The highly versatile bigmemory package does the same, but it can store data not only on disk but also in the machine’s main memory, which is ideal for multicore machines.