Packages are also used to deliver research and research results to people and make them reproducible. Therefore, it is often necessary to include a dataset in our R package. Many packages also use this possibility to include this data into their demo code, to give users the possibilities to execute a demo version of the package's functions instantly without the need to import your own data.
To include this data, R provides several options. The choice of the option depends on what kind of data we want to attach to the package and what this data will be used for.
The most common way is to include it in the data
subdirectory. This way is often used when our dataset is used for the example code. Another way to include it is an .rda
file in the sysdata.rda
file. We can use this function if we do not want the package's users to have full access to these datasets.
The data files we can include in the package can be in three formats:
.txt
or .csv*
files).RData
or .rda
files)*CSV files
Please note, that the csv files in this context are not normal csv files. They have to be in a special format to be included this way. We can find more information about this format at: http://tools.ietf.org/html/rfc4180.
To create .rda
files we can create them in R or load them into R and then call the save()
function. This function will then save this data to an .rda
file.
The following code shows how to create such a data file:
df = data.frame(matrix(rnorm(10), nrow = 5)) save(df, file = "dataFile.Rda")
This code will create the file dataFile.Rda
, which can then be found in the home directory of our project.
These .Rda
files can be loaded into the working environment simply by clicking on them. This will open a pop up where we can confirm that this RData
file should be loaded.
After loading the file into the environment, we can find it in the Environment panel. Then we can work with them like we are used to.
The compression with the save()
function is also the best way when we want to ship very large datasets with our package.
As R has to load every dataset into the memory before it can be used, it is important, especially when we have bigger datasets, to use LazyData
in our package. We can activate it by adding the following line to the DESCRIPTION
file:
LazyData: true
Then, our datasets are not loaded into the memory until we really use them. This often saves a lot of memory and you should use LazyData
in all packages that include data files.