Using DataFrames

If you measure n variables (each of a different type) of a single object, then you get a table with n columns for each object row. If there are m observations, then we have m rows of data. For example, given the student grades as data, you might want to know compute the average grade for each socioeconomic group, where grade and socioeconomic group are both columns in the table, and there is one row per student.

DataFrame is the most natural representation to work with such a (m x n) table of data. They are similar to Pandas DataFrames in Python or data.frame in R. DataFrame is a more specialized tool than a normal array for working with tabular and statistical data, and it is defined in the DataFrames package, a popular Julia library for statistical work. Install it in your environment by typing in add DataFrames in the REPL. Then, import it into your current workspace with using DataFrames. Do the same for the DataArrays and RDatasets packages (which contain a collection of example datasets mostly used in the R literature).

A common case in statistical data is that data values can be missing (the information is not known). The Missings package provides us with a unique value, missing, which represents a non-existing value, and has the Missing type. The result of the computations that contain the missing values mostly cannot be determined, for example, 42 + missing returns missing.

DataFrame is a kind of in-memory database, versatile in the various ways you can work with data. It consists of columns with names such as Col1, Col2, and Col3. All of these columns are DataArrays that have their own type, and the data they contain can be referred to by the column names as well, so we have substantially more forms of indexing. Unlike two-dimensional arrays, columns in DataFrame can be of different types. One column might, for instance, contain the names of students and should therefore be a string. Another column could contain their age and should be an integer.

We construct DataFrame from the program data as follows:

// code in Chapter 8\dataframes.jl 
using DataFrames, Missings 
# constructing a DataFrame: 
df = DataFrame() 
df[:Col1] = 1:4 
df[:Col2] = [exp(1), pi, sqrt(2), 42] 
df[:Col3] = [true, false, true, false] 
show(df) 

Notice that the column headers are used as symbols. This returns the following 4 x 3 DataFrame object:

show(df) produces a nicely formatted output (whereas show(:Col2) does not). This is because there is a show() routine defined in the package for the entire contents of DataFrame.

We could also have used the full constructor, as follows:

df = DataFrame(Col1 = 1:4, Col2 = [e, pi, sqrt(2), 42],
Col3 = [true, false, true, false])

You can refer to columns either by an index (the column number) or by a name; both of the following expressions return the same output:

show(df[2]) 
show(df[:Col2]) 

This gives the following output:

[2.718281828459045, 3.141592653589793, 1.4142135623730951,42.0]

To show the rows or subsets of rows and columns, use the familiar splice (:) syntax, for example:

| Row | Col1 | Col2    | Col3 |
|-----|------|---------|------|
| 1   | 1    | 2.71828 | true |
2x2 DataFrame
 | Row | Col2    | Col3  |
 |---- |-----   -|-------|
 | 1   | 3.14159 | false |
 | 2   | 1.41421 | true  |

The following functions are very useful when working with DataFrames:

df0 = DataFrame(i = 1:10, x = rand(10), 
y = rand(["a", "b", "c"], 10))
head(df0

To load in data from a local CSV file, use the read method from the CSV package (the following are the docs for that package: https://juliadata.github.io/CSV.jl/stable/). The returned object is of the DataFrame type:

// code in Chapter 8\dataframes.jl 
using DataFrames, CSV 
fname = "winequality.csv" 
data = CSV.read(fname, delim = ';') 
typeof(data) # DataFrame 
size(data) # (1599,12)

Here is a fraction of the output:

The readtable method also supports reading in the gzip CSV files.

Writing DataFrame to a file can be done with the CSV.write function, which takes the filename and DataFrame as arguments, for example, CSV.write ("dataframe1.csv", df, delim = ';'). By default, write will use the delimiter specified by the filename extension and write the column names as headers.

Both read and write support numerous options for special cases. Refer to the docs for more information.

To demonstrate some of the power of DataFrames, here are some queries you can do:

Here, we use the .== operator, which does an element-wise comparison. data[:alcohol] .== 9.5 returns an array of Boolean values (true for datapoints, where :alcohol is 9.5, and false otherwise). data[boolean_array, : ] selects those rows where boolean_array is true.

6x2 DataFrame
| Row | quality | x1  |
|-----|---------|-----|
| 1    | 3      | 10  |
| 2    | 4      | 53  |
| 3    | 5      | 681 |
| 4    | 6      | 638 |
| 5    | 7      | 199 |
| 6    | 8      | 18  |

Another easy way to get the distribution among quality is to execute the histogram hist function, hist(data[:quality]), which gives the counts over the range of quality (2.0:1.0:8.0,[10,53,681,638,199,18]). More precisely, this is a tuple with the first element corresponding to the edges of the histogram bins, and the second denoting the number of items in each bin. So there are, for example, 10 wines with quality between 2 and 3.

To extract the counts as a count variable of the Vector type, we can execute _, count = hist(data[:quality])_; this means that we neglect the first element of the tuple. To obtain the quality classes as a DataArray class, we execute the following:

class = sort(unique(data[:quality]))

We can now construct df_quality, a DataFrame type with class and count columns as df_quality = DataFrame(qual=class, no=count). This gives the following output:

6x2 DataFrame
| Row | qual | no  |
|-----|------|-----|
| 1   | 3    | 10  |
| 2   | 4    | 53  |
| 3   | 5    | 681 |
| 4   | 6    | 638 |
| 5   | 7    | 199 |
| 6   | 8    | 18  |

To deepen your understanding and learn about the other features of Julia's DataFrames (such as joining, reshaping, and sorting), refer to the documentation available at http://juliadata.github.io/DataFrames.jl/latest/index.html.