Using SQLite3 SELECT and the DescriptiveStats module for descriptive statistics

This section will serve as our review of the SQLite3 SELECT queries and our DescriptiveStats module. In this section, we're going to take a look at loading data from our MovieLens dataset into an IHaskell notebook. We're going to be demonstrating the built-in avg function in SQLite3 and a simple inner table join in SQLite3. We're also going to be using SQLite3 and descriptive stats to find the highest median across three genres of movies. We want to know which genre is considered the highest-rated: action, drama, or sci-fi movies. In the previous section, we had just created the movies.sqlite3 database. What we need to do now is to copy that file into our data folder, and, if we look at this folder, we should see our movies.sqlite3 file, as shown in the following example:

Now let's go back and create a MovieLens notebook in our IHaskell environment. Here is where we're going to study the MovieLens dataset. So, let's begin by loading our DescriptiveStats model and importing our central libraries: Data.List, Database.HDBC, Database.HDBC.SQLite3, and DescriptiveStats, as shown in the following example:

Next, we need to create a connection to our database, as shown in the following example:

Now, if you've used databases for any length of time, you've probably already run into the avg function for computing means. It's built into SQL. So let's go ahead and demonstrate that now, in the following way:

So, we have done a quick query on our database and selected the average rating from our dataset, and we can see that the average rating is 3.53. This rating is across all movies. Now, let's change avg to say something such as median, which is another popular descriptive statistic, as shown in the following example:

We see that median doesn't exist. Let's try again, with another popular one: the mode, as shown in the following example:

Again, it doesn't exist. It gives us an error. So, this addresses the importance of having a descriptive stats library, such as the one we made in Chapter 1, Descriptive Statistics. So, let's create a dataset consisting of all of the ratings for movies that are of the action genre. Our genre information is in the items table, but our ratings are in the data table. So, what we have to do is perform an inner join in order to align these two datasets. This is fairly straightforward to do using a little bit of SQLite3 SQL query. So, we are going to call this ratingsAction, as shown in the following screenshot:

So, we have done a query on our database. We're going to do a SELECT data.ratings. Anytime we do a table join, we prefix all of our columns with the tables. So, we do table.column. In our case, it's data.rating. We're then pulling from data and items, and we want to know any place where our data.itemid field matches our items.movieid field, and where the items.action field is equal to 1.

Now let's check the length of that dataset, as shown in the following example:

Of the 100,000 ratings in our dataset, it looks like over 25,000 ratings represent action movies. Let's create the readColumn partial function that we've used previously, as shown in the following example:

So, readColumn is a mapping from SQL. Now, let's compute our mean, as shown in the following example:

We are reading in the column information. We then did a head . transpose of our ratings of action movies, and parsed that into a list of Doubles. As we recall from earlier, the overall average of our ratings is 3.53, but, for the action movies, it's 3.48. Let's try this again, with the median, as demonstrated in the following screenshot:

So, we have taken our previous statement and changed the mean to median, and we see that the median is 4. Let's do this again, but, instead of action movies, let's check for drama movies. To do this, we are going to modify our earlier statements and we are going to call this ratingsDrama, as shown in the following example:

Let's compute the mean of our drama movies, which will give us the following result:

We can see it's 3.69. Now let's check the median, as demonstrated in the following example:

Once again it's 4. Finally, we will do this again for sci-fi movies, as shown in the following example:

Now, let's pull our mean of the sci-fi movies, and we should see the following code:

We got 3.56. Finally, the median of the sci-fi movies should give us the following result:

So, we've discovered that the median rating for all three genres that we decided to look at in this video was 4, and that they all had slightly different mean values. The highest mean was from the drama category followed by the sci-fi category, and then the action category. So, more users appreciate our dramas than our sci-fi or action movies. In our next section, we will be creating compelling visualizations of the MovieLens dataset.