Application of the KDE

This section will serve as our real-world application of the kernel density estimator. In this section, we're going to take a look at the Monet dataset. Monet was a famous French impressionist painter, and many of his paintings have sold at auction for millions of dollars. The Monet dataset is a record of all of those final auction prices for his paintings. We'll be discussing the parts of the kernel density estimator function, and then we will be answering the following question, using the kernel density estimator: what is the probability that, in the future, a Monet painting will sell for 5 million dollars or more? Let's do a Google search for Monet paintings, as illustrated by the following screenshot:

This is my excuse for putting beautiful Monet paintings in our book, and we're going to be discussing the auction prices of these paintings. Aren't those lovely? Hopefully, you have recognized a few of those. Now let's go ahead and find the data set, as demonstrated by the following screenshot:

We are looking for the link Econometric Analysis, 7th Edition and 8th Edition, Data Sets. Let's click on it and scroll down to look for the option for Monet datasets, as shown in the next screenshot:

It's Table F4.1. There are several columns in this particular dataset but we're only interested in the first one, and that is Sale Price, which is in millions. We're going to download the csv format of the dataset as seen in the above screenshot. Now let's rename the downloaded file TableF4.1.csv to monet.csv, as shown in the following example:

Let's open the file using the command vi monet.csv:, as shown in the following code:

As you can see, we have the header line and the prices. There's one thing that I want to point out: with some of these prices, if it they are less than a million, you'll notice that they have no leading 0 on their value, and this appears throughout the dataset. So, we are going to introduce a quick fix. First, we will remove the header line and save the file. Now, let's go back to our terminal and type the following code:

We have done a quick sed script. -i means we're going to do an in-place change on the file. Whenever we see a line beginning with an expression of a dot, we are going to replace it with a 0 dot. Now let's go and check our file, which should give us the following result:

As you can see, all of those values that had just a dot at the beginning now have a 0 at the beginning. Now let's go ahead and copy this file into our data folder, as shown in the following example:

Let's now go back to our notebook and play with the imports a little bit. We need to load our CSV library and also add several imports, as demonstrated in the following example:

We have imported Data.Maybe, MyCSV, and Text.CSV; and that should do the trick. Now, in order to save us a little bit of time, we are going to scroll down to where we had defined the normal in our notebook and rerun that line. Just make sure it's still there. Now let's introduce the KDE function, as shown in the following example:

Now, this is going to be a review of the last section. We're only passing in one dataset and we are returning a Maybe of two Double values, the first of which is the domain and the second of which is the range; and the range represents the adjusted KDE. As you recall from the last section, we are setting our low value to be whatever the low is: -5; and the high value to be whatever the high is of our domain: +5. We then just follow the steps of the KDE as we did them earlier, including the shape of the KDE divided by the sum.

Now let's go ahead and grab our data from the Monet dataset, as shown in the following example:

Now, we need to read those Monet prices. Hopefully you did the sed trick, where we added zeros to the beginning of some of those line items. If you didn't do that, then this next step isn't going to work. So, we're going to get our Monet prices, and we're going to read the index from Monet from column 0. We're going to parse these as a list of Double, as shown in the following code:

Without doing anything to this dataset, let's take a moment to plot what this dataset looks like, as shown in the following example:

So, we have plotted the values starting from 1, and we have sorted the Monet prices so you get a brief idea of what the smallest and largest values are. This is shown on the following graph:

You can see here that we have over 400 observations, and that the y axis here is the value in millions. Some of these get to over 30 million dollars, but, as you can see, most of them are below 5 million dollars. It looks like 370 or so paintings are less than 5 million dollars and the rest are above 5 million. It kind of gives you an idea about the dataset distribution, but this doesn't describe the shape of the distribution, the KDE does that. So, let's compute that KDE now, as shown in the following example:

There we go. Let's go ahead and plot this, using the following code:

The following graph shows the actual shape of the data:

You can see that there is a strong peak in the dataset, just after 0. Now here, the x axis has shifted. We now have the x axis representing a dollar amount in millions. Right at 0, you see how our continuous function is predicting sale prices of Monet paintings that are negative.

You can essentially just cut those off after 0, but our question that we set forth at the beginning of this section is: what is the probability that a painting will sell for 5 million dollars or more? So, what we do is we figure out all of the domain values that are 5 or greater and then figure out what the corresponding probabilities are, and then add those probabilities together. So, let's find those indices in which the price of the painting is 5 or greater, as shown in the following example:

findIndices is a wonderful little function in the Data.List library. We're going to search for domain values greater than, or equal to, 5. We then map those to the adjusted KDE and then compute the sum of that, as demonstrated in the following code:

We have passed in a function to map where we have adjusted the KDE, and whatever that x index value is from our indices list. We see that we get a probability of almost 20 % - it's almost 0.2. So, if a Monet painting ever comes up for sale in the future, there is a 0.2 probability that it will go for 5 million dollars or more. So, hopefully, that gives you an idea about the utility of the kernel density estimator.