In this section, we'll be discussing the central limit theorem, which is essential to our understanding of normal distribution. Normal distribution is an important formula for the study of even basic statistics in data science. Data science, at its heart, is mathematical. We're transitioning away from the technical aspects of Haskell and file formats. First let's look at the central limit theorem before we introduce normal distribution, and then we're going to be exploring the parameters of normal distribution. So, here is the definition of the central limit theorem as per Wikipedia:
Now, I realize that's a mouthful, but let's see if I can convert this definition into some plain English. Imagine that you have a large data source, and we've worked with several datasets that are sufficiently large, and need to calculate the mean and the standard deviation. You can't, it won't work. We're going to split that data into smaller chunks. We're going to take the averages of each of those smaller chunks.
We're going to count how many times we see each average, and then we're going to plot that information. Once you get that plot, you'll always get something that looks like normal distribution; and this happens regardless of whatever your original data source is. Now let's go over to our Jupyter Notebook and create a notebook entitled Normal, and we have a few imports ready to go, as the following screenshot shows:
The one import that we're using here that you may not be familiar with is System.Random, and that is for generating random numbers. So, to begin, we need to create our random number generator, as shown in the following example:
This is what we're going to use in order to demonstrate the random number generator. Next, we will generate 100,000 values, in the range of 1 to 100, as the following example shows:
We have used a function called randomRs that generates an infinite stream of random number variables, and we're going to take the first 100,000 from that list. So, randomRs, generates our random stream on our range from 1 to 100. You then have to pass in the random number generator, g, which we defined earlier, and we're going to specify that these values are all going to be Double. The first 10 values in our list should look like the following screenshot:
You can see that they're all Double, they have long floating-point portions, and they're the first 10 values. What we need to do now is to divide this dataset up into chunks, and for that we would generate 10,000 chunks, each of size 10. So, let's create our quick function, as demonstrated in the following example:
This function is called chunk, and, if it takes an empty list it'll just return an empty list, but, if it sees a list with data in it, it's going to take the first 10 items of that list and then recursively call itself by dropping those first 10 items and passing the remainder. So, this is going to generate our list, where each sublist will be 10 elements long. Now, let's check out the length of chunkedValues, as demonstrated in the following example:
We have 10,000, which is what we were expecting: 10,000 times 10 is 100,000, which is our original value. Next, we need to compute the average of each of these chunks, as shown in the following example:
We have taken the average of each of these sublists, so we now need to record how often we notice each average. To do this, we are going to cheat a little bit—only a little—and round this information to the nearest integer. We then sort the lists of values and perform run-length encoding, which we generated back in Chapter 1, Descriptive Statistics. This is shown in the following screenshot:
So, pairs is going to be equal to runLengthEncoding , and we're going to pass in our sorted map, where we round each of our averages. Now, let's plot to the screen, as shown in the following example:
We're going to map our values over this quick little function, map. You're going to see why we did this in a little bit. Since our run-length encoding algorithm produces an integral value for each of the first values in our list, the plot function doesn't know how to plot that. So, we have to convert from integral. Thus we have used fromIntegral on our x, and then we passed in our y, and then we passed in our pairs. Since this is a randomly generated dataset, we don't know what's going to be displayed. An example is shown in the following screenshot:
Now, according to the central limit theorem, whatever is displayed should be approximately normal distribution. The characteristics of normal distribution are that it has a high center in the middle, and extends downward pretty evenly on both sides; and so here we have that approximated normal distribution. Looking at the approximated normal distribution, you will see that it's jagged, because our data is random and not perfect. What we would like to do now is to introduce the actual normal distribution.