Histograms

The exact use of histograms is to assess the probability distribution of a given variable by plotting the frequencies of observations occurring in certain ranges of values. They were first described by Karl Pearson. In their most simplistic form, histograms plot the frequency of a variable in a range of values called bins. We have chosen to start this chapter by describing histograms as they are the simplest of graphs that only accommodate one variable. Adding a density curve makes them a bit more informative but let's start with the basic form of the histogram. We will use the Class dataset that has been extensively used in the previous chapters:

Proc SGPLOT Data = Class;
  Histogram Height;
  Title 'Height of children in class across years';
Run;

The only change to the Class dataset is that the Weight variable has been renamed Weights in this chapter.

This produces the following diagram:

Out of the 12 observations in the dataset, one child has a height equal to 70. The observation for this child is included in the first bin. The observation contributes to around 8% of the total number of observations. As per the preceding diagram, around 40% of observations have a height between 80 to 90 units. Looking at the spread of the height variable in the dataset, the height of 70 for the ClassID B5324 looks to be an outlier. Let's remove the observation using the following code:

Proc SGPLOT Data = Class;
  Histogram Height;
  Title 'Height of children in class across years';
  Where ClassID ne 'B5324';
Run;

We get the following diagram as the output:

While we removed the outlier in the overall population, we know that the Class dataset has observations from 2013 and 2019. It's definite that, the height of students between those six years would have changed, given the fact that these are school-going children. We will use the following code to create a panel of histograms for both observation years:

Proc SGPANEL Data=Class;
  Panelby Year / Rows=2 Layout=Rowlattice;
  Histogram Height;
Run;

The following diagram clearly illustrates the difference in heights across the observation years:

The Panelby statement helps create the histograms in the panel. Rows specifies the number of groups we have in the data. In our dataset, the heights do not overlap across the years. The growth spurt seems to have come to an end and the children don't seem to be getting taller when you look at the lower panel. However, that's not true. The x-axis of the year 2019 has height values over a large band clubbed together in two bins. Both bins have the same frequency of 50%. However, this doesn't mean that the height of children in the year 2019 is the same.

The visualization of data should aid in the correct interpretation of the data. You won't be able to present both the data and the graph to your end users on certain occasions. The visualization of the data should make drawing inferences and insights a relatively easier task.

One way to make the histogram easy to interpret is to control the number and size of the bins on the x-axis. You can use the following code to achieve that:

Proc SGPLOT Data = Class;
  Histogram Height / Binstart=70 Binwidth=.5 Scale=count;
  Title 'Height of Class in Customized Bins';
Run;

The following diagram is an extreme example where we have created a bin that is so small in range that we have ended up with the number of bins equal to the number of observations. We could have also used the NBins option to control the number of bins:

While defining histograms, we mentioned that they are useful if we wish to understand the probability distribution of the variable.

Let's plot a density curve, which, in the case of SGPLOT, is the normal density curve by default. The parameters are estimated from the data:

Proc SGPLOT Data = Class;
  Histogram Height;
  Density Height;
  Title 'Height of children in class across years';
Run;

We have managed to plot the following histogram and density curve together in the same graph:

While this chapter isn't about probability distribution functions, we will look at an example where the histogram allows us to plot multiple density functions:

Proc SGPLOT Data = Class;
  Histogram Height;
  Density Height;
  Density Height / Type= Kernel;
  Keylegend / Location = Inside Position = TopRight 
  Across = 1 Title = 'Density Curves';
  Title 'Height and Density with Multiple Curves';
Run;

The type option can help us specify the density function. The options that are available are normal or kernel. The normal option specifies a normal distribution based on the mean and the standard deviation. The kernel option specifies a non-parametric kernel density estimate.

The Keylegend statement provides a legend to understand which curves have been plotted. There are various options such as location, position, across, and title available in the statement. The location can be inside or outside of the panel of the chart. If you specify the down option instead of across, you will get a horizontal list of the density functions instead of the stacked-up list of density function names that can be seen in the following diagram:

Previously, we created two histograms in the same chart but different panels. But how can you plot their density functions while still being able to compare the histograms? Use the following code to create histograms by group in the same panel and create multiple density plots:

Proc SGPLOT Data=Class;
  Histogram Height / Group=Year Transparency=0.5;
  Density Height / Group=Year;
Run;

This produces the following output:

We can create histograms by group in the same panel and create multiple density plots as well.