Chapter 12. Data Visualization Using ggplot2

In this chapter, we will cover the following recipes:

Creating bar charts
Creating multiple bar charts
Creating a bar chart with error bars
Visualizing the density of a numeric variable
Creating a box plot
Creating a layered plot with a scatter plot and fitted line
Creating a line chart
Graph annotation with ggplot

Introduction

In this chapter, we will mainly use the ggplot2 library to visualize data with common graphs, such as bar charts, histograms, boxplots, and scatter plots. The ggplot2 library has the implementation of Grammar of Graphics, which gives the user the flexibility to produce any kind of graph by introducing layered facilities. At the end of this chapter, we will see how we can easily annotate the graphs using the theme option of the ggplot2 library.

Throughout this chapter, we will use a single dataset. We will simulate the dataset with some numeric and categorical variables so that we can use the same dataset for each recipe.

There are four numeric variables, namely disA, disB, disC, and disD. Here, disA and disD are correlated in the sense that disD is produced with the same values of disA but with an added random error from the normal distribution with a mean of zero and a standard deviation of three. There are three categorical variables that represent the age category, sex, and economic status. In the following code snippet we will create the dataset as described. First of all we will set a seed value so that we can have the same data from any computer and any number of attempts. Here is the code:

# Set a seed value to make the data reproducible
set.seed(12345)
ggplotdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3),
disB=rnorm(n=100,mean=25,sd=4),
disC=rnorm(n=100,mean=15,sd=1.5),
age=factor(sample(c(1,2,3,4),size=100,replace=T),levels=c(1,2,3,4),labels=c("< 5yrs","5-10 yrs","10-18 yrs","18 +yrs")),
sex=factor(sample(c(1,2),size=100,replace=T),
levels=c(1,2),labels=c("Male","Female")),
econ_status=factor(sample(c(1,2,3),size=100,replace=T), levels=c(1,2,3),labels=c("Poor","Middle","Rich")))

ggplotdata$disD <- ggplotdata$disA+rnorm(100,sd=3)

We will use this data throughout the chapter. We will explicitly use the ggplot2 library, but if required, we will use other libraries and mention them in the respective recipe section. So, let's start the actual recipes.