Continuous Data

This section describes tests that apply to continuous random variables. Many important measurements fall into this category, such as times, dollar amounts, and chemical concentrations.

Normal Distribution-Based Tests

We’ll start off by showing how to use some common statistical tests that assume the underlying data is normally distributed. Normal distributions occur frequently in nature, so this is often a good assumption.^[50]

Comparing means

Suppose that you designed an experiment to show that some effect is true. You have collected some data and now want to know if the data proves your hypothesis. One common question is to ask if the mean of the experimental data is close to what the experimenter expected; this is called the null hypothesis. Alternately, the experimenter may calculate the probability that an alternative hypothesis was true. Specifically, suppose that you have a set of observations x₁, x₂, ..., x_n with experimental mean μ and want to know if the experimental mean is different from the null hypothesis mean μ₀. Furthermore, assume that the observations are normally distributed. To test the validity of the hypothesis, you can use a t-test. In R, you would use the function t.test:

## Default S3 method:
t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

Here is a description of the arguments to the t.test function.

Argument	Description	Default
x	A numeric vector of data values.
y	A numeric vector of data values (use `y=NULL` for comparing a single vector of values to a null hypothesis mean, `mu`, or a vector to compare vector `x` to vector `y`).	`NULL`
alternative	A character value specifying the alternative hypothesis. Use `alternative="two.sided"` for a two-sided distribution, `alternative="less"` for lower, and `alternative="greater"` for higher.	`c("two.sided", "less", "greater")`
mu	A numeric value specifying the value of the mean under the null hypothesis (if testing a single vector of values) or the difference between the means (if comparing the means of two vectors).	`0`
paired	A logical value indicating if the vectors are paired. See the next section for a description of how to use paired.	`FALSE`
var.equal	A logical value indicating whether the variance of the two vectors is assumed to be the same. If `var.equal=TRUE`, then the pooled variance is used. If `var.equal=FALSE`, then the Welch method is used.	`FALSE`
conf.level	The confidence interval.	`0.95`
...	Optional values passed to other methods.

Let’s take a look at an example of how you would use the t.test function. We’ll use the same example data that we used in Dot plots. Suppose that we thought, a priori, that tires of type H should last for approximately nine hours until failure.^[51] We’d like to compare the true mean of this data to the hypothetical mean and determine if the difference was statistically significant using a t-test.

To load the sample data, use the following command:

> library(nutshell)
> data(tires.sus)

To begin, let’s extract a vector with the set of values in which we are interested and calculate the true mean:

> times.to.failure.h <- subset(tires.sus,
+   Tire_Type=="H" &
+   Speed_At_Failure_km_h==160
+ )$Time_To_Failure
> times.to.failure.h
 [1] 10.00 16.67 13.58 13.53 16.83  7.62  4.25 10.67  4.42  4.25
> mean(times.to.failure.h)
[1] 10.182

As you can see, the true mean for these 10 tests was slightly longer than expected (10.182). We can use the function t.test to check if this difference is statistically significant:

> t.test(times.to.failure.h, mu=9)

     One Sample t-test

data:  times.to.failure.h 
t = 0.7569, df = 9, p-value = 0.4684
alternative hypothesis: true mean is not equal to 9 
95 percent confidence interval:
  6.649536 13.714464 
sample estimates:
mean of x 
   10.182

Here’s an explanation of the output from the t.test function. First, the function shows us the test statistic (t = 0.7569), the degrees of freedom (df = 9), and the calculated p-value for the test (p-value = 0.4684). The p-value means that the probability that the mean value from an actual sample was higher than 10.182 (or lower than 7.818) was 0.4684.

The next line states the alternative hypothesis: the true mean is not equal to 9, which we would reject based on the result of this test. Next, the t.test function shows the 95% confidence interval for this test, and, finally, it gives the actual mean. As a statistician would say, this evidence does not imply that the true mean was not equal to 9.

Another common situation is when you have two groups of observations, and you want to know if there is a significant difference between the means of the two groups of observations. You can also use a t-test to compare the means of the two groups.

Let’s pick another example from the tire data. Looking at the characteristics of the different tires that were tested, notice that three of the six tires had the same speed rating: S. Based on this speed rating, we would expect all three tires to last the same amount of time in the test:

> times.to.failure.e <- subset(tires.sus,
+   Tire_Type=="E" & Speed_At_Failure_km_h==180)$Time_To_Failure
> times.to.failure.d <- subset(tires.sus,
+   Tire_Type=="D" & Speed_At_Failure_km_h==180)$Time_To_Failure
> times.to.failure.b <- subset(tires.sus,
+   Tire_Type=="B" & Speed_At_Failure_km_h==180)$Time_To_Failure

Let’s start by comparing the mean times until failure for tires of types D and E:

> t.test(times.to.failure.e, times.to.failure.d)

   Welch Two Sample t-test

data:  times.to.failure.e and times.to.failure.d 
t = -2.5042, df = 8.961, p-value = 0.03373
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.82222528 -0.04148901 
sample estimates:
mean of x mean of y 
 4.321000  4.752857

The results here are similar to the results from the single-sample t-test. In this case, notice that the results were statistically significant at the 95% confidence interval; tires of type E lasted longer than tires of type D.

As an another example, let’s compare tires of types E and B:

> t.test(times.to.failure.e, times.to.failure.b)

   Welch Two Sample t-test

data:  times.to.failure.e and times.to.failure.b 
t = -1.4549, df = 16.956, p-value = 0.1640
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.5591177  0.1027844 
sample estimates:
mean of x mean of y 
 4.321000  4.549167

In this case, the difference in means was not significant between the two groups (because the calculated p-value was 0.1640). Notice that the output in R is otherwise identical; the t.test function doesn’t explicitly say if the results were significant or not.

For two-sample t-tests, you can also use a formula to specify a t-test if the data is included in a data frame, and the two groups of observations are differentiated by a factor:

## S3 method for class 'formula':
t.test(formula, data, subset, na.action, ...)

The formula specifies the variables to use in the test.

As an example, let’s look at data on field goals kicked in the NFL during 2005. Specifically, let’s look at the distance of successful field goals kicked in indoor and outdoor stadiums. Many TV commentators talk about the difficulty of kicking field goals outdoors, due to snow, wind, and so on. But does it make a significant difference in the distance of successful field goals? (Or, for that matter, in bad field goals?) We can use a t-test to find out.

First, let’s put together the data set:

> library(nutshell)
> data(field.goals)
> good <- transform(
+   field.goals[field.goals$play.type=="FG good",
+     c("yards","stadium.type")],
+   outside=(stadium.type=="Out"))
> bad  <- transform(
+   field.goals[field.goals$play.type=="FG no",
+       c("yards","stadium.type")],
+     outside=(stadium.type=="Out"))

Now, let’s use the t.test function to compare the distance of field goals in indoor and outdoor stadiums:

> t.test(yards~outside, data=good)

     Welch Two Sample t-test

data:  yards by outside 
t = 1.1259, df = 319.428, p-value = 0.2610
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.685112  2.518571 
sample estimates:
mean in group FALSE  mean in group TRUE 
           35.31707            34.40034

Although the average successful field goal length was about a yard longer, the difference is not significant at a 95% confidence level. The same is true for field goals that missed:

> t.test(yards~outside, data=bad)

     Welch Two Sample t-test

data:  yards by outside 
t = 1.2016, df = 70.726, p-value = 0.2335
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -1.097564  4.425985 
sample estimates:
mean in group FALSE  mean in group TRUE 
           45.18421            43.52000

Was there a statistically significant difference in the distances that coaches attempted to kick field goals? Let’s take a look:

> field.goals.inout <-
+   transform(field.goals,
+     outside=(stadium.type=="Out"))
> t.test(yards~outside, data=field.goals.inout)

     Welch Two Sample t-test

data:  yards by outside 
t = 1.5473, df = 401.509, p-value = 0.1226
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.3152552  2.6461541 
sample estimates:
mean in group FALSE  mean in group TRUE 
           37.14625            35.98080

Again, the difference does not appear to be statistically significant at a 95% level.

Comparing paired data

Sometimes you are provided with paired data. For example, you might have two observations per subject: one before an experiment and one after the experiment. In this case, you would use a paired t-test. You can use the t.test function, specifying paired=TRUE, to perform this test.

As an example of paired data, we can look at the SPECint2006 results. SPEC is an organization that provides computer performance data using standardized tests. The organization defines a number of different tests for different applications: database performance, web server performance, graphics performance, and so on. For our example, we’ll use a simple metric: the integer performance of different computers on typical desktop computing tasks.

SPEC provides two different types of tests: tests with standard settings and tests that are optimized for specific computers. As an example of paired data, we will compare the unoptimized results (called “baseline”) with the optimized results, to see if there is a statistically significant difference between the results. This data set is a good example of paired data: we have two different test results for each computer system. As an example, we will look only at single-chip, dual-core systems:

> library(nutshell)
> data(SPECint2006)
> t.test(subset(SPECint2006,Num.Chips==1&Num.Cores==2)$Baseline,
+        subset(SPECint2006,Num.Chips==1&Num.Cores==2)$Result,
+        paired=TRUE)

   Paired t-test

data:  subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline 
  and subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result 
t = -21.8043, df = 111, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -1.957837 -1.631627 
sample estimates:
mean of the differences 
              -1.794732

In this case, we can clearly see that the results were significant at the 95% confidence interval. (This isn’t a very big surprise. It’s well known that optimizing compiler settings and system parameters can make a big difference on system performance. Additionally, submitting optimized results is optional: organizations that could not tune their systems very well probably would not voluntarily share that fact.)

Comparing variances of two populations

To compare the variances of two samples from normal populations, R includes the var.test function, which performs an F-test:

## Default S3 method:
var.test(x, y, ratio = 1,
         alternative = c("two.sided", "less", "greater"),
         conf.level = 0.95, ...)

## S3 method for class 'formula':
var.test(formula, data, subset, na.action, ...)

Let’s continue with the example from above. Is there a difference in the variance of field goal lengths between indoor and outdoor stadiums? Let’s take a look:

> var.test(yards~outside, data=field.goals.inout)

     F test to compare two variances

data:  yards by outside 
F = 1.2432, num df = 252, denom df = 728, p-value = 0.03098
alternative hypothesis: true ratio of variances is not equal to 1 
95 percent confidence interval:
 1.019968 1.530612 
sample estimates:
ratio of variances 
          1.243157

As you can see from the output above, the p-value is less than 0.05, indicating that the difference in variance between the two populations is statistically significant. To test that the variances in each of the groups (samples) are the same, you can use Bartlett’s test. In R, this is available through the bartlett.test function:

bartlett.test(x, ...)

## Default S3 method:
bartlett.test(x, g, ...)

## S3 method for class 'formula':
bartlett.test(formula, data, subset, na.action, ...)

Using the same example as above, let’s compare variances of the two groups using the Bartlett test:

> bartlett.test(yards~outside, data=field.goals.inout)

     Bartlett test of homogeneity of variances

data:  yards by outside 
Bartlett's K-squared = 4.5808, df = 1, p-value = 0.03233

Comparing means across more than two groups

To compare the means across more than two groups, you can use a method called analysis of variance (ANOVA).^[52] ANOVA methods are very important for statistics. A full explanation of ANOVA requires an explanation of statistical models in R, which are covered in Chapter 20.

A simple way to perform these tests is through aov:

aov(formula, data = NULL, projections = FALSE, qr = TRUE,
    contrasts = NULL, ...)

As an example, let’s consider the 2006 U.S. mortality data set. (I showed how to load this data set in Using Other Languages to Preprocess Text Files.) Specifically, we’ll look at differences in age at death by cause of death. This is a pretty silly example; clearly, the average age at which people die of natural causes is going to be higher than the age at which they die for other reasons. However, this should help illustrate how the statistic works.

I mapped the disease codes in the original file into readable values and then summarized causes into a small number of reasons. To do this, I created a function to translate the numeric codes into character values. (I grouped together some common causes of death.) The mort06.smpl data set is included in the nutshell package.

Let’s take a look at the summary statistics for age by cause:

> library(nutshell)
> data(mort06.smpl)
> tapply(mort06.smpl$age, INDEX=list(mort06.smpl$Cause), FUN=summary)
$Accidents
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   31.00   48.00   50.88   73.00  108.00    8.00 

$`Alzheimer's Disease`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  40.00   82.00   87.00   86.07   91.00  109.00 

$Cancer
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   61.00   72.00   70.24   81.00  107.00 

$`Chronic respiratory diseases`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   70.00   78.00   76.37   84.00  106.00    1.00 

$Diabetes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   63.00   75.00   72.43   83.00  104.00    1.00 

$`Heart Disease`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   70.00   81.00   77.66   88.00  112.00    4.00 

$Homicide
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   22.00   28.00   32.19   42.00   92.00    2.00 

$`Influenza and pneumonia`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   76.00   84.00   80.16   90.00  108.00    1.00 

$Other
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   60.00   78.00   70.44   87.00  110.00   10.00 

$Suicide
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   8.00   32.00   45.00   46.14   57.00   97.00    2.00

Now let’s fit an ANOVA model to the data and show a summary of the model. To do this in R, we simply need to use the aov function:

> aov(age~Cause, data=mort06.smpl)
Call:
   aov(formula = age ~ Cause, data = mort06.smpl)

Terms:
                   Cause Residuals
Sum of Squares  15727886  72067515
Deg. of Freedom        9    243034

Residual standard error: 17.22012 
Estimated effects may be unbalanced
29 observations deleted due to missingness

To get more information on ANOVA results, you can use the model.tables to print information on aov objects:

## S3 method for class 'aov':
model.tables(x, type = "effects", se = FALSE, cterms, ...)

The argument x specifies the model object, type specifies the type of results to print, se specifies whether to compute standard errors, and cterms specifies which tables should be compared. As an example, here is the output of model.tables for the cause of death example above:

> model.tables(aov(age~Cause, data=mort06.smpl))
Tables of effects

 Cause 
    Accidents Alzheimer's Disease    Cancer
       -21.41               13.77    -2.056
rep  12363.00             7336.00 57162.000
    Chronic respiratory diseases  Diabetes Heart Disease Homicide
                           4.077    0.1343         5.371    -40.1
rep                    12386.000 7271.0000     82593.000   1917.0
    Influenza and pneumonia     Other Suicide
                      7.863    -1.855  -26.15
rep                5826.000 52956.000 3234.00

As another example of aov, let’s consider weight gain by women during pregnancy:

> library(nutshell)
> data(births2006.smpl)
> births2006.cln <- births2006.smpl[births2006.smpl$WTGAIN<99 &
+   !is.na(births2006.smpl$WTGAIN),]
> tapply(X=births2006.cln$WTGAIN,
+   INDEX=births2006.cln$DOB_MM,
+   FUN=mean)
       1        2        3        4        5        6 
30.94405 31.08356 31.29317 31.33610 31.07242 30.92589 
       7        8        9       10       11       12 
30.57734 30.54855 30.25546 30.43985 30.79077 30.85564

It appears that weight gain increases slightly during winter months, but is this difference statistically significant? Let’s take a look:

> aov(WTGAIN~DOB_MM, births2006.cln)
Call:
   aov(formula = WTGAIN ~ DOB_MM, data = births2006.cln)

Terms:
                  DOB_MM Residuals
Sum of Squares     14777  73385301
Deg. of Freedom        1    351465

Residual standard error: 14.44986 
Estimated effects may be unbalanced

Often, it’s better to use lm to fit a linear model and then use the anova function to extract information about analysis of variance. For large models, it is often more efficient to use the update function to change an existing model than to create a new model from scratch. See Example: A Simple Linear Model for more information on the lm function, model objects, and the update function. The anova function presents results slightly differently than the aov function, as you can see in this example:

> mort06.smpl.lm <- lm(age~Cause, data=mort06.smpl)
> anova(mort06.smpl.lm)
Analysis of Variance Table

Response: age
              Df   Sum Sq Mean Sq F value    Pr(>F)    
Cause          9 15727886 1747543  5893.3 < 2.2e-16 ***
Residuals 243034 72067515     297                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA calculations assume that the variance is equal across groups. When you know this is not true (or suspect this is not true), you can use the oneway.test function to calculate whether two or more samples have the same mean:

oneway.test(formula, data, subset, na.action, var.equal = FALSE)

This is similar to calling t.test with var.equal=FALSE (and the calculation method was also developed by Welch).

There are other functions for printing information about aov objects: proj returns a list of matrices giving the projections of data onto the linear model, TukeyHSD returns confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage, and se.contrast returns the standard errors for one or more contrasts in an aov object.

Pairwise t-tests between multiple groups

Sometimes, you’re not interested in just whether there is a difference across groups, but would like to know more details about the differences. One way to do this is by performing a t-test between every pair of groups. To do this in R, you can use the pairwise.t.test function:

pairwise.t.test(x, g, p.adjust.method = p.adjust.methods,
                pool.sd = !paired, paired = FALSE,
                alternative = c("two.sided", "less", "greater"),  ...)

This function calculates pairwise comparisons among group levels with corrections for multiple testing. The argument x specifies a vector of numeric values, and g specifies a factor that is used to group values. The argument pool.sd specifies whether to calculate a single standard deviation value across all groups and use this for the test.

As an example, let’s return to the tire data that we used in the example above. When we looked at the t.test function, we created three different vectors for the different types of tires. Here we’ll just use the pairwise t-test to compare all the tires by type:

> pairwise.t.test(tires.sus$Time_To_Failure, tires.sus$Tire_Type)

   Pairwise comparisons using t tests with pooled SD 
data:  tires.sus$Time_To_Failure and tires.sus$Tire_Type 

  B       C      D       E       H     
C 0.2219  -      -       -       -     
D 1.0000  0.5650 -       -       -     
E 1.0000  0.0769 1.0000  -       -     
H 2.4e-07 0.0029 2.6e-05 1.9e-08 -     
L 0.1147  1.0000 0.4408  0.0291  0.0019

P value adjustment method: holm

As you can see, there is no statistically significant difference between the means of a few pairs of groups (such as C and L, D and L, or D and E), but there is a significant difference between some others (such as B and H, C and H, or E and L).

Testing for normality

To test if a distribution is normally distributed in R, you can use the Shapiro-Wilk test for normality through the shapiro.test function:

shapiro.test(x)

Using the example above, let’s look at field goal lengths in the NFL in 2005. Was the distribution of field goal lengths normally distributed? My first instinct is to take a look at the distribution using a histogram or a quantile-quantile plot. Here is some R code to plot both, side by side:

> par(mfcol=c(1, 2), ps=6.5)
> hist(fg_attempts$yards, breaks=25)
> qqnorm(fg_attempts$yards, pch=".")

The plot is shown in Figure 18-1. It seems plausible that the distribution was normal: the distribution is roughly bell-curve shaped, and the quantile-quantile plot is roughly linear. To get a more rigorous answer, we can use the Shapiro-Wilk test. Here’s what the Shapiro-Wilk test tells us:

> shapiro.test(field.goals$yards)

     Shapiro-Wilk normality test

data:  field.goals$YARDS 
W = 0.9728, p-value = 1.307e-12

Figure 18-1. Distribution of field goal attempt distances in the NFL in 2005

As you can tell from the p-value, it is quite likely that this data is not normally distributed.

Testing if a data vector came from an arbitrary distribution

You can use the Kolmogorov-Smirnov test to see if a vector came from an arbitrary probability distribution (not just a normal distribution):

ks.test(x, y, ...,
        alternative = c("two.sided", "less", "greater"),
        exact = NULL)

The argument x specifies the test data. The argument y specifies the arbitrary distribution; it can be a vector of data values, a character name for a probability distribution function, or a distribution function. (You can pass along additional arguments to the distribution function.) The alternative argument allows you to specify the alternative hypothesis, and the exact value specifies whether to calculate exact values (for large x and y) or approximations.

Using the example above, we can use the ks.test function. We’ll specify the normal distribution (using the pnorm function):

> ks.test(field.goals$yards, pnorm)

  One-sample Kolmogorov-Smirnov test

data:  field.goals$yards 
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided 

Warning message:
In ks.test(field.goals$yards, pnorm) :
  cannot compute correct p-values with ties

Notice the warning message; ties are extremely unlikely for values from a true normal distribution. If there are ties in the data, that is a good sign that the test data is not actually normally distributed, so the function prints a warning.

Testing if two data vectors came from the same distribution

The Kolmogorov-Smirnov test can also be used to test the probability that two data vectors came from the same distribution. As an example, let’s look at the SPECint2006 data that we saw in Comparing paired data. What is the probability that the benchmark data and the optimized data come from the same distribution? We’ll compare the benchmark and optimized data using the ks.test function, adding some jitter to the values to suppress the warning about ties:

> ks.test(jitter(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Baseline),
+   jitter(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Result))

   Two-sample Kolmogorov-Smirnov test

data:  jitter(subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline) 
  and jitter(subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result) 
D = 0.2143, p-value = 0.01168
alternative hypothesis: two-sided

The p-value of this test was 0.0168, which is much less than 0.05. So this test implies that it was not likely that these two samples came from the same distribution.

Correlation tests

The functions in Correlation and Covariance simply compute the degree of correlation between pairs of vectors, but they don’t tell you if the correlation is significant. If you’d like to check whether there is a statistically significant correlation between two vectors, you can use the cor.test function:

## Default S3 method:
cor.test(x, y,
         alternative = c("two.sided", "less", "greater"),
         method = c("pearson", "kendall", "spearman"),
         exact = NULL, conf.level = 0.95, ...)

## S3 method for class 'formula':
cor.test(formula, data, subset, na.action, ...)

For example, let’s look at how this function works on two obviously correlated vectors:

> cor.test(c(1, 2, 3, 4, 5,  6,  7,  8), 
+          c(0, 2, 4, 6, 8, 10, 11, 14))

   Pearson's product-moment correlation

data:  c(1, 2, 3, 4, 5, 6, 7, 8) and c(0, 2, 4, 6, 8, 10, 11, 14) 
t = 36.1479, df = 6, p-value = 2.989e-08
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.9868648 0.9996032 
sample estimates:
     cor 
0.997712

And two less correlated vectors:

> cor.test(c(1, 2, 3, 4, 5, 6, 7, 8),
+          c(5, 3, 8, 1, 7, 0, 0, 3))

  Pearson's product-moment correlation

data:  c(1, 2, 3, 4, 5, 6, 7, 8) and c(5, 3, 8, 1, 7, 0, 0, 3) 
t = -1.2232, df = 6, p-value = 0.2671
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 -0.8757371  0.3764066 
sample estimates:
       cor 
-0.4467689

Let’s revisit the data on environmental toxins and lung cancer that we examined in Scatter Plots. This data compared the amount of airborne toxins released in each state with the deaths by lung cancer in each state:

> cor.test(air_on_site/Surface_Area, deaths_lung/Population)

  Pearson's product-moment correlation

data:  air_on_site/Surface_Area and deaths_lung/Population 
t = 3.4108, df = 39, p-value = 0.001520
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval:
 0.2013723 0.6858402 
sample estimates:
      cor 
0.4793273

The test shows that there appears to be a positive correlation between these two quantities that is statistically significant. However, don’t infer that there is a causal relationship between the rates of toxins released and the rates of lung cancer deaths. There are many alternate explanations for this phenomenon. For example, states with lots of dirty industrial activity may also be states with lower levels of income, which, in turn, correlates with lower-quality medical care. Or, perhaps, states with lots of industrial activity may be states with higher rates of smoking. Or maybe states with lower levels of industrial activity are less likely to identify cancer as a cause of death. Whatever the explanation, I thought this was a neat result.

Non-Parametric Tests

Although many real data sources can be approximated well by a normal distribution, there are many cases where you know that the data is not normally distributed, or you do not know the shape of the distribution. A good alternative to the tests described in Normal Distribution-Based Tests are non-parametric tests. These tests can be more computationally intensive than tests based on a normal distribution, but they may help you make better choices when the distribution is not normally distributed.

Comparing two means

The Wilcoxon test is the non-parametric equivalent to the t-test:

## Default S3 method:
wilcox.test(x, y = NULL,
            alternative = c("two.sided", "less", "greater"),
            mu = 0, paired = FALSE, exact = NULL, correct = TRUE,
            conf.int = FALSE, conf.level = 0.95, ...)

## S3 method for class 'formula':
wilcox.test(formula, data, subset, na.action, ...)

The Wilcoxon test works by looking at the ranks of different elements in x and y; the exact values don’t matter. To get the test statistic for x and y, you can calculate:

That is, look at all pairs of values (x[i], y[j]), counting the number of cases where y[j] < x[i]. If the two vectors were both from the same distribution, you’d expect this to be true for roughly half the pairs. The Wilcoxon distribution can be used to estimate the probability of different values for W; the p-value for the Wilcoxon test comes from this distribution. Notice that there is no version of the Wilcoxon test that compares a data sample with a hypothesized mean.

Let’s take a look at the same examples we used for t-tests. Let’s start by looking at the times to failure for tires. As above, let’s start by comparing tires of type E to tires of type D:

> wilcox.test(times.to.failure.e, times.to.failure.d)

   Wilcoxon rank sum test with continuity correction

data:  times.to.failure.e and times.to.failure.d 
W = 14.5, p-value = 0.05054
alternative hypothesis: true location shift is not equal to 0 

Warning message:
In wilcox.test.default(times.to.failure.e, times.to.failure.d) :
  cannot compute exact p-value with ties

Here’s an explanation of the output. The test function first shows the test statistic (W = 14.5) and the p-value for the statistic (0.05054). Notice that this is different from the result for the t-test. With the t-test, there was a significant difference between the means of the two groups, but with the Wilcoxon rank-sum test, the difference between the two groups is not significant at a 95% confidence level (though it barely misses).

Also note the warning. The Wilcoxon test statistic is based on the rank order of the observations, not their specific values. In our test data, there are a few ties:

> times.to.failure.d
[1] 5.22 4.47 4.25 5.22 4.68 5.05 4.3
> times.to.failure.e
 [1] 4.48 4.70 4.52 4.63 4.07 4.28 4.25 4.10 4.08 4.10

Because there was a tie, the function above actually used a normal approximation; see the help file for more information.

As with the standard t-test function, there is also a formula method for wilcox.test. As above, let’s compare the distance of field goals made in indoor stadiums versus outdoor stadiums:

> wilcox.test(yards~outside, data=good)

        Wilcoxon rank sum test with continuity correction

data:  YARDS by outside 
W = 62045, p-value = 0.3930
alternative hypothesis: true location shift is not equal to 0

Comparing more than two means

The Kruskal-Wallis rank-sum test is a non-parametric equivalent to ANOVA analysis:

kruskal.test(x, ...)

## Default S3 method:
kruskal.test(x, g, ...)

## S3 method for class 'formula':
kruskal.test(formula, data, subset, na.action, ...

As an example, here is the output for the mortality data that we used as an example for ANOVA statistics:

> kruskal.test(age~Cause, data=mort06.smpl)

    Kruskal-Wallis rank sum test

data:  age by Cause 
Kruskal-Wallis chi-squared = 34868.1, df = 9, p-value
< 2.2e-16

Comparing variances

To compare the variance among different groups using a nonparametric test, R includes an implementation of the Fligner-Killeen (median) test through the fligner.test function:

## Default S3 method: 
fligner.test(x, g, ...)
## S3 method for class 'formula': 
fligner.test(formula, data, subset, na.action, ...)

Here is the output of fligner.test for the mortality data above:

> fligner.test(age~Cause, data=mort06.smpl)

     Fligner-Killeen test of homogeneity of variances

data:  age by Cause 
Fligner-Killeen:med chi-squared = 15788, df = 9,
p-value < 2.2e-16

Difference in scale parameters

There are some tests in R for testing for differences in scale parameters. To use the Ansari-Bradley two-sample test for a difference in scale parameters, use the function ansari.test:

## Default S3 method:
ansari.test(x, y,
            alternative = c("two.sided", "less", "greater"),
            exact = NULL, conf.int = FALSE, conf.level = 0.95,
            ...)

## S3 method for class 'formula':
ansari.test(formula, data, subset, na.action, ...)

To use Mood’s two-sample test for a difference in scale parameters in R, try the function mood.test:

## Default S3 method:
mood.test(x, y,
          alternative = c("two.sided", "less", "greater"), ...)

## S3 method for class 'formula':
mood.test(formula, data, subset, na.action, ...)

^[50] One of the most famous results in probability theory is something called the central limit theorem. The central limit theorem states, in a nutshell, that if x is the sum of a set of random variables x₁, x₂, ..., x_n, then the distribution of x approaches the normal distribution as n → ∞.

^[51]This is a slightly contrived example, because I just made up the hypothetical mean value. In reality, the mean value might come from another experiment (perhaps a published experiment, where the raw data was not available). Or the mean value might have been derived from theory.

^[52]This process is named for the analysis process, not for the results. It doesn’t compare variances; it compares means by analyzing variances.