This section describes tests that apply to continuous random variables. Many important measurements fall into this category, such as times, dollar amounts, and chemical concentrations.
We’ll start off by showing how to use some common statistical tests that assume the underlying data is normally distributed. Normal distributions occur frequently in nature, so this is often a good assumption.[50]
Suppose that you designed an experiment to show that
some effect is true. You have collected some data and now want to know
if the data proves your hypothesis. One common question is to ask if
the mean of the experimental data is close to what the experimenter
expected; this is called the null hypothesis. Alternately, the
experimenter may calculate the probability that an alternative
hypothesis was true. Specifically, suppose that you have a set of
observations x1,
x2, ...,
xn with experimental mean
μ and want to know if the experimental mean is
different from the null hypothesis mean
μ0. Furthermore, assume
that the observations are normally distributed. To test the validity
of the hypothesis, you can use a t-test. In R,
you would use the function t.test
:
## Default S3 method: t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...)
Here is a description of the arguments to the t.test
function.
Argument | Description | Default |
---|---|---|
x | A numeric vector of data values. | |
y | A numeric vector of data values (use y=NULL for comparing a single vector
of values to a null hypothesis mean, mu , or a vector to compare vector
x to vector y ). | NULL |
alternative | A character value specifying the alternative
hypothesis. Use alternative ="two.sided" for a two-sided
distribution, alternative ="less" for lower, and alternative ="greater" for higher. | c("two.sided", "less",
"greater") |
mu | A numeric value specifying the value of the mean under the null hypothesis (if testing a single vector of values) or the difference between the means (if comparing the means of two vectors). | 0 |
paired | A logical value indicating if the vectors are paired. See the next section for a description of how to use paired. | FALSE |
var.equal | A logical value indicating whether the variance of the
two vectors is assumed to be the same. If var.equal=TRUE , then the pooled
variance is used. If var.equal=FALSE , then the Welch
method is used. | FALSE |
conf.level | The confidence interval. | 0.95 |
... | Optional values passed to other methods. |
Let’s take a look at an example of how you would use the
t.test
function. We’ll use the same
example data that we used in Dot plots. Suppose that
we thought, a priori, that tires of type H should last for
approximately nine hours until failure.[51] We’d like to compare the true mean of this data to the
hypothetical mean and determine if the difference was statistically
significant using a t-test.
To load the sample data, use the following command:
> library(nutshell) > data(tires.sus)
To begin, let’s extract a vector with the set of values in which we are interested and calculate the true mean:
> times.to.failure.h <- subset(tires.sus, + Tire_Type=="H" & + Speed_At_Failure_km_h==160 + )$Time_To_Failure > times.to.failure.h [1] 10.00 16.67 13.58 13.53 16.83 7.62 4.25 10.67 4.42 4.25 > mean(times.to.failure.h) [1] 10.182
As you can see, the true mean for these 10 tests was slightly
longer than expected (10.182). We can use the function t.test
to check if this difference is
statistically significant:
> t.test(times.to.failure.h, mu=9)
One Sample t-test
data: times.to.failure.h
t = 0.7569, df = 9, p-value = 0.4684
alternative hypothesis: true mean is not equal to 9
95 percent confidence interval:
6.649536 13.714464
sample estimates:
mean of x
10.182
Here’s an explanation of the output from the t.test
function. First, the function shows
us the test statistic (t = 0.7569), the degrees
of freedom (df = 9), and the calculated p-value
for the test (p-value = 0.4684). The
p-value means that the probability that the mean
value from an actual sample was higher than 10.182 (or lower than
7.818) was 0.4684.
The next line states the alternative hypothesis: the true mean
is not equal to 9, which we would reject based on the result of this
test. Next, the t.test
function
shows the 95% confidence interval for this test, and, finally, it
gives the actual mean. As a statistician would say, this evidence does
not imply that the true mean was not equal to 9.
Another common situation is when you have two groups of observations, and you want to know if there is a significant difference between the means of the two groups of observations. You can also use a t-test to compare the means of the two groups.
Let’s pick another example from the tire data. Looking at the characteristics of the different tires that were tested, notice that three of the six tires had the same speed rating: S. Based on this speed rating, we would expect all three tires to last the same amount of time in the test:
> times.to.failure.e <- subset(tires.sus, + Tire_Type=="E" & Speed_At_Failure_km_h==180)$Time_To_Failure > times.to.failure.d <- subset(tires.sus, + Tire_Type=="D" & Speed_At_Failure_km_h==180)$Time_To_Failure > times.to.failure.b <- subset(tires.sus, + Tire_Type=="B" & Speed_At_Failure_km_h==180)$Time_To_Failure
Let’s start by comparing the mean times until failure for tires of types D and E:
> t.test(times.to.failure.e, times.to.failure.d)
Welch Two Sample t-test
data: times.to.failure.e and times.to.failure.d
t = -2.5042, df = 8.961, p-value = 0.03373
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.82222528 -0.04148901
sample estimates:
mean of x mean of y
4.321000 4.752857
The results here are similar to the results from the single-sample t-test. In this case, notice that the results were statistically significant at the 95% confidence interval; tires of type E lasted longer than tires of type D.
As an another example, let’s compare tires of types E and B:
> t.test(times.to.failure.e, times.to.failure.b)
Welch Two Sample t-test
data: times.to.failure.e and times.to.failure.b
t = -1.4549, df = 16.956, p-value = 0.1640
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5591177 0.1027844
sample estimates:
mean of x mean of y
4.321000 4.549167
In this case, the difference in means was not significant
between the two groups (because the calculated
p-value was 0.1640). Notice that the output in R
is otherwise identical; the t.test
function doesn’t explicitly say if the results were significant or
not.
For two-sample t-tests, you can also use a formula to specify a t-test if the data is included in a data frame, and the two groups of observations are differentiated by a factor:
## S3 method for class 'formula': t.test(formula, data, subset, na.action, ...)
The formula specifies the variables to use in the test.
As an example, let’s look at data on field goals kicked in the NFL during 2005. Specifically, let’s look at the distance of successful field goals kicked in indoor and outdoor stadiums. Many TV commentators talk about the difficulty of kicking field goals outdoors, due to snow, wind, and so on. But does it make a significant difference in the distance of successful field goals? (Or, for that matter, in bad field goals?) We can use a t-test to find out.
First, let’s put together the data set:
> library(nutshell) > data(field.goals) > good <- transform( + field.goals[field.goals$play.type=="FG good", + c("yards","stadium.type")], + outside=(stadium.type=="Out")) > bad <- transform( + field.goals[field.goals$play.type=="FG no", + c("yards","stadium.type")], + outside=(stadium.type=="Out"))
Now, let’s use the t.test
function to compare the distance of field goals in indoor and outdoor
stadiums:
> t.test(yards~outside, data=good)
Welch Two Sample t-test
data: yards by outside
t = 1.1259, df = 319.428, p-value = 0.2610
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.685112 2.518571
sample estimates:
mean in group FALSE mean in group TRUE
35.31707 34.40034
Although the average successful field goal length was about a yard longer, the difference is not significant at a 95% confidence level. The same is true for field goals that missed:
> t.test(yards~outside, data=bad)
Welch Two Sample t-test
data: yards by outside
t = 1.2016, df = 70.726, p-value = 0.2335
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.097564 4.425985
sample estimates:
mean in group FALSE mean in group TRUE
45.18421 43.52000
Was there a statistically significant difference in the distances that coaches attempted to kick field goals? Let’s take a look:
> field.goals.inout <- + transform(field.goals, + outside=(stadium.type=="Out")) > t.test(yards~outside, data=field.goals.inout) Welch Two Sample t-test data: yards by outside t = 1.5473, df = 401.509, p-value = 0.1226 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3152552 2.6461541 sample estimates: mean in group FALSE mean in group TRUE 37.14625 35.98080
Again, the difference does not appear to be statistically significant at a 95% level.
Sometimes you are provided with paired data. For
example, you might have two observations per subject: one before an
experiment and one after the experiment. In this case, you would use a
paired t-test. You can use the t.test
function, specifying paired=TRUE
, to perform this test.
As an example of paired data, we can look at the SPECint2006 results. SPEC is an organization that provides computer performance data using standardized tests. The organization defines a number of different tests for different applications: database performance, web server performance, graphics performance, and so on. For our example, we’ll use a simple metric: the integer performance of different computers on typical desktop computing tasks.
SPEC provides two different types of tests: tests with standard settings and tests that are optimized for specific computers. As an example of paired data, we will compare the unoptimized results (called “baseline”) with the optimized results, to see if there is a statistically significant difference between the results. This data set is a good example of paired data: we have two different test results for each computer system. As an example, we will look only at single-chip, dual-core systems:
> library(nutshell) > data(SPECint2006) > t.test(subset(SPECint2006,Num.Chips==1&Num.Cores==2)$Baseline, + subset(SPECint2006,Num.Chips==1&Num.Cores==2)$Result, + paired=TRUE) Paired t-test data: subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline and subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result t = -21.8043, df = 111, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.957837 -1.631627 sample estimates: mean of the differences -1.794732
In this case, we can clearly see that the results were significant at the 95% confidence interval. (This isn’t a very big surprise. It’s well known that optimizing compiler settings and system parameters can make a big difference on system performance. Additionally, submitting optimized results is optional: organizations that could not tune their systems very well probably would not voluntarily share that fact.)
To compare the variances of two samples from normal
populations, R includes the var.test
function, which performs an
F-test:
## Default S3 method: var.test(x, y, ratio = 1, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, ...) ## S3 method for class 'formula': var.test(formula, data, subset, na.action, ...)
Let’s continue with the example from above. Is there a difference in the variance of field goal lengths between indoor and outdoor stadiums? Let’s take a look:
> var.test(yards~outside, data=field.goals.inout)
F test to compare two variances
data: yards by outside
F = 1.2432, num df = 252, denom df = 728, p-value = 0.03098
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.019968 1.530612
sample estimates:
ratio of variances
1.243157
As you can see from the output above, the
p-value is less than 0.05, indicating that the
difference in variance between the two populations is statistically
significant. To test that the variances in each of the groups
(samples) are the same, you can use Bartlett’s test. In R, this is
available through the bartlett.test
function:
bartlett.test(x, ...) ## Default S3 method: bartlett.test(x, g, ...) ## S3 method for class 'formula': bartlett.test(formula, data, subset, na.action, ...)
Using the same example as above, let’s compare variances of the two groups using the Bartlett test:
> bartlett.test(yards~outside, data=field.goals.inout)
Bartlett test of homogeneity of variances
data: yards by outside
Bartlett's K-squared = 4.5808, df = 1, p-value = 0.03233
To compare the means across more than two groups, you can use a method called analysis of variance (ANOVA).[52] ANOVA methods are very important for statistics. A full explanation of ANOVA requires an explanation of statistical models in R, which are covered in Chapter 20.
A simple way to perform these tests is through aov
:
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...)
As an example, let’s consider the 2006 U.S. mortality data set. (I showed how to load this data set in Using Other Languages to Preprocess Text Files.) Specifically, we’ll look at differences in age at death by cause of death. This is a pretty silly example; clearly, the average age at which people die of natural causes is going to be higher than the age at which they die for other reasons. However, this should help illustrate how the statistic works.
I mapped the disease codes in the original file into readable
values and then summarized causes into a small number of reasons. To
do this, I created a function to translate the numeric codes into
character values. (I grouped together some common causes of death.)
The mort06.smpl
data set is
included in the nutshell
package.
Let’s take a look at the summary statistics for age by cause:
> library(nutshell) > data(mort06.smpl) > tapply(mort06.smpl$age, INDEX=list(mort06.smpl$Cause), FUN=summary) $Accidents Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 31.00 48.00 50.88 73.00 108.00 8.00 $`Alzheimer's Disease` Min. 1st Qu. Median Mean 3rd Qu. Max. 40.00 82.00 87.00 86.07 91.00 109.00 $Cancer Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 61.00 72.00 70.24 81.00 107.00 $`Chronic respiratory diseases` Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 70.00 78.00 76.37 84.00 106.00 1.00 $Diabetes Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 63.00 75.00 72.43 83.00 104.00 1.00 $`Heart Disease` Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 70.00 81.00 77.66 88.00 112.00 4.00 $Homicide Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 22.00 28.00 32.19 42.00 92.00 2.00 $`Influenza and pneumonia` Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 76.00 84.00 80.16 90.00 108.00 1.00 $Other Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 60.00 78.00 70.44 87.00 110.00 10.00 $Suicide Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 8.00 32.00 45.00 46.14 57.00 97.00 2.00
Now let’s fit an ANOVA model to the data and show a summary of
the model. To do this in R, we simply need to use the aov
function:
> aov(age~Cause, data=mort06.smpl)
Call:
aov(formula = age ~ Cause, data = mort06.smpl)
Terms:
Cause Residuals
Sum of Squares 15727886 72067515
Deg. of Freedom 9 243034
Residual standard error: 17.22012
Estimated effects may be unbalanced
29 observations deleted due to missingness
To get more information on ANOVA results, you can use the
model.tables
to print information
on aov
objects:
## S3 method for class 'aov': model.tables(x, type = "effects", se = FALSE, cterms, ...)
The argument x
specifies the
model object, type
specifies the
type of results to print, se
specifies whether to compute standard errors, and cterms
specifies which tables should be
compared. As an example, here is the output of model.tables
for the cause of death example
above:
> model.tables(aov(age~Cause, data=mort06.smpl))
Tables of effects
Cause
Accidents Alzheimer's Disease Cancer
-21.41 13.77 -2.056
rep 12363.00 7336.00 57162.000
Chronic respiratory diseases Diabetes Heart Disease Homicide
4.077 0.1343 5.371 -40.1
rep 12386.000 7271.0000 82593.000 1917.0
Influenza and pneumonia Other Suicide
7.863 -1.855 -26.15
rep 5826.000 52956.000 3234.00
As another example of aov
,
let’s consider weight gain by women during pregnancy:
> library(nutshell) > data(births2006.smpl) > births2006.cln <- births2006.smpl[births2006.smpl$WTGAIN<99 & + !is.na(births2006.smpl$WTGAIN),] > tapply(X=births2006.cln$WTGAIN, + INDEX=births2006.cln$DOB_MM, + FUN=mean) 1 2 3 4 5 6 30.94405 31.08356 31.29317 31.33610 31.07242 30.92589 7 8 9 10 11 12 30.57734 30.54855 30.25546 30.43985 30.79077 30.85564
It appears that weight gain increases slightly during winter months, but is this difference statistically significant? Let’s take a look:
> aov(WTGAIN~DOB_MM, births2006.cln)
Call:
aov(formula = WTGAIN ~ DOB_MM, data = births2006.cln)
Terms:
DOB_MM Residuals
Sum of Squares 14777 73385301
Deg. of Freedom 1 351465
Residual standard error: 14.44986
Estimated effects may be unbalanced
Often, it’s better to use lm
to fit a linear model and then use the anova
function to extract information about
analysis of variance. For large models, it is often more efficient to
use the update
function to change
an existing model than to create a new model from scratch. See Example: A Simple Linear Model for more information on the lm
function, model objects, and the update
function. The anova
function presents results slightly
differently than the aov
function,
as you can see in this example:
> mort06.smpl.lm <- lm(age~Cause, data=mort06.smpl) > anova(mort06.smpl.lm) Analysis of Variance Table Response: age Df Sum Sq Mean Sq F value Pr(>F) Cause 9 15727886 1747543 5893.3 < 2.2e-16 *** Residuals 243034 72067515 297 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA calculations assume that the variance is equal across
groups. When you know this is not true (or suspect this is not true),
you can use the oneway.test
function to calculate whether two or more samples have the same
mean:
oneway.test(formula, data, subset, na.action, var.equal = FALSE)
This is similar to calling t.test
with var.equal=FALSE
(and the calculation method
was also developed by Welch).
There are other functions for printing information about
aov
objects: proj
returns a list of matrices giving the
projections of data onto the linear model, TukeyHSD
returns confidence intervals on the
differences between the means of the levels of a factor with the
specified family-wise probability of coverage, and se.contrast
returns the standard errors for
one or more contrasts in an aov
object.
Sometimes, you’re not interested in just whether there
is a difference across groups, but would like to know more details
about the differences. One way to do this is by performing a
t-test between every pair of groups. To do this
in R, you can use the pairwise.t.test
function:
pairwise.t.test(x, g, p.adjust.method = p.adjust.methods, pool.sd = !paired, paired = FALSE, alternative = c("two.sided", "less", "greater"), ...)
This function calculates pairwise comparisons among group levels
with corrections for multiple testing. The argument x
specifies a vector of numeric values, and
g
specifies a factor that is used
to group values. The argument pool.sd
specifies whether to calculate a
single standard deviation value across all groups and use this for the
test.
As an example, let’s return to the tire data that we used in the
example above. When we looked at the t.test
function, we created three different
vectors for the different types of tires. Here we’ll just use the
pairwise t-test to compare all the tires by
type:
> pairwise.t.test(tires.sus$Time_To_Failure, tires.sus$Tire_Type)
Pairwise comparisons using t tests with pooled SD
data: tires.sus$Time_To_Failure and tires.sus$Tire_Type
B C D E H
C 0.2219 - - - -
D 1.0000 0.5650 - - -
E 1.0000 0.0769 1.0000 - -
H 2.4e-07 0.0029 2.6e-05 1.9e-08 -
L 0.1147 1.0000 0.4408 0.0291 0.0019
P value adjustment method: holm
As you can see, there is no statistically significant difference between the means of a few pairs of groups (such as C and L, D and L, or D and E), but there is a significant difference between some others (such as B and H, C and H, or E and L).
To test if a distribution is normally distributed in R,
you can use the Shapiro-Wilk test for normality through the shapiro.test
function:
shapiro.test(x)
Using the example above, let’s look at field goal lengths in the NFL in 2005. Was the distribution of field goal lengths normally distributed? My first instinct is to take a look at the distribution using a histogram or a quantile-quantile plot. Here is some R code to plot both, side by side:
> par(mfcol=c(1, 2), ps=6.5) > hist(fg_attempts$yards, breaks=25) > qqnorm(fg_attempts$yards, pch=".")
The plot is shown in Figure 18-1. It seems plausible that the distribution was normal: the distribution is roughly bell-curve shaped, and the quantile-quantile plot is roughly linear. To get a more rigorous answer, we can use the Shapiro-Wilk test. Here’s what the Shapiro-Wilk test tells us:
> shapiro.test(field.goals$yards)
Shapiro-Wilk normality test
data: field.goals$YARDS
W = 0.9728, p-value = 1.307e-12
As you can tell from the p-value, it is quite likely that this data is not normally distributed.
You can use the Kolmogorov-Smirnov test to see if a vector came from an arbitrary probability distribution (not just a normal distribution):
ks.test(x, y, ..., alternative = c("two.sided", "less", "greater"), exact = NULL)
The argument x
specifies the
test data. The argument y
specifies
the arbitrary distribution; it can be a vector of data values, a
character name for a probability distribution function, or a
distribution function. (You can pass along additional arguments to the distribution function.)
The alternative
argument allows you
to specify the alternative hypothesis, and the exact
value specifies whether to calculate
exact values (for large x and
y) or approximations.
Using the example above, we can use the ks.test
function. We’ll specify the normal distribution (using
the pnorm
function):
> ks.test(field.goals$yards, pnorm)
One-sample Kolmogorov-Smirnov test
data: field.goals$yards
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
Warning message:
In ks.test(field.goals$yards, pnorm) :
cannot compute correct p-values with ties
Notice the warning message; ties are extremely unlikely for values from a true normal distribution. If there are ties in the data, that is a good sign that the test data is not actually normally distributed, so the function prints a warning.
The Kolmogorov-Smirnov test can also be used to test the
probability that two data vectors came from the same distribution. As
an example, let’s look at the SPECint2006 data that we saw in Comparing paired data. What is the probability that the
benchmark data and the optimized data come from the same distribution?
We’ll compare the benchmark and optimized data using the ks.test
function, adding some jitter to the
values to suppress the warning about ties:
> ks.test(jitter(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Baseline), + jitter(subset(SPECint2006, Num.Chips==1&Num.Cores==2)$Result)) Two-sample Kolmogorov-Smirnov test data: jitter(subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Baseline) and jitter(subset(SPECint2006, Num.Chips == 1 & Num.Cores == 2)$Result) D = 0.2143, p-value = 0.01168 alternative hypothesis: two-sided
The p-value of this test was 0.0168, which is much less than 0.05. So this test implies that it was not likely that these two samples came from the same distribution.
The functions in Correlation and Covariance simply compute the degree of
correlation between pairs of vectors, but they don’t tell you if the
correlation is significant. If you’d like to check whether there is a
statistically significant correlation between two vectors, you can use the cor.test
function:
## Default S3 method: cor.test(x, y, alternative = c("two.sided", "less", "greater"), method = c("pearson", "kendall", "spearman"), exact = NULL, conf.level = 0.95, ...) ## S3 method for class 'formula': cor.test(formula, data, subset, na.action, ...)
For example, let’s look at how this function works on two obviously correlated vectors:
> cor.test(c(1, 2, 3, 4, 5, 6, 7, 8), + c(0, 2, 4, 6, 8, 10, 11, 14)) Pearson's product-moment correlation data: c(1, 2, 3, 4, 5, 6, 7, 8) and c(0, 2, 4, 6, 8, 10, 11, 14) t = 36.1479, df = 6, p-value = 2.989e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9868648 0.9996032 sample estimates: cor 0.997712
And two less correlated vectors:
> cor.test(c(1, 2, 3, 4, 5, 6, 7, 8), + c(5, 3, 8, 1, 7, 0, 0, 3)) Pearson's product-moment correlation data: c(1, 2, 3, 4, 5, 6, 7, 8) and c(5, 3, 8, 1, 7, 0, 0, 3) t = -1.2232, df = 6, p-value = 0.2671 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8757371 0.3764066 sample estimates: cor -0.4467689
Let’s revisit the data on environmental toxins and lung cancer that we examined in Scatter Plots. This data compared the amount of airborne toxins released in each state with the deaths by lung cancer in each state:
> cor.test(air_on_site/Surface_Area, deaths_lung/Population)
Pearson's product-moment correlation
data: air_on_site/Surface_Area and deaths_lung/Population
t = 3.4108, df = 39, p-value = 0.001520
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2013723 0.6858402
sample estimates:
cor
0.4793273
The test shows that there appears to be a positive correlation between these two quantities that is statistically significant. However, don’t infer that there is a causal relationship between the rates of toxins released and the rates of lung cancer deaths. There are many alternate explanations for this phenomenon. For example, states with lots of dirty industrial activity may also be states with lower levels of income, which, in turn, correlates with lower-quality medical care. Or, perhaps, states with lots of industrial activity may be states with higher rates of smoking. Or maybe states with lower levels of industrial activity are less likely to identify cancer as a cause of death. Whatever the explanation, I thought this was a neat result.
Although many real data sources can be approximated well by a normal distribution, there are many cases where you know that the data is not normally distributed, or you do not know the shape of the distribution. A good alternative to the tests described in Normal Distribution-Based Tests are non-parametric tests. These tests can be more computationally intensive than tests based on a normal distribution, but they may help you make better choices when the distribution is not normally distributed.
The Wilcoxon test is the non-parametric equivalent to the t-test:
## Default S3 method: wilcox.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, exact = NULL, correct = TRUE, conf.int = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula': wilcox.test(formula, data, subset, na.action, ...)
The Wilcoxon test works by looking at the ranks of different elements in x and y; the exact values don’t matter. To get the test statistic for x and y, you can calculate:
That is, look at all pairs of values (x[i], y[j]), counting the number of cases where y[j] < x[i]. If the two vectors were both from the same distribution, you’d expect this to be true for roughly half the pairs. The Wilcoxon distribution can be used to estimate the probability of different values for W; the p-value for the Wilcoxon test comes from this distribution. Notice that there is no version of the Wilcoxon test that compares a data sample with a hypothesized mean.
Let’s take a look at the same examples we used for t-tests. Let’s start by looking at the times to failure for tires. As above, let’s start by comparing tires of type E to tires of type D:
> wilcox.test(times.to.failure.e, times.to.failure.d)
Wilcoxon rank sum test with continuity correction
data: times.to.failure.e and times.to.failure.d
W = 14.5, p-value = 0.05054
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(times.to.failure.e, times.to.failure.d) :
cannot compute exact p-value with ties
Here’s an explanation of the output. The test function first shows the test statistic (W = 14.5) and the p-value for the statistic (0.05054). Notice that this is different from the result for the t-test. With the t-test, there was a significant difference between the means of the two groups, but with the Wilcoxon rank-sum test, the difference between the two groups is not significant at a 95% confidence level (though it barely misses).
Also note the warning. The Wilcoxon test statistic is based on the rank order of the observations, not their specific values. In our test data, there are a few ties:
> times.to.failure.d [1] 5.22 4.47 4.25 5.22 4.68 5.05 4.3 > times.to.failure.e [1] 4.48 4.70 4.52 4.63 4.07 4.28 4.25 4.10 4.08 4.10
Because there was a tie, the function above actually used a normal approximation; see the help file for more information.
As with the standard
t-test function, there is also a formula method
for wilcox.test
. As above, let’s
compare the distance of field goals made in indoor stadiums versus
outdoor stadiums:
> wilcox.test(yards~outside, data=good)
Wilcoxon rank sum test with continuity correction
data: YARDS by outside
W = 62045, p-value = 0.3930
alternative hypothesis: true location shift is not equal to 0
The Kruskal-Wallis rank-sum test is a non-parametric equivalent to ANOVA analysis:
kruskal.test(x, ...) ## Default S3 method: kruskal.test(x, g, ...) ## S3 method for class 'formula': kruskal.test(formula, data, subset, na.action, ...
As an example, here is the output for the mortality data that we used as an example for ANOVA statistics:
> kruskal.test(age~Cause, data=mort06.smpl)
Kruskal-Wallis rank sum test
data: age by Cause
Kruskal-Wallis chi-squared = 34868.1, df = 9, p-value
< 2.2e-16
To compare the variance among different groups using a
nonparametric test, R includes an implementation of the
Fligner-Killeen (median) test through the fligner.test
function:
## Default S3 method: fligner.test(x, g, ...) ## S3 method for class 'formula': fligner.test(formula, data, subset, na.action, ...)
Here is the output of fligner.test
for the mortality data
above:
> fligner.test(age~Cause, data=mort06.smpl)
Fligner-Killeen test of homogeneity of variances
data: age by Cause
Fligner-Killeen:med chi-squared = 15788, df = 9,
p-value < 2.2e-16
There are some tests in R for testing for differences in
scale parameters. To use the Ansari-Bradley two-sample test for a difference in scale parameters, use the function
ansari.test
:
## Default S3 method: ansari.test(x, y, alternative = c("two.sided", "less", "greater"), exact = NULL, conf.int = FALSE, conf.level = 0.95, ...) ## S3 method for class 'formula': ansari.test(formula, data, subset, na.action, ...)
To use Mood’s two-sample test for a difference in scale
parameters in R, try the function mood.test
:
## Default S3 method: mood.test(x, y, alternative = c("two.sided", "less", "greater"), ...) ## S3 method for class 'formula': mood.test(formula, data, subset, na.action, ...)
[50] One of the most famous results in probability theory is something called the central limit theorem. The central limit theorem states, in a nutshell, that if x is the sum of a set of random variables x1, x2, ..., xn, then the distribution of x approaches the normal distribution as n → ∞.
[51] This is a slightly contrived example, because I just made up the hypothetical mean value. In reality, the mean value might come from another experiment (perhaps a published experiment, where the raw data was not available). Or the mean value might have been derived from theory.
[52] This process is named for the analysis process, not for the results. It doesn’t compare variances; it compares means by analyzing variances.