There is a different set of tests for looking at the statistical significance of discrete random variables (like counts of proportions), and so there is a different set of functions in R for performing those tests.
If you have a data set with several different groups of
observations and are measuring the probability of success in each group
(or the fraction of some other characteristic), you can use the function
prop.test
to measure whether the difference between groups is
statistically significant. Specifically, prop.test
can be used for testing the null
hypothesis that the proportions (probabilities of success) in several
groups are the same or that they equal certain given values:
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE)
As an example, let’s revisit the field goal data. Above, we considered the question, “Is there a difference in the length of attempts indoors and outdoors?” Now we’ll ask the question, “Is the probability of success the same indoors as it is outdoors?”
First, let’s create a new data set containing only good and bad field goals. (We’ll eliminate blocked and aborted attempts; there were only 8 aborted attempts and 24 blocked attempts in 2005, but 787 good attempts and 163 bad [no good] attempts.)
> field.goals.goodbad <- field.goals[field.goals$play.type=="FG good" | + field.goals$play.type=="FG no", ]
Now let’s create a table of successes and failures by stadium type:
> field.goals.table <- table(field.goals.goodbad$play.type, + field.goals.goodbad$stadium.type) > field.goals.table Both In Out FG aborted 0 0 0 FG blocked 0 0 0 FG good 53 152 582 FG no 14 24 125
The table isn’t quite right for prop.test
; we need a table with two columns
(one with a count of successes and one with a count of failures), and we
don’t want to show empty factor levels. Let’s remove the two rows we
don’t need and transpose the table:
> field.goals.table.t <- t(field.goals.table[3:4,]) > field.goals.table.t FG good FG no Both 53 14 In 152 24 Out 582 125
Now we’re ready to see if there is a statistically significant
difference in success among the three
groups. We can simply call prop.test
on the field
.goals.table.t
object to check:
> prop.test(field.goals.table.t)
3-sample test for equality of proportions without
continuity correction
data: field.goals.table
X-squared = 2.3298, df = 2, p-value = 0.3120
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3
0.7910448 0.8636364 0.8231966
As you can see, the results are not significant.
Often, an experiment consists of a series of identical trials, each of which has only two outcomes. For example, suppose that you wanted to test the hypothesis that the probability that a coin would land on heads was 0.5. You might design an experiment where you flipped the coin 50 times and counted the number of heads. Each coin flip is an example of a Bernoulli trial. The distribution of the number of heads is given by the binomial distribution.
R includes a function for evaluating such a trial to determine whether to accept or reject the hypothesis:
binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95)
The argument x
gives the number
of successes, n
gives the total
number of trials, p
gives the
probability of each success, alternative
gives the alternative hypothesis,
and conf.level
gives the returned
confidence level.
As an example, let’s look at David Ortiz’s performance during the
2008 season. In 2008, he had a batting average of .264 (110 hits in 416
at bats). Suppose that he was actually a .300 hitter—that the actual
probability that he would get a hit in a given at bat was 0.3. What were
the odds that he hit .264 or less in this number of at bats? We can use
the function binom.test
to estimate this probability:
> binom.test(x=110, n=416, p=0.3, alternative="less")
Exact binomial test
data: 110 and 416
number of successes = 110, number of trials = 416, p-value =
0.06174
alternative hypothesis: true probability of success is less than 0.3
95 percent confidence interval:
0.0000000 0.3023771
sample estimates:
probability of success
0.2644231
Unlike some other test functions, the p-value represents the probability that the fraction of successes (0.26443431) was at least as far from the hypothesized value (.300) after the experiment. We specified that the alternative hypothesis was “less,” meaning that the p-value represents the probability that the fraction of successes was less than 0.26443431, which in this case was 0.06174.
In plain English, this means that if David Ortiz was a “true” .300 hitter, the probability that he actually hit .264 or worse in a season was 0.06174.
A common problem is to look at a table of data and determine if there is a relationship between two categorical variables. If there were no relationship, the two variables would be statistically independent. In these tests, the hypothesis is that the two variables are independent. The alternative hypothesis is that the two variables are not independent.
Tables of data often come up in experimental contexts: there is one column of data from a test population and one from a control population. In this context, the analyst often wants to calculate the probability that the two sets of data could have come from the same population (which would imply the same proportions in each). This is an equivalent problem, so the same test functions can be used.
For small contingency tables (and small values), you can obtain the best results using Fisher’s exact test. Fisher’s exact test calculates the probability that the deviation from the independence was greater than or equal to the sample quantities. So a high p-value means that the sample data implies that the two variables are likely to be independent. A low p-value means that the sample data implies that the two variables are not independent.
In R, you can use the function fisher.test
to perform Fisher’s exact test:
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE, control = list(), or = 1, alternative = "two.sided", conf.int = TRUE, conf.level = 0.95, simulate.p.value = FALSE, B = 2000)
Here is a description of the arguments to fisher.test
.
Argument | Description | Default |
---|---|---|
x | Specifies the sample data to use for the test. Either a matrix (representing a two-dimensional contingency table) or a factor. | |
y | Specifies the sample data to use for the test. If
x is a factor, then y should
be a factor. If x is a
matrix, then y is
ignored. | NULL |
workspace | An integer value specifying the size of the workspace to use in the network algorithm (in units of 4 bytes). | 200000 |
hybrid | For tables larger than 2 × 2, specifies whether exact
probabilities should be calculated (hybrid=FALSE ) or an approximation
should be used (hybrid=TRUE ). | FALSE |
control | A list of named components for low-level control of
fisher.test ; see the help
file for more information. | list() |
or | The hypothesized odds ratio for the 2 × 2 case. | 1 |
alternative | The alternative hypothesis. Must be one of "two.sided" , "greater" , or "less" . | "two.sided" |
conf.int | A logical value specifying whether to compute and return confidence intervals in the results. | TRUE |
conf.level | Specifies the confidence level to use in computing the confidence interval. | 0.95 |
simulate.p.value | A logical value indicating whether to use Monte Carlo simulation to compute p-values in tables larger than 2 × 2. | FALSE |
B | An integer indicating the number of replicates to use in Monte Carlo simulations. | 2000 |
If you specify x
and y
as factors, then R will compute a
contingency table from these factors. Alternatively, you can specify a
matrix for x
containing the
contingency table.
Fisher’s exact test can be very computationally intensive for large tables, so statisticians usually use an alternative test: chi-squared tests. Chi-squared tests are not exactly the same as Fisher’s tests. With a chi-squared test, you explicitly state a hypothesis about the probability of each event and then compare the sample distribution to the hypothesis. The p-value is the probability that a distribution at least as different from the hypothesized distribution arose by chance.
In R, you can use the function chisq.test
to calculate a chi-squared contingency table and
goodness-of-fit tests:
chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 200
Here is a description of the arguments to chisq.test
.
Argument | Description | Default |
---|---|---|
x | Specifies the sample data to use for the test. Either a matrix or a vector. | |
y | Specifies the sample data to use for the test. If
x is a factor, then y should be a vector. If x is a matrix, then y is ignored. | NULL |
correct | A logical value specifying whether to apply continuity correction when computing the test statistic for 2 × 2 tables. | TRUE |
p | A vector of probabilities that represent the hypothesis to test. (Note that the default is to assume equal probability for each item.) | rep(1/length(x),
length(x)) |
rescale.p | A logical value indicating whether p needs to be rescaled to sum to 1. | FALSE |
simulate.p.value | A logical value indicating whether to compute p-values using Monte Carlo simulation. | FALSE |
B | An integer indicating the number of replicates to use in Monte Carlo simulations. | 200 |
If you specify x
and y
as vectors, then R will compute a
contingency table from these vectors (after coercing them to factors).
Alternatively, you can specify a matrix for x
containing the contingency table.
As an example, let’s use the 2006 births data set. (For a detailed description of this data set, see Univariate Trellis Plots.) We will take a look at the number of male and female babies delivered during July 2006, by delivery method. We’ll take a subset of births during July where the delivery method was known and then tabulate the results:
> births.july.2006 <- births2006.smpl[births2006.smpl$DMETH_REC!="Unknown" & + births2006.smpl$DOB_MM==7, ] > nrow(births2006.smpl) [1] 427323 > nrow(births.july.2006) [1] 37060 > method.and.sex <- table( + births.july.2006$SEX, + as.factor(as.character(births.july.2006$DMETH_REC))) > method.and.sex C-section Vaginal F 5326 12622 M 6067 13045
Note that the delivery methods were actually slightly unbalanced by gender during July 2006:
> 5326 / (5326 + 6067) [1] 0.46748 > 12622 / (12622 + 13045) [1] 0.4917598
However, there isn’t an intuitive reason why this should be true. So let’s check whether this difference is statistically significant: is the difference due to chance or is it likely that these two variables (delivery method and sex) are independent? We can use Fisher’s exact test to answer this question:
> fisher.test(method.and.sex)
Fisher's Exact Test for Count Data
data: method.and.sex
p-value = 1.604e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8678345 0.9485129
sample estimates:
odds ratio
0.9072866
The p-value is the probability of obtaining results that were at least as far removed from independence as these results. In this case, the p-value is very low, indicating that the results were very far from what we would expect if the variables were truly independent. This implies that we should reject the hypothesis that the two variables are independent.
As a second example, let’s look only at twin births. (Note that each record represents a single birth, not a single pregnancy.)
> twins.2006 <- births2006.smpl[births2006.smpl$DPLURAL=="2 Twin" & + births2006.smpl$DMETH_REC != "Unknown",] > method.and.sex.twins <- + table(twins.2006$SEX, + as.factor(as.character(twins.2006$DMETH_REC))) > method.and.sex.twins C-section Vaginal F 4924 1774 M 5076 1860
Now let’s see if there is a statistically significant difference in delivery methods between the two sexes:
> fisher.test(method.and.sex.twins)
Fisher's Exact Test for Count Data
data: method.and.sex.twins
p-value = 0.67
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9420023 1.0981529
sample estimates:
odds ratio
1.017083
In this case, the p-value (0.67) is very high, so it is very likely that the two variables are independent.
We can look at the same table using a chi-squared test:
> chisq.test(method.and.sex.twins)
Pearson's Chi-squared test with Yates' continuity
correction
data: method.and.sex.twins
X-squared = 0.1745, df = 1, p-value = 0.6761
By the way, we could also have just passed the two factors to
chisq.test
, and chisq.test
would have calculated the
contingency table for us:
> chisq.test(twins.2006$DMETH_REC, twins.2006$SEX)
Pearson's Chi-squared test with Yates' continuity
correction
data: twins.2006$DMETH_REC and twins.2006$SEX
X-squared = 0.1745, df = 1, p-value = 0.6761
As above, the p-value is very high, so it is likely that the two variables are independent for twin births.
Let’s ask another interesting question: how many babies are born on weekdays versus weekends? Let’s start by tabulating the number of births, by day of week, during 2006:
> births2006.byday <- table(births2006.smpl$DOB_WK) > births2006.byday 1 2 3 4 5 6 7 40274 62757 69775 70290 70164 68380 45683
Curiously, the number of births on days 1 and 7 (Sunday and
Saturday, respectively) are sharply lower than the number of births on
other days. We can use a chi-squared test to determine what the
probability is that this distribution arose by chance. As noted above,
by default, we perform a chi-squared test under the assumption that the
actual probability of a baby being born on each day is given by the
vector p=rep(1/length(x), length(x))
,
which in this case is 1/7 for every day. So we’re asking what the
probability is that a distribution at least as unbalanced as the one
above arose by chance:
> chisq.test(births2006.byday)
Chi-squared test for given probabilities
data: births2006.byday
X-squared = 15873.20, df = 6, p-value < 2.2e-16
As you might have guessed, this effect was statistically significant. The p-value is very, very small, indicating that it is very unlikely that this effect arose due to chance. (Of course, with a sample this big, it’s not hard to find significant effects.)
The chisq.test
function can
also perform tests on multidimensional tables. As an example, let’s
build a table showing the number of births by day and month:
> births2006.bydayandmonth <- table(births2006.smpl$DOB_WK, + births2006.smpl$DOB_MM) > births2006.bydayandmonth 1 2 3 4 5 6 7 1 3645 2930 2965 3616 2969 3036 3976 2 5649 4737 4779 4853 5712 5033 6263 3 6265 5293 5251 5297 6472 5178 5149 4 5131 5280 6486 5173 6496 5540 5499 5 5127 5271 6574 5162 5347 6863 5780 6 4830 5305 6330 5042 4975 6622 5760 7 3295 3392 3408 4185 3364 3464 4751 8 9 10 11 12 1 3160 3270 3964 2999 3744 2 5127 4850 6167 5043 4544 3 7225 5805 6887 5619 5334 4 7011 5725 5445 6838 5666 5 6945 5822 5538 6165 5570 6 5530 7027 5256 5079 6624 7 3686 4669 3564 3509 4396
As above, let’s check the probability that this distribution arose by chance under the assumption that the probability of each combination was equal:
> chisq.test(births2006.bydayandmonth)
Pearson's Chi-squared test
data: births2006.bydayandmonth
X-squared = 4729.620, df = 66,
p-value < 2.2e-16
Much like the one-dimensional table, we see that the effects are statistically significant; it is very unlikely that this unbalanced distribution arose due to chance.
For three-way interactions, you can try a Cochran-Mantel-Haenszel
test. This is implemented in R through the mantelhaen.test
function:
mantelhaen.test(x, y = NULL, z = NULL, alternative = c("two.sided", "less", "greater"), correct = TRUE, exact = FALSE, conf.level = 0.95)
To test for symmetry in a two-dimensional contingency table, you
can use McNemar’s chi-squared
test. This is implemented in R as mcnemar.test
:
mcnemar.test(x, y = NULL, correct = TRUE)
The Friedman rank-sum test is a non-parametric counterpart to two-way ANOVA tests. In
R, this is implemented through the friedman.test
function:
friedman.test(y, ...) ## Default S3 method: friedman.test(y, groups, blocks, ...) ## S3 method for class 'formula': friedman.test(formula, data, subset, na.action, ...)
As examples, let’s look at some of the same tables we looked at above:
> friedman.test(method.and.sex.twins)
Friedman rank sum test
data: method.and.sex.twins
Friedman chi-squared = 2, df = 1,
p-value = 0.1573
Just like the chi-squared test, the Friedman rank-sum test shows that it is very likely that the two distributions are independent.