Discrete Data

There is a different set of tests for looking at the statistical significance of discrete random variables (like counts of proportions), and so there is a different set of functions in R for performing those tests.

Proportion Tests

If you have a data set with several different groups of observations and are measuring the probability of success in each group (or the fraction of some other characteristic), you can use the function prop.test to measure whether the difference between groups is statistically significant. Specifically, prop.test can be used for testing the null hypothesis that the proportions (probabilities of success) in several groups are the same or that they equal certain given values:

prop.test(x, n, p = NULL,
          alternative = c("two.sided", "less", "greater"),
          conf.level = 0.95, correct = TRUE)

As an example, let’s revisit the field goal data. Above, we considered the question, “Is there a difference in the length of attempts indoors and outdoors?” Now we’ll ask the question, “Is the probability of success the same indoors as it is outdoors?”

First, let’s create a new data set containing only good and bad field goals. (We’ll eliminate blocked and aborted attempts; there were only 8 aborted attempts and 24 blocked attempts in 2005, but 787 good attempts and 163 bad [no good] attempts.)

> field.goals.goodbad <- field.goals[field.goals$play.type=="FG good" |
+                                    field.goals$play.type=="FG no", ]

Now let’s create a table of successes and failures by stadium type:

> field.goals.table <- table(field.goals.goodbad$play.type,
+                            field.goals.goodbad$stadium.type)
> field.goals.table
            
             Both  In Out
  FG aborted    0   0   0
  FG blocked    0   0   0
  FG good      53 152 582
  FG no        14  24 125

The table isn’t quite right for prop.test; we need a table with two columns (one with a count of successes and one with a count of failures), and we don’t want to show empty factor levels. Let’s remove the two rows we don’t need and transpose the table:

> field.goals.table.t <- t(field.goals.table[3:4,])
> field.goals.table.t
      
       FG good FG no
  Both      53    14
  In       152    24
  Out      582   125

Now we’re ready to see if there is a statistically significant difference in success among the three groups. We can simply call prop.test on the field.goals.table.t object to check:

> prop.test(field.goals.table.t)

     3-sample test for equality of proportions without
     continuity correction

data:  field.goals.table 
X-squared = 2.3298, df = 2, p-value = 0.3120
alternative hypothesis: two.sided 
sample estimates:
   prop 1    prop 2    prop 3 
0.7910448 0.8636364 0.8231966

As you can see, the results are not significant.

Binomial Tests

Often, an experiment consists of a series of identical trials, each of which has only two outcomes. For example, suppose that you wanted to test the hypothesis that the probability that a coin would land on heads was 0.5. You might design an experiment where you flipped the coin 50 times and counted the number of heads. Each coin flip is an example of a Bernoulli trial. The distribution of the number of heads is given by the binomial distribution.

R includes a function for evaluating such a trial to determine whether to accept or reject the hypothesis:

binom.test(x, n, p = 0.5,
           alternative = c("two.sided", "less", "greater"),
           conf.level = 0.95)

The argument x gives the number of successes, n gives the total number of trials, p gives the probability of each success, alternative gives the alternative hypothesis, and conf.level gives the returned confidence level.

As an example, let’s look at David Ortiz’s performance during the 2008 season. In 2008, he had a batting average of .264 (110 hits in 416 at bats). Suppose that he was actually a .300 hitter—that the actual probability that he would get a hit in a given at bat was 0.3. What were the odds that he hit .264 or less in this number of at bats? We can use the function binom.test to estimate this probability:

> binom.test(x=110, n=416, p=0.3, alternative="less")

     Exact binomial test

data:  110 and 416 
number of successes = 110, number of trials = 416, p-value =
0.06174
alternative hypothesis: true probability of success is less than 0.3 
95 percent confidence interval:
 0.0000000 0.3023771 
sample estimates:
probability of success 
             0.2644231

Unlike some other test functions, the p-value represents the probability that the fraction of successes (0.26443431) was at least as far from the hypothesized value (.300) after the experiment. We specified that the alternative hypothesis was “less,” meaning that the p-value represents the probability that the fraction of successes was less than 0.26443431, which in this case was 0.06174.

In plain English, this means that if David Ortiz was a “true” .300 hitter, the probability that he actually hit .264 or worse in a season was 0.06174.

Tabular Data Tests

A common problem is to look at a table of data and determine if there is a relationship between two categorical variables. If there were no relationship, the two variables would be statistically independent. In these tests, the hypothesis is that the two variables are independent. The alternative hypothesis is that the two variables are not independent.

Tables of data often come up in experimental contexts: there is one column of data from a test population and one from a control population. In this context, the analyst often wants to calculate the probability that the two sets of data could have come from the same population (which would imply the same proportions in each). This is an equivalent problem, so the same test functions can be used.

For small contingency tables (and small values), you can obtain the best results using Fisher’s exact test. Fisher’s exact test calculates the probability that the deviation from the independence was greater than or equal to the sample quantities. So a high p-value means that the sample data implies that the two variables are likely to be independent. A low p-value means that the sample data implies that the two variables are not independent.

In R, you can use the function fisher.test to perform Fisher’s exact test:

fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
            control = list(), or = 1, alternative = "two.sided",
            conf.int = TRUE, conf.level = 0.95,
            simulate.p.value = FALSE, B = 2000)

Here is a description of the arguments to fisher.test.

Argument	Description	Default
x	Specifies the sample data to use for the test. Either a matrix (representing a two-dimensional contingency table) or a factor.
y	Specifies the sample data to use for the test. If `x` is a factor, then y should be a factor. If `x` is a matrix, then `y` is ignored.	`NULL`
workspace	An integer value specifying the size of the workspace to use in the network algorithm (in units of 4 bytes).	`200000`
hybrid	For tables larger than 2 × 2, specifies whether exact probabilities should be calculated (`hybrid=FALSE`) or an approximation should be used (`hybrid=TRUE`).	`FALSE`
control	A list of named components for low-level control of `fisher.test`; see the help file for more information.	`list()`
or	The hypothesized odds ratio for the 2 × 2 case.	`1`
alternative	The alternative hypothesis. Must be one of `"two.sided"`, `"greater"`, or `"less"`.	`"two.sided"`
conf.int	A logical value specifying whether to compute and return confidence intervals in the results.	`TRUE`
conf.level	Specifies the confidence level to use in computing the confidence interval.	`0.95`
simulate.p.value	A logical value indicating whether to use Monte Carlo simulation to compute p-values in tables larger than 2 × 2.	`FALSE`
B	An integer indicating the number of replicates to use in Monte Carlo simulations.	`2000`

If you specify x and y as factors, then R will compute a contingency table from these factors. Alternatively, you can specify a matrix for x containing the contingency table.

Fisher’s exact test can be very computationally intensive for large tables, so statisticians usually use an alternative test: chi-squared tests. Chi-squared tests are not exactly the same as Fisher’s tests. With a chi-squared test, you explicitly state a hypothesis about the probability of each event and then compare the sample distribution to the hypothesis. The p-value is the probability that a distribution at least as different from the hypothesized distribution arose by chance.

In R, you can use the function chisq.test to calculate a chi-squared contingency table and goodness-of-fit tests:

chisq.test(x, y = NULL, correct = TRUE,
           p = rep(1/length(x), length(x)), rescale.p = FALSE,
           simulate.p.value = FALSE, B = 200

Here is a description of the arguments to chisq.test.

Argument	Description	Default
x	Specifies the sample data to use for the test. Either a matrix or a vector.
y	Specifies the sample data to use for the test. If `x` is a factor, then `y` should be a vector. If `x` is a matrix, then `y` is ignored.	`NULL`
correct	A logical value specifying whether to apply continuity correction when computing the test statistic for 2 × 2 tables.	`TRUE`
p	A vector of probabilities that represent the hypothesis to test. (Note that the default is to assume equal probability for each item.)	`rep(1/length(x), length(x))`
rescale.p	A logical value indicating whether p needs to be rescaled to sum to 1.	`FALSE`
simulate.p.value	A logical value indicating whether to compute p-values using Monte Carlo simulation.	`FALSE`
B	An integer indicating the number of replicates to use in Monte Carlo simulations.	`200`

If you specify x and y as vectors, then R will compute a contingency table from these vectors (after coercing them to factors). Alternatively, you can specify a matrix for x containing the contingency table.

As an example, let’s use the 2006 births data set. (For a detailed description of this data set, see Univariate Trellis Plots.) We will take a look at the number of male and female babies delivered during July 2006, by delivery method. We’ll take a subset of births during July where the delivery method was known and then tabulate the results:

> births.july.2006 <- births2006.smpl[births2006.smpl$DMETH_REC!="Unknown" &
+                                     births2006.smpl$DOB_MM==7, ]
> nrow(births2006.smpl)
[1] 427323
> nrow(births.july.2006)
[1] 37060
> method.and.sex <- table(
+     births.july.2006$SEX,
+     as.factor(as.character(births.july.2006$DMETH_REC)))
> method.and.sex
   
    C-section Vaginal
  F      5326   12622
  M      6067   13045

Note that the delivery methods were actually slightly unbalanced by gender during July 2006:

> 5326 / (5326 + 6067)
[1] 0.46748
> 12622 / (12622 + 13045)
[1] 0.4917598

However, there isn’t an intuitive reason why this should be true. So let’s check whether this difference is statistically significant: is the difference due to chance or is it likely that these two variables (delivery method and sex) are independent? We can use Fisher’s exact test to answer this question:

> fisher.test(method.and.sex)

     Fisher's Exact Test for Count Data

data:  method.and.sex 
p-value = 1.604e-05
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 0.8678345 0.9485129 
sample estimates:
odds ratio 
 0.9072866

The p-value is the probability of obtaining results that were at least as far removed from independence as these results. In this case, the p-value is very low, indicating that the results were very far from what we would expect if the variables were truly independent. This implies that we should reject the hypothesis that the two variables are independent.

As a second example, let’s look only at twin births. (Note that each record represents a single birth, not a single pregnancy.)

> twins.2006 <- births2006.smpl[births2006.smpl$DPLURAL=="2 Twin" &
+                                births2006.smpl$DMETH_REC != "Unknown",]
> method.and.sex.twins <-
+   table(twins.2006$SEX,
+         as.factor(as.character(twins.2006$DMETH_REC)))
> method.and.sex.twins
   
    C-section Vaginal
  F      4924    1774
  M      5076    1860

Now let’s see if there is a statistically significant difference in delivery methods between the two sexes:

> fisher.test(method.and.sex.twins)

	Fisher's Exact Test for Count Data

data:  method.and.sex.twins 
p-value = 0.67
alternative hypothesis: true odds ratio is not equal to 1 
95 percent confidence interval:
 0.9420023 1.0981529 
sample estimates:
odds ratio 
  1.017083

In this case, the p-value (0.67) is very high, so it is very likely that the two variables are independent.

We can look at the same table using a chi-squared test:

> chisq.test(method.and.sex.twins)

     Pearson's Chi-squared test with Yates' continuity
     correction

data:  method.and.sex.twins 
X-squared = 0.1745, df = 1, p-value = 0.6761

By the way, we could also have just passed the two factors to chisq.test, and chisq.test would have calculated the contingency table for us:

> chisq.test(twins.2006$DMETH_REC, twins.2006$SEX)

     Pearson's Chi-squared test with Yates' continuity
     correction

data:  twins.2006$DMETH_REC and twins.2006$SEX 
X-squared = 0.1745, df = 1, p-value = 0.6761

As above, the p-value is very high, so it is likely that the two variables are independent for twin births.

Let’s ask another interesting question: how many babies are born on weekdays versus weekends? Let’s start by tabulating the number of births, by day of week, during 2006:

> births2006.byday <- table(births2006.smpl$DOB_WK)
> births2006.byday

    1     2     3     4     5     6     7 
40274 62757 69775 70290 70164 68380 45683

Curiously, the number of births on days 1 and 7 (Sunday and Saturday, respectively) are sharply lower than the number of births on other days. We can use a chi-squared test to determine what the probability is that this distribution arose by chance. As noted above, by default, we perform a chi-squared test under the assumption that the actual probability of a baby being born on each day is given by the vector p=rep(1/length(x), length(x)), which in this case is 1/7 for every day. So we’re asking what the probability is that a distribution at least as unbalanced as the one above arose by chance:

> chisq.test(births2006.byday)

     Chi-squared test for given probabilities

data:  births2006.byday 
X-squared = 15873.20, df = 6, p-value < 2.2e-16

As you might have guessed, this effect was statistically significant. The p-value is very, very small, indicating that it is very unlikely that this effect arose due to chance. (Of course, with a sample this big, it’s not hard to find significant effects.)

The chisq.test function can also perform tests on multidimensional tables. As an example, let’s build a table showing the number of births by day and month:

> births2006.bydayandmonth <- table(births2006.smpl$DOB_WK,
+                                   births2006.smpl$DOB_MM)
> births2006.bydayandmonth
   
       1    2    3    4    5    6    7
  1 3645 2930 2965 3616 2969 3036 3976
  2 5649 4737 4779 4853 5712 5033 6263
  3 6265 5293 5251 5297 6472 5178 5149
  4 5131 5280 6486 5173 6496 5540 5499
  5 5127 5271 6574 5162 5347 6863 5780
  6 4830 5305 6330 5042 4975 6622 5760
  7 3295 3392 3408 4185 3364 3464 4751
   
       8    9   10   11   12
  1 3160 3270 3964 2999 3744
  2 5127 4850 6167 5043 4544
  3 7225 5805 6887 5619 5334
  4 7011 5725 5445 6838 5666
  5 6945 5822 5538 6165 5570
  6 5530 7027 5256 5079 6624
  7 3686 4669 3564 3509 4396

As above, let’s check the probability that this distribution arose by chance under the assumption that the probability of each combination was equal:

> chisq.test(births2006.bydayandmonth)

     Pearson's Chi-squared test

data:  births2006.bydayandmonth 
X-squared = 4729.620, df = 66,
p-value < 2.2e-16

Much like the one-dimensional table, we see that the effects are statistically significant; it is very unlikely that this unbalanced distribution arose due to chance.

For three-way interactions, you can try a Cochran-Mantel-Haenszel test. This is implemented in R through the mantelhaen.test function:

mantelhaen.test(x, y = NULL, z = NULL,
                alternative = c("two.sided", "less", "greater"),
                correct = TRUE, exact = FALSE, conf.level = 0.95)

To test for symmetry in a two-dimensional contingency table, you can use McNemar’s chi-squared test. This is implemented in R as mcnemar.test:

mcnemar.test(x, y = NULL, correct = TRUE)

Non-Parametric Tabular Data Tests

The Friedman rank-sum test is a non-parametric counterpart to two-way ANOVA tests. In R, this is implemented through the friedman.test function:

friedman.test(y, ...)

## Default S3 method:
friedman.test(y, groups, blocks, ...)

## S3 method for class 'formula':
friedman.test(formula, data, subset, na.action, ...)

As examples, let’s look at some of the same tables we looked at above:

> friedman.test(method.and.sex.twins)

     Friedman rank sum test

data:  method.and.sex.twins 
Friedman chi-squared = 2, df = 1,
p-value = 0.1573

Just like the chi-squared test, the Friedman rank-sum test shows that it is very likely that the two distributions are independent.