Chapter 13
Selecting Statistical Tests

13.1 Overview
13.2 Glossary
13.3 Examples
13.4 Exercises

Overview

The last nine chapters of the textbook are structured around flowcharts that describe how a statistician selects a method of analysis for a particular dataset and a particular research question. While reading each of those chapters, the flowcharts gave us an idea about how the statistical procedures described fit into the organization of statistical methods, but we were focused on the methods being described rather than the overall structure of the flowcharts. Chapter 13 in the workbook gives us an opportunity to step back and consider the all of flowcharts.

The flowcharts are collected in Appendix A of both the textbook and the workbook, as well as at the beginning of each of the last nine chapters of the textbook. To start, consider the master flowchart (Flowchart 1). This is where we begin to select an appropriate approach to a set of data. First, we need to identify the dependent variable. We can do this by asking ourselves, “What do we want to make an estimate of or test a hypothesis about?” The answer is the dependent variable. It is not unusual to have more than one dependent variable in a dataset. That is ok. The most common approach to this situation is to consider each of the dependent variables, one at a time.

Next, we need to identify the independent variable(s). We find these by asking ourselves, “Under what conditions are we interested in examining the dependent variable?” For example, this could be comparing groups of dependent variable values. Nominal independent variables are needed to specify these groups. We need one fewer nominal independent variable(s) than there are groups to compare. Our concern at this point is just to count the independent variables and decide if there are no independent variables, one independent variable, or more than one independent variable.

The final thing we need to do with the master flowchart (Flowchart 1) is decide what type of data is being represented by the dependent variable. We need to choose among three types: continuous, ordinal, or nominal. The introduction to Part Two in the textbook describes these types of data. To summarize, the types of data differ according to their ability to order the data values in a biologically meaningful way and according to the spacing between values. Continuous data have ordered, evenly-spaced values. Ordinal data have ordered values, but the spacing between values is undefined. Nominal data cannot be ordered in a meaningful way. Identifying the type of data represented by the dependent variable gets us to the next part of the flowchart, each associated with a chapter of the textbook.

The next thing we do depends on whether or not there are independent variables. If there are independent variables, we need to identify the type of data represented by those independent variables. Then, we might have committed to a single path to the end of the flowchart, or there may be more decisions to make. These decisions are related to the research interest, and the nature of the independent variable(s).

As we approach the end of the flowchart, three pieces of information are itemized. First, there is the point estimate(s) that is most often of interest. Next comes the common name of the procedure. Finally, there is the name of the general method or standard distribution that is used to perform the analysis. Now, we know quite a bit about how to analyze a particular set of data, but there still might be decisions to make. For example, we need to decide whether we want to take chance into account by calculating an interval estimate or by hypothesis testing. We turn to Chapter 3 of the textbook to answer that question. Other decisions are discussed in the chapter to which the flowchart belongs.

Glossary

Bivariable Dataset – a collection of data that includes one dependent variable and one independent variable.
Continuous Data – measurements that can be ordered and that are evenly-spaced. They generally have a large number of possible values.
Dependent Variable – the variable that represents the data of primary interest. These are the data for which we want to make an estimate or test a hypothesis.
Independent Variable – a variable that represents data that specify conditions under which we are interested in the dependent variable.
Multivariable Dataset – a collection of data that includes one dependent variable and more than one independent variable.
Multivariate Methods – statistical methods designed to analyze more than one dependent variable in a single analysis. Multivariate methods are not commonly used in health research.
Nominal Data – measurements that cannot be ordered in a biologically meaningful way.
Ordinal Data – measurements that can be ordered, but where the spacing between values is uneven or undefined.
Univariable Dataset – a collection of data in which there is one dependent variable and no independent variables.
Variable – a theoretical entity that represents data in the mathematics of statistical methods.

Examples

13.1. A state health department has conducted a survey in a random sample of 500 high school students to estimate the frequency of two risk-taking behaviors: smoking and unprotected sexual intercourse. Among those persons in the sample, they found that 123 students were currently smoking and 97 students were regularly engaged in unprotected sexual intercourse. What type of point estimate should be used to summarize these observations?

We begin by trying to identify the data that will be represented by a dependent variable. We do this by asking ourselves for which data are we interested in making an estimate or testing a hypothesis. There are two types of data mentioned in this study: smoking and unprotected intercourse. We are equally interested in both, so there are two dependent variables. The usual biostatistical practice is to perform a separate analysis for each dependent variable. Thus, we consider the data in this example to be part of two, separate datasets. There are no specific conditions under which we are interested in examining these two dependent variables, so they are univariable datasets.

Both dependent variables represent nominal data. This means that Chapter 6 discusses how to analyze these data. There are three point estimates addressed in Chapter 6: prevalence, risk, and incidence. Incidence and risk address intervals of time. We do not have information on intervals of time; only a point in time. Therefore, we would estimate prevalences.

13.2. For the survey described in Example 13.1, what is the best way to take into account the role of chance in obtaining the sample of 500 students?

There are two aspects of taking chance into account that we need to consider. The first involves choosing between interval estimation and hypothesis testing. We can always perform interval estimation, but hypothesis testing requires a sensible (ie, of biologic interest) null hypothesis. When we have a univariable dataset, a sensible null hypothesis requires a paired sample. This is not a paired sample, so we will use interval estimation.

The second aspect is particular to nominal dependent variables. It is the choice between an exact procedure and a normal approximation. The better choice is the exact procedure, if we have a computer program to perform it. Otherwise, the normal approximation is the better choice.

13.3. A group of researchers is studying the relationship between RBC (red blood cell) counts and doses of folic acid. In this study, they randomly assign 100 persons with pernicious anemia to each of 10 doses of folic acid and, after a period of two weeks, they perform a RBC count. What point estimate(s) should they use to summarize these observations?

Here we have two types of data: RBC counts and dose of folic acid. When we ask ourselves which we want to make estimates of, the answer is RBC counts. Dose specifies the condition under which we are in interested in RBC counts. Thus, it is represented by an independent variable. It is the only independent variable, so we have a bivariable dataset. The dependent variable is RBC counts which can be considered continuous, so, we are interested in the flowchart in Chapter 7 .

Now that we have an independent variable, the next thing we need to decide is what type of data are represented by the independent variable. Doses are continuous, so we are in the left-hand branch of the flowchart. Now, we have to decide what question the researchers are asking. Do they want to estimate RBC for a given dose or do they want to describe the strength of the association between RBC and dose? This is not completely clear, but the likely answer is that they want to describe the relationship, rather than say how strong it is. Thus, “Intercept and slope” is the best answer.

13.4. In the study described in Example 13.3, the researchers are concerned about potential confounding effects of gender in their investigation of the relationship between RBC counts and dose of folic acid. What is the name of the general class of statistical procedures that they should use to analyze these observations while controlling for gender?

This is a continuation of Example 13.3. What has changed is that another variable has been established. This is gender. The interest in including gender is to control for its effect. Thus, gender is represented by an independent variable. Now, we have two independent variables: dose and gender. This means we have a multivariable dataset with a continuous dependent variable. This puts us in Chapter 10 of the textbook. Since we have both a continuous (dose) and a nominal (gender) independent variable, we find ourselves in the right-hand branch of the flowchart. This leads to analysis of covariance.

13.5. Suppose we conducted a study of serum cholesterol levels among persons randomly sampled from five different countries. In the analysis of these data, we are interested in comparing mean serum cholesterol levels between genders as well as between countries. What is the name of the general class of statistical procedures that we should use to analyze these data?

In this study, it is clear that serum cholesterol is represented by the dependent variable and that gender and country will be represented by independent variables. We have more than one independent variable and the dependent variable is continuous, so we are in Chapter 10 of the textbook. All independent variables are nominal, so we are in the middle branch of the flowchart. Here, we have to decide whether the nominal independent variables specify categories of one or more than one characteristic. There are two characteristics: gender and country. Thus, we want to use a factorial analysis of variance to analyze these data.

13.6. Suppose we are interested in the strength of the relationship between body mass index (BMI) and systolic blood pressure (SBP). To investigate this relationship, we randomly select 250 persons from a particular population and measure their BMI and SBP. What point estimate(s) would be the most appropriate to summarize the data?

It is not clear which of the two continuous variables is the dependent variable and which is the independent variable. That often is the case when the interest is in the strength of the association between two continuous variables. That does not impede our use of the flowchart. It is clear that the dependent variable must be continuous and that we have one independent variable. Thus, we are in Chapter 7 of the textbook. The independent variable is continuous, so we are in the left-hand branch of that flowchart. If we follow the path for strength of the association, we encounter Pearson's correlation coefficient.

13.7. For the data in Example 13.6, suppose we are interested in estimating the mean SBP for persons with a particular BMI. What point estimates would be the most appropriate to summarize the data to reflect that interest?

This is a modification of Example 13.6. Now we know the dependent variable represents SBP. The difference here, is the question being asked. Instead of strength of the association, we are interested in estimating dependent variable values corresponding to values of the independent variable. Thus, we take the branch that leads to slope and intercept.

13.8. What is the common name of the statistical method that would be best used to address the interest in Example 13.7?

This continuation of Example 13.7 requires us to follow the same branch of the flowchart so we can find the common name of the procedure. That name is linear regression analysis.

13.9. Now, suppose we want to control for age in the analysis addressing the interest described in Example 13.8. What is the common name of the statistical method that would be the best to use?

This is a continuation of Example 13.8. Here, we add a second continuous independent variable (age). That means we have a multivariable dataset. Since the dependent variable is continuous, we are in Chapter 10. Both independent variables are continuous, so we are in the left-hand branch of the flowchart. Our interest still is in estimating dependent variable values; thus we want to use multiple regression analysis.

13.10. Finally, suppose we want to control for both age and gender in the analysis addressing the interest described in Example 13.8. What is the common name of the statistical method that would be the best to use?

This is a continuation of Example 13.9. Gender is added as another independent variable. That means we have a mixture of continuous and nominal independent variables. This leads us to analysis of covariance.

13.11. Suppose we measure the serum cholesterol level for 50 persons on a particular diet. What is the best way to summarize those measurements while taking chance into account?

This is similar to Example 13.1 in that we need to decide between interval estimation and hypothesis testing as ways to take chance into account. As in Example 13.1, there is no sensible null hypothesis to test, so the only possible answer is interval estimation. If we were uncertain about what parameter to estimate, we could recognize this as a univariable sample with a continuous dependent variable. That selects the Chapter 4 flowchart.

13.12. For the study described in Example 13.11, suppose we want to look at how much the serum cholesterol changes when people use a particular diet. To study that change, we measure, for each of the 50 persons, serum cholesterol just prior to starting the diet and then again after being on the diet for 30 days. (i.e., both measurements are made for each individual person). What would be the best statistical test to analyze these data?

Here, we have two measurements of serum cholesterol for each person. We are told, however, it is neither measurement of serum cholesterol that is of interest. Instead the difference between the two measurements is the dependent variable. There are no other variables, so this is a univariable sample with a continuous dependent variable. That selects the flowchart in Chapter 4.

13.13. For the study described in Example 13.12, suppose we want to look at how strong the association is between serum cholesterol measured just prior to beginning the diet and after 30 days on the diet. To examine the strength of that association, what would be the best statistical test to use?

Although these are the same data as in Example 13.12, the question being asked changes how we think about those data. Instead of using the two measurements of serum cholesterol to find the difference, we are now interested in the two measurements themselves. Thus, this is a bivariable sample with a continuous dependent variable. This puts us in Chapter 7. There, we are in the left-hand branch because the independent variable represents continuous data as well. We are interested in the strength of association, so we follow the flowchart to Pearson's correlation coefficient.

13.14. For the study described in Example 13.12, suppose there are actually three diets of interest to us and that we randomly assign 50 different persons to each of them. To compare the mean change in serum cholesterol among those three diets, what would be the best statistical test to use?

This is a continuation of Example 13.12 in which the difference between cholesterol measurements is the dependent variable. Now, we are going to divide those differences into three groups. To specify three groups of dependent variable values, we need two nominal independent variables. ¹ Thus, we have a multivariable sample with a continuous dependent variable. This puts us in Chapter 10.

Both of the independent variables represent nominal data, so we are in the middle branch of the flowchart. Diet is a single characteristic. This leads us to one-way analysis of variance.

13.15. For the study described in Example 13.14, suppose we are also interested in the ability of chitosan to reduce the serum cholesterol further. To study this treatment, we randomly assign each of the 50 persons on each diet to receive either chitosan or a placebo. To look at the effect of chitosan as well as the effect of the diets, what would be the best statistical test to analyze these data?

This is a continuation of Example 13.14. Now, we are adding another treatment to diet. This creates six groups of dependent variable values, so we have five nominal independent variables. This keeps us in the middle branch of the flowchart. When we changed the number of groups to six, we also added another characteristic (chitosan). Thus, the flowchart leads us to factorial analysis of variance.

13.16. For the study described in Example 13.15, suppose we want to compare serum cholesterol values for the three diets and two treatments while controlling for the potential confounding of age and gender. What would be the best statistical test to analyze these data?

This is a continuation of Example 13.15. We have added two more independent variables: gender and age. Since we now have both nominal and continuous independent variables, we are in the right-hand branch of the flowchart in Chapter 10. This branch leads to analysis of covariance.

13.17. Imagine we are interested in the relationship between HPV (human papilloma virus) and development of cervical cancer. To study this relationship, we identify 1,000 women with normal PAP smears who are positive for HPV and follow them for five years. During that period, each of the women receive biannual examinations, including a PAP smear. Of those 1,000 women, 250 of them develop abnormal PAP smears during the five-year period. What would be the best point estimate to summarize those results?

The dependent variable represents developing an abnormal PAP smear. This is a nominal dependent variable. There are no independent variables. ² This means we have a univariable sample with a nominal dependent variable. This puts us in Chapter 6. In the Chapter 6 flowchart we need to distinguish between a nominal dependent variable affected by time and not affected by time. Since everyone was followed for a five-year period, the dependent variable is not affected by time. This means we can estimate a probability. There are three probabilities listed among the answers. Since we are interested in new cases developed over time, it is the five-year risk we want to estimate.

13.18. Now, suppose we identify another 1,000 women with normal PAP smears who are negative for HPV and we follow these women for five years as well. During that period of time, 75 of these 1,000 women develop abnormal PAP smears. What is the best point estimate to compare the results for these women to the results for the women in Example 13.17 if we do not want the comparison to reflect the underlying frequency of abnormal PAP smears?

This is a continuation of Example 13.17. Another group of women is added to the study. Now, HPV is an independent variable. This means we have a bivariable sample with a nominal dependent variable. We are in Chapter 9. Since HPV is a nominal variable, we follow the right-hand branch of the flowchart. As explained in Example 13.14, the dependent variable is not affected by time. Since there is no attempt to pair HPV persons to non-HPV persons, this is an unpaired study. This leads us to a choice among three parameters to estimate. To make this choice, we recall from Chapter 6 that ratios are not affected by the underlying frequency of the event or characteristic.

13.19. For the study described in Examples 13.17 and 13.18, what is the common name of the statistical method that would be best to test a null hypothesis about the occurrence of abnormal PAP smears between the two groups of women?

This is a continuation of Example 13.18. Now, we are to continue down the flowchart to find the common name of the statistical method we should use. There are two: a normal approximation and an exact test. We learned in Chapter 6 that, if we have a choice, we should use the results from the exact test.

Exercises

13.1 Suppose we are interested in how well a new medication prevents migraine headaches. To study this relationship, we identify 200 persons who suffer from migraines. Then, we assign 100 people to receive the new medication and 100 people to receive the standard therapy and record the number of people who have at least one migraine during a three-month period. In this study, the dependent variable represents which of the following?
1. People in the population
2. People with a history of migraines
3. Occurrence of migraine during the three-month study period
4. The new medication
5. The distinction between the new medication and the standard therapy
13.2 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Which of the following is the best way to summarize those measurements?
1. Estimate the mean diastolic blood pressure and calculate its 95% confidence interval
2. Estimate the mean diastolic blood pressure and test the null hypothesis that it is equal to zero in the population
3. Estimate the proportion of persons with high blood pressure and test the null hypothesis that it is equal to zero in the population
4. Estimate the proportion of persons with high blood pressure and test the null hypothesis that it is equal to 0.5 in the population
5. Estimate the proportion of persons with high blood pressure and test the null hypothesis that it is equal to one in the population
13.3 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Further suppose we want to look at how much diastolic blood pressure changes between the last lecture and the final examination. To examine that change, we measure for each of the 25 students diastolic blood pressure both during the last lecture and the final examination (i.e., both measurements are made for each individual student). Which of the following would be the best statistical test to analyze these data?
1. Paired t test
2. Student's t test
3. Linear regression analysis
4. Pearson's correlation analysis
5. Normal approximation to the binomial
13.4 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Further suppose we want to look at how much diastolic blood pressure changes between the last lecture and the final examination. To examine that change, we measure for each of the 25 students diastolic blood pressure both during the last lecture and the final examination (i.e., both measurements are made for each individual student). Now suppose we want to look at how strong the association is between diastolic blood pressure measured during the last lecture and diastolic blood pressure measured during the final examination. To examine the strength of that association, which of the following would be the best statistical test to use?
1. Paired t test
2. Student's t test
3. Linear regression analysis
4. Pearson's correlation analysis
5. Normal approximation to the binomial
13.5 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Further suppose we want to look at how much diastolic blood pressure changes between the last lecture and the final examination. To examine that change, we measure for each of the 25 students diastolic blood pressure both during the last lecture and the final examination (i.e., both measurements are made for each individual student). Now, suppose that we want to compare diastolic blood pressure values during the final examination among students taking a statistics course during different semesters. Suppose that we measure diastolic blood pressure during the final for samples of 25 students for each of the three semesters (i.e., fall, spring, and summer) of one academic year. Which of the following would be the best statistical test to analyze these data?
1. Multiple regression analysis
2. Multiple correlation analysis
3. One-way analysis of variance
4. Factorial analysis of variance
5. Analysis of covariance
13.6 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Further suppose we want to look at how much diastolic blood pressure changes between the last lecture and the final examination. To examine that change, we measure for each of the 25 students diastolic blood pressure both during the last lecture and the final examination (i.e., both measurements are made for each individual student). Now, suppose we want to compare diastolic blood pressure values during the final examination among students taking a statistics course during two different academic years (e.g., 2015-2016 and 2016-2017) as well as during different semesters in each of those years. Suppose that we measure diastolic blood pressure during the final for samples of 25 students for each of the three semesters (i.e., fall, spring, and summer) of each academic year. Which of the following would be the best statistical test to analyze these data?
1. Multiple regression analysis
2. Multiple correlation analysis
3. One-way analysis of variance
4. Factorial analysis of variance
5. Analysis of covariance
13.7 Suppose we measure the diastolic blood pressure for 25 students taking this examination. Further suppose we want to look at how much diastolic blood pressure changes between the last lecture and the final examination. To examine that change, we measure for each of the 25 students diastolic blood pressure both during the last lecture and the final examination (i.e., both measurements are made for each individual student). Now, suppose we want to compare diastolic blood pressure values during the final examination while controlling for the potential confounding of age and gender. Which of the following would be the best statistical test to analyze these data?
1. Multiple regression analysis
2. Multiple correlation analysis
3. One-way analysis of variance
4. Factorial analysis of variance
5. Analysis of covariance
13.8 Suppose that we are interested in the relationship between smoking and getting an “A” on a final examination. To study this relationship, we identify 25 smokers and 25 nonsmokers in a particular course. Then we determine how many in each of those groups got ”A”s. Which of the following would be the best way to analyze those data?
1. Normal approximation to the binomial
2. Student's t test
3. Chi-square test
4. Logistic regression analysis
5. Multiple regression analysis
13.9 Suppose that we are interested in the relationship between smoking and getting an “A” on a final examination. To study this relationship, we identify 25 smokers and 25 nonsmokers in a particular course. Then we determine how many students in each of those groups got ”A”s. Further suppose we want to control for age and gender. Which of the following would be the best way to analyze those data while controlling for these potential confounders?
1. Normal approximation to the binomial
2. Student's t test
3. Chi-square test
4. Logistic regression analysis
5. Multiple regression analysis
13.10 Suppose that we are interested in the change in heart rate that occurs during a particular form of exercise. To investigate this change, we select 100 persons to be in a study. In that study, each of the 100 persons have their heart rate measured before and, then, during exercise. Which of the following would be the best point estimate(s) to reflect this change?
1. Coefficient of determination
2. Mean of the differences in heart rates
3. Difference between the mean heart rates
4. Correlation coefficient
5. Intercept and slope
13.11 Suppose that we are interested in the change in heart rate that occurs during a particular form of exercise. To investigate this change, we select 100 persons to be in a study. In that study, each of the 100 persons have their heart rate measured before and, then, during exercise. Now, suppose we are interested in estimating the heart rate during exercise based on an individual's heart rate before exercise while controlling for differences in heart rate related to body mass index (BMI). Which of the following is the common name of the statistical test that would be best to use in that analysis?
1. Analysis of covariance
2. Multiple regression analysis
3. Student-Newman-Keuls test
4. One-way analysis of variance
5. Factorial analysis of variance
13.12 Suppose that we are interested in the change in heart rate that occurs during a particular form of exercise. To investigate this change, we select 100 persons to be in a study. In that study, each of the 100 persons have their heart rate measured before and, then, during exercise. Now, suppose we are interested in estimating the heart rate during exercise based on an individual's heart rate before exercise while controlling for differences in heart rate related to body mass index (BMI) and gender. Which of the following is the common name of the statistical test that would be best to use in that analysis?
1. Analysis of covariance
2. Multiple regression analysis
3. Student-Newman-Keuls test
4. One-way analysis of variance
5. Factorial analysis of variance
13.13 Suppose that we are interested in studying the association between being exposed to second-hand smoke and having a low birth-weight infant. To study this association, we identify 50 women who have recently delivered a singleton infant and who were exposed to second-hand smoke during their pregnancy and 50 women who have recently delivered a singleton infant and who were not exposed to second-hand smoke during their pregnancy. For each of these women we determine the birth-weight of their infant. Which of the following is the common name of the statistical procedure that would be best to use to compare birth-weights between these two groups of women?
1. Fisher's exact test
2. McNemar's test
3. Paired t test
4. Student's t test
5. Logistic regression analysis
13.14 Suppose that we are interested in studying the association between being exposed to second-hand smoke and having a low birth-weight infant. To study this association, we identify 50 women who have recently delivered a singleton infant and who were exposed to second-hand smoke during their pregnancy and 50 women who have recently delivered a singleton infant and who were not exposed to second-hand smoke during their pregnancy. For each of these women we determine the birth-weight of their infant. Now, suppose we define a low birth weight as being less than 1,500 grams. Which of the following is the common name of the statistical procedure that would be best to use to compare the odds of having a low birth weight infant between these two groups of women?
1. Fisher's exact test
2. McNemar's test
3. Paired t test
4. Student's t test
5. Logistic regression analysis
13.15 Suppose that we are interested in studying the association between being exposed to second-hand smoke and having a low birth-weight infant. To study this association, we identify 50 women who have recently delivered a singleton infant and who were exposed to second-hand smoke during their pregnancy and 50 women who have recently delivered a singleton infant and who were not exposed to second-hand smoke during their pregnancy. For each of these women we determine the birth-weight of their infant. Now, suppose we define a low birth weight as being less than 1,500 grams. Which of the following is the common name of the statistical procedure that would be best to use to compare the odds of having a low birth weight infant between these two groups of women while controlling for the confounding effect of maternal age?
1. Fisher's exact test
2. McNemar's test
3. Paired t test
4. Student's t test
5. Logistic regression analysis