The last nine chapters of the textbook are structured around flowcharts that describe how a statistician selects a method of analysis for a particular dataset and a particular research question. While reading each of those chapters, the flowcharts gave us an idea about how the statistical procedures described fit into the organization of statistical methods, but we were focused on the methods being described rather than the overall structure of the flowcharts. Chapter 13 in the workbook gives us an opportunity to step back and consider the all of flowcharts.
The flowcharts are collected in Appendix A of both the textbook and the workbook, as well as at the beginning of each of the last nine chapters of the textbook. To start, consider the master flowchart (Flowchart 1). This is where we begin to select an appropriate approach to a set of data. First, we need to identify the dependent variable. We can do this by asking ourselves, “What do we want to make an estimate of or test a hypothesis about?” The answer is the dependent variable. It is not unusual to have more than one dependent variable in a dataset. That is ok. The most common approach to this situation is to consider each of the dependent variables, one at a time.
Next, we need to identify the independent variable(s). We find these by asking ourselves, “Under what conditions are we interested in examining the dependent variable?” For example, this could be comparing groups of dependent variable values. Nominal independent variables are needed to specify these groups. We need one fewer nominal independent variable(s) than there are groups to compare. Our concern at this point is just to count the independent variables and decide if there are no independent variables, one independent variable, or more than one independent variable.
The final thing we need to do with the master flowchart (Flowchart 1) is decide what type of data is being represented by the dependent variable. We need to choose among three types: continuous, ordinal, or nominal. The introduction to Part Two in the textbook describes these types of data. To summarize, the types of data differ according to their ability to order the data values in a biologically meaningful way and according to the spacing between values. Continuous data have ordered, evenly-spaced values. Ordinal data have ordered values, but the spacing between values is undefined. Nominal data cannot be ordered in a meaningful way. Identifying the type of data represented by the dependent variable gets us to the next part of the flowchart, each associated with a chapter of the textbook.
The next thing we do depends on whether or not there are independent variables. If there are independent variables, we need to identify the type of data represented by those independent variables. Then, we might have committed to a single path to the end of the flowchart, or there may be more decisions to make. These decisions are related to the research interest, and the nature of the independent variable(s).
As we approach the end of the flowchart, three pieces of information are itemized. First, there is the point estimate(s) that is most often of interest. Next comes the common name of the procedure. Finally, there is the name of the general method or standard distribution that is used to perform the analysis. Now, we know quite a bit about how to analyze a particular set of data, but there still might be decisions to make. For example, we need to decide whether we want to take chance into account by calculating an interval estimate or by hypothesis testing. We turn to Chapter 3 of the textbook to answer that question. Other decisions are discussed in the chapter to which the flowchart belongs.
We begin by trying to identify the data that will be represented by a dependent variable. We do this by asking ourselves for which data are we interested in making an estimate or testing a hypothesis. There are two types of data mentioned in this study: smoking and unprotected intercourse. We are equally interested in both, so there are two dependent variables. The usual biostatistical practice is to perform a separate analysis for each dependent variable. Thus, we consider the data in this example to be part of two, separate datasets. There are no specific conditions under which we are interested in examining these two dependent variables, so they are univariable datasets.
Both dependent variables represent nominal data. This means that Chapter 6 discusses how to analyze these data. There are three point estimates addressed in Chapter 6: prevalence, risk, and incidence. Incidence and risk address intervals of time. We do not have information on intervals of time; only a point in time. Therefore, we would estimate prevalences.
There are two aspects of taking chance into account that we need to consider. The first involves choosing between interval estimation and hypothesis testing. We can always perform interval estimation, but hypothesis testing requires a sensible (ie, of biologic interest) null hypothesis. When we have a univariable dataset, a sensible null hypothesis requires a paired sample. This is not a paired sample, so we will use interval estimation.
The second aspect is particular to nominal dependent variables. It is the choice between an exact procedure and a normal approximation. The better choice is the exact procedure, if we have a computer program to perform it. Otherwise, the normal approximation is the better choice.
Here we have two types of data: RBC counts and dose of folic acid. When we ask ourselves which we want to make estimates of, the answer is RBC counts. Dose specifies the condition under which we are in interested in RBC counts. Thus, it is represented by an independent variable. It is the only independent variable, so we have a bivariable dataset. The dependent variable is RBC counts which can be considered continuous, so, we are interested in the flowchart in Chapter 7 .
Now that we have an independent variable, the next thing we need to decide is what type of data are represented by the independent variable. Doses are continuous, so we are in the left-hand branch of the flowchart. Now, we have to decide what question the researchers are asking. Do they want to estimate RBC for a given dose or do they want to describe the strength of the association between RBC and dose? This is not completely clear, but the likely answer is that they want to describe the relationship, rather than say how strong it is. Thus, “Intercept and slope” is the best answer.
This is a continuation of Example 13.3. What has changed is that another variable has been established. This is gender. The interest in including gender is to control for its effect. Thus, gender is represented by an independent variable. Now, we have two independent variables: dose and gender. This means we have a multivariable dataset with a continuous dependent variable. This puts us in Chapter 10 of the textbook. Since we have both a continuous (dose) and a nominal (gender) independent variable, we find ourselves in the right-hand branch of the flowchart. This leads to analysis of covariance.
In this study, it is clear that serum cholesterol is represented by the dependent variable and that gender and country will be represented by independent variables. We have more than one independent variable and the dependent variable is continuous, so we are in Chapter 10 of the textbook. All independent variables are nominal, so we are in the middle branch of the flowchart. Here, we have to decide whether the nominal independent variables specify categories of one or more than one characteristic. There are two characteristics: gender and country. Thus, we want to use a factorial analysis of variance to analyze these data.
It is not clear which of the two continuous variables is the dependent variable and which is the independent variable. That often is the case when the interest is in the strength of the association between two continuous variables. That does not impede our use of the flowchart. It is clear that the dependent variable must be continuous and that we have one independent variable. Thus, we are in Chapter 7 of the textbook. The independent variable is continuous, so we are in the left-hand branch of that flowchart. If we follow the path for strength of the association, we encounter Pearson's correlation coefficient.
This is a modification of Example 13.6. Now we know the dependent variable represents SBP. The difference here, is the question being asked. Instead of strength of the association, we are interested in estimating dependent variable values corresponding to values of the independent variable. Thus, we take the branch that leads to slope and intercept.
This continuation of Example 13.7 requires us to follow the same branch of the flowchart so we can find the common name of the procedure. That name is linear regression analysis.
This is a continuation of Example 13.8. Here, we add a second continuous independent variable (age). That means we have a multivariable dataset. Since the dependent variable is continuous, we are in Chapter 10. Both independent variables are continuous, so we are in the left-hand branch of the flowchart. Our interest still is in estimating dependent variable values; thus we want to use multiple regression analysis.
This is a continuation of Example 13.9. Gender is added as another independent variable. That means we have a mixture of continuous and nominal independent variables. This leads us to analysis of covariance.
This is similar to Example 13.1 in that we need to decide between interval estimation and hypothesis testing as ways to take chance into account. As in Example 13.1, there is no sensible null hypothesis to test, so the only possible answer is interval estimation. If we were uncertain about what parameter to estimate, we could recognize this as a univariable sample with a continuous dependent variable. That selects the Chapter 4 flowchart.
Here, we have two measurements of serum cholesterol for each person. We are told, however, it is neither measurement of serum cholesterol that is of interest. Instead the difference between the two measurements is the dependent variable. There are no other variables, so this is a univariable sample with a continuous dependent variable. That selects the flowchart in Chapter 4.
Although these are the same data as in Example 13.12, the question being asked changes how we think about those data. Instead of using the two measurements of serum cholesterol to find the difference, we are now interested in the two measurements themselves. Thus, this is a bivariable sample with a continuous dependent variable. This puts us in Chapter 7. There, we are in the left-hand branch because the independent variable represents continuous data as well. We are interested in the strength of association, so we follow the flowchart to Pearson's correlation coefficient.
This is a continuation of Example 13.12 in which the difference between cholesterol measurements is the dependent variable. Now, we are going to divide those differences into three groups. To specify three groups of dependent variable values, we need two nominal independent variables. 1 Thus, we have a multivariable sample with a continuous dependent variable. This puts us in Chapter 10.
Both of the independent variables represent nominal data, so we are in the middle branch of the flowchart. Diet is a single characteristic. This leads us to one-way analysis of variance.
This is a continuation of Example 13.14. Now, we are adding another treatment to diet. This creates six groups of dependent variable values, so we have five nominal independent variables. This keeps us in the middle branch of the flowchart. When we changed the number of groups to six, we also added another characteristic (chitosan). Thus, the flowchart leads us to factorial analysis of variance.
This is a continuation of Example 13.15. We have added two more independent variables: gender and age. Since we now have both nominal and continuous independent variables, we are in the right-hand branch of the flowchart in Chapter 10. This branch leads to analysis of covariance.
The dependent variable represents developing an abnormal PAP smear. This is a nominal dependent variable. There are no independent variables. 2 This means we have a univariable sample with a nominal dependent variable. This puts us in Chapter 6. In the Chapter 6 flowchart we need to distinguish between a nominal dependent variable affected by time and not affected by time. Since everyone was followed for a five-year period, the dependent variable is not affected by time. This means we can estimate a probability. There are three probabilities listed among the answers. Since we are interested in new cases developed over time, it is the five-year risk we want to estimate.
This is a continuation of Example 13.17. Another group of women is added to the study. Now, HPV is an independent variable. This means we have a bivariable sample with a nominal dependent variable. We are in Chapter 9. Since HPV is a nominal variable, we follow the right-hand branch of the flowchart. As explained in Example 13.14, the dependent variable is not affected by time. Since there is no attempt to pair HPV persons to non-HPV persons, this is an unpaired study. This leads us to a choice among three parameters to estimate. To make this choice, we recall from Chapter 6 that ratios are not affected by the underlying frequency of the event or characteristic.
This is a continuation of Example 13.18. Now, we are to continue down the flowchart to find the common name of the statistical method we should use. There are two: a normal approximation and an exact test. We learned in Chapter 6 that, if we have a choice, we should use the results from the exact test.