CHAPTER 27
Eliminating common misconceptions to enable intelligent use of biostatistics: How can a novice use statistics more intelligently?

Gary Gaddis

St. Luke’s Hospital of Kansas City, MO, USA

Recognizing types of data [1]

Medical researchers almost always collect or tally numbers. Three types of numeric (quantitative) data exist. Each of these types of data is described and analyzed by different statistical tools. Researchers must be able to distinguish types of data, in order to be able to select the proper statistical tools to describe and analyze that data without committing a statistical error. Types of data include:

  • Nominal data: Rates, proportions, or frequencies of observations classed into named categories. (The names of the categories are chosen by the researcher.)
  • Ordinal data: Numeric data collected using a rank-ordered scale.
  • Interval or ratio data, which is also called “parametric data”: Data that can be expected to fit the Gaussian, or “normal” distribution. Ratio data have a minimum value of zero (such as age or degrees Kelvin), while interval data can have a negative value (such as degrees Celsius). Otherwise, they are essentially identical.

(Qualitative data, by which subjects provide the words to answer research questions, are seldom studied in medical research, and will not be further described in this chapter.)

A landmark study that strongly influenced clinical practice and contains all three types of quantitative data is that of Sullivan et al. [2]. This manuscript, “Early Treatment with Prednisolone or Acyclovir in Bell’s Palsy”, was produced by a group in Scotland and was published in the New England Journal of Medicine. This manuscript delivered strong and clear evidence that full recovery from Bell’s Palsy is enhanced by prednisolone, but not by acyclovir [2]. The manuscript also provides examples of the statistical concepts discussed in this chapter.

Nominal data

Nominal data consist of numeric rates, proportions, or frequencies. The primary outcomes Sullivan et al. studied were frequencies of full facial motor recovery observed three months and nine months after initiation of treatment for Bell’s Palsy.

The four treatment conditions were prednisolone without acyclovir, acyclovir without prednisolone, prednisolone plus acyclovir, and no active medication. The two primary outcomes reported were whether or not full recovery of facial nerve function (“Grade 1 on House–Brackman Scale”) was achieved at three months and at nine months after onset of Bell’s Palsy. Table 27.1 shows the data regarding the frequency of full facial nerve recovery that appeared in Table 2 of the Sullivan et al. manuscript. Although patients were randomized into one of four groups as noted above, the results were tallied as frequency data, as noted here in Table 27.1.

Table 27.1 Proportions of patients manifesting a Grade 1 House–Brackmann scale (complete recovery from Bell’s palsy).

Source: Adapted from Sullivan et al. 2007 [2].

Characteristic Prednisolone No Prednisolone p-value Acyclovir No Acyclovir p-value
Time(n = 251)(n = 245)(n = 247)(n = 249)
At 3 months205/247 (83.0%)152/239 (63.6%)<0.001173/243 (71.2%)184/243 (75.7%)0.50
At 9 months237/251 (94.4%)200/245 (81.6%)<0.001211/247 (85.4%)226/249 (90.8%)0.10

Ordinal data

Ordinal data are data grouped into rankings, from low to high or from high to low. Sullivan et al. used the ordinal “House–Brackman” score [3] for judging the degree of facial palsy, which can be scored from 1 (no paralysis) to 6 (total paralysis).

A critical feature of any ordinal data is that an incremental difference of one rating point does not have a consistent meaning. The amount of difference between House–Brackman scores of 1 versus 2 is not equivalent to the difference between scores of 5 versus 6. CHADS2 scores for risk of stroke with atrial fibrillation [4] and Glasgow Coma Scale scores [5] are other examples of ordinal data that are probably more familiar to many clinicians. Notice that these ordinal data are dimensionless; there are no units of measure that can be applied to ordinal data.

For all of these scoring systems, there is no consistent level of difference between scores on an ordinal data scale. A consequence of this distinction is that it is inappropriate to attempt to calculate a mean or standard deviation for ordinal data.

Interval or ratio (parametric) data

Parametric data are normally distributed (or nearly so) and the magnitude of difference between the gradations of parametric data scales is consistent. Whether the data are “interval scale” (data with no absolute zero, such as temperature in degrees C or F) or “ratio scale” (data with an absolute zero, such as temperature in degrees K, or age), this consistency of the data scale is always present. Parametric data have a dimension, a unit of measure (such as kilograms, meters, seconds, etc, … units that have a consistent level of magnitude everywhere on the data management scale) that can be subjected to dimensional analysis. To have dimension is a key way to distinguish parametric data from ordinal data, which are dimensionless.

Sullivan et al. reported their subjects’ age in years, an example of ratio scale data (Table 27.2). There is a consistent meaning of an increment of n year(s) between any two possible uniformly-spaced ages. This consistency permits valid calculation of a mean and standard deviation, characterizations not valid for ordinal or nominal data.

Table 27.2 Selected baseline characteristics of patients studied by Sullivan et al.

Source: Adapted from Sullivan et al. 2007 [2].

Characteristic Prednisolone No Prednisolone Acyclovir No Acyclovir Total Total
Age (Years)43.2 ± 16.244.9 ± 16.645.0 ± 16.643.0 ± 16.144.0 ± 16.4
House–Brackmann score3.5 ± 1.23.8 ± 1.33.6 ± 1.33.7 ± 1.23.6 ± 1.3
Health Utilities Mark 3 score0.80 ± 0.220.78 ± 0.210.79 ± 0.210.78 ± 0.220.79 ± 0.22
Derriford Appearance score 5971 ± 3775 ± 4172 ± 3974 ± 3873 ± 39
Score on Brief Pain Inventory10 ± 1816 ± 2112 ± 1814 ± 2113 ± 20

Describing data [6]

Researchers can describe their data in terms of its “central tendency” and its “variability”.

Central tendency and parametric data

Anyone who has ever computed an average has familiarity with “central tendency”. For parametric data, three measures of central tendency exist. One is the mean, the arithmetic average of the data. Another is the median, the mid-most observation, which has half of the data points lying below it, and half lying above it. (Another way to describe median is to use the term “50th percentile”) The final descriptor of central tendency is the mode, the most commonly observed data value. In the normal distribution, these three points are identical. However, in skewed distributions, where one “tail” of the distribution is longer than the other, the mean value lies toward the longer tail in a skewed distribution, compared to the median, while the mode lies at the “peak” of the skewed distribution (Figure 27.1).

c27-fig-0001

Figure 27.1 Systolic blood pressure in persons with renovascular hypertension. Mean, 228.7; median, 230; mode, 240.

Source: Gaddis and Gaddis [6]; reproduced with permission from Elsevier.

Central tendency and ordinal data

For ordinal data, it is mathematically possible, yet methodologically incorrect, to use the individual data points to calculate an apparent mean value. It is inappropriate to calculate a “mean” of ordinal data, due to the lack of a consistent level of magnitude of difference between points on the data scale. Ordinal data is dimensionless. For example, Likert Scale data are not quantified in “Agreement Units”, they are generally quantified as “Strongly Agree”, “Agree”, “Neutral”, “Disagree” or “Strongly Disagree”. There is no consistent order of magnitude of agreement or disagreement between the five points on the scale. This helps explain why such dimensionless data cannot have a mean. However, the median and the mode are appropriate descriptors of ordinal data.

Central tendency and nominal data

For nominal data, it is possible only to express the mode, the type of observation that was made most frequently in the study. With nominal data, there is no logical ranking of named data categories, so it is impossible to calculate a median.

Variability and parametric data

Standard deviation” most thoroughly explains variability of parametric data. The standard deviation is the square root of a term called the “variance”. The variance is the sum of squared deviations from the sample mean, divided by one less than the total number of observations:

images

where n is the number of data points in the distribution and xi is the value of each individual data point.

Deviations of each data point from the mean are squared in the variance calculation, so that a non-zero variability term is obtained. (If one calculated the difference between each data point and the mean in a normal distribution, the sum would be zero, because some data points fall above the mean, and the others fall below the mean.)

Another appropriate tool to express data variability is the range, a statement of the largest and smallest values observed.

Variability and ordinal data

Neither Standard Deviation nor variance can be computed for ordinal data because of the lack of a consistent size of difference of the units of the data scale, as previously described. Another way to express this is: Since ordinal data cannot correctly have a mean determined, ordinal data cannot have a variance or standard deviation correctly calculated, because the mean is used to calculate the variance. Variability of ordinal data is most commonly expressed by “inter-quartile range”, the range spanning the 25th to the 75th percentile values. Range can also be used to express variability of ordinal data.

Variability and nominal data

For nominal data, the only possible expression of data variability is to express the number of observations in each named category.

More detail regarding this topic is available elsewhere [6]. Sullivan et al. [2] expressed central tendency and variability for their data (Tables 27.1 and 27.2). They correctly used mean and standard deviation to describe subjects’ ages, but incorrectly used mean and standard deviation to describe ordinal data that were studied as secondary outcomes. These ordinal scores included House–Brackman, Health Utilities Index Mark 3, Derriford Appearance Scale 59, and Brief Pain Inventory scores.

Statistical inference, random sampling, bias and error [1]

A quirk but not a fatal flaw

Statistics permit researchers to infer a conclusion about an entire population based upon analysis of dependent variable data from samples of that population. Sullivan et al. inferred, from their patients with Bell’s Palsy, that corticosteroids help increase the probability of full recovery from facial paralysis, but acyclovir does not [2].

This study was peculiar regarding how the authors assembled their study cohort. They went to the extraordinary step of establishing patient receiving centers at 17 hospitals, serving 88% of Scotland’s population of 5.1 million persons, as their source for referrals. They nearly studied an entire population (Scots with Bell’s Palsy), rather than a random sample. Usually, via random sampling, well-executed studies enroll a much smaller portion of the potentially eligible subjects.

If some factor unique among Scottish patients was to be likely to influence recovery from Bell’s Palsy, the results that Sullivan’s group obtained might not be applicable to other populations. However, many physicians would probably believe that Scotland supplied an appropriately representative population for study of treatment of Bell’s Palsy, and the inferences from this study are broadly applicable.

The lack of random sampling and attempt to enroll nearly an entire population by Sullivan et al. thus constitutes a quirk, but not a fatal flaw that would suggest their conclusions were invalid. You will seldom read a manuscript for which the authors attempted to assemble an entire population or nearly an entire population. Random sampling is much more common, as it is less expensive and much easier from a practical perspective to study a random sample than to study an entire population.

Random sampling, a tool to decrease bias

Random sampling of a population is the tool that is usually employed to minimize bias in any study. Bias is any factor or attribute that can influence a study and make a particular conclusion more or less probable than would have been the case if that influence were not present. For example, if a researcher attempted to estimate the average height of American males, and all study subjects were professional basketball players from the National Basketball Association, the non-random sample would be biased, because it should over-estimate the average height of American men.

All individuals in a population must have an equal chance of being a member of a study sample, if sampling bias is to be avoided. It is a matter of judgment to determine how much bias is introduced and how much influence on study conclusions exists when random sampling does not occur.

Numerous other aspects of a study’s methodology can give rise to bias; space does not permit their enumeration here. (The reader is referred to other chapters in this book, especially Chapters 10 and 12.) As a result, all research studies risk stating erroneous conclusions [6]. The greater the degree of bias, the greater the probability of reaching erroneous conclusions.

Mechanics of statistical inference [7]

In any inferential study, two or more groups are compared for some trait during the testing of an hypothesis. An hypothesis that some type of difference between groups exists regarding this trait underlies most research. The hypothesis that a difference exists is known as H1, the alternative hypothesis. H1 does not specify the size of difference; it merely specifies that a difference is believed to exist. Because an infinite number of numeric sizes of difference are possible, H1 contains an infinite number of hypotheses.

It is both impractical and mathematically impossible to attempt to test an infinite number of hypotheses. Researchers, therefore, must test the “Null hypothesis” H0, the single hypothesis of zero difference between groups.

After collection of data and calculation of inferential statistics, the researcher can only reach one of two possible conclusions:

  • One possible conclusion is that all groups studied appear to be statistically indistinguishable from each other (even though a numerical difference between groups is almost certain to be observed from the study data). This conclusion supports H0. No significant difference appears to exist between the study groups. Statisticians would say that the null hypothesis is accepted as tenable.
  • The only other possible conclusion is that the groups are inferred to appear to be different from each other, because a “significant” difference exists between the study groups. This difference is large enough that it is unlikely to have occurred due to chance. Statisticians would say in this case that the null hypothesis is rejected.

The statistical methods used to reach these inferential conclusions are rooted in probabilities. All inferential statistical methods yield a “p-value”, an expression of how probable the numerical difference(s) observed within the study were obtained due to chance alone. By convention, researchers deem any comparison between groups that appears to be less than 5% likely to have occurred due to chance alone as “statistically significant”, and indicative of a probable difference between study groups. (This is the reason researchers often report, “Statistical significance was accepted if p<0.05.”)

Type I and Type II errors

The fact that p <0.05 is deemed “statistically significant,” leading the researcher to reject the null hypothesis, demonstrates a critical point. Statisticians and researchers inherently accept the fact that an erroneous conclusion, a conclusion that a difference appears to exist between groups when no true difference actually exists, can occur up to 5% of the time, which is to say, one time in 20. On the other hand, when the probability of difference by chance alone appears to be more than 5% or greater, statisticians say “the null hypothesis is accepted as tenable”.

In summary, four possibilities exist with hypothesis testing:

  • Biostatistics yield correct conclusions when the null hypothesis is supported as tenable, and the groups being compared do not differ from each other.
  • Biostatistics yield correct conclusions when the null hypothesis is rejected, and the groups being compared truly differ from each other.
  • Biostatistics sometimes suggest that a significant statistical difference exists between groups, when in fact the groups are not truly different. To reject the null hypothesis when the null hypothesis is true is to make a “Type I” error.
  • Biostatistics sometimes suggest that no significant statistical difference exists between groups, when in fact the groups do differ from each other. To accept H0 as tenable when it is false is to make a “Type II” error.

These possibilities are illustrated in Figure 27.2 (see Chapter 28, regarding statistical power, to learn more about how to decrease the probability of a Type II error).

c27-fig-0002

Figure 27.2 Correct conclusions, and Type I and Type II Error. Comparing possible findings of statistical inferential testing versus that which is actually true.

Matching the type of data to the appropriate inferential statistical test [8, 9]

When making comparisons between two or more groups, it is critical to use the proper inferential statistical test. It is a matter of using tools correctly! It is straightforward to list the appropriate statistical test to use, when certain data attributes are present [7, 8].

Inferential tests for nominal data

Most statistical testing of nominal data is performed using a Chi-square test. (The exception is that when a cell of a 2 × 2 contingency table has a frequency of four or less; then Fisher’s Exact Test is used [7].) Chi-square testing compares observed versus expected outcome frequencies from each cell in a contingency table to derive a statistic with a numeric value. In this case, that statistic is called “Chi-square.” Computer software can convert the value of the Chi-square statistic obtained directly from the data to the probability of obtaining that Chi-square value by chance, given the size of the contingency table developed by the experiment7. More about Chi-square testing is presented in Chapter 32.

Tests for ordinal data

For ordinal data, the appropriate test can be chosen after determining:

  • Whether or not subjects participate in all treatment groups, as occurs in “cross-over” designs. Cross-over designs inherently decrease intersubject variability between groups. Thus, treatment effects can account for a relatively larger portion of the total variability between each individual subject’s data.
  • Whether or not “cumulative frequency grouping” occurs. Cumulative frequency grouping implies the use of a range of ordinal values within one outcome group. (For instance, grouping Glasgow Coma Scores of 3, 4, or 5 into one group is an example of cumulative frequency grouping.)
  • Whether there are two, or more than two groups being compared. See Table 27.3 for a summary, which matches aspects of a study’s design (whether or not a “cross-over” design is used) and the number of groups being studied to determine the proper inferential test for ordinal data.

Table 27.3 Picking the correct test for ordinal data.

Number of groups compared Cross-over design? Cumulative freq. distribution? Correct test
2NoNoMann-Whitney “U
(AKA Wilcoxon Rank Sum)
2NoYesKolmogorov-Smirnov
2YesNoWilcoxon Signed Ranks
3 or moreNoNoKruskal-Wallis
3 or moreYesNoFriedman

Inferential tests for ordinal data require calculation of that test’s statistic, such as the Mann–Whitney “U” for the Mann–Whitney test, or Friedman’s “Q” for the Friedman test. From the numeric value of this statistic, calculated using the numbers that comprise the study data, the probability of obtaining the statistic result by chance can be determined. This process is highly analogous to how a Chi-square value for tests of ordinal data is used to determine the probability that the value of Chi-square could have occurred simply by chance. Statistical software performs these functions and provides the probability estimate (“p-value”) that the researcher reports. For further detail regarding the underlying mathematical assumptions of these tests of ordinal data, or regarding the method of calculation of the test statistic for these tests and the translation of the value of the test statistic to a p-value, the reader is encouraged to consult a statistical text that addresses these matters.

For parametric data, the appropriate test to use depends upon the number of groups being compared. The Student t-test is used to compare two groups, and a form of Analysis of Variance (ANOVA) is used to compare three or more groups. Further, specific versions of the Student t-test can be used with cross-over experimental designs (“Paired t-test”) and non-cross-over designs (“Non-paired t-test”). For more about ANOVA, see Chapter 32.

Multiple comparisons and Type I error rates

You may wonder why ANOVA, Friedman Tests, and Kruskal–Wallis tests are necessary and useful. These tests are specifically designed to determine, with a single test, whether or not a statistically significant difference exists among three or more treatment groups. The reason for their utility is clear to those who understand the impact of incorrectly making multiple inferential comparisons among three or more groups, using tests designed to compare two groups at a time.

The overall probability of committing a “Type I” statistical error, when making multiple comparisons between study groups, is [1 – (1 – α)c], where “c” is the number of statistical comparisons made and α is the probability deemed to be “statistically significant” (Table 27.4).

Table 27.4 Multiple comparisons impact the probability of a Type I error (α = 0.05).

Number of comparisons Type I error probability
10.0500
20.0975
30.1427
40.1855
50.2263
60.2649

For example, if a study has four treatment groups, six comparisons are possible…Group 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, and 3 vs 4. Kruskal–Wallis and Friedman tests for ordinal data, and Analysis of Variance for parametric data, are designed to hold the experiment-wise Type I error rate constant at 5%, rather than being increased due to making the multiple possible intergroup comparisons. Contrast this 5% Type I error rate, when using inferential tests properly, to the probability of committing a Type I error that accrues from incorrectly making the six possible individual dual statistical comparisons. When setting p <0.05 to conclude a significant statistical difference for each individual test, the overall probability of making a Type I error rises to approximately 26.5%!

The issue of multiple comparisons and the equation [1 – (1 – α)c] also applies to evaluating the experiment-wise Type 1 error rate in studies that have primary and secondary outcomes. A study with one primary outcome and five secondary outcomes also makes six statistical comparisons. The probability of committing a Type I error when making six statistical comparisons is also approximately 26.5%! (Table 27.4)

Beware of studies with multiple secondary outcomes. They are reasonably likely to commit at least one Type I error. The fact that Type I errors are inherently more likely for secondary outcomes than for primary outcomes explains why secondary outcome findings that appear to be statistically significant should be restudied to determine whether a significant statistical association remains.

Summary

The prudent researcher and the prudent consumer of medical literature must be able to recognize the type of data being collected and analyzed in a study, in order to express study results correctly, in terms of characterizations of its central tendency and variability. Also, the type of data being analyzed must be known in order to pick the correct tool, the appropriate statistical test. Those who perform and read about research must also understand the concepts of Type I and Type II error, and the fact that experimental results are expressed as matters of probabilities, and not certainties.

References

  1. 1 Gaddis, M.L. and Gaddis, G.M. (1990) Introduction to biostatistics: Part 1, basic concepts. Ann Emerg Med, 19:86–89.
  2. 2 Sullivan, F.M., Swan, I.R.C., Donnan, P.T., et al. (2007) Early treatment with prednisolone or acyclovir in Bell’s Palsy. N Engl J Med, 357:1598–1607.
  3. 3 House, J.W. and Brackmann, D.E. (1985) Facial nerve grading system. Otolaryngol Head Neck Surg, 93:146–147.
  4. 4 Gage, BF, van Walraven, C, Pearce, L, et al. (2004) Selecting patients with atrial fibrillation for anticoagulation: stroke risk stratification in patients taking aspirin. Circulation, 110:2287–2292.
  5. 5 Teasdale, G. and Jennett, B. (1974) Assessment of coma and impaired consciousness: A practical scale. Lancet, 2:(7872):81–84.
  6. 6 Gaddis, G.M. and Gaddis, M.L. (1990) Introduction to biostatistics: Part 2, descriptive statistics. Ann Emerg Med, 19:309–315.
  7. 7 Gaddis, G.M. and Gaddis, M.L. (1990) Introduction to biostatistics: Part 3, sensitivity, specificity, predictive value, and hypothesis testing. Ann Emerg Med, 19:591–597.
  8. 8 Gaddis, G.M. and Gaddis, M.L. (1990) Introduction to biostatistics: Part 5, statistical inference techniques for hypothesis testing with nonparametric data. Ann Emerg Med, 19:1054–1059.
  9. 9 Gaddis, G.M. and Gaddis, M.L. (1990) Introduction to biostatistics: Part 4, statistical inference techniques in hypothesis testing. Ann Emerg Med,19:820–825.