Statistics with Confidence: Confidence Intervals and Statistical Guidelines

Studies evaluating diagnostic test performance yield a variety of numerical results. While these ought to be accompanied by confidence intervals, this is less commonly done than in other types of medical research.¹

Diagnosis may be based either on the presence or absence of some feature or symptom, on a classification into three or more groups (perhaps using a pathological grading system), or on a measurement. Values of a measurement may also be grouped into two or more categories, or may be kept as measurements. Each case will be considered in turn. See Altman² for more detailed discussion of most of these methods.

A confidence interval indicates uncertainty in the estimated value. As noted in earlier chapters, it does not take account of any additional uncertainty that might relate to other aspects such as bias in the study design, a common problem with studies evaluating diagnostic studies.³

Classification into two groups

Table 10.1 Relation between a binary diagnostic test and presence or absence of disease

The two most common indices of the performance of a test are the sensitivity and specificity. The sensitivity is the proportion of true positives that are correctly identified by the test, given by a/(a + c) in Table 10.1. The specificity is the proportion of true negatives that are correctly identified by the test, or d/(b + d).

The sensitivity and specificity are proportions, and so confidence intervals can be calculated for them using the traditional method or Wilson’s method, as discussed in chapter 6. Note that the traditional method may not perform well when proportions are close to 1 (100%) as is often the case in this context, and may even give confidence intervals which exceed 100%.⁴ Wilson’s method is thus generally preferable.

Worked example

Petersen et al.⁵ investigated the use of the ice-water test in 80 patients with detrusor overactivity. Of these, 60 had bladder instability and 20 had detrusor hyperreflexia. Their results are shown in Table 10.2.

The sensitivity of the test at detecting detrusor hyperreflexia was 39/60 = 0·650 or 65·0% and the specificity was 17/20 = 0·850 or 85·0%. Using the recommended method (Wilson’s method), 95% confidence intervals are 52·4% to 75·8% for the sensitivity and 64·0% to 94·8% for the specificity.

Table 10.2 Results of ice-water test among patients with either detrusor hyperreflexia (DH) or bladder instability (BI)⁵

The traditional method gives a similar confidence interval for the sensitivity of 52·9% to 75·8%, but an impossible confidence interval for the specificity of 69·4% to 100·6%.

Positive and negative predictive values

In clinical practice the test result is all that is known, so we want to know how good the test is at predicting abnormality. In other words, what proportion of patients with abnormal test results are truly abnormal? The sensitivity and specificity do not give us this information. Instead we must approach the data from the direction of the test results. These two proportions are defined as follows:

Worked example

Using the same data as before, the positive predictive value among those with a positive ice-water test is PV+ = 39/42 (92·9%) and the negative predictive value among those with a negative test result is PV– = 17/38 (44·7%). Using Wilson’s method (chapter 6), 95% confidence intervals are 81·0% to 97·5% for the positive predictive value and 30·1 % to 60·3% for the negative predictive value.

We should not stop the analysis here. The predictive values of a test depend upon the prevalence of the abnormality in the patients being tested, which may not be known. The values just calculated assume that the prevalence of detrusor hyperreflexia among the population of patients likely to be tested is the same as in the sample, namely 60/80 or 75%. In a different clinical setting the prevalence of abnormality may differ considerably.

Predictive values observed in one study do not apply universally. The rarer the true abnormality is, the more sure we can be that a negative test indicates no abnormality, and the less sure that a positive result really indicates an abnormal patient.

More general formulae for calculating predictive values for any prevalence (prev) are

where sens and spec are the sensitivity and specificity as previously defined. The prevalence can be interpreted as the probability before the test is carried out that the subject has the disease, also known as the prior probability of disease. PV+ and PV– are the revised estimates of the probability of disease for those subjects who are positive and negative to the test, and are known as posterior probabilities. The comparison of the prior and posterior probabilities is one way of assessing the usefulness of the test. The predictive values can change considerably with a different prevalence.

We can obtain approximate confidence intervals for prevalence- adjusted predictive values by expressing them as proportions of the number of positive or negative test results in the study.

Worked example

We can estimate the predictive values for a setting where the prevalence of detrusor hyperreflexia is 25% (0·25). For example, the prevalence- adjusted positive predicted value is

With the estimated prevalence-adjusted PV + of 0·591, in a sample of 42 test positives we would expect 0·591 × 42 = 24·8 true positives, or about 25. We can construct a confidence interval for the proportion 25/42 in the usual way. Wilson’s method gives a 95% confidence interval around the adjusted PV + of 0·591 from 0·445 to 0·730, or 44·5% to 73·0%.

Likelihood ratios

For any test result we can compare the probability of getting that result, if the patient truly had the condition of interest, with the corresponding probability if they were healthy. The ratio of these probabilities is called the likelihood ratio (LR). The likelihood ratio for a positive test result is calculated as

The contrast between these values indicates the value of the test for increasing certainty about a diagnosis.²

Worked example

Using the same data as before, the prevalence of detrusor hyperreflexia was 0·75. The likelihood ratio for a positive test is LR+ = 0·65/(1 – 0·85) = 4·33 and the likelihood ratio for a negative test is LR– = (1 – 0·65)/0·85 = 0·41.

The likelihood ratio is increasingly being quoted in papers describing diagnostic test results, but is rarely accompanied by a confidence interval. The likelihood ratio for a positive test can be also expressed as the ratio of the true-positive and false-positive rates., that is

Similarly, the likelihood ratio for a negative test can be expressed as the ratio of the true-negative and false-negative rates, that is

The likelihood ratio is identical in construction to a relative risk (or risk ratio)—that is, it is the ratio of two independent proportions. It follows immediately that a method for deriving a confidence interval for a relative risk can be applied also to the likelihood ratio. There are several possible methods, not all of which are equally good. Two are considered here.

Log method

Confidence intervals for the population value of the likelihood ratio can be constructed through a logarithmic transformation,² as described in chapter 7. The method is illustrated using LR+. The standard error of log_e LR+ is

from which a 100(1 – α)% confidence interval for log_eLR+ is found in the standard way. We obtain a confidence interval for LR+ by antilogging (exponentiating) these values. (The derivation of this method, sometimes called the log method, is given in the appendix of Simel et al.⁶)

Note that either a or b can be zero, in which case SE(log_e LR+) becomes infinite. To avoid this problem, it may be preferable in such cases to add 0·5 to the counts in all four cells of the observed table before calculating both LR+ and SE(log_e LR+).

Score method

The “score test” method reportedly performs somewhat better that the usual log method,^7,8 although it is too complex for the formula to be given here.

Worked example

Nam⁸ (1995) considered the following example, originally provided by Koopman:⁹ 36/40 diseased and 16/80 non-diseased patients have positive test results. We thus have a = 36, b = 16, c = 4, and d = 64. The estimated likelihood ratio for a positive test is LR+ = (36/40)/(16/80) = 4.5. The log method gives a 95% confidence interval for LR+ from 2·87 to 7·06. Using the score method, the 95% confidence interval is 2·94 to 7·15, so in this case the two methods agree reasonably well.

Classification into more than two groups

Diagnostic tests based on measurements

A global assessment of the performance of the test, sometimes called diagnostic accuracy,¹² is given by the area under the ROC curve (often abbreviated AUC). This area is equal to the probability that a random person with the disease has a higher value of the measurement than a random person without the disease. The area is 1 for a perfect test and 0·5 for an uninformative test. The non-parametric calculation of the area under the curve is closely related to the Mann-Whitney U statistic.

The area under the curve and its standard error can be obtained by examining every comparison of a member of each group. We consider the n individuals in the first group with values X_i and the m individuals in the second group with observations Y_j, so that there are nm pairs of values X_i and Y_j. For each pair we obtain a “placement score” ψ_ij, which indicates which value is larger. We have ψ_ij = 1 if Y_j > X_i; ψ_ij = 0·5 if Y_j = X_i; and ψ_ij = 0 if Y_j < X_i. In effect we assess where each observation is placed compared with all of the values in the other group.^14,15.

The area under the curve is simply the mean of all of the values ψ_ij. In mathematical notation, we have

The values R_i indicate for each member of the first group the proportion of the observations in the second group which exceed it, and similarly for C_j. (They are also the row and column totals of ψ_ij when the data are arranged in a two-way table, as in the worked example below.)

An equivalent method, based on the values of R_i and C_j, and avoiding the need to calculate

and

where var indicates the variance. The standard error of AUC, the area under the ROC curve, can be used to construct a confidence interval in the familiar way using the general formula given in chapter 3. A 100(1 – α)% confidence interval for AUC is

where z_{1 – α/2} is the appropriate value from the standard Normal distribution for the 100(1 – α/2) percentile (see Table 18.1). For a 95% confidence interval, z_{1 – α/2} = 1·96.

Worked example

Table 10.3 shows values of an index of mixed epidermal cell lympho-cyte reactions in bone-marrow transplant recipients who did or did not develop graft-versus-host disease (GvHD).¹³ The usefulness of the test for predicting GvHD is related to the degree of non-overlap in the two distributions. The ROC curve for these data is shown in Figure 10.1. Table 10.4 shows the values of ψ_ij for each combination of one member of each group, where the X_i are the observations from 17 patients with GvHD and Y_j are the values from the 20 patients without GvHD.

The area under the curve is obtained simply as the mean of the entries in Table 10.4, given by AUC = 270·5/(17 × 20) = 0·7956, or about 0·8, showing quite good separation between the groups.

Table 10.3 Values of an index of mixed epidermal cell lym-phocyte reactions in bone-marrow transplant recipients who did or did not develop graft-versus-host disease (GvHD)¹³

The variance of the row and column totals are 4·3633 and 3·7259 respectively, so that

The 95% confidence interval for the area under the curve is thus 0·7956 – 1·96 × 0·0730 to 0·7956 + 1·96 × 0·0730, that is 0·65 to 0·94. The confidence interval is quite wide because the sample size is small.

The method just described relies on the assumption that the area under the curve has a Normal sampling distribution. Obuchowski and Lieber¹⁶ note that for methods of high accuracy (AUC > 0·95) use of the preceding method for the area under a single ROC curve may require a sample size of 200. For smaller samples a bootstrap approach is recommended (see chapter 13).

Having determined that a test does provide good discrimination the choice can be made of the best cutpoint for clinical use. The simple approach of minimising “errors” (equivalent to maximising the sum of the sensitivity and specificity) is common, but it is not necessarily best. Consideration needs to be given to the costs (not just financial) of false-negative and false-positive diagnoses, and the prevalence of the disease in the subjects being tested.¹² For example, when screening the general population for cancer the cutpoint would be chosen to ensure that most cases were detected (high sensitivity) at the cost of many false positives (low specificity), who could be eliminated by a further test.

The ROC plot is more useful when comparing two or more measures. A test with a curve that lies wholly above the curve of another will be clearly better. Methods for comparing the areas under two ROC curves for both paired and unpaired data are reviewed by Zweig and Campbell.¹²

Comparison of assessors—the kappa statistic

Kappa is a measure of agreement beyond the level of agreement expected by chance alone. The observed agreement is the proportion of samples for which both observers agree, given by p_o = (a + d)/n. To get the expected agreement we use the row and column totals to estimate the expected numbers agreeing for each category. For positive agreement (+, +) the expected proportion is the product of (a + b)/n and (a + c)/n, giving (a + b)(a + c)/n². Likewise, for negative agreement the expected proportion is (c + d)(b + d)/n². The expected proportion of agreements for the whole table (p_e) is the sum of these two terms. From these elements we obtain kappa as

from which a 100(1 – α)% confidence interval for K is found in the standard way as

Kappa has its maximum value of 1 when there is perfect agreement. A kappa value of 0 indicates agreement no better than chance, while negative values (rarely encountered) indicate agreement worse than chance.

Worked example

A group of children who had survived post-haemorrhagic ventricular dilatation were assessed by a paediatrician and by their parents regarding their ability to perform various activities. Table 10.6 shows the relation between their assessments of whether the child could walk downstairs. The observed agreement is p_o = (32 + 42)/83 = 0·8916. The expected agreement is p_c = (35 × 38)/83² + (48 × 45)/83² = 0·5066. From these we calculate kappa as

The standard error of kappa is

Table 10.6 Comparison of paediatrician’s and parent’s assessments of whether children could walk downstairs¹⁷

The 95% confidence interval for kappa is thus 0·780 – 1·96 × 0·0692 to 0·780 + 1·96 × 0·0692, that is from 0·64 to 0·92.

The calculation of kappa can be extended quite easily to assessments with more than two categories.² If there are g categories and f_i is the number of agreements for the ith category, then the overall observed agreement is

If r_i and c_i are the totals of the ith row and ith column, then the overall expected agreement is

Kappa can also be extended to multiple observers, and it is possible to weight the disagreements according to the number of categories separating the two assessments.

1 Harper R, Reeves B. Reporting of precision of estimates for diagnostic accuracy: a review. BMJ 1999;318:1322–3.

2 Altman DG. Practical statistics for medical research. London: Chapman & Hall, 1991:403–19.

3 Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research. JAMA 1995;274:645–51.

4 Deeks JJ, Altman DG. Sensitivity and specificity and their confidence intervals cannot exceed 100. BMJ 1999;318:193–4.

5 Petersen T, Chandiramani V, Fowler CJ. The ice-water test in detrusor hyperreflexia and bladder instability. Br J Urol 1997;79:163–7.

6 Simel DL, Samsa GP, Matchar DB. Likelihood ratios with confidence: sample size estimation for diagnostic test studies. J Clin Epidemiol 1991;44:763–70.

7 Gart JJ, Nam J. Approximate interval estimation of the ratio of binomial proportions: a review and corrections for skewness. Biometrics 1988;44:323–38.

8 Nam J. Confidence limits for the ratio of two binomial proportions based on likelihood scores: non-iterative method. Biom J 1995;37:375–9.

9 Koopman PAR. Confidence limits of the ratio of two binomial proportions. Biometrics 1984;80:513–17.

10 Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-based medicine. How to practise and teach EBM. London: Churchill-Livingstone, 1997:124.

11 Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29–36.

12 Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental tool in clinical medicine. Clin Chem 1993;39:561–77.

13 Bagot M, Mary J-Y, Heslan M et al. The mixed epidermal cell lymphocyte reaction is the most predictive factor of acute graft-versus-host disease in bone marrow graft recipients. Br J Haematol 1988;70:403–9.

14 DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44:837–45.

15 Hanley JA, Haijan-Tilaki KO. Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. Acad Radiol 1997;4:49–58.

16 Obuchowski NA, Lieber ML. Confidence intervals for the receiver operating characteristic area in studies with small samples. Acad Radiol 1998;5:561–71.

17 Fooks J, Mutch L, Yudkin P, Johnson A, Elbourne D. Comparing two methods of follow up in a multicentre randomised trial. Arch Dis Child 1997;76:369–76.

10

Diagnostic tests

Classification into two groups

Sensitivity and specificity

Positive and negative predictive values

Likelihood ratios

Log method

Score method

Classification into more than two groups

Diagnostic tests based on measurements

The area under a receiver operating characteristic (ROC) curve

Comparison of assessors—the kappa statistic