2 Confidence intervals in practice

DOUGLAS G ALTMAN

As noted in chapter 1, confidence intervals are not a modern device, yet their use in medicine (and indeed other scientific areas) was quite unusual until the second half of the 1980s. For some reason in the mid-1980s there was a spate of interest in the topic, with many journals publishing editorials and expository articles (see chapter 1). It seems that several such articles in leading medical journals were particularly influential. Since the first edition of this book there have been many further such publications, often contrasting confidence intervals and significance tests. There has been a continuing increase in the use of confidence intervals in medical research papers, although some medical specialties seem somewhat slower to move in this direction. This chapter briefly summarises some of this literature.

Surveys of the use of confidence intervals in medical journals

There is a long tradition of reviewing the statistical content of medical journals, and several recent reviews have included the use of confidence intervals. Of particular interest is a review of the use of statistics in papers in the British Medical Journal in 1977 and 1994, before and after it adopted its policy of requiring authors to use confidence intervals.¹ One of the most marked increases was in the use of confidence intervals, which had risen from 4% to 62% of papers using some statistical technique, a large increase but still well short of that required. Similarly, between 1980 and 1990 the use of confidence intervals in the American Journal of Epidemiology approximately doubled to 70%, and it was around 90% in the subset of papers related to cancer, ² despite a lack of editorial directive.³ This review also illustrated a wider phenomenon, that the increased use of confidence intervals was not so much instead of P values but as a supplement to them.²

The uptake of confidence intervals has not been equal throughout medicine. A review of papers published in the American Journal of Physiology in 1996 found that out of 370 papers only one reported confidence intervals!⁴ They were presented in just 16% of 100 papers in two radiology journals in 1993 compared with 52% of 50 concurrent papers in the British Medical Journal.⁵

Confidence intervals may also be uncommon in certain contexts. For example, they were used in only 2 of 112 articles in anaesthesia journals (in 1991–92) in conjunction with analyses of data from visual analogue scales.⁶

Editorials and expository articles

Editorials^7–19 and expository articles^20–31 related to confidence intervals have continued to appear in medical journals, some being quite lengthy and detailed. In effect, the authors have almost all favoured greater use of confidence intervals and reduced use of P values (a few exceptions are discussed below). Many of these papers have contrasted estimation and confidence intervals with significance tests and P values.

Such articles seem to have become rarer in the second half of the 1990s, which may indicate that confidence intervals are now routinely included in introductory statistics courses, that there is a wide belief that this particular battle has been won, or that their use is so widespread that researchers use them to conform. Probably all of these are true to some degree.

Medical journal policy

As noted in chapter 1, when the first edition of this book was published in 1989, a few medical journals had begun to include some mention of confidence intervals in their instructions to authors. In 1988 the influential ‘Vancouver guidelines’³² (originally published in 1979) included the following passage:

Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results. When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information.

This passage has survived intact to May 1999 apart from one trivial rewording.³³ The comment on confidence intervals is, however, very brief and rather nebulous. In 1988 Bailar and Mosteller published a helpful amplification of the Vancouver section,³⁴ but this article is not cited in recent versions of the guidelines. Over 500 medical journals have agreed to use the Vancouver requirements in their instructions to authors.³³

Despite the continuing flow of editorials in medical journals in favour of greater use of confidence intervals,^7–19 it is clear that the uptake of this advice has been patchy, as illustrated by reviews of published papers and also journals’ instructions to authors. In 1993, I reviewed the ‘Instructions to Authors’ of 135 journals, chosen to have high impact factors within their specialties. Only 19 (14%) mentioned confidence intervals explicitly in their instructions for authors, although about half made some mention of the Vancouver guidelines. Journals’ instructions to authors change frequently, and not necessarily in the anticipated direction. Statistical guidelines published (anonymously) in 1993 in Diabetic Medicine included the following: ‘Confidence intervals should be used to indicate the precision of estimated effects and differences’.³⁵ At the same time they published an editorial stating ‘Diabetic Medicine is now requesting the use of confidence intervals wherever possible’.¹⁴ These two publications are not referenced in the 1999 guidelines, however, and there is no explicit mention of confidence intervals, although there is a reference to the Vancouver guidelines.³⁶

Kenneth Rothman was an early advocate of confidence intervals in medical papers.³⁷ In 1986 he wrote: ‘Testing for significance continues today not on its merits as a methodological tool but on the momentum of tradition. Rather than serving as a thinker’s tool, it has become for some a clumsy substitute for thought, subverting what should be a contemplative exercise into an algorithm prone to error.’³⁸ Subsequently, as editor of Epidemiology, he has gone further:³⁹

When writing for Epidemiology, you can also enhance your prospects if you omit tests of statistical significance. Despite a widespread belief that many journals require significance tests for publication, the Uniform Requirements for Manuscripts Submitted to Biomedical Journals discourages them, and every worthwhile journal will accept papers that omit them entirely. In Epidemiology, we do not publish them at all. Not only do we eschew publishing claims of the presence or absence of statistical significance, we discourage the use of this type of thinking in the data analysis, such as in the use of stepwise regression.

Curiously, this information is not given in the journal’s ‘Guidelines for Contributors’ (http://www.epidem.com/), perhaps reflecting the slightly softer position of a 1997 editorial: ‘it would be too dogmatic simply to ban the reporting of all P-values from Epidemiology.’⁴⁰ Despite widespread encouragement to include confidence intervals, I am unaware of any other medical journal which has taken such a strong stance against P values.

A relevant issue is the inclusion of confidence intervals in abstracts of papers. Many commentators have noted that the abstract is the most read part of a paper,⁴¹ yet it is clear that it is the part that receives the least attention by authors, and perhaps also by editors. A few journals explicitly state in their instructions that abstracts should include confidence intervals. However, confidence intervals are often not included in the abstracts of papers even in journals which have signed up to guidelines requiring such presentation.^42,43

Misuse of confidence intervals

The most obvious example of the misuse of confidence intervals is the presentation in a comparative study of separate confidence intervals for each group rather than a confidence interval for the contrast, as is recommended (chapter 14). This practice leads to inferences based on whether the two separate confidence intervals, such as for the means in each group, overlap or not. This is not the appropriate comparison and may mislead (see chapters 3 and 11). Of 100 consecutive papers (excluding randomised trials) that I refereed for the British Medical Journal, 8 papers out of the 59 (14%) which used confidence intervals used them inappropriately.⁴⁴

The use for small samples of statistical methods intended for large samples can cause problems. In particular, confidence intervals for quantities constrained between limits should not include values outside the range of possible values for the quantities concerned. For example, the confidence interval for a proportion should not go outside the range 0 to 1 (or 0% to 100%) (see chapters 6 and 10). Quoted confidence intervals which include impossible values – such as the sensitivity of a diagnostic test greater than 100%, the area under the ROC curve greater than 1, and negative values of the odds ratio – should not be accepted by journals.^45,46

One criticism of confidence intervals as used is that many researchers seem concerned only with whether the confidence interval includes the ‘null’ value representing no difference between the groups. Confidence intervals wholly to one side of the no effect point are deemed to indicate a significant result. This practice, which is based on a correct link between confidence interval and the P value, is indeed common. But even if the author of a paper acts in this way, by presenting the confidence interval they give readers the opportunity to take a different and more informative interpretation. When results are presented simply as P values, this option is unavailable.

Dissenting voices

It is clear that there is a considerable consensus among statisticians that confidence intervals represent a far better approach to the presentation and interpretation of results than significance tests and P values. Apart from those, mostly statisticians, who criticise all frequentist approaches to statistical inference (usually in favour of Bayesian methods), there seem to have been very few who have spoken out against the general view that confidence intervals are a much better way to present results than P values.

In a short editorial in the Journal of Obstetrics and Gynecology, the editor attacked several targets including confidence intervals.⁴⁷ He expressed the unshakeable view that only positive results (P < 0.05) indicate important findings, and suggested that ‘The adoption of the [confidence interval] approach has already enabled the publication in full of many large but inconclusive studies … ’ Charlton⁴⁸ argued that confidence intervals do not provide information of any value to clinicians. In fact, he criticised confidence intervals for not doing something which they do not purport to do, namely indicate the variation in response for individual patients.

Hilden⁴⁹ cautioned that confidence intervals should not be presented ‘when there are major threats to accuracy besides sampling error; or when a characteristic is too local and study-dependent to be generalizable’. Hall⁵⁰ took this line of reasoning further, arguing that confidence intervals ‘should be used sparingly, if at all’ when presenting the results of clinical trials. He also argued, contrary to the common view, that they might be particularly misleading ‘when a clinical trial has failed to produce anticipated results’. His reasoning was that patients in a trial are not a random sample and thus the results cannot be generalised, and also that ‘a clinical trial is designed to confirm expectation of treatment efficacy by rejecting the null hypothesis that differences are due to chance’. He went further, and suggested that ‘there are few, if any, situations in which a confidence interval proves useful’. This line of reasoning has a rational basis, but he has taken it to unreasonable extremes. Other articles in the same journal issue^51,52 presented a more mainstream view.

It is interesting that there is no consensus among this small group of critics about what are the failings of confidence intervals. It is right to observe that we should always think carefully about the appropriate use and interpretation of all statistics, but it is wrong to suggest that all confidence intervals are meaningless or misleading.

Comment

Like many innovations, it is hard now to imagine the medical literature without confidence intervals. Overall, this is surely a development of great value, not least for the associated downplaying (but by no means elimination) of the wide use of P < 0.05 or P > 0.05 as a rule for interpreting study findings. However, as noted, confidence intervals can be both misused and overused and there are arguments in favour of other approaches to statistical inference. Also, despite a large increase in the use of confidence intervals, even in those journals which require confidence intervals – such as the British Medical Journal – their use is not widespread, and in some fields, such as physiology and psychology, their use remains uncommon.

Confidence intervals are especially valuable to aid the interpretation of clinical trials and meta-analyses⁵³ (see chapter 11). In cases where the estimated treatment effect is small the confidence interval indicates where clinically valuable treatment benefit remains plausible in the light of the data, and may help to avoid mistaking lack of evidence of effectiveness with evidence of lack of effectiveness.⁵⁴ The CONSORT statement⁴³ for reporting randomised trials requires confidence intervals, as does the QUOROM statement⁵⁵ for reporting systematic reviews and meta-analyses (see chapters 11 and 15).

None of this is meant to imply that confidence intervals offer a cure for all the problems associated with significance testing and P values, as several observers have noted.^56,57 We should certainly expect continuing developments in thinking about statistical inference.^58–61

1 Seldrup J. Whatever happened to the t-test? Drug Inf J 1997;31:745–50.

2 Savitz DA, Tolo K-A, Poole C. Statistical significance testing in the American Journal of Epidemiology, 1970–1990. AmJ Epidemiol 1994;139:1047–52.

3 Walter SD. Methods of reporting statistical results from medical research studies. AmJ Epidemiol 1995;141:896–908.

4 Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol 1998;85:775–86.

5 Cozens NJA. Should we have confidence intervals in radiology papers? Clin Radiol 1994;49:199–201.

6 Mantha S, Thisted R, Foss J, Ellis JE, Roizen MF. A proposal to use confidence intervals for visual analog scale data for pain measurement to determine clinical significance. Anesth Analg 1993;77:1041–7.

7 Keiding N. Sikkerhedsintervaller. Ugeskr Lceger 1990;152:2622.

8 Braitman LE. Confidence intervals assess both clinical significance and statistical significance. Ann Intern Med 1991;114:515–17.

9 Russell I. Statistics – with confidence? Br J Gen Pract 1991;41:179–80.

10 Altman DG, Gardner MJ. Confidence intervals for research findings. Br J Obstet Gynecol 1992;99:90–1.

11 Grimes DA. The case for confidence intervals. Obstet Gynecol 1992;80:865–6.

12 Scialli AR. Confidence and the null hypothesis. Reprod Toxicol 1992;6:383–4.

13 Harris EK. On P values and confidence intervals (why can’t we P with more confidence?) Clin Chem 1993;39:927–8.

14 Hollis S. Statistics in Diabetic Medicine: how confident can you be? diabetic Med 1993;10:103–4.

15 Potter RH. Significance level and confidence interval. J Dent Res 1994;73:494–6.

16 Waller PC, Jackson PR, Tucker GT, Ramsay LE. Clinical pharmacology with confidence. Br J Clin Pharmacol 1994;37:309–10.

17 Altman DG. Use of confidence intervals to indicate uncertainty in research findings. Evidence-Based Med 1996;1 (May-June): 102–4.

18 Northridge ME, Levin B, Feinleib M, Susser MW. Statistics in the journal – significance, confidence and all that. AmJ Public Health 1997;87:1092–5.

19 Sim J, Reid N. Statistical inference by confidence intervals: issues of interpretation and utilization. Phys Ther 1999;79:186–95.

20 Kelbsek HS, Gjorup T, Hilden J. Sikkerhedsintervaller i stedet for P-vserdier. Ugeskr Lceger 1990;152:2623–8.

21 Chinn S. Statistics in respiratory medicine. 1. Ranges, confidence intervals and related quantities: what they are and when to use them. Thorax 1991;46:391–3.

22 Borenstein M. A note on the use of confidence intervals in psychiatric research. Psychopharmacol Bull 1994;30:235–8.

23 Healy MJR. Size, power, and confidence. Arch Dis Child 1992;67:1495–7.

24 Dorey F, Nasser S, Amstutz H. The need for confidence intervals in the presentation of orthopaedic data. J Bone Joint Surg 1993;75A:1844–52.

25 Birnbaum D, Sheps SB. The merits of confidence intervals relative to hypothesis testing. Infect Control Hosp Epidemiol 1992;13:553–5.

25a Henderson AR. Chemistry with confidence: should Clinical Chemistry require confidence intervals for analytical and other data? Clin Chem 1993;39:929–35.

26 Metz CE. Quantification of failure to demonstrate statistical significance. Invest Radiol 1993:28:59–63.

27 Borenstein M. Hypothesis testing and effect size estimation in clinical trials. Ann Allergy Asthma Immunol 1997;78:5–11.

28 Young KD, Lewis RJ. What is confidence? Part 1: The use and interpretation of confidence intervals. Ann Emerg Med 1997;30:307–10.

29 Young KD, Lewis RJ. What is confidence? Part 2: Detailed definition and determination of confidence intervals. Ann Emerg Med 1997;30:311–18.

30 Greenfield MVH, Kuhn JE, Wojtys EM. A statistics primer. Confidence intervals. AmJ Sports Med 1998;26:145–9.

31 Fitzmaurice G. Confidence intervals. Nutrition 1999;15:515–16.

32 International Committee of Medical Journal Editors. Uniform Requirements for Manuscripts Submitted to Biomedical Journals. BMJ 1988;296:401–5.

33 International Committee of Medical Journal Editors. Uniform Requirements for Manuscripts Submitted to Biomedical Journals. Ann Intern Med 1997; 126:36–47 (see also http://www.acponline.org/journals/resource/unifreqr.htm dated May 1999 – accessed 23 September 1999).

34 Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations. Ann Intern Med 1988; 108:266–73.

35 Anonymous. Statistical guidelines for Diabetic Medicine. Diabetic Med 1993;10: 93–4.

36 Diabetic Medicine. Instructions for Authors, http://www.blacksci.co.ulc/ (accessed 23 September 1999).

37 Rothman KJ. A show of confidence. N Eng J Med 1978;299:1362–3.

38 Rothman KJ. Significance questing. Ann Intern Med 1986;105:445–7.

39 Rothman KJ. Writing for Epidemiology. Epidemiology 1998;9. See also http: 11 www.epidem.com.

40 Lang JM, Rothman KJ, Cann Cl. The confounded P value. Epidemiology 1998;9:7–8.

41 Pitkin RM, Branagan MA. Can the accuracy of abstracts be improved by providing specific instructions? A randomized controlled trial. JAMA 1998;280: 267–9.

42 Haynes RB, Mulrow CD, Huth EJ, Altman DG, Gardner MJ. More informative abstracts revisited. Ann Intern Med 1990;113:69–76.

43 Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving the quality of reporting of randomized controlled trials: the CONSORT statement. JAMA 1996;276:637–9.

44 Altman DG. Statistical reviewing for medical journals. Stat Med 1998;17: 2662–74.

45 Deeks JJ, Altman DG. Sensitivity and specificity and their confidence intervals cannot exceed 100%. BMJ 1999;318:193–4.

46 Altman DG. ROC curves and confidence intervals: getting them right. Heart 2000;83:236.

47 Hawkins DF. Clinical trials – meta-analysis, confidence limits and ‘intention to treat’ analysis. J Obstet Gynaecol 1990;10:259–60.

48 Charlton BG. The future of clinical research: from megatrials towards methodological rigour and representative sampling. J Eval Clin Practice 1996; 2:159–69.

49 Hilden J. Book review of Lang TA, Secic M, ‘How to report statistics in medicine. Annotated guidelines for authors, editors and reviewers’. Med Decis Making 1998;18:351–2.

50 Hall DB. Confidence intervals and controlled clinical trials: incompatible tools for medical research. J Biopharmaceut Stat 1993;3:257–63.

51 Braitman LE. Statistical estimates and clinical trials. J Biopharmaceut Stat 1993;3:249–56.

52 Simon R. Why confidence intervals are useful tools in clinical therapeutics. J Biopharmaceut Stat 1993;3:243–8.

53 Borenstein M. The case for confidence intervals in controlled clinical trials. Controlled Clin Trials 1994;15:411–28.

54 Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ 1995;311:485.

55 Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF, et al. Improving the quality of reports of meta-analyses of randomized controlled trials: the QUOROM statement. Lancet, in press.

56 Freeman PR. The role of P-values in analysing trial results. Stat Med 1993;12:1443–52.

57 Feinstein AR. P-values and confidence intervals: two sides to the same unsatisfactory coin. J Clin Epidemiol 1998;51:355–60.

58 Savitz DA. Is statistical significance testing useful in interpreting data? Reprod Toxicol 1993;7:95–100.

59 Burton PR, Gurrin LC, Campbell MJ. Clinical significance not statistical significance: a simple Bayesian alternative to P values. J Epidemiol Community Health 1998;52:318–23.

60 Goodman SN. Towards evidence-based medical statistics. Part 1. The P value fallacy. Ann Intern Med 1999;130:995–1004.

61 Goodman SN. Towards evidence-based medical statistics. Part 2. The Bayes factor. Ann Intern Med 1999;130:1005–21.