3

How Not to Lie with Statistics

Category shift is only one example where setting the wrong targets can lead to unwanted consequences that are bad for the patients, but at least it can be neutralised by setting intelligent targets and by truthful and wholesome reporting of outcomes. Let us assume that this happy state of affairs now prevails, and that our data have not been distorted by category shift or other misrepresentations. Can we now use mortality data to find out which surgeon or which hospital is the better one?

When surgeon X has a higher mortality rate than surgeon Y, there are three possible reasons: (1) the difference is due to chance alone; (2) the difference is due to different case mixes (surgeon X operates on many high-risk patients, say, and surgeon Y runs a mile when one comes along); or (3) surgeon Y is better. Before we can safely conclude that reason 3 is right, we must eliminate any possibility of reason 1 and reason 2.

Ensuring that the difference is not due to chance alone is actually remarkably easy to do, and has been one of the essential features of interpreting medical and other scientific research. The interpretation relies on basic statistics.

Imagine that you have a headache and you stumble upon a snake-oil salesman. He is flogging his snake oil as a treatment for headache. He confidently (and truthfully) asserts that, in his experience, 100 per cent of those who took the snake oil had their headache cured in an hour, whereas 100 per cent of those who didn’t buy the snake oil still had a sore head after an hour. You might be suitably impressed, but you shouldn’t be, or at least not yet. Before coming to any conclusions, you really, really must find out how many people with a headache took the snake oil and how many did not. You ask the snake-oil salesman, who replies (also truthfully) that only two people ever approached him with a headache. One took the oil and lost the headache, the other did not take the oil and the headache persisted. Does the snake oil work? Maybe it does and maybe it doesn’t, but his earlier assertions are certainly no evidence for the efficacy of the medicine. Concluding that the snake oil works on the basis of this pathetic, two-patient, non-randomised clinical trial is exactly the same as stepping out of a cafe, meeting a six-foot-tall woman followed by a five-foot-11-inch-tall man, and concluding that women are taller than men. You cannot make such conclusions on the basis of so few observations. If, on the other hand, you measured the height of 10,000 women and 10,000 men at random in your town, and found that, on average, the women were six feet tall and the men only five feet, you would almost certainly be right in concluding that women are taller than men.

Statisticians are strange people in many ways. They tend to dissect data to death before telling us what the data mean, and one thing they never, ever do is jump to conclusions. In fact, there is an old joke about a journalist, a scientist, and a statistician on a train going through Scotland. They look out of the carriage window and see a black sheep on the hillside. The journalist exclaims: ‘Wow, guys, look at that! Sheep are black in Scotland!’ The scientist interjects: ‘No, that’s not strictly true. There are some sheep in Scotland that are black.’ The statistician thinks for a long while, and finally says: ‘I think we can safely conclude that, in Scotland, there appears to be at least one sheep, at least one side of which appears to be black.’

There is something absolutely fundamental to intelligent statistics, and it is this: the size of the sample. When we have data on a sample taken from a population, we need to know the size of the sample before jumping to any conclusions about that particular population. If we want to compare a feature between two populations, say for example the prevalence of blue eyes, then we might take a random sample of 100 people from population A and another 100 people from population B, and we count the number of people with blue eyes. Say we have 20 people with blue eyes in sample A and 21 in sample B. Does that mean that there are more blue-eyed people in population B? Probably not, as it seems highly likely that such a small difference may be simply due to chance. This means that a second important feature, once we have ascertained that the sample itself is of a decent size, is the size of any measured difference. Essentially, what we observe in the size of the samples and the size of the difference between them will determine, with appropriate statistical tests, how sure we can be that there is a real difference between the populations from which the samples were taken. There are, of course, other statistical measures, but the size of the sample and the size of the difference are sine qua non. No conclusion can be drawn without knowing them. An isolated observation obtained from looking at one lone sheep on the hillside tells you next to nothing about the nation’s ovine population or its colouring characteristics.

There are pretty basic statistical tools that allow you to determine the likelihood that a difference that is found is due purely to chance, and these tools usually express their results as a ‘p-value’. The p-value is the probability that an observed difference is due to chance. If the p-value is 1, that means the difference is definitely due to chance. If it is 0.1, it means that there is a 10 per cent likelihood that the difference is due to chance. If it is 0.01, there is only a 1 per cent likelihood that the difference is due to chance. In medicine, as in many other scientific fields, the cut-off point has been arbitrarily set at 0.05: that means, if the likelihood of the difference being due to chance alone is less than 5 per cent, we conclude that the difference is real, or ‘significant’. Anything more than 5 per cent, and the difference is non-significant: in other words, it doesn’t cut the mustard. In a way, this is similar to the concept of the burden of proof in the practice of law. In a criminal court, the jury can only find the defendant guilty if it considers the case proven ‘beyond reasonable doubt’. In a civil court, the decision can be made on ‘the balance of probabilities’. In medical statistics, we make it easy: we actually measure the ‘doubt’. More than 5 per cent: not guilty. Less than 5 per cent: guilty as charged.*

[* I often have wondered what this represents in a court of law: beyond reasonable doubt or on the balance of probabilities? I think that it is somewhere between the two, but probably closer to the former. If any lawyer is reading this, please feel free to set me straight.]

The story of the snake-oil salesman is not as far-fetched as one might think. We humans often make decisions that affect our entire life based on what can only be called anecdotes. Something happens once, and we make it guide our future decisions forever, especially if the event is vivid, recent, memorable, or all three. A swarthy curly-haired man in a BMW cuts me off rather dangerously at a road junction. I get scared and angry. I subconsciously conclude that swarthy curly-haired men in BMWs are all dangerous drivers. This is called the ‘availability heuristic’, or the example rule. Our minds are tuned to look for an example in memory and follow its findings and conclusions, rather than to ask the question: ‘If this situation occurs 1,000 times, how likely is it that the outcome will be such and such?’*

[* That the example rule often deceives and misguides is explored beautifully in Dan Gardner’s Risk. It is also ruthlessly exploited by peddlers of snake oil and alternative-medicine remedies.]

Let us now go back to surgeons. If you hear that surgeon A has a mortality of 20 per cent and surgeon B has a mortality of 40 per cent, the first question you must ask is: how many patients are we talking about here? If it is only a handful of patients, the difference is due to chance. If it is a thousand patients each, we have a real difference (and, by the way, two surgeons to avoid absolutely, if the operation we are talking about is CABG).

Even after we have confirmed that the difference is statistically significant, we still need some more information before coming to any conclusion about how good or bad A and B are. Imagine that we have looked at the data for the two surgeons with different CABG mortality, and that we have ensured that the proper statistical tests have been applied. We have confirmed that their figures were derived from over a thousand patients each (large enough sample) and that the difference between the mortalities is also large enough to matter. The statistics tell us that the p-value on this difference is less than 0.05. In plain language, statistics say that the likelihood of finding a difference of that magnitude in that number of patients due to chance alone is less than 5 per cent. The conclusion that can be drawn about our two surgeons is inescapable: they do indeed have a difference in CABG mortality, and, this time, the difference is real. Can we conclude that the surgeon with the lower mortality is a better surgeon?

The answer is still ‘No’. The main reason for this is that our two surgeons could well be operating on very different patients. One surgeon may only take on young and relatively healthy patients with absolutely nothing wrong with them other than the heart problem. The other surgeon may be operating on very old patients with a weak heart muscle, diabetes, kidney failure, previous strokes, and damaged lungs. Other things being equal, you would certainly expect the latter group to have a higher mortality than the former, no matter who was doing the operation. The problem is that we did not, until fairly recently, have any objective way of assessing how old and sick the patients were. In other words, there was no easy way of telling the predicted risk of an operation in advance by looking at the features of the patient.

Naturally, there were always subjective ways of predicting the risk. Experienced doctors and surgeons have an eye for these things, so we could ask the surgeons themselves, but, in my experience, that is a terrible method of obtaining risk information. Ask any surgeon the question, ‘Who in your hospital does the most difficult, challenging, and high-risk operations?’ and the answer is very likely to be, ‘Why, that would be me, of course!’

There are good (and not so good) reasons for surgeons to harbour these self-delusions. To some extent, surgeons in general and heart surgeons in particular need a lot of self-confidence to function. A heart surgeon, day after day, walks into an operating theatre, nonchalantly cracks open the chest, puts the patient on an artificial heart–lung machine, stops the heart, opens it, fixes it, starts it again, disconnects the patient from the heart–lung machine, and expects that the heart will handle supporting life again after this abuse. Self-doubt has never been a prominent feature in the heart surgeon’s psychological make-up. In fact, it has been said that, if you ask any heart surgeon to name the three greatest heart surgeons who ever existed, he would struggle to find the names of the other two. Surgeons’ opinions of their own prowess and of the risk profile of their patients are therefore not reliable, and certainly not objective. Can an objective method be found?

It can, and it was.

The year was 1989, and I had just been employed as a senior trainee in heart surgery in Sheffield, the city in the north of England once famed for steel manufacture, the decline of which was vividly illustrated in the hit film The Full Monty. In the Northern General Hospital, among the disused steel mills of Sheffield, I worked for a senior surgeon called Professor Geoffrey Smith, a normally placid, affable man, and, by all accounts, an easy boss to work for. Early one Monday morning, I had finished the rounds of the cardiac intensive care unit, and was in the staff common room enjoying a coffee and indulging my longstanding addiction to The Guardian’s cryptic crossword before the day’s operating began. Geoffrey Smith burst into the room in an uncharacteristic state of obvious excitement, clutching some photocopied pieces of paper, which he handed to me, saying, ‘I believe this is one of most important papers I have ever seen! Will you read it and tell me what you think?’

I tried, probably unsuccessfully, to conceal my deeply felt preference for the crossword and a bit of peace and quiet before going to the operating theatre, and picked up the research article. It was something he had dug out of a recently published supplement to the medical journal Circulation. The article was by Victor Parsonnet, a New Jersey cardiac surgeon, and his two collaborators, Bernstein and Dean, from the Newark Beth Israel Medical Center (Parsonnet 1989). It had the unwieldy title ‘A Method of Uniform Stratification of Risk for Evaluating the Results of Surgery in Acquired Adult Heart Disease’. What Parsonnet did was so simple that it is remarkable that it had not been done earlier in medicine. He gathered data on a few thousand patients having heart operations at his hospital. He found out what features in these patients, their heart condition, and their heart operations were linked to mortality from the operation. He assigned each factor a ‘weight’ depending on how likely it was that this factor would be associated with an outcome of death, and built a scoring system. The scoring system was ‘additive’: you took the risk factors that were present, and added their weights to come up with a number. This number, expressed as a percentage, tells you the likelihood of death after such an operation in such a patient. We call this the predicted mortality. For example, if the patient was female (1 point) and diabetic (3 points), the risk of CABG, according to the Parsonnet model, was 1 + 3 = 4 per cent.

Geoffrey Smith was right. This was a very important paper. The obvious immediate benefit of such a system was that it could help both the patient and the doctor come to a sensible decision about whether or not to go ahead with an operation. Less obviously, but perhaps more importantly, by providing an objective way of predicting the outcome, Parsonnet gave us a benchmark against which to compare the actual outcome. The paper had the potential of turning the world of heart surgery (and of medical practice in general) on its head. Doctors and nurses would, of course, continue to do their best for their patients, but now, for the first time ever, at least in this one specialty of heart surgery, we had a way of knowing how good that ‘best’ really was.

The risk factors that contribute to the Parsonnet score, and their ‘weights’, are listed opposite:

Risk Factor

Parsonnet Score

female

1

obesity

3

diabetes

3

hypertension (> 140 mmHg)

3

ejection fraction (a measure of how well the left ventricle — the main heart pump — works)

  • > 50 per cent (good)

0

  • 30–49 per cent (moderate)

2

  • < 30 per cent (poor)

4

age

  • 70–74 years

7

  • 75–79 years

12

  • > 79 years

20

valve operation

5

valve + CABG

7

pulmonary hypertension (high blood pressure in the lungs)

3

intra-aortic balloon pump (having a device for supporting a weak heart)

2

previous heart operation

  • one

5

  • two

10

on dialysis for kidney failure

10

catheter lab emergency

10

catastrophic states

10–50 to be decided by the surgical resident

a few other rare factors

to be decided by the surgical resident

Parsonnet and colleagues went on to give an interpretation, with a somewhat arbitrary grouping of patients into risk categories from ‘good’ risk to ‘extremely high’ risk, as in the table below:

Parsonnet Score

Risk

0–4

good

5–9

fair

10–14

poor

15–19

high

20+

extremely high

Knowing what we now know about risk modelling, there are a few features of the Parsonnet model that can be readily criticised. First, it was built from data obtained from a single hospital, so its widespread application may not be reliable. In fact, the model may well contain a biased reflection of the peculiarities of practice in that one institution. Second, as folk tried the model in their own units, many found that their own mortality rates were a lot smaller than Parsonnet predicted, so that the ‘calibration’ of the model may have been a little off-kilter: it overpredicted mortality. Perhaps, at the time, the results of heart surgery were not as good in Parsonnet’s own unit as they should have been, or perhaps the model deliberately overstated the risks in order not to upset the medical establishment. Last, there were features in the model that left too much to subjective interpretation: for example, the definition of ‘catastrophic state’ or ‘rare risk factors’ could be awarded whatever risk value the surgical resident chose. This was neither rigorous nor objective, and it opened the model to the possibility of gaming.

Be that as it may, it would be churlish to criticise what was, in effect, a major groundbreaker. Parsonnet had created a model that made sense to his surgical colleagues, was intuitive, could reasonably predict outcomes, and could be applied by all. It was easy to use, and it was the first of its kind. The vital importance of this first step cannot be overstated: it opened the door to quality control in cardiac surgery, and that, in turn, ushered in quality control in medicine as a whole.

Surgeons are not, as a general rule, well known for their rapacious appetite for reading. An old medical joke goes like this:

Q: How do you hide money from an orthopaedic surgeon?

A: Put it in a medical textbook.

Q: How do you hide money from a general surgeon?

A: Put it in the patient’s case notes.

Q: How do you hide money from a plastic surgeon?

A: You can’t hide money from a plastic surgeon!

When surgeons actually do read scientific papers, it is often to find out how other surgeons do operations, to learn about newfangled operations, and to look at gory pictures.

The initial response to the publication of Parsonnet’s paper lay somewhere between ‘muted’ and ‘non-existent’. When it was first published, relatively few people read it, and even fewer, like Geoffrey Smith, appreciated its tremendous importance. To begin with, it had appeared in a specialist medical journal rather than in a dedicated surgical one. It also was fairly well hidden in a supplement, rather than appearing in the main journal itself. Finally, your average heart surgeon (who rarely reads Circulation anyway, and almost never reads the supplement) is unlikely to be fired with enthusiasm by a paper describing risk modelling (and with no gory pictures).

Slowly, however, the value of having a tool for predicting outcomes began to achieve the recognition it deserved. Some American hospitals tried the Parsonnet model in their own heart surgery practice, and found that it worked pretty well, and they published this finding in journals. Having shortly afterwards left Sheffield to work as a senior registrar in Manchester, I thought I would do the same in a British hospital. I collected the data on the Parsonnet risk factors in our patients, and used the information to see how well Parsonnet worked in a British hospital. At the time, the senior surgeons for whom I worked looked at my obsession with the subject with some degree of detached bemusement. I went on to publish the findings in the British Medical Journal (Nashef 1992) as the first application of the Parsonnet model in a British hospital, my own Wythenshawe Hospital in Manchester. Essentially, the study proved that a risk model from across the Atlantic worked pretty well in a British hospital, and that our results were actually better than the model predicted, so everybody was happy. Soon, other reports on clinical research in heart surgery would routinely begin by describing groups of patients in terms of average age, sex distribution, and Parsonnet score. The importance of measuring the risk profile of patients when reporting clinical results was at last being appreciated.

Much more importantly, Parsonnet gave us a way of predicting mortality, and, by doing so, he provided a benchmark: at last, we had something against which actual mortality could be assessed. In other words, surgeons who operated on widely varying patient groups in terms of risk could pitch their results against this benchmark.