Chapter 13. Using Hypothesis Tests: Look At The Evidence

image with no caption

Not everything you’re told is absolutely certain.

The trouble is, how do you know when what you’re being told isn’t right? Hypothesis tests give you a way of using samples to test whether or not statistical claims are likely to be true. They give you a way of weighing the evidence and testing whether extreme results can be explained by mere coincidence, or whether there are darker forces at work. Come with us on a ride through this chapter, and we’ll show you how you can use hypothesis tests to confirm or allay your deepest suspicions.

image with no caption

Statsville’s leading drug company has produced a new remedy for curing snoring. Frustrated snorers are flocking to their doctors in hopes of finding nightly relief.

The drug company claims that their miracle drug cures 90% of people within two weeks, which is great news for the people with snoring difficulties. The trouble is, not everyone’s convinced.

image with no caption

The doctor at the Statsville Surgery has been prescribing SnoreCull to her patients, but she’s disappointed by the results. She decides to conduct her own trial of the drug.

She takes a random sample of 15 snorers and puts them on a course of SnoreCull for two weeks. After two weeks, she calls them back in to see whether their snoring has stopped.

Here are the results:

Cured?

Yes

No

Frequency

11

4

Here’s the probability distribution for how many people the drug company says should have been cured by the snoring remedy.

image with no caption

The number of people cured by SnoreCull in the doctor’s sample is actually much lower than you’d expect it to be. Given the claims made by the drug company, you’d expect 14 people to be cured, but instead, only 11 people have been.

So why the discrepancy?

image with no caption

The drug company might not be deliberately telling lies, but their claims might be misleading.

It’s possible that the tests of the drug company were flawed, and this might have resulted in misleading claims being made about SnoreCull. They may have inadvertent conducted flawed or biased tests on SnoreCull, which resulted in them making inaccurate predictions about the population.

If the success rate of SnoreCull is actually lower than 90%, this would explain why only 11 people in the sample were cured.

image with no caption

The drug company’s claims might actually be accurate.

Rather than the drug company being at fault, it’s always possible that the patients in the doctor’s sample may not have been representative of the snoring population as a whole. It’s always possible that the snoring remedy does cure 90% of snorers, but the doctor just happens to have a higher proportion of people in her sample whom it doesn’t cure. In other words, her sample might be biased in some way, or it could just come down to there being a small number of patients in the sample.

So how do we resolve the conflict between the doctor and the drug company? Let’s take a very high level view of what we need to do.

We can resolve the conflict between the drug company and the doctor by putting the claims of the drug company on trial. In other words, we’ll accept the word of the drug company by default, but if there’s strong evidence against it, we’ll side with the doctor instead.

Here’s what we’ll do:

image with no caption
image with no caption
image with no caption

In general, this process is called hypothesis testing, as you take a hypothesis or claim and then test it against the evidence. Let’s look at the general process for this.

Here are the broad steps that are involved in hypothesis testing. We’ll go through each one in detail in the following pages.

image with no caption

We need to make sure we properly test the drug claim before we reject it.

That way we’ll know we’re making an impartial decision either way, and we’ll be giving the claim a fair trial. What we don’t want to to do is reject the claim if there’s insufficient evidence against it, and this means that we need some way of deciding what constitutes sufficient evidence.

Let’s start with step one of the hypothesis test, and look at the key claim we want to test. This claim is called a hypothesis.

image with no caption

We’ve looked at what the claim is we’re going to test, the null hypothesis, but what if it’s not true? What’s the alternative?

The alternate hypothesis for SnoreCull is the claim you’ll accept if the drug company’s claim turns out to be false. If there’s sufficiently strong evidence against the drug company, then it’s likely that the doctor is right.

The doctor believes that SnoreCull cures less than 90% of people, so this means that the alternate hypothesis is that p < 90%.

H1: p < 0.9

Now that we have the null and alternate hypotheses for the SnoreCull hypothesis test, we can move onto step 2.

Now that you’ve determined exactly what it is you’re going to test, you need some means of testing it. You can do this with a test statistic.

image with no caption

The test statistic is the statistic that you use to test your hypothesis. It’s the statistic that’s most relevant to the test.

image with no caption

The critical region of a hypothesis test is the set of values that present the most extreme evidence against the null hypothesis.

Let’s see how this works by taking another look at the doctor’s sample. If 90% or more people had been cured, this would have been in line with the claims made by the drug company. As the number of people cured decreases, the more unlikely it becomes that the claims of the drug company are true.

Here’s the probability distribution:

image with no caption

Before we can find the critical region of the hypothesis test, we first need to decide on the significance level. The significance level of a test is a measure of how unlikely you want the results of the sample to be before you reject the null hypothesis Ho. Just like the confidence level for a confidence interval, the significance level is given as a percentage.

As an example, suppose we want to test the claims of the drug company at a 5% level of significance. This means that we choose the critical region so that the probability of fewer than c snorers being cured is less than 0.05. It’s the lowest 5% of the probability distribution.

image with no caption

The significance level is normally represented by the Greek letter α. The lower α is, the more unlikely the results in your sample need to be before we reject Ho.

Let’s use a significance level of 5% in our hypothesis test. This means that if the number of snorers cured in the sample is in the lowest 5% of the probability distribution, then we will reject the claims of the drug company. If the number of snorers cured lies in the top 95% of the probability distribution, then we’ll decide there isn’t enough evidence to reject the null hypothesis, and accept the claims of the drug company.

If we use X to represent the number of snorers cured, then we define the critical region as being values such that

P(X < c) < α

where

α = 5%

image with no caption

Now that we’ve looked at critical regions, we can move on to step 4, finding the p-value.

A p-value is the probability of getting a value up to and including the one in your sample in the direction of your critical region. It’s a way of taking your sample and working out whether the result falls within the critical region for your hypothesis test. In other words, we use the p-value to say whether or not we can reject the null hypothesis.

To find the p-value of our hypothesis test, we had to find P(X ≤ 11). This means that the p-value is 0.0555.

image with no caption

A p-value is the probability of getting the results in the sample, or something more extreme, in the direction of the critical region.

In our hypothesis test for SnoreCull, the critical region is the lower tail of the probability distribution. In order to see whether 11 people being cured of snoring is in the critical region, we calculated P(X ≤ 11), as this is the probability of getting a result at least as extreme as the results of our sample in the direction of the lower tail.

image with no caption

Had our critical region been the upper tail of the probability distribution instead, we would have needed to find P(X ≥ 11). We would have counted more extreme results as being greater than 11, as these would have been closer to the critical region.

image with no caption

Now that we’ve found the p-value, we can use it to see whether the result from our sample falls within the critical region. If it does, then we’ll have sufficient evidence to reject the claims of the drug company.

Our critical region is the lower tail of the probability distribution, and we’re using a significance level of 5%. This means that we can reject the null hypothesis if our p-value is less that 0.05. As our p-value is 0.0555, this means that the number of people cured by SnoreCull in the sample doesn’t fall within the critical region.

image with no caption

We’ve now reached the final step of the hypothesis test. We can decide whether to accept the null hypothesis, or reject it in favor of the alternative.

The p-value of the hypothesis test falls just outside the critical region of the test. This means that there isn’t sufficient evidence to reject the null hypothesis. In other words:

image with no caption

We accept the claims of the drug company

image with no caption

Let’s summarize what we just did.

First of all, we took the claims of the drug company, which the doctor had misgivings about. We used these claims as the basis of a hypothesis test. We formed a null hypothesis that the probability of curing a patient is 0.9, and then we applied this to the number of people in the doctors sample.

We then decided to conduct a test at the 5% level, using the success rate in the doctor’s sample. We looked at the probability of 11 people or fewer being cured, and checked to see whether the probability of this was less than 5%, or 0.05. In other words, we looked at the probability of getting a result this extreme, or even more so.

Finally, we found that at the 5% level, there wasn’t strong enough evidence to reject the claims of the drug company.

image with no caption

Once you’ve fixed the significance level of the test, you can’t change it.

The test needs to be completely impartial. This means that you decide what level you need the test to be at, based on what level of evidence you require, before you look at what evidence you actually have.

If you were to look at the amount of evidence you have before deciding on the level of the test, this could influence any decisions you made. You might be tempted to decide on a specific level of test just to get the result you want. This would make the outcome of the test biased, and you might make the wrong decision.

So far the doctor has conducted her trial using a sample of just 15 people, and on the basis of this, there was insufficient evidence to reject the claims of the drug company.

It’s possible that the size of the sample wasn’t large enough to get an accurate result. The doctor might get more reliable results by using a larger sample.

Here are the results from the doctor’s new trial:

Cured?

Yes

No

Frequency

80

20

image with no caption

We want to determine whether the new data will make a difference in the outcome of the test.

Let’s run through another hypothesis test, this time with the larger sample.

The doctor still has misgivings about the claims made by the drug company. Let’s conduct a hypothesis test based on the new data.

image with no caption

We need to start off by finding the null hypothesis and alternate hypothesis of the SnoreCull trial. As a reminder, the null hypothesis is the claim that we’re testing, and the alternate hypothesis is what we’ll accept if there’s sufficient evidence against the null hypothesis.

So what are the null and alternate hypotheses?

image with no caption

As before, the next step is to choose the test statistic. In other words, we need some statistic that we can use to test the hypothesis.

For the previous hypothesis test, we conducted the test by looking at the number of successes in the sample and seeing how significant the result was. We used the binomial distribution to find the probability of getting a result at least as extreme as the value we got in the sample. In other words, we used a test statistic of X ~ B(15, 0.9) to test whether P(X ≤ 11) was less than 0.05, the level of significance.

This time the number of people in the sample is 100, and we’re testing the same claim, that probability of successfully curing someone is 0.9. This means that our new test statistic is X ~ B(100, 0.9).

image with no caption

We can use another probability distribution instead of the binomial.

Using the binomial distribution for this sort of problem would be time consuming, as we’d have to calculate lots of probabilities.

Fortunately, there’s another way. Rather than use the binomial distribution, we can use some other distribution instead.

We still need to find a test statistic we can use in our hypothesis test, and as the number in the sample is large, this means that using the binomial distribution will be time consuming and complicated.

There are 100 people in the sample, and the proportion of successes according to the drug company is 0.9. In other words, the number of successes follows a binomial distribution, where n = 100 and p = 0.9.

As n is large, and both np and nq are greater than 5, we can use X ~ N(np, npq) as our test statistic, where X is the number of patients successfully cured. In other words, we can use

X ~ N(90, 9)

to approximate any probabilities that we may need.

If we standardize this, we get

image with no caption

This means that for our test statistic we can use

image with no caption

You use the test statistic to work out probabilities you can use as evidence.

This means that we use Z as our test statistic, as we can easily use it to look up probabilities and see how unlikely the results of our sample are given the claims of the drug company. We substitute our value of 80 in place of X, so we can use it to find the probability of 80 or fewer being cured.

Now that we have a test statistic for our test, we need to come up with a critical region. As our alternate hypothesis is p < 0.9, this means that our critical region lies in the lower tail just as before.

image with no caption

The critical region also depends on the significance level of the test. Let’s choose the same significance level as before, so let’s test at the 5% level.

image with no caption

As our test statistic follows a standard normal distribution, we can use probability tables to find the critical value, c. The critical value is the boundary between whether we have strong enough evidence to reject the null hypothesis or not.

As our significance level is 5%, this means that our critical value c is the value where P(Z < c) = 0.05. If we look up the probability 0.05 in the probability tables, this gives us a value for c of –1.64. In other words,

P(Z < –1.64) = 0.05

This means that if our test statistic is less than –1.64, we have strong enough evidence to reject the null hypothesis.

image with no caption

This time when we performed a hypothesis test on SnoreCull, there was sufficient evidence to reject the null hypothesis. In other words, we can reject the claims made by the drug company.

image with no caption
image with no caption

Hypothesis tests require evidence.

With a hypothesis test, you accept a claim and then put it on trial. You only reject it if there’s enough evidence against it. This means that the tests are impartial, as you only make a decision based on whether or not there’s sufficient evidence.

If we had just accepted the doctor’s opinion in the first place, we wouldn’t have properly considered the evidence. We would have made a decision without considering whether the results could have been explained away by mere coincidence. As it is, we have enough evidence to show that the results of the sample are extreme enough to justify rejecting the null hypothesis. The results are statistically significant, as they’re unlikely to have happened by chance.

So does this guarantee that the claims of the drug company are wrong?

So far we’ve looked at how we can use the results of a sample as evidence in a hypothesis test. If the evidence is sufficiently strong, then we can use it to justify rejecting the null hypothesis.

We’ve found that there is strong evidence that the claims of the drug company are wrong, but is this guaranteed?

image with no caption

Even though the evidence is strong, we can’t absolutely guarantee that the drug company claims are wrong.

Even though it’s unlikely, we could still have made the wrong decision. We can examine evidence with a hypothesis, and we can specify how certain we want to be before rejecting the null hypothesis, but it doesn’t prove with absolute certainty that our decision is right.

The question is, how do we know?

Conducting a hypothesis test is a bit like putting a prisoner on trial in front of a jury. The jury assumes that the prisoner is innocent unless there is strong evidence against him, but even considering the evidence, it’s still possible for the jury to make wrong decisions. Have a go at the exercise on the next page, and you’ll see how.

image with no caption

The errors we can make when conducting a hypothesis test are the same sort of errors we could make when putting a prisoner on trial.

Hypothesis tests are basically tests where you take a claim and put it on trial by assessing the evidence against it. If there’s sufficient evidence against it, you reject it, but if there’s insufficient evidence against it, you accept it.

You may correctly accept or reject the null hypothesis, but even considering the evidence, it’s also possible to make an error. You may reject a valid null hypothesis, or you might accept it when it’s actually false.

Statisticians have special names for these types of errors. A Type I error is when you wrongly reject a true null hypothesis, and a Type II error is when you wrongly accept a false null hypothesis.

The power of a hypothesis test is the probability that that you will correctly reject a false null hypothesis.

image with no caption

A Type I error is what you get when you reject the null hypothesis when the null hypothesis is actually correct. It’s like putting a prisoner on trial and finding him guilty when he’s actually innocent.

image with no caption

A Type II error is what you get when you accept the null hypothesis, and the null hypothesis is actually wrong. It’s like putting a prisoner on trial and finding him innocent when he’s actually guilty.

image with no caption

The probability of getting a Type II error is normally represented by the Greek letter β.

P(Type II error) = β

Let’s see if we can find the probability of getting Type I and Type II errors for the SnoreCull hypothesis test. As a reminder, our standardized test statistic is

image with no caption

where X is the number of people cured in the sample. The significance level of the test is 5%.

Now that the alternate hypothesis H1 gives a specific value for p, we can move on to the next step. We need to find the values of X that lie outside the critical region of the hypothesis test.

We saw back in Step 3: Find the critical region that the critical region for the test is given by Z < –1.64—in other words, P(Z < –1.64) = 0.05. This means that values that fall outside the critical region are given by Z ≥ –1.64.

image with no caption

If we de-standardize this, we get

image with no caption

In other words, we would have accepted the null hypothesis if 85.08 people or more had been cured by SnoreCull.

The final thing we need to do is work out P(X ≥ 85.08), assuming that H1 is true. That way, we’ll be able to work out the probability of accepting the null hypothesis when actually H1 is true instead. As we’re using the normal distribution to approximate X, we need to use a probability distribution X ~ N(np, npq), where n = 100 and p = 0.8. This gives us

X ~ N(80, 16)

This means that if we can calculate P(X ≥ 85.08) where X ~ N(80, 16), we’ll have found the probability of getting a Type II error.

We calculate this in the same way we calculate other normal distribution probabilities, by finding the standard score and then looking up the value in standard normal probability tables.

We can find the probability of getting a Type II error by calculating P(X ≥ 85.08) where X ~ N(80, 16). Let’s start off by finding the standard score of 85.08.

image with no caption

This means that in order to find P(X ≥ 85.08), we need to use standard probability tables to find P(Z ≥ 1.27).

P(Z ≥ 1.27)

= 1 – P(Z < 1.27)

 

= 1 – 0.8980

 

= 0.102

In other words,

P(Type II error) = 0.102

So far we’ve looked at the probability of getting different types of error in our hypothesis test. One thing that we haven’t looked at is power.

The power of a hypothesis test is the probability that we will reject H0 when H0 is false. In other words, it’s the probability that we will make the correct decision to reject H0.

image with no caption

Once you’ve found P(Type II error), calculating the power of a hypothesis test is easy.

Rejecting H0 when H0 is false is actually the opposite of making a Type II error. This means that

Power = 1 – β

where β is the probability of making a Type II error.

In this chapter, you’ve run through two hypothesis tests, and you’ve proved that there’s sufficient evidence to reject the claims made by the drug company. You’ve been able to show that based on the doctor’s sample, there’s sufficient evidence that SnoreCull doesn’t cure 90% of snorers, as the drug company claims.

image with no caption
image with no caption

Keep reading, and we’ll show you what other sorts of hypothesis tests you can use. We’ll see you over at Fat Dan’s Casino...