«You’ve seen how it’s possible to miss real effects by not collecting enough data. You might miss a viable medicine or fail to notice an important ...»
STATISTICAL POWER AND
You’ve seen how it’s possible
to miss real effects by not collecting enough data. You might miss a
viable medicine or fail to notice an
important side effect. So how do you know
how much data to collect?
The concept of statistical power provides the answer. The
power of a study is the probability that it will distinguish an
effect of a certain size from pure luck. A study might easily detect a huge beneﬁt from a medication, but detecting a subtle difference is much less likely.
The Power Curve Suppose I’m convinced that my archnemesis has an unfair coin. Rather than getting heads half the time and tails half the time, it’s biased to give one outcome 60% of the time, allowing Statistics Done Wrong © 2015 Alex Reinhart him to cheat at incredibly boring coin-ﬂipping betting games.
I suspect he’s cheating—but how to prove it?
I can’t just take the coin, ﬂip it 100 times, and count the heads. Even a perfectly fair coin won’t always get 50 heads, as the solid line in Figure 2-1 shows.
0.08 0.06 Probability 0.04 0.02 0.00 Number of Heads Figure 2-1: The probability of getting different numbers of heads if you ﬂip a fair coin (solid line) or biased coin (dashed line) 100 times. The biased coin gives heads 60% of the time.
Even though 50 heads is the most likely outcome, it still happens less than 10% of the time. I’m also reasonably likely to get 51 or 52 heads. In fact, when ﬂipping a fair coin 100 times, I’ll get between 40 and 60 heads 95% of the time. On the other hand, results far outside this range are unlikely: with a fair coin, there’s only a 1% chance of obtaining more than 63 or fewer than 37 heads. Getting 90 or 100 heads is almost impossible.
Compare this to the dashed line in Figure 2-1, showing the probability of outcomes for a coin biased to give heads 60% of the time. The curves do overlap, but you can see that an unfair coin is much more likely to produce 70 heads than a fair coin is.
Let’s work out the math. Say I run 100 trials and count the number of heads. If the re
Figure 2-2: The power curves for 100 and 1,000 coin ﬂips, showing the probability of detecting biases of different magnitudes. The vertical line indicates a 60% probability of heads.
Let’s start with the size of the bias. The solid line in Figure 2-2 shows that if the coin is rigged to give heads 60% of the time, I have a 50% chance of concluding that it’s rigged after 100 ﬂips. (That is, when the true probability of heads is 0.6, the power is 0.5.) The other half of the time, I’ll get fewer than 60 heads and fail to detect the bias. With only 100 ﬂips, there’s just too little data to always separate bias from random variation. The coin would have to be incredibly biased—yielding heads more than 80% of the time, for example—for me to notice nearly 100% of the time.
Another problem is that even if the coin is perfectly fair, I will falsely accuse it of bias 5% of the time. I’ve designed my test
18 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart are usually satisﬁed when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size. (If the true effect is actually larger, the study will have greater power.) However, few scientists ever perform this calculation, and few journal articles even mention statistical power. In the prestigious journals Science and Nature, fewer than 3% of articles calculate statistical power before starting their study.1 Indeed, many trials conclude that “there was no statistically signiﬁcant difference in adverse effects between groups,” without noting that there was insufﬁcient data to detect any but the largest differences.2 If one of these trials was comparing side effects in two drugs, a doctor might erroneously think the medications are equally safe, when one could very well be much more dangerous than the other.
Maybe this is a problem only for rare side effects or only when a medication has a weak effect? Nope. In one sample of studies published in prestigious medical journals between 1975 and 1990, more than four-ﬁfths of randomized controlled trials that reported negative results didn’t collect enough data to detect a 25% difference in primary outcome between treatment groups. That is, even if one medication reduced symptoms by 25% more than another, there was insufﬁcient data to make that conclusion. And nearly two-thirds of the negative trials didn’t have the power to detect a 50% difference.3 A more recent study of trials in cancer research found similar results: only about half of published studies with negative results had enough statistical power to detect even a large difference in their primary outcome variable.4 Less than 10% of these studies explained why their sample sizes were so poor.
Similar problems have been consistently seen in other ﬁelds of medicine.5,6 In neuroscience, the problem is even worse. Each individual neuroscience study collects such little data that the median study has only a 20% chance of being able to detect the effect it’s looking for. You could compensate for this by aggregating data collected across several papers all investigating the same effect. But since many neuroscience studies use animal subjects, this raises a signiﬁcant ethical concern. If each study is underpowered, the true effect will likely be discovered only after many studies using many animals have been completed and analyzed—using far more animal subjects than if the study had been done properly in the ﬁrst place.7 An ethical review board should not approve a trial if it knows the trial is unable to detect the effect it is looking for.
20 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart Even so, you’d think scientists would notice their power problems and try to correct them; after ﬁve or six studies with insigniﬁcant results, a scientist might start wondering what she’s doing wrong. But the average study performs not one hypothesis test but many and so has a good shot at ﬁnding something signiﬁcant.11 As long as this signiﬁcant result is interesting enough to feature in a paper, the scientist will not feel that her studies are underpowered.
The perils of insufﬁcient power do not mean that scientists are lying when they state they detected no signiﬁcant difference between groups. But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was so small it’d be lucky to notice it. Let’s consider an example we see every day.
Wrong Turns on Red In the 1970s, many parts of the United States began allowing drivers to turn right at a red light. For many years prior, road designers and civil engineers argued that allowing right turns on a red light would be a safety hazard, causing many additional crashes and pedestrian deaths. But the 1973 oil crisis and its fallout spurred trafﬁc agencies to consider allowing right turns on red to save fuel wasted by commuters waiting at red lights, and eventually Congress required states to allow right turns on red, treating it as an energy conservation measure just like building insulation standards and more efﬁcient lighting.
Several studies were conducted to consider the safety impact of the change. In one, a consultant for the Virginia Department of Highways and Transportation conducted a before-and-after study of 20 intersections that had begun to allow right turns on red. Before the change, there were 308 accidents at the intersections; after, there were 337 in a similar length of time. But this difference was not statistically signiﬁcant, which the consultant indicated in his report. When the report was forwarded to the governor, the commissioner of the Department of Highways and Transportation wrote that “we can discern no signiﬁcant hazard to motorists or pedestrians from implementation” of right turns on red.12 In other words, he turned statistical insigniﬁcance into practical insigniﬁcance.
Several subsequent studies had similar ﬁndings: small increases in the number of crashes but not enough data to
22 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart while a wide interval clearly shows that the measurement was not precise enough to draw conclusions.
Physicists commonly use conﬁdence intervals to place bounds on quantities that are not signiﬁcantly different from zero. In the search for a new fundamental particle, for example, it’s not helpful to say, “The signal was not statistically signiﬁcant.” Instead, physicists can use a conﬁdence interval to place an upper bound on the rate at which the particle is produced in the particle collisions under study and then compare this result to the competing theories that predict its behavior (and force future experimenters to build yet bigger instruments to ﬁnd it).
Thinking about results in terms of conﬁdence intervals provides a new way to approach experimental design. Instead of focusing on the power of signiﬁcance tests, ask, “How much data must I collect to measure the effect to my desired precision?” Even a powerful experiment can nonetheless produce signiﬁcant results with extremely wide conﬁdence intervals, making its results difﬁcult to interpret.
Of course, the sizes of our conﬁdence intervals vary from one experiment to the next because our data varies from experiment to experiment. Instead of choosing a sample size to achieve a certain level of power, we choose a sample size so the conﬁdence interval will be suitably narrow 99% of the time (or 95%; there’s not yet a standard convention for this number, called the assurance, which determines how often the conﬁdence interval must beat our target width).16 Sample size selection methods based on assurance have been developed for many common statistical tests, though not for all; it is a new ﬁeld, and statisticians have yet to fully explore it.17 (These methods go by the name accuracy in parameter estimation, or AIPE.) Statistical power is used far more often than assurance, which has not yet been widely adopted by scientists in any ﬁeld. Nonetheless, these methods are enormously useful.
Statistical signiﬁcance is often a crutch, a catchier-sounding but less informative substitute for a good conﬁdence interval.
Truth Inﬂation Suppose Fixitol reduces symptoms by 20% over a placebo, but the trial you’re using to test it is too small to have adequate statistical power to detect this difference reliably. We know that small trials tend to have varying results; it’s easy to get 10 lucky patients who have shorter colds than usual but much harder to get 10,000 who all do.
24 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart Consider also that top-ranked journals, such as Nature and Science, prefer to publish studies with groundbreaking results—meaning large effect sizes in novel ﬁelds with little prior research. This is a perfect combination for chronic truth inﬂation. Some evidence suggests a correlation between a journal’s impact factor (a rough measure of its prominence and importance) and the factor by which its studies overestimate effect sizes. Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.21,22 When a study claims to have detected a large effect with a relatively small sample, your ﬁrst reaction should not be “Wow, they’ve found something big!” but “Wow, this study is underpowered!”23 Here’s an example. Starting in 2005, Satoshi Kanazawa published a series of papers on the theme of gender ratios, culminating with “Beautiful Parents Have More Daughters.” He followed up with a book discussing this and other “politically incorrect truths” he’d discovered.
The studies were popular in the press at the time, particularly because of the large effect size they reported: Kanazawa claimed the most beautiful parents have daughters 52% of the time, but the least attractive parents have daughters only 44% of the time.
To biologists, a small effect—perhaps one or two percentage points—would be plausible. The Trivers–Willard Hypothesis suggests that if parents have a trait that beneﬁts girls more than boys, then they will have more girls than boys (or vice versa).
If you assume girls beneﬁt more from beauty than boys, then the hypothesis would predict beautiful parents would have, on average, slightly more daughters.
But the effect size claimed by Kanazawa was extraordinary.
And as it turned out, he committed several errors in his statistical analysis. A corrected regression analysis found that his data showed attractive parents were indeed 4.7% more likely to have girls—but the conﬁdence interval stretched from 13.3% more likely to 3.9% less likely.23 Though Kanazawa’s study used data from nearly 3,000 parents, the results were not statistically signiﬁcant.
Enormous amounts of data would be needed to reliably detect a small difference. Imagine a more realistic effect size— say, 0.3%. Even with 3,000 parents, an observed difference of 0.3% is far too small to distinguish from luck. You’d be lucky to obtain a statistically signiﬁcant result just 5% of the time. These
Average Test Score Number of Students Figure 2-4: Schools with more students have less random variation in their test scores. This data is simulated but based on real observations of Pennsylvania public schools.
Another example: in the United States, counties with the lowest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties. Why might this be? Maybe rural people get more exercise or inhale less-polluted air. Or perhaps they just lead less stressful lives.
On the other hand, counties with the highest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.