FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 |

«You’ve seen how it’s possible to miss real effects by not collecting enough data. You might miss a viable medicine or fail to notice an important ...»

-- [ Page 1 ] --




You’ve seen how it’s possible

to miss real effects by not collecting enough data. You might miss a

viable medicine or fail to notice an

important side effect. So how do you know

how much data to collect?

The concept of statistical power provides the answer. The

power of a study is the probability that it will distinguish an

effect of a certain size from pure luck. A study might easily detect a huge benefit from a medication, but detecting a subtle difference is much less likely.

The Power Curve Suppose I’m convinced that my archnemesis has an unfair coin. Rather than getting heads half the time and tails half the time, it’s biased to give one outcome 60% of the time, allowing Statistics Done Wrong © 2015 Alex Reinhart him to cheat at incredibly boring coin-flipping betting games.

I suspect he’s cheating—but how to prove it?

I can’t just take the coin, flip it 100 times, and count the heads. Even a perfectly fair coin won’t always get 50 heads, as the solid line in Figure 2-1 shows.

0.08 0.06 Probability 0.04 0.02 0.00 Number of Heads Figure 2-1: The probability of getting different numbers of heads if you flip a fair coin (solid line) or biased coin (dashed line) 100 times. The biased coin gives heads 60% of the time.

Even though 50 heads is the most likely outcome, it still happens less than 10% of the time. I’m also reasonably likely to get 51 or 52 heads. In fact, when flipping a fair coin 100 times, I’ll get between 40 and 60 heads 95% of the time. On the other hand, results far outside this range are unlikely: with a fair coin, there’s only a 1% chance of obtaining more than 63 or fewer than 37 heads. Getting 90 or 100 heads is almost impossible.

Compare this to the dashed line in Figure 2-1, showing the probability of outcomes for a coin biased to give heads 60% of the time. The curves do overlap, but you can see that an unfair coin is much more likely to produce 70 heads than a fair coin is.

Let’s work out the math. Say I run 100 trials and count the number of heads. If the re

–  –  –

Figure 2-2: The power curves for 100 and 1,000 coin flips, showing the probability of detecting biases of different magnitudes. The vertical line indicates a 60% probability of heads.

Let’s start with the size of the bias. The solid line in Figure 2-2 shows that if the coin is rigged to give heads 60% of the time, I have a 50% chance of concluding that it’s rigged after 100 flips. (That is, when the true probability of heads is 0.6, the power is 0.5.) The other half of the time, I’ll get fewer than 60 heads and fail to detect the bias. With only 100 flips, there’s just too little data to always separate bias from random variation. The coin would have to be incredibly biased—yielding heads more than 80% of the time, for example—for me to notice nearly 100% of the time.

Another problem is that even if the coin is perfectly fair, I will falsely accuse it of bias 5% of the time. I’ve designed my test

–  –  –

18 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size. (If the true effect is actually larger, the study will have greater power.) However, few scientists ever perform this calculation, and few journal articles even mention statistical power. In the prestigious journals Science and Nature, fewer than 3% of articles calculate statistical power before starting their study.1 Indeed, many trials conclude that “there was no statistically significant difference in adverse effects between groups,” without noting that there was insufficient data to detect any but the largest differences.2 If one of these trials was comparing side effects in two drugs, a doctor might erroneously think the medications are equally safe, when one could very well be much more dangerous than the other.

Maybe this is a problem only for rare side effects or only when a medication has a weak effect? Nope. In one sample of studies published in prestigious medical journals between 1975 and 1990, more than four-fifths of randomized controlled trials that reported negative results didn’t collect enough data to detect a 25% difference in primary outcome between treatment groups. That is, even if one medication reduced symptoms by 25% more than another, there was insufficient data to make that conclusion. And nearly two-thirds of the negative trials didn’t have the power to detect a 50% difference.3 A more recent study of trials in cancer research found similar results: only about half of published studies with negative results had enough statistical power to detect even a large difference in their primary outcome variable.4 Less than 10% of these studies explained why their sample sizes were so poor.

Similar problems have been consistently seen in other fields of medicine.5,6 In neuroscience, the problem is even worse. Each individual neuroscience study collects such little data that the median study has only a 20% chance of being able to detect the effect it’s looking for. You could compensate for this by aggregating data collected across several papers all investigating the same effect. But since many neuroscience studies use animal subjects, this raises a significant ethical concern. If each study is underpowered, the true effect will likely be discovered only after many studies using many animals have been completed and analyzed—using far more animal subjects than if the study had been done properly in the first place.7 An ethical review board should not approve a trial if it knows the trial is unable to detect the effect it is looking for.

–  –  –

20 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart Even so, you’d think scientists would notice their power problems and try to correct them; after five or six studies with insignificant results, a scientist might start wondering what she’s doing wrong. But the average study performs not one hypothesis test but many and so has a good shot at finding something significant.11 As long as this significant result is interesting enough to feature in a paper, the scientist will not feel that her studies are underpowered.

The perils of insufficient power do not mean that scientists are lying when they state they detected no significant difference between groups. But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was so small it’d be lucky to notice it. Let’s consider an example we see every day.

Wrong Turns on Red In the 1970s, many parts of the United States began allowing drivers to turn right at a red light. For many years prior, road designers and civil engineers argued that allowing right turns on a red light would be a safety hazard, causing many additional crashes and pedestrian deaths. But the 1973 oil crisis and its fallout spurred traffic agencies to consider allowing right turns on red to save fuel wasted by commuters waiting at red lights, and eventually Congress required states to allow right turns on red, treating it as an energy conservation measure just like building insulation standards and more efficient lighting.

Several studies were conducted to consider the safety impact of the change. In one, a consultant for the Virginia Department of Highways and Transportation conducted a before-and-after study of 20 intersections that had begun to allow right turns on red. Before the change, there were 308 accidents at the intersections; after, there were 337 in a similar length of time. But this difference was not statistically significant, which the consultant indicated in his report. When the report was forwarded to the governor, the commissioner of the Department of Highways and Transportation wrote that “we can discern no significant hazard to motorists or pedestrians from implementation” of right turns on red.12 In other words, he turned statistical insignificance into practical insignificance.

Several subsequent studies had similar findings: small increases in the number of crashes but not enough data to

–  –  –

22 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart while a wide interval clearly shows that the measurement was not precise enough to draw conclusions.

Physicists commonly use confidence intervals to place bounds on quantities that are not significantly different from zero. In the search for a new fundamental particle, for example, it’s not helpful to say, “The signal was not statistically significant.” Instead, physicists can use a confidence interval to place an upper bound on the rate at which the particle is produced in the particle collisions under study and then compare this result to the competing theories that predict its behavior (and force future experimenters to build yet bigger instruments to find it).

Thinking about results in terms of confidence intervals provides a new way to approach experimental design. Instead of focusing on the power of significance tests, ask, “How much data must I collect to measure the effect to my desired precision?” Even a powerful experiment can nonetheless produce significant results with extremely wide confidence intervals, making its results difficult to interpret.

Of course, the sizes of our confidence intervals vary from one experiment to the next because our data varies from experiment to experiment. Instead of choosing a sample size to achieve a certain level of power, we choose a sample size so the confidence interval will be suitably narrow 99% of the time (or 95%; there’s not yet a standard convention for this number, called the assurance, which determines how often the confidence interval must beat our target width).16 Sample size selection methods based on assurance have been developed for many common statistical tests, though not for all; it is a new field, and statisticians have yet to fully explore it.17 (These methods go by the name accuracy in parameter estimation, or AIPE.) Statistical power is used far more often than assurance, which has not yet been widely adopted by scientists in any field. Nonetheless, these methods are enormously useful.

Statistical significance is often a crutch, a catchier-sounding but less informative substitute for a good confidence interval.

Truth Inflation Suppose Fixitol reduces symptoms by 20% over a placebo, but the trial you’re using to test it is too small to have adequate statistical power to detect this difference reliably. We know that small trials tend to have varying results; it’s easy to get 10 lucky patients who have shorter colds than usual but much harder to get 10,000 who all do.

–  –  –

24 Chapter 2 Statistics Done Wrong © 2015 Alex Reinhart Consider also that top-ranked journals, such as Nature and Science, prefer to publish studies with groundbreaking results—meaning large effect sizes in novel fields with little prior research. This is a perfect combination for chronic truth inflation. Some evidence suggests a correlation between a journal’s impact factor (a rough measure of its prominence and importance) and the factor by which its studies overestimate effect sizes. Studies that produce less “exciting” results are closer to the truth but less interesting to a major journal editor.21,22 When a study claims to have detected a large effect with a relatively small sample, your first reaction should not be “Wow, they’ve found something big!” but “Wow, this study is underpowered!”23 Here’s an example. Starting in 2005, Satoshi Kanazawa published a series of papers on the theme of gender ratios, culminating with “Beautiful Parents Have More Daughters.” He followed up with a book discussing this and other “politically incorrect truths” he’d discovered.

The studies were popular in the press at the time, particularly because of the large effect size they reported: Kanazawa claimed the most beautiful parents have daughters 52% of the time, but the least attractive parents have daughters only 44% of the time.

To biologists, a small effect—perhaps one or two percentage points—would be plausible. The Trivers–Willard Hypothesis suggests that if parents have a trait that benefits girls more than boys, then they will have more girls than boys (or vice versa).

If you assume girls benefit more from beauty than boys, then the hypothesis would predict beautiful parents would have, on average, slightly more daughters.

But the effect size claimed by Kanazawa was extraordinary.

And as it turned out, he committed several errors in his statistical analysis. A corrected regression analysis found that his data showed attractive parents were indeed 4.7% more likely to have girls—but the confidence interval stretched from 13.3% more likely to 3.9% less likely.23 Though Kanazawa’s study used data from nearly 3,000 parents, the results were not statistically significant.

Enormous amounts of data would be needed to reliably detect a small difference. Imagine a more realistic effect size— say, 0.3%. Even with 3,000 parents, an observed difference of 0.3% is far too small to distinguish from luck. You’d be lucky to obtain a statistically significant result just 5% of the time. These

–  –  –

Average Test Score Number of Students Figure 2-4: Schools with more students have less random variation in their test scores. This data is simulated but based on real observations of Pennsylvania public schools.

Another example: in the United States, counties with the lowest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties. Why might this be? Maybe rural people get more exercise or inhale less-polluted air. Or perhaps they just lead less stressful lives.

On the other hand, counties with the highest rates of kidney cancer tend to be Midwestern, Southern, and Western rural counties.

Pages:   || 2 |

Similar works:

«Working to make sure the consumer’s voice is always heard and helps shape the provision of health and social care services in West Sussex Visit report: Ashley House Details of visit: Responsive announced visit Service address: Ashley House, 120 Aldwick Road, Bognor Regis, PO21 2PB Service Provider: Livability Date of visit: 12/1/15 Authorised Representatives: Andy Lane, Karin Lane Contact details: mark.habibi@healthwatchwestsussex.co.uk Acknowledgements Healthwatch West Sussex would like to...»

«DEPARTMENT OF HEALTH & HUMAN SERVICES Centers for Medicare & Medicaid Services Center for Consumer Information & Insurance Oversight 200 Independence Avenue SW Washington, DC 20201 Date: April 21, 2016 Subject: Draft Updated Federal Standard Renewal and Product Discontinuation Notices I. Purpose The Centers for Medicare & Medicaid Services (CMS) is releasing draft updated Federal standard notices of product discontinuation and renewal for the individual health insurance market. 1 Once...»

«Aus dem Institut für Klinische Neurowissenschaften der Ludwig-Maximilians-Universität München Vorstand: Prof. Dr. med. Dr. h.c. Thomas Brandt, FRCP Subcortical Control of Visual Fixation Dissertation zum Erwerb des Doktorgrades der Humanbiologie an der Medizinischen Fakultät der Ludwig-Maximilians-Universität zu München vorgelegt von: Lorenzo Guerrasio aus: Mailand (Italien) Jahr: 2011 Mit Genehmigung der Medizinischen Fakultät der Universität München Berichterstatter: Prof. Dr. Ulrich...»

«U of A Policies and Procedures On-Line (UAPPOL) Approval Date: June 17, 2011 Off-Campus Activity and Travel Policy Office of Accountability: Provost and Vice-President (Academic) Vice-President (Finance and Administration) Office of Administrative Responsibility: Insurance and Risk Assessment (Risk Management Services) Approver: Board of Governors (Board Safety, Health and Environment Committee) Scope: This policy applies to all members of the University community involved in off-campus...»

«Health Promoting Schools – the Right way Degree: Doctor of Philosophy Institution: Victoria University Faculty: Arts, Education and Human Development School: Education Candidate: Kerry Renwick Year of Submission: 2006 Dedication This journey that I have undertaken has never been a solo one. Derek Colquhoun helped to start me on this path and was with me much of the way. Derek was the very “right” supervisor for my early learnings and subsequent explorations. I am grateful to the school...»

«Amethyst Women’s Addiction Centre Report to the Community Fiscal year ending March 31, 2013 Helping Women Build Healthier Lives Amethyst Women’s Addiction Centre offers alternative addiction services to women. Amethyst is Our committed to working with diverse and marginalized groups of women. Our aim is to support women, Mission individually and together, to take control of factors affecting our health and well-being. Our Amethyst is grounded in the feminist belief that women’s...»

«UCC1: New Course Transmittal Form Microbiology and Cell Science 60100000 Department Name and Number Recommended SCNS Course Identi cation U R N 3 1 7 Pre x Level Course Number Lab Code Genetics & Genomics in Health Care Full Course Title Transcript Title (please limit to 21 characters) Genetics in Nursing Fall 2013 E ective Term and Year Rotating Topic yes no Amount of Credit Contact Hour: Base or Headcount S/U Only yes s no Repeatable Credit yes s no If yes, total repeatable credit allowed...»

«Faculty of Family Planning and Reproductive Health Care Clinical Effectiveness Unit Aberdeen Maternity Hospital Room 63, Cornhill Road Aberdeen AB25 2ZL New Product Review (April 2003) Desogestrel-only pill (Cerazette) Evidence from a randomised trial has shown that a 75 microgram desogestrel pill inhibits ovulation in 97% of cycles. Thus, on theoretical grounds, we would expect the desogestrel pill to be more effective than existing progestogen-only pills (POPs). However, Pearl indices from...»

«Zur regionären Metastasierung der Plattenepithelkarzinome des Oropharynx, Hypopharynx und Larynx Dissertation zur Erlangung des akademischen Grades Dr. med. dent. an der Medizinischen Fakultät der Universität Leipzig eingereicht von: Daniela Heints geb. am 27.11.1984 in Siegburg angefertigt am: Institut für Pathologie des Universitätsklinikums Leipzig (UKL) Betreuer: Prof. Dr. med. Ch. Wittekind Beschluss über die Verleihung des Doktorgrades vom: 19.02.2013 Bibliographische Beschreibung...»

«JBC Papers in Press. Published on November 29, 2001 as Manuscript M108442200 The type IIs restriction endonuclease BspMI is a tetramer that acts concertedly at two copies of an asymmetric DNA sequence* Niall A. Gormley, Anna L. Hillberg and Stephen E. Halford‡ From the Department of Biochemistry, School of Medical Sciences, University of Bristol, Downloaded from http://www.jbc.org/ by guest on May 10, 2016 University Walk, Bristol BS8 1TD, United Kingdom ‡ To whom correspondence should be...»

«4th INTUITION International Conference on Virtual Reality and Virtual Environments proceedings 4-5 October 2007, Athens, Greece ISBN-978-960-254-665-9 Institute of Communication and Computer Systems of the National Technical University of Athens, 2007 Collaborative Evaluation of a Haptic-based Medical Virtual Environment Shamus P. Smith1, Susan Todd2 (1) Department of Computer Science Durham University, Durham, DH1 3LE, UNITED KINGDOM E-mail: shamus.smith@durham.ac.uk (2) UK Haptics Ltd. Albert...»

«From the Clinic for Small Animal Medicine Faculty of Veterinary Medicine of the Ludwig-Maximilians-University, Munich Head of the Clinic: Prof. Dr. Katrin Hartmann Under the supervision of: Priv.-Doz. Dr. Andrea Fischer Congenital sensorineural deafness in client-owned pure-breed white cats Inaugural Dissertation to achieve the title Doctor of Veterinary Medicine from the Faculty of Veterinary Medicine of the Ludwig-Maximilians-University, Munich by Dejan Cvejić from Pula Munich 2009 Gedruckt...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.