There’s a feature article in the new issue of Science News on the failure of science “to face the shortcomings of statistics”. The author, Tom Siegfried, argues that many scientific results shouldn’t be believed because they depend on faulty statistical practices:

Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

I have mixed feelings about the article. It’s hard to disagree with the basic idea that many scientific results are the results of statistical malpractice and/or misfortune. And Siegfried generally provides lucid explanations of some common statistical pitfalls when he sticks to the descriptive side of things. For instance, he gives nice accounts of Bayesian inference, of the multiple comparisons problem, and of the distinction between statistical significance and clinical/practical significance. And he nicely articulates what’s wrong with one of the most common (mis)interpretations of *p* values:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”

So as a laundry list of common statistical pitfalls, it works quite nicely.

What I *don’t *really like about the article is that it seems to lay the blame squarely on the use of statistics to do science, rather than the way statistical analysis tends to be performed. That’s to say, a lay person reading the article could well come away with the impression that the very problem with science is *that it relies on statistics*. As opposed to the much more reasonable conclusion that science is hard, and statistics is hard, and ensuring that your work sits at the intersection of good science *and* good statistical practice is even harder. Siegfried all but implies that scientists are silly to base their conclusions on statistical inference. For instance:

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions.

Or:

Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals. Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.

The problem is that there isn’t really any viable alternative to the “love affair with statistics”. Presumably Siegfried doesn’t think (most) scientists ought to be doing qualitative research; so the choice isn’t between statistics and no statistics, it’s between good and bad statistics.

In that sense, the tone of a lot of the article is pretty condescending: it comes off more like Siegfried saying “boy, scientists sure are dumb” and less like the more accurate observation that doing statistics is really hard, and it’s not surprising that even very smart people mess up frequently.

What makes it worse is that Siegfried slips up on a couple of basic points himself, and says some demonstrably false things in a couple of places. For instance, he explains failures to replicate genetic findings this way:

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor” for the syndrome, the researchers reported in the

Journal of the American Medical Association.How could so many studies be wrong? Because their conclusions relied on “statistical significance,” a concept at the heart of the mathematical analysis of modern scientific experiments.

This is wrong for at least two reasons. One is that, to believe the JAMA study Siegfried is referring to, and disbelieve the results of all 85 previously reported findings, you have to *accept the null hypothesis*, which is one of the very same errors Siegfried is supposed to be warning us against. In fact, you have to accept the null hypothesis *85 times*. In the JAMA paper, the authors are careful to note that it’s possible the actual effects were simply overstated in the original studies, and that at least some of the original findings might still hold under more restrictive conditions. The conclusion that there really is *no* effect whatsoever is almost never warranted, because you rarely have enough power to rule out even very small effects. But Siegfried offers no such qualifiers; instead, he happily accepts 85 null hypotheses in support of his own argument.

The other issue is that it isn’t really the reliance on statistical significance that causes replication failures; it’s usually the use of excessively liberal statistical criteria. The problem has very little to do with the hypothesis testing framework per se. To see this, consider that if researchers always used a criterion of p < .0000001 instead of the conventional p < .05, there would almost never be any replication failures (because there would almost never be any statistically significant findings, period). So the problem is not so much with the classical hypothesis testing framework as with the choices many researchers make about how to set thresholds *within* that framework. (That’s not to say that there aren’t any problems associated with frequentist statistics, just that this isn’t really a fair one.)

Anyway, Siegfried’s explanations of the pitfalls of statistical significance then leads him to make what has to be hands-down the silliest statement in the article:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

If you take this statement at face value, you should conclude there’s no point in doing statistical analysis, period. No matter what statistical procedure you use, you’re never going to know for *cross-your-heart-hope-to-die *sure that your conclusions are warranted. After all, you’re always going to have the same two possibilities: either the effect is real, or it’s not (or, if you prefer to frame the problem in terms of magnitude, either the effect is about as big as you think it is, or it’s very different in size). The same exact conclusion goes through if you take a threshold of p < .001 instead of one of p < .05: the effect can *still *be a spurious and improbable fluke. And it also goes through if you have twelve replications instead of just one positive finding: you could *still *be wrong (and people *have* been wrong). So saying that “two possible conclusions remain” isn’t offering any deep insight; it’s utterly vacuous.

The reason scientists use a conventional threshold of p < .05 when evaluating results isn’t because we think it gives us some magical certainty into whether a finding is “real” or not; it’s because it feels like a reasonable level of confidence to shoot for when making claims about whether the null hypothesis of no effect is *likely* to hold or not. Now there certainly are many problems associated with the hypothesis testing framework–some of them very serious–but if you really believe that “there’s no logical basis for using a P value from a single study to draw any conclusion,” your beef isn’t actually with *p *values, it’s with the very underpinnings of the scientific enterprise.

Anyway, the bottom line is Siegfried’s article is not so much bad as irresponsible. As an accessible description of some serious problems with common statistical practices, it’s actually quite good. But I guess the sense I got in reading the article was that at some point Siegfried became more interested in writing a contrarian piece about how scientists are falling down on the job than about how doing statistics well is just really hard for almost all of us (I certainly fail at it all the time!). And ironically, in the process of trying to explain just why “science fails to face the shortcomings of statistics”, Siegfried commits some of the very same errors he’s taking scientists to task for.

[UPDATE: Steve Novella says much the same thing here.]

[UPDATE 2: Andrew Gelman has a nice roundup of other comments on Siegfried’s article throughout the blogosphere.]

On a related topic, an article in the current TAS suggests that publication bias is so strong (due to some of the problems noted above, and a few others) that we’d often be better throwing out 90% of the published studies in order to get a better estimate of the true magnitude of an effect.

See http://mikekr.blogspot.com/2010/03/could-it-be-better-to-discard-90-of.html

You note “If you take this statement at face value, you should conclude there’s no point in doing statistical analysis, period. ” You might go further and say there’s no point in doing the study, period, since a single study is unlikely to be definitive. That’s why we have more than one scientist doing more than one study.

Nice commentary Tal. NHST is deeply unintuitive, and it’s hard for even people with good statistical training and acumen to avoid messing up sometimes. Your formulation that the problems of science and statistics are hard (rather than that the people are dumb) definitely resonates with me.

Part of the reason the shortcomings of NHST are so well known is because it has been in widespread use for so long — and thus subjected to intense field-testing and scrutiny. That makes it easy to come along and write a curmudeonly rant cataloging common errors. I suspect that if we were all doing Bayesian modeling or anything else, there’d be plenty of (different) misuses and errors abounding in the applied literature. I’m not saying we wouldn’t be better off. But my hunch is that no matter what statistical paradigm scientists were using, you’d always be able to write this sort of article.

zbicyclist, agreed, though there certainly are (rare) cases where a single study offers definitive evidence–or at least, orients researchers to important directions that hadn’t previously been considered.

Sanjay, thanks for the comment. To be fair, I think there’s an important place for curmudgeonly rants provided they’re done constructively (e.g., pointing out potential solutions, and not just problems). Personally I’d actually love to see more laundry lists of common problems researchers run into; I learn a lot that way. But those kinds of papers are hard to do right, and a pop science magazine probably isn’t the right venue. When people write well-reasoned methodological critiques that offer actual solutions along with the criticism, I’m all ears. But in this case it’s not clear what the point of the article actually is, other than to give naive readers the mistaken impression that scientists are wasting their time doing statistics.

Nowhere are the stats more untrustworthy than with fMRI. For instance, many of these studies routinely use sample sizes of 10 subjects and get published in prestigious science journals. In fact, a recent paper on SSRN titled “Brain Imaging for Legal Thinkers: A Guide for the Perplexed” justifies this practice not on any scientific basis but merely on cost values. And yet the wild claims persist that fMRI has proven just about everything under the sun.

Now, how are these papers getting pubished?

There’s actually no such thing as statistics, it’s just a word someone made up to sell textbooks.

If you calculate a p value that X could have happened by chance that’s one kind of inference, if you look at the data and it’s obvious that X is true that’s another kind of inference, but both are “statistical” (it’s not as if your judgment that X is obvious

has nothing to do withthe formal statistics of the data, although you may not be conscious of them).What matters is whether your method of inference is sound and whether you’re aware of the limitations (have you done loads of comparisons and not corrected your p? Have you looked at the data and thought “X” when in fact Y is a better explanation?)

So saying science has a problem with statistics is a bit silly. Specific areas of science have specific problems with various kinds of inference, e.g. in certain kinds of neuroimaging there’s the non-independence error, in GWAS there’s the multiple-comparisons problem, population stratification – but there’s no problem with statistics per se.

Statistics ARE hard..

Or

Undertanding statistics is difficult.

Minor typo, but anotherwise an excellent article. 🙂

I think one part of the issue is that “Scientific Method 101” teaches that one exception disproves the hypothesis. This is also true in pure Mathematics 101.

Reality is somewhat different…

Another issue is how easy it is to overlook that statistically insignificant results are still results and exist no matter how hard you try and pretend they don’t.

A recent example is the recent 1000 point drop in the DJI index, when your normal bouncy for insignificance is the 2nd, or for the paranoid 3rd standard deviation from mean such an event seams so unlikely that when it happens everyone is scrambling for an outside explanation for the cause.

That neatly leads me to my last point. Correlation != Causation, which is one of the many deadly pitfalls laying in wait for the incautious data modeler or sloppy science reporter.