Posts Tagged ‘criticism’

what the arsenic effect means for scientific publishing

Thursday, December 9th, 2010

I don’t know very much about DNA (and by ‘not very much’ I sadly mean ‘next to nothing’), so when someone tells me that life as we know it generally doesn’t use arsenic to make DNA, and that it’s a big deal to find a bacterium that does, I’m willing to believe them. So too, apparently, are at least two or three reviewers for Science, which published a paper last week by a NASA group purporting to demonstrate exactly that.

Turns out the paper might have a few holes. In the last few days, the blogosphere has reached fever delirium pitch as critiques of the article have emerged from every corner; it seems like pretty much everyone with some knowledge of the science in question is unhappy about the paper. Since I’m not in any position to critique the article myself, I’ll take Carl Zimmer’s word for it in Slate yesterday:

Was this merely a case of a few isolated cranks? To find out, I reached out to a dozen experts on Monday. Almost unanimously, they think the NASA scientists have failed to make their case.  “It would be really cool if such a bug existed,” said San Diego State University’s Forest Rohwer, a microbiologist who looks for new species of bacteria and viruses in coral reefs. But, he added, “none of the arguments are very convincing on their own.” That was about as positive as the critics could get. “This paper should not have been published,” said Shelley Copley of the University of Colorado.

Zimmer then follows his Slate piece up with a blog post today in which he provides 13 experts’ unadulterated comments. While there are one or two (somewhat) positive reviews, the consensus clearly seems to be that the Science paper is (very) bad science.

Of course, scientists (yes, even Science reviewers) do occasionally make mistakes, so if we’re being charitable about it, we might chalk it up to human error (though some of the critiques suggest that these are elementary problems that could have been very easily addressed, so it’s possible there’s some disingenuousness involved). But what many bloggers (1, 2, 3, etc.) have found particularly inexcusable is the way NASA and the research team have handled the criticism. Zimmer again, in Slate:

I asked two of the authors of the study if they wanted to respond to the criticism of their paper. Both politely declined by email.

“We cannot indiscriminately wade into a media forum for debate at this time,” declared senior author Ronald Oremland of the U.S. Geological Survey. “If we are wrong, then other scientists should be motivated to reproduce our findings. If we are right (and I am strongly convinced that we are) our competitors will agree and help to advance our understanding of this phenomenon. I am eager for them to do so.”

“Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated,” wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. “The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner.”

A NASA spokesperson basically reiterated this point of view, indicating that NASA scientists weren’t going to respond to criticism of their work unless that criticism appeared in, you know, a respectable, peer-reviewed outlet. (Fortunately, at least one of the critics already has a draft letter to Science up on her blog.)

I don’t think it’s surprising that people who spend much of their free time blogging about science, and think it’s important to discuss scientific issues in a public venue, generally aren’t going to like being told that science blogging isn’t a legitimate form of scientific discourse. Especially considering that the critics here aren’t laypeople without scientific training; they’re well-respected scientists with areas of expertise that are directly relevant to the paper. In this case, dismissing trenchant criticism because it’s on the web rather than in a peer-reviewed journal seems kind of like telling someone who’s screaming at you that your house is on fire that you’re not going to listen to them until they adopt a more polite tone. It just seems counterproductive.

That said, I personally don’t think we should take the NASA team’s statements at face value. I very much doubt that what the NASA researchers are saying really reflect any deep philosophical view about the role of blogs in scientific discourse; it’s much more likely that they’re simply trying to buy some time while they figure out how to respond. On the face of it, they have a choice between two lousy options: either ignore the criticism entirely, which would be antithetical to the scientific process and would look very bad, or address it head-on–which, judging by the vociferousness and near-unanimity of the commentators, is probably going to be a losing battle. Shifting the terms of the debate by insisting on responding only in a peer-reviewed venue doesn’t really change anything, but it does buy the authors two or three weeks. And two or three weeks is worth like, forty attentional cycles in the blogosphere.

Mind you, I’m not saying we should sympathize with the NASA researchers just because they’re in a tough position. I think one of the main reasons the story’s attracted so much attention is precisely because people see it as a case of justice being served. The NASA team called a major press conference ahead of the paper’s publication, published its results in one of the world’s most prestigious science journals, and yet apparently failed to run relatively basic experimental controls in support of its conclusions. If the critics are to be believed, the NASA researchers are either disingenuous or incompetent; either way, we shouldn’t feel sorry for them.

What I do think this episode shows is that the rules of scientific publishing have fundamentally changed in the last few years–and largely for the better. I haven’t been doing science for very long, but even in the halcyon days of 2003, when I started graduate school, science blogging was practically nonexistent, and the main way you’d find out what other people thought about an influential new paper was by talking to people you knew at conferences (which could take several months) or waiting for critiques or replication failures to emerge in other peer-reviewed journals (which could take years). That kind of delay between publication and evaluation is disastrous for science, because in the time it takes for a consensus to emerge that a paper is no good, several research teams might have already started trying to replicate and extend the reported findings, and several dozen other researchers might have uncritically cited their paper peripherally in their own work. This delay is probably why, as John Ioannidis’ work so elegantly demonstrates, major studies published in high-impact journals tend to exert a disproportionate influence on the literature long after they’ve been resoundingly discredited.

The Arsenic Effect, if we can call it that, provides a nice illustration of the impact of new media on scientific communication. It’s a safe bet that there are now very few people who do anything even vaguely related to the NASA team’s research who haven’t been made aware that the reported findings are controversial. Which means that the process of attempting to replicate (or falsify) the findings will proceed much more quickly than it might have ten or twenty years ago, and there probably won’t be very many people who cite the Science paper as compelling evidence of terrestrial arsenic-based life. Perhaps more importantly, as researchers get used to the idea that their high-profile work is going to be instantly evaluated by thousands of pairs of highly trained eyes, any of which might be attached to a highly prolific pair of typing hands, there will be an increasingly strong disincentive to avoid being careless. That isn’t to say that bad science will disappear, of course; just that, in cases where the badness reflects a pressure to tell a good story at all costs, we’ll probably see less of it.

scientists aren’t dumb; statistics is hard

Wednesday, March 17th, 2010

There’s a feature article in the new issue of Science News on the failure of science “to face the shortcomings of statistics”. The author, Tom Siegfried, argues that many scientific results shouldn’t be believed because they depend on faulty statistical practices:

Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

I have mixed feelings about the article. It’s hard to disagree with the basic idea that many scientific results are the results of statistical malpractice and/or misfortune. And Siegfried generally provides lucid explanations of some common statistical pitfalls when he sticks to the descriptive side of things. For instance, he gives nice accounts of Bayesian inference, of the multiple comparisons problem, and of the distinction between statistical significance and clinical/practical significance. And he nicely articulates what’s wrong with one of the most common (mis)interpretations of p values:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”

So as a laundry list of common statistical pitfalls, it works quite nicely.

What I don’t really like about the article is that it seems to lay the blame squarely on the use of statistics to do science, rather than the way statistical analysis tends to be performed. That’s to say, a lay person reading the article could well come away with the impression that the very problem with science is that it relies on statistics. As opposed to the much more reasonable conclusion that science is hard, and statistics is hard, and ensuring that your work sits at the intersection of good science and good statistical practice is even harder. Siegfried all but implies that scientists are silly to base their conclusions on statistical inference. For instance:

It’s science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions.

Or:

Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals. Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.

The problem is that there isn’t really any viable alternative to the “love affair with statistics”. Presumably Siegfried doesn’t think (most) scientists ought to be doing qualitative research; so the choice isn’t between statistics and no statistics, it’s between good and bad statistics.

In that sense, the tone of a lot of the article is pretty condescending: it comes off more like Siegfried saying “boy, scientists sure are dumb” and less like the more accurate observation that doing statistics is really hard, and it’s not surprising that even very smart people mess up frequently.

What makes it worse is that Siegfried slips up on a couple of basic points himself, and says some demonstrably false things in a couple of places. For instance, he explains failures to replicate genetic findings this way:

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor” for the syndrome, the researchers reported in the Journal of the American Medical Association.

How could so many studies be wrong? Because their conclusions relied on “statistical significance,” a concept at the heart of the mathematical analysis of modern scientific experiments.

This is wrong for at least two reasons. One is that, to believe the JAMA study Siegfried is referring to, and disbelieve the results of all 85 previously reported findings, you have to accept the null hypothesis, which is one of the very same errors Siegfried is supposed to be warning us against. In fact, you have to accept the null hypothesis 85 times. In the JAMA paper, the authors are careful to note that it’s possible the actual effects were simply overstated in the original studies, and that at least some of the original findings might still hold under more restrictive conditions. The conclusion that there really is no effect whatsoever is almost never warranted, because you rarely have enough power to rule out even very small effects. But Siegfried offers no such qualifiers; instead, he happily accepts 85 null hypotheses in support of his own argument.

The other issue is that it isn’t really the reliance on statistical significance that causes replication failures; it’s usually the use of excessively liberal statistical criteria. The problem has very little to do with the hypothesis testing framework per se. To see this, consider that if researchers always used a criterion of p < .0000001 instead of the conventional p < .05, there would almost never be any replication failures (because there would almost never be any statistically significant findings, period). So the problem is not so much with the classical hypothesis testing framework as with the choices many researchers make about how to set thresholds within that framework. (That’s not to say that there aren’t any problems associated with frequentist statistics, just that this isn’t really a fair one.)

Anyway, Siegfried’s explanations of the pitfalls of statistical significance then leads him to make what has to be hands-down the silliest statement in the article:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

If you take this statement at face value, you should conclude there’s no point in doing statistical analysis, period. No matter what statistical procedure you use, you’re never going to know for cross-your-heart-hope-to-die sure that your conclusions are warranted. After all, you’re always going to have the same two possibilities: either the effect is real, or it’s not (or, if you prefer to frame the problem in terms of magnitude, either the effect is about as big as you think it is, or it’s very different in size). The same exact conclusion goes through if you take a threshold of p < .001 instead of one of p < .05: the effect can still be a spurious and improbable fluke. And it also goes through if you have twelve replications instead of just one positive finding: you could still be wrong (and people have been wrong). So saying that “two possible conclusions remain” isn’t offering any deep insight; it’s utterly vacuous.

The reason scientists use a conventional threshold of p < .05 when evaluating results isn’t because we think it gives us some magical certainty into whether a finding is “real” or not; it’s because it feels like a reasonable level of confidence to shoot for when making claims about whether the null hypothesis of no effect is likely to hold or not. Now there certainly are many problems associated with the hypothesis testing framework–some of them very serious–but if you really believe that “there’s no logical basis for using a P value from a single study to draw any conclusion,” your beef isn’t actually with p values, it’s with the very underpinnings of the scientific enterprise.

Anyway, the bottom line is Siegfried’s article is not so much bad as irresponsible. As an accessible description of some serious problems with common statistical practices, it’s actually quite good. But I guess the sense I got in reading the article was that at some point Siegfried became more interested in writing a contrarian piece about how scientists are falling down on the job than about how doing statistics well is just really hard for almost all of us (I certainly fail at it all the time!). And ironically, in the process of trying to explain just why “science fails to face the shortcomings of statistics”, Siegfried commits some of the very same errors he’s taking scientists to task for.

[UPDATE: Steve Novella says much the same thing here.]

[UPDATE 2: Andrew Gelman has a nice roundup of other comments on Siegfried's article throughout the blogosphere.]