Tag Archives: publishing

strong opinions about data sharing mandates–mine included

Apparently, many scientists have rather strong feelings about data sharing mandates. In the wake of PLOS’s recent announcement–which says that, effective now, all papers published in PLOS journals must deposit their data in a publicly accessible location–a veritable gaggle of scientists have taken to their blogs to voice their outrage and/or support for the policy. The nays have posts like DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the nays to get over themselves and share their data (more posts here, here, and here). While I’m a bit late to the party (mostly because I’ve been traveling and otherwise indisposed), I guess I’ll go ahead and throw my hat into the ring in support of data sharing mandates. For a number of reasons outlined below, I think time will show the anti-PLOS folks to very clearly be on the wrong side of this issue.

Mandatory public deposition is like, totally way better than a “share-upon-request” approach

You might think that proactive data deposition has little incremental utility over a philosophy of sharing one’s data upon request, since emails are these wordy little things that only take a few minutes of a data-seeker’s time to write. But it’s not just the time and effort that matter. It’s also the psychology and technology. Psychology, because if you don’t know the person on the other end, or if the data is potentially useful but not essential to you, or if you’re the agreeable sort who doesn’t like to bother other people, it’s very easy to just say, “nah, I’ll just go do something else”. Scientists are busy people. If a dataset is a click away, many people will be happy to download that dataset and play with it who wouldn’t feel comfortable emailing the author to ask for it. Technology, because data that isn’t publicly available is data that isn’t publicly indexed. It’s all well and good to say that if someone really wants a dataset, they can email you to ask for it, but if someone doesn’t know about your dataset in the first place–because it isn’t in the first three pages of Google results–they’re going to have a hard time asking.

People don’t actually share on request

Much of the criticism of the PLoS data sharing policy rests on the notion that the policy is unnecessary, because in practice most journals already mandate that authors must share their data upon request. One point that defenders of the PLOS mandate haven’t stressed enough is that such “soft” mandates are largely meaningless. Empirical studies have repeatedly demonstrated  that it’s actually very difficult  to get authors to share their data upon request –even when they’re obligated to do so by the contractual agreement they’ve signed with a publisher. And when researchers do fulfill data sharing requests, they often take inordinately long to do so, and the data often don’t line up properly with what was reported in the paper (as the PLOS editors noted in their explanation for introducing the policy), or reveal potentially serious errors.

Personally, I have to confess that I often haven’t fulfilled other researchers’ requests for my data–and in at least two cases, I never even responded to the request. These failures to share didn’t reflect my desire to hide anything; they occurred largely because I knew it would be a lot of work, and/or the data were no longer readily accessible to me, and/or I was too busy to take care of the request right when it came in. I think I’m sufficiently aware of my own character flaws to know that good intentions are no match for time pressure and divided attention–and that’s precisely why I’d rather submit my work to journals that force me to do the tedious curation work up front, when I have a strong incentive to do it, rather than later, when I don’t.

Comprehensive evaluation requires access to the data

It’s hard to escape the feeling that some of the push-back against the policy is actually rooted in the fear that other researchers will find mistakes in one’s work by going through one’s data. In some cases, this fear is made explicit. For example, DrugMonkey suggested that:

There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted!

This rather dire premonition prompted a commenter to ask if it’s possible that DM might ever be wrong about what his data means–necessitating other pairs of eyes and/or opinions. DM’s response was, in essence, “No.”. But clearly, this is wishful thinking: we have plenty of reasons to think that everyone in science–even the luminaries among us–make mistakes all the time. Science is hard. In the fields I’m most familiar with, I rarely read a paper that I don’t feel has some serious flaws–even though nearly all of these papers were written by people who have, in DM’s words, “been at this for a while”. By the same token, I’m certain that other people read each of my papers and feel exactly the same way. Of course, it’s not pleasant to confront our mistakes by putting everything out into the open, and I don’t doubt that one consequence of sharing data proactively is that error-finding will indeed become much more common. At least initially (i.e., until we develop an appreciation for the true rate of error in the average dataset, and become more tolerant of minor problems), this will probably cause everyone some discomfort. But temporary discomfort surely isn’t a good excuse to continue to support practices that clearly impede scientific progress.

Part of the problem, I suspect, is that scientists have collectively internalized as acceptable many practices that are on some level clearly not good for the community as a whole. To take just one example, it’s an open secret in biomedical science that so-called “representative figures” (of spiking neurons, Western blots, or whatever else you like) are rarely truly representative. Frequently, they’re actually among the best examples the authors of a paper were able to find. The communal wink-and-shake agreement to ignore this kind of problem is deeply problematic, in that it likely allows many claims to go unchallenged that are actually not strongly supported by the data. In a world where other researchers could easily go through my dataset and show that the “representative” raster plot I presented in Figure 2C was actually the best case rather than the norm, I would probably have to be more careful about making that kind of claim up front–and someone else might not waste a lot of their time chasing results that can’t possibly be as good as my figures make them look.

Figure 1.  A representative planet.

The Data are a part of the Methods

If you still don’t find this convincing, consider that one could easily have applied nearly all of the arguments people having been making in the blogosphere these past two weeks to that dastardly scientific timesink that is the common Methods sections. Imagine that we lived in a culture where scientists always reported their Results telegraphically–that is, with the brevity of a typical Nature or Science paper, but without the accompanying novel’s worth of Supplementary Methods. Then, when someone first suggested that it might perhaps be a good idea to introduce a separate section that describes in dry, technical language how authors actually produced all those exciting results, we would presumably see many people in the community saying something like the following:

Why should I bother to tell you in excruciating detail what software, reagents, and stimuli I used in my study? The vast majority of readers will never try to directly replicate my experiment, and those who do want to can just email me to get the information they need–which of course I’m always happy to provide in a timely and completely disinterested fashion. Asking me to proactively lay out every little methodological step I took is really unreasonable; it would take a very long time to write a clear “Methods” section of the kind you propose, and the benefits seem very dubious. I mean, the only thing that will happen if I adopt this new policy is that half of my competitors will start going through this new section with a fine-toothed comb in order to find problems, and the other half will now be able to scoop me by repeating the exact procedures I used before I have a chance to follow them up myself! And for what? Why do I need to tell everyone exactly what I did? I’m an expert with many years of experience in this field! I know what I’m doing, and I don’t appreciate your casting aspersions on my work and implying that my conclusions might not always be 100% sound!

As far as I can see, there isn’t any qualitative difference between reporting detailed Methods and providing comprehensive Data. In point of fact, many decisions about which methods one should use depend entirely on the nature of the data, so it’s often actually impossible to evaluate the methodological choices the authors made without seeing their data. If DrugMonkey et al think it’s crazy for one researcher to want access to another researcher’s data in order to determine whether the distribution of some variable looks normal, they should also think it’s crazy for researchers to have to report their reasoning for choosing a particular transformation in the first place. Or for using a particular reagent. Or animal strain. Or learning algorithm, or… you get the idea. But as Bjorn Brembs succinctly put it, in the digital age, this is silly: for all intents and purposes, there’s no longer any difference between text and data.

The data are funded by the taxpayers, and (in some sense) belong to the taxpayers

People vary widely in the extent to which they feel the public deserves to have access to the products of the work it funds. I don’t think I hold a particularly extreme position in this regard, in the sense that I don’t think the mere fact that someone’s effort is funded by the public automatically means any of their products should be publicly available for anyone’s perusal or use. However, when we’re talking about scientific data–where the explicit rationale for funding the work is to produce new generalizable knowledge, and where the marginal cost of replicating digital data is close to zero–I really don’t see any reason not to push very strongly to force scientists to share their data. I’m sympathetic to claims about scooping and credit assignment, but as a number of other folks have pointed out in comment threads, these are fundamentally arguments in favor of better credit assignment, and not arguments against sharing data. The fear some people have of being scooped is not sufficient justification for impeding our collective scientific progress.

It’s also worth noting that, in principle, PLOS’s new data sharing policy shouldn’t actually make it any easier for someone else to scoop you. Remember that under PLOS’s current data sharing mandate–as well as the equivalent policies at most other scientific journals–authors are already required to provide their data to anyone else upon request. Critics who argue that the new public archiving mandate opens the door to being scooped are in effect admitting that the old mandate to share upon request doesn’t work, because in theory there already shouldn’t really be anything preventing me from scooping you with your data simply by asking you for it (other than social norms–but then, the people who are actively out to usurp others’ ideas are the least likely to abide by those norms anyway). It’s striking to see how many of the posts defending the “share-upon-request” approach have no compunction in saying that they’re currently only willing to share their data after determining what the person on the other end wants to use it for–in clear violation of most journals’ existing policy.

It’s really not that hard

Organizing one’s data or code in a form minimally suitable for public consumption isn’t much fun. I do it fairly regularly; I know it sucks. It takes some time out of your day, and requires you to allocate resources to the problem that could otherwise be directed elsewhere. That said, a lot of the posts complaining about how much effort the new policy requires seem absurdly overwrought. There seems to be a widespread belief–which, as far as I can tell, isn’t supported by a careful reading of the actual PLOS policy–that there’s some incredibly strict standard that datasets have to live up to before pulic release. I don’t really understand where this concern comes from. Personally, I spend much of my time analyzing data other people have collected. I’ve worked with many other people’s data, and rarely is it in exactly the form I would like. Often times it’s not even in the ballpark of what I’d like. And I’ve had to invest a considerable amount of my time understanding what columns and rows mean, and scrounging for morsels of (poor) documentation. My working assumption when I do this–and, I think, most other people’s–is that the onus is on me to expend some effort figuring out what’s in a dataset I wish to use, and not on the author to release that dataset in a form that a completely naive person could understand without any effort. Of course it would be nice if everyone put their data up on the web in a form that maximized accessibility, but it certainly isn’t expected*. In asking authors to deposit their data publicly, PLOS isn’t asserting that there’s a specific format or standard that all data must meet; they’re just saying data must meet accepted norms. Since those norms depend on one’s field, it stands to reason that expectations will be lower for a 10-TB fMRI dataset than for an 800-row spreadsheet of behavioral data.

There are some valid concerns, but…

I don’t want to sound too Pollyannaish about all this. I’m not suggesting that the PLOS policy is perfect, or that issues won’t arise in the course of its implementation and enforcement. It’s very clear that there are some domains in which data sharing is a hassle, and I sympathize with the people who’ve pointed out that it’s not really clear what “all” the data means–is it the raw data, which aren’t likely to be very useful to anyone, or the post-processed data, which may be too close to the results reported in the paper? But such domain- or case-specific concerns are grossly outweighed by the very general observation that it’s often impossible to evaluate previous findings adequately, or to build a truly replicable science, if you don’t have access to other scientists’ data. There’s no doubt that edge cases will arise in the course of enforcing the new policy. But they’ll be dealt with on a case-by-case basis, exactly as the PLOS policy indicates. In the meantime, our default assumption should be that editors at PLOS–who are, after all, also working scientists–will behave reasonably, since they face many of the same considerations in their own research. When a researcher tells an editor that she doesn’t have anywhere to put the 50 TB of raw data for her imaging study, I expect that that editor will typically respond by saying, “fine, but surely you can drag and drop a directory full of the first- and second-level beta images, along with a basic description, into NeuroVault, right?”, and not “Whut!? No raw DICOM images, no publication!”

As for the people who worry that by sharing their data, they’ll be giving away a competitive advantage… to be honest, I think many of these folks are mistaken about the dire consequences that would ensue if they shared their data publicly. I suspect that many of the researchers in question would be pleasantly surprised at the benefits of data sharing (increased citation rates, new offers of collaboration, etc.) Still, it’s clear enough that some of the people who’ve done very well for themselves in the current scientific system–typically by leveraging some incredibly difficult-to-acquire dataset into a cottage industry of derivative studies–would indeed do much less well in a world where open data sharing was mandatory. What I fail to see, though, is why PLOS, or the scientific community as a whole, should care very much about this latter group’s concerns. As far as I can tell, PLOS’s new policy is a significant net positive for the scientific community as a whole, even if it hurts one segment of that community in the short term. For the moment, scientists who harbor proprietary attitudes towards their data can vote with their feet by submitting their papers somewhere other than PLOS. Contrary to the dire premonitions floating around, I very much doubt any potential drop in submissions is going to deliver a terminal blow to PLOS (and the upside is that the articles that do get published in PLOS will arguably be of higher quality). In the medium-to-long term, I suspect that cultural norms surrounding who gets credit for acquiring and sharing data vs. analyzing and reporting new findings based on those data are are going to undergo a sea change–to the point where in the not-too-distant future, the scoopophobia that currently drives many people to privately hoard their data is a complete non-factor. At that point, it’ll be seen as just plain common sense that if you want your scientific assertions to be taken seriously, you need to make the data used to support those assertions available for public scrutiny, re-analysis, and re-use.

 

* As a case in point, just yesterday I came across a publicly accessible dataset I really wanted to use, but that was in SPSS format. I don’t own a copy of SPSS, so I spent about an hour trying to get various third-party libraries to extract the data appropriately, without any luck. So eventually I sent the file to a colleague who was helpful enough to convert it. My first thought when I received the tab-delimited file in my mailbox this morning was not “ugh, I can’t believe they released the file in SPSS”, it was “how amazing is it that I can download this gigantic dataset acquired half the world away instantly, and with just one minor hiccup, be able to test a novel hypothesis in a high-powered way without needing to spend months of time collecting data?”

the truth is not optional: five bad reasons (and one mediocre one) for defending the status quo

You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.

Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.

“But everyone does it, so how bad can it be?”

We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?

The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.

In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.

“But psychology would break if we could only report results that were truly predicted a priori!”

This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.

In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.

In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.

“But mistakes happen, and people could get falsely accused!”

Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.

For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.

More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.

“But it hurts the public’s perception of our field!”

Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.

While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.

In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.

“But unreliable effects will just fail to replicate, so what’s the big deal?”

This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.

I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?

While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.

In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.

There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.

Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).

The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.

“But it would hurt my career to be meticulously honest about everything I do!”

Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.

I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.

 

Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.

The reviewer’s dilemma, or why you shouldn’t get too meta when you’re supposed to be writing a review that’s already overdue

When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.

The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.

The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.

To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.

Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”

Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?

The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.

I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.

One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”

Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.

building better platforms for evaluating science: a request for feedback

UPDATE 4/20/2012: a revised version of the paper mentioned below is now available here.

A couple of months ago I wrote about a call for papers for a special issue of Frontiers in Computational Neuroscience focusing on “Visions for Open Evaluation of Scientific Papers by Post-Publication Peer Review“. I wrote a paper for the issue, the gist of which is that many of the features scientists should want out of a next-generation open evaluation platform are already implemented all over the place in social web applications, so that building platforms for evaluating scientific output should be more a matter of adapting existing techniques than having to come up with brilliant new approaches. I’m talking about features like recommendation engines, APIs, and reputation systems, which you can find everywhere from Netflix to Pandora to Stack Overflow to Amazon, but (unfortunately) virtually nowhere in the world of scientific publishing.

Since the official deadline for submission is two months away (no, I’m not so conscientious that I habitually finish my writing assignments two months ahead of time–I just failed to notice that the deadline had been pushed way back), I figured I may as well use the opportunity to make the paper openly accessible right now in the hopes of soliciting some constructive feedback. This is a topic that’s kind of off the beaten path for me, and I’m not convinced I really know what I’m talking about (well, fine, I’m actually pretty sure I don’t know what I’m talking about), so I’d love to get some constructive criticism from people before I submit a final version of the manuscript. Not only from scientists, but ideally also from people with experience developing social web applications–or actually, just about anyone with good ideas about how to implement and promote next-generation evaluation platforms. I mean, if you use Netflix or reddit regularly, you’re pretty much a de facto expert on collaborative filtering and recommendation systems, right?

Anyway, here’s the abstract:

Traditional pre-publication peer review of scientific output is a slow, inefficient, and unreliable process. Efforts to replace or supplement traditional evaluation models with open evaluation platforms that leverage advances in information technology are slowly gaining traction, but remain in the early stages of design and implementation. Here I discuss a number of considerations relevant to the development of such platforms. I focus particular attention on three core elements that next-generation evaluation platforms should strive to emphasize, including (a) open and transparent access to accumulated evaluation data, (b) personalized and highly customizable performance metrics, and (c) appropriate short-term incentivization of the userbase. Because all of these elements have already been successfully implemented on a large scale in hundreds of existing social web applications, I argue that development of new scientific evaluation platforms should proceed largely by adapting existing techniques rather than engineering entirely new evaluation mechanisms. Successful implementation of open evaluation platforms has the potential to substantially advance both the pace and the quality of scientific publication and evaluation, and the scientific community has a vested interest in shifting towards such models as soon as possible.

You can download the PDF here (or grab it from SSRN here). It features a cameo by Archimedes and borrows concepts liberally from sites like reddit, Netflix, and Stack Overflow (with attribution, of course). I’d love to hear your comments; you can either leave them below or email me directly. Depending on what kind of feedback I get (if any), I’ll try to post a revised version of the paper here in a month or so that works in people’s comments and suggestions.

(fanciful depiction of) Archimedes, renowned ancient Greek mathematician and co-inventor (with Al Gore) of the open access internet repository

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

of postdocs and publishing models: two opportunities of (possible) interest

I don’t usually use this blog to advertise things (so please don’t send me requests to publicize your third cousin’s upcoming bar mitzvah), but I think these two opportunities are pretty cool. They also happen to be completely unrelated, but I’m too lazy to write two separate posts, so…

Opportunity 1: We’re hiring!

Well, not me personally, but a guy I know. My current postdoc advisor, Tor Wager, is looking to hire up to 4 postdocs in the next few months to work on various NIH-funded projects related to the neural substrates of pain and emotion. You would get to play with fun things like fMRI scanners, thermal stimulators, and machine learning techniques. Oh, and snow, because we’re located in Boulder, Colorado. So we have. A lot. Of snow.

Anyway, Tor is great to work with, the lab is full of amazing people and great resources, and Boulder is a fantastic place to live, so if you have (or expect to soon have) a PhD in affective/cognitive neuroscience or related field and a background in pain/emotion research and/or fMRI analysis and/or machine learning and/or psychophysiology, you should consider applying! See this flyer for more details. And no, I’m not being paid to say this.

Opportunity 2: Design the new science!

That’s a cryptic way of saying that there’s a forthcoming special issue of Frontiers in Computational Neuroscience that’s going to focus on “Visions for Open Evaluation of Scientific Papers by Post-Publication Peer Review.” As far as I can tell, that basically means that if you’re like every other scientist, and think there’s more to scientific evaluation than the number of publications and citations one has, you now have an opportunity to design a perfect evaluation system of your very own–meaning, of course, that system in which you end up at or near the very top.

In all seriousness though, this seems like a really great idea, and I think it’s the kind of thing that could actually have a very large impact on how we’re all doing–or at least communicating–science 10 or 20 years from now. The special issue will be edited by Niko Kriegeskorte, whose excellent ideas about scientific publishing I’ve previously blogged about, and Diana Deca. Send them your best ideas! And then, if it’s not too much trouble, put my name on your paper. You know, as a finder’s fee. Abstracts are due January 15th.

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

how many Cortex publications in the hand is a Nature publication in the bush worth?

A provocative and very short Opinion piece by Julien Mayor (Are scientists nearsighted gamblers? The misleading nature of impact factors) was recently posted on the Frontiers in Psychology website (open access! yay!). Mayor’s argument is summed up nicely in this figure:

The left panel plots the mean versus median number of citations per article in a given year (each year is a separate point) for 3 journals: Nature (solid circles), Psych Review (squares), and Psych Science (triangles). The right panel plots the number of citations each paper receives in each of the first 15 years following its publication. What you can clearly see is that (a) the mean and median are very strongly related for the psychology journals, but completely unrelated for Nature, implying that a very small number of articles account for the vast majority of Nature citations (Mayor cites data indicating that up to 40% of Nature papers are never cited); and (b) Nature papers tend to get cited heavily for a year or two, and then disappear, whereas Psych Science, and particularly Psych Review, tend to have much longer shelf lives. Based on these trends, Mayor concludes that:

From this perspective, the IF, commonly accepted as golden standard for performance metrics seems to reward high-risk strategies (after all your Nature article has only slightly over 50% chance of being ever cited!), and short-lived outbursts. Are scientists then nearsighted gamblers?

I’d very much like to believe this, in that I think the massive emphasis scientists collectively place on publishing work in broad-interest, short-format journals like Nature and Science is often quite detrimental to the scientific enterprise as a whole. But I don’t actually believe it, because I think that, for any individual paper, researchers generally do have good incentives to try to publish in the glamor mags rather than in more specialized journals. Mayor’s figure, while informative, doesn’t take a number of factors into account:

  • The type of papers that gets published in Psych Review and Nature are very different. Review papers, in general, tend to get cited more often, and for a longer time. A better comparison would be between Psych Review papers and only review papers in Nature (there’s not many of them, unfortunately). My guess is that that difference alone probably explains much of the difference in citation rates later on in an article’s life. That would also explain why the temporal profile of Psych Science articles (which are also overwhelmingly short empirical reports) is similar to that of Nature. Major theoretical syntheses stay relevant for decades; individual empirical papers, no matter how exciting, tend to stop being cited as frequently once (a) the finding fails to replicate, or (b) a literature builds up around the original report, and researchers stop citing individual studies and start citing review articles (e.g., in Psych Review).
  • Scientists don’t just care about citation counts, they also care about reputation. The reality is that much of the appeal of having a Nature or Science publication isn’t necessarily that you expect the work to be cited much more heavily, but that you get to tell everyone else how great you must be because you have a publication in Nature. Now, on some level, we know that it’s silly to hold glamor mags in such high esteem, and Mayor’s data are consistent with that idea. In an ideal world, we’d read all papers ultra-carefully before making judgments about their quality, rather than using simple but flawed heuristics like what journal those papers happen to be published in. But this isn’t an ideal world, and the reality is that people do use such heuristics. So it’s to each scientist’s individual advantage (but to the field’s detriment) to take advantage of that knowledge.
  • Different fields have very different citation rates. And articles in different fields have very different shelf lives. For instance, I’ve heard that in many areas of physics, the field moves so fast that articles are basically out of date within a year or two (I have no way to verify if this is true or not). That’s certainly not true of most areas of psychology. For instance, in cognitive neuroscience, the current state of the field in many areas is still reasonably well captured by highly-cited publications that are 5 – 10 years old. Most behavioral areas of psychology seem to advance even more slowly. So one might well expect articles in psychology journals to peak later in time than the average Nature article, because Nature contains a high proportion of articles in the natural sciences.
  • Articles are probably selected for publication in Nature, Psych Science, and Psych Review for different reasons. In particular, there’s no denying the fact that Nature selects articles in large part based on the perceived novelty and unexpectedness of the result. That’s not to say that methodological rigor doesn’t play a role, just that, other things being equal, unexpected findings are less likely to be replicated. Since Nature and Science overwhelmingly publish articles with new and surprising findings, it shouldn’t be surprising if the articles in these journals have a lower rate of replication several years on (and hence, stop being cited). That’s presumably going to be less true of articles in specialist journals, where novelty factor and appeal to a broad audience are usually less important criteria.

Addressing these points would probably go a long way towards closing, and perhaps even reversing, the gap implied  by Mayor’s figure. I suspect that if you could do a controlled experiment and publish the exact same article in Nature and Psych Science, it would tend to get cited more heavily in Nature over the long run. So in that sense, if citations were all anyone cared about, I think it would be perfectly reasonable for scientists to try to publish in the most prestigious journals–even though, again, I think the pressure to publish in such journals actually hurts the field as a whole.

Of course, in reality, we don’t just care about citation counts anyway; lots of other things matter. For one thing, we also need to factor in the opportunity cost associated with writing a paper up in a very specific format for submission to Nature or Science, knowing that we’ll probably have to rewrite much or all of it before it gets published. All that effort could probably have been spent on other projects, so one way to put the question is: how many lower-tier publications in the hand is a top-tier publication in the bush worth?

Ultimately, it’s an empirical matter; I imagine if you were willing to make some strong assumptions, and collect the right kind of data, you could come up with a meaningful estimate of the actual value of a Nature publication, as a function of important variables like the number of other publications the authors had, the amount of work invested in rewriting the paper after rejection, the authors’ career stage, etc. But I don’t know of any published work to that effect; it seems like it would probably be more trouble than it was worth (or, to get meta: how many Nature manuscripts can you write in the time it takes you to write a manuscript about how many Nature manuscripts you should write?). And, to be honest, I suspect that any estimate you obtained that way would have little or no impact on the actual decisions scientists make about where to submit their manuscripts anyway, because, in practice, such decisions are driven as much by guesswork and wishful thinking as by any well-reasoned analysis. And on that last point, I speak from extensive personal experience…

what the arsenic effect means for scientific publishing

I don’t know very much about DNA (and by ‘not very much’ I sadly mean ‘next to nothing’), so when someone tells me that life as we know it generally doesn’t use arsenic to make DNA, and that it’s a big deal to find a bacterium that does, I’m willing to believe them. So too, apparently, are at least two or three reviewers for Science, which published a paper last week by a NASA group purporting to demonstrate exactly that.

Turns out the paper might have a few holes. In the last few days, the blogosphere has reached fever delirium pitch as critiques of the article have emerged from every corner; it seems like pretty much everyone with some knowledge of the science in question is unhappy about the paper. Since I’m not in any position to critique the article myself, I’ll take Carl Zimmer’s word for it in Slate yesterday:

Was this merely a case of a few isolated cranks? To find out, I reached out to a dozen experts on Monday. Almost unanimously, they think the NASA scientists have failed to make their case.  “It would be really cool if such a bug existed,” said San Diego State University’s Forest Rohwer, a microbiologist who looks for new species of bacteria and viruses in coral reefs. But, he added, “none of the arguments are very convincing on their own.” That was about as positive as the critics could get. “This paper should not have been published,” said Shelley Copley of the University of Colorado.

Zimmer then follows his Slate piece up with a blog post today in which he provides 13 experts’ unadulterated comments. While there are one or two (somewhat) positive reviews, the consensus clearly seems to be that the Science paper is (very) bad science.

Of course, scientists (yes, even Science reviewers) do occasionally make mistakes, so if we’re being charitable about it, we might chalk it up to human error (though some of the critiques suggest that these are elementary problems that could have been very easily addressed, so it’s possible there’s some disingenuousness involved). But what many bloggers (1, 2, 3, etc.) have found particularly inexcusable is the way NASA and the research team have handled the criticism. Zimmer again, in Slate:

I asked two of the authors of the study if they wanted to respond to the criticism of their paper. Both politely declined by email.

“We cannot indiscriminately wade into a media forum for debate at this time,” declared senior author Ronald Oremland of the U.S. Geological Survey. “If we are wrong, then other scientists should be motivated to reproduce our findings. If we are right (and I am strongly convinced that we are) our competitors will agree and help to advance our understanding of this phenomenon. I am eager for them to do so.”

“Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated,” wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. “The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner.”

A NASA spokesperson basically reiterated this point of view, indicating that NASA scientists weren’t going to respond to criticism of their work unless that criticism appeared in, you know, a respectable, peer-reviewed outlet. (Fortunately, at least one of the critics already has a draft letter to Science up on her blog.)

I don’t think it’s surprising that people who spend much of their free time blogging about science, and think it’s important to discuss scientific issues in a public venue, generally aren’t going to like being told that science blogging isn’t a legitimate form of scientific discourse. Especially considering that the critics here aren’t laypeople without scientific training; they’re well-respected scientists with areas of expertise that are directly relevant to the paper. In this case, dismissing trenchant criticism because it’s on the web rather than in a peer-reviewed journal seems kind of like telling someone who’s screaming at you that your house is on fire that you’re not going to listen to them until they adopt a more polite tone. It just seems counterproductive.

That said, I personally don’t think we should take the NASA team’s statements at face value. I very much doubt that what the NASA researchers are saying really reflect any deep philosophical view about the role of blogs in scientific discourse; it’s much more likely that they’re simply trying to buy some time while they figure out how to respond. On the face of it, they have a choice between two lousy options: either ignore the criticism entirely, which would be antithetical to the scientific process and would look very bad, or address it head-on–which, judging by the vociferousness and near-unanimity of the commentators, is probably going to be a losing battle. Shifting the terms of the debate by insisting on responding only in a peer-reviewed venue doesn’t really change anything, but it does buy the authors two or three weeks. And two or three weeks is worth like, forty attentional cycles in the blogosphere.

Mind you, I’m not saying we should sympathize with the NASA researchers just because they’re in a tough position. I think one of the main reasons the story’s attracted so much attention is precisely because people see it as a case of justice being served. The NASA team called a major press conference ahead of the paper’s publication, published its results in one of the world’s most prestigious science journals, and yet apparently failed to run relatively basic experimental controls in support of its conclusions. If the critics are to be believed, the NASA researchers are either disingenuous or incompetent; either way, we shouldn’t feel sorry for them.

What I do think this episode shows is that the rules of scientific publishing have fundamentally changed in the last few years–and largely for the better. I haven’t been doing science for very long, but even in the halcyon days of 2003, when I started graduate school, science blogging was practically nonexistent, and the main way you’d find out what other people thought about an influential new paper was by talking to people you knew at conferences (which could take several months) or waiting for critiques or replication failures to emerge in other peer-reviewed journals (which could take years). That kind of delay between publication and evaluation is disastrous for science, because in the time it takes for a consensus to emerge that a paper is no good, several research teams might have already started trying to replicate and extend the reported findings, and several dozen other researchers might have uncritically cited their paper peripherally in their own work. This delay is probably why, as John Ioannidis’ work so elegantly demonstrates, major studies published in high-impact journals tend to exert a disproportionate influence on the literature long after they’ve been resoundingly discredited.

The Arsenic Effect, if we can call it that, provides a nice illustration of the impact of new media on scientific communication. It’s a safe bet that there are now very few people who do anything even vaguely related to the NASA team’s research who haven’t been made aware that the reported findings are controversial. Which means that the process of attempting to replicate (or falsify) the findings will proceed much more quickly than it might have ten or twenty years ago, and there probably won’t be very many people who cite the Science paper as compelling evidence of terrestrial arsenic-based life. Perhaps more importantly, as researchers get used to the idea that their high-profile work is going to be instantly evaluated by thousands of pairs of highly trained eyes, any of which might be attached to a highly prolific pair of typing hands, there will be an increasingly strong disincentive to avoid being careless. That isn’t to say that bad science will disappear, of course; just that, in cases where the badness reflects a pressure to tell a good story at all costs, we’ll probably see less of it.

PLoS ONE needs new subjects

I like the PLoS journals, including PLoS ONE, a lot. But it drives me a little bit crazy that the list of PLoS ONE subjects includes things like Non-Clinical Health, Nutrition, and Science Policy, while perfectly respectable subjects like Psychology, Economics, and Political Science are nowhere to be found (note: I’m not saying there’s anything wrong with Nutrition, just that there’s also nothing wrong with Psychology).

I can sort of understand the rationale; PLoS ONE is supposed to be a science journal, and I imagine the editors feel that if they opened up the door to the aforementioned categories, some of the submissions they’d start receiving would have tenuous or nonexistent relationships to anything that you could call science. But in practice, PLoS ONE already does take articles in all of those subjects–and many others. And what then happens, no doubt, is that the editorial board has epic battles over which of the 40-odd existing subjects is going to become the proud beneficiary of a completely unrelated article.

I imagine it goes down something like this:

Editor A: Look, “Patriarchal principles of pop music in a post-Jacksonian era” is clearly an Epidemiology article. It’s going under Public Health and Epidemiology.

Editor B: Don’t be a fool. There isn’t a single word in the paper about health or disease. You’d know that if you’d bothered to read it. It obviously belongs under Mental Health.

Editor A: Absolutely not. Infectious Diseases, Pediatrics and Child Health, or Anesthesiology and Pain Management. Pick one. Final offer.

Editor B: No. But I’ll tell you what. Send it back to the authors, ask them to add a section on the influence of barbiturates and opiates on modern composition, and then we’ll stick it under Pharmacology.

Editor A: Deal.

Lest you think I’m making shit up exaggerating, witness exhibit A: a paper published today by Araújo et al entitled “Tactical Voting in Plurality Elections”. To be fair, I don’t know anything about tactics, voting, plurality, or elections, so I can’t tell you if the paper is any good or not. It looks interesting, but I don’t understand much more than the abstract.

What I can tell you though with something approaching certainty is that the paper has absolutely nothing to do with Neuroscience–which is one of the categories it’s filed under (the other is Physics, which it also seems to bear no relation to, save for the fact that the authors are physicists). It doesn’t mention the words ‘brain’, ‘neuro-’, ‘neural’, or ‘neuron’ anywhere in the text, which is pretty much a necessary condition for a neuroscience article in my book. The only conceivable link I can think of is that it’s a paper about voting, and voting is done by people, and people have brains. But that’s not very compelling. Really, it should go under Political Science, or Economics, or Applied Statistics, or even a catch-all category like Social Sciences. Except that none of those exist.

Pretty please, PLoS ONE, can we get a Social Sciences section?