building better platforms for evaluating science: a request for feedback

UPDATE 4/20/2012: a revised version of the paper mentioned below is now available here.

A couple of months ago I wrote about a call for papers for a special issue of Frontiers in Computational Neuroscience focusing on “Visions for Open Evaluation of Scientific Papers by Post-Publication Peer Review“. I wrote a paper for the issue, the gist of which is that many of the features scientists should want out of a next-generation open evaluation platform are already implemented all over the place in social web applications, so that building platforms for evaluating scientific output should be more a matter of adapting existing techniques than having to come up with brilliant new approaches. I’m talking about features like recommendation engines, APIs, and reputation systems, which you can find everywhere from Netflix to Pandora to Stack Overflow to Amazon, but (unfortunately) virtually nowhere in the world of scientific publishing.

Since the official deadline for submission is two months away (no, I’m not so conscientious that I habitually finish my writing assignments two months ahead of time–I just failed to notice that the deadline had been pushed way back), I figured I may as well use the opportunity to make the paper openly accessible right now in the hopes of soliciting some constructive feedback. This is a topic that’s kind of off the beaten path for me, and I’m not convinced I really know what I’m talking about (well, fine, I’m actually pretty sure I don’t know what I’m talking about), so I’d love to get some constructive criticism from people before I submit a final version of the manuscript. Not only from scientists, but ideally also from people with experience developing social web applications–or actually, just about anyone with good ideas about how to implement and promote next-generation evaluation platforms. I mean, if you use Netflix or reddit regularly, you’re pretty much a de facto expert on collaborative filtering and recommendation systems, right?

Anyway, here’s the abstract:

Traditional pre-publication peer review of scientific output is a slow, inefficient, and unreliable process. Efforts to replace or supplement traditional evaluation models with open evaluation platforms that leverage advances in information technology are slowly gaining traction, but remain in the early stages of design and implementation. Here I discuss a number of considerations relevant to the development of such platforms. I focus particular attention on three core elements that next-generation evaluation platforms should strive to emphasize, including (a) open and transparent access to accumulated evaluation data, (b) personalized and highly customizable performance metrics, and (c) appropriate short-term incentivization of the userbase. Because all of these elements have already been successfully implemented on a large scale in hundreds of existing social web applications, I argue that development of new scientific evaluation platforms should proceed largely by adapting existing techniques rather than engineering entirely new evaluation mechanisms. Successful implementation of open evaluation platforms has the potential to substantially advance both the pace and the quality of scientific publication and evaluation, and the scientific community has a vested interest in shifting towards such models as soon as possible.

You can download the PDF here (or grab it from SSRN here). It features a cameo by Archimedes and borrows concepts liberally from sites like reddit, Netflix, and Stack Overflow (with attribution, of course). I’d love to hear your comments; you can either leave them below or email me directly. Depending on what kind of feedback I get (if any), I’ll try to post a revised version of the paper here in a month or so that works in people’s comments and suggestions.

(fanciful depiction of) Archimedes, renowned ancient Greek mathematician and co-inventor (with Al Gore) of the open access internet repository

what aspirin can tell us about the value of antidepressants

There’s a nice post on Science-Based Medicine by Harriet Hall pushing back (kind of) against the increasingly popular idea that antidepressants don’t work. For context, there have been a couple of large recent meta-analyses that used comprehensive FDA data on clinical trials of antidepressants (rather than only published studies, which are biased towards larger, statistically significant, effects) to argue that antidepressants are of little or no use in mild or moderately-depressed people, and achieve a clinically meaningful benefit only in the severely depressed.

Hall points out that whether you think antidepressants have a clinically meaningful benefit or not depends on how you define clinically meaningful (okay, this sounds vacuous, but bear with me). Most meta-analyses of antidepressant efficacy reveal an effect size of somewhere between 0.3 and 0.5 standard deviations. Historically, psychologists consider effect sizes of 0.2, 0.5, and 0.8 standard deviations to be small, medium, and large, respectively. But as Hall points out:

The psychologist who proposed these landmarks [Jacob Cohen] admitted that he had picked them arbitrarily and that they had “no more reliable a basis than my own intuition.” Later, without providing any justification, the UK’s National Institute for Health and Clinical Excellence (NICE) decided to turn the 0.5 landmark (why not the 0.2 or the 0.8 value?) into a one-size-fits-all cut-off for clinical significance.

She goes on to explain why this ultimately leaves the efficacy of antidepressants open to interpretation:

In an editorial published in the British Medical Journal (BMJ), Turner explains with an elegant metaphor: journal articles had sold us a glass of juice advertised to contain 0.41 liters (0.41 being the effect size Turner, et al. derived from the journal articles); but the truth was that the “glass” of efficacy contained only 0.31 liters. Because these amounts were lower than the (arbitrary) 0.5 liter cut-off, NICE standards (and Kirsch) consider the glass to be empty. Turner correctly concludes that the glass is far from full, but it is also far from empty. He also points out that patients’ responses are not all-or-none and that partial responses can be meaningful.

I think this pretty much hits the nail on the head; no one really doubts that antidepressants work at this point; the question is whether they work well enough to justify their side effects and the social and economic costs they impose. I don’t have much to add to Hall’s argument, except that I think she doesn’t sufficiently emphasize how big a role scale plays when trying to evaluate the utility of antidepressants (or any other treatment). At the level of a single individual, a change of one-third of a standard deviation may not seem very big (then again, if you’re currently depressed, it might!). But on a societal scale, even canonically ‘small’ effects can have very large effects in the aggregate.

The example I’m most fond of here is Robert Rosenthal’s famous illustration of the effects of aspirin on heart attack. The correlation between taking aspirin daily and decreased risk of heart attack is, at best, .03 (I say at best because the estimate is based on a large 1988 study, but my understanding is that more recent studies have moderated even this small effect). In most domains of psychology, a correlation of .03 is so small as to be completely uninteresting. Most psychologists would never seriously contemplate running a study to try to detect an effect of that size. And yet, at a population level, even an r of .03 can have serious implications. Cast in a different light, what this effect means is that 3% of people who would be expected to have a heart attack without aspirin would be saved from that heart attack given a daily aspirin regimen. Needless to say, this isn’t trivial. It amounts to a potentially life-saving intervention for 30 out of every 1,000 people. At a public policy level, you’d be crazy to ignore something like that (which is why, for a long time, many doctors recommended that people take an aspirin a day). And yet, by the standards of experimental psychology, this is a tiny, tiny effect that probably isn’t worth getting out of bed for.

The point of course is that when you consider how many people are currently on antidepressants (millions), even small effects–and certainly an effect of one-third of a standard deviation–are going to be compounded many times over. Given that antidepressants demonstrably reduce the risk of suicide (according to Hall, by about 20%), there’s little doubt that tens of thousands of lives have been saved by antidepressants. That doesn’t necessarily justify their routine use, of course, because the side effects and costs also scale up to the societal level (just imagine how many millions of bouts of nausea could be prevented by eliminating antidepressants from the market!). The point is that just that, if you think the benefits of antidepressants outweigh their costs even slightly at the level of the average depressed individual, you’re probably committing yourself to thinking that they have a hugely beneficial impact at a societal level–and that holds true irrespective of whether the effects are ‘clinically meaningful’ by conventional standards.

in praise of self-policing

It’s IRB week over at The Hardest Science; Sanjay has an excellent series of posts (1, 2, 3) discussing some proposed federal rule changes to the way IRBs oversee research. The short of it is that the proposed changes are mostly good news for people who do minimal risk-type research with human subjects (i.e., stuff that doesn’t involve poking people with needles); if the changes pass as written, most of us will no longer have to file any documents with our IRBs before running our studies. We’ll just put in a short note saying we’ve determined that our studies are excused from review, and then we can start collecting data right away. It’ll work something like this*:

This doesn’t mean federal oversight of human subjects research will cease, of course. There will still be guidelines we all have to follow. But instead of making researchers jump through flaming hoops preemptively, enforcement will take place on an ad-hoc basis and via random audits. For the most part, the important decisions will be left to investigators rather than IRBs. For more details, see Sanjay’s excellent breakdown.

I also agree with Sanjay’s sentiment in his latest post that this is the right way to do things; researchers should police themselves, rather than employing an entire staff of people whose jobs it is to tell researchers how to safely and ethically do their research. In principle, the idea of having trained IRB analysts go over every study sounds nice; the problem is that it takes a very long time, generates a lot of extra work for everyone, and perhaps most problematically, sets up all sorts of perverse incentives. Namely, IRB analysts have an incentive to be pedantic (since they rarely lose their jobs if they ask for too much detail, but could be liable if they give too much leeway and something bad happens), and investigators have an incentive to off-load their conscience onto the IRB rather than actually having to think about the impact of their experiment on subjects. I catch myself doing this more often than I’d like, and I’m not really happy about it. (For instance, I recently found myself telling someone it was okay for them to present gruesome pictures to subjects “because the IRB doesn’t mind that”, and not because I thought the psychological impact was negligible. I gave myself twenty lashes for that one**.) I suspect that, aside from saving everyone a good deal of time and effort, placing the responsibility of doing research on researchers’ shoulders would actually lead them to give more, and not less, consideration to ethical issues.

Anyway, it remains to be seen whether the proposed rules actually pass in their current form. One of the interesting features of the situation is that IRBs may now perversely actually have an incentive to fight against these rules going into effect, since they’d almost certainly need to lay off staff if we move to a system where most studies are entirely excused from review. I don’t really think that this will be much of an issue, and on balance I’m sure university administrations recognize how much IRBs slow down research; but it still can’t hurt for those of us who do research with human subjects to stick our heads past the Department of Health and Human Service’s doors and affirm that excusing most non-invasive human subjects research from review is the right thing to do.


* I know, I know. I managed to go two whole years on this blog without a single lolcat appearance, and now I throw it all away for this. Sorry.

** With a feather duster.