Yes, your research is very noble. No, that’s not a reason to flout copyright law.

Scientific research is cumulative; many elements of a typical research project would not and could not exist but for the efforts of many previous researchers. This goes not only for knowledge, but also for measurement. In much of the clinical world–and also in many areas of “basic” social and life science research–people routinely save themselves inordinate amounts of work by using behavioral or self-report measures developed and validated by other researchers.

Among many researchers who work in fields heavily dependent on self-report instruments (e.g., personality psychology), there appears to be a tacit belief that, once a measure is publicly available–either because it’s reported in full in a journal article, or because all of the items and instructions be found on the web–it’s fair game for use in subsequent research. There’s a time-honored ttradition of asking one’s colleagues if they happen to “have a copy” of the NEO-PI-3, or the Narcissistic Personality Inventory, or the Hamilton Depression Rating Scale. The fact that many such measures are technically published under restrictive copyright licenses, and are often listed for sale at rather exorbitant prices (e.g., you can buy 25 paper copies of the NEO-PI-3 from the publisher for $363 US), does not seem to deter researchers much. The general understanding seems to be that if a measure is publicly available, it’s okay to use it for research purposes. I don’t think most researchers have a well-thought out, internally consistent justification for this behavior; it seems to almost invariably be an article of tacit belief that nothing bad can or should happen to someone who uses a commercially available instrument for a purpose as noble as scientific research.

The trouble with tacit beliefs is that, like all beliefs, they can sometimes be wrong–only, because they’re tacit, they’re often not evaluated openly until things go horribly wrong. Exhibit A on the frontier of horrible wrongness is a recent news article in Science that reports on a rather disconcerting case where the author of a measure (the Eight-Item Morisky Medication Adherence Scale–which also provides a clue to its author’s name) has been demanding rather large sums of money (ranging from $2000 to $6500) from the authors of hundreds of published articles that have used the MMAS-8 without explicitly requesting permission. As the article notes, there appears to be a general agreement that Morisky is within his legal rights to demand such payment; what people seem to be objecting to is the amount Morisky is requesting, and the way he’s going about the process (i.e., with lawyers):

Morisky is well within his rights to seek payment for use of his copyrighted tool. U.S. law encourages academic scientists and their universities to protect and profit from their inventions, including those developed with public funds. But observers say Morisky’s vigorous enforcement and the size of his demands stand out. “It’s unusual that he is charging as much as he is,” says Kurt Geisinger, director of the Buros Center for Testing at the University of Nebraska in Lincoln, which evaluates many kinds of research-related tests. He and others note that many scientists routinely waive payments for such tools, as long as they are used for research.

It’s a nice article, and and I think it suggests two things fairly clearly. First, Morisky is probably not a very nice man. He seems to have no compunction charging resource-strapped researchers in third-world countries licensing fees that require them to take out loans from their home universities, and he would apparently rather see dozens of published articles retracted from the literature than suffer the indignity of having someone use his measure without going through the proper channels (and paying the corresponding fees).

Second, the normative practice in many areas of science that depend on the (re)use of measures developed by other people is to essentially flout copyright law, bury one’s head in the sand, and hope for the best.

I don’t know that anything can be done about the first observation–and even if something could be done, there will always be other Moriskys. I do, however, think that we could collectively do quite a few things to change the way scientists think about, and deal with, the re-use of self-report (and other kinds of) measures. Most of these amount to providing better guidance and training. In principle, this shouldn’t be hard to do; in most disciplines, scientists are trained in all manner of research method, statistical praxis, and scientific convention. Yet I know of no graduate program in my own discipline (psychology) that provides its students with even a cursory overview of intellectual property law. This despite the fact that many scientists’ chief assets–and the things they most closely identify their career achievements with–are their intellectual products.

This is, in my view, a serious training failure. More important, it’s an unnecessary failure, because there isn’t really very much that a social scientist needs to know about copyright law in order to dramatically reduce their odds of ending up a target of legal action. The goal is not to train PhDs who can moonlight as bad attorneys; it’s to prevent behavior that flagrantly exposes one to potential Moriskying (look! I coined a verb!). For that, a single 15-minute segment of a research methods class would likely suffice. While I’m sure someone better-informed and more lawyer-like than me could come up with a more accurate precis, here’s the gist of what I think one would want to cover:

  • Just because a measure is publicly available does not mean it’s in the public domain. It’s intuitive to suppose that any measure that can be found in a publicly accessible place (e.g., on the web) is, by default, okay for public use–meaning that, unless the author of a measure has indicated that they don’t want their measure to be used by others, it can be. In fact, the opposite is true. By default, the author of a newly produced work retains all usage and distribution rights to that work. The author can, if they are so inclined, immediately place that work in the public domain. Alternatively, they could stipulate that every time someone uses their measure, that user must, within 72 hours of use, immediately send the author 22 green jelly beans in an unmarked paper bag. You don’t like those terms of use? Fine: don’t use the measure.

Importantly, an author isn’t under any obligation to say anything at all about how they wish their work to be reproduced or used. This means that when a researcher uses a measure that lacks explicit licensing information, that researcher is assuming the risk of running afoul of the measure author’s desires, whether or not those desires have been made publicly known. The fact that the measure happens to be publicly available may be a mitigating factor (e.g., one could potentially claim fair use, though as far as I know there’s little precedent for this type of thing in the scientific domain), but that’s a matter for lawyers to hash out, and I think most of us scientists would rather avoid lawyer-hashing if we can help it.

This takes us directly to the next point…

  • Don’t use a measure unless you’ve read, and agree with, its licensing terms. Of course, in practice, very few scientific measures are currently released with an explicit license–which gives rise to an important corollary injunction: don’t use a measure that doesn’t come with a license.

The latter statement may seem unfair; after all, it’s clear enough that most measures developed by social scientist are missing licenses not because their authors are intentionally trying to capitalize on ambiguity, but simply because most authors are ignorant of the fact that the lack of a license creates a significant liability for potential users. Walking away from unlicensed measures would amount to giving up on huge swaths of potential research, which surely doesn’t seem like a good idea.

Fortunately, I’m not suggesting anything nearly this drastic. Because the lack of licensing is typically unintentional, often, a simple, friendly email to an author may be sufficient to magic an explicit license into existence. While I haven’t had occasion to try this yet for self-report measures, I’ve been on both ends of such requests on multiple occasions when dealing with open-source software. In virtually every case I’ve been involved in, the response to an inquiry along the lines of “hey, I’d like to use your software, but there’s no license information attached” has been to either add a license to the repository (for example…), or provide an explicit statement to the effect of “you’re welcome to use this for the use case you describe”. Of course, if a response is not forthcoming, that too is instructive, as it suggests that perhaps steering clear of the tool (or measure) in question might be a good idea.

Of course, taking licensing seriously requires one to abide by copyright law–which, like it or not, means that there may be cases where the responsible (and legal) thing to do is to just walk away from a measure, even if it seems perfect for your use case from a research standpoint. If you’re serious about taking copyright seriously, and, upon emailing the author to inquire about the terms of use, you’re informed that the terms of use involve paying $100 per participant, you can either put up the money, or use a different measure. Burying your head in the sand and using the measure anyway, without paying for it, is not a good look.

  • Attach a license to every reusable product you release into the wild. This follows directly from the previous point: if you want responsible, informed users to feel comfortable using your measure, you should tell them what they can and can’t do with it. If you’re so inclined, you can of course write your own custom license, which can involve dollar bills, jelly beans, or anything else your heart desires. But unless you feel a strong need to depart from existing practices, it’s generally a good idea to select one of the many pre-existing licenses out there, because most of them have the helpful property of having been written by lawyers, and lawyers are people who generally know how to formulate sentiments like “you must give me heap big credit” in somewhat more precise language.

There are a lot of practical recommendations out there about what license one should or shouldn’t choose; I won’t get into those here, except to say that in general, I’m a strong proponent of using permissive licenses (e.g., MIT or CC-BY), and also, that I agree with many people’s sentiment that placing restrictions on commercial use–while intuitively appealing to scientists who value public goods–is generally counterproductive. In any case, the real point here is not to push people to use any particular license, but just to think about it for a few minutes when releasing a measure. I mean, you’re probably going to spend tens or hundreds of hours thinking about the measure itself; the least you can do is make sure you tell people what they’re allowed to do with it.

I think covering just the above three points in the context of a graduate research methods class–or at the very least, in those methods classes slanted towards measure development or evaluation (e.g., psychometrics)–would go a long way towards changing scientific norms surrounding measure use.

Most importantly, perhaps, the point of learning a little bit about copyright law is not just to reduce one’s exposure to legal action. There are also large communal benefits. If academic researchers collectively decided to stop flouting copyright law when choosing research measures, the developers of measures would face a very different–and, from a societal standpoint, much more favorable–set of incentives. The present state of affairs–where an instrument’s author is able to legally charge well-meaning researchers exorbitant fees post-hoc for use of an 8-item scale–exists largely because researchers refuse to take copyright seriously, and insist on acting as if science, being such a noble and humanitarian enterprise, is somehow exempt from legal considerations that people in other fields have to constantly worry about. Perversely, the few researchers who do the right thing by offering to pay for the scales they use then end up incurring large costs, while the majority who use the measures without permission suffer no consequences (except on the rare occasions when someone like Morisky comes knocking on the door with a lawyer).

By contrast, in an academic world that cared more about copyright law, many widely-used measures that are currently released under ambiguous or restrictive licenses (or, most commonly, no license at all) would never have attained widespread use in the first place. If, say, Costa & McCrae’s NEO measures–used by thousands of researchers every year–had been developed in a world where academics had a standing norm of avoiding restrictively licensed measures, the most likely outcome is that the NEO would have changed to accommodate the norm, and not vice versa. The net result is that we would be living in a world where the vast majority of measures–just like the vast majority of open-source software–really would be free to use in every sense of the word, without risk of lawsuits, and with the ability to redistribute, reuse, and modify freely. That, I think, is a world we should want to live in. And while the ship may have already sailed when it comes to the most widely used existing measures, it’s a world we could still have going forward. We just have to commit to not using new measures unless they have a clear license–and be prepared to follow the terms of that license to the letter.

strong opinions about data sharing mandates–mine included

Apparently, many scientists have rather strong feelings about data sharing mandates. In the wake of PLOS’s recent announcement–which says that, effective now, all papers published in PLOS journals must deposit their data in a publicly accessible location–a veritable gaggle of scientists have taken to their blogs to voice their outrage and/or support for the policy. The nays have posts like DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the nays to get over themselves and share their data (more posts here, here, and here). While I’m a bit late to the party (mostly because I’ve been traveling and otherwise indisposed), I guess I’ll go ahead and throw my hat into the ring in support of data sharing mandates. For a number of reasons outlined below, I think time will show the anti-PLOS folks to very clearly be on the wrong side of this issue.

Mandatory public deposition is like, totally way better than a “share-upon-request” approach

You might think that proactive data deposition has little incremental utility over a philosophy of sharing one’s data upon request, since emails are these wordy little things that only take a few minutes of a data-seeker’s time to write. But it’s not just the time and effort that matter. It’s also the psychology and technology. Psychology, because if you don’t know the person on the other end, or if the data is potentially useful but not essential to you, or if you’re the agreeable sort who doesn’t like to bother other people, it’s very easy to just say, “nah, I’ll just go do something else”. Scientists are busy people. If a dataset is a click away, many people will be happy to download that dataset and play with it who wouldn’t feel comfortable emailing the author to ask for it. Technology, because data that isn’t publicly available is data that isn’t publicly indexed. It’s all well and good to say that if someone really wants a dataset, they can email you to ask for it, but if someone doesn’t know about your dataset in the first place–because it isn’t in the first three pages of Google results–they’re going to have a hard time asking.

People don’t actually share on request

Much of the criticism of the PLoS data sharing policy rests on the notion that the policy is unnecessary, because in practice most journals already mandate that authors must share their data upon request. One point that defenders of the PLOS mandate haven’t stressed enough is that such “soft” mandates are largely meaningless. Empirical studies have repeatedly demonstrated  that it’s actually very difficult  to get authors to share their data upon request —even when they’re obligated to do so by the contractual agreement they’ve signed with a publisher. And when researchers do fulfill data sharing requests, they often take inordinately long to do so, and the data often don’t line up properly with what was reported in the paper (as the PLOS editors noted in their explanation for introducing the policy), or reveal potentially serious errors.

Personally, I have to confess that I often haven’t fulfilled other researchers’ requests for my data–and in at least two cases, I never even responded to the request. These failures to share didn’t reflect my desire to hide anything; they occurred largely because I knew it would be a lot of work, and/or the data were no longer readily accessible to me, and/or I was too busy to take care of the request right when it came in. I think I’m sufficiently aware of my own character flaws to know that good intentions are no match for time pressure and divided attention–and that’s precisely why I’d rather submit my work to journals that force me to do the tedious curation work up front, when I have a strong incentive to do it, rather than later, when I don’t.

Comprehensive evaluation requires access to the data

It’s hard to escape the feeling that some of the push-back against the policy is actually rooted in the fear that other researchers will find mistakes in one’s work by going through one’s data. In some cases, this fear is made explicit. For example, DrugMonkey suggested that:

There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted!

This rather dire premonition prompted a commenter to ask if it’s possible that DM might ever be wrong about what his data means–necessitating other pairs of eyes and/or opinions. DM’s response was, in essence, “No.”. But clearly, this is wishful thinking: we have plenty of reasons to think that everyone in science–even the luminaries among us–make mistakes all the time. Science is hard. In the fields I’m most familiar with, I rarely read a paper that I don’t feel has some serious flaws–even though nearly all of these papers were written by people who have, in DM’s words, “been at this for a while”. By the same token, I’m certain that other people read each of my papers and feel exactly the same way. Of course, it’s not pleasant to confront our mistakes by putting everything out into the open, and I don’t doubt that one consequence of sharing data proactively is that error-finding will indeed become much more common. At least initially (i.e., until we develop an appreciation for the true rate of error in the average dataset, and become more tolerant of minor problems), this will probably cause everyone some discomfort. But temporary discomfort surely isn’t a good excuse to continue to support practices that clearly impede scientific progress.

Part of the problem, I suspect, is that scientists have collectively internalized as acceptable many practices that are on some level clearly not good for the community as a whole. To take just one example, it’s an open secret in biomedical science that so-called “representative figures” (of spiking neurons, Western blots, or whatever else you like) are rarely truly representative. Frequently, they’re actually among the best examples the authors of a paper were able to find. The communal wink-and-shake agreement to ignore this kind of problem is deeply problematic, in that it likely allows many claims to go unchallenged that are actually not strongly supported by the data. In a world where other researchers could easily go through my dataset and show that the “representative” raster plot I presented in Figure 2C was actually the best case rather than the norm, I would probably have to be more careful about making that kind of claim up front–and someone else might not waste a lot of their time chasing results that can’t possibly be as good as my figures make them look.

Figure 1.  A representative planet.

The Data are a part of the Methods

If you still don’t find this convincing, consider that one could easily have applied nearly all of the arguments people having been making in the blogosphere these past two weeks to that dastardly scientific timesink that is the common Methods sections. Imagine that we lived in a culture where scientists always reported their Results telegraphically–that is, with the brevity of a typical Nature or Science paper, but without the accompanying novel’s worth of Supplementary Methods. Then, when someone first suggested that it might perhaps be a good idea to introduce a separate section that describes in dry, technical language how authors actually produced all those exciting results, we would presumably see many people in the community saying something like the following:

Why should I bother to tell you in excruciating detail what software, reagents, and stimuli I used in my study? The vast majority of readers will never try to directly replicate my experiment, and those who do want to can just email me to get the information they need–which of course I’m always happy to provide in a timely and completely disinterested fashion. Asking me to proactively lay out every little methodological step I took is really unreasonable; it would take a very long time to write a clear “Methods” section of the kind you propose, and the benefits seem very dubious. I mean, the only thing that will happen if I adopt this new policy is that half of my competitors will start going through this new section with a fine-toothed comb in order to find problems, and the other half will now be able to scoop me by repeating the exact procedures I used before I have a chance to follow them up myself! And for what? Why do I need to tell everyone exactly what I did? I’m an expert with many years of experience in this field! I know what I’m doing, and I don’t appreciate your casting aspersions on my work and implying that my conclusions might not always be 100% sound!

As far as I can see, there isn’t any qualitative difference between reporting detailed Methods and providing comprehensive Data. In point of fact, many decisions about which methods one should use depend entirely on the nature of the data, so it’s often actually impossible to evaluate the methodological choices the authors made without seeing their data. If DrugMonkey et al think it’s crazy for one researcher to want access to another researcher’s data in order to determine whether the distribution of some variable looks normal, they should also think it’s crazy for researchers to have to report their reasoning for choosing a particular transformation in the first place. Or for using a particular reagent. Or animal strain. Or learning algorithm, or… you get the idea. But as Bjorn Brembs succinctly put it, in the digital age, this is silly: for all intents and purposes, there’s no longer any difference between text and data.

The data are funded by the taxpayers, and (in some sense) belong to the taxpayers

People vary widely in the extent to which they feel the public deserves to have access to the products of the work it funds. I don’t think I hold a particularly extreme position in this regard, in the sense that I don’t think the mere fact that someone’s effort is funded by the public automatically means any of their products should be publicly available for anyone’s perusal or use. However, when we’re talking about scientific data–where the explicit rationale for funding the work is to produce new generalizable knowledge, and where the marginal cost of replicating digital data is close to zero–I really don’t see any reason not to push very strongly to force scientists to share their data. I’m sympathetic to claims about scooping and credit assignment, but as a number of other folks have pointed out in comment threads, these are fundamentally arguments in favor of better credit assignment, and not arguments against sharing data. The fear some people have of being scooped is not sufficient justification for impeding our collective scientific progress.

It’s also worth noting that, in principle, PLOS’s new data sharing policy shouldn’t actually make it any easier for someone else to scoop you. Remember that under PLOS’s current data sharing mandate–as well as the equivalent policies at most other scientific journals–authors are already required to provide their data to anyone else upon request. Critics who argue that the new public archiving mandate opens the door to being scooped are in effect admitting that the old mandate to share upon request doesn’t work, because in theory there already shouldn’t really be anything preventing me from scooping you with your data simply by asking you for it (other than social norms–but then, the people who are actively out to usurp others’ ideas are the least likely to abide by those norms anyway). It’s striking to see how many of the posts defending the “share-upon-request” approach have no compunction in saying that they’re currently only willing to share their data after determining what the person on the other end wants to use it for–in clear violation of most journals’ existing policy.

It’s really not that hard

Organizing one’s data or code in a form minimally suitable for public consumption isn’t much fun. I do it fairly regularly; I know it sucks. It takes some time out of your day, and requires you to allocate resources to the problem that could otherwise be directed elsewhere. That said, a lot of the posts complaining about how much effort the new policy requires seem absurdly overwrought. There seems to be a widespread belief–which, as far as I can tell, isn’t supported by a careful reading of the actual PLOS policy–that there’s some incredibly strict standard that datasets have to live up to before pulic release. I don’t really understand where this concern comes from. Personally, I spend much of my time analyzing data other people have collected. I’ve worked with many other people’s data, and rarely is it in exactly the form I would like. Often times it’s not even in the ballpark of what I’d like. And I’ve had to invest a considerable amount of my time understanding what columns and rows mean, and scrounging for morsels of (poor) documentation. My working assumption when I do this–and, I think, most other people’s–is that the onus is on me to expend some effort figuring out what’s in a dataset I wish to use, and not on the author to release that dataset in a form that a completely naive person could understand without any effort. Of course it would be nice if everyone put their data up on the web in a form that maximized accessibility, but it certainly isn’t expected*. In asking authors to deposit their data publicly, PLOS isn’t asserting that there’s a specific format or standard that all data must meet; they’re just saying data must meet accepted norms. Since those norms depend on one’s field, it stands to reason that expectations will be lower for a 10-TB fMRI dataset than for an 800-row spreadsheet of behavioral data.

There are some valid concerns, but…

I don’t want to sound too Pollyannaish about all this. I’m not suggesting that the PLOS policy is perfect, or that issues won’t arise in the course of its implementation and enforcement. It’s very clear that there are some domains in which data sharing is a hassle, and I sympathize with the people who’ve pointed out that it’s not really clear what “all” the data means–is it the raw data, which aren’t likely to be very useful to anyone, or the post-processed data, which may be too close to the results reported in the paper? But such domain- or case-specific concerns are grossly outweighed by the very general observation that it’s often impossible to evaluate previous findings adequately, or to build a truly replicable science, if you don’t have access to other scientists’ data. There’s no doubt that edge cases will arise in the course of enforcing the new policy. But they’ll be dealt with on a case-by-case basis, exactly as the PLOS policy indicates. In the meantime, our default assumption should be that editors at PLOS–who are, after all, also working scientists–will behave reasonably, since they face many of the same considerations in their own research. When a researcher tells an editor that she doesn’t have anywhere to put the 50 TB of raw data for her imaging study, I expect that that editor will typically respond by saying, “fine, but surely you can drag and drop a directory full of the first- and second-level beta images, along with a basic description, into NeuroVault, right?”, and not “Whut!? No raw DICOM images, no publication!”

As for the people who worry that by sharing their data, they’ll be giving away a competitive advantage… to be honest, I think many of these folks are mistaken about the dire consequences that would ensue if they shared their data publicly. I suspect that many of the researchers in question would be pleasantly surprised at the benefits of data sharing (increased citation rates, new offers of collaboration, etc.) Still, it’s clear enough that some of the people who’ve done very well for themselves in the current scientific system–typically by leveraging some incredibly difficult-to-acquire dataset into a cottage industry of derivative studies–would indeed do much less well in a world where open data sharing was mandatory. What I fail to see, though, is why PLOS, or the scientific community as a whole, should care very much about this latter group’s concerns. As far as I can tell, PLOS’s new policy is a significant net positive for the scientific community as a whole, even if it hurts one segment of that community in the short term. For the moment, scientists who harbor proprietary attitudes towards their data can vote with their feet by submitting their papers somewhere other than PLOS. Contrary to the dire premonitions floating around, I very much doubt any potential drop in submissions is going to deliver a terminal blow to PLOS (and the upside is that the articles that do get published in PLOS will arguably be of higher quality). In the medium-to-long term, I suspect that cultural norms surrounding who gets credit for acquiring and sharing data vs. analyzing and reporting new findings based on those data are are going to undergo a sea change–to the point where in the not-too-distant future, the scoopophobia that currently drives many people to privately hoard their data is a complete non-factor. At that point, it’ll be seen as just plain common sense that if you want your scientific assertions to be taken seriously, you need to make the data used to support those assertions available for public scrutiny, re-analysis, and re-use.

 

* As a case in point, just yesterday I came across a publicly accessible dataset I really wanted to use, but that was in SPSS format. I don’t own a copy of SPSS, so I spent about an hour trying to get various third-party libraries to extract the data appropriately, without any luck. So eventually I sent the file to a colleague who was helpful enough to convert it. My first thought when I received the tab-delimited file in my mailbox this morning was not “ugh, I can’t believe they released the file in SPSS”, it was “how amazing is it that I can download this gigantic dataset acquired half the world away instantly, and with just one minor hiccup, be able to test a novel hypothesis in a high-powered way without needing to spend months of time collecting data?”

What we can and can’t learn from the Many Labs Replication Project

By now you will most likely have heard about the “Many Labs” Replication Project (MLRP)–a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of classical and not-so-classical findings in psychology. You probably already know that the authors tested a variety of different effects–some recent, some not so recent (the oldest one dates back to 1941!); some well-replicated, others not so much–and reported successful replications of 10 out of 13 effects (though with widely varying effect sizes).

By and large, the reception of the MLRP paper has been overwhelmingly positive. Setting aside for the moment what the findings actually mean (see also Rolf Zwaan’s earlier take), my sense is that most psychologists are united in agreement that the mere fact that researchers at 36 different sites were able to get together and run a common protocol testing 13 different effects is a pretty big deal, and bodes well for the field in light of recent concerns about iffy results and questionable research practices.

But not everyone’s convinced. There now seems to be something of an incipient backlash against replication. Or perhaps not so much against replication itself as against the notion that the ongoing replication efforts have any special significance. An in press paper by Joseph Cesario makes a case for deferring independent efforts to replicate an effect until the original effect is theoretically well understood (a suggestion I disagree with quite strongly, and plan to follow up on in a separate post). And a number of people have questioned, in blog comments and tweets, what the big deal is. A case in point:

I think the charitable way to interpret this sentiment is that Gilbert and others are concerned that some people might read too much into the fact that the MLRP successfully replicated 10 out of 13 effects. And clearly, at least some journalists have; for instance, Science News rather irresponsibly reported that the MLRP “offers reassurance” to psychologists. That said, I don’t think it’s fair to characterize this as anything close to a dominant reaction, and I don’t think I’ve seen any researchers react to the MLRP findings as if the 10/13 number means anything special. The piece Dan Gilbert linked to in his tweet, far from promoting “hysteria” about replication, is a Nature News article by the inimitable Ed Yong, and is characteristically careful and balanced. Far from trumpeting the fact that 10 out of 13 findings replicated, here’s a direct quote from the article:

Project co-leader Brian Nosek, a psychologist at the Center of Open Science in Charlottesville, Virginia, finds the outcomes encouraging. “It demonstrates that there are important effects in our field that are replicable, and consistently so,“ he says. “But that doesn’t mean that 10 out of every 13 effects will replicate.“

Kahneman agrees. The study “appears to be extremely well done and entirely convincing“, he says, “although it is surely too early to draw extreme conclusions about entire fields of research from this single effort“.

Clearly, the mere fact that 10 out of 13 effects replicated is not in and of itself very interesting. For one thing (and as Ed Yong also noted in his article), a number of the effects were selected for inclusion in the project precisely because they had already been repeatedly replicated. Had the MLRP failed to replicate these effects–including, for instance, the seminal anchoring effect discovered by Kahneman and Tversky in the 1970s–the conclusion would likely have been that something was wrong with the methodology, and not that the anchoring effect doesn’t exist. So I think pretty much everyone can agree with Gilbert that we have most assuredly not learned, as a result of the MLRP, that there’s no replication crisis in psychology after all, and that roughly 76.9% of effects are replicable. Strictly speaking, all we know is that there are at least 10 effects in all of psychology that can be replicated. But that’s not exactly what one would call an earth-shaking revelation. What’s important to appreciate, however, is that the utility of the MLRP was never supposed to be about the number of successfully replicated effects. Rather, its value is tied to a number of other findings and demonstrations–some of which are very important, and have potentially big implications for the field at large. To wit:

1. The variance between effects is greater than the variance within effects.

Here’s the primary figure from the MLRP paper: Many Labs Replication Project results

Notice that the range of meta-analytic estimates for the different effect sizes (i.e., the solid green circles) is considerably larger than the range of individual estimates within a given effect. In other words, if you want to know how big a given estimate is likely to be, it’s more informative to know what effect is being studied than to know which of the 36 sites is doing the study. This may seem like a rather esoteric point, but it has important implications. Most notably, it speaks directly to the question of how much one should expect effect sizes to fluctuate from lab to lab when direct replications are attempted. If you’ve been following the controversy over the relative (non-)replicability of a number of high-profile social priming studies, you’ve probably noticed that a common defense researchers use when their findings fails to replicate is to claim that the underlying effect is very fragile, and can’t be expected to work in other researchers’ hands. What the MLRP shows, for a reasonable set of studies, is that there does not in fact appear to be a huge amount of site-to-site variability in effects. Take currency priming, for example–an effect in which priming participants with money supposedly leads them to express capitalistic beliefs and behaviors more strongly. Given a single failure to replicate the effect, one could plausibly argue that perhaps the effect was simply too fragile to reproduce consistently. But when 36 different sites all produce effects within a very narrow range–with a mean that is effectively zero–it becomes much harder to argue that the problem is that the effect is highly variable. To the contrary, the effect size estimates are remarkably consistent–it’s just that they’re consistently close to zero.

2. Larger effects show systematically greater variability.

You can see in the above figure that the larger an effect is, the more individual estimates appear to vary across sites. In one sense, this is not terribly surprising–you might already have the statistical intuition that the larger an effect is, the more reliable variance should be available to interact with other moderating variables. Conversely, if an effect is very small to begin with, it’s probably less likely that it could turn into a very large effect under certain circumstances–or that it might reverse direction entirely. But in another sense, this finding is actually quite unexpected, because, as noted above, there’s a general sense in the field that it’s the smaller effects that tend to be more fragile and heterogeneous. To the extent we can generalize from these 13 studies, these findings should give researchers some pause before attributing replication failures to invisible moderators that somehow manage to turn very robust effects (e.g., the original currency priming effect was nearly a full standard deviation in size) into nonexistent ones.

3. A number of seemingly important variables don’t systematically moderate effects.

There have long been expressions of concern over the potential impact of cultural and population differences on psychological effects. For instance, despite repeated demonstrations that internet samples typically provide data that are as good as conventional lab samples, many researchers continue to display a deep (and in my view, completely unwarranted) skepticism of findings obtained online. More reasonably, many researchers have worried that effects obtained using university students in Western nations–the so-called WEIRD samples–may not generalize to other social groups, cultures and countries. While the MLRP results are obviously not the last word on this debate, it’s instructive to note that factors like data acquisition approach (online vs. offline) and cultural background (US vs. non-US) didn’t appear to exert a systematic effect on results. This doesn’t mean that there are no culture-specific effects in psychology of course (there undoubtedly are), but simply that our default expectation should probably be that most basic effects will generalize across cultures to at least some extent.

4. Researchers have pretty good intuitions about which findings will replicate and which ones won’t.

At the risk of offending some researchers, I submit that the likelihood that a published finding will successfully replicate is correlated to some extent with (a) the field of study it falls under and (b) the journal in which it was originally published. For example, I don’t think it’s crazy to suggest that if one were to try to replicate all of the social priming studies and all of the vision studies published in Psychological Science in the last decade, the vision studies would replicate at a consistently higher rate. Anecdotal support for this intuition comes from a string of high-profile failures to replicate famous findings–e.g., John Bargh’s demonstration that priming participants with elderly concepts leads them to walk away from an experiment more slowly. However, the MLRP goes one better than anecdote, as it included a range of effects that clearly differ in their a priori plausibility. Fortuitously, just prior to publicly releasing the MLRP results, Brian Nosek asked the following question on Twitter:

Several researchers, including me, took Brian up on his offers; here are the responses:

As you can see, pretty much everyone that replied to Brian expressed skepticism about the two priming studies (#9 and #10 in Hal Pashler’s reply). There was less consensus on the third effect. (Actually, as it happens, there were actually ultimately only 2 failures to replicate–the third effect became statistically significant when samples were weighted properly.) Nonetheless, most of us picked Imagined Contact as number 3, which did in fact emerge as the smallest of the statistically significant effects. (It’s probably worth mentioning that I’d personally only heard of 4 or 5 of the 13 effects prior to reading their descriptions, so it’s not as though my response was based on a deep knowledge of prior work on these effects–I simply read the descriptions of the findings and gauged their plausibility accordingly.)

Admittedly, these are just two (or three) studies. It’s possible that the MLRP researchers just happened to pick two of the only high-profile priming studies that both seem highly counterintuitive and happen to be false positives. That said, I don’t really think these findings stand out from the mass of other counterintuitive priming studies in social psychology in any way. While we obviously shouldn’t conclude from this that no high-profile, counterintuitive priming studies will successfully replicate, the fact that a number of researchers were able to prospectively determine, with a high degree of accuracy, which effects would fail to replicate (and, among those that replicated, which were rather weak), is a pretty good sign that researchers’ intuitions about plausibility and replicability are pretty decent.

Personally, I’d love to see this principle pushed further, and formalized as a much broader tool for evaluating research findings. For example, one can imagine a website where researchers could publicly (and perhaps anonymously) register their degree of confidence in the likely replicability of any finding associated with a doi or PubMed ID. I think such a service would be hugely valuable–not only because it would help calibrate individual researchers’ intuitions and provide a sense of the field’s overall belief in an effect, but because it would provide a useful index of a finding’s importance in the event of successful replication (i.e., the authors of a well-replicated finding should probably receive more credit if the finding was initially viewed with great skepticism than if it was universally deemed rather obvious).

There are other potentially important findings in the MLRP paper that I haven’t mentioned here (see Rolf Zwaan’s blog post for additional points), but if nothing else, I hope this will help convince any remaining skeptics that this is indeed a landmark paper for psychology–even though the number of successful replications is itself largely meaningless.

Oh, there’s one last point worth mentioning, in light of the rather disagreeable tone of the debate surrounding previous replication efforts. If your findings are ever called into question by a multinational consortium of 36 research groups, this is exactly how you should respond:

Social psychologist Travis Carter of Colby College in Waterville, Maine, who led the original flag-priming study, says that he is disappointed but trusts Nosek’s team wholeheartedly, although he wants to review their data before commenting further. Behavioural scientist Eugene Caruso at the University of Chicago in Illinois, who led the original currency-priming study, says, “We should use this lack of replication to update our beliefs about the reliability and generalizability of this effect“, given the “vastly larger and more diverse sample“ of the MLRP. Both researchers praised the initiative.

Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings (excepting cases of deliberate misconduct or clear negligence). What matters is, as Caruso notes, whether and to what extent one shows a willingness to update one’s beliefs in response to countervailing evidence. That’s one mark of a good scientist.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

the Neurosynth viewer goes modular and open source

If you’ve visited the Neurosynth website lately, you may have noticed that it looks… the same way it’s always looked. It hasn’t really changed in the last ~20 months, despite the vague promise on the front page that in the next few months, we’re going to do X, Y, Z to improve the functionality. The lack of updates is not by design; it’s because until recently I didn’t have much time to work on Neurosynth. Now that much of my time is committed to the project, things are moving ahead pretty nicely, though the changes behind the scenes aren’t reflected in any user-end improvements yet.

The github repo is now regularly updated and even gets the occasional contribution from someone other than myself; I expect that to ramp up considerably in the coming months. You can already use the code to run your own automated meta-analyses fairly easily; e.g., with everything set up right (follow the Readme and examples in the repo), the following lines of code:

dataset = cPickle.load(open('dataset.pkl', 'rb'))
studies = get_ids_by_expression("memory* &~ ("wm|working|episod*"), threshold=0.001)
ma = meta.MetaAnalysis(dataset, studies)
ma.save_results('memory')

…will perform an automated meta-analysis of all studies in the Neurosynth database that use the term ‘memory’ at a frequency of 1 in 1,000 words or greater, but don’t use the terms wm or working, or words that start with ‘episod’ (e.g., episodic). You can perform queries that nest to arbitrary depths, so it’s a pretty powerful engine for quickly generating customized meta-analyses, subject to all of the usual caveats surrounding Neurosynth (i.e., that the underlying data are very noisy, that terms aren’t mental states, etc.).

Anyway, with the core tools coming along, I’ve started to turn back to other elements of the project, starting with the image viewer. Yesterday I pushed the first commit of a new version of the viewer that’s currently on the Neurosynth website. In the next few weeks, this new version will be replacing the current version of the viewer, along with a bunch of other changes to the website.

A live demo of the new viewer is available here. It’s not much to look at right now, but behind the scenes, it’s actually a huge improvement on the old viewer in a number of ways:

  • The code is completely refactored and is all nice and object-oriented now. It’s also in CoffeeScript, which is an alternative and (if you’re coming from a Python or Ruby background) much more readable syntax for JavaScript. The source code is on github and contributions are very much encouraged. Like most scientists, I’m generally loathe to share my code publicly because I think it sucks most of the time. But I actually feel pretty good about this code. It’s not good code by any stretch, but I think it rises to the level of ‘mostly sensible’, which is about as much as I can hope for.
  • The viewer now handles multiple layers simultaneously, with the ability to hide and show layers, reorder them by dragging, vary the transparency, assign different color palettes, etc. These features have been staples of offline viewers pretty much since the prehistoric beginnings of fMRI time, but they aren’t available in the current Neurosynth viewer or most other online viewers I’m aware of, so this is a nice addition.
  • The architecture is modular, so that it should be quite easy in future to drop in other alternative views onto the data without having to muck about with the app logic. E.g., adding a 3D WebGL-based view to complement the current 2D slice-based HTML5 canvas approach is on the near-term agenda.
  • The resolution of the viewer is now higher–up from 4 mm to 2 mm (which is the most common native resolution used in packages like SPM and FSL). The original motivation for downsampling to 4 mm in the prior viewer was to keep filesize to a minimum and speed up the initial loading of images. But at some point I realized, hey, we’re living in the 21st century; people have fast internet connections now. So now the files are all in 2 mm resolution, which has the unpleasant effect of increasing file sizes by a factor of about 8, but also has the pleasant effect of making it so that you can actually tell what the hell you’re looking at.

Most importantly, there’s now a clean, and near-complete, separation between the HTML/CSS content and the JavaScript code. Which means that you can now effectively drop the viewer into just about any HTML page with just a few lines of code. So in theory, you can have basically the same viewer you see in the demo just by sticking something like the following into your page:

 viewer = Viewer.get('#layer_list', '.layer_settings')
 viewer.addView('#view_axial', 2);
 viewer.addView('#view_coronal', 1);
 viewer.addView('#view_sagittal', 0);
 viewer.addSlider('opacity', '.slider#opacity', 'horizontal', 'false', 0, 1, 1, 0.05);
 viewer.addSlider('pos-threshold', '.slider#pos-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addSlider('neg-threshold', '.slider#neg-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addColorSelect('#color_palette');
 viewer.addDataField('voxelValue', '#data_current_value')
 viewer.addDataField('currentCoords', '#data_current_coords')
 viewer.loadImageFromJSON('data/MNI152.json', 'MNI152 2mm', 'gray')
 viewer.loadImageFromJSON('data/emotion_meta.json', 'emotion meta-analysis', 'bright lights')
 viewer.loadImageFromJSON('data/language_meta.json', 'language meta-analysis', 'hot and cold')
 viewer.paint()

Well, okay, there are some other dependencies and styling stuff you’re not seeing. But all of that stuff is included in the example folder here. And of course, you can modify any of the HTML/CSS you see in the example; the whole point is that you can now easily style the viewer however you want it, without having to worry about any of the app logic.

What’s also nice about this is that you can easily pick and choose which of the viewer’s features you want to include in your page; nothing will (or at least, should) break no matter what you do. So, for example, you could decide you only want to display a single view showing only axial slices; or to allow users to manipulate the threshold of layers but not their opacity; or to show the current position of the crosshairs but not the corresponding voxel value; and so on. All you have to do is include or exclude the various addSlider() and addData() lines you see above.

Of course, it wouldn’t be a mediocre open source project if it didn’t have some important limitations I’ve been hiding from you until near the very end of this post (hoping, of course, that you wouldn’t bother to read this far down). The biggest limitation is that the viewer expects images to be in JSON format rather than a binary format like NIFTI or Analyze. This is a temporary headache until I or someone else can find the time and motivation to adapt one of the JavaScript NIFTI readers that are already out there (e.g., Satra Ghosh‘s parser for xtk), but for now, if you want to load your own images, you’re going to have to take the extra step of first converting them to JSON. Fortunately, the core Neurosynth Python package has a img_to_json() method in the imageutils module that will read in a NIFTI or Analyze volume and produce a JSON string in the expected format. Although I’m pretty sure it doesn’t handle orientation properly for some images, so don’t be surprised if your images look wonky. (And more importantly, if you fix the orientation issue, please commit your changes to the repo.)

In any case, as long as you’re comfortable with a bit of HTML/CSS/JavaScript hacking, the example/ folder in the github repo has everything you need to drop the viewer into your own pages. If you do use this code internally, please let me know! Partly for my own edification, but mostly because when I write my annual progress reports to the NIH, it’s nice to be able to truthfully say, “hey, look, people are actually using this neat thing we built with taxpayer money.”

tracking replication attempts in psychology–for real this time

I’ve written a few posts on this blog about how the development of better online infrastructure could help address and even solve many of the problems psychologists and other scientists face (e.g., the low reliability of peer review, the ‘fudge factor’ in statistical reporting, the sheer size of the scientific literature, etc.). Actually, that general question–how we can use technology to do better science–occupies a good chunk of my research these days (see e.g., Neurosynth). One question I’ve been interested in for a long time is how to keep track not only of ‘successful’ studies (i.e., those that produce sufficiently interesting effects to make it into the published literature), but also replication failures (or successes of limited interest) that wind up in researchers’ file drawers. A couple of years ago I went so far as to build a prototype website for tracking replication attempts in psychology. Unfortunately, it never went anywhere, partly (okay, mostly) because the site really sucked, and partly because I didn’t really invest much effort in drumming up interest (mostly due to lack of time). But I still think the idea is a valuable one in principle, and a lot of other people have independently had the same idea (which means it must be right, right?).

Anyway, it looks like someone finally had the cleverness, time, and money to get this right. Hal Pashler, Sean Kang*, and colleagues at UCSD have been developing an online database for tracking attempted replications of psychology studies for a while now, and it looks like it’s now in beta. PsychFileDrawer is a very slick, full-featured platform that really should–if there’s any justice in the world–provide the kind of service everyone’s been saying we need for a long time now. If it doesn’t work, I think we’ll have some collective soul-searching to do, because I don’t think it’s going to get any easier than this to add and track attempted replications. So go use it!

 

*Full disclosure: Sean Kang is a good friend of mine, so I’m not completely impartial in plugging this (though I’d do it anyway). Sean also happens to be amazingly smart and in search of a faculty job right now. If I were you, I’d hire him.

see me flub my powerpoint slides on NIF tv!

 

UPDATE: the webcast is now archived here for posterity.

This is kind of late notice and probably of interest to few people, but I’m giving the NIF webinar tomorrow (or today, depending on your time zone–either way, we’re talking about November 1st). I’ll be talking about Neurosynth, and focusing in particular on the methods and data, since that’s what NIF (which stands for Neuroscience Information Framework) is all about. Assuming all goes well, the webinar should start at 11 am PST. But since I haven’t done a webcast of any kind before, and have a surprising knack for breaking audiovisual equipment at a distance, all may not go well. Which I suppose could make for a more interesting presentation. In any case, here’s the abstract:

The explosive growth of the human neuroimaging literature has led to major advances in understanding of human brain function, but has also made aggregation and synthesis of neuroimaging findings increasingly difficult. In this webinar, I will describe a highly automated brain mapping framework called NeuroSynth that uses text mining, meta-analysis and machine learning techniques to generate a large database of mappings between neural and cognitive states. The NeuroSynth framework can be used to automatically conduct large-scale, high-quality neuroimaging meta-analyses, address long-standing inferential problems in the neuroimaging literature (e.g., how to infer cognitive states from distributed activity patterns), and support accurate “˜decoding’ of broad cognitive states from brain activity in both entire studies and individual human subjects. This webinar will focus on (a) the methods used to extract the data, (b) the structure of the resulting (publicly available) datasets, and (c) some major limitations of the current implementation. If time allows, I’ll also provide a walk-through of the associated web interface (http://neurosynth.org) and will provide concrete examples of some potential applications of the framework.

There’s some more info (including details about how to connect, which might be important) here. And now I’m off to prepare my slides. And script some evasive and totally non-committal answers to deploy in case of difficult questions from the peanut gallery respected audience.

in which I suffer a minor setback due to hyperbolic discounting

I wrote a paper with some collaborators that was officially published today in Nature Methods (though it’s been available online for a few weeks). I spent a year of my life on this (a YEAR! That’s like 30 years in opossum years!), so go read the abstract, just to humor me. It’s about large-scale automated synthesis of human functional neuroimaging data. In fact, it’s so about that that that’s the title of the paper*. There’s also a companion website over here, which you might enjoy playing with if you like brains.

I plan to write a long post about this paper at some point in the near future, but not today. What I will do today is tell you all about why I didn’t write anything about the paper much earlier (i.e., 4 weeks ago, when it appeared online), because you seem very concerned. You see, I had grand plans for writing a very detailed and wonderfully engaging multi-part series of blog posts about the paper, starting with the background and motivation for the project (that would have been Part 1), then explaining the methods we used (Part 2), then the results (III; let’s switch to Roman numerals for effect), then some of the implications (IV), then some potential applications and future directions (V), then some stuff that didn’t make it into the paper (VI), and then, finally, a behind-the-science account of how it really all went down (VII; complete with filmed interviews with collaborators who left the project early due to creative differences). A seven-part blog post! All about one paper! It would have been longer than the article itself! And all the supplemental materials! Combined! Take my word for it, it would have been amazing.

Unfortunately, like most everyone else, I’m a much better person in the future than I am in the present; things that would take me a week of full-time work in the Now apparently take me only five to ten minutes when I plan them three months ahead of time. If you plotted my temporal discounting curve for intellectual effort, it would look like this:

So that’s why my seven-part series of blog posts didn’t debut at the same time the paper was published online a few weeks ago. In fact, it hasn’t debuted at all. At this point, my much more modest goal is just to write a single much shorter post, which will no longer be able to DEBUT, but can at least slink into the bar unnoticed while everyone else is out on the patio having a smoke. And really, I’m only doing it so I can look myself in the eye again when I look myself in the mirror. Because it turns out it’s very hard to shave your face safely if you’re not allowed to look yourself in the eye. And my labmates are starting to call me PapercutMan, which isn’t really a superpower worth having.

So yeah, I’ll write something about this paper soon. But just to play it safe, I’m not going to operationally define ‘soon’ right now.

 

* Three “that”s in a row! What are the odds! Good luck parsing that sentence!

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

what the arsenic effect means for scientific publishing

I don’t know very much about DNA (and by ‘not very much’ I sadly mean ‘next to nothing’), so when someone tells me that life as we know it generally doesn’t use arsenic to make DNA, and that it’s a big deal to find a bacterium that does, I’m willing to believe them. So too, apparently, are at least two or three reviewers for Science, which published a paper last week by a NASA group purporting to demonstrate exactly that.

Turns out the paper might have a few holes. In the last few days, the blogosphere has reached fever delirium pitch as critiques of the article have emerged from every corner; it seems like pretty much everyone with some knowledge of the science in question is unhappy about the paper. Since I’m not in any position to critique the article myself, I’ll take Carl Zimmer’s word for it in Slate yesterday:

Was this merely a case of a few isolated cranks? To find out, I reached out to a dozen experts on Monday. Almost unanimously, they think the NASA scientists have failed to make their case.  “It would be really cool if such a bug existed,” said San Diego State University’s Forest Rohwer, a microbiologist who looks for new species of bacteria and viruses in coral reefs. But, he added, “none of the arguments are very convincing on their own.” That was about as positive as the critics could get. “This paper should not have been published,” said Shelley Copley of the University of Colorado.

Zimmer then follows his Slate piece up with a blog post today in which he provides 13 experts’ unadulterated comments. While there are one or two (somewhat) positive reviews, the consensus clearly seems to be that the Science paper is (very) bad science.

Of course, scientists (yes, even Science reviewers) do occasionally make mistakes, so if we’re being charitable about it, we might chalk it up to human error (though some of the critiques suggest that these are elementary problems that could have been very easily addressed, so it’s possible there’s some disingenuousness involved). But what many bloggers (1, 2, 3, etc.) have found particularly inexcusable is the way NASA and the research team have handled the criticism. Zimmer again, in Slate:

I asked two of the authors of the study if they wanted to respond to the criticism of their paper. Both politely declined by email.

“We cannot indiscriminately wade into a media forum for debate at this time,” declared senior author Ronald Oremland of the U.S. Geological Survey. “If we are wrong, then other scientists should be motivated to reproduce our findings. If we are right (and I am strongly convinced that we are) our competitors will agree and help to advance our understanding of this phenomenon. I am eager for them to do so.”

“Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated,” wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. “The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner.”

A NASA spokesperson basically reiterated this point of view, indicating that NASA scientists weren’t going to respond to criticism of their work unless that criticism appeared in, you know, a respectable, peer-reviewed outlet. (Fortunately, at least one of the critics already has a draft letter to Science up on her blog.)

I don’t think it’s surprising that people who spend much of their free time blogging about science, and think it’s important to discuss scientific issues in a public venue, generally aren’t going to like being told that science blogging isn’t a legitimate form of scientific discourse. Especially considering that the critics here aren’t laypeople without scientific training; they’re well-respected scientists with areas of expertise that are directly relevant to the paper. In this case, dismissing trenchant criticism because it’s on the web rather than in a peer-reviewed journal seems kind of like telling someone who’s screaming at you that your house is on fire that you’re not going to listen to them until they adopt a more polite tone. It just seems counterproductive.

That said, I personally don’t think we should take the NASA team’s statements at face value. I very much doubt that what the NASA researchers are saying really reflect any deep philosophical view about the role of blogs in scientific discourse; it’s much more likely that they’re simply trying to buy some time while they figure out how to respond. On the face of it, they have a choice between two lousy options: either ignore the criticism entirely, which would be antithetical to the scientific process and would look very bad, or address it head-on–which, judging by the vociferousness and near-unanimity of the commentators, is probably going to be a losing battle. Shifting the terms of the debate by insisting on responding only in a peer-reviewed venue doesn’t really change anything, but it does buy the authors two or three weeks. And two or three weeks is worth like, forty attentional cycles in the blogosphere.

Mind you, I’m not saying we should sympathize with the NASA researchers just because they’re in a tough position. I think one of the main reasons the story’s attracted so much attention is precisely because people see it as a case of justice being served. The NASA team called a major press conference ahead of the paper’s publication, published its results in one of the world’s most prestigious science journals, and yet apparently failed to run relatively basic experimental controls in support of its conclusions. If the critics are to be believed, the NASA researchers are either disingenuous or incompetent; either way, we shouldn’t feel sorry for them.

What I do think this episode shows is that the rules of scientific publishing have fundamentally changed in the last few years–and largely for the better. I haven’t been doing science for very long, but even in the halcyon days of 2003, when I started graduate school, science blogging was practically nonexistent, and the main way you’d find out what other people thought about an influential new paper was by talking to people you knew at conferences (which could take several months) or waiting for critiques or replication failures to emerge in other peer-reviewed journals (which could take years). That kind of delay between publication and evaluation is disastrous for science, because in the time it takes for a consensus to emerge that a paper is no good, several research teams might have already started trying to replicate and extend the reported findings, and several dozen other researchers might have uncritically cited their paper peripherally in their own work. This delay is probably why, as John Ioannidis’ work so elegantly demonstrates, major studies published in high-impact journals tend to exert a disproportionate influence on the literature long after they’ve been resoundingly discredited.

The Arsenic Effect, if we can call it that, provides a nice illustration of the impact of new media on scientific communication. It’s a safe bet that there are now very few people who do anything even vaguely related to the NASA team’s research who haven’t been made aware that the reported findings are controversial. Which means that the process of attempting to replicate (or falsify) the findings will proceed much more quickly than it might have ten or twenty years ago, and there probably won’t be very many people who cite the Science paper as compelling evidence of terrestrial arsenic-based life. Perhaps more importantly, as researchers get used to the idea that their high-profile work is going to be instantly evaluated by thousands of pairs of highly trained eyes, any of which might be attached to a highly prolific pair of typing hands, there will be an increasingly strong disincentive to avoid being careless. That isn’t to say that bad science will disappear, of course; just that, in cases where the badness reflects a pressure to tell a good story at all costs, we’ll probably see less of it.