What we can and can’t learn from the Many Labs Replication Project

By now you will most likely have heard about the “Many Labs” Replication Project (MLRP)–a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of classical and not-so-classical findings in psychology. You probably already know that the authors tested a variety of different effects–some recent, some not so recent (the oldest one dates back to 1941!); some well-replicated, others not so much–and reported successful replications of 10 out of 13 effects (though with widely varying effect sizes).

By and large, the reception of the MLRP paper has been overwhelmingly positive. Setting aside for the moment what the findings actually mean (see also Rolf Zwaan’s earlier take), my sense is that most psychologists are united in agreement that the mere fact that researchers at 36 different sites were able to get together and run a common protocol testing 13 different effects is a pretty big deal, and bodes well for the field in light of recent concerns about iffy results and questionable research practices.

But not everyone’s convinced. There now seems to be something of an incipient backlash against replication. Or perhaps not so much against replication itself as against the notion that the ongoing replication efforts have any special significance. An in press paper by Joseph Cesario makes a case for deferring independent efforts to replicate an effect until the original effect is theoretically well understood (a suggestion I disagree with quite strongly, and plan to follow up on in a separate post). And a number of people have questioned, in blog comments and tweets, what the big deal is. A case in point:

I think the charitable way to interpret this sentiment is that Gilbert and others are concerned that some people might read too much into the fact that the MLRP successfully replicated 10 out of 13 effects. And clearly, at least some journalists have; for instance, Science News rather irresponsibly reported that the MLRP “offers reassurance” to psychologists. That said, I don’t think it’s fair to characterize this as anything close to a dominant reaction, and I don’t think I’ve seen any researchers react to the MLRP findings as if the 10/13 number means anything special. The piece Dan Gilbert linked to in his tweet, far from promoting “hysteria” about replication, is a Nature News article by the inimitable Ed Yong, and is characteristically careful and balanced. Far from trumpeting the fact that 10 out of 13 findings replicated, here’s a direct quote from the article:

Project co-leader Brian Nosek, a psychologist at the Center of Open Science in Charlottesville, Virginia, finds the outcomes encouraging. “It demonstrates that there are important effects in our field that are replicable, and consistently so,” he says. “But that doesn’t mean that 10 out of every 13 effects will replicate.”

Kahneman agrees. The study “appears to be extremely well done and entirely convincing”, he says, “although it is surely too early to draw extreme conclusions about entire fields of research from this single effort”.

Clearly, the mere fact that 10 out of 13 effects replicated is not in and of itself very interesting. For one thing (and as Ed Yong also noted in his article), a number of the effects were selected for inclusion in the project precisely because they had already been repeatedly replicated. Had the MLRP failed to replicate these effects–including, for instance, the seminal anchoring effect discovered by Kahneman and Tversky in the 1970s–the conclusion would likely have been that something was wrong with the methodology, and not that the anchoring effect doesn’t exist. So I think pretty much everyone can agree with Gilbert that we have most assuredly not learned, as a result of the MLRP, that there’s no replication crisis in psychology after all, and that roughly 76.9% of effects are replicable. Strictly speaking, all we know is that there are at least 10 effects in all of psychology that can be replicated. But that’s not exactly what one would call an earth-shaking revelation. What’s important to appreciate, however, is that the utility of the MLRP was never supposed to be about the number of successfully replicated effects. Rather, its value is tied to a number of other findings and demonstrations–some of which are very important, and have potentially big implications for the field at large. To wit:

1. The variance between effects is greater than the variance within effects.

Here’s the primary figure from the MLRP paper: Many Labs Replication Project results

Notice that the range of meta-analytic estimates for the different effect sizes (i.e., the solid green circles) is considerably larger than the range of individual estimates within a given effect. In other words, if you want to know how big a given estimate is likely to be, it’s more informative to know what effect is being studied than to know which of the 36 sites is doing the study. This may seem like a rather esoteric point, but it has important implications. Most notably, it speaks directly to the question of how much one should expect effect sizes to fluctuate from lab to lab when direct replications are attempted. If you’ve been following the controversy over the relative (non-)replicability of a number of high-profile social priming studies, you’ve probably noticed that a common defense researchers use when their findings fails to replicate is to claim that the underlying effect is very fragile, and can’t be expected to work in other researchers’ hands. What the MLRP shows, for a reasonable set of studies, is that there does not in fact appear to be a huge amount of site-to-site variability in effects. Take currency priming, for example–an effect in which priming participants with money supposedly leads them to express capitalistic beliefs and behaviors more strongly. Given a single failure to replicate the effect, one could plausibly argue that perhaps the effect was simply too fragile to reproduce consistently. But when 36 different sites all produce effects within a very narrow range–with a mean that is effectively zero–it becomes much harder to argue that the problem is that the effect is highly variable. To the contrary, the effect size estimates are remarkably consistent–it’s just that they’re consistently close to zero.

2. Larger effects show systematically greater variability.

You can see in the above figure that the larger an effect is, the more individual estimates appear to vary across sites. In one sense, this is not terribly surprising–you might already have the statistical intuition that the larger an effect is, the more reliable variance should be available to interact with other moderating variables. Conversely, if an effect is very small to begin with, it’s probably less likely that it could turn into a very large effect under certain circumstances–or that it might reverse direction entirely. But in another sense, this finding is actually quite unexpected, because, as noted above, there’s a general sense in the field that it’s the smaller effects that tend to be more fragile and heterogeneous. To the extent we can generalize from these 13 studies, these findings should give researchers some pause before attributing replication failures to invisible moderators that somehow manage to turn very robust effects (e.g., the original currency priming effect was nearly a full standard deviation in size) into nonexistent ones.

3. A number of seemingly important variables don’t systematically moderate effects.

There have long been expressions of concern over the potential impact of cultural and population differences on psychological effects. For instance, despite repeated demonstrations that internet samples typically provide data that are as good as conventional lab samples, many researchers continue to display a deep (and in my view, completely unwarranted) skepticism of findings obtained online. More reasonably, many researchers have worried that effects obtained using university students in Western nations–the so-called WEIRD samples–may not generalize to other social groups, cultures and countries. While the MLRP results are obviously not the last word on this debate, it’s instructive to note that factors like data acquisition approach (online vs. offline) and cultural background (US vs. non-US) didn’t appear to exert a systematic effect on results. This doesn’t mean that there are no culture-specific effects in psychology of course (there undoubtedly are), but simply that our default expectation should probably be that most basic effects will generalize across cultures to at least some extent.

4. Researchers have pretty good intuitions about which findings will replicate and which ones won’t.

At the risk of offending some researchers, I submit that the likelihood that a published finding will successfully replicate is correlated to some extent with (a) the field of study it falls under and (b) the journal in which it was originally published. For example, I don’t think it’s crazy to suggest that if one were to try to replicate all of the social priming studies and all of the vision studies published in Psychological Science in the last decade, the vision studies would replicate at a consistently higher rate. Anecdotal support for this intuition comes from a string of high-profile failures to replicate famous findings–e.g., John Bargh’s demonstration that priming participants with elderly concepts leads them to walk away from an experiment more slowly. However, the MLRP goes one better than anecdote, as it included a range of effects that clearly differ in their a priori plausibility. Fortuitously, just prior to publicly releasing the MLRP results, Brian Nosek asked the following question on Twitter:

Several researchers, including me, took Brian up on his offers; here are the responses:

As you can see, pretty much everyone that replied to Brian expressed skepticism about the two priming studies (#9 and #10 in Hal Pashler’s reply). There was less consensus on the third effect. (Actually, as it happens, there were actually ultimately only 2 failures to replicate–the third effect became statistically significant when samples were weighted properly.) Nonetheless, most of us picked Imagined Contact as number 3, which did in fact emerge as the smallest of the statistically significant effects. (It’s probably worth mentioning that I’d personally only heard of 4 or 5 of the 13 effects prior to reading their descriptions, so it’s not as though my response was based on a deep knowledge of prior work on these effects–I simply read the descriptions of the findings and gauged their plausibility accordingly.)

Admittedly, these are just two (or three) studies. It’s possible that the MLRP researchers just happened to pick two of the only high-profile priming studies that both seem highly counterintuitive and happen to be false positives. That said, I don’t really think these findings stand out from the mass of other counterintuitive priming studies in social psychology in any way. While we obviously shouldn’t conclude from this that no high-profile, counterintuitive priming studies will successfully replicate, the fact that a number of researchers were able to prospectively determine, with a high degree of accuracy, which effects would fail to replicate (and, among those that replicated, which were rather weak), is a pretty good sign that researchers’ intuitions about plausibility and replicability are pretty decent.

Personally, I’d love to see this principle pushed further, and formalized as a much broader tool for evaluating research findings. For example, one can imagine a website where researchers could publicly (and perhaps anonymously) register their degree of confidence in the likely replicability of any finding associated with a doi or PubMed ID. I think such a service would be hugely valuable–not only because it would help calibrate individual researchers’ intuitions and provide a sense of the field’s overall belief in an effect, but because it would provide a useful index of a finding’s importance in the event of successful replication (i.e., the authors of a well-replicated finding should probably receive more credit if the finding was initially viewed with great skepticism than if it was universally deemed rather obvious).

There are other potentially important findings in the MLRP paper that I haven’t mentioned here (see Rolf Zwaan’s blog post for additional points), but if nothing else, I hope this will help convince any remaining skeptics that this is indeed a landmark paper for psychology–even though the number of successful replications is itself largely meaningless.

Oh, there’s one last point worth mentioning, in light of the rather disagreeable tone of the debate surrounding previous replication efforts. If your findings are ever called into question by a multinational consortium of 36 research groups, this is exactly how you should respond:

Social psychologist Travis Carter of Colby College in Waterville, Maine, who led the original flag-priming study, says that he is disappointed but trusts Nosek’s team wholeheartedly, although he wants to review their data before commenting further. Behavioural scientist Eugene Caruso at the University of Chicago in Illinois, who led the original currency-priming study, says, “We should use this lack of replication to update our beliefs about the reliability and generalizability of this effect”, given the “vastly larger and more diverse sample” of the MLRP. Both researchers praised the initiative.

Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings (excepting cases of deliberate misconduct or clear negligence). What matters is, as Caruso notes, whether and to what extent one shows a willingness to update one’s beliefs in response to countervailing evidence. That’s one mark of a good scientist.

9 thoughts on “What we can and can’t learn from the Many Labs Replication Project”

  1. Your “charitable” interpretation of my tweet is in fact an accurate and literal interpretation: the mere fact that 10 of 13 hand-picked studies replicate tells us nothing. The Young article to which my tweet linked suggests otherwise. (At least to me). Just read the opening paragraphs, where we learn that there is a replication crisis in psychology, that everyone is worried, and then…wait! ….this just in! …a new paper shows that results DO replicate! (And while I like and respect Danny Oppenheimer, his quote plays right into this misreading of the Many Labs paper). Sorry, that paper is NOT about whether most effects in psychology are or are not replicable, and the authors of the Many Labs paper never claimed such a ridiculous thing. Journalists who spin it that way to generate catchy headlines (and then clarify the issue somewhere in paragraph 247 to create plausible deniability for themselves) do everyone a disservice. But who can blame them? Their job is to sell magazines, not to make us smarter or wiser. And indeed, we encourage muckraking with our overblown hysteria about the fact that some published results don’t replicate. (BTW, if you didn’t know that already, you have been living in a cartoon or a coma). In short, the Many Labs paper makes some nice points & doesn’t overclaim. I like it just fine and didn’t criticize it. I criticized the coverage of it and the hysteria that encourages that kind of coverage. I think that’s clear from my tweet which links to the coverage, not to the paper, and which condemns hysteria, not replicability. Anyway, nice post!

  2. Thanks for the comment, Dan!

    I think the interpretation is charitable in the sense that without context (and there wasn’t any), it really wasn’t clear what you mean by “replication hysteria”. I frankly couldn’t tell from your tweet who you were suggesting was being hysterical: Ed Yong, journalists in general, vocal proponents of replication like Nosek, the people who had done the original studies and were complaining about being picked on, or someone else. So, for instance, a less charitable but not unreasonable (given the lack of context) interpretation might have been that you see the whole replication movement as a waste of time. I understand that that wasn’t what you meant (and that’s what I guessed when writing the post), but I hope you can appreciate that the fact that several people on Twitter did think that’s what you meant suggests that your tweet was not as unambiguous as you might feel.

    That said, I agree with you that the Oppenheimer quote was the weakest part of the article. But again, to be charitable, my guess is that the point both Oppenheimer and Yong intended to make was really a reductio ad absurdum along the lines of look, if you thought that all the hubbub over the replication crisis in recent months means that nothing ever replicates in psychology, these results clearly show that that’s not true. The point may have been made badly, and it might be obvious to you and I, but I think it’s probably not obvious to a sizable proportion of Nature News readers–who might think, after recent coverage, that psychology should just be written off as a science. In any case, I very much doubt that anyone who read the whole piece would come away thinking “great, so now we know there’s no problem in psychology, and 10 out of every 13 studies will replicate”. I read it and thought it was a nice, fair article, and I would be the first to complain to Ed if I thought it was too pollyannaish, as I’m firmly in the camp that thinks there are very serious systemic problems with current publication practices in psychology.

  3. You suggest that one of the things we can learn from the MLRP is that “the variance between effects is greater than the variance within effects.” But I’m really not sure how much the findings of the MLRP allow us to say about whether this is true in general, particularly if the 13 effects selected for inclusion here were deliberately chosen such that there would be a lot of between-effect variation in the mean effect sizes. Now, the manuscript does not say that the effects were selected in order to maximize between-effect variability, however it does list one of the criteria as achieving a “diversity of effects,” in particular, “differing levels of certainty and existing impact.” It’s pretty easy to imagine that this is closely related to effect size. So it seems not at all unlikely that in a random sample of effects–where probably the great majority of the effect sizes are in the d = 0 to 1 range–there would be greater within-effect variation than between-effect variation.

  4. Jake,

    Yeah, that’s a fair point, and was noted by folks on Twitter as well. For what it’s worth, I don’t think the question of whether there’s more variance within or between effects in an absolute sense is a terribly interesting one, and I probably shouldn’t have worded it that way in the section header. The more important point is that there seems to be much less variation within-effect than one might intuitively suppose (the large between-effect variation at the upper end is mostly useful for calibrating intuitions, I think), and that has important implications for how we think about appeals to invisible moderators.

    Of course, it’s certainly conceivable that this set of effects is deeply unrepresentative in that sense as well–i.e., that in a truly random sample of 13 effects (whatever that means!)–there would be massive variation in effect sizes within individual effects. But given these findings, I think that seems substantially less likely than I would have thought pre-MLRP. While the selection process for MLRP effects almost certainly resulted in a biased distribution of effect sizes, I see no obvious reason why, e.g., the 3 smallest effects reported here should be more stable than other kinds of effects that people have argued must be very fragile. That to me is the real upshot of these findings. But I’ll be happy to walk that back if someone does a second MLRP with very different effects and reports that this time the putatively fragile ones show considerable between-site variation.

  5. “Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings ”

    I don’t agree when you call this a mistake… I don’t see where they made a mistake. They had data and analysed the data. Where is the mistake? Every experiment should be seen as one estimation of a variable and that estimation might be closer or further from the truth. But we should abandon this mistake idea.

  6. Do we really learn nothing from the 10/13 figure? I was under the impression that these effects were picked as credible high-profile well-known effects, and so the 10/13 is quite interesting – as an upper bound!

Leave a Reply