Category Archives: research

strong opinions about data sharing mandates–mine included

Apparently, many scientists have rather strong feelings about data sharing mandates. In the wake of PLOS’s recent announcement–which says that, effective now, all papers published in PLOS journals must deposit their data in a publicly accessible location–a veritable gaggle of scientists have taken to their blogs to voice their outrage and/or support for the policy. The nays have posts like DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the nays to get over themselves and share their data (more posts here, here, and here). While I’m a bit late to the party (mostly because I’ve been traveling and otherwise indisposed), I guess I’ll go ahead and throw my hat into the ring in support of data sharing mandates. For a number of reasons outlined below, I think time will show the anti-PLOS folks to very clearly be on the wrong side of this issue.

Mandatory public deposition is like, totally way better than a “share-upon-request” approach

You might think that proactive data deposition has little incremental utility over a philosophy of sharing one’s data upon request, since emails are these wordy little things that only take a few minutes of a data-seeker’s time to write. But it’s not just the time and effort that matter. It’s also the psychology and technology. Psychology, because if you don’t know the person on the other end, or if the data is potentially useful but not essential to you, or if you’re the agreeable sort who doesn’t like to bother other people, it’s very easy to just say, “nah, I’ll just go do something else”. Scientists are busy people. If a dataset is a click away, many people will be happy to download that dataset and play with it who wouldn’t feel comfortable emailing the author to ask for it. Technology, because data that isn’t publicly available is data that isn’t publicly indexed. It’s all well and good to say that if someone really wants a dataset, they can email you to ask for it, but if someone doesn’t know about your dataset in the first place–because it isn’t in the first three pages of Google results–they’re going to have a hard time asking.

People don’t actually share on request

Much of the criticism of the PLoS data sharing policy rests on the notion that the policy is unnecessary, because in practice most journals already mandate that authors must share their data upon request. One point that defenders of the PLOS mandate haven’t stressed enough is that such “soft” mandates are largely meaningless. Empirical studies have repeatedly demonstrated  that it’s actually very difficult  to get authors to share their data upon request –even when they’re obligated to do so by the contractual agreement they’ve signed with a publisher. And when researchers do fulfill data sharing requests, they often take inordinately long to do so, and the data often don’t line up properly with what was reported in the paper (as the PLOS editors noted in their explanation for introducing the policy), or reveal potentially serious errors.

Personally, I have to confess that I often haven’t fulfilled other researchers’ requests for my data–and in at least two cases, I never even responded to the request. These failures to share didn’t reflect my desire to hide anything; they occurred largely because I knew it would be a lot of work, and/or the data were no longer readily accessible to me, and/or I was too busy to take care of the request right when it came in. I think I’m sufficiently aware of my own character flaws to know that good intentions are no match for time pressure and divided attention–and that’s precisely why I’d rather submit my work to journals that force me to do the tedious curation work up front, when I have a strong incentive to do it, rather than later, when I don’t.

Comprehensive evaluation requires access to the data

It’s hard to escape the feeling that some of the push-back against the policy is actually rooted in the fear that other researchers will find mistakes in one’s work by going through one’s data. In some cases, this fear is made explicit. For example, DrugMonkey suggested that:

There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted!

This rather dire premonition prompted a commenter to ask if it’s possible that DM might ever be wrong about what his data means–necessitating other pairs of eyes and/or opinions. DM’s response was, in essence, “No.”. But clearly, this is wishful thinking: we have plenty of reasons to think that everyone in science–even the luminaries among us–make mistakes all the time. Science is hard. In the fields I’m most familiar with, I rarely read a paper that I don’t feel has some serious flaws–even though nearly all of these papers were written by people who have, in DM’s words, “been at this for a while”. By the same token, I’m certain that other people read each of my papers and feel exactly the same way. Of course, it’s not pleasant to confront our mistakes by putting everything out into the open, and I don’t doubt that one consequence of sharing data proactively is that error-finding will indeed become much more common. At least initially (i.e., until we develop an appreciation for the true rate of error in the average dataset, and become more tolerant of minor problems), this will probably cause everyone some discomfort. But temporary discomfort surely isn’t a good excuse to continue to support practices that clearly impede scientific progress.

Part of the problem, I suspect, is that scientists have collectively internalized as acceptable many practices that are on some level clearly not good for the community as a whole. To take just one example, it’s an open secret in biomedical science that so-called “representative figures” (of spiking neurons, Western blots, or whatever else you like) are rarely truly representative. Frequently, they’re actually among the best examples the authors of a paper were able to find. The communal wink-and-shake agreement to ignore this kind of problem is deeply problematic, in that it likely allows many claims to go unchallenged that are actually not strongly supported by the data. In a world where other researchers could easily go through my dataset and show that the “representative” raster plot I presented in Figure 2C was actually the best case rather than the norm, I would probably have to be more careful about making that kind of claim up front–and someone else might not waste a lot of their time chasing results that can’t possibly be as good as my figures make them look.

Figure 1.  A representative planet.

The Data are a part of the Methods

If you still don’t find this convincing, consider that one could easily have applied nearly all of the arguments people having been making in the blogosphere these past two weeks to that dastardly scientific timesink that is the common Methods sections. Imagine that we lived in a culture where scientists always reported their Results telegraphically–that is, with the brevity of a typical Nature or Science paper, but without the accompanying novel’s worth of Supplementary Methods. Then, when someone first suggested that it might perhaps be a good idea to introduce a separate section that describes in dry, technical language how authors actually produced all those exciting results, we would presumably see many people in the community saying something like the following:

Why should I bother to tell you in excruciating detail what software, reagents, and stimuli I used in my study? The vast majority of readers will never try to directly replicate my experiment, and those who do want to can just email me to get the information they need–which of course I’m always happy to provide in a timely and completely disinterested fashion. Asking me to proactively lay out every little methodological step I took is really unreasonable; it would take a very long time to write a clear “Methods” section of the kind you propose, and the benefits seem very dubious. I mean, the only thing that will happen if I adopt this new policy is that half of my competitors will start going through this new section with a fine-toothed comb in order to find problems, and the other half will now be able to scoop me by repeating the exact procedures I used before I have a chance to follow them up myself! And for what? Why do I need to tell everyone exactly what I did? I’m an expert with many years of experience in this field! I know what I’m doing, and I don’t appreciate your casting aspersions on my work and implying that my conclusions might not always be 100% sound!

As far as I can see, there isn’t any qualitative difference between reporting detailed Methods and providing comprehensive Data. In point of fact, many decisions about which methods one should use depend entirely on the nature of the data, so it’s often actually impossible to evaluate the methodological choices the authors made without seeing their data. If DrugMonkey et al think it’s crazy for one researcher to want access to another researcher’s data in order to determine whether the distribution of some variable looks normal, they should also think it’s crazy for researchers to have to report their reasoning for choosing a particular transformation in the first place. Or for using a particular reagent. Or animal strain. Or learning algorithm, or… you get the idea. But as Bjorn Brembs succinctly put it, in the digital age, this is silly: for all intents and purposes, there’s no longer any difference between text and data.

The data are funded by the taxpayers, and (in some sense) belong to the taxpayers

People vary widely in the extent to which they feel the public deserves to have access to the products of the work it funds. I don’t think I hold a particularly extreme position in this regard, in the sense that I don’t think the mere fact that someone’s effort is funded by the public automatically means any of their products should be publicly available for anyone’s perusal or use. However, when we’re talking about scientific data–where the explicit rationale for funding the work is to produce new generalizable knowledge, and where the marginal cost of replicating digital data is close to zero–I really don’t see any reason not to push very strongly to force scientists to share their data. I’m sympathetic to claims about scooping and credit assignment, but as a number of other folks have pointed out in comment threads, these are fundamentally arguments in favor of better credit assignment, and not arguments against sharing data. The fear some people have of being scooped is not sufficient justification for impeding our collective scientific progress.

It’s also worth noting that, in principle, PLOS’s new data sharing policy shouldn’t actually make it any easier for someone else to scoop you. Remember that under PLOS’s current data sharing mandate–as well as the equivalent policies at most other scientific journals–authors are already required to provide their data to anyone else upon request. Critics who argue that the new public archiving mandate opens the door to being scooped are in effect admitting that the old mandate to share upon request doesn’t work, because in theory there already shouldn’t really be anything preventing me from scooping you with your data simply by asking you for it (other than social norms–but then, the people who are actively out to usurp others’ ideas are the least likely to abide by those norms anyway). It’s striking to see how many of the posts defending the “share-upon-request” approach have no compunction in saying that they’re currently only willing to share their data after determining what the person on the other end wants to use it for–in clear violation of most journals’ existing policy.

It’s really not that hard

Organizing one’s data or code in a form minimally suitable for public consumption isn’t much fun. I do it fairly regularly; I know it sucks. It takes some time out of your day, and requires you to allocate resources to the problem that could otherwise be directed elsewhere. That said, a lot of the posts complaining about how much effort the new policy requires seem absurdly overwrought. There seems to be a widespread belief–which, as far as I can tell, isn’t supported by a careful reading of the actual PLOS policy–that there’s some incredibly strict standard that datasets have to live up to before pulic release. I don’t really understand where this concern comes from. Personally, I spend much of my time analyzing data other people have collected. I’ve worked with many other people’s data, and rarely is it in exactly the form I would like. Often times it’s not even in the ballpark of what I’d like. And I’ve had to invest a considerable amount of my time understanding what columns and rows mean, and scrounging for morsels of (poor) documentation. My working assumption when I do this–and, I think, most other people’s–is that the onus is on me to expend some effort figuring out what’s in a dataset I wish to use, and not on the author to release that dataset in a form that a completely naive person could understand without any effort. Of course it would be nice if everyone put their data up on the web in a form that maximized accessibility, but it certainly isn’t expected*. In asking authors to deposit their data publicly, PLOS isn’t asserting that there’s a specific format or standard that all data must meet; they’re just saying data must meet accepted norms. Since those norms depend on one’s field, it stands to reason that expectations will be lower for a 10-TB fMRI dataset than for an 800-row spreadsheet of behavioral data.

There are some valid concerns, but…

I don’t want to sound too Pollyannaish about all this. I’m not suggesting that the PLOS policy is perfect, or that issues won’t arise in the course of its implementation and enforcement. It’s very clear that there are some domains in which data sharing is a hassle, and I sympathize with the people who’ve pointed out that it’s not really clear what “all” the data means–is it the raw data, which aren’t likely to be very useful to anyone, or the post-processed data, which may be too close to the results reported in the paper? But such domain- or case-specific concerns are grossly outweighed by the very general observation that it’s often impossible to evaluate previous findings adequately, or to build a truly replicable science, if you don’t have access to other scientists’ data. There’s no doubt that edge cases will arise in the course of enforcing the new policy. But they’ll be dealt with on a case-by-case basis, exactly as the PLOS policy indicates. In the meantime, our default assumption should be that editors at PLOS–who are, after all, also working scientists–will behave reasonably, since they face many of the same considerations in their own research. When a researcher tells an editor that she doesn’t have anywhere to put the 50 TB of raw data for her imaging study, I expect that that editor will typically respond by saying, “fine, but surely you can drag and drop a directory full of the first- and second-level beta images, along with a basic description, into NeuroVault, right?”, and not “Whut!? No raw DICOM images, no publication!”

As for the people who worry that by sharing their data, they’ll be giving away a competitive advantage… to be honest, I think many of these folks are mistaken about the dire consequences that would ensue if they shared their data publicly. I suspect that many of the researchers in question would be pleasantly surprised at the benefits of data sharing (increased citation rates, new offers of collaboration, etc.) Still, it’s clear enough that some of the people who’ve done very well for themselves in the current scientific system–typically by leveraging some incredibly difficult-to-acquire dataset into a cottage industry of derivative studies–would indeed do much less well in a world where open data sharing was mandatory. What I fail to see, though, is why PLOS, or the scientific community as a whole, should care very much about this latter group’s concerns. As far as I can tell, PLOS’s new policy is a significant net positive for the scientific community as a whole, even if it hurts one segment of that community in the short term. For the moment, scientists who harbor proprietary attitudes towards their data can vote with their feet by submitting their papers somewhere other than PLOS. Contrary to the dire premonitions floating around, I very much doubt any potential drop in submissions is going to deliver a terminal blow to PLOS (and the upside is that the articles that do get published in PLOS will arguably be of higher quality). In the medium-to-long term, I suspect that cultural norms surrounding who gets credit for acquiring and sharing data vs. analyzing and reporting new findings based on those data are are going to undergo a sea change–to the point where in the not-too-distant future, the scoopophobia that currently drives many people to privately hoard their data is a complete non-factor. At that point, it’ll be seen as just plain common sense that if you want your scientific assertions to be taken seriously, you need to make the data used to support those assertions available for public scrutiny, re-analysis, and re-use.

 

* As a case in point, just yesterday I came across a publicly accessible dataset I really wanted to use, but that was in SPSS format. I don’t own a copy of SPSS, so I spent about an hour trying to get various third-party libraries to extract the data appropriately, without any luck. So eventually I sent the file to a colleague who was helpful enough to convert it. My first thought when I received the tab-delimited file in my mailbox this morning was not “ugh, I can’t believe they released the file in SPSS”, it was “how amazing is it that I can download this gigantic dataset acquired half the world away instantly, and with just one minor hiccup, be able to test a novel hypothesis in a high-powered way without needing to spend months of time collecting data?”

What we can and can’t learn from the Many Labs Replication Project

By now you will most likely have heard about the “Many Labs” Replication Project (MLRP)–a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of classical and not-so-classical findings in psychology. You probably already know that the authors tested a variety of different effects–some recent, some not so recent (the oldest one dates back to 1941!); some well-replicated, others not so much–and reported successful replications of 10 out of 13 effects (though with widely varying effect sizes).

By and large, the reception of the MLRP paper has been overwhelmingly positive. Setting aside for the moment what the findings actually mean (see also Rolf Zwaan’s earlier take), my sense is that most psychologists are united in agreement that the mere fact that researchers at 36 different sites were able to get together and run a common protocol testing 13 different effects is a pretty big deal, and bodes well for the field in light of recent concerns about iffy results and questionable research practices.

But not everyone’s convinced. There now seems to be something of an incipient backlash against replication. Or perhaps not so much against replication itself as against the notion that the ongoing replication efforts have any special significance. An in press paper by Joseph Cesario makes a case for deferring independent efforts to replicate an effect until the original effect is theoretically well understood (a suggestion I disagree with quite strongly, and plan to follow up on in a separate post). And a number of people have questioned, in blog comments and tweets, what the big deal is. A case in point:

I think the charitable way to interpret this sentiment is that Gilbert and others are concerned that some people might read too much into the fact that the MLRP successfully replicated 10 out of 13 effects. And clearly, at least some journalists have; for instance, Science News rather irresponsibly reported that the MLRP “offers reassurance” to psychologists. That said, I don’t think it’s fair to characterize this as anything close to a dominant reaction, and I don’t think I’ve seen any researchers react to the MLRP findings as if the 10/13 number means anything special. The piece Dan Gilbert linked to in his tweet, far from promoting “hysteria” about replication, is a Nature News article by the inimitable Ed Yong, and is characteristically careful and balanced. Far from trumpeting the fact that 10 out of 13 findings replicated, here’s a direct quote from the article:

Project co-leader Brian Nosek, a psychologist at the Center of Open Science in Charlottesville, Virginia, finds the outcomes encouraging. “It demonstrates that there are important effects in our field that are replicable, and consistently so,” he says. “But that doesn’t mean that 10 out of every 13 effects will replicate.”

Kahneman agrees. The study “appears to be extremely well done and entirely convincing”, he says, “although it is surely too early to draw extreme conclusions about entire fields of research from this single effort”.

Clearly, the mere fact that 10 out of 13 effects replicated is not in and of itself very interesting. For one thing (and as Ed Yong also noted in his article), a number of the effects were selected for inclusion in the project precisely because they had already been repeatedly replicated. Had the MLRP failed to replicate these effects–including, for instance, the seminal anchoring effect discovered by Kahneman and Tversky in the 1970s–the conclusion would likely have been that something was wrong with the methodology, and not that the anchoring effect doesn’t exist. So I think pretty much everyone can agree with Gilbert that we have most assuredly not learned, as a result of the MLRP, that there’s no replication crisis in psychology after all, and that roughly 76.9% of effects are replicable. Strictly speaking, all we know is that there are at least 10 effects in all of psychology that can be replicated. But that’s not exactly what one would call an earth-shaking revelation. What’s important to appreciate, however, is that the utility of the MLRP was never supposed to be about the number of successfully replicated effects. Rather, its value is tied to a number of other findings and demonstrations–some of which are very important, and have potentially big implications for the field at large. To wit:

1. The variance between effects is greater than the variance within effects.

Here’s the primary figure from the MLRP paper: Many Labs Replication Project results

Notice that the range of meta-analytic estimates for the different effect sizes (i.e., the solid green circles) is considerably larger than the range of individual estimates within a given effect. In other words, if you want to know how big a given estimate is likely to be, it’s more informative to know what effect is being studied than to know which of the 36 sites is doing the study. This may seem like a rather esoteric point, but it has important implications. Most notably, it speaks directly to the question of how much one should expect effect sizes to fluctuate from lab to lab when direct replications are attempted. If you’ve been following the controversy over the relative (non-)replicability of a number of high-profile social priming studies, you’ve probably noticed that a common defense researchers use when their findings fails to replicate is to claim that the underlying effect is very fragile, and can’t be expected to work in other researchers’ hands. What the MLRP shows, for a reasonable set of studies, is that there does not in fact appear to be a huge amount of site-to-site variability in effects. Take currency priming, for example–an effect in which priming participants with money supposedly leads them to express capitalistic beliefs and behaviors more strongly. Given a single failure to replicate the effect, one could plausibly argue that perhaps the effect was simply too fragile to reproduce consistently. But when 36 different sites all produce effects within a very narrow range–with a mean that is effectively zero–it becomes much harder to argue that the problem is that the effect is highly variable. To the contrary, the effect size estimates are remarkably consistent–it’s just that they’re consistently close to zero.

2. Larger effects show systematically greater variability.

You can see in the above figure that the larger an effect is, the more individual estimates appear to vary across sites. In one sense, this is not terribly surprising–you might already have the statistical intuition that the larger an effect is, the more reliable variance should be available to interact with other moderating variables. Conversely, if an effect is very small to begin with, it’s probably less likely that it could turn into a very large effect under certain circumstances–or that it might reverse direction entirely. But in another sense, this finding is actually quite unexpected, because, as noted above, there’s a general sense in the field that it’s the smaller effects that tend to be more fragile and heterogeneous. To the extent we can generalize from these 13 studies, these findings should give researchers some pause before attributing replication failures to invisible moderators that somehow manage to turn very robust effects (e.g., the original currency priming effect was nearly a full standard deviation in size) into nonexistent ones.

3. A number of seemingly important variables don’t systematically moderate effects.

There have long been expressions of concern over the potential impact of cultural and population differences on psychological effects. For instance, despite repeated demonstrations that internet samples typically provide data that are as good as conventional lab samples, many researchers continue to display a deep (and in my view, completely unwarranted) skepticism of findings obtained online. More reasonably, many researchers have worried that effects obtained using university students in Western nations–the so-called WEIRD samples–may not generalize to other social groups, cultures and countries. While the MLRP results are obviously not the last word on this debate, it’s instructive to note that factors like data acquisition approach (online vs. offline) and cultural background (US vs. non-US) didn’t appear to exert a systematic effect on results. This doesn’t mean that there are no culture-specific effects in psychology of course (there undoubtedly are), but simply that our default expectation should probably be that most basic effects will generalize across cultures to at least some extent.

4. Researchers have pretty good intuitions about which findings will replicate and which ones won’t.

At the risk of offending some researchers, I submit that the likelihood that a published finding will successfully replicate is correlated to some extent with (a) the field of study it falls under and (b) the journal in which it was originally published. For example, I don’t think it’s crazy to suggest that if one were to try to replicate all of the social priming studies and all of the vision studies published in Psychological Science in the last decade, the vision studies would replicate at a consistently higher rate. Anecdotal support for this intuition comes from a string of high-profile failures to replicate famous findings–e.g., John Bargh’s demonstration that priming participants with elderly concepts leads them to walk away from an experiment more slowly. However, the MLRP goes one better than anecdote, as it included a range of effects that clearly differ in their a priori plausibility. Fortuitously, just prior to publicly releasing the MLRP results, Brian Nosek asked the following question on Twitter:

Several researchers, including me, took Brian up on his offers; here are the responses:

As you can see, pretty much everyone that replied to Brian expressed skepticism about the two priming studies (#9 and #10 in Hal Pashler’s reply). There was less consensus on the third effect. (Actually, as it happens, there were actually ultimately only 2 failures to replicate–the third effect became statistically significant when samples were weighted properly.) Nonetheless, most of us picked Imagined Contact as number 3, which did in fact emerge as the smallest of the statistically significant effects. (It’s probably worth mentioning that I’d personally only heard of 4 or 5 of the 13 effects prior to reading their descriptions, so it’s not as though my response was based on a deep knowledge of prior work on these effects–I simply read the descriptions of the findings and gauged their plausibility accordingly.)

Admittedly, these are just two (or three) studies. It’s possible that the MLRP researchers just happened to pick two of the only high-profile priming studies that both seem highly counterintuitive and happen to be false positives. That said, I don’t really think these findings stand out from the mass of other counterintuitive priming studies in social psychology in any way. While we obviously shouldn’t conclude from this that no high-profile, counterintuitive priming studies will successfully replicate, the fact that a number of researchers were able to prospectively determine, with a high degree of accuracy, which effects would fail to replicate (and, among those that replicated, which were rather weak), is a pretty good sign that researchers’ intuitions about plausibility and replicability are pretty decent.

Personally, I’d love to see this principle pushed further, and formalized as a much broader tool for evaluating research findings. For example, one can imagine a website where researchers could publicly (and perhaps anonymously) register their degree of confidence in the likely replicability of any finding associated with a doi or PubMed ID. I think such a service would be hugely valuable–not only because it would help calibrate individual researchers’ intuitions and provide a sense of the field’s overall belief in an effect, but because it would provide a useful index of a finding’s importance in the event of successful replication (i.e., the authors of a well-replicated finding should probably receive more credit if the finding was initially viewed with great skepticism than if it was universally deemed rather obvious).

There are other potentially important findings in the MLRP paper that I haven’t mentioned here (see Rolf Zwaan’s blog post for additional points), but if nothing else, I hope this will help convince any remaining skeptics that this is indeed a landmark paper for psychology–even though the number of successful replications is itself largely meaningless.

Oh, there’s one last point worth mentioning, in light of the rather disagreeable tone of the debate surrounding previous replication efforts. If your findings are ever called into question by a multinational consortium of 36 research groups, this is exactly how you should respond:

Social psychologist Travis Carter of Colby College in Waterville, Maine, who led the original flag-priming study, says that he is disappointed but trusts Nosek’s team wholeheartedly, although he wants to review their data before commenting further. Behavioural scientist Eugene Caruso at the University of Chicago in Illinois, who led the original currency-priming study, says, “We should use this lack of replication to update our beliefs about the reliability and generalizability of this effect”, given the “vastly larger and more diverse sample” of the MLRP. Both researchers praised the initiative.

Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings (excepting cases of deliberate misconduct or clear negligence). What matters is, as Caruso notes, whether and to what extent one shows a willingness to update one’s beliefs in response to countervailing evidence. That’s one mark of a good scientist.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

the Neurosynth viewer goes modular and open source

If you’ve visited the Neurosynth website lately, you may have noticed that it looks… the same way it’s always looked. It hasn’t really changed in the last ~20 months, despite the vague promise on the front page that in the next few months, we’re going to do X, Y, Z to improve the functionality. The lack of updates is not by design; it’s because until recently I didn’t have much time to work on Neurosynth. Now that much of my time is committed to the project, things are moving ahead pretty nicely, though the changes behind the scenes aren’t reflected in any user-end improvements yet.

The github repo is now regularly updated and even gets the occasional contribution from someone other than myself; I expect that to ramp up considerably in the coming months. You can already use the code to run your own automated meta-analyses fairly easily; e.g., with everything set up right (follow the Readme and examples in the repo), the following lines of code:

dataset = cPickle.load(open('dataset.pkl', 'rb'))
studies = get_ids_by_expression("memory* &~ ("wm|working|episod*"), threshold=0.001)
ma = meta.MetaAnalysis(dataset, studies)
ma.save_results('memory')

…will perform an automated meta-analysis of all studies in the Neurosynth database that use the term ‘memory’ at a frequency of 1 in 1,000 words or greater, but don’t use the terms wm or working, or words that start with ‘episod’ (e.g., episodic). You can perform queries that nest to arbitrary depths, so it’s a pretty powerful engine for quickly generating customized meta-analyses, subject to all of the usual caveats surrounding Neurosynth (i.e., that the underlying data are very noisy, that terms aren’t mental states, etc.).

Anyway, with the core tools coming along, I’ve started to turn back to other elements of the project, starting with the image viewer. Yesterday I pushed the first commit of a new version of the viewer that’s currently on the Neurosynth website. In the next few weeks, this new version will be replacing the current version of the viewer, along with a bunch of other changes to the website.

A live demo of the new viewer is available here. It’s not much to look at right now, but behind the scenes, it’s actually a huge improvement on the old viewer in a number of ways:

  • The code is completely refactored and is all nice and object-oriented now. It’s also in CoffeeScript, which is an alternative and (if you’re coming from a Python or Ruby background) much more readable syntax for JavaScript. The source code is on github and contributions are very much encouraged. Like most scientists, I’m generally loathe to share my code publicly because I think it sucks most of the time. But I actually feel pretty good about this code. It’s not good code by any stretch, but I think it rises to the level of ‘mostly sensible’, which is about as much as I can hope for.
  • The viewer now handles multiple layers simultaneously, with the ability to hide and show layers, reorder them by dragging, vary the transparency, assign different color palettes, etc. These features have been staples of offline viewers pretty much since the prehistoric beginnings of fMRI time, but they aren’t available in the current Neurosynth viewer or most other online viewers I’m aware of, so this is a nice addition.
  • The architecture is modular, so that it should be quite easy in future to drop in other alternative views onto the data without having to muck about with the app logic. E.g., adding a 3D WebGL-based view to complement the current 2D slice-based HTML5 canvas approach is on the near-term agenda.
  • The resolution of the viewer is now higher–up from 4 mm to 2 mm (which is the most common native resolution used in packages like SPM and FSL). The original motivation for downsampling to 4 mm in the prior viewer was to keep filesize to a minimum and speed up the initial loading of images. But at some point I realized, hey, we’re living in the 21st century; people have fast internet connections now. So now the files are all in 2 mm resolution, which has the unpleasant effect of increasing file sizes by a factor of about 8, but also has the pleasant effect of making it so that you can actually tell what the hell you’re looking at.

Most importantly, there’s now a clean, and near-complete, separation between the HTML/CSS content and the JavaScript code. Which means that you can now effectively drop the viewer into just about any HTML page with just a few lines of code. So in theory, you can have basically the same viewer you see in the demo just by sticking something like the following into your page:

 viewer = Viewer.get('#layer_list', '.layer_settings')
 viewer.addView('#view_axial', 2);
 viewer.addView('#view_coronal', 1);
 viewer.addView('#view_sagittal', 0);
 viewer.addSlider('opacity', '.slider#opacity', 'horizontal', 'false', 0, 1, 1, 0.05);
 viewer.addSlider('pos-threshold', '.slider#pos-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addSlider('neg-threshold', '.slider#neg-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addColorSelect('#color_palette');
 viewer.addDataField('voxelValue', '#data_current_value')
 viewer.addDataField('currentCoords', '#data_current_coords')
 viewer.loadImageFromJSON('data/MNI152.json', 'MNI152 2mm', 'gray')
 viewer.loadImageFromJSON('data/emotion_meta.json', 'emotion meta-analysis', 'bright lights')
 viewer.loadImageFromJSON('data/language_meta.json', 'language meta-analysis', 'hot and cold')
 viewer.paint()

Well, okay, there are some other dependencies and styling stuff you’re not seeing. But all of that stuff is included in the example folder here. And of course, you can modify any of the HTML/CSS you see in the example; the whole point is that you can now easily style the viewer however you want it, without having to worry about any of the app logic.

What’s also nice about this is that you can easily pick and choose which of the viewer’s features you want to include in your page; nothing will (or at least, should) break no matter what you do. So, for example, you could decide you only want to display a single view showing only axial slices; or to allow users to manipulate the threshold of layers but not their opacity; or to show the current position of the crosshairs but not the corresponding voxel value; and so on. All you have to do is include or exclude the various addSlider() and addData() lines you see above.

Of course, it wouldn’t be a mediocre open source project if it didn’t have some important limitations I’ve been hiding from you until near the very end of this post (hoping, of course, that you wouldn’t bother to read this far down). The biggest limitation is that the viewer expects images to be in JSON format rather than a binary format like NIFTI or Analyze. This is a temporary headache until I or someone else can find the time and motivation to adapt one of the JavaScript NIFTI readers that are already out there (e.g., Satra Ghosh‘s parser for xtk), but for now, if you want to load your own images, you’re going to have to take the extra step of first converting them to JSON. Fortunately, the core Neurosynth Python package has a img_to_json() method in the imageutils module that will read in a NIFTI or Analyze volume and produce a JSON string in the expected format. Although I’m pretty sure it doesn’t handle orientation properly for some images, so don’t be surprised if your images look wonky. (And more importantly, if you fix the orientation issue, please commit your changes to the repo.)

In any case, as long as you’re comfortable with a bit of HTML/CSS/JavaScript hacking, the example/ folder in the github repo has everything you need to drop the viewer into your own pages. If you do use this code internally, please let me know! Partly for my own edification, but mostly because when I write my annual progress reports to the NIH, it’s nice to be able to truthfully say, “hey, look, people are actually using this neat thing we built with taxpayer money.”

tracking replication attempts in psychology–for real this time

I’ve written a few posts on this blog about how the development of better online infrastructure could help address and even solve many of the problems psychologists and other scientists face (e.g., the low reliability of peer review, the ‘fudge factor’ in statistical reporting, the sheer size of the scientific literature, etc.). Actually, that general question–how we can use technology to do better science–occupies a good chunk of my research these days (see e.g., Neurosynth). One question I’ve been interested in for a long time is how to keep track not only of ‘successful’ studies (i.e., those that produce sufficiently interesting effects to make it into the published literature), but also replication failures (or successes of limited interest) that wind up in researchers’ file drawers. A couple of years ago I went so far as to build a prototype website for tracking replication attempts in psychology. Unfortunately, it never went anywhere, partly (okay, mostly) because the site really sucked, and partly because I didn’t really invest much effort in drumming up interest (mostly due to lack of time). But I still think the idea is a valuable one in principle, and a lot of other people have independently had the same idea (which means it must be right, right?).

Anyway, it looks like someone finally had the cleverness, time, and money to get this right. Hal Pashler, Sean Kang*, and colleagues at UCSD have been developing an online database for tracking attempted replications of psychology studies for a while now, and it looks like it’s now in beta. PsychFileDrawer is a very slick, full-featured platform that really should–if there’s any justice in the world–provide the kind of service everyone’s been saying we need for a long time now. If it doesn’t work, I think we’ll have some collective soul-searching to do, because I don’t think it’s going to get any easier than this to add and track attempted replications. So go use it!

 

*Full disclosure: Sean Kang is a good friend of mine, so I’m not completely impartial in plugging this (though I’d do it anyway). Sean also happens to be amazingly smart and in search of a faculty job right now. If I were you, I’d hire him.

see me flub my powerpoint slides on NIF tv!

 

UPDATE: the webcast is now archived here for posterity.

This is kind of late notice and probably of interest to few people, but I’m giving the NIF webinar tomorrow (or today, depending on your time zone–either way, we’re talking about November 1st). I’ll be talking about Neurosynth, and focusing in particular on the methods and data, since that’s what NIF (which stands for Neuroscience Information Framework) is all about. Assuming all goes well, the webinar should start at 11 am PST. But since I haven’t done a webcast of any kind before, and have a surprising knack for breaking audiovisual equipment at a distance, all may not go well. Which I suppose could make for a more interesting presentation. In any case, here’s the abstract:

The explosive growth of the human neuroimaging literature has led to major advances in understanding of human brain function, but has also made aggregation and synthesis of neuroimaging findings increasingly difficult. In this webinar, I will describe a highly automated brain mapping framework called NeuroSynth that uses text mining, meta-analysis and machine learning techniques to generate a large database of mappings between neural and cognitive states. The NeuroSynth framework can be used to automatically conduct large-scale, high-quality neuroimaging meta-analyses, address long-standing inferential problems in the neuroimaging literature (e.g., how to infer cognitive states from distributed activity patterns), and support accurate ‘decoding’ of broad cognitive states from brain activity in both entire studies and individual human subjects. This webinar will focus on (a) the methods used to extract the data, (b) the structure of the resulting (publicly available) datasets, and (c) some major limitations of the current implementation. If time allows, I’ll also provide a walk-through of the associated web interface (http://neurosynth.org) and will provide concrete examples of some potential applications of the framework.

There’s some more info (including details about how to connect, which might be important) here. And now I’m off to prepare my slides. And script some evasive and totally non-committal answers to deploy in case of difficult questions from the peanut gallery respected audience.

in which I suffer a minor setback due to hyperbolic discounting

I wrote a paper with some collaborators that was officially published today in Nature Methods (though it’s been available online for a few weeks). I spent a year of my life on this (a YEAR! That’s like 30 years in opossum years!), so go read the abstract, just to humor me. It’s about large-scale automated synthesis of human functional neuroimaging data. In fact, it’s so about that that that’s the title of the paper*. There’s also a companion website over here, which you might enjoy playing with if you like brains.

I plan to write a long post about this paper at some point in the near future, but not today. What I will do today is tell you all about why I didn’t write anything about the paper much earlier (i.e., 4 weeks ago, when it appeared online), because you seem very concerned. You see, I had grand plans for writing a very detailed and wonderfully engaging multi-part series of blog posts about the paper, starting with the background and motivation for the project (that would have been Part 1), then explaining the methods we used (Part 2), then the results (III; let’s switch to Roman numerals for effect), then some of the implications (IV), then some potential applications and future directions (V), then some stuff that didn’t make it into the paper (VI), and then, finally, a behind-the-science account of how it really all went down (VII; complete with filmed interviews with collaborators who left the project early due to creative differences). A seven-part blog post! All about one paper! It would have been longer than the article itself! And all the supplemental materials! Combined! Take my word for it, it would have been amazing.

Unfortunately, like most everyone else, I’m a much better person in the future than I am in the present; things that would take me a week of full-time work in the Now apparently take me only five to ten minutes when I plan them three months ahead of time. If you plotted my temporal discounting curve for intellectual effort, it would look like this:

So that’s why my seven-part series of blog posts didn’t debut at the same time the paper was published online a few weeks ago. In fact, it hasn’t debuted at all. At this point, my much more modest goal is just to write a single much shorter post, which will no longer be able to DEBUT, but can at least slink into the bar unnoticed while everyone else is out on the patio having a smoke. And really, I’m only doing it so I can look myself in the eye again when I look myself in the mirror. Because it turns out it’s very hard to shave your face safely if you’re not allowed to look yourself in the eye. And my labmates are starting to call me PapercutMan, which isn’t really a superpower worth having.

So yeah, I’ll write something about this paper soon. But just to play it safe, I’m not going to operationally define ‘soon’ right now.

 

* Three “that”s in a row! What are the odds! Good luck parsing that sentence!

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

what the arsenic effect means for scientific publishing

I don’t know very much about DNA (and by ‘not very much’ I sadly mean ‘next to nothing’), so when someone tells me that life as we know it generally doesn’t use arsenic to make DNA, and that it’s a big deal to find a bacterium that does, I’m willing to believe them. So too, apparently, are at least two or three reviewers for Science, which published a paper last week by a NASA group purporting to demonstrate exactly that.

Turns out the paper might have a few holes. In the last few days, the blogosphere has reached fever delirium pitch as critiques of the article have emerged from every corner; it seems like pretty much everyone with some knowledge of the science in question is unhappy about the paper. Since I’m not in any position to critique the article myself, I’ll take Carl Zimmer’s word for it in Slate yesterday:

Was this merely a case of a few isolated cranks? To find out, I reached out to a dozen experts on Monday. Almost unanimously, they think the NASA scientists have failed to make their case.  “It would be really cool if such a bug existed,” said San Diego State University’s Forest Rohwer, a microbiologist who looks for new species of bacteria and viruses in coral reefs. But, he added, “none of the arguments are very convincing on their own.” That was about as positive as the critics could get. “This paper should not have been published,” said Shelley Copley of the University of Colorado.

Zimmer then follows his Slate piece up with a blog post today in which he provides 13 experts’ unadulterated comments. While there are one or two (somewhat) positive reviews, the consensus clearly seems to be that the Science paper is (very) bad science.

Of course, scientists (yes, even Science reviewers) do occasionally make mistakes, so if we’re being charitable about it, we might chalk it up to human error (though some of the critiques suggest that these are elementary problems that could have been very easily addressed, so it’s possible there’s some disingenuousness involved). But what many bloggers (1, 2, 3, etc.) have found particularly inexcusable is the way NASA and the research team have handled the criticism. Zimmer again, in Slate:

I asked two of the authors of the study if they wanted to respond to the criticism of their paper. Both politely declined by email.

“We cannot indiscriminately wade into a media forum for debate at this time,” declared senior author Ronald Oremland of the U.S. Geological Survey. “If we are wrong, then other scientists should be motivated to reproduce our findings. If we are right (and I am strongly convinced that we are) our competitors will agree and help to advance our understanding of this phenomenon. I am eager for them to do so.”

“Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated,” wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. “The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner.”

A NASA spokesperson basically reiterated this point of view, indicating that NASA scientists weren’t going to respond to criticism of their work unless that criticism appeared in, you know, a respectable, peer-reviewed outlet. (Fortunately, at least one of the critics already has a draft letter to Science up on her blog.)

I don’t think it’s surprising that people who spend much of their free time blogging about science, and think it’s important to discuss scientific issues in a public venue, generally aren’t going to like being told that science blogging isn’t a legitimate form of scientific discourse. Especially considering that the critics here aren’t laypeople without scientific training; they’re well-respected scientists with areas of expertise that are directly relevant to the paper. In this case, dismissing trenchant criticism because it’s on the web rather than in a peer-reviewed journal seems kind of like telling someone who’s screaming at you that your house is on fire that you’re not going to listen to them until they adopt a more polite tone. It just seems counterproductive.

That said, I personally don’t think we should take the NASA team’s statements at face value. I very much doubt that what the NASA researchers are saying really reflect any deep philosophical view about the role of blogs in scientific discourse; it’s much more likely that they’re simply trying to buy some time while they figure out how to respond. On the face of it, they have a choice between two lousy options: either ignore the criticism entirely, which would be antithetical to the scientific process and would look very bad, or address it head-on–which, judging by the vociferousness and near-unanimity of the commentators, is probably going to be a losing battle. Shifting the terms of the debate by insisting on responding only in a peer-reviewed venue doesn’t really change anything, but it does buy the authors two or three weeks. And two or three weeks is worth like, forty attentional cycles in the blogosphere.

Mind you, I’m not saying we should sympathize with the NASA researchers just because they’re in a tough position. I think one of the main reasons the story’s attracted so much attention is precisely because people see it as a case of justice being served. The NASA team called a major press conference ahead of the paper’s publication, published its results in one of the world’s most prestigious science journals, and yet apparently failed to run relatively basic experimental controls in support of its conclusions. If the critics are to be believed, the NASA researchers are either disingenuous or incompetent; either way, we shouldn’t feel sorry for them.

What I do think this episode shows is that the rules of scientific publishing have fundamentally changed in the last few years–and largely for the better. I haven’t been doing science for very long, but even in the halcyon days of 2003, when I started graduate school, science blogging was practically nonexistent, and the main way you’d find out what other people thought about an influential new paper was by talking to people you knew at conferences (which could take several months) or waiting for critiques or replication failures to emerge in other peer-reviewed journals (which could take years). That kind of delay between publication and evaluation is disastrous for science, because in the time it takes for a consensus to emerge that a paper is no good, several research teams might have already started trying to replicate and extend the reported findings, and several dozen other researchers might have uncritically cited their paper peripherally in their own work. This delay is probably why, as John Ioannidis’ work so elegantly demonstrates, major studies published in high-impact journals tend to exert a disproportionate influence on the literature long after they’ve been resoundingly discredited.

The Arsenic Effect, if we can call it that, provides a nice illustration of the impact of new media on scientific communication. It’s a safe bet that there are now very few people who do anything even vaguely related to the NASA team’s research who haven’t been made aware that the reported findings are controversial. Which means that the process of attempting to replicate (or falsify) the findings will proceed much more quickly than it might have ten or twenty years ago, and there probably won’t be very many people who cite the Science paper as compelling evidence of terrestrial arsenic-based life. Perhaps more importantly, as researchers get used to the idea that their high-profile work is going to be instantly evaluated by thousands of pairs of highly trained eyes, any of which might be attached to a highly prolific pair of typing hands, there will be an increasingly strong disincentive to avoid being careless. That isn’t to say that bad science will disappear, of course; just that, in cases where the badness reflects a pressure to tell a good story at all costs, we’ll probably see less of it.

what the Dunning-Kruger effect is and isn’t

If you regularly read cognitive science or psychology blogs (or even just the lowly New York Times!), you’ve probably heard of something called the Dunning-Kruger effect. The Dunning-Kruger effect refers to the seemingly pervasive tendency of poor performers to overestimate their abilities relative to other people–and, to a lesser extent, for high performers to underestimate their abilities. The explanation for this, according to Kruger and Dunning, who first reported the effect in an extremely influential 1999 article in the Journal of Personality and Social Psychology, is that incompetent people by lack the skills they’d need in order to be able to distinguish good performers from bad performers:

…people who lack the knowledge or wisdom to perform well are often unaware of this fact. We attribute this lack of awareness to a deficit in metacognitive skill. That is, the same incompetence that leads them to make wrong choices also deprives them of the savvy necessary to recognize competence, be it their own or anyone else’s.

For reasons I’m not really clear on, the Dunning-Kruger effect seems to be experiencing something of a renaissance over the past few months; it’s everywhere in the blogosphere and media. For instance, here are just a few alleged Dunning-Krugerisms from the past few weeks:

So what does this mean in business? Well, it’s all over the place. Even the title of Dunning and Kruger’s paper, the part about inflated self-assessments, reminds me of a truism that was pointed out by a supervisor early in my career: The best employees will invariably be the hardest on themselves in self-evaluations, while the lowest performers can be counted on to think they are doing excellent work…

Heidi Montag and Spencer Pratt are great examples of the Dunning-Kruger effect. A whole industry of assholes are making a living off of encouraging two attractive yet untalented people they are actually genius auteurs. The bubble around them is so thick, they may never escape it. At this point, all of America (at least those who know who they are), is in on the joke – yet the two people in the center of this tragedy are completely unaware…

Not so fast there — the Dunning-Kruger effect comes into play here. People in the United States do not have a high level of understanding of evolution, and this survey did not measure actual competence. I’ve found that the people most likely to declare that they have a thorough knowledge of evolution are the creationists…but that a brief conversation is always sufficient to discover that all they’ve really got is a confused welter of misinformation…

As you can see, the findings reported by Kruger and Dunning are often interpreted to suggest that the less competent people are, the more competent they think they are. People who perform worst at a task tend to think they’re god’s gift to said task, and the people who can actually do said task often display excessive modesty. I suspect we find this sort of explanation compelling because it appeals to our implicit just-world theories: we’d like to believe that people who obnoxiously proclaim their excellence at X, Y, and Z must really not be so very good at X, Y, and Z at all, and must be (over)compensating for some actual deficiency; it’s much less pleasant to imagine that people who go around shoving their (alleged) superiority in our faces might really be better than us at what they do.

Unfortunately, Kruger and Dunning never actually provided any support for this type of just-world view; their studies categorically didn’t show that incompetent people are more confident or arrogant than competent people. What they did show is this:

This is one of the key figures from Kruger and Dunning’s 1999 paper (and the basic effect has been replicated many times since). The critical point to note is that there’s a clear positive correlation between actual performance (gray line) and perceived performance (black line): the people in the top quartile for actual performance think they perform better than the people in the second quartile, who in turn think they perform better than the people in the third quartile, and so on. So the bias is definitively not that incompetent people think they’re better than competent people. Rather, it’s that incompetent people think they’re much better than they actually are. But they typically still don’t think they’re quite as good as people who, you know, actually are good. (It’s important to note that Dunning and Kruger never claimed to show that the unskilled think they’re better than the skilled; that’s just the way the finding is often interpreted by others.)

That said, it’s clear that there is a very large discrepancy between the way incompetent people actually perform and the way they perceive their own performance level, whereas the discrepancy is much smaller for highly competent individuals. So the big question is why. Kruger and Dunning’s explanation, as I mentioned above, is that incompetent people lack the skills they’d need in order to know they’re incompetent. For example, if you’re not very good at learning languages, it might be hard for you to tell that you’re not very good, because the very skills that you’d need in order to distinguish someone who’s good from someone who’s not are the ones you lack. If you can’t hear the distinction between two different phonemes, how could you ever know who has native-like pronunciation ability and who doesn’t? If you don’t understand very many words in another language, how can you evaluate the size of your own vocabulary in relation to other people’s?

This appeal to people’s meta-cognitive abilities (i.e., their knowledge about their knowledge) has some intuitive plausibility, and Kruger, Dunning and their colleagues have provided quite a bit of evidence for it over the past decade. That said, it’s by no means the only explanation around; over the past few years, a fairly sizeable literature criticizing or extending Kruger and Dunning’s work has developed. I’ll mention just three plausible (and mutually compatible) alternative accounts people have proposed (but there are others!)

1. Regression toward the mean. Probably the most common criticism of the Dunning-Kruger effect is that it simply reflects regression to the mean–that is, it’s a statistical artifact. Regression to the mean refers to the fact that any time you select a group of individuals based on some criterion, and then measure the standing of those individuals on some other dimension, performance levels will tend to shift (or regress) toward the mean level. It’s a notoriously underappreciated problem, and probably explains many, many phenomena that people have tried to interpret substantively. For instance, in placebo-controlled clinical trials of SSRIs, depressed people tend to get better in both the drug and placebo conditions. Some of this is undoubtedly due to the placebo effect, but much of it is probably also due to what’s often referred to as “natural history”. Depression, like most things, tends to be cyclical: people get better or worse better over time, often for no apparent rhyme or reason. But since people tend to seek help (and sign up for drug trials) primarily when they’re doing particularly badly, it follows that most people would get better to some extent even without any treatment. That’s regression to the mean (the Wikipedia entry has other nice examples–for example, the famous Sports Illustrated Cover Jinx).

In the context of the Dunning-Kruger effect, the argument is that incompetent people simply regress toward the mean when you ask them to evaluate their own performance. Since perceived performance is influenced not only by actual performance, but also by many other factors (e.g., one’s personality, meta-cognitive ability, measurement error, etc.), it follows that, on average, people with extreme levels of actual performance won’t be quite as extreme in terms of their perception of their performance. So, much of the Dunning-Kruger effect arguably doesn’t need to be explained at all, and in fact, it would be quite surprising if you didn’t see a pattern of results that looks at least somewhat like the figure above.

2. Regression to the mean plus better-than-average. Having said that, it’s clear that regression to the mean can’t explain everything about the Dunning-Kruger effect. One problem is that it doesn’t explain why the effect is greater at the low end than at the high end. That is, incompetent people tend to overestimate their performance to a much greater extent than competent people underestimate their performance. This asymmetry can’t be explained solely by regression to the mean. It can, however, be explained by a combination of RTM and a “better-than-average” (or self-enhancement) heuristic which says that, in general, most people have a tendency to view themselves excessively positively. This two-pronged explanation was proposed by Krueger and Mueller in a 2002 study (note that Krueger and Kruger are different people!), who argued that poor performers suffer from a double whammy: not only do their perceptions of their own performance regress toward the mean, but those perceptions are also further inflated by the self-enhancement bias. In contrast, for high performers, these two effects largely balance each other out: regression to the mean causes high performers to underestimate their performance, but to some extent that underestimation is offset by the self-enhancement bias. As a result, it looks as though high performers make more accurate judgments than low performers, when in reality the high performers are just lucky to be where they are in the distribution.

3. The instrumental role of task difficulty. Consistent with the notion that the Dunning-Kruger effect is at least partly a statistical artifact, some studies have shown that the asymmetry reported by Kruger and Dunning (i.e., the smaller discrepancy for high performers than for low performers) actually goes away, and even reverses, when the ability tests given to participants are very difficult. For instance, Burson and colleagues (2006), writing in JPSP, showed that when University of Chicago undergraduates were asked moderately difficult trivia questions about their university, the subjects who performed best were just as poorly calibrated as the people who performed worst, in the sense that their estimates of how well they did relative to other people were wildly inaccurate. Here’s what that looks like:

Notice that this finding wasn’t anomalous with respect to the Kruger and Dunning findings; when participants were given easier trivia (the diamond-studded line), Burson et al observed the standard pattern, with poor performers seemingly showing worse calibration. Simply knocking about 10% off the accuracy rate on the trivia questions was enough to induce a large shift in the relative mismatch between perceptions of ability and actual ability. Burson et al then went on to replicate this pattern in two additional studies involving a number of different judgments and tasks, so this result isn’t specific to trivia questions. In fact, in the later studies, Burson et al showed that when the task was really difficult, poor performers were actually considerably better calibrated than high performers.

Looking at the figure above, it’s not hard to see why this would be. Since the slope of the line tends to be pretty constant in these types of experiments, any change in mean performance levels (i.e., a shift in intercept on the y-axis) will necessarily result in a larger difference between actual and perceived performance at the high end. Conversely, if you raise the line, you maximize the difference between actual and perceived performance at the lower end.

To get an intuitive sense of what’s happening here, just think of it this way: if you’re performing a very difficult task, you’re probably going to find the experience subjectively demanding even if you’re at the high end relative to other people. Since people’s judgments about their own relative standing depends to a substantial extent on their subjective perception of their own performance (i.e., you use your sense of how easy a task was as a proxy of how good you must be at it), high performers are going to end up systematically underestimating how well they did. When a task is difficult, most people assume they must have done relatively poorly compared to other people. Conversely, when a task is relatively easy (and the tasks Dunning and Kruger studied were on the easier side), most people assume they must be pretty good compared to others. As a result, it’s going to look like the people who perform well are well-calibrated when the task is easy and poorly-calibrated when the task is difficult; less competent people are going to show exactly the opposite pattern. And note that this doesn’t require us to assume any relationship between actual performance and perceived performance. You would expect to get the Dunning-Kruger effect for easy tasks even if there was exactly zero correlation between how good people actually are at something and how good they think they are.

Here’s how Burson et al summarized their findings:

Our studies replicate, eliminate, or reverse the association between task performance and judgment accuracy reported by Kruger and Dunning (1999) as a function of task difficulty. On easy tasks, where there is a positive bias, the best performers are also the most accurate in estimating their standing, but on difficult tasks, where there is a negative bias, the worst performers are the most accurate. This pattern is consistent with a combination of noisy estimates and overall bias, with no need to invoke differences in metacognitive abilities. In this  regard, our findings support Krueger and Mueller’s (2002) reinterpretation of Kruger and Dunning’s (1999) findings. An association between task-related skills and metacognitive insight may indeed exist, and later we offer some suggestions for ways to test for it. However, our analyses indicate that the primary drivers of errors in judging relative standing are general inaccuracy and overall biases tied to task difficulty. Thus, it is important to know more about those sources of error in order to better understand and ameliorate them.

What should we conclude from these (and other) studies? I think the jury’s still out to some extent, but at minimum, I think it’s clear that much of the Dunning-Kruger effect reflects either statistical artifact (regression to the mean), or much more general cognitive biases (the tendency to self-enhance and/or to use one’s subjective experience as a guide to one’s standing in relation to others). This doesn’t mean that the meta-cognitive explanation preferred by Dunning, Kruger and colleagues can’t hold in some situations; it very well may be that in some cases, and to some extent, people’s lack of skill is really what prevents them from accurately determining their standing in relation to others. But I think our default position should be to prefer the alternative explanations I’ve discussed above, because they’re (a) simpler, (b) more general (they explain lots of other phenomena), and (c) necessary (frankly, it’d be amazing if regression to the mean didn’t explain at least part of the effect!).

We should also try to be aware of another very powerful cognitive bias whenever we use the Dunning-Kruger effect to explain the people or situations around us–namely, confirmation bias. If you believe that incompetent people don’t know enough to know they’re incompetent, it’s not hard to find anecdotal evidence for that; after all, we all know people who are both arrogant and not very good at what they do. But if you stop to look for it, it’s probably also not hard to find disconfirming evidence. After all, there are clearly plenty of people who are good at what they do, but not nearly as good as they think they are (i.e., they’re above average, and still totally miscalibrated in the positive direction). Just like there are plenty of people who are lousy at what they do and recognize their limitations (e.g., I don’t need to be a great runner in order to be able to tell that I’m not a great runner–I’m perfectly well aware that I have terrible endurance, precisely because I can’t finish runs that most other runners find trivial!). But the plural of anecdote is not data, and the data appear to be equivocal. Next time you’re inclined to chalk your obnoxious co-worker’s delusions of grandeur down to the Dunning-Kruger effect, consider the possibility that your co-worker’s simply a jerk–no meta-cognitive incompetence necessary.

ResearchBlogging.orgKruger J, & Dunning D (1999). Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77 (6), 1121-34 PMID: 10626367
Krueger J, & Mueller RA (2002). Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. Journal of personality and social psychology, 82 (2), 180-8 PMID: 11831408
Burson KA, Larrick RP, & Klayman J (2006). Skilled or unskilled, but still unaware of it: how perceptions of difficulty drive miscalibration in relative comparisons. Journal of personality and social psychology, 90 (1), 60-77 PMID: 16448310