Archive for the ‘research’ Category

tracking replication attempts in psychology–for real this time

Tuesday, November 22nd, 2011

I’ve written a few posts on this blog about how the development of better online infrastructure could help address and even solve many of the problems psychologists and other scientists face (e.g., the low reliability of peer review, the ‘fudge factor’ in statistical reporting, the sheer size of the scientific literature, etc.). Actually, that general question–how we can use technology to do better science–occupies a good chunk of my research these days (see e.g., Neurosynth). One question I’ve been interested in for a long time is how to keep track not only of ‘successful’ studies (i.e., those that produce sufficiently interesting effects to make it into the published literature), but also replication failures (or successes of limited interest) that wind up in researchers’ file drawers. A couple of years ago I went so far as to build a prototype website for tracking replication attempts in psychology. Unfortunately, it never went anywhere, partly (okay, mostly) because the site really sucked, and partly because I didn’t really invest much effort in drumming up interest (mostly due to lack of time). But I still think the idea is a valuable one in principle, and a lot of other people have independently had the same idea (which means it must be right, right?).

Anyway, it looks like someone finally had the cleverness, time, and money to get this right. Hal Pashler, Sean Kang*, and colleagues at UCSD have been developing an online database for tracking attempted replications of psychology studies for a while now, and it looks like it’s now in beta. PsychFileDrawer is a very slick, full-featured platform that really should–if there’s any justice in the world–provide the kind of service everyone’s been saying we need for a long time now. If it doesn’t work, I think we’ll have some collective soul-searching to do, because I don’t think it’s going to get any easier than this to add and track attempted replications. So go use it!

 

*Full disclosure: Sean Kang is a good friend of mine, so I’m not completely impartial in plugging this (though I’d do it anyway). Sean also happens to be amazingly smart and in search of a faculty job right now. If I were you, I’d hire him.

see me flub my powerpoint slides on NIF tv!

Monday, October 31st, 2011

 

UPDATE: the webcast is now archived here for posterity.

This is kind of late notice and probably of interest to few people, but I’m giving the NIF webinar tomorrow (or today, depending on your time zone–either way, we’re talking about November 1st). I’ll be talking about Neurosynth, and focusing in particular on the methods and data, since that’s what NIF (which stands for Neuroscience Information Framework) is all about. Assuming all goes well, the webinar should start at 11 am PST. But since I haven’t done a webcast of any kind before, and have a surprising knack for breaking audiovisual equipment at a distance, all may not go well. Which I suppose could make for a more interesting presentation. In any case, here’s the abstract:

The explosive growth of the human neuroimaging literature has led to major advances in understanding of human brain function, but has also made aggregation and synthesis of neuroimaging findings increasingly difficult. In this webinar, I will describe a highly automated brain mapping framework called NeuroSynth that uses text mining, meta-analysis and machine learning techniques to generate a large database of mappings between neural and cognitive states. The NeuroSynth framework can be used to automatically conduct large-scale, high-quality neuroimaging meta-analyses, address long-standing inferential problems in the neuroimaging literature (e.g., how to infer cognitive states from distributed activity patterns), and support accurate ‘decoding’ of broad cognitive states from brain activity in both entire studies and individual human subjects. This webinar will focus on (a) the methods used to extract the data, (b) the structure of the resulting (publicly available) datasets, and (c) some major limitations of the current implementation. If time allows, I’ll also provide a walk-through of the associated web interface (http://neurosynth.org) and will provide concrete examples of some potential applications of the framework.

There’s some more info (including details about how to connect, which might be important) here. And now I’m off to prepare my slides. And script some evasive and totally non-committal answers to deploy in case of difficult questions from the peanut gallery respected audience.

in which I suffer a minor setback due to hyperbolic discounting

Thursday, July 28th, 2011

I wrote a paper with some collaborators that was officially published today in Nature Methods (though it’s been available online for a few weeks). I spent a year of my life on this (a YEAR! That’s like 30 years in opossum years!), so go read the abstract, just to humor me. It’s about large-scale automated synthesis of human functional neuroimaging data. In fact, it’s so about that that that’s the title of the paper*. There’s also a companion website over here, which you might enjoy playing with if you like brains.

I plan to write a long post about this paper at some point in the near future, but not today. What I will do today is tell you all about why I didn’t write anything about the paper much earlier (i.e., 4 weeks ago, when it appeared online), because you seem very concerned. You see, I had grand plans for writing a very detailed and wonderfully engaging multi-part series of blog posts about the paper, starting with the background and motivation for the project (that would have been Part 1), then explaining the methods we used (Part 2), then the results (III; let’s switch to Roman numerals for effect), then some of the implications (IV), then some potential applications and future directions (V), then some stuff that didn’t make it into the paper (VI), and then, finally, a behind-the-science account of how it really all went down (VII; complete with filmed interviews with collaborators who left the project early due to creative differences). A seven-part blog post! All about one paper! It would have been longer than the article itself! And all the supplemental materials! Combined! Take my word for it, it would have been amazing.

Unfortunately, like most everyone else, I’m a much better person in the future than I am in the present; things that would take me a week of full-time work in the Now apparently take me only five to ten minutes when I plan them three months ahead of time. If you plotted my temporal discounting curve for intellectual effort, it would look like this:

So that’s why my seven-part series of blog posts didn’t debut at the same time the paper was published online a few weeks ago. In fact, it hasn’t debuted at all. At this point, my much more modest goal is just to write a single much shorter post, which will no longer be able to DEBUT, but can at least slink into the bar unnoticed while everyone else is out on the patio having a smoke. And really, I’m only doing it so I can look myself in the eye again when I look myself in the mirror. Because it turns out it’s very hard to shave your face safely if you’re not allowed to look yourself in the eye. And my labmates are starting to call me PapercutMan, which isn’t really a superpower worth having.

So yeah, I’ll write something about this paper soon. But just to play it safe, I’m not going to operationally define ‘soon’ right now.

 

* Three “that”s in a row! What are the odds! Good luck parsing that sentence!

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Monday, January 10th, 2011

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

what the arsenic effect means for scientific publishing

Thursday, December 9th, 2010

I don’t know very much about DNA (and by ‘not very much’ I sadly mean ‘next to nothing’), so when someone tells me that life as we know it generally doesn’t use arsenic to make DNA, and that it’s a big deal to find a bacterium that does, I’m willing to believe them. So too, apparently, are at least two or three reviewers for Science, which published a paper last week by a NASA group purporting to demonstrate exactly that.

Turns out the paper might have a few holes. In the last few days, the blogosphere has reached fever delirium pitch as critiques of the article have emerged from every corner; it seems like pretty much everyone with some knowledge of the science in question is unhappy about the paper. Since I’m not in any position to critique the article myself, I’ll take Carl Zimmer’s word for it in Slate yesterday:

Was this merely a case of a few isolated cranks? To find out, I reached out to a dozen experts on Monday. Almost unanimously, they think the NASA scientists have failed to make their case.  “It would be really cool if such a bug existed,” said San Diego State University’s Forest Rohwer, a microbiologist who looks for new species of bacteria and viruses in coral reefs. But, he added, “none of the arguments are very convincing on their own.” That was about as positive as the critics could get. “This paper should not have been published,” said Shelley Copley of the University of Colorado.

Zimmer then follows his Slate piece up with a blog post today in which he provides 13 experts’ unadulterated comments. While there are one or two (somewhat) positive reviews, the consensus clearly seems to be that the Science paper is (very) bad science.

Of course, scientists (yes, even Science reviewers) do occasionally make mistakes, so if we’re being charitable about it, we might chalk it up to human error (though some of the critiques suggest that these are elementary problems that could have been very easily addressed, so it’s possible there’s some disingenuousness involved). But what many bloggers (1, 2, 3, etc.) have found particularly inexcusable is the way NASA and the research team have handled the criticism. Zimmer again, in Slate:

I asked two of the authors of the study if they wanted to respond to the criticism of their paper. Both politely declined by email.

“We cannot indiscriminately wade into a media forum for debate at this time,” declared senior author Ronald Oremland of the U.S. Geological Survey. “If we are wrong, then other scientists should be motivated to reproduce our findings. If we are right (and I am strongly convinced that we are) our competitors will agree and help to advance our understanding of this phenomenon. I am eager for them to do so.”

“Any discourse will have to be peer-reviewed in the same manner as our paper was, and go through a vetting process so that all discussion is properly moderated,” wrote Felisa Wolfe-Simon of the NASA Astrobiology Institute. “The items you are presenting do not represent the proper way to engage in a scientific discourse and we will not respond in this manner.”

A NASA spokesperson basically reiterated this point of view, indicating that NASA scientists weren’t going to respond to criticism of their work unless that criticism appeared in, you know, a respectable, peer-reviewed outlet. (Fortunately, at least one of the critics already has a draft letter to Science up on her blog.)

I don’t think it’s surprising that people who spend much of their free time blogging about science, and think it’s important to discuss scientific issues in a public venue, generally aren’t going to like being told that science blogging isn’t a legitimate form of scientific discourse. Especially considering that the critics here aren’t laypeople without scientific training; they’re well-respected scientists with areas of expertise that are directly relevant to the paper. In this case, dismissing trenchant criticism because it’s on the web rather than in a peer-reviewed journal seems kind of like telling someone who’s screaming at you that your house is on fire that you’re not going to listen to them until they adopt a more polite tone. It just seems counterproductive.

That said, I personally don’t think we should take the NASA team’s statements at face value. I very much doubt that what the NASA researchers are saying really reflect any deep philosophical view about the role of blogs in scientific discourse; it’s much more likely that they’re simply trying to buy some time while they figure out how to respond. On the face of it, they have a choice between two lousy options: either ignore the criticism entirely, which would be antithetical to the scientific process and would look very bad, or address it head-on–which, judging by the vociferousness and near-unanimity of the commentators, is probably going to be a losing battle. Shifting the terms of the debate by insisting on responding only in a peer-reviewed venue doesn’t really change anything, but it does buy the authors two or three weeks. And two or three weeks is worth like, forty attentional cycles in the blogosphere.

Mind you, I’m not saying we should sympathize with the NASA researchers just because they’re in a tough position. I think one of the main reasons the story’s attracted so much attention is precisely because people see it as a case of justice being served. The NASA team called a major press conference ahead of the paper’s publication, published its results in one of the world’s most prestigious science journals, and yet apparently failed to run relatively basic experimental controls in support of its conclusions. If the critics are to be believed, the NASA researchers are either disingenuous or incompetent; either way, we shouldn’t feel sorry for them.

What I do think this episode shows is that the rules of scientific publishing have fundamentally changed in the last few years–and largely for the better. I haven’t been doing science for very long, but even in the halcyon days of 2003, when I started graduate school, science blogging was practically nonexistent, and the main way you’d find out what other people thought about an influential new paper was by talking to people you knew at conferences (which could take several months) or waiting for critiques or replication failures to emerge in other peer-reviewed journals (which could take years). That kind of delay between publication and evaluation is disastrous for science, because in the time it takes for a consensus to emerge that a paper is no good, several research teams might have already started trying to replicate and extend the reported findings, and several dozen other researchers might have uncritically cited their paper peripherally in their own work. This delay is probably why, as John Ioannidis’ work so elegantly demonstrates, major studies published in high-impact journals tend to exert a disproportionate influence on the literature long after they’ve been resoundingly discredited.

The Arsenic Effect, if we can call it that, provides a nice illustration of the impact of new media on scientific communication. It’s a safe bet that there are now very few people who do anything even vaguely related to the NASA team’s research who haven’t been made aware that the reported findings are controversial. Which means that the process of attempting to replicate (or falsify) the findings will proceed much more quickly than it might have ten or twenty years ago, and there probably won’t be very many people who cite the Science paper as compelling evidence of terrestrial arsenic-based life. Perhaps more importantly, as researchers get used to the idea that their high-profile work is going to be instantly evaluated by thousands of pairs of highly trained eyes, any of which might be attached to a highly prolific pair of typing hands, there will be an increasingly strong disincentive to avoid being careless. That isn’t to say that bad science will disappear, of course; just that, in cases where the badness reflects a pressure to tell a good story at all costs, we’ll probably see less of it.

what the Dunning-Kruger effect is and isn’t

Wednesday, July 7th, 2010

If you regularly read cognitive science or psychology blogs (or even just the lowly New York Times!), you’ve probably heard of something called the Dunning-Kruger effect. The Dunning-Kruger effect refers to the seemingly pervasive tendency of poor performers to overestimate their abilities relative to other people–and, to a lesser extent, for high performers to underestimate their abilities. The explanation for this, according to Kruger and Dunning, who first reported the effect in an extremely influential 1999 article in the Journal of Personality and Social Psychology, is that incompetent people by lack the skills they’d need in order to be able to distinguish good performers from bad performers:

…people who lack the knowledge or wisdom to perform well are often unaware of this fact. We attribute this lack of awareness to a deficit in metacognitive skill. That is, the same incompetence that leads them to make wrong choices also deprives them of the savvy necessary to recognize competence, be it their own or anyone else’s.

For reasons I’m not really clear on, the Dunning-Kruger effect seems to be experiencing something of a renaissance over the past few months; it’s everywhere in the blogosphere and media. For instance, here are just a few alleged Dunning-Krugerisms from the past few weeks:

So what does this mean in business? Well, it’s all over the place. Even the title of Dunning and Kruger’s paper, the part about inflated self-assessments, reminds me of a truism that was pointed out by a supervisor early in my career: The best employees will invariably be the hardest on themselves in self-evaluations, while the lowest performers can be counted on to think they are doing excellent work…

Heidi Montag and Spencer Pratt are great examples of the Dunning-Kruger effect. A whole industry of assholes are making a living off of encouraging two attractive yet untalented people they are actually genius auteurs. The bubble around them is so thick, they may never escape it. At this point, all of America (at least those who know who they are), is in on the joke – yet the two people in the center of this tragedy are completely unaware…

Not so fast there — the Dunning-Kruger effect comes into play here. People in the United States do not have a high level of understanding of evolution, and this survey did not measure actual competence. I’ve found that the people most likely to declare that they have a thorough knowledge of evolution are the creationists…but that a brief conversation is always sufficient to discover that all they’ve really got is a confused welter of misinformation…

As you can see, the findings reported by Kruger and Dunning are often interpreted to suggest that the less competent people are, the more competent they think they are. People who perform worst at a task tend to think they’re god’s gift to said task, and the people who can actually do said task often display excessive modesty. I suspect we find this sort of explanation compelling because it appeals to our implicit just-world theories: we’d like to believe that people who obnoxiously proclaim their excellence at X, Y, and Z must really not be so very good at X, Y, and Z at all, and must be (over)compensating for some actual deficiency; it’s much less pleasant to imagine that people who go around shoving their (alleged) superiority in our faces might really be better than us at what they do.

Unfortunately, Kruger and Dunning never actually provided any support for this type of just-world view; their studies categorically didn’t show that incompetent people are more confident or arrogant than competent people. What they did show is this:

This is one of the key figures from Kruger and Dunning’s 1999 paper (and the basic effect has been replicated many times since). The critical point to note is that there’s a clear positive correlation between actual performance (gray line) and perceived performance (black line): the people in the top quartile for actual performance think they perform better than the people in the second quartile, who in turn think they perform better than the people in the third quartile, and so on. So the bias is definitively not that incompetent people think they’re better than competent people. Rather, it’s that incompetent people think they’re much better than they actually are. But they typically still don’t think they’re quite as good as people who, you know, actually are good. (It’s important to note that Dunning and Kruger never claimed to show that the unskilled think they’re better than the skilled; that’s just the way the finding is often interpreted by others.)

That said, it’s clear that there is a very large discrepancy between the way incompetent people actually perform and the way they perceive their own performance level, whereas the discrepancy is much smaller for highly competent individuals. So the big question is why. Kruger and Dunning’s explanation, as I mentioned above, is that incompetent people lack the skills they’d need in order to know they’re incompetent. For example, if you’re not very good at learning languages, it might be hard for you to tell that you’re not very good, because the very skills that you’d need in order to distinguish someone who’s good from someone who’s not are the ones you lack. If you can’t hear the distinction between two different phonemes, how could you ever know who has native-like pronunciation ability and who doesn’t? If you don’t understand very many words in another language, how can you evaluate the size of your own vocabulary in relation to other people’s?

This appeal to people’s meta-cognitive abilities (i.e., their knowledge about their knowledge) has some intuitive plausibility, and Kruger, Dunning and their colleagues have provided quite a bit of evidence for it over the past decade. That said, it’s by no means the only explanation around; over the past few years, a fairly sizeable literature criticizing or extending Kruger and Dunning’s work has developed. I’ll mention just three plausible (and mutually compatible) alternative accounts people have proposed (but there are others!)

1. Regression toward the mean. Probably the most common criticism of the Dunning-Kruger effect is that it simply reflects regression to the mean–that is, it’s a statistical artifact. Regression to the mean refers to the fact that any time you select a group of individuals based on some criterion, and then measure the standing of those individuals on some other dimension, performance levels will tend to shift (or regress) toward the mean level. It’s a notoriously underappreciated problem, and probably explains many, many phenomena that people have tried to interpret substantively. For instance, in placebo-controlled clinical trials of SSRIs, depressed people tend to get better in both the drug and placebo conditions. Some of this is undoubtedly due to the placebo effect, but much of it is probably also due to what’s often referred to as “natural history”. Depression, like most things, tends to be cyclical: people get better or worse better over time, often for no apparent rhyme or reason. But since people tend to seek help (and sign up for drug trials) primarily when they’re doing particularly badly, it follows that most people would get better to some extent even without any treatment. That’s regression to the mean (the Wikipedia entry has other nice examples–for example, the famous Sports Illustrated Cover Jinx).

In the context of the Dunning-Kruger effect, the argument is that incompetent people simply regress toward the mean when you ask them to evaluate their own performance. Since perceived performance is influenced not only by actual performance, but also by many other factors (e.g., one’s personality, meta-cognitive ability, measurement error, etc.), it follows that, on average, people with extreme levels of actual performance won’t be quite as extreme in terms of their perception of their performance. So, much of the Dunning-Kruger effect arguably doesn’t need to be explained at all, and in fact, it would be quite surprising if you didn’t see a pattern of results that looks at least somewhat like the figure above.

2. Regression to the mean plus better-than-average. Having said that, it’s clear that regression to the mean can’t explain everything about the Dunning-Kruger effect. One problem is that it doesn’t explain why the effect is greater at the low end than at the high end. That is, incompetent people tend to overestimate their performance to a much greater extent than competent people underestimate their performance. This asymmetry can’t be explained solely by regression to the mean. It can, however, be explained by a combination of RTM and a “better-than-average” (or self-enhancement) heuristic which says that, in general, most people have a tendency to view themselves excessively positively. This two-pronged explanation was proposed by Krueger and Mueller in a 2002 study (note that Krueger and Kruger are different people!), who argued that poor performers suffer from a double whammy: not only do their perceptions of their own performance regress toward the mean, but those perceptions are also further inflated by the self-enhancement bias. In contrast, for high performers, these two effects largely balance each other out: regression to the mean causes high performers to underestimate their performance, but to some extent that underestimation is offset by the self-enhancement bias. As a result, it looks as though high performers make more accurate judgments than low performers, when in reality the high performers are just lucky to be where they are in the distribution.

3. The instrumental role of task difficulty. Consistent with the notion that the Dunning-Kruger effect is at least partly a statistical artifact, some studies have shown that the asymmetry reported by Kruger and Dunning (i.e., the smaller discrepancy for high performers than for low performers) actually goes away, and even reverses, when the ability tests given to participants are very difficult. For instance, Burson and colleagues (2006), writing in JPSP, showed that when University of Chicago undergraduates were asked moderately difficult trivia questions about their university, the subjects who performed best were just as poorly calibrated as the people who performed worst, in the sense that their estimates of how well they did relative to other people were wildly inaccurate. Here’s what that looks like:

Notice that this finding wasn’t anomalous with respect to the Kruger and Dunning findings; when participants were given easier trivia (the diamond-studded line), Burson et al observed the standard pattern, with poor performers seemingly showing worse calibration. Simply knocking about 10% off the accuracy rate on the trivia questions was enough to induce a large shift in the relative mismatch between perceptions of ability and actual ability. Burson et al then went on to replicate this pattern in two additional studies involving a number of different judgments and tasks, so this result isn’t specific to trivia questions. In fact, in the later studies, Burson et al showed that when the task was really difficult, poor performers were actually considerably better calibrated than high performers.

Looking at the figure above, it’s not hard to see why this would be. Since the slope of the line tends to be pretty constant in these types of experiments, any change in mean performance levels (i.e., a shift in intercept on the y-axis) will necessarily result in a larger difference between actual and perceived performance at the high end. Conversely, if you raise the line, you maximize the difference between actual and perceived performance at the lower end.

To get an intuitive sense of what’s happening here, just think of it this way: if you’re performing a very difficult task, you’re probably going to find the experience subjectively demanding even if you’re at the high end relative to other people. Since people’s judgments about their own relative standing depends to a substantial extent on their subjective perception of their own performance (i.e., you use your sense of how easy a task was as a proxy of how good you must be at it), high performers are going to end up systematically underestimating how well they did. When a task is difficult, most people assume they must have done relatively poorly compared to other people. Conversely, when a task is relatively easy (and the tasks Dunning and Kruger studied were on the easier side), most people assume they must be pretty good compared to others. As a result, it’s going to look like the people who perform well are well-calibrated when the task is easy and poorly-calibrated when the task is difficult; less competent people are going to show exactly the opposite pattern. And note that this doesn’t require us to assume any relationship between actual performance and perceived performance. You would expect to get the Dunning-Kruger effect for easy tasks even if there was exactly zero correlation between how good people actually are at something and how good they think they are.

Here’s how Burson et al summarized their findings:

Our studies replicate, eliminate, or reverse the association between task performance and judgment accuracy reported by Kruger and Dunning (1999) as a function of task difficulty. On easy tasks, where there is a positive bias, the best performers are also the most accurate in estimating their standing, but on difficult tasks, where there is a negative bias, the worst performers are the most accurate. This pattern is consistent with a combination of noisy estimates and overall bias, with no need to invoke differences in metacognitive abilities. In this  regard, our findings support Krueger and Mueller’s (2002) reinterpretation of Kruger and Dunning’s (1999) findings. An association between task-related skills and metacognitive insight may indeed exist, and later we offer some suggestions for ways to test for it. However, our analyses indicate that the primary drivers of errors in judging relative standing are general inaccuracy and overall biases tied to task difficulty. Thus, it is important to know more about those sources of error in order to better understand and ameliorate them.

What should we conclude from these (and other) studies? I think the jury’s still out to some extent, but at minimum, I think it’s clear that much of the Dunning-Kruger effect reflects either statistical artifact (regression to the mean), or much more general cognitive biases (the tendency to self-enhance and/or to use one’s subjective experience as a guide to one’s standing in relation to others). This doesn’t mean that the meta-cognitive explanation preferred by Dunning, Kruger and colleagues can’t hold in some situations; it very well may be that in some cases, and to some extent, people’s lack of skill is really what prevents them from accurately determining their standing in relation to others. But I think our default position should be to prefer the alternative explanations I’ve discussed above, because they’re (a) simpler, (b) more general (they explain lots of other phenomena), and (c) necessary (frankly, it’d be amazing if regression to the mean didn’t explain at least part of the effect!).

We should also try to be aware of another very powerful cognitive bias whenever we use the Dunning-Kruger effect to explain the people or situations around us–namely, confirmation bias. If you believe that incompetent people don’t know enough to know they’re incompetent, it’s not hard to find anecdotal evidence for that; after all, we all know people who are both arrogant and not very good at what they do. But if you stop to look for it, it’s probably also not hard to find disconfirming evidence. After all, there are clearly plenty of people who are good at what they do, but not nearly as good as they think they are (i.e., they’re above average, and still totally miscalibrated in the positive direction). Just like there are plenty of people who are lousy at what they do and recognize their limitations (e.g., I don’t need to be a great runner in order to be able to tell that I’m not a great runner–I’m perfectly well aware that I have terrible endurance, precisely because I can’t finish runs that most other runners find trivial!). But the plural of anecdote is not data, and the data appear to be equivocal. Next time you’re inclined to chalk your obnoxious co-worker’s delusions of grandeur down to the Dunning-Kruger effect, consider the possibility that your co-worker’s simply a jerk–no meta-cognitive incompetence necessary.

ResearchBlogging.orgKruger J, & Dunning D (1999). Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77 (6), 1121-34 PMID: 10626367
Krueger J, & Mueller RA (2002). Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. Journal of personality and social psychology, 82 (2), 180-8 PMID: 11831408
Burson KA, Larrick RP, & Klayman J (2006). Skilled or unskilled, but still unaware of it: how perceptions of difficulty drive miscalibration in relative comparisons. Journal of personality and social psychology, 90 (1), 60-77 PMID: 16448310

will trade two Methods sections for twenty-two subjects worth of data

Thursday, July 1st, 2010

The excellent and ever-candid Candid Engineer in Academia has an interesting post discussing the love-hate relationship many scientists who work in wet labs have with benchwork. She compares two very different perspectives:

She [a current student] then went on to say that, despite wanting to go to grad school, she is pretty sure she doesn’t want to continue in academia beyond the Ph.D. because she just loves doing the science so much and she can’t imagine ever not being at the bench.

Being young and into the benchwork, I remember once asking my grad advisor if he missed doing experiments. His response: “Hell no.” I didn’t understand it at the time, but now I do. So I wonder if my student will always feel the way she does now- possessing of that unbridled passion for the pipet, that unquenchable thirst for the cell culture hood.

Wet labs are pretty much nonexistent in psychology–I’ve never had to put on gloves or goggles to do anything that I’d consider an “experiment”, and I’ve certainly never run the risk of  spilling dangerous chemicals all over myself–so I have no opinion at all about benchwork. Maybe I’d love it, maybe I’d hate it; I couldn’t tell you. But Candid Engineer’s post did get me thinking about opinions surrounding the psychological equivalent of benchwork–namely, collecting data form human subjects. My sense is that there’s somewhat more consensus among psychologists, in that most of us don’t seem to like data collection very much. But there are plenty of exceptions, and there certainly are strong feelings on both sides.

More generally, I’m perpetually amazed at the wide range of opinions people can hold about the various elements of scientific research, even when the people doing the different-opinion-holding all work in very similar domains. For instance, my favorite aspect of the research I do, hands down, is data analysis. I’d be ecstatic if I could analyze data all day and never have to worry about actually communicating the results to anyone (though I enjoy doing that too). After that, there are activities like writing and software development, which I spend a lot of time doing, and occasionally enjoy, but also frequently find very frustrating. And then, at the other end, there are aspects of research that I find have little redeeming value save for their instrumental value in supporting other, more pleasant, activities–nasty, evil activities like writing IRB proposals and, yes, collecting data.

To me, collecting data is something you do because you’re fundamentally interested in some deep (or maybe not so deep) question about how the mind works, and the only way to get an answer is to actually interrogate people while they do stuff in a controlled environment. It isn’t something I do for fun. Yet I know people who genuinely seem to love collecting data–or, for that matter, writing Methods sections or designing new experiments–even as they loathe perfectly pleasant activities like, say, sitting down to analyze the data they’ve collected, or writing a few lines of code that could save them hours’ worth of manual data entry. On a personal level, I find this almost incomprehensible: how could anyone possibly enjoy collecting data more than actually crunching the numbers and learning new things? But I know these people exist, because I’ve talked to them. And I recognize that, from their perspective, I’m the guy with the strange views. They’re sitting there thinking: what kind of joker actually likes to turn his data inside out several dozen times? What’s wrong with just running a simple t-test and writing up the results as fast as possible, so you can get back to the pleasure of designing and running new experiments?

This of course leads us directly to the care bears fucking tea party moment where I tell you how wonderful it is that we all have these different likes and dislikes. I’m not being sarcastic; it really is great. Ultimately, it works to everyone’s advantage that we enjoy different things, because it means we get to collaborate on projects and take advantage of complementary strengths and interests, instead of all having to fight over who gets to write the same part of the Methods section. It’s good that there are some people who love benchwork and some people who hate it, and it’s good that there are people who’re happy to write software that other people who hate writing software can use. We don’t all have to pretend we understand each other; it’s enough just to nod and smile and say “but of course you can write the Methods for that paper; I really don’t mind. And yes, I guess I can run some additional analyses for you, really, it’s not too much trouble at all.”

a possible link between pesticides and ADHD

Tuesday, May 18th, 2010

A forthcoming article in the journal Pediatrics that’s been getting a lot of press attention suggests that exposure to common pesticides may be associated with a substantially elevated risk of ADHD. More precisely, what the study found was that elevated urinary concentrations of organophosphate metabolites were associated with an increased likelihood of meeting criteria for an ADHD diagnosis. One of the nice things about this study is that the authors used archival data from the (very large) National Health and Nutrition Examination Survey (NHANES), so they were able to control for a relatively broad range of potential confounds (e.g., gender, age, SES, etc.). The primary finding is, of course, still based on observational data, so you wouldn’t necessarily want to conclude that exposure to pesticides causes ADHD. But it’s a finding that converges with previous work in animal models demonstrating that high exposure to organophosphate pesticides causes neurodevelopmental changes, so it’s by no means a crazy hypothesis.

I think it’s really pleasantly surprising to see how responsibly the popular press has covered this story (e.g., this, this, and this). Despite the obvious potential for alarmism, very few articles have led with a headline implying a causal link between pesticides and ADHD. They all say things like “associated with”, “tied to”, or “linked to”, which is exactly right. And many even explicitly mention the size of the effect in question–namely, approximately a 50% increase in risk of ADHD per 10-fold increase in concentration of pesticide metabolites. Given that most of the articles contain cautionary quotes from the study’s authors, I’m guessing the authors really emphasized the study’s limitations when dealing with the press, which is great. In any case, because the basic details of the study have already been amply described elsewhere (I thought this short CBS article was particularly good), I’ll just mention a few random thoughts here:

  • Often, epidemiological studies suffer from a gaping flaw in the sense that the more interesting causal story (and the one that prompts media attention) is far less plausible than other potential explanations (a nice example of this is the recent work on the social contagion of everything from obesity to loneliness). That doesn’t seem to be the case here. Obviously, there are plenty of other reasons you might get a correlation between pesticide metabolites and ADHD risk–for instance, ADHD is substantially heritable, so it could be that parents with a disposition to ADHD also have systematically different dietary habits (i.e., parental dispositions are a common cause of both urinary metabolites and ADHD status in children). But given the aforementioned experimental evidence, it’s not obvious that alternative explanations for the correlation are much more plausible than the causal story linking pesticide exposure to ADHD, so in that sense this is potentially a very important finding.
  • The use of a dichotomous dependent variable (i.e., children either meet criteria for ADHD or don’t; there are no shades of ADHD gray here) is a real problem in this kind of study, because it can make the resulting effects seem deceptively large. The intuitive way we think about the members of a category is to think in terms of prototypes, so that when you think about “ADHD” and “Not-ADHD” categories, you’re probably mentally representing an extremely hyperactive, inattentive child for the former, and a quiet, conscientious kid for the latter. If that’s your mental model, and someone comes along and tells you that pesticide exposure increases the risk of ADHD by 50%, you’re understandably going to freak out, because it’ll seem quite natural to interpret that as a statement that pesticides have a 50% chance of turning average kids into hyperactive ones. But that’s not the right way to think about it. In all likelihood, pesticides aren’t causing a small proportion of kids to go from perfectly average to completely hyperactive; instead, what’s probably happening is that the entire distribution is shifting over slightly. In other words, most kids who are exposed to pesticides (if we assume for the sake of argument that there really is a causal link) are becoming slightly more hyperactive and/or inattentive.
  • Put differently, what happens when you have a strict cut-off for diagnosis is that even small increases in underlying symptoms can result in a qualitative shift in category membership. If ADHD symptoms were measured on a continuous scale (which they actually probably were, before being dichotomized to make things simple and more consistent with previous work), these findings might have been reported as something like “a 10-fold increase in pesticide exposures is associated with a 2-point increase on a 30-point symptom scale,” which would have made it much clearer that, at worst, pesticides are only one of many other contributing factors to ADHD, and almost certainly not nearly as big a factor as some others. That’s not to say we shouldn’t be concerned if subsequent work supports a causal link, but just that we should retain perspective on what’s involved. No one’s suggesting that you’re going to feed your child an unwashed pear or two and end up with a prescription for Ritalin; the more accurate view would be that you might have a minority of kids who are already at risk for ADHD, and this would be just one more precipitating factor.
  • It’s also worth keeping in mind that the relatively large increase in ADHD risk is associated with a ten-fold increase in pesticide metabolites. As the authors note, that corresponds to the difference between the 25th and 75th percentiles in the sample. Although we don’t know exactly what that means in terms of real-world exposure to pesticides (because the authors didn’t have any data on grocery shopping or eating habits), it’s almost certainly a very sizable difference (I won’t get into the reasons why, except to note that the rank-order of pesticide metabolites must be relatively stable among children, or else there wouldn’t be any association with a temporally-extended phenotype like ADHD). So the point is, it’s probably not so easy to go from the 25th to the 75th percentile just by eating a few more fruits and vegetables here and there. So while it’s certainly advisable to try and eat better, and potentially to buy organic produce (if you can afford it), you shouldn’t assume that you can halve your child’s risk of ADHD simply by changing his or her diet slightly. These are, at the end of the day, small effects.
  • The authors report that fully 12% of children in this nationally representative sample met criteria for ADHD (mostly of the inattentive subtype). This, frankly, says a lot more about how silly the diagnostic criteria for ADHD are than about the state of the nation’s children. It’s frankly not plausible to suppose that 1 in 8 children really suffer from what is, in theory at least, a severe, potentially disabling disorder. I’m not trying to trivialize ADHD or argue that there’s no such thing, but simply to point out the dangers of medicalization. Once you’ve reached the point where 1 in every 8 people meet criteria for a serious disorder, the label is in danger of losing all meaning.

ResearchBlogging.orgBouchard, M., Bellinger, D., Wright, R., & Weisskopf, M. (2010). Attention-Deficit/Hyperactivity Disorder and Urinary Metabolites of Organophosphate Pesticides PEDIATRICS DOI: 10.1542/peds.2009-3058

in defense of three of my favorite sayings

Friday, May 14th, 2010

Seth Roberts takes issue with three popular maxims that (he argues) people use “to push away data that contradicts this or that approved view of the world”. He terms this preventive stupidity. I’m a frequent user of all three sayings, so I suppose that might make me preventively stupid; but I do feel like I have good reasons for using these sayings, and I confess to not really seeing Roberts’ point.

Here’s what Roberts has to say about the three sayings in question:

1. Absence of evidence is not evidence of absence. Øyhus explains why this is wrong. That such an Orwellian saying is popular in discussions of data suggests there are many ways we push away inconvenient data.

In my own experience, by far the biggest reason this saying is popular in discussions of data (and the primary reason I use it when reviewing papers) is that many people have a very strong tendency to interpret null results as an absence of any meaningful effect. That’s a very big problem, because the majority of studies in psychology tend to have relatively little power to detect small to moderate-sized effects. For instance, as I’ve discussed here, most whole-brain analyses in typical fMRI samples (of say, 15 – 20 subjects) have very little power to detect anything but massive effects. And yet people routinely interpret a failure to detect hypothesized effects as an indication that they must not exist at all. The simplest and most direct counter to this type of mistake is to note that one shouldn’t accept the null hypothesis unless one has very good reasons to think that power is very high and effect size estimates are consequently quite accurate. Which is just another way of saying that absence of evidence is not evidence of absence.

2. Correlation does not equal causation. In practice, this is used to mean that correlation is not evidence for causation. At UC Berkeley, a job candidate for a faculty position in psychology said this to me. I said, “Isn’t zero correlation evidence against causation?” She looked puzzled.

Again, Roberts’ experience clearly differs from mine; I’ve far more often seen this saying used as a way of suggesting that a researcher may be drawing overly strong causal conclusions from the data, not as a way of simply dismissing a correlation outright. A good example of this is found in the developmental literature, where many researchers have observed strong correlations between parents’ behavior and their children’s subsequent behavior. It is, of course, quite plausible to suppose that parenting behavior exerts a direct causal influence on children’s behavior, so that the children of negligent or abusive parents are more likely to exhibit delinquent behavior and grow up to perpetuate the “cycle of violence”. But this line of reasoning is substantially weakened by behavioral genetic studies indicating that very little of the correlation between parents’ and children’s personalities is explained by shared environmental factors, and that the vast majority reflects heritable influences and/or unique environmental influences. Given such findings, it’s a perfectly appropriate rebuttal to much of the developmental literature to note that correlation doesn’t imply causation.

It’s also worth pointing out that the anecdote Roberts provides isn’t exactly a refutation of the maxim; it’s actually an affirmation of the consequent. The fact that an absence of any correlation could potentially be strong evidence against causation (under the right circumstances) doesn’t mean that the presence of a correlation is strong evidence for causation. It may or may not be, but that’s something to be weighed on a case-by-case basis. There certainly are plenty of cases where it’s perfectly appropriate (and even called for) to remind someone that correlation doesn’t imply causation.

3. The plural of anecdote is not data. How dare you try to learn from stories you are told or what you yourself observe!

I suspect this is something of a sore spot for Roberts, who’s been an avid proponent of self-experimentation and case studies. I imagine people often dismiss his work as mere anecdote rather than valuable data. Personally, I happen to think there’s tremendous value to self-experimentation (at least when done in as controlled a manner as possible), so I don’t doubt there are many cases where this saying is unfairly applied. That said, I think Roberts fails to appreciate that people who do his kind of research constitute a tiny fraction of the population. Most of the time, when someone says that “the plural of anecdote is not data,” they’re not talking to someone who does rigorous self-experimentation, but to people who, say, don’t believe they should give up smoking seeing as how their grandmother smoked till she was 88 and died in a bungee-jumping accident, or who are convinced that texting while driving is perfectly acceptable because they don’t personally know anyone who’s gotten in an accident. In such cases, it’s not only legitimate but arguably desirable to point out that personal anecdote is no substitute for hard data.

Orwell was right. People use these sayings — especially #1 and #3 — to push away data that contradicts this or that approved view of the world. Without any data at all, the world would be simpler: We would simply believe what authorities tell us. Data complicates things. These sayings help those who say them ignore data, thus restoring comforting certainty.

Maybe there should be a term (antiscientific method?) to describe the many ways people push away data. Or maybe preventive stupidity will do.

I’d like to be charitable here, since there very clearly are cases where Roberts’ point holds true: sometimes people do toss out these sayings as a way of not really contending with data they don’t like. But frankly, the general claim that these sayings are antiscientific and constitute an act of stupidity just seems silly. All three sayings are clearly applicable in a large number of situations; to deny that, you’d have to believe that (a) it’s always fine to accept the null hypothesis, (b) correlation is always a good indicator of a causal relationship, and (c) personal anecdotes are just as good as large, well-controlled studies. I take it that no one, including Roberts, really believes that. So then it becomes a matter of when to apply these sayings, and not whether or not to use them. After all, it’d be silly to think that the people who use these sayings are always on the side of darkness, and the people who wield null results, correlations, and anecdotes with reckless abandon are always on the side of light.

My own experience, for what it’s worth, is that the use of these sayings is justified far more often than not, and I don’t have any reservation applying them myself when I think they’re warranted (which is relatively often–particularly the first one). But I grant that that’s just my own personal experience talking, and no matter how many experiences I’ve had of people using these sayings appropriately, I’m well aware that the plural of anecdote…

de Waal and Ferrari on cognition in humans and animals

Thursday, May 6th, 2010

Humans do many things that most animals can’t. That much no one would dispute. The more interesting and controversial question is just how many things we can do that most animals can’t, and just how many animal species can or can’t do the things we do. That question is at the center of a nice opinion piece in Trends in Cognitive Sciences by Frans de Waal and Pier Francisco Ferrari.

De Waal and Ferrari argue for what they term a bottom-up approach to human and animal cognition. The fundamental idea–which isn’t new, and in fact owes much to decades of de Waal’s own work with primates–is that most of our cognitive abilities, including many that are often characterized as uniquely human, are in fact largely continuous with abilities found in other species. De Waal and Ferrari highlight a number of putatively “special” functions like imitation and empathy that turn out to have relatively frequent primate (and in some cases non-primate) analogs. They push for a bottom-up scientific approach that seeks to characterize the basic mechanisms that complex functionality might have arisen out of, rather than (what they see as) “the overwhelming tendency outside of biology to give human cognition special treatment.”

Although I agree pretty strongly with the thesis of the paper, its scope is also, in some ways, quite limited: De Waal and Ferrari clearly believe that many complex functions depend on homologous mechanisms in both humans and non-human primates, but they don’t actually say very much about what these mechanisms might be, save for some brief allusions to relatively broad neural circuits (e.g., the oft-criticized mirror neuron system, which Ferrari played a central role in identifying and characterizing). To some extent that’s understandable given the brevity of TICS articles, but given how much de Waal has written about primate cognition, it would have been nice to see a more detailed example of the types of cognitive representations de Waal thinks underlie, say, the homologous abilities of humans and capuchin monkeys empathize with conspecifics.

Also, despite its categorization as an “Opinion” piece (these are supposed to stir up debate), I don’t think many people (at least, the kind of people who read TICS articles) are going to take issue with the basic continuity hypothesis advanced by de Waal and Ferrari. I suspect many more people would agree than disagree with the notion that most complex cognitive abilities displayed by humans share a closely intertwined evolutionary history with seemingly less sophisticated capacities displayed by primates and other mammalian species. So in that sense, de Waal and Ferrari might be accused of constructing something of a straw man. But it’s important to recognize that de Waal’s own work is a very large part of the reason why the continuity hypothesis is so widely accepted these days. So in that sense, even if you already agree with its premise, the TICS paper is worth reading simply as an elegant summary of a long-standing and important line of research.