[Editorial note: The people and events described here are fictional. But the paper in question is quite real.]
“Dearly Beloved,” The Graduate Student began. “We are gathered here to–”
“Again?” Samantha interrupted. “Again with the Dearly Beloved speech? Can’t we just start a meeting like a normal journal club for once? We’re discussing papers here, not holding a funeral.”
“We will discuss papers,” said The Graduate Student indignantly. “In good time. But first, we have to follow the rules of Great Minds Journal Club. There’s a protocol, you know.”
Samantha was about to point out that she didn’t know, because The Graduate Student was the sole author of the alleged rules, and the alleged rules had a habit of changing every week. But she was interrupted by the sound of the double doors at the back of the room swinging violently inwards.
“Sorry I’m late,” said Jin, strolling into the room, one hand holding what looked like a large bucket of coffee with a lid on top. “What are we reading today?”
“Nothing,” said Lionel. “The reading has already happened. What we’re doing now is discussing the paper that everyone’s already read.”
“Right, right,” said Jin. “What I meant to ask was: what paper that we’ve all already read are we discussing today?”
“Statistically controlling for confounding constructs is harder than you think,” said The Graduate Student.
“I doubt it,” said Jin. “I think almost everything is intolerably difficult.”
“Do I work here… Hah. Funny man. Remember, Lionel… I’ll be on your tenure committee in the Fall.”
“Why don’t we get started,” said The Graduate Student, eager to prevent a full-out sarcastathon. “I guess we can do our standard thing where Samantha and I describe the basic ideas and findings, talk about how great the paper is, and suggest some possible extensions… and then Jin and Lionel tear it to shreds.”
“Sounds good,” said Jin and Lionel in concert.
“The basic problem the authors highlight is pretty simple,” said Samantha. “It’s easy to illustrate with an example. Say you want to know if eating more bacon is associated with a higher incidence of colorectal cancer–like that paper that came out a while ago suggested. In theory, you could just ask people how often they eat bacon and how often they get cancer, and then correlate the two. But suppose you find a positive correlation–what can you conclude?”
“Not much,” said Pablo–apparently in a talkative mood. It was the first thing he’d said to anyone all day–and it was only 3 pm.
“Right. It’s correlational data,” Samantha continued. “Nothing is being experimentally manipulated here, so we have no idea if the bacon-cancer correlation reflects the effect of bacon itself, or if there’s some other confounding variable that explains the association away.”
“Like, people who exercise less tend to eat more bacon, and exercise also prevents cancer,” The Graduate Student offered.
“Or it could be a general dietary thing, and have nothing to do with bacon per se,” said Jin. “People who eat a lot of bacon also have all kinds of other terrible dietary habits, and it’s really the gestalt of all the bad effects that causes cancer, not any one thing in particular.”
“Or maybe,” suggested Pablo, “a sneaky parasite unknown to science invades the brain and the gut. It makes you want to eat bacon all the time. Because bacon is its intermediate host. And then it also gives you cancer. Just to spite you.”
“Right, it could be any of those things.” Samantha said. “Except for maybe that last one. The point is, there are many potential confounds. If we want to establish that there’s a ‘real’ association between bacon and cancer, we need to somehow remove the effect of other variables that could be correlated with both bacon-eating and cancer-having. The traditional way to do this is to statistical “control for” or “hold constant” the effects of confounding variables. The idea is that you adjust the variables in your regression equation so that you’re essentially asking what would the relationship between bacon and cancer look like if we could eliminate the confounding influence of things like exercise, diet, alcohol, and brain-and-gut-eating parasites? It’s a very common move, and the logic of statistical control is used to justify a huge number of claims all over the social and biological sciences.”
“I just published a paper showing that brain activation in frontoparietal regions predicts people’s economic preferences even after controlling for self-reported product preferences,” said Jin. “Please tell me you’re not going to shit all over my paper. Is that where this is going?”
“It is,” said Lionel gleefully. “That’s exactly where this is going.”
“It’s true,” Samantha said apologetically. “But if it’s any consolation, we’re also going to shit on Lionel’s finding that implicit prejudice is associated with voting behavior after controlling for explicit attitudes.”
“That’s actually pretty consoling,” said Jin, smiling at Lionel.
“So anyway, statistical control is pervasive,” Samantha went on. “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”
“Could you maybe give an example,” asked Pablo. He was the youngest in the group, being only a second-year graduate student. (The Graduate Student, by contrast, had been in the club for so long that his real name had long ago been forgotten by the other members of the GMJC.)
“Sure,” said The Graduate Student. “Suppose your survey includes an item like ‘how often do you consume alcoholic beverages’, and the response options include things like never, less than once a month, I’m never not consuming alcoholic beverages, and so on. Now, people are not that great at remembering exactly how often they have a drink–especially the ones who tend to have a lot of drinks. On top of that, there’s a stigma against drinking a lot, so there’s probably going to be some degree of systematic underreporting. All of this contrives to give you a measure that’s less than perfectly reliable–meaning, it won’t give you the same values that you would get if you could actually track people for an extended period of time and accurately measure exactly how much ethanol they consume, by volume. In many, many cases, measured covariates of this kind are pretty mediocre.”
“I see,” said Pablo. “That makes sense. So why is that a problem?”
“Because you can’t control for that which you aren’t measuring,” Samantha said. “Meaning, if your alleged measure of alcohol consumption–or any other variable you care about–isn’t measuring the thing you care about with perfect accuracy, then you can’t remove its influence on other things. It’s easiest to see this if you think about the limiting case where your measurements are completely unreliable. Say you think you’re measuring weekly hours of exercise, but actually your disgruntled research assistant secretly switched out the true exercise measure for randomly generated values. When you then control for the alleged ‘exercise’ variable in your model, how much of the true influence of exercise are you removing?”
“None,” said Pablo.
“Right. Your alleged measure of exercise doesn’t actually reflect anything about exercise, so you’re accomplishing nothing by controlling for it. The same exact point holds–to varying degrees–when your measure is somewhat reliable, but not perfect. Which is to say, pretty much always.”
“You could also think about the same general issue in terms of construct validity,” The Graduate Student chimed in. “What you’re typically trying to do by controlling for something is account for a latent construct or concept you care about–not a specific measure. For example, the latent construct of a “healthy diet” could be measured in many ways. You could ask people how much broccoli they eat, how much sugar or transfat they consume, how often they eat until they can’t move, and so on. If you surveyed people with a lot of different items like this, and then extracted the latent variance common to all of them, then you might get a component that could be interpreted as something like ‘healthy diet’. But if you only use one or two items, they’re going to be very noisy indicators of the construct you care about. Which means you’re not really controlling for how healthy people’s diet is in your model relating bacon to cancer. At best, you’re controlling for, say, self-reported number of vegetables eaten. But there’s a very powerful temptation for authors to forget that caveat, and to instead think that their measurement-level conclusions automatically apply at the construct level. The result is that you end up with a huge number of papers saying things like ‘we show that fish oil promotes heart health even after controlling for a range of dietary and lifestyle factors’. When in fact the measurement-level variables they’ve controlled for can’t help but capture only a tiny fraction of all of the dietary and lifestyle factors that could potentially confound the association you care about.”
“I see,” said Pablo. “But this seems like a pretty basic point, doesn’t it?”
“Yes,” said Lionel. “It’s a problem as old as time itself. It might even be older than Jin.”
Jin smiled at Lionel and tipped her coffee cup-slash-bucket towards him slightly in salute.
“In fairness to the authors,” said The Graduate Student, “they do acknowledge that essentially the same problem has been discussed in many literatures over the past few decades. And they cite some pretty old papers. Oldest one is from… 1965. Kahneman, 1965.”
An uncharacteristic silence fell over the room.
“That Kahneman?” Jin finally probed.
“The one and only.”
“Fucking Kahneman,” said Lionel. “That guy could really stand to leave a thing or two for the rest of us to discover.”
“So, wait,” said Jin, evidently coming around to Lionel’s point of view. “These guys cite a 50-year old paper that makes essentially the same argument, and still have the temerity to publish this thing?”
“Yes,” said Samantha and The Graduate Student in unison.
“But to be fair, their presentation is very clear,” Samantha said. “They lay out the problem really nicely–which is more than you can say for many of the older papers. Plus there’s some neat stuff in here that hasn’t been done before, as far as I know.”
“Like what?” asked Lionel.
“There’s a nice framework for analytically computing error rates for any set of simple or partial correlations between two predictors and a DV. And, to save you the trouble of having to write your own code, there’s a Shiny web app.”
“In my day, you couldn’t just write a web app and publish it as a paper,” Jin grumbled. “Shiny or otherwise.”
“That’s because in your day, the internet didn’t exist,” Lionel helpfully offered.
“No internet?” the Graduate Student shrieked in horror. “How old are you, Jin?”
“Old enough to become very wise,” said Jin. “Very, very wise… and very corpulent with federal grant money. Money that I could, theoretically, use to fund–or not fund–a graduate student of my choosing next semester. At my complete discretion, of course.” She shot The Graduate Student a pointed look.
“There’s more,” Samantha went on. “They give some nice examples that draw on real data. Then they show how you can solve the problem with SEM–although admittedly that stuff all builds directly on textbook SEM work as well. And then at the end they go on to do some power calculations based on SEM instead of the standard multiple regression approach. I think that’s new. And the results are… not pretty.”
“How so,” asked Lionel.
“Well. Westfall and Yarkoni suggest that for fairly typical parameter regimes, researchers who want to make incremental validity claims at the latent-variable level–using SEM rather than multiple regression–might be looking at a bare minimum of several hundred participants, and often many thousands, in order to adequately power the desired inference.”
“Ouchie,” said Jin.
“What happens if there’s more than one potential confound?” asked Lionel. “Do they handle the more general multiple regression case, or only two predictors?”
“No, only two predictors,” said The Graduate Student. “Not sure why. Maybe they were worried they were already breaking enough bad news for one day.”
“Could be,” said Lionel. “You have to figure that in an SEM, when unreliability in the predictors is present, the uncertainty is only going to compound as you pile on more covariates–because it’s going to become increasingly unclear how the model should attribute any common variance that the predictor of interest shares with both the DV and at least one other covariate. So whatever power estimates they come up with in the paper for the single-covariate case are probably upper bounds on the ability to detect incremental contributions in the presence of multiple covariates. If you have a lot of covariates–like the epidemiology or nutrition types usually do–and at least some of your covariates are fairly unreliable, things could get ugly really quickly. Who knows what kind of sample sizes you’d need in order to make incremental validity claims about small effects in epi studies where people start controlling for the sun, moon, and stars. Hundreds of thousands? Millions? I have no idea.”
“Jesus,” said The Graduate Student. “That would make it almost impossible to isolate incremental contributions in large observational datasets.”
“Correct,” said Lionel.
“The thing I don’t get,” said Samantha, “is that the epidemiologists clearly already know about this problem. Or at least, some of them do. They’ve written dozens of papers about ‘residual confounding’, which is another name for the same problem Westfall and Yarkoni discuss. And yet there are literally thousands of large-sample, observational papers published in prestigious epidemiology, nutrition, or political science journals that never even mention this problem. If it’s such a big deal, why does almost nobody actually take any steps to address it?”
“Ah…” said Jin. “As the senior member of our group, I can probably answer that question best for you. You see, it turns out it’s quite difficult to publish a paper titled After an extensive series of SEM analyses of a massive observational dataset that cost the taxpayer three million dollars to assemble, we still have no idea if bacon causes cancer. Nobody wants to read that paper. You know what paper people do want to read? The one called Look at me, I eat so much bacon I’m guaranteed to get cancer according to the new results in this paper–but I don’t even care, because bacon is so delicious. That’s the paper people will read, and publish, and fund. So that’s the paper many scientists are going to write.”
A second uncharacteristic silence fell over the room.
“Bit of a downer today, aren’t you,” Lionel finally said. “I guess you’re playing the role of me? I mean, that’s cool. It’s a good look for you.”
“Yes,” Jin agreed. “I’m playing you. Or at least, a smarter, more eloquent, and better-dressed version of you.”
“Why don’t we move on,” Samantha interjected before Lionel could re-arm and respond. “Now that we’ve laid out the basic argument, should we try to work through the details and see what we find?”
“Yes,” said Lionel and Jin in unison–and proceeded to tear the paper to shreds.
[Editorial note: this was originally posted on April 1, 2016. April 1 is a day marked by a general lack of seriousness. Interpret this post accordingly.]
As many people who follow this blog will be aware, much of my research effort over the past few years has been dedicated to developing Neurosynth—a framework for large-scale, automated meta-analysis of neuroimaging data. Neurosynth has expanded steadily over time, with an ever-increasing database of studies, and a host of new features in the pipeline. I’m very grateful to NIMH for the funding that allows me to keep working on the project, and also to the hundreds (thousands?) of active Neurosynth users who keep finding novel applications for the data and tools we’re generating.
That said, I have to confess that, over the past year or so, I’ve gradually grown dissatisfied at my inability to scale up the Neurosynth operation in a way that would take the platform to the next level . My colleagues and I have come up with, and in some cases even prototyped, a number of really exciting ideas that we think would substantially advance the state of the art in neuroimaging. But we find ourselves spending an ever-increasing chunk of our time applying for the grants we need to support the work, and having little time left to over to actually do the work. Given the current funding climate and other logistical challenges (e.g., it’s hard to hire professional software developers on postdoc budgets), it’s become increasingly clear to me that the Neurosynth platform will be hard to sustain in an academic environment over the long term. So, for the past few months, I’ve been quietly exploring opportunities to help Neurosynth ladder up via collaborations with suitable industry partners.
Initially, my plan was simply to license the Neurosynth IP and use the proceeds to fund further development of Neurosynth out of my lab at UT-Austin. But as I started talking to folks in industry, I realized that there were opportunities available outside of academia that would allow me to take Neurosynth in directions that the academic environment would never allow. After a lot of negotiation, consultation, and soul-searching, I’m happy (though also a little sad) to announce that I’ll be leaving my position at the University of Texas at Austin later this year and assuming a new role as Senior Technical Fellow at Elsevier Open Science (EOS). EOS is a brand new division of Elsevier that seeks to amplify and improve scientific communication and evaluation by developing cutting-edge open science tools. The initial emphasis will be on the neurosciences, but other divisions are expected to come online in the next few years (and we’ll be hiring soon!). EOS will be building out a sizable insight-as-a-service operation that focuses on delivering real value to scientists—no p-hacking, no gimmicks, just actionable scientific information. The platforms we build will seek to replace flawed citation-based metrics with more accurate real-time measures that quantify how researchers actually use one another’s data, ideas, and tools—ultimately paving the way to a new suite of microneuroservices that reward researchers both professionally and financially for doing high-quality science.
On a personal level, I’m thrilled to be in a position to help launch an initiative like this. Having spent my entire career in an academic environment, I was initially a bit apprehensive at the thought of venturing into industry. But the move to Elsevier ended up feeling very natural. I’ve always seen Elsevier as a forward-thinking company at the cutting edge of scientific publishing, so I wasn’t shocked to hear about the EOS initiative. But as I’ve visited a number of Elsevier offices over the past few weeks (in the process of helping to decide where to locate EOS), I’ve been continually struck at how open and energetic—almost frenetic—the company is. It’s the kind of environment that combines many of the best elements of the tech world and academia, but without a lot of the administrative bureaucracy of the latter. At the end of the day, it was an opportunity I couldn’t pass up.
It will, of course, be a bittersweet transition for me; I’ve really enjoyed my 3 years in Austin, both professionally and personally. While I’m sure I’ll enjoy Norwich, CT (where EOS will be based), I’m going to really miss Austin. The good news is, I won’t be making the move alone! A big part of what sold me on Elsevier’s proposal was their commitment to developing an entire open science research operation; over the next five years, the goal is to make Elsevier the premier place to work for anyone interested in advancing open science. I’m delighted to say that Chris Gorgolewski (Stanford), Satrajit Ghosh (MIT), and Daniel Margulies (Max Planck Institute for Human Cognitive and Brain Sciences) have all also been recruited to Elsevier, and will be joining EOS at (or in Satra’s case, shortly after) launch. I expect that they’ll make their own announcements shortly, so I won’t steal their thunder much. But the short of it is that Chris, Satra, and I will be jointly spearheading the technical operation. Daniel will be working on other things, and is getting the fancy title of “Director of Interactive Neuroscience”; I think this means he’ll get to travel a lot and buy expensive pictures of brains to put on his office walls. So really, it’s a lot like his current job.
It goes without saying that Neurosynth isn’t making the jump to Elsevier all alone; NeuroVault—a whole-brain image repository developed by Chris—will also be joining the Elsevier family. We have some exciting plans in the works for much closer NeuroVault-Neurosynth integration, and we think the neuroimaging community is going to really like the products we develop. We’ll also be bringing with us the OpenfMRI platform created by Russ Poldrack. While Russ wasn’t interested in leaving Stanford (as I recall, his exact words were “over all of your dead bodies”), he did agree to release the OpenfMRI IP to Elsevier (and in return, Elsevier is endowing a permanent Open Science fellowship at Stanford). Russ will, of course, continue to actively collaborate on OpenfMRI, and all data currently in the OpenfMRI database will remain where it is (though all original contributors will be given the opportunity to withdraw their datasets if they choose). We also have some new Nipype-based tools rolling out over the coming months that will allow researchers to conduct state-of-the-art neuroimaging analyses in the cloud (for a small fee)–but I’ll have much more to say about that in a later post.
Naturally, a transition like this one can’t be completed without hitting a few speed bumps along the way. The most notable one is that the current version of Neurosynth will be retired permanently in mid-April (so grab any maps you need right now!). A new and much-improved version will be released in September, coinciding with the official launch of EOS. One of the things I’m most excited about is that the new version will support an “Enhanced Usage” tier. The vertical integration of Neurosynth with the rest of the Elsevier ecosystem will be a real game-changer; for example, authors submitting papers to NeuroImage will automatically be able to push their content into NeuroVault and Neurosynth upon acceptance, and readers will be able to instantly visualize and cognitively decode any activation map in the Elsevier system (for a nominal fee handled via an innovative new micropayment system). Users will, of course, retain full control over their content, ensuring that only readers who have the appropriate permissions (and a valid micropayment account of their own) can access other people’s data. We’re even drawing up plans to return a portion of the revenues earned through the system to the content creators (i.e., article authors)—meaning that for the first time, neuroimaging researchers will be able to easily monetize their research.
As you might expect, the Neurosynth brand will be undergoing some changes to reflect the new ownership. While Chris and I initially fought hard to preserve the names Neurosynth and NeuroVault, Elsevier ultimately convinced us that using a consistent name for all of our platforms would reduce confusion, improve branding, and make for a much more streamlined user experience*. There’s also a silver lining to the name we ended up with: Chris, Russ, and I have joked in the past that we should unite our various projects into a single “NeuroStuff” website—effectively the Voltron of neuroimaging tools—and I even went so far as to register neurostuff.org a while back. When we mentioned this to the Elsevier execs (intending it as a joke), we were surprised at their positive response! The end result (after a lot of discussion) is that Neurosynth, NeuroVault, and OpenfMRI will be merging into The NeuroStuff Collection, by Elsevier (or just NeuroStuff for short)–all coming in late 2016!
Admittedly, right now we don’t have a whole lot to show for all these plans, except for a nifty logo created by Daniel (and reluctantly approved by Elsevier—I think they might already be rethinking this whole enterprise). But we’ll be rolling out some amazing new services in the very near future. We also have some amazing collaborative projects that will be announced in the next few weeks, well ahead of the full launch. A particularly exciting one that I’m at liberty to mention** is that next year, EOS will be teaming up with Brian Nosek and folks at the Center for Open Science (COS) in Charlottesville to create a new preregistration publication stream. All successful preregistered projects uploaded to the COS’s flagship Open Science Framework (OSF) will be eligible, at the push of a button, for publication in EOS’s new online-only journal Preregistrations. Submission fees will be competitive with the very cheapest OA journals (think along the lines of PeerJ’s $99 lifetime subscription model).
It’s been a great ride working on Neurosynth for the past 5 years, and I hope you’ll all keep using (and contributing to) Neurosynth in its new incarnation as Elsevier NeuroStuff!
* Okay, there’s no point in denying it—there was also some money involved.
** See? Money can’t get in the way of open science—I can talk about whatever I want!
In my last post, I wrote a long commentary on a recent PNAS article by Lieberman & Eisenberger claiming to find evidence that the dorsal anterior cingulate cortex is “selective for pain” using my Neurosynth framework for large-scale fMRI meta-analysis. I argued that nothing about Neurosynth supports any of L&E’s major conclusions, and that they made several major errors of inference and analysis. L&E have now responded in detail on Lieberman’s blog. If this is the first you’re hearing of this exchange, and you have a couple of hours to spare, I’d suggest proceeding in chronological order: read the original article first, then my commentary, then L&E’s response then this response to the response (if you really want to leave no stone unturned, you could also read Alex Shackman’s commentary, which focuses on anatomical issues). If you don’t have that kind of time on your hands, just read on and hope for the best, I guess.
Before I get to the substantive issues, let me say that I appreciate L&E taking the time to reply to my comments in detail. I recognize that they have other things they could be doing (as do I), and I think their willingness to engage in this format sets an excellent example as the scientific community continues to move rapidly towards more open, rapid, and interactive online scientific discussion. I would encourage readers to weigh in on the debate themselves or raise any questions they feel haven’t been addressed (either here on on Lieberman’s blog).
With that said, I have to confess that I don’t think my view is any closer to L&E’s than it previously was. I disagree with L&E’s suggestions that we actually agree on more than I thought in my original post; if anything, I think the opposite is true. However, I did find L&E’s response helpful inasmuch as it helped me better understand where their misunderstandings of Neurosynth lie.
In what follows, I provided a detailed rebuttal to L&E’s response. I’ll warn you right now that this will be a very long and fairly detail-oriented post. In a (probably fruitless) effort to minimize reader boredom, I’ve divided my response into two sections, much as L&E did. In the first section, I summarize what I see as the two most important points of disagreement. In the second part, I quote L&E’s entire response and insert my own comments in-line (essentially responding email-style). I recognize that this is a rather unusual thing to do, and it makes for a decidedly long read (the post clocks in at over 20,000 words, though much of that is quotes from L&E’s response). but I did it this way because, frankly, I think L&E badly misrepresented much of what I said in my last post. I want to make sure the context is very clear to readers, so I’m going to quote the entirety of each of L&E’s points before I respond to them, so that at the very least I can’t be accused of quoting them out of context.
The big issues: reverse inference and selectivity
With preliminaries out of the way, let me summarize what I see as the two biggest problems with L&E’s argument (though, if you make it to the second half of this post, you’ll see that there are many other statistical and interpretational issues that are pretty serious in their own right). The first concerns their fundamental misunderstanding of the statistical framework underpinning Neurosynth, and its relation to reverse inference. The second concerns their use of a definition of selectivity that violates common sense and can’t possibly support their claim that “the dACC is selective for pain”.
Misunderstandings about the statistics of reverse inference
I don’t think there’s any charitable way to say this, so I’ll just be blunt: I don’t think L&E understand the statistics behind the images Neurosynth produces. In particular, I don’t think they understand the foundational role that the notion of probability plays in reverse inference. In their reply, L&E repeatedly say that my concerns about their lack of attention to effect sizes (i.e., conditional probabilities) are irrelevant, because they aren’t trying to make an argument about effect sizes. For example:
TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.
This captures perhaps the crux of L&E’s misunderstanding about both Neurosynth and reverse inference. Their argument here is basically that they don’t care about the actual probability of a term being used conditional on a particular pattern of activation; they just want to know that there’s “support for the reverse inference”. Unfortunately, it doesn’t work that way. The z-scores produced by Neurosynth (which are just transformations of p-values) don’t provide a direct index of the support for a reverse inference. What they measure is what p-values always measure: the probability of observing a result as extreme as the one observed under the assumption that the null of no effect is true. Conceptually, we can interpret this as a claim about the population-level association between a region and a term. Roughly, we can say that as z-scores increase, we can be more confident that there’s a non-zero (positive) relationship between a term and a brain region (though some Bayesians might want to take issue with even this narrow assertion). So, if all L&E wanted to say was, “there’s good evidence that there’s a non-zero association between pain and dACC activation across the population of published fMRI studies”, they would be in good shape. But what they’re arguing for is much stronger: they want to show that the dACC is selective for pain. And z-scores are of no use here. Knowing that there’s a non-zero association between dACC activation and pain tells us nothing about the level of specificity or selectivity of that association in comparison to other terms. If the z-score for the association between dACC activation and ‘pain’ occurrence is 12.4 (hugely statistically significant!), does that mean that the probability of pain conditional on dACC activation is closer to 95%, or to 25%? Does it tell us that dACC activation is a better marker of pain than conflict, vision, or memory? We don’t know. We literally have no way to tell, unless we’re actually willing to talk about probabilities within a Bayesian framework.
To demonstrate that this isn’t just a pedantic point about what could in theory happen, and that the issue is in fact completely fundamental to understanding what Neurosynth can and can’t support, here are three different flavors of the Neurosynth maps for the “pain” map:
The top row is the reverse inference z-score map available on the website. The values here are z-scores, and what they tell us (being simply transformations of p-values) is nothing more than what the probability would be of observing an association at least as extreme as the one we observe under the null hypothesis of no effect. The second and third maps are both posterior probability maps. They display the probability of a study using the term ‘pain’ when activation is observed at each voxel in the brain. These maps aren’t available on the website (for reasons I won’t get into here, though the crux of it is that they’re extremely easy to misinterpret, for reasons that may become clear below)—though you can easily generate them with the Neurosynth core tools if you’re so inclined.
The main feature of these two probability maps that should immediately jump out at you is how strikingly different their numbers are. In the first map (i.e., middle row), the probabilities of “pain” max out around 20%; in the second map (bottom row), they range from around 70% – 90%. And yet, here I am telling you that these are both posterior probability maps that tell us the probability of a study using the term “pain” conditional on that study observing activity at each voxel. How could this be? How could the two maps be so different, if they’re supposed to be estimates of the same thing?
The answer lies in the prior. In the natural order of things, different terms occur with wildly varying frequencies in the literature (remember that Neurosynth is based on extraction of words from abstracts, not direct measurement of anyone’s mental state!). “Pain” occurs in only about 3.5% of Neurosynth studies. By contrast, the term “memory” occurs in about 16% of studies. One implication of this is that, if we know nothing at all about the pattern of brain activity reported in a given study, we should already expect that study to be about five times more likely to involve memory than pain. Of course, knowing something about the pattern of brain activity should change our estimate. In Bayesian terminology, we can say that our prior belief about the likelihood of different termsgets updated by the activity pattern we observe, producing somewhat more informed posterior estimates. For example, if the hippocampus and left inferior frontal gyrus are active, that should presumably increase our estimate of “memory” somewhat; conversely, if the periaqueductal gray, posterior insula, and dACC are all active, that should instead increase our estimate of “pain”.
In practice, the degree to which the data modulate our Neurosynth-based beliefs is not nearly as extreme as you might expect. In the first posterior probability map above (labeled “empirical prior”), what you can see are the posterior estimates for “pain” under the assumption that pain occurs in about 3.5% of all studies—which is the actual empirical frequency observed in the Neurosynth database. Notice that the very largest probabilities we ever see—located, incidentally, in the posterior insula, and not in the dACC—max out around 15 – 20%. This is not to be scoffed at; it means that observing activation in the posterior insula implies approximately a 5-fold increase in the likelihood of “pain” being present (relative to our empirical prior of 3.5%). Yet, in absolute terms, the probability of “pain” is still very low. Based on these data, no one in their right mind should, upon observing posterior insula activation (let alone dACC, where most voxels show a probability no higher than 10%), draw the reverse inference that pain is likely to be present.
To make it even clearer why this inference would be unsupportable, here are posterior probabilities for the same voxels as above, but now plotted for several other terms, in addition to pain:
Notice how, in the bottom map (for ‘motor’, which occurs in about 18% of all studies in Neurosynth), the posterior probabilities in all of dACC are substantially higher for than for ‘pain’, even though z-scores in most of dACC show the opposite pattern. For ‘working memory’ and ‘reward’, the posterior probabilities are in the same ballpark as for pain (mostly around 8 – 12%). And for ‘fear’, there are no voxels with posterior probabilities above 5% anywhere, because the empirical prior is so low (only 2% of Neurosynth studies).
What this means is that, if you observe activation in dACC—a region which shows large z-scores for “pain” and much lower ones for “motor”—your single best guess as to what process might be involved (of the five candidates in the above figure) should be ‘motor’ by a landslide. You could also guess ‘reward’ or ‘working memory’ with about the same probability as ‘pain’. Of course, the more general message you should take away from this is that it’s probably a bad idea to infer any particular process on the basis of observed activity, given how low the posterior probability estimates for most terms are going to be. Put simply, it’s a giant leap to go from these results—which clearly don’t license anyone to conclude that the dACC is a marker of any single process—to concluding that “the dACC is selective for pain” and that pain represents the best psychological characterization of dACC function.
As if this isn’t bad enough, we now need to add a further complication to the picture. The analysis above assumes we have a good prior for terms like “pain” and “memory”. In reality, we have no reason to think that the empirical estimates of term frequency we get out of Neurosynth are actually good reflections of the real world. For all we know, it could be that pain processing is actually 10 times as common as it appears to be in Neurosynth (i.e., that pain is severely underrepresented in fMRI studies relative to its occurrence in real-world human brains). If we use the empirical estimates from Neurosynth as our priors—with all of their massive between-term variation—then, as you saw above, the priors will tend to overwhelm our posteriors. In other words, no amount of activation in pain-related regions would ever lead us to conclude that a study is about a low-frequency term like pain rather than a high-frequency term like memory or vision.
For this reason, when I first built Neurosynth, my colleagues and I made the deliberate decision to impose a uniform (i.e., 50/50) prior on all terms displayed on the Neurosynth website. This approach greatly facilitates qualitative comparison of different terms; but it necessarily does so by artificially masking the enormous between-term variability in base rates. What this means is that when you see a posterior probability like 85% for pain in the dACC in the third row of the pain figure above, the right interpretation of this is “if you pretend that the prior likelihood of a study using the term pain is exactly 50%, then your posterior estimate after observing dACC activation should now be 85%”. Is this a faithful representation of reality? No. It most certainly isn’t. And in all likelihood, neither is the empirical prior of 3.5%. But the problem is, we have to do something; Bayes’ rule has to have priors to work with; it can’t just conjure into existence a conditional probability for a term (i.e., P(Term|Activation)) without knowing anything about its marginal probability (i.e., P(Term)). Unfortunately, as you can see in the above figure, the variation in the posterior that’s attributable to the choice of prior will tend to swamp the variation that’s due to observed differences in brain activity.
The upshot is, if you come into a study thinking that ‘pain’ is 90% likely to be occurring, then Neurosynth is probably not going to give you much reason to revise that belief. Conversely, if your task involves strictly visual stimuli, and you know that there’s no sensory stimulation at all—so maybe you feel comfortable setting the prior on pain at 1%—then no pattern of activity you could possibly see is going to lead you to conclude that there’s a high probability of pain. This may not be very satisfying, but hey, that’s life.
The interesting thing about all this is that, no matter what prior you choose for any given term, the Neurosynth z-score will never change. That’s because the z-score is a frequentist measure of statistical association between term occurrence and voxel activation. All it tells us is that, if the null of no effect were true, the data we observe would be very unlikely. This may or may not be interesting (I would argue that it’s not, but that’s for a different post), but it certainly doesn’t license a reverse inference like “dACC activation suggests that pain is present”. To draw the latter claim, you have to use a Bayesian framework and pick some sensible priors. No priors, no reverse inference.
Now, as I noted in my last post, it’s important to maintain a pragmatic perspective. I’m obviously not suggesting that the z-score maps on Neurosynth are worthless. If one’s goal is just to draw weak qualitative inferences about brain-cognition relationships, I think it’s reasonable to use Neurosynth reverse inference z-score maps for that purpose. For better or worse, the vast majority of claims researchers make in cognitive neuroscience are not sufficiently quantitative that it makes much difference whether the probability of a particular term occurring given some observed pattern of activation is 24% or 58%. Personally, I would argue that this is to the detriment of the field; but regardless, the fact remains that if one’s goal is simply to say something like “we think that the temporoparietal junction is associated with biological motion and theory of mind,” or “evidence suggests that the parahippocampal cortex is associated with spatial navigation,” I don’t see anything wrong with basing that claim on Neurosynth z-score maps. In marked contrast, however, Neurosynth provides no license for saying much stronger things like “the dACC is selective for pain” or suggesting that one can make concrete reverse inferences about mental processes on the basis of observed patterns of brain activity. If the question we’re asking is what are we entitled to conclude about the presence of pain when we observed significant activation in the dACC in a particular study?, the simple answer is: almost nothing.
Let’s now reconsider L&E’s statement—and by extension, their entire argument for selectivity—in this light. L&E say that their goal is not to compare effect sizes for different terms, but rather “to see whether there [is] accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.” But what could this claim possibly mean, if not something like “we want to know whether it’s safe to infer the presence of pain given the presence of dACC activation?” How could this possibly be anything other than a statement about probability? Are L&E really saying that, given a sufficiently high z-score for dACC/pain, it would make no difference to them at all if the probability of pain given dACC activation was only 5%, even if there were plenty of other terms with much higher conditional probabilities? Do they expect us to believe that, in their 2003 social pain paper—where they drew a strong reverse inference that social pain shares mechanisms with physical pain based purely on observation of dACC activation (which, ironically, wasn’t even in pain-related areas of dACC—it would have made no difference to their conclusion even if they’d known conclusively that dACC activation actually only reflects pain processing 5% of the time? Such a claim is absurd on its face.
Let me summarize this section by making the following points about Neurosynth. First, it’s possible to obtain almost any posterior probability for any term given activation in any voxel, simply by adjusting the prior probability of term occurrence. Second, a choice about the prior must be made; there is no “default” setting (well, there is on the website, but that’s only because I’ve already made the choice for you). Third, the choice of prior will tend to dominate the posterior—which is to say, if you’re convinced that there’s a high (or low) prior probability that your study involves pain, then observing different patterns of brain activity will generally not do nearly as much as you might expect to change your conclusions. Fourth, this is not a Neurosynth problem, it’s a reality problem. The fundamental fact of the matter is that we simply do not know with any reasonable certainty, in any given context, what the prior probability of a particular process occuring in our subjects’ head is. Yet, without that, we have little basis for drawing any kind of reverse inference when we observe brain activity in a given study.
If all this makes you think, “oh, this seems like it would make it almost impossible in practice to draw meaningful reverse inferences in individual studies,” well, you’re not wrong.
L&E’s PNAS paper, and their reply to my last post, suggests that they don’t appreciate any of these points. The fact of the matter is that it’s impossible to draw any reverse inference about an individual study unless one is willing to talk about probabilities. L&E don’t seem to understand this, because if they did, they wouldn’t feel comfortable saying that they don’t care about effect sizes, and that z-scores provide adequate support for reverse inference claims. In fact, they wouldn’t feel comfortable making any claim about the dACC’s selectivity for pain relative to other terms on the basis of Neurosynth data.
I want to be clear that I don’t think L&E’s confusion about these issues is unusual. The reality is that many of these core statistical concepts—both frequentist and Bayesian—are easy to misunderstand, even for researchers who rely on them on a day-to-day basis. By no means am I excluding myself from this analysis; I still occasionally catch myself making similar slips when explaining what the z-scores and conditional probabilities in Neurosynth mean—and I’ve been thinking about these exact ideas in this exact context for a pretty long time! So I’m not criticizing L&E for failing to correctly understand reverse inference and its relation to Neurosynth. What I’m criticizing L&E for is writing an entire paper making extremely strong claims about functional selectivity based entirely on Neurosynth results, without ensuring that they understand the statistical underpinnings of the framework, and without soliciting feedback from anyone who might be in a position to correct their misconceptions. Personally, if I were in their position, I would move to retract the paper. But I have no control over that. All I can say is that it’s my informed opinion—as the creator of the software framework underlying all of L&E’s analyses—that the conclusions they draw in their paper are not remotely supported by any data that I’ve ever seen come out of Neurosynth.
On ‘strong’ vs. ‘weak’ selectivity
The other major problem with L&E’s paper, from my perspective, lies in their misuse of the term ‘selective’. In their response, L&E take issue with my criticism of their claim that they’ve shown the dACC to be selective for pain. They write:
Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word, with the strong form entailing further constraints on what constitutes an effect being selective. TY writes in his blog: “it’s one thing to use Neurosynth to support a loose claim like “some parts of the dACC are preferentially associated with pain”, and quite another to claim that the dACC is selective for pain, that virtually nothing else activates dACC”. The last part there gets at what TY thinks we mean by selective and what we would call the strong form of selectivity.
L&E respectively define these strong and weak forms of selectivity as follows:
Selectivitystrong: The dACC is selective for pain, if pain and only pain activates the dACC.
Selectivityweak: The dACC is selective for pain, if pain is a more reliable source of dACC activation than the other terms of interest (executive, conflict, salience).
They suggest that I accused them of claiming ‘strong’ selectivity when they were really just making the much weaker claim that dACC activation is more strongly associated with dACC activation than with other terms. I disagree with this characterization. I’ll come back to what I meant by ‘selective’ in a bit (I certainly didn’t assume anything like L&E’s strong definition). But first, let’s talk about L&E’s ‘weak’ notion of selectivity, which in my view is at odds with any common-sense understanding of what ‘selective’ means, and would have an enormously destructive effect on the field if it were to become widely used.
The fundamental problem with the suggestion that we can say dACC is pain-selective if “it’s a more reliable source of dACC activation than the other terms of interest” is that this definition provides a free pass for researchers to make selectivity claims about an extremely large class of associations, simply by deciding what is or isn’t of interest in any given instance. L&E claim to be “interested” in executive control, conflict, and salience. This seems reasonable enough; after all, these are certainly candidate functions that people have discussed at length in the literature. The problem lies with all the functions L&E don’t seem to be interested in: e.g., fear, autonomic control, or reward—three other processes that many researchers have argued the dACC is crucially involved in, and that demonstrably show robust effects in dACC in Neurosynth. If we take L&E’s definition of weak selectivity at face value, we find ourselves in the rather odd position of saying that one can use Neurosynth to claim that a region is “selective” for a particular function just as long as it’s differentiable from some other very restricted set of functions. Worse still, one apparently does not have to justify the choice of comparison functions! In their PNAS paper, L&E never explain why they chose to focus only on three particular ACC accounts that don’t show robust activation in dACC in Neurosynth, and ignored several other common accounts that do show robust activation.
If you think this is a reasonable way to define selectivity, I have some very good news for you. I’ve come up with a list of other papers that someone could easily write (and, apparently, publish in a high-profile journal) based entirely on results you can obtain from the Neurosynth websites. The titles of these papers (and you could no doubt come up with many more) include:
“The TPJ is selective for theory of mind”
“The TPJ is selective for biological motion”
“The anterior insula is selective for inhibition”
“The anterior insula is selective for orthography”
“The VMPFC is selective for autobiographical memory”
“The VMPFC is selective for valuation”
“The VMPFC is selective for autonomic control”
“The dACC is selective for fear”
“The dACC is selective for autonomic control”
“The dACC is selective for reward”
These are all interesting-sounding articles that I’m sure would drum up considerable interest and controversy. And the great thing is, as long as you’re careful about what you find “interesting” (and you don’t have to explicitly explain yourself in the paper!), Neurosynth will happily support all of these conclusions. You just need to make sure not to include any comparison terms that don’t fit with your story. So, if you’re writing a paper about the VMPFC and valuation, make sure you don’t include autobiographical memory as a control. And if you’re writing about theory of mind in the TPJ, it’s probably best to not find biological motion interesting.
Now, you might find yourself thinking, “how could it make sense to have multiple people write different papers using Neurosynth, each one claiming that a given region is ‘selective’ for a variety of different processes? Wouldn’t that sort of contradict any common-sense understanding of what the term ‘selective’ means?” My own answer would be “yes, yes it would”. But L&E’s definition of “weak selectivity”—and the procedures they use in their paper—allow for multiple such papers to co-exist without any problem. Since what counts as an “interesting” comparison condition is subjective—and, if we take L&E’s PNAS example as a model, one doesn’t even need to explicitly justify the choices one makes—there’s really nothing stopping anyone from writing any of the papers I suggested above. Following L&E’s logic, a researcher who favored a fear-based account of dACC could simply select two or three alternative processes as comparison conditions—say, sustained attention and salience—do all of the same analyses L&E did (pretending for the moment that those analyses are valid, which they aren’t), and conclude that the dACC is selective for fear. It really is that easy.
In reality, I imagine that if L&E came across an article claiming that Neurosynth shows that the dACC is selective for fear, I doubt they’d say “well, I guess the dACC is selective for fear. Good to know.” I suspect they would (quite reasonably) take umbrage at the fear paper’s failure to include pain as a comparison condition in the analysis. Yet, by their own standards, they’d have no real basis for any complaint. The fear paper’s author could simply, say, “pain’s not interesting to me,” and that would be that. No further explanation necessary.
Perhaps out of recognition that there’s something a bit odd about their definition of selectivity, L&E try to prime our intuition that their usage is consistent with the rest of the field. They point out that, in most experimental fMRI studies claiming evidence for selectivity, researchers only ever compare the target stimulus or process to a small number of candidates. For example, they cite a Haxby commentary on a paper that studied category specificity in visual cortex:
From Haxby (2006): “numerous small spots of cortex were found that respond with very high selectivity to faces. However, these spots were intermixed with spots that responded with equally high selectivity to the other three categories.”
Their point is that nobody expects ‘selective’ here to mean that the voxel in question responds to only that visual category and no other stimulus that could conceivably have been presented. In practice, people take ‘selective’ to mean “showed a greater response to the target category than to other categories that were tested”.
I agree with L&E that Haxby’s usage of the term ‘selective’ here is completely uncontroversial. The problem is, the study in question is a lousy analogy for L&E’s PNAS paper. A much better analogy would be a study that presented 10 visual categories to participants, but then made a selectivity claim in the paper’s title on the basis of a comparison between the target category and only 2 other categories, with no explanation given for excluding the other 7 categories, even though (a) some of those 7 categories were well known to also be associated with the same brain region, and (b) strong activation in response to some of those excluded categories was clearly visible in a supplementary figure. I don’t know about L&E, but I’m pretty sure that, presented with such a paper, the vast majority of cognitive neuroscientists would want to say something like, “how can you seriously be arguing that this part of visual cortex responds selectively to spheres, when you only compared spheres with faces and houses in the main text, and your supplemental figure clearly shows that the same region responds strongly to cubes and pyramids as well? Shouldn’t you maybe be arguing that this is a region specialized for geometric objects, if anything?” And I doubt anyone would be very impressed if the authors’ response to this critique was “well, it doesn’t matter what else we’re not focusing on in the paper. We said this region is sphere-selective, which just means it’s more selective than a couple of other stimulus categories people have talked about. Pyramids and cubes are basically interchangeable with spheres, right? What more do you want from us?”
I think it’s clear that there’s no basis for making a claim like “the dACC is selective for pain” when one knows full well that at least half a dozen other candidate functions all reliably activate the dACC. As I noted in my original post, the claim is particularly egregious in this case, because it’s utterly trivial to generate a ranked list of associations for over 3,000 different terms in Neurosynth. So it’s not even as if one needs to think very carefully about which conditions to include in one’s experiment, or to spend a lot of time running computationally intensive analyses. L&E were clearly aware that a bunch of other terms also activated dACC; they briefly noted as much in the Discussion of their paper. What they didn’t explain is why this observation didn’t lead them to seriously revise their framing. Given what they knew, there were at least two alternative articles they could have written that wouldn’t have violated common sense understanding of what the term ‘selective’ means. One might have been titled something like “Heterogeneous aspects of dACC are preferentially associated with pain, autonomic control, fear, reward, negative affect, and conflict monitoring”. The other might have been titled “the dACC is preferentially associated with X-related processes”—where “X” is some higher-order characterization that explains why all of these particular processes (and not others) are activated in dACC. I have no idea whether either of these papers would have made it through peer review at PNAS (or any other journal), but at the very least they wouldn’t have been flatly contradicted by Neurosynth results.
To be fair to L&E, while they didn’t justify their exlcusion of terms like fear and autonomic control in the PNAS paper, they did provide some explanation in their reply to my last post. Here’s what they say:
TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.” And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.
Essentially, their response is: “it didn’t make sense for us to include fear or autonomic control, because these functions are compatible with the underlying role we think the dACC is playing in pain”. This is not compelling, for three reasons. First, it’s a bait-and-switch. L&E’s paper isn’t titled “the dACC is selective for a family of distress-related processes”, it’s titled “the dACC is selective for pain“. One cannot publish a paper purporting to show that the dACC is selective for pain, and arguing that pain is the single best psychological characterization of its role in cognition, and then, in a section of their Discussion that they admit is the “most speculative” part of the paper, essentially say, “just kidding–we don’t think it’s really doing pain per se, we think it’s a much more general set of functions. But we don’t have any real evidence for that.”
Second, it’s highly uncharitable for L&E to spontaneously lump alternative accounts of dACC function like fear/avoidance, autonomic control, and bodily orientation in with their general “distress-related” account, because proponents of many alternative views of dACC function have been very explicit in saying that they don’t view these functions as fundamentally affective (e.g., Vogt and colleagues view posterior dACC as a premotor region). While L&E may themselves believe that pain, fear, and autonomic control in dACC all reflect some common function, that’s an extremely strong claim that requires independent evidence, and is not something that they’re entitled to simply assume. A perfectly sensible alternative is that these are actually dissociable functions with only partially overlapping spatial representations in dACC. Since the terms themselves are distinct in Neurosynth, that should be L&E’s operating assumption until they provide evidence for their stronger claim that there’s some underlying commonality. Nothing about this conclusion simply falls out of the data in advance.
Third, let me reiterate the point I made above about L&E’s notion of ‘weak selectivity’: if we take at face value L&E’s claim that fear and autonomic control don’t need to be explicitly considered because they could be interpreted alongside pain under a common account, then they’re effectively conceding that it would have made just as much sense to publish a paper titled “the dACC is selective for fear” or “the dACC is selective for autonomic control” that relegated the analysis of the term “pain” to a supplementary figure. In the paper’s body, you would find repeated assertions that the authors have shown that autonomic control is the “best general psychological account of dACC function”. When pressed as to whether this was a reasonable conclusion, the authors would presumably defend their decision to ignore pain as a viable candidate by saying things like, “well, sure pain also activates the dACC; everyone knows that. But that’s totally consistent with our autonomic control account, because pain produces autonomic outputs! So we don’t need to consider that explicitly.”
I confess to some skepticism that L&E would simply accept such a conclusion without any objection.
Before moving on, let me come full circle and offer a definition of selectivity that I think is much more workable than either of the ones L&E propose, and is actually compatible with the way people use the term ‘selective’ more broadly in the field:
Selectivityrealistic: A brain region can be said to be ‘selective’ for a particular function if it (i) shows a robust association with that function, (ii) shows a negligible association with all other readily available alternatives, and (iii) the authors have done due diligence in ensuring that the major candidate functions proposed in the literature are well represented in their analysis.
Personally, I’m not in love with this definition. I think it still allows researchers to make claims that are far too strong in many cases. And it still allows for a fair amount of subjectivity in determining what gets to count as a suitable control—at least in experimental studies where researchers necessarily have to choose what kinds of conditions to include. But I think this definition is more or less in line with the way most cognitive neuroscientists expect each other to use the term. It captures the fact that most people would feel justifiably annoyed if someone reported a “selective” effect in one condition while failing to acknowledge that 4 other unreported conditions showed the same effect. And it also captures the notion that researchers should be charitable to each other: if I publish a paper claiming that the so-called fusiform ‘face’ area is actually selective for houses, based on a study that completely failed to include a face condition, no one is going to take my claim of house selectivity seriously. Instead, they’re going to conclude that I wasn’t legitimately engaging with other people’s views.
In the context of Neurosynth—where one has 3,000 individual terms or several hundred latent topics at their disposal—this definition makes it very clear that researchers who want to say that a region is selective for something have an obligation to examine the database comprehensively, and not just to cherry-pick a couple of terms for analysis. That is what I meant when I said that L&E need to show that “virtually nothing else activates dACC”. I wasn’t saying that they have to show that no other conceivable process reliably activates the dACC (which would be impossible, as they observe), but simply that they need to show that no non-synonymous terms in the Neurosynth database do. I stand by this assertion. I see no reason why anyone should accept a claim of selectivity based on Neurosynth data if just a minute or two of browsing the Neurosynth website provides clear-cut evidence that plenty of other terms also reliably activate the same region.
To sum up, nothing L&E say in their paper gives us any reason to think that the dACC is selective for pain (even if we were to ignore all the problems with their understanding of reverse inference and allow them to claim selectivity based on inappropriate statistical tests). I submit that no definition of ‘selective’ that respects common sense usage of the term, and is appropriately charitable to other researchers, could possibly have allowed L&E to conclude that dACC activity is “selective” for pain when they knew full well that fear, autonomic control, and reward all also reliably activated the dACC in Neurosynth.
Having focused on what I view as the two overarching issues raised by L&E’s reply, I now turn to comprehensively addressing each of their specific claims. As I noted at the outset, I recognize this is going to make for slow reading. But I want to make sure I address L&E’s points clearly and comprehensively, as I feel that they blatantly mischaracterized what I said in my original post in many cases. I don’t actually recommend that anyone read this entire section linearly. I’m writing it primarily as a reference—so that if you think there were some good points L&E made in their reply to my original post, you can find those points by searching for the quote, and my response will be directly below.
Okay, let’s begin.
Tal Yarkoni (hereafter, TY), the creator of Neurosynth, has now posted a blog (here (link is external)) suggesting that pretty much all of our claims are either false, trivial, or already well-known. While this response was not unexpected, it’s disappointing because we love Neurosynth and think it’s a powerful tool for drawing exactly the kinds of conclusions we’ve drawn.
I’m surprised to hear that my response was not unexpected. This would seem to imply that L&E had some reason to worry that I wouldn’t approve of the way they were using Neurosynth, which leads me to wonder why they didn’t solicit my input ahead of time.
While TY is the creator of Neurosynth, we don’t think that means he has the last word when it comes to what is possible to do with it (nor does he make this claim). In the end, we think there may actually be a fair bit of agreement between us and TY. We do think that TY has misunderstood some of our claims (section 1 below) and failed to appreciate the significance and novelty of our actual claims (sections 2 and 4). TY also thinks we should have used different statistical analyses than we did, but his critique assumes we had a different question than the one we really had (section 5).
I agree that I don’t have the last word, and I encourage readers to consider both L&E’s arguments and mine dispassionately. I don’t, however, think that there’s a fair bit of agreement between us. Nor do I think I misunderstood L&E’s claim or failed to appreciate their significance or novelty. And, as I discuss at length both above and below, the problem is not that L&E are asking a different question than I think, it’s that they don’t understand that the methods they’re using simply can’t speak to the question they say they’re asking.
1. Misunderstandings (where we sort of probably agree)
We think a lot of the heat in TY’s blog comes from two main misunderstandings of what we were trying to accomplish. The good news (and we really hope it is good news) is that ultimately, we may actually mostly agree on both of these points once we get clear on what we mean. The two issues have to do with the use of the term “selective” and then why we chose to focus on the four categories we did (pain, executive, conflict, salience) and not others like fear and autonomic.
Misunderstanding #1: Selectivity. Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word…
I’ve already addressed this in detail at the beginning of this post, so I’ll skip the next few paragraphs and pick up here:
We mean this in the same way that Haxby and lots of others do. We never give a technical definition of selectivity in our paper, though in the abstract we do characterize our results as follows:
“Results clearly indicated that the best psychological description of dACC function was related to pain processing—not executive, conflict, or salience processing.”
Thus, the context of what comparisons our selectivity refers to is given in the same sentence, right up front in the abstract. In the end, we would have been just as happy if “selectivity” in the title was replaced with “preferentially activated”. We think this is what the weak form of selectivity entails and it is really what we meant. We stress again, we are not familiar with researchers who use the strong form of selectivity. TY’s blog is the first time we have encountered this and was not what we meant in the paper.
I strongly dispute L&E’s suggestion that the average reader will conclude from the above sentence that they’re clearly analyzing only 4 terms. Here’s the sentence in their abstract that directly precedes the one they quote:
Using Neurosynth, an automated brainmapping database [of over 10,000 functional MRI (fMRI) studies], we performed quantitative reverse inference analyses to explore the best general psychological account of the dACC function P(Ψ processjdACC activity).
It seems quite clear to me that the vast majority of readers are going to parse the title and abstract of L&E’s paper as implying a comprehensive analysis to find the best general psychological account of dACC function, and not “the best general psychological account if you only consider these 4 very specific candidates”. Indeed, I have trouble making any sense of the use of the terms “best” and “general” in this context, if what L&E meant was “a very restricted set of possibilities”. I’ll also note that in five minutes of searching the literature, I couldn’t find any other papers with titles or abstracts that make nearly as strong a claim about anterior cingulate function as L&E’s present claims about pain. So I reject the idea that their usage is par for the course. Still, I’m happy to give them the benefit of the doubt and accept that they truly didn’t realize that their wording might lead others to misinterpret their claims. I guess the good news is that, now that they’re aware of the potential confusion claims like this can cause, they will surely be much more circumspect in the titles and abstracts of their future papers.
Before moving on, we want to note that in TY’11 (i.e. the Yarkoni et al., 2011 paper announcing Neurosynth), the weak form of selectivity is used multiple times. In the caption for Figure 2, the authors refer to “regions in c were selectively associated with the term” when as far as we can tell, they are talking only about the comparison of three terms (working memory, emotion, pain). Similarly on p. 667 the authors write “However, the reverse inference map instead implicated the anterior prefrontal cortex and posterior parietal cortex as the regions that were most selectively activated by working memory tasks.” Here again, the comparison is to emotion and pain, and the authors are not claiming selectivity relative to all other psychological processes in the Neurosynth database. If it is fair for Haxby, Botvinick, and the eminent coauthors of TY’11 to use selectivity in this manner, we think it was fine for us as well.
I reject the implication of equivalence here. I think the scope of the selectivity claim I made in the figure caption in question is abundantly clear from the immediate context, and provides essentially no room for ambiguity. Who would expect, in a figure with 3 different maps, the term ‘selective’ to mean anything other than ‘for this one and not those two’? I mean, if L&E had titled their paper “pain preferentially activates the dACC relative to conflict, salience, or executive control”, and avoided saying that they were proposing the “best general account” of psychological function in dACC, I wouldn’t have taken issue with their use of the term ‘selective’ in their manuscript either, because the scope would have been equally clear. Conversely, if I had titled my 2011 paper “the dACC shows no selectivity for any cognitive process”, and said, in the abstract, something like “we show that there is no best general psychological function of the dACC–not pain, working memory, or emotion”, I would have fully expected to receive scorn from others.
That said, I’m willing to put my money where my mouth is. If a few people (say 5) write in to say (in the comments below, on twitter, or by email) that they took the caption in Figure 2 of my 2011 paper to mean anything other than “of these 3 terms, only this one showed an effect”, I’ll happily send the journal a correction. And perhaps, L&E could respond in kind by commiting to changing the title of their manuscript to something like “the dACC is preferentially active for pain relative to conflict, salience or executive control” if 5 people write in to say that they interpreted L&E’s claims as being much more global than L&E suggest they are. I encourage readers to use the comments below to clarify how they understood both of these selectivity claims.
We would also point readers to the fullest characterization of the implication of our results on p. 15253 of the article:
“The conclusion from the Neurosynth reverse inference maps is unequivocal: The dACC is involved in pain processing. When only forward inference data were available, it was reasonable to make the claim that perhaps dACC was not involved in pain per se, but that pain processing could be reduced to the dACC’s “real” function, such as executive processes, conflict detection, or salience responses to painful stimuli. The reverse inference maps do not support any of these accounts that attempt to reduce pain to more generic cognitive processes.”
We think this claim is fully defensible and nothing in TY’s blog contradicts this. Indeed, he might even agree with it.
This claim does indeed seem to me largely unobjectionable. However, I’m at a loss to understand how the reader is supposed to know that this one very modest sentence represents “the fullest characterization” of the results in a paper replete with much stronger assertions. Is the reader supposed to, upon reading this sentence, retroactively ignore all of the other claims—e.g., the title itself, and L&E’s repeated claim throughout the paper that “the best psychological interpretation of dACC activity is in terms of pain processes”?
*Misunderstanding #2: We did not focus on fear, emotion, and autonomic accounts*. TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.” And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.
I addressed this in detail above, in the section on “selectivity”.
We speak to some but not all of this in the paper. On p. 15254, we revisit our neural alarm account and write “Distress-related emotions (“negative affect” “distress” “fear”) were each linked to a dACC cluster, albeit much smaller than the one associated with “pain”.” While we could have said more explicitly that pain is in this distress-related category, we have written about this several times before and assumed this would be understood by readers.
There is absolutely no justification for assuming this. The community of people who might find a paper titled “the dorsal anterior cingulate cortex is selective for pain” interesting is surely at least an order of magnitude larger than the community of people who are familiar with L&E’s previous work on distress-related emotions.
So why did we focus on executive, conflict, and salience? Like most researchers, we are the products of our early (academic) environment. When we were first publishing on social pain, we were confused by the standard account of dACC function. A half century of lesion data and a decade of fMRI studies of pain pointed towards more evidence of the dACC’s involvement in distress-related emotions (pain & anxiety), yet every new paper about the dACC’s function described it in cognitive terms. These cognitive papers either ignored all of the pain and distress findings for dACC or they would redescribe pain findings as reducible to or just an instance of something more cognitive.
When we published our first social pain paper, the first rebuttal paper suggested our effects were really just due to “expectancy violation” (Somerville et al., 2006), an account that was later invalidated (Kawamoto 2012). Many other cognitive accounts have also taken this approach to physical pain (Price 2000; Vogt, Derbyshire, & Jones, 2006).
Thus for us, the alternative to pain accounts of dACC all these years were conflict detection and cognitive control explanations. This led to the focus on the executive and conflict-related terms. In more recent years, several papers have attempted to explain away pain responses in the dACC as nothing more than salience processes (e.g Iannetti’s group) that have nothing to do with pain, and so salience became a natural comparison as well. We haven’t been besieged with papers saying that pain responses in the dACC are “nothing but” fear or “nothing but” autonomic processes, so those weren’t the focus of our analyses.
This is a informative explanation of L&E’s worldview and motivations. But it doesn’t justify ignoring numerous alternative accounts whose proponents very clearly don’t agree with L&E that their views can be explained away as “distress-related”. If L&E had written a paper titled “salience is not a good explanation of dACC function,” I would have happily agreed with their conclusion here. But they didn’t. They wrote a paper explicitly asserting that pain is the best psychological characterization of the dACC. They’re not entitled to conclude this unless they compare pain properly with a comprehensive set of other possible candidates—not just the ones that make pain look favorable.
We want to comment further on fear specifically. We think one of the main reasons that fear shows up in the dACC is because so many studies of fear use pain manipulations (i.e. shock administration) in the process of conditioning fear responses. This is yet another reason that we were not interested in contrasting pain and fear maps. That said, if we do compare the Z-scores in the same eight locations we used in the PNAS paper, the pain effect has more accumulated evidence than fear in all seven locations where there is any evidence for pain at all.
This is a completely speculative account, and no evidence is provided for it. Worse, it’s completely invertible: one could just as easily say that pain shows up in the dACC because it invariably produces fear, or because it invariably elicits autonomic changes (frankly, it seems more plausible to me that pain almost always generates fear than that fear is almost always elicited by pain). There’s no basis for ruling out these other candidate functions a priori as being more causally important. This is simply question-begging.
Its interesting to us that TY does not in principle seem to like us trying to generate some kind of unitary account of dACC writing “There’s no reason why nature should respect our human desire for simple, interpretable models of brain function.” Yet, TY then goes on to offer a unitary account more to his liking. He highlights Vogt’s “four-region” model of the cingulate writing “I’m especially partial to the work of Brent Vogt…”. In Vogt’s model, the aMCC appears to be largely the same region as what we are calling dACC. Although the figure shown by TY doesn’t provide anatomical precision, in other images, Vogt shows the regions with anatomical boundaries. Rotge et al. (2015) used such an image from Vogt (2009) to estimate the boundaries of aMCC as spanning 4.5 ≤ y ≤ 30 which is very similar to our dACC anterior/posterior boundaries of 0 ≤ y ≤ 30) (see Figure below). Vogt ascribes the function of avoidance behavior to this region – a pretty unitary description of the region that TY thinks we should avoid unitary descriptions of.
There is no charitable way to put it: this is nothing short of a gross misrepresentation of what I said about the Vogt account. As a reminder, here’s what I actually wrote in my post:
I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree) … the Vogt characterization of dACC/aMCC … fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance” region).
I have no idea how L&E read this and concluded that I was arguing that we should simply replace the label “pain” with “fear”. I don’t feel the need to belabor the point further, because I think what I wrote is quite clear.
In the end though, if TY prefers a fear story to our pain story, we think there is some evidence for both of these (a point we make in our PNAS paper). We think they are in a class of processes that overlap both conceptually (i.e. distress-related emotions) and methodologically (i.e. many fear studies use pain manipulations to condition fear).
No, I don’t prefer a fear story. My view (which should be abundantly clear from the above quote) is that both a fear story and a pain story would be gross oversimplifications that shed more heat than light. I will, however, reiterate my earlier point (which L&E never responded to), which is that their PNAS paper provides no reason at all to think that the dACC is involved in distress-related emotion (indeed, they explicitly said that this was the most speculative part of the paper). If anything, the absence of robust dACC activation for terms like ‘disgust’, ’emotion’, and ‘social’ would seem to me like pretty strong evidence against a simplistic model of this kind. I’m not sure why L&E are so resistant to the idea that maybe, just maybe, the dACC is just too big a region to attach a single simple label to. As far as I can tell, they provide no defense of this assumption in either their paper or their reply.
After focusing on potential misunderstandings we want to turn to our first disagreement with TY. Near the end of his blog, TY surprised us by writing that the following conclusions can be reasonably drawn from Neurosynth analyses:
* “There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.”
* “It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)”
Our first response was ‘Wow. After pages and pages of criticizing our paper, TY pretty much agrees with what we take to be the major claims of our paper. Yes, his version is slightly watered down from what we were claiming, but these are definitely in the ballpark of what we believe.’
L&E omitted my third bullet point here, which was that “Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.” I’m not sure why they left it out; they could hardly disagree with it either, if they want to stand by their definition of “weak selectivity”.
I’ll leave it to you to decide whether or not my conclusions are really just “watered down” versions “in the ballpark” of the major claims L&E make in their paper.
But then TY’s next statement surprised us in a different sort of way. He wrote
“I think these are all interesting and potentially important observations. They’re hardly novel…”.
We’ve been studying the dACC for more than a decade and wondered what he might have meant by this. We can think of two alternatives for what he might have meant:
* That L&E and a small handful of others have made this claim for over a decade (but clearly not with the kind of evidence that Neurosynth provides).
* That TY already used Neurosynth in 2011 to show this. In the blog, he refers to this paper writing “We explicitly noted that there is preferential activation for pain in dACC”.
I’m not sure what was confusing about what I wrote. Let’s walk through the three bullet points. The first one is clearly not novel. We’ve known for many years that many parts of dACC are preferentially active when people experience painful stimulation. As I noted in my last post, L&E explicitly appealed to this literature over a decade ago in their 2003 social pain paper. The second one is also clearly not novel. For example, Vogt and colleagues (among others) have been arguing for at least two decades now that the posterior aspects of dACC support pain processing in virtue of their involvement in processes (e.g., bodily orientation) that clearly preclude most higher cognitive accounts of dACC. The third claim isn’t novel either, as there has been ample evidence for at least a decade now that virtually every part of dACC that responds to painful stimulation also systematically responds to other non-nociceptive stimuli (e.g., the posterior dACC responds to non-painful touch, the anterior to reward, etc.). I pointed to articles and textbooks comprehensively reviewing this literature in my last post. So I don’t understand L&E’s surprise. Which of these three claims do they think is actually novel to their paper?
In either case, “they’re hardly novel” implies this is old news and that everyone knows and believes this, as if we’re claiming to have discovered that most people have two eyes, a nose, and a mouth. But this implication could not be further from the truth.
No, that’s not what “hardly novel” implies. I think it’s fair to say that the claim that social pain is represented in the dACC in virtue of representations shared with physical pain is also hardly novel at this point, yet few people appear to know and believe it. I take ‘hardly novel’ to mean “it’s been said before multiple times in the published literature.”
There is a 20+ year history of researchers ignoring or explaining away the role of pain processing in dACC.
I’ll address the “explained away” part of this claim below, but it’s completely absurd to suggest that researchers have ignored the role of pain processing in dACC for 20 years. I don’t think I can do any better than link to Google Scholar, where the reader is invited to browse literally hundreds of articles that all take it as an established finding that the dACC is important for pain processing (and many of which have hundreds of citations from other articles).
When pain effects are mentioned in most papers about the function of dACC, it is usually to say something along the lines of ‘Pain effects in the dACC are just one manifestation of the broader cognitive function of conflict detection (or salience or executive processes)’. This long history is indisputable. Here are just a few examples (and these are all reasonable accounts of dACC function in the absence of reverse inference data):
* Executive account: Price’s 2000 Science paper on the neural mechanisms of pain assigns to the dACC the roles of “directing attention and assigning response priorities”
* Executive account: Vogt et al. (1996) says the dACC “is not a ‘pain centre’” and “is involved in response selection” and “response inhibition or visual guidance of responses”
* Conflict account: Botvinick et al. (2004) wrote that “the ACC might serve to detect events or internal states indicating a need to shift the focus of attention or strengthen top-down control (, see also ), an idea consistent, for example, with the fact that the ACC responds to pain ” (Botvinick et al. 2004)
* Salience account: Iannetti suggests the ‘pain matrix’ is a myth and in Legrain et al. (2011) suggests that the dACC’s responses to pain “could mainly reflect brain processes that are not directly related to the emergence of pain and that can be engaged by sensory inputs that do not originate from the activation of nociceptors.”
I’m not really sure what to make of this argument either. All of these examples clearly show that even proponents of other theories of dACC function are well aware of the association with pain, and don’t dispute it in any way. So L&E’s objection can’t be that other people just don’t believe that the dACC supports pain processing. Instead, L&E seem to dislike the idea that other theorists have tried to “explain away” the role of dACC in pain by appealing to other mechanisms. Frankly, I’m not sure what the alternative to such an approach could possibly be. Unless L&E are arguing that dACC is the neural basis of an integrated, holistic pain experience (whatever such a thing might mean), there presumably must be some specific computational operations going on within dACC that can be ascribed a sensible mechanistic function. I mean, even L&E themselves don’t take the dACC to be just about, well, pain. Their whole “distress-related emotion” story is itself intended to explain what it is that dACC actually does in relation to pain (since pretty much everyone accepts that the sensory aspects of pain aren’t coded in dACC).
The only way I can make sense of this “explained away” concern is if what L&E are actually objecting to is the fact that other researchers have disagreed or ignored their particular story about what the dACC does in pain—i.e., L&E’s view that the dACC role in pain is derived from distress-related emotion. As best I can tell, what bothers them is that other researchers fundamentally disagree with–and hence, don’t cite–their “distress-related emotion” account. Now, maybe this irritation is justified, and there’s actually an enormous amount of evidence out there in favor of the distress account that other researchers are willfully ignoring. I’m not qualified to speak to that (though I’m skeptical). What I do feel qualified to say is that none of the Neurosynth results L&E present in their paper make any kind of case for an affective account of pain processing in dACC. The most straightforward piece of evidence for that claim would be if there were a strong overlap between pain and negative affect activations in dACC. But we just don’t see this in Neurosynth. As L&E themselves acknowledge, the peak sectors of pain-related activation in dACC are in mid-to-posterior dACC, and affect-related terms only seem to reliably activate the most anterior aspects.
To be charitable to L&E, I do want to acknowledge one valuable point that they contribute here, which is that it’s clear that dACC function cannot be comprehensively explained by, say, a salience account or a conflict monitoring account. I think that’s a nice point (though I gather that some people who know much more about anatomy than I do are in the process of writing rebuttals to L&E that argue it’s not as nice as I think it is). The problem is, this argument can be run both ways. Meaning, much as L&E do a nice job showing that conflict monitoring almost certainly can’t explain activations in posterior dACC, the very maps they show make it clear that pain can’t explain all the other activations in anterior dACC (for reward, emotion, etc.). Personally, I think the sensible conclusion one ought to take away from all this is “it’s really complicated, and we’re not going to be able to neatly explain away all of dACC function with a single tidy label like ‘pain’.” L&E draw a different conclusion.
But perhaps this approach to dACC function has changed in light of TY’11 findings (i.e. Yarkoni et al. 2011). There he wrote “For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis.” This hardly sounds like a resounding call for a different understanding of dACC that involves an appreciation of its preferential involvement in pain.
Right. It wasn’t a resounding call for a different understanding of dACC, because it wasn’t a paper about the dACC—a brain region I lack any deep interest in or knowledge of—it was a paper about Neurosynth and reverse inference.
Here are quotes from other papers showing how they view the dACC in light of TY’11:
* Poldrack (2012) “The striking insight to come from analyses of this database (Yarkoni et al., in press) is that some regions (e.g., anterior cingulate) can show high degrees of activation in forward inference maps, yet be of almost no use for reverse inference due to their very high base rates of activation across studies”
* Chang, Yarkoni et al. (2012) “the ACC tends to show substantially higher rates of activation than other regions in neuroimaging studies (Duncan and Owen 2000; Nelson et al. 2010; Yarkoni et al. 2011), which has lead some to conclude that the network is processing goal-directed cognition (Yarkoni et al. 2009)”
* Atlas & Wager (2012) “In fact, the regions that are reliably modulated (insula, cingulate, and thalamus) are actually not specific to pain perception, as they are activated by a number of processes such as interoception, conflict, negative affect, and response inhibition”
I won’t speak for papers I’m not an author on, but with respect to the quote from the Chang et al paper, I’m not sure what L&E’s point actually is. In Yarkoni et al. (2009), I argued that “effort” might be a reasonable generic way to characterize the ubiquitous role of the frontoparietal “task-positive” network in cognition. I mistakenly called the region in question ‘dACC’ when I should have said ‘preSMA’. I already gave L&E deserved credit in my last post for correcting my poor knowledge of anatomy. But I would think that, if anything, the fact that I was routinely confusing these terms circa 2011 should lead L&E to conclude that maybe I don’t know or care very much about the dACC, and not that I’m a proud advocate for a strong theory of dACC function that many other researchers also subscribe to. I think L&E give me far too much credit if they think that my understanding of the dACC in 2011 (or, for that matter, now) is somehow representative of the opinions of experts who study that region.
Perhaps the reason why people who cite TY’11 in their discussion of dACC didn’t pay much attention to the above quote from TY’11 (““For pain, the regions of maximal pain-related…”) was because they read and endorsed the following more direct conclusion that followed “…because the dACC is activated consistently in all of these states [cognitive control, pain, emotion], its activation may not be diagnostic of any one of them” (bracketed text added). If this last quote is taken as TY’11’s global statement regarding dACC function, then it strikes us still as quite novel to assert that the dACC is more consistently associated with one category of processes (pain) than others (executive, conflict, and salience processes).
I don’t think TY’11 makes any ‘global statement regarding dACC function’, because TY’11 was a methodological paper about the nature of reverse inference, not a paper about grand models of dACC function. As for the quote L&E reproduce, here’s the full context:
These results showed that without the ability to distinguish consistency from selectivity, neuroimaging data can produce misleading inferences. For instance, neglecting the high base rate of DACC activity might lead researchers in the areas of cognitive control, pain and emotion to conclude that the DACC has a key role in each domain. Instead, because the DACC is activated consistently in all of these states, its activation may not be diagnostic of any one of them and conversely, might even predict their absence. The NeuroSynth framework can potentially address this problem by enabling researchers to conduct quantitative reverse inference on a large scale.
I stand by everything I said here, and I’m not sure what L&E object to. It’s demonstrably true if you look at Figure 2 in TY’11 that pain, emotion, and cognitive control all robustly activate the dACC in the forward inference map, but not in the reverse inference maps. The only sense I can make of L&E’s comment is if they’re once again conflating z-scores with probabilities, and assuming that the presence of significant activation for pain means that dACC is in fact diagnostic for pain. But, as I showed much earlier in this post, that would betray very deep misunderstanding of what the reverse inference maps generated by Neurosynth mean. There is absolutely no basis for concluding, in any individual study, that people are likely to be perceiving pain just because the dACC is active.
In the article, we showed forward and reverse inference maps for 21 terms and then another 9 in the supplemental materials. These are already crowded busy figures and so we didn’t have room to show multiple slices for each term. Fortunately, since Neurosynth is easily accessible (go check it out now at neurosynth.org – its awesome!) you can look at anything we didn’t show you in the paper. Tal takes us to task for this.
He then shows a bunch of maps from x=-8 to x=+8 on a variety of terms. Many of these terms weren’t the focus of our paper because we think they are in the same class of processes as pain (as noted above). So it’s no surprise to us that terms such as ‘fear,’ ‘empathy,’ and ‘autonomic’ produce dACC reverse inference effects. In the paper, we reported that ‘reward’ does indeed produce reverse inference effects in the anterior portion of the dACC (and show the figure in the supplemental materials), so no surprise there either. Then at the bottom he shows cognitive control, conflict, and inhibition which all show very modest footprints in dACC proper, as we report in the paper.
Once again: L&E are not entitled to exclude a large group of viable candidate functions from their analysis simply because they believe that they’re “in the same class of [distress-related affect] processes” (a claim that many people, including me, would dispute). If proponents of the salience monitoring view wrote a Neurosynth-based paper neglecting to compare salience with pain because “pain is always salient, so it’s in the same class of salience-related processes”, I expect that L&E would not be very happy about it. They should show others the same charity they themselves would expect.
But in any case, if it’s not surprising to L&E that reward, fear, and autonomic control all activate the dACC, then I’m at a loss to understand why they didn’t title the paper something like “the dACC is selectively involved in pain, reward, fear, and autonomic control”. That would have much more accurately represented the results they report, and would be fully consistent with their notion of “weak selectivity”.
There are two things that make the comparison of what he shows and what we reported in the paper not a fair comparison. First, his maps are thresholded at p<.001 and yet all the maps that we report use Neurosynth’s standard, more conservative, FDR criterion of p<.01 (a standard TY literally set). Here, TY is making a biased, apples-to-oranges comparison by juxtaposing the maps at a much more liberal threshold than what we did. Given that each of the terms we were interested in (pain, executive, conflict, salience) had more than 200 studies in the database its not clear why TY moved from FDR to uncorrected maps here.
The reason I used a threshold of p < .001 for this analysis is because it’s what L&E themselves used:
In addition, we used a threshold of Z > 3.1, P < 0.001 as our threshold for indicating significance. This threshold was chosen instead of Neurosynth’s more strict false discovery rate (FDR) correction to maximize the opportunity for multiple psychological terms to “claim” the dACC.
This is a sensible thing to do here, because L&E are trying to accept the null of no effect (or at least, it’s more sensible than applying a standard, conservative correction). Accepting the null hypothesis because an effect fails to achieve significance is the cardinal sin of null hypothesis significance testing, so there’s no real justification for doing what L&E are trying to do. But if you are going to accept the null, it at least behooves you to use a very liberal threshold for your analysis. I’m not sure why it’s okay for L&E to use a threshold of p < .001 but not for me to do the same (and for what it’s worth, I think p < .001 is still an absurdly conservative cut-off given the context).
Second, the Neurosynth database has been updated since we did our analyses. The number of studies in the database has only increased by about 5% (from 10,903 to 11,406 studies) and yet there are some curious changes. For instance, fear shows more robust dACC now than it did a few months ago even though it only increased from 272 studies to 298 studies.
Although the number of studies has nominally increased by only 5%, this actually reflects the removal of around 1,000 studies as a result of newer quality control heuristics, and the addition of around 1,500 new studies. So it should not be surprising if there are meaningful differences between the two. In any case, it seems odd for L&E to use the discrepancy between old and new versions of the database as a defense of their findings, given that the newer results are bound to be more accurate. If L&E accept that there’s a discrepancy, perhaps what they should be saying is “okay, since we used poorer data for our analyses than what Neurosynth currently contains, we should probably re-run our analyses and revise our conclusions accordingly”.
We were more surprised to discover that the term ‘rejection’ has been removed from the Neurosynth database altogether such that it can no longer be used as a term to generate forward and reverse inference maps (even though it was in the database prior to the latest update).
This claim is both incorrect and mildly insulting. It’s incorrect because the term “rejection” hasn’t been in the online Neurosynth database for nearly two years, and was actually removed three updates ago. And it’s mildly insulting, because all L&E had to do to verify the date at which rejection was removed, as well as understand why, was visit the Neurosynth data repository and inspect the different data releases. Failing that, they could have simply asked me for an explanation, instead of intimating that there are “curious” changes. So let me take this opportunity to remind L&E and other readers that the data displayed on the Neurosynth website are always archived on GitHub. If you don’t like what’s on the website at any given moment, you can always reconstruct the database based on an earlier snapshot. This can be done in just a few lines of Python code, as the IPython notebook I linked to last time illustrates.
As to why the term “rejection” disappeared: in April 2014, I switched from a manually curated set of 525 terms (which I had basically picked entirely subjectively) to the more comprehensive and principled approach of including all terms that passed a minimum frequency threshold (i.e., showing up in at least 60 unique article abstracts). The term “rejection” was not frequent enough to survive. I don’t make decisions about individual terms on a case-by-case basis (well, not since April 2014, anyway), and I certainly hope L&E weren’t implying that I pulled the ‘rejection’ term in response to their paper or any of their other work, because, frankly, they would be giving themselves entirely too much credit.
Anyway, since L&E seem concerned with the removal of ‘rejection’ from Neurosynth, I’m happy to rectify that for them. Here are two maps for the term “rejection” (both thresholded at voxel-wise p < .001, uncorrected):
The first map is from the last public release (March 2013) that included “rejection” as a feature, and is probably what L&E remember seeing on the website (though, again, it hasn’t been online since 2014). It’s based on 33 studies. The second map is the current version of the map, based on 52 studies. The main conclusion I personally would take away from both of these maps is that there’s not enough data here to say anything meaningful, because they’re both quite noisy and based on a small number of studies. This is exactly why I impose a frequency cut-off for all terms I put online.
That said, if L&E would like to treat these “rejection” analyses as admissible evidence, I think it’s pretty clear that these maps actually weigh directly against their argument. In both cases, we see activation in pain-related areas of dACC for the forward inference analysis but not for the reverse. Interestingly, we do see activation in the most anterior part of dACC in both cases. This seems to me entirely consistent with the argument many people have made that subjective representations of emotion (including social pain) are to be found primarily in anterior medial frontal cortex, and that posterior dACC activations for pain have much more to do with motor control, response selection, and fear than with anything affective.
Given that Neurosynth is practically a public utility and federally funded, it would be valuable to know more about the specific procedures that determine which journals and articles are added to the database and on what schedule. Also, what are the conditions that can lead to terms being removed from the database and what are the set of terms that were once included that have now been removed.
I appreciate L&E’s vote of confidence (indeed, I wish that I believed Neurosynth could do half of what they claim it can do). As I’ve repeatedly said in this post and the last one, I’m happy to answer any questions L&E have about Neurosynth methods (preferably on the mailing list, which is publicly archived and searchable). But to date, they haven’t asked me any. I’ll also reiterate that it would behoove L&E to check the data repository on GitHub (which is linked to from the neurosynth.org portal) before they conclude that the information they want isn’t already publicly accessible (because most of it is).
In any event, we did not cherry pick data. We used the data that was available to us as of June 2015 when we wrote the paper. For the four topics of interest, below we provide more representative views of the dACC, thresholded as typical Neurosynth maps are, at FDR p<.01. We’ve made the maps nice and big so you can see the details and have marked in green the dACC region on the different slices (the coronal slice are at y=14 and y=22). When you look at these, we think they tell the same story we told in the paper.
I’m not sure what the point here is. I was not suggesting that L&E were lying; I was arguing that (a) visual inspection of a few slices is no way to make a strong argument about selectivity; (b) the kinds of analyses L&E report are a statistically invalid way to draw the conclusion they are trying to draw, and (c) even if we (inappropriately) use L&E’s criteria, analyses done with more current data clearly demonstrate the presence of plenty of effects for terms other than pain. L&E dispute the first two points (which we’ll come back to), but they don’t seem to contest the last. This seems to me like it should lead L&E to the logical conclusion that they should change their conclusions, since newer and better data are now available that clearly produce different results given the same assumptions.
(I do want to be clear again that I don’t condone L&E’s analyses, which I show above and below in detail simply don’t support their conclusions. I was simply pointing out that even by their own criteria, Neurosynth results don’t support their claims.)
4. Surprising lack of appreciation for what the reverse inference maps show in pretty straightforward manner.
Let’s start with pain and salience. Iannetti and his colleagues have made quite a bit of hay the last few years saying that the dACC is not involved in pain, but rather codes for salience. One of us has critiqued the methods of this work elsewhere (Eisenberger, 2015, Annual Review). The reverse inference maps above show widespread robust reverse inference effects throughout the dACC for pain and not a single voxel for salience. When we ran this initially for the paper, there were 222 studies tagged for the term salience and now that number is up to 269 and the effects are the same.
Should our tentative conclusion be that we should hold off judgment until there is more evidence? TY thinks so: “If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available.” This would be reasonable if we were talking about topics with 10 or 15 studies in the database. But, there are 269 studies for the term salience and yet there is nothing in the dACC reverse inference maps. I can’t think of anyone who has ever run a meta-analysis of anything with 250 studies, found no accumulated evidence for an effect and then said “we should withhold judgment until more data is available”.
This is another gross misrepresentation of what I said in my commentary. So let me quote what I actually said. Here’s the context:
While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.
This is pretty close to the textbook definition of “quoting out of context”. It should be abundantly clear that I was not saying that L&E shouldn’t interpret results from a Neurosynth meta-analysis of 250 studies (which would be absurd). The point of the above quote was that if L&E don’t like the result they get when they conduct meta-analytic comparisons properly with Neurosynth, they’re not entitled to replace the analysis with a statistically invalid procedure that does give results they like.
TY and his collaborators have criticized researchers in major media outlets (e.g. New York Times) for poor reverse inference – for drawing invalid reverse inference conclusions from forward inference data. The analyses we presented suggest that claims about salience and the dACC are also based on unfounded reverse inference claims. One would assume that TY and his collaborators are readying a statement to criticize the salience researchers in the same way they have previously.
This is another absurd, and frankly insulting, comparison. My colleagues and I have criticized people for saying that insula activation is evidence that people are in love with their iPhones. I certainly hope that this is in a completely different league from inferring that people must be experiencing pain if the dACC is activated (because if not, some of L&E’s previous work would appear to be absurd on its face). For what it’s worth, I agree with L&E that nobody should interpret dACC activation in a study as strong evidence of “salience”—and, for that matter, also of “pain”. As for why I’m not readying a statement to criticize the salience researchers, the answer is that it’s not my job to police the ACC literature. My interest is in making sure Neurosynth is used appropriately. L&E can rest assured that if someone published an article based entirely on Neurosynth results in which their primary claim was that the dACC is selective for salience, I would have written precisely the same kind of critique. Though it should perhaps concern them that, of the hundreds of published uses of Neurosynth to date, theirs is the first and only one that has moved me to write a critical commentary.
But no. Nowhere in the blog does TY comment on this finding that directly contradicts a major current account of the dACC. Not so much as a “Geez, isn’t it crazy that so many folks these days think the dACC and AI can be best described in terms of salience detection and yet there is no reverse inference evidence at all for this claim.”
Once again: I didn’t comment on this because I’m not interested in the dACC; I’m interested in making sure Neurosynth is used appropriately. If L&E had asked me, “hey, do you think Neurosynth supports saying that dACC activation is a good marker of ‘salience’?”, I would have said “no, of course not.” But L&E didn’t write a paper titled “dACC activity should not be interpreted as a marker of salience”. They wrote a paper titled “the dACC is selective for pain”, in which they argue that pain is the best psychological characterization of dACC—a claim that Neurosynth simply does not support.
For the terms executive and conflict, our Figure 3 in the PNAS paper shows a tiny bit of dACC. We think the more comprehensive figures we’ve included here continue to tell the same story. If someone wants to tell the conflict story of why pain activates the dACC, we think there should be evidence of widespread robust reverse inference mappings from the dACC to conflict. But the evidence for such a claim just isn’t there. Whatever else you think about the rest of our statistics and claims, this should give a lot of folks pause, because this is not what almost any of us would have expected to see in these reverse inference maps (including us).
No objections here.
If you generally buy into Neurosynth as a useful tool (and you should), then when you look at the four maps above, it should be reasonable to conclude, at least among these four processes, that the dACC is much more involved in that first one (i.e. pain). Let’s test this intuition in a new thought experiment.
Imagine you were given the three reverse inference maps below and you were interested in the function of the occipital cortex area marked off with the green outline. You’d probably feel comfortable saying the region seems to have a lot more to do with Term A than Terms B or C. And if you know much about neuroanatomy, you’d probably be surprised, and possibly even angered, when I tell you that Term A is ‘motor’, Term B is ‘engaged’, and Term C is ‘visual’. How is this possible since we all know this region is primarily involved in visual processes? Well it isn’t possible because I lied. Term A is actually ‘visual’ and Term C is ‘motor’. And now the world makes sense again because these maps do indeed tell us that this region is widely and robustly associated with vision and only modestly associated with engagement and motor processes. The surprise you felt, if you believed momentarily that Term A was motor was because you have the same intuition we do that these reverse inference maps tell us that Term A is the likely function of this region, not Term B or Term C – and we’d like that reverse inference to be what we always thought this region was associated with – vision. It’s important to note that while a few voxels appear in this region for Terms B and C, it still feels totally fine to say this region’s psychological function can best be described as vision-related. It is the widespread robust nature of the effect in Term A, relative to the weak and limited effects of Terms B and C, that makes this a compelling explanation of the region.
I’m happy to grant L&E that it may “feel totally fine” to some people to make a claim like this. But this is purely an appeal to intuition, and has zero bearing on the claim’s actual validity. I hope L&E aren’t seriously arguing that cognitive neuroscientists should base the way we do statistical inference on our intuitions about what “feels totally fine”. I suspect it felt totally fine to L&E to conclude in 2003 that people were experiencing physical pain because the dACC was active, even though there was no evidential basis for such a claim (and there still isn’t). Recall that, in surveys of practicing researchers, a majority of respondents routinely endorse the idea that a p-value of .05 means that that there’s at least a 95% probability that the alternative hypothesis is correct (it most certainly doesn’t mean this). Should we allow people to draw clearly invalid conclusions in their publications on the grounds that it “feels right” to them? Indeed, as I show below, L&E’s arguments for selectivity rest in part on an invalid acceptance of the null hypothesis. Should they be given a free pass on what is probably the cardinal sin of NHST, on the grounds that it probably “felt right” to them to equate non-significance with evidence of absence?
The point of Neurosynth is that it provides a probabilistic framework for understanding the relationship between psychological function and brain activity. The framework has many very serious limitations that, in practice, make it virtually impossible to draw any meaningful reverse inference from observed patterns of brain activity in any individual study. If L&E don’t like this, they’re welcome to build their own framework that overcomes the limitations of Neurosynth (or, they could even help me improve Neurosynth!). But they don’t get to violate basic statistical tenets in favor of what “feels totally fine” to them.
Another point of this thought experiment is that if Term A is what we expect it to be (i.e. vision) then we can keep assuming that Neurosynth reverse inference maps tell us something valuable about the function of this region. But if Term A violates our expectation of what this region does, then we are likely to think about the ways in which Neurosynth’s results are not conclusive on this point.
We suspect if the dACC results had come out differently, say with conflict showing wide and robust reverse inference effects throughout the dACC, and pain showing little to nothing in dACC, that most of our colleagues would have said “Makes sense. The reverse inference map confirms what we thought – that dACC serves a general cognitive function of detecting conflicts.” We think it is because of the content of the results rather than our approach that is likely to draw ire from many.
I can’t speak for L&E’s colleagues, but my own response to their paper was indeed driven entirely by their approach. If someone had published a paper using Neurosynth to argue that the dACC is selective for conflict, using the same kinds of arguments L&E make, I would have written exactly the same kind of critique I wrote in response to L&E’s paper. I don’t know how I can make it any clearer that I have zero attachment to any particular view of the dACC; my primary concern is with L&E’s misuse of Neurosynth, not what they or anyone else thinks about dACC function. I’ve already made it clear several times that I endorse their conclusion that conflict, salience, and cognitive control are not adequate explanations for dACC function. What they don’t seem to accept is that pain isn’t an adequate explanation either, as the data from Neurosynth readily demonstrate.
5. L&E did the wrong analyses
TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.
I’ve already addressed the overarching problem with L&E’s statistical analyses in the first part of this post. Below I’ll just walk through each of L&E’s assertions in detail and point out all of the specific issues in detail. I’ll warn you right now that this is not likely to make for very exciting reading.
While we think the maps for each term speak volumes just from visual inspection, we thought it was also critical to run the comparisons across terms directly. We all know the statistical error of showing that A is significant, while B is not and then assuming, but not testing A > B, directly. TY has a section called “A>B does not imply ~B” (where ~B means ‘not B’). Indeed it does not, but all the reverse inference maps for the executive, conflict, and salience terms already established ~B. We were just doing due diligence by showing that the difference between A and B was indeed significant.
I apologize for implying that L&E weren’t aware that A > B doesn’t entail ~B. I drew that conclusion because the only other way I could see their claim of selectivity making any sense is if they were interpreting a failure to detect a significant effect for B as positive evidence of no effect. I took that to be much more unlikely, because it’s essentially the cardinal sin of NHST. But their statement here explicitly affirms that this is, in fact, exactly what they were arguing—which leads me to conclude that they don’t understand the null hypothesis statistical testing (NHST) framework they’re using. The whole point of this section of my post was that L&E cannot conclude that there’s no activity in dACC for terms like conflict or salience, because accepting the null is an invalid move under NHST. Perhaps I wasn’t sufficiently clear about this in my last post, so let me reiterate: the reverse inference maps do not establish ~B, and cannot establish ~B. The (invalid) comparison tests of A > B do not establish ~B, and cannot cannot establish ~B. In fact, no analysis, figure, or number L&E report anywhere in their paper establishes ~B for any of the terms they compare with pain. Under NHST, the only possible result of any of L&E’s analyses that would allow them to conclude that a term is not positively associated with dACC activation would be a significant result in the negative direction (i.e., if dACC activation implied a decrease in likelihood of a term). But that’s clearly not true of any of the terms they examine.
Note that this isn’t a fundamental limitation of statistical inference in general; it’s specifically an NHST problem. A Bayesian model comparison approach would have allowed L&E to make a claim about the evidence for the null in comparison to the alternative (though specifying the appropriate priors here might not be very straightforward). Absent such an analysis, L&E are not in any position to make claims about conflict or salience not activating the dACC—and hence, per their own criteria for selectivity, they have no basis for arguing that pain is selective.
Now, in my last post, I went well beyond this logical objection and argued that, if you analyze the data using L&E’s own criteria, there’s plenty of evidence for significant effects of other terms in dACC. I now regret including those analyses. Not because they were wrong; I stand by my earlier conclusion (which should be apparent to anyone who spends five minutes browsing maps on Neurosynth.org), and this alone should have prevented L&E from making claims about pain selectivity. But the broader point is that I don’t want to give the impression that this debate is over what the appropriate statistical threshold for analysis is—i.e., that maybe if we use p < 0.05, I’m right, and if we use FDR = 0.1, L&E are right. The entire question of which terms do or don’t show a significant effect is actually completely beside the point given that L&E’s goal is to establish that only pain activates the dACC, and that terms like conflict or salience don’t. To accomplish that, L&E would need to use an entirely different statistical framework that allows them them to accept the null (relative to some alternative).
If it’s reasonable to use the Z-scores from Neurosynth to say “How much evidence is there for process A being a reliable reverse inference target for region X” then it has to be reasonable to compare Z-scores from two analyses to ask “How much MORE evidence is there for process A than process B being a reliable reverse inference target for region X”. This is all we did when we compared the Z-scores for different terms to each other (using a standard formula from a meta-analysis textbook) and we think this is the question many people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region.
I addressed this in the earlier part of this post, where I explained why one cannot obtain support for a reverse inference using z-scores or p-values. Reverse inference is inherently a Bayesian notion, and makes sense only if you’re willing to talk about prior and posterior probabilities. So L&E’s first premise here—i.e., that it’s reasonable to use z-scores from Neurosynth to quantify “evidence for process A being a reliable reverse inference target for region X” is already false.
For what it’s worth, the second premise is also independently false, because it’s grossly inappropriate to use meta-analytic z-score comparison test in this situation. For one thing, there’s absolutely no reason to compare z-scores given that the distributional information is readily available. Rosenthal (the author of the meta-analysis textbook L&E cite) himself explicitly notes that such a test is inferior to effect size-based tests, and is essentially a last-ditch approach. Moreover, the intended use of the test in meta-analysis is to determine whether or not there’s heterogeneity in p-values as a precursor to combining them in an analysis (which is a concern that makes no sense in the context of Neurosynth data). At best, what L&E would be able to say with this test is something like “it looks like these two z-scores may be coming from different underlying distributions”. I don’t know why L&E think this is at all an interesting question here, because we already know with certainty that there can be no meaningful heterogeneity of this sort in these z-scores given that they’re all generated using exactly the same set of studies.
In fact, the problems with the z-score comparison test L&E are using run so deep that I can’t help point out just one truly stupefying implication of the approach: it’s possible, under a wide range of scenarios, to end up concluding that there’s evidence that one term is “preferentially” activated relative to another term even when the point estimate is (significantly) larger for the latter term. For example, consider a situation in which we have a probability of 0.65 for one term with n = 1000 studies, and a probability of 0.8 for a second term with n = 100 studies. The one-sample proportion test for these two samples, versus a null of 0.5, gives z-scores of 9.5 and 5.9, respectively–so both tests are highly significant, as one would expect. But the Rosenthal z-score test favored by L&E tells us that the z-score for the first sample is significantly larger than the z-score for the second. It isn’t just wrong to interpret this as evidence that the first term has a more selective effect; it’s dangerously wrong. A two-sample test for the difference in proportions correctly reveals a significant effect in the expected direction (i.e., the 0.8 probablity in the smaller sample is in fact significantly greater than the 0.65 probability in the much larger sample). Put simply, L&E’s test is broken. It’s not clear that it tests anything meaningful in this context, let alone allowing us to conclude anything useful about functional selectivity in dACC.
As for what people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region: I really don’t know, and I don’t see how that would have any bearing on whether the methods L&E are using are valid or not. What I do know that I’ve never seen anyone else compare Neurosynth z-scores using a meta-analytic procedure intended to test for heterogeneity of effects—and I certainly wouldn’t recommend it.
TY then raises two quite reasonable issues with the Z-score comparisons, one of which we already directly addressed in our paper. First, TY raises the issue that Z-scores increase with accumulating evidence, so terms with more studies in the database will tend to have larger Z-scores. This suggests that terms with the most studies in the database (e.g. motor with 2081 studies) should have significant Z-scores everywhere in the brain. But terms with the most studies don’t look like this. Indeed, the reverse inference map for “functional magnetic” with 4990 studies is a blank brain with no significant Z-scores.
Not quite. It’s true that for any fixed effect size, z-scores will rise (in absolute value) as sample size increases. But if the true effect size is very small, one will still obtain a negligible z-score even in a very large sample. So while terms with more studies will indeed tend to have larger absolute z-scores, it’s categorically false that “terms with the most studies in the database should have significant z-scores everywhere in the brain”.
However, TY has a point. If two terms have similar true underlying effects in dACC, then the one with the larger number of studies will have a larger Z-score, all else being equal. We addressed this point in the limitations section of our paper writing “It is possible that terms that occur more frequently, like “pain,” might naturally produce stronger reverse inference effects than less frequent terms. This concern is addressed in two ways. First, the current analyses included a variety of terms that included both more or fewer studies than the term “pain” and no frequency-based gradient of dACC effects is observable.” So while pain (410 studies) is better represented in the Neurosynth database than conflict (246 studies), effort (137 studies), or Stroop (162 studies), several terms are better represented than pain including auditory (1004 studies), cognitive control (2474 studies), control (2781 studies), detection (485 studies), executive (531 studies), inhibition (432 studies), motor (1910 studies), and working memory (815). All of these, regardless of whether they are better or worse represented in the Neurosynth database show minimal presence in the dACC reverse inference maps. It’s also worth noting that painful and noxious, with only 158 and 85 studies respectively, both show broader coverage within the dACC than any of the cognitive or salience terms considered in our paper.
L&E don’t seem to appreciate that the relationship between the point estimate of a parameter and the uncertainty around that estimate is not like the relationship between two predictors in a regression, where one can (perhaps) reason logically about what would or should be true if one covariate was having an influence on another. One cannot “rule out” the possibility that sample size is a problem by pointing to some large-N terms with small effects or some small-N terms with large effects. Sampling error is necessarily larger in smaller samples. The appropriate way to handle between-term variation in sample size is to properly build that differential uncertainty into one’s inferential test. Rosenthal’s z-score comparison doesn’t do this. The direct meta-analytic contrast one can perform with Neurosynth does do this, but of course, being much more conservative than the Rosenthal test (appropriately so!), L&E don’t seem to like the results it produces. (And note that the direct meta-analytic contrast would still require one to make strong assumptions about priors if the goal was to make quantitative reverse inferences, as opposed to detecting a mean difference in probability of activation.)
TY’s second point is also reasonable, but is also not a problem for our findings. TY points out that some effects may be easier to produce in the scanner than others and thus may be biased towards larger effect sizes. We are definitely sympathetic to this point in general, but TY goes on to focus on how this is a problem for comparing pain studies to emotion studies because pain is easy to generate in the scanner and emotion is hard. If we were writing a paper comparing effect sizes of pain and emotion effects this would be a problem but (a) we were not primarily interested in comparing effect sizes and (b) we definitely weren’t comparing pain and emotion because we think the aspect of pain that the dACC is involved in is the affective component of pain as we’ve written in many other papers dating back to 2003 (Eisenberger & Lieberman, 2004; Eisenberger, 2012; Eisenberger, 2015).
It certainly is a problem for L&E’s findings. Z-scores are related one-to-one with effect size for any fixed sample size, so if the effect size is artificially increased in one condition, so too is the z-score that L&E stake their (invalid) analysis on. Any bias in the point estimate will necessarily distort the z-value as well. This is not a matter of philosophical debate or empirical conjecture, it’s a mathematical necessity.
Is TY’s point relevant to our actual terms of comparison: executive, conflict, and salience processes? We think not. Conflict tasks are easy and reliable ways to produce conflict processes. In multiple ways, we think pain is actually at a disadvantage in the comparison to conflict. First, pain effects are so variable from one person to the next that most pain researchers begin by calibrating the objective pain stimuli delivered, to each participant’s subjective responses to pain. As a result, each participant may actually be receiving different objective inputs and this might limit the reliability or interpretability of certain observed effects. Second, unlike conflict, pain can only be studied at the low end of its natural range. Due to ethical considerations, we do not come close to studying the full spectrum of pain phenomena. Both of these issues may limit the observation of robust pain effects relative to our actual comparisons of interest (executive, conflict, and salience processes.
Perhaps I wasn’t sufficiently clear, but I gave the pain-emotion contrast as an example. The point is that meta-analytic comparisons of the kind L&E are trying to make are a very dangerous proposition unless one has reason to think that two classes of manipulations are equally “strong”. It’s entirely possible that L&E are right that executive control manipulations are generally stronger than pain manipulations, but that case needs to be made on the basis of data, and cannot be taken for granted.
6. About those effect size comparison maps
After criticizing us for not comparing effect sizes, rather than Z-scores, TY goes on to produce his own maps comparing the effect sizes of different terms and claiming that these represent evidence that the dACC is not selective for pain. A lot of our objections to these analyses as evidence against our claims repeats what’s already been said so we’ll start with what’s new and then only briefly reiterate the earlier points.
a) We don’t think it makes much sense to compare effect sizes for terms in voxels for which there is no evidence that it is a valid reverse inference target. For instance, the posterior probability at 0 26 26 for pain is .80 and for conflict is .61 (with .50 representing a null effect). Are these significantly different from one another? I don’t think it matters much because the Z-score associated with conflict at this spot is 1.37, which is far from significant (or at least it was when we ran our analyses last summer. Strangely, now, any non-significant Z-scores seem to come back with a value of 0, whereas they used to give the exact non-significant Z-score).
I’m not sure why L&E think that statistical significance makes a term a “valid target” for reverse inference (or conversely, that non-significant terms cannot be valid targets). If they care to justify this assertion, I’ll be happy to respond to it. It is, in any case, a moot point, since many of the examples I gave were statistically significant, and L&E don’t provide any explanation as to why those terms aren’t worth worrying about either.
As for the disappearance of non-significant z-scores, that’s a known bug introduced by the last major update to Neurosynth, and it’ll be fixed in the next major update (when the entire database is re-generated).
If I flip a coin twice I might end up with a probability estimate of 100% heads, but this estimate is completely unreliable. Comparing this estimate to those from a coin flipped 10,000 times which comes up 51% heads makes little sense. Would the first coin having a higher probability estimate than the second tell us anything useful? No, because we wouldn’t trust the probability estimate to be meaningful. Similarly, if a high posterior probability is associated with a non-significant Z-score, we shouldn’t take this posterior probability as a particularly reliable estimate.
L&E are correct that it wouldn’t make much sense to compare an estimate from 2 coin flips to an estimate from 10,000 coin flips. But the error is in thinking that comparing p-values somehow addresses this problem. As noted above, the p-value comparison they use is a meta-analytic test that only tells one if a set of z-scores are heterogenous, and is not helpful for comparing proportions when one has actual distributional information available. It would be impossible to answer the question of whether one coin is biased relative to another using this test—and it’s equally impossible to use it to determine whether one term is more important than another for dACC function.
b) TY’s approach for these analyses is to compare the effect sizes for any two processes A & B by finding studies in the database tagged for A but not B and others tagged for B but not A and to compare these two sets. In some cases this might be fine, but in others it leaves us with a clean but totally unrealistic comparison. To give the most extreme example, imagine we did this for the terms pain and painful. It’s possible there are some studies tagged for painful but not pain, but how representative would these studies be of “painful” as a general term or construct? It’s much like the clinical problem of comparing depression to anxiety by comparing those with depression (but not anxiety) to those with anxiety (but not depression). These folks are actually pretty rare because depression and anxiety are so highly comorbid, so the comparison is hardly a valid test of depression vs. anxiety. Given that we think pain, fear, emotion, and autonomic are actually all in the same class of explanations, we think comparisons within this family are likely to suffer from this issue.
There’s nothing “unrealistic” about this comparison. It’s not the inferential test’s job to make sure that the analyst is doing something sensible, it’s the analyst’s job. Nothing compels L&E to run a comparison between ‘pain’ and ‘painful’, and I fully agree that this would be a dumb thing to do (and it would be an equally dumb thing to do using any other statistical test). One the other hand, comparing the terms ‘pain’ and ’emotion’ is presumably not a dumb thing to do, so it behooves us to make sure that we use an inferential test that doesn’t grossly violate common sense and basic statistical assumptions.
Now, if L&E would like to suggest an alternative statistical test that doesn’t exclude the intersection of the two terms and still (i) produces interpretable results, (ii) weights all studies equally, (iii) appropriately accounts for the partial dependency structure of the data, and (iv) is sufficiently computationally efficient to apply to thousands of terms in a reasonable amount of time (which rules out most permutation-based tests), then I’d be delighted to consider their suggestions. The relevant code can be found here, and L&E are welcome to open a GitHub issue to discuss this further. But unless they have concrete suggestions, it’s not clear what I’m supposed to do with their assertion that doing meta-analytic comparison properly sometimes “leaves us with a clean but totally unrealistic comparison”. If they don’t like the reality, they’re welcome to help me improve the reality. Otherwise they’re simply engaging in wishful thinking. Nobody owes L&E a statistical test that’s both valid and gives them results they like.
c) TY compared topics (i.e., a cluster of related terms), not terms. This is fine, but it is one more way that what TY did is not comparable to what we did (i.e. one more way his maps can’t be compared to those we presented).
I almost always use topics rather than terms in my own analyses, for a variety of reasons (they have better construct validity, are in theory more reliable, reduce the number of comparisons, etc.). I didn’t try out the analyses I ran with any of the term-based features, but I encourage L&E to do so if they like, and I’d be surprised if the results differ appreciably (they should, in general, simply be slightly less robust all around). In any case, I deliberately made my code available so that L&E (or anyone else) could easily reproduce and modify my analyses. (And of course, nothing at all hangs on the results in any case, because the whole premise that this is a suitable way to demonstrate selectivity is unfounded.)
d) Finally and most importantly, our question would not have led us to comparing effect sizes. We were interested in whether there was greater accumulated evidence for one term (i.e. pain) being a reverse inference target for dACC activations than for another term (e.g. conflict). Using the Z-scores as we did is a perfectly reasonable way to do this.
See above. Using the z-scores the way L&E did is not reasonable and doesn’t tell us anything anyone would want to know about functional selectivity.
7. Biases all around
Towards the end of his blog, TY says what we think many cognitive folks believe:
“I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else.”
We think this is very telling because it suggests that the findings such as those in our PNAS paper are likely to be unacceptable regardless of what the data shows.
Another misrepresentation of what I actually said, which was:
One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally.
The context makes it abundantly clear that I was not making a general statement about the importance of pain in some grand evolutionary sense, but simply pointing out the implausibility of supposing that Neurosynth reverse inference maps provide unbiased windows into the neural substrates of cognition. In the case of pain, there’s tentative evidence to believe that effect sizes are overestimated.
In contrast, we can’t think of too many things that the brain would prize above pain (and distress) representations. People who don’t feel pain (i.e. congenital insensitivity to pain) invariably die an early death – it is literally a death sentence to not feel pain. What could be more important for survival? Blind and deaf people survive and thrive, but those without the ability to feel pain are pretty much doomed.
I’m not sure what this observation is supposed to tell us. One could make the same kind of argument about plenty of other functions. People who suffer from a variety of autonomic or motor problems are also likely to suffer horrible early deaths; it’s unclear to me how this would justify a claim like “the brain prizes little above autonomic control”, or what possibly implications such a claim would have for understanding dACC function.
Similar (but not identical) to TY’s conclusions that we opened this blog with, we think the following conclusions are supported by the Neurosynth evidence in our PNAS paper:
I’ll take these one at a time.
* There is more widespread and robust reverse inference evidence for the role of pain throughout the dACC than for executive, conflict, and salience-related processes.
I’m not sure what is meant here by “robust reverse inference evidence”. Neurosynth certainly provides essentially no basis for drawing reverse inferences about the presence of pain in individual studies. (Let me remind L&E once again: at best, the posterior probability for ‘pain’ in dACC is around 80%–but that’s given an assumed based rate of 50%, not the more realistic real-world rate of around 3%). If what they mean is something like “on average, taking the average of all voxels in dACC, there’s more evidence of a statistical association between pain and dACC than pain and conflict monitoring”, then I’m fine with that.
* There is little to no evidence from the Neurosynth database that executive, conflict, and salience-related processes are reasonable reverse inference targets for dACC activity.
Again, this depends on what L&E mean. If they mean that one shouldn’t, upon observing activation in dACC, proclaim that conflict must be present, then they’re absolutely right. But again, the same is true for pain. On the other hand, if they mean that there’s no evidence in Neurosynth for a reverse inference association between these terms and dACC activity, where the criterion is surviving FDR-correction, then that’s clearly not true: for example, the conflict map clearly includes voxels within the dACC. Alternatively, if L&E’s point is that the dACC/preSMA region centrally associated with conflict monitoring or executive control is more dorsal than many (though not all) people have assumed, then I agree with them without qualification.
* Pain processes, particularly the affective or distressing part of pain, are in the same family with other distress-related processes including terms like distress, fear, and negative affect.
I have absolutely no idea what evidence this conclusion is based on. Nothing I can see in Neurosynth seems to support this—let alone anything in the PNAS paper. As I’ve noted several times now, most distress-related terms do not seem to overlap meaningfully with pain-related activations in dACC. To the extent that one thinks spatial overlap is a good criterion for determining family membership (and for what it’s worth, I don’t think it is), the evidence does not seem particularly suggestive of any such relationship (and L&E don’t test it formally in any way).
Postscript. *L&E should have used reverse inference, not forward inference, when examining the anatomical boundaries of dACC.*
We saved this one for the postscript because this has little bearing on the major claims of our paper. In our paper, we observed that when one does a forward inference analysis of the term ‘dACC’ the strongest effect occurs outside the dACC in what is actually SMA. This suggested to us that people might be getting activations outside the dACC and calling them dACC (much as many activations clearly not in the amygdala have been called amygdala because it fits a particular narrative). TY admits having been guilty of this in TY’11 and points out that we made this mistake in our 2003 Science paper on social pain. A couple of thoughts on this.
a) In 2003, we did indeed call an activation outside of dACC (-6 8 45) by the term “dACC”. TY notes that if this is entered into a Neurosynth analysis the first anatomical term that appears is SMA. Fair enough. It was our first fMRI paper ever and we identified that activation incorrectly. What TY doesn’t mention is that there are two other activations from the same paper (-8 20 40; -6 21 41) where the top named anatomical term in Neurosynth is anterior cingulate. And if you read this in TY’s blog and thought “I guess social pain effects aren’t even in the dACC”, we would point you to the recent meta-analysis of social pain by Rotge et al. (2015) where they observed the strongest effect for social pain in the dACC (8 24 24; Z=22.2 PFDR<.001). So while we made a mistake, no real harm was done.
I mentioned the preSMA activation because it was the critical data point L&E leaned on to argue that the dACC was specifically associated with the affective component of pain. Here’s the relevant excerpt from the 2003 social pain paper:
As predicted, group analysis of the fMRI data indicated that dorsal ACC (Fig. 1A) (x – 8, y 20, z 40) was more active during ESE than during inclusion (t 3.36, r 0.71, P < 0.005) (23, 24). Self-reported distress was positively correlated with ACC activity in this contrast (Fig. 2A) (x – 6, y 8, z 45, r 0.88, P < 0.005; x – 4, y 31, z 41, r 0.75, P < 0.005), suggesting that dorsal ACC activation during ESE was associated with emotional distress paralleling previous studies of physical pain (7, 8). The anterior insula (x 42, y 16, z 1) was also active in this comparison (t 4.07, r 0.78, P < 0.005); however, it was not associated with self-reported distress.
Note that both the dACC and anterior insula were activated by the exclusion vs. inclusion contrast, but L&E concluded that it was specifically the dACC that supports the “neural alarm” system, by virtue of being correlated with participants’ subjective reports (whereas the insula was not). Setting aside the fact that these results were observed in a sample size of 13 using very liberal statistical thresholds (so that the estimates are highly variable, spatial error is going to be very high, there’s a high risk of false positives, and accepting the null in the insula because of the absence of a significant effect is probably a bad idea), in focusing on the the preSMA activation in my critique, I was only doing what L&E themselves did in their paper:
Dorsal ACC activation during ESE could reflect enhanced attentional processing, previously associated with ACC activity (4, 5), rather than an underlying distress due to exclusion. Two pieces of evidence make this possibility unlikely. First, ACC activity was strongly correlated with perceived distress after exclusion, indicating that the ACC activity was associated with changes in participants’ self-reported feeling states.
By L&E’s own admission, without the subjective correlation, there would have been little basis for concluding that the effect they observed was attributable to distress rather than other confounds (attentional increases, expectancy violation, etc.). That’s why I focused on the preSMA activation: because they did too.
That said, since L&E bring up the other two activations, let’s consider those too, since they also have their problems. While it’s true that both of them are in the anterior cingulate, according to Neurosynth, neither of them is a “pain” voxel. The top functional associates for both locations are ‘inteference’, ‘task’, ‘verbal’, ‘verbal fluency’, ‘word’, ‘demands’, ‘words’, ‘reading’ … you get the idea. Pain is not significantly associated with these points in Neurosynth. So while L&E might be technically right that these other activations were in the anterior cingulate, if we take Neurosynth to be as reliable a guide to reverse inference as they think, then L&E never had any basis for attributing the social exclusion effect to pain to begin with—because, according to Neurosynth, literally none of the medial frontal cortex activations reported in the 2003 paper are associated with pain. I’ll leave it to others to decide whether “no harm was done” by their claim that the dACC is involved in social pain.
In contrast, TY’11’s mistake is probably of greater significance. Many have taken Figure 3 of TY’11 as strong evidence that the dACC activity can’t be reliably associated with working memory, emotion, or pain. If TY had tested instead (2 8 40), a point directly below his that is actually in dACC (rather than 2 8 50 which TY now acknowledges is in SMA), he would have found that pain produces robust reverse inference effects, while neither working memory or emotion do. This would have led to a very different conclusion than the one most have taken from TY’11 about the dACC.
Nowhere in TY’11 is it claimed that dACC activity isn’t reliably associated with working memory, emotion or pain (and, as I already noted in my last post, I explicitly said that the posterior aspects of dACC are preferentially associated with pain). What I did say is that dACC activation may not be diagnostic of any of these processes. That’s entirely accurate. As I’ve explained at great length above, there is simply no basis for drawing any strong reverse inference on the basis of dACC activation.
That said, if it’s true that many people have misinterpreted what I said in my paper, that would indeed be potentially damaging to the field. I would appreciate feedback from other people on this issue, because if there’s a consensus that my paper has in fact led people to think that dACC plays no specific role in cognition, then I’m happy to submit an erratum to the journal. But absent such feedback, I’m not convinced that my paper has had nearly as much influence on people’s views as L&E seem to think.
b) TY suggested that we should have looked for “dACC” in the reverse inference map rather than the forward inference map writing “All the forward inference map tells you is where studies that use the term “dACC” tend to report activation most often”. Yet this is exactly what we were interested in. If someone is talking about dACC in their paper, is that the region most likely to appear in their tables? The answer appears to be no.
No, it isn’t what L&E are interested in. Let’s push this argument to its logical extreme to illustrate the problem: imagine that every single fMRI paper in the literature reported activation in preSMA (plus other varying activations)—perhaps because it became standard practice to do a “task-positive localizer” of some kind. This is far-fetched, but certainly conceptually possible. In such a case, searching for every single region by name (“amygdala”, “V1”, you name it) would identify preSMA as the peak voxel in the forward inference map. But what would this tell us, other than that preSMA is activated with alarming frequency? Nothing. What L&E want to know is what brain regions have the biggest impact on the likelihood that an author says “hey, that’s dACC!”. That’s a matter of reverse inference.
c) But again, this is not one of the central claims of the paper. We just thought it was noteworthy so we noted it. Nothing else in the paper depends on these results.
I agree with this. I guess it’s nice to end on a positive note.
[Update 12/10/2015: Lieberman & Eisenberger have now posted a lengthy response to this post here. I’ll post my own reply to their reply in the next few days.]
[Update 12/14/2015: I’ve posted an even lengthier reply to L&E’s reply here.]
[Update 12/16/2015: Alex Shackman has posted an interesting commentary of his own on the L&E paper. It focuses on anatomical concerns unrelated to the issues I raise here and in my last post.]
The anterior cingulate cortex (ACC)—located immediately above the corpus callosum on the medial surface of the brain’s frontal cortex—is an intriguing brain region. Despite decades of extensive investigation in thousands of animal and human studies, understanding the function(s) of this region has proven challenging. Neuroscientists have proposed a seemingly never-ending string of hypotheses about what role it might play in in emotion and/or cognition. The field of human neuroimaging has taken a particular shine to the ACC in the past two decades; if you’ve ever heard overheard some nerdy-looking people talking about “conflict monitoring”, “error detection”, or “reinforcement learning” in the human brain, there’s a reasonable chance they were talking at least partly about the role of the ACC.
In a new PNAS paper, Matt Lieberman and Naomi Eisenberger wade into the debate with what is quite possibly the strongest claim yet about ACC function, arguing (and this is a verbatim quote from the paper’s title) that “the dorsal anterior cingulate cortex is selective for pain”. That conclusion rests almost entirely on inspection of meta-analytic results produced by Neurosynth, an automated framework for large-scale synthesis of results from thousands of published fMRI studies. And while I’ll be the first to admit that I know very little about the anterior cingulate cortex, I am probably the world’s foremost expert on Neurosynth*—because I created it. I also have an obvious interest in making sure that Neurosynth is used with appropriate care and caution. In what follows, I provide my HIBAR reactions to the Lieberman & Eisenberger (2015) manuscript, focusing largely on whether L&E’s bold conclusion is supported by the Neurosynth findings they review (spoiler alert: no).
Before going any further, I should clarify my role in the paper, since I’m credited in the Acknowledgments section for “providing Neurosynth assistance”. My contribution consisted entirely of sending the first author (per an email request) an aggregate list of study counts for different terms on the Neurosynth website. I didn’t ask what it was for, he didn’t say what it was for, and I had nothing to do with any other aspect of the paper—nor did PNAS ask me to review it. None of this is at all problematic, from my perspective. My policy has always been that people can do whatever they want with any of the Neurosynth data, code, or results, without having to ask me or anyone else for permission. I do encourage people to ask questions or solicit feedback (we have a mailing list), but in this case the authors didn’t contact me before this paper was published (other than to request data). So being acknowledged by name shouldn’t be taken as an endorsement of any of the results.
With that out of the way, we can move onto the paper. The basic argument L&E make is simple, and largely hangs on the following observation about Neurosynth data: when we look for activation in the dorsal ACC (dACC) in various “reverse inference” brain maps on Neurosynth, the dominant associate is the term “pain”. Other candidate functions people have considered in relation to dACC—e.g., “working memory”, “salience”, and “conflict”—show (at least according to L&E) virtually no association with dACC. L&E take this as strong evidence against various models of dACC function that propose that the dACC plays a non-pain-related role in cognition—e.g., that it monitors for conflict between cognitive representations or detects salient events. They state, in no uncertain terms, that Neurosynth results “clearly indicated that the best psychological description of dACC function was related to pain processing – not executive, conflict, or salience processing”. This is a strong claim, and would represent a major advance in our understanding of dACC function if it were borne out. Unfortunately, it isn’t.
A crash course in reverse inference
To understand why, we need to understand the nature of the Neurosynth data L&E focus on. And to do that, we need to talk about something called reverse inference. L&E begin their paper by providing an excellent explanation of why the act of inferring mental states from patterns of brain activity (i.e., reverse inference—a term popularized in a seminal 2006 article by Russ Poldrack)—is a difficult business. Many experienced fMRI researchers might feel that the issue has already been beaten to death (see for instance this, this, this, or this). Those readers are invited to skip to the next section.
For everyone else, we can summarize the problem by observing that the probability of a particular pattern of brain activity conditional on a given mental state is not the same thing as the probability of a particular mental state conditional on a given pattern of observed brain activity (i.e., P(activity|mental state) != P(mental state|activity)). For example, if I know that doing a difficult working memory task produces activation in the dorsolateral prefrontal cortex (DLPFC) 80% of the time, I am not entitled to conclude that observing DLPFC activation in someone’s brain implies an 80% chance that that person is doing a working memory task.
To see why, imagine that a lot of other cognitive tasks—say, those that draw on recognition memory, emotion recognition, pain processing, etc.—also happen to produce DLPFC activation around 80% of the time. Then we would be justified in saying that all of these processes consistently produce DLPFC activity, but we would have no basis for saying that DLPFC activation is specific, or even preferential, for any one of these processes. To make the latter claim, we would need to directly estimate the probability of working memory being involved given the presence of DLPFC activation. But this is a difficult proposition, because most fMRI studies only compare a small number of experimental conditions (typically with low statistical power), and cannot really claim to demonstrate that a particular pattern of activity is specific to a given cognitive process.
Unfortunately, a huge proportion of fMRI studies continue to draw strong reverse inferences on the basis of little or no quantitative evidence. The practice is particularly common in Discussion sections, when authors often want to say something more than just “we found a bunch of differences as a result of this experimental manipulation”, and end up drawing inferences about what such-and-such activation implies about subjects’ mental states on the basis of a handful of studies that previously reported activation in the same region(s). Many of these attributions could well be correct, of course; but the point is that it’s exceedingly rare to see any quantitative evidence provided in support of claims that are often fundamental to the interpretation authors wish to draw.
Fortunately, this is where large-scale meta-analytic databases like Neurosynth can help—at least to some degree. Because Neurosynth contains results from over 11,000 fMRI studies drawn from virtually every domain of cognitive neuroscience, we can use it to produce quantitative whole-brain reverse inference maps (for more details, see Yarkoni et al. (2011)). In other words, we can estimate the relative specificity with which a particular pattern of brain activity implies that some cognitive process is in play—provided we’re willing to make some fairly strong assumptions (which we’ll return to below).
The dACC, lost and found
Armed with an understanding of the forward/reverse inference distinction, we can now turn to the focus of the L&E paper: a brain region known as the dorsal anterior cingulate cortex (dACC). The first thing L&E set out to do, quite reasonably, is identify the boundaries of the dACC, so that it’s clear what constitutes the target of analysis. To this end, they compare the anatomically-defined boundaries of dACC with the boundaries found in the Neurosynth forward inference map for “dACC”. Here’s what they show us:
The blue outline in panel A is the anatomical boundary of dACC; the colorful stuff in B is the Neurosynth map for ‘dACC’. (It’s worth noting in passing that the choice to rely on anatomy as the gold standard here is not completely uncontroversial; given the distributed nature of fMRI activation and the presence of considerable registration error in most studies, another reasonable approach would have been to use a probabilistic template). As you can see, the two don’t converge all that closely. Much of the Neurosynth map sits squarely inside preSMA territory rather than in dACC proper. As L&E report:
When “dACC” is entered as a term into a Neurosynth forward inference analysis (Fig. 1B), there is substantial activity present in the anatomically defined dACC region; however, there is also substantial activity present in the SMA/preSMA region. Moreover, the location with the highest Z-score in this analysis is actually in SMA, not dACC. The same is true if the term “anterior cingulate” is used (Fig. 1C).
L&E interpret this as a sign of confusion in the literature about the localization of dACC, and suggest that this observation might explain why people have misattributed certain functions to dACC:
These findings suggest that some of the disagreement over the function of the dACC may actually apply to the SMA/pre-SMA, rather than the dACC. In fact, a previous paper reporting that a reverse inference analysis for dACC was not selective for pain, emotion, or working memory (see figure 3 in ref. 13) seems to have used coordinates for the dACC that are in fact in the SMA/ pre-SMA (MNI coordinates 2, 8, 50), not in the dACC.
This is an interesting point, and clearly has a kernel of truth to it, inasmuch as some researchers undoubtedly confuse dACC with more dorsal regions. As L&E point out, I made this mistake myself in the original Neurosynth paper (that’s the ‘ref. 13’ in the above quote); specifically, here’s the figure where I clearly labeled dACC in the wrong place:
Mea culpa—I made a mistake, and I appreciate L&E pointing it out. I should have known better.
That said, L&E should also have known better, because they were among the first authors to ascribe a strong functional role to a region of dorsal ACC that wasn’t really dACC at all. I refer here to their influential 2003 Science paper on social exclusion, in which they reported that a region of dorsal ACC centered on (-6, 8, 45) was specifically associated with the feeling of social exclusion and concluded (based on the assumption that the same region was already known to be implicated in pain processing) that social pain shares core neural substrates with physical pain. Much of the ongoing debate over what the putative role of dACC is traces back directly to this paper. Yet it’s quite clear that the region identified in that paper was not the same as the one L&E now argue is the pain-specific dACC. At coordinates (-6, 8, 45), the top hits in Neurosynth are “SMA”, “motor”, and “supplementary motor”. If we scan down to the first cognitive terms, we find the terms “task”, “execution”, and “orthographic”. “Pain” is not significantly associated with activation at this location at all. So, to the extent that people have mislabeled this region in the past, L&E would appear to share much of the blame. Which is fine—we all make mistakes. But given the context, I think it would behoove L&E to clarify their own role in perpetuating this confusion.
That said, even if L&E are correct that a subset of researchers have sometimes confused dACC and pre-SMA, they’re clearly wrong to suggest that the cognitive neuroscience community as a whole is guilty of the same confusion. A perplexing aspect of their argument is that they base their claim of localization confusion entirely on inspection of the forward inference Neurosynth map for “dACC”—an odd decision, coming immediately after several paragraphs in which they lucidly explain why a forward inference analysis is exactly the wrong way to determine what brain regions are specifically associated with a particular term. If you want to use Neurosynth to find out where people think dACC is, you should use the reverse inference map, not the forward inference map. All the forward inference map tells you is where studies that use the term “dACC” tend to report activation most often. But as discussed above, and in the L&E paper, that estimate will be heavily biased by differences between regions in the base rate of activation.
Perhaps in tacit recognition of this potential criticism, L&E go on to suggest that the alleged “distortion” problem isn’t ubiquitous, and doesn’t happen in regions like the amygdala, hippocampus, or posterior cingulate:
We tested several other anatomical terms including “amygdala,” “hippocampus,” “posterior cingulate,” “basal ganglia,” “thalamus,” “supplementary motor,” and “pre sma.” In each of these regions, the location with the highest Z-score was within the expected anatomical boundaries. Only within the dACC did we find this distortion. These results indicate that studies focused on the dACC are more likely to be reporting SMA/pre-SMA activations than dACC activations.
But this isn’t quite right. While it may be the case that dACC was the only brain region among the ones L&E examined that didn’t show this “distortion”, it’s certainly not the only brain region that shows this pattern. For example, the forward inference maps for “DMPFC” and “middle cingulate” (and probably others—I only spent a couple of minutes looking) show peak voxels in pre-SMA and the anterior insula, respectively, and not within the boundaries of the expected anatomical structures. If we take L&E’s “localization confusion” explanation seriously, we would be forced to conclude not only that cognitive neuroscientists generally don’t know where dACC is, but also that they don’t know DMPFC from pre-SMA or mid-cingulate from anterior insula. I don’t think this is a tenable suggestion.
For what it’s worth, Neurosynth clearly agrees with me: the “distortion” L&E point to completely vanishes as soon as one inspects the reverse inference map for “dacc” rather then forward inference map. Here’s what the two maps look like, side-by-side (incidentally, the code and data used to generate this plot and all the others in this post can be found here):
You can see that the extent of dACC in the bottom row (reverse inference) is squarely within the area that L&E take to be the correct extent of dACC (see their Figure 1). So, when we follow L&E’s recommendations, rather than their actual practice, there’s no evidence of any spatial confusion. Researchers (collectively, at least) do know where dACC is. It’s just that, as L&E themselves argue at length earlier in the paper, you would expect to find evidence of that knowledge in the reverse inference map, and not in the forward inference map.
The unobjectionable claim: dACC is associated with pain
Localization issues aside, L&E clearly do have a point when they note that there appears to be a relatively strong association between the posterior dACC and pain. Of course, it’s not a novel point. It couldn’t be, given that L&E’s 2003 claim that social pain and physical pain share common mechanisms was already predicated on the assumption that the dACC is selectively implicated in pain (even though, as I noted above, the putative social exclusion locus reported in that paper was actually centered in preSMA and not dACC). Moreover, the Neurosynth pain meta-analysis map that L&E used has been online for nearly 5 years now. Since the reverse inference map is loaded by default on Neurosynth, and the sagittal orthview is by default centered on x = 0, one of the first things anybody sees when they visit this page is the giant pain-related blob in the anterior cingulate cortex. When I give talks on Neurosynth, the preferential activation for pain in the posterior dACC is one of the most common examples I use to illustrate the importance of reverse inference.
But you don’t have to take my word for any of this, because my co-authors and I made this exact point in the 2011 paper introducing Neurosynth, where we observed that:
For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis. This is consistent with studies of nonhuman primates that have implicated the dorsal posterior insula as a primary integration center for nociceptive afferents and with studies of humans in which anterior aspects of the so-called ‘pain matrix’ responded nonselectively to multiple modalities.
Contrary to what L&E suggest, we did not claim in our paper that reverse inference analysis demonstrates that the dACC is not preferentially associated with any cognitive function; we made the considerably weaker point that accounting for differences in the base rate of activation changes the observed pattern of association for many terms. And we explicitly noted that there is preferential activation for pain in dACC and insula—much as L&E themselves do.
The objectionable claim: dACC is selective for pain
Of course, L&E go beyond the claims made in Yarkoni et al (2011)—and what the Neurosynth page for pain reveals—in that they claim not only that pain is preferentially associated with dACC, but that “the clearest account of dACC function is that it is selectively involved in pain-related processes.” The latter is a much stronger claim, and, if anything, is directly contradicted by the very same kind of evidence (i.e., Neurosynth maps) L&E claim to marshal in its support.
Perhaps the most obvious problem with the claim is that it’s largely based on comparison of pain with just three other groups of terms, reflecting executive function, cognitive conflict, and salience**. This is, on its face, puzzling evidence for the claim that the dACC is pain-selective. By analogy, it would be like giving people a multiple choice question asking whether their favorite color is green, fuchsia, orange, or yellow, and then proclaiming, once results were in, that the evidence suggests that green is the only color people like.
Given that Neurosynth contains more than 3,000 terms, it’s not clear why L&E only compared pain to 3 other candidates. After all, it’s entirely conceivable that dACC might be much more frequently activated by pain than by conflict or executive control, and still also be strongly associated with a large number of other functions. L&E’s only justification for this narrow focus, as far as I can tell, is that they’ve decided to only consider candidate functions that have been previously proposed in the literature:
We first examined forward inference maps for many of the psychological terms that have been associated with dACC activity. These terms were in the categories of pain (“pain”, “painful”, “noxious”), executive control (“executive”, “working memory”, “effort”, “cognitive control”, “cognitive”, “control”), conflict processing (“conflict”, “error”, “inhibition”, “stop signal”, “Stroop”, “motor”), and salience (“salience”, “detection”, “task relevant”, “auditory”, “tactile”, “visual”).
This seems like an odd decision considering that one can retrieve a rank-ordered listing of 3,000+ terms from Neurosynth at the push of a button. More importantly, L&E also omit a bunch of other accounts of dACC function that don’t focus on the above categories—for example, that the dACC is involved in various aspects of value learning (e.g., Kennerley et al., 2006; Behrens et al., 2007; autonomic control (e.g., Critchley et al., 2003; or fear processing (e.g., Milad et al., 2007). In effect, L&E are not really testing whether dACC is selective for pain; what they’re doing is, at best, testing whether the dACC is preferentially associated with pain in comparison to a select number of other candidate processes.
To be fair, L&E do report inspecting the full term rankings, even if they don’t report them explicitly:
Beyond the specific terms we selected for analyses, we also identified which psychological term was associated with the highest Z-score for each of the 8 dACC locations across all the psychological terms in the NeuroSynth database. Despite the fact that there are several hundred psychological terms in the NeuroSynth database, “pain” was the top term for 6 out of 8 locations in the dACC.
This may seem compelling at face value, but there are several problems. First, z-scores don’t provide a measure of strength of effect, they provide (at best) a measure of strength of evidence. Pain has been extensively studied in the fMRI literature, so it’s not terribly surprising if z-scores for pain are larger than z-scores for many other terms in Neurosynth. Saying that dACC is specific to pain because it shows the strongest z-score is like saying that SSRIs are the only effective treatment for depression because a drug study with a sample size of 3,000 found a smaller p-value than a cognitive-behavioral therapy (CBT) study of 100 people. If we want to know if SSRIs beat CBT as a treatment for depression, we need to directly compare effect sizes for the two treatments, not p-values or z-scores. Otherwise we’re conflating how much evidence there is for each effect with how big the effect is. At best, we might be able to claim that we’re more confident that there’s a non-zero association between dACC activation and pain than that there’s a non-zero association between dACC activation and, say, conflict monitoring. But that doesn’t constitute evidence that the dACC is more strongly associated with pain than with conflict.
Second, if one looks at effect sizes estimates rather than z-scores—which is exactly what one should do if the goal is to make claims about the relative strengths of different associations—then it’s clearly not true that dACC is specific to pain. For the vast majority of voxels within the dACC, ranking associates by descending order of posterior probability results in some term or terms other than pain occupying the top spot for a majority of dACC voxels. For example, for coordinates (0, 22 26), we get ‘experiencing’ as the top associate (PP = 86%), then pain (82%), then ’empathic’ (81%). These results seem to cast dACC in a very different light than simply saying that dACC is involved in pain. Don’t like (0, 22, 26)? Okay, pick a different dACC coordinate. Say (4, 10, 28). Now the top associates are ‘aversive’ (79%), ‘anxiety disorders’ (79%), and ‘conditioned’ (78%) (‘pain’ is a little ways back, hanging out with ‘heart’, ‘skin conductance’, and ‘taste’). Or maybe you’d like something more anterior. Well, at (-2 30 22), we have ‘abuse’ (85%), ‘incentive delay’ (84%), ‘nociceptive’ (83%), and ‘substance’ (83%). At (0, 28, 16), we have ‘dysregulation’ (84%), ‘heat’ (83%), and ‘happy faces’ (82%). And so on.
Why didn’t L&E look at the posterior probabilities, which would have been a more appropriate way to compare different terms? They justify the decision as follows:
Because Z-scores are less likely to be inflated from smaller sample sizes than the posterior probabilities, our statistical analyses were all carried out on the Z-scores associated with each posterior probability (21).”
While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.
Meta-analytic contrasts in Neurosynth
It doesn’t have to be this way, mind you. If we want to directly compare effect sizes for different terms—which I think is what L&E want, even if they don’t actually do it—we can do that fairly easily using Neurosynth (though you have to use the Python core tools, rather than the website). The crux of the approach is that we need to directly compare the two conditions (or terms) using only those studies in the Neurosynth database that load on exactly one of the two target terms. This typically results in a rather underpowered test, because we end up working with only a few hundred studies, rather than the full database of 11,000+ studies. But such is our Rumsfeldian life—we do analysis with the data we have, not the data we wish we had.
In any case, if we conduct direct meta-analytic contrasts of pain versus a bunch of other terms like salience, emotion, and cognitive control, we get results that look like this:
These maps are thresholded very liberally (p < .001, uncorrected), so we should be wary of reading too much into them. And, as noted above, power for meta-analytic contrasts in Neurosynth is typically quite low. Still, it’s pretty clear that the results don’t support L&E’s conclusion. While pain does indeed activate the dACC with significantly higher probability than some other topics (e.g., emotion or touch), it doesn’t differentiate pain from a number of other viable candidates (e.g., salience, fear, and autonomic control). Moreover, there are other contrasts not involving pain that also elicit significant differences—e.g., between autonomic control and emotion, or fear and cognitive control.
Given that this is the correct way to test for activation differences between different Neurosynth maps, if we were to take seriously the idea that more frequent dACC activation in pain studies than in other kinds of studies implies pain selectivity, the above results would seem to indicate that dACC isn’t selective to pain (or at least, that there’s no real evidence for that claim). Perhaps we could reasonably say that dACC cares more about pain than, say, emotion (though, as discussed below, even that’s not a given); but that’s hardly the same thing as saying that “the best psychological description of dACC function is related to pain processing”.
A > B does not imply ~B
Of course, we wouldn’t want to buy L&E’s claim that the dACC is selective for pain even if the dACC did show significantly more frequent activation for pain than for all other terms, because showing that dACC activation is greater for task A than task B (or even tasks B through Z) doesn’t entail that the dACC is not also important for task B. By analogy, demonstrating that people on average prefer the color blue to the color green doesn’t entitle us to conclude that nobody likes green.
In fairness, L&E do say that the other candidate terms they examined don’t show any associations with the dACC in the Neurosynth reverse inference maps. For instance, they show us this figure:
A cursory inspection indeed reveals very little going on for terms other than pain. But this is pretty woeful evidence for the claim of no effect, as it’s based on low-resolution visual inspection of just one mid-saggital brain slice for just a handful of terms. The only quantitative support L&E marshal for their “nothing else activates dACC” claim is an inspection of activation at 8 individual voxels within dACC, which they report largely fail to activate for anything other than pain. The latter is not a very comprehensive analysis, and makes one wonder why L&E didn’t do something a little more systematic given the strength of their claim (e.g., they could have averaged over all dACC voxels and tested whether activation occurs more frequently than chance for each term).
As it turns out, when we look at the entire dACC rather than just 8 voxels, there’s plenty of evidence that the dACC does in fact care about things other than pain. You can easily see this on neurosynth.org just by browsing around for a few minutes, but to spare you the trouble, here are reverse inference maps for a bunch of terms that L&E either didn’t analyze at all, or looked at in only the 8 selected voxels (the pain map is displayed in the first row for reference):
In every single one of these cases, we see significant associations with dACC activation in the reverse inference meta-analysis. The precise location of activation varies from case to case (which might lead us to question whether it makes sense to talk about dACC as a monolithic system with a unitary function), but the point is that pain is clearly not the only process that activates dACC. So the notion that dACC is selective to pain doesn’t survive scrutiny even if you use L&E’s own criteria.
The limits of Neurosynth
All of the above problems are, in my view, already sufficient to lay the argument that dACC is pain selective to rest. But there’s another still more general problem with the L&E analysis that would, in my view, be sufficient to warrant extreme skepticism about their conclusion even if you knew nothing at all about the details of the analysis. Namely, in arguing for pain selectivity, L&E ignore many of the known limitations of Neurosynth. There are a number of reasons to think that—at least in its present state—Neurosynth simply can’t support the kind of inference that L&E are trying to draw. While L&E do acknowledge some of these limitations in their Discussion section, in my view, they don’t take them nearly as seriously as they ought to.
First, it’s important to remember that Neurosynth can’t directly tell us whether activation is specific to pain (or any other process), because terms in Neurosynth are just that—terms. They’re not carefully assigned task labels, let alone actual mental states. The strict interpretation of a posterior probability of 80% for pain in a dACC voxel is that, if we were to take 11,000 published fMRI studies and pretend that exactly 50% of them included the term ‘pain’ in their abstracts, the presence of activation in the voxel in question should increase our estimate of the likelihood of the term ‘pain’ occurring from 50% to 80%. If this seems rather weak, that’s because it is. It’s something of a leap to go from words in abstracts to processes in people’s heads.
Now, in most cases, I think it’s a perfectly defensible leap. I don’t begrudge anyone for treating Neurosynth terms as if they were decent proxies for mental states or cognitive tasks. I do it myself all the time, and I don’t feel apologetic about it. But that’s because it’s one thing to use Neurosynth to support a loose claim like “some parts of the dACC are preferentially associated with pain”, and quite another to claim that the dACC is selective for pain, that virtually nothing else activates dACC, and that “pain represents the best psychological characterization of dACC function”. The latter is an extremely strong claim that requires one to demonstrate not only that there’s a robust association between dACC and pain (which Neurosynth supports), but also that (i) the association is meaningfully stronger than every other potential candidate, and (ii) no other process activates dACC in a meaningful way independently of its association with pain. L&E have done neither of these things, and frankly, I can’t imagine how they could do such a thing—at least, not with Neurosynth.
Second, there’s the issue of bias. Terms in Neurosynth are only good proxies for mental processes to the extent that they’re accurately represented in the literature. One important source of bias many people often point to (including L&E) is that if the results researchers report are colored by their expectations—which they almost certainly are—then Neurosynth is likely to reflect that bias. So, for example, if people think dACC supports pain, and disproportionately report activation in dACC in their papers (relative to other regions), the Neurosynth estimate of the pain-dACC assocation is likely be biased upwards. I think this is a legitimate concern, though (for technical reasons I won’t get into here) I also think it’s overstated. But there’s a second source of bias that I think is likely to be much more problematic in this particular case, which is that Neurosynth estimates (and, for that matter, estimates from every other large-scale meta-analysis, irrespective of database or method) are invariably biased to some degree by differences in the strength of different experimental manipulations.
To see what I mean, consider that pain is quite easy to robustly elicit in the scanner in comparison with many other processes or states. Basically, you attach some pain-inducing device to someone’s body and turn it on. If the device is calibrated properly and the subject has normal pain perception, you’re pretty much guaranteed to produce the experience of pain. In general, that effect is likely to be large, because it’s easy to induce fairly intense pain in the scanner.
Contrast that, with, say, emotion tasks. It’s an open secret in much of emotion research that what passes for an “emotional” stimulus is usually pretty benign by the standards of day-to-day emotional episodes. A huge proportion of studies use affective pictures to induce emotions like fear or disgust, and while there’s no doubt that such images successfully induce some change in emotional state, there are very few subjects who report large changes in experienced emotion (if you doubt this, try replacing the “extremely disgusted” upper anchor of your rating scale with “as disgusted as I would feel if someone threw up next to me” in your next study). One underappreciated implication of this is that if we decide to meta-analytically compare brain activation during emotion with brain activation during pain, our results are necessarily going to be biased by differences in the relative strengths of the two kinds of experimental manipulation—independently of any differences in the underlying neural substrates of pain and emotion. In other words, we may be comparing apples to oranges without realizing it. If we suppose, for the sake of hypothesis, that the dACC plays the same role in pain and emotion, and then compare strong manipulations of pain with weak manipulations of emotion, we would be confounding differences in experimental strength with differences in underlying psychology and biology. And we might well conclude that dACC is more important for pain than emotion—all because we have no good way of correcting for this rather mundane bias.
In point of fact, I think something like this is almost certainly true for the pain map in Neurosynth. One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally. I’m not sure there’s any good way to correct for this, but given that small differences in posterior probabilities (e.g., going from 80% to 75%) would probably have large effects on the rank order of different terms, I think the onus is on L&E to demonstrate why this isn’t a serious concern for their analysis.
But it’s still good for plenty of other stuff!
Having spent a lot of time talking about Neurosynth’s limitations—and all the conclusions one can’t draw from reverse inference maps in Neurosynth—I want to make sure I don’t leave you with the wrong impression about where I see Neurosynth fitting into the cognitive neuroscience ecosystem. Despite its many weaknesses, I still feel quite strongly that Neurosynth is one of the most useful tools we have at the moment for quantifying the relative strengths of association between psychological processes and neurobiological substrates. There are all kinds of interesting uses for the data, website, and software that are completely unobjectionable. I’ve seen many published articles use Neurosynth in a variety of interesting ways, and a few studies have even used Neurosynth as their primary data source (and my colleagues and I have several more on the way). Russ Poldrack and I have a forthcoming paper in Annual Review of Psychology in which we review some of the ways databases like Neurosynth can play an invaluable role in the brain mapping enterprise. So clearly, I’m the last person who would tell anyone that Neurosynth isn’t useful for anything. It’s useful for a lot of things; but it probably shouldn’t be the primary source of evidence for very strong claims about brain-cognition or brain-behavior relationships.
What can we learn about the dACC using Neurosynth? A number of things. Here are some conclusions I think one can reasonably draw based solely on inspection of Neurosynth maps:
There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.
It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)—though we should be cautious, because in some cases some of those other functions are clearly represented in dACC, just in different sectors.
Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.
I think these are all interesting and potentially important observations. They’re hardly novel, of course, but it’s still nice to have convergent meta-analytic support for claims that have been made using other methods.
So what does the dACC do?
Having read this far, you might be thinking, well if dACC isn’t selective for pain, then what does it do? While I don’t pretend to have a good answer to this question, let me make three tentative observations about the potential role of dACC in cognition that may or may not be helpful.
First, there’s actually no particular reason why dACC has to play any unitary role in cognition. It may be a human conceit to think that just because we can draw some nice boundaries around a region and give it the name ‘dACC’, there must be some corresponding sensible psychological process that passably captures what all the neurons within that chunk of tissue are doing. But the dACC is a large brain region that contains hundreds of millions of neurons with enormously complex response profiles and connectivity patterns. There’s no reason why nature should respect our human desire for simple, interpretable models of brain function. To the contrary, our default assumption should probably be that there’s considerable functional heterogeneity within dACC, so that slapping a label like “pain” onto the entire dACC is almost certainly generating more heat than light.
Second, to the degree that we nevertheless insist on imposing a single unifying label on the entire dACC, it’s very unlikely that a generic characterization like “pain” is up to the job. While we can reasonably get away with loosely describing some (mostly sensory) parts of the brain as broadly supporting vision or motor function, the dACC—a frontal region located much higher in the processing hierarchy—is unlikely to submit to a similar analysis. It’s telling that most of the serious mechanistic accounts of dACC function have shied away from extensional definitions of regional function like “pain” or “emotion” and have instead focused on identifying broad computational roles that dACC might play. Thus, we have suggestions that dACC might be involved in response selection, conflict monitoring, or value learning. While these models are almost certainly wrong (or at the very least, grossly incomplete), they at least attempt to articulate some kind of computational role dACC circuits might be playing in cognition. Saying that the dACC is for “pain”, by contrast, tells us nothing about the nature of the representations in the region.
To their credit, L&E do address this issue to some extent. Specifically, they suggest that the dACC may be involved in monitoring for “survival-relevant goal conflicts”. Admittedly, it’s a bit odd that L&E make such a suggestion at all, seeing as it directly contradicts everything they argue for in the rest of the paper (i.e., if the dACC supports detection of the general class of things that are relevant for survival, then it is by definition not selective for pain, and vice versa). Contradictions aside, however, L&E’s suggestion is not completely implausible. As the Neurosynth maps above show, the dACC is clearly preferentially activated by fear conditioning, autonomic control, and reward—all of which could broadly be construed as “survival-relevant”. The main difficulty for L&E’s survival account comes from (a) the lack of evidence of dACC involvement in other clearly survival-relevant stimuli or processes—e.g., disgust, respiration, emotion, or social interaction, and (b) the availability of other much more plausible theories of dACC function (see the next point). Still, if we’re relying strictly on Neurosynth for evidence, we can give L&E the benefit of the doubt and reserve judgment on their survival-relevant account until more data becomes available. In the interim, what should not be controversial is that such an account has no business showing up in a paper titled “the dorsal anterior cingulate cortex is selective for pain”—a claim it is completely incompatible with.
Third, theories of dACC function based largely on fMRI evidence don’t (or shouldn’t) operate in a vacuum. Over the past few decades, literally thousands of animal and human studies have investigated the structure and function of the anterior cingulate cortex. Many of these studies have produced considerable insights into the role of the ACC (including dACC), and I think it’s safe to say that they collectively offer a much richer understanding than what fMRI studies—let alone a meta-analytic engine like Neurosynth—have produced to date. I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree). Schematically, their “four-region” architectural model looks like this:
While the aMCC is assumed to contains many pain-selective neurons (as do more anterior sectors of the cingulate), it’s demonstrably not pain-selective, as neurons throughout the aMCC also respond to other stimuli (e.g., non-painful touch, fear cues, etc.).
Aside from being based on an enormous amount of evidence from lesion, electrophysiology, and imaging studies, the Vogt characterization of dACC/aMCC has several other nice features. For one thing, it fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). For another, it provides an elegant and parsimonious explanation for the broad extent of pain-related activation in anterior cingulate cortex even though no part of aMCC is selective for pain (i.e., unlike other non-physical stimuli, pain involves skeletomotor orienting, and unlike non-painful touch, it elicits avoidance behavior and subjective unpleasantness).
Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance” region).
Lieberman & Eisenberger (2015) argue, largely on the basis of evidence from my Neurosynth framework, that the dACC is selective for pain. They are wrong. Neurosynth does not—and, at present, cannot—support such a conclusion. Moreover, a more careful examination of Neurosynth results directly refutes Lieberman and Eisenberger’s claims, providing clear evidence that the dACC is associated with many other operations, and converging with extensive prior animal and human work to suggest a far more complex view of dACC function.
This is probably the first time I’ve been able to call myself the world’s foremost expert on anything while keeping a straight face. It feels pretty good.
** L&E show meta-analysis maps for a few more terms in an online supplement, but barely discuss them, even though at least one term (fear) clearly activates very similar parts of dACC.
I like to think of myself as a data-respecting guy–by which I mean that I try to follow the data wherever it leads, and work hard to suppress my intuitions in cases where those intuitions are convincingly refuted by the empirical evidence. Over the years, I’ve managed to argue myself into believing many things that I would have once found ludicrous–for instance, that parents have very little influence on their children’s personalities, or that in many fields, the judgments of acclaimed experts with decades of training are only marginally better than those of people selected at random, and often considerably worse than simple actuarial models. I believe these things not because I want to or like to, but because I think a dispassionate reading of the available evidence suggests that that’s just how the world works, whether I like it or not.
Still, for all of my efforts, there are times when I find myself unable to set aside my intuitions in the face of what would otherwise be pretty compelling evidence. A case in point is the putative relationship between weather and mood. I think most people–including me–take it as a self-evident fact that weather exerts a strong effect on mood. Climate is one of the first things people bring up when discussing places they’ve lived or visited. When I visit other cities and talk to people about what Austin, Texas (my current home) is like, my description usually amounts to something like it’s an amazing place to live so long as you don’t mind the heat. When people talk about Seattle, they bitch about the rain and the clouds; when people rave about living in California, they’re often thinking in no small part about the constant sunshine that pervades most of the state. When someone comments on the absurdly high rate of death metal bands in Finland, our first reaction is to chuckle and think well, what the hell else is there to do that far up north in the winter?–a reaction promptly followed by a twinge of guilt, because Seasonal Affective Disorder is no laughing matter.
And yet… and yet, the empirical evidence linking variations in the weather to variations in human mood is surprisingly scant. There are a few published reports of very large effects of weather on mood going back several decades, but these are invariably from very small samples–and we know that big correlations tend to occur in little studies. By contrast, large-scale studies with hundreds or thousands of subjects have found very little evidence of a relationship between mood and weather–and the effects identified are not necessarily consistent across studies.
For example, Denissen and colleagues (2008) fit a series of multilevel models of the relationship between objective weather parameters and self-reported mood in 1,233 German subjects, and found only very small associations between weather variables and negative (but not positive) affect. [Klimstra et al (2011)] found similarly negligible main effects in another sample of ~500 subjects. The state of the empirical literature on weather and mood was nicely summed up by Denissen et al in their Discussion:
As indicated by the relatively small regression weights, weather fluctuations accounted for very little variance in people’s day-to-day mood. This result may be unexpected given the existence of commonly held conceptions that weather exerts a strong influence on mood (Watson, 2000), though it replicates findings by Watson (2000) and Keller et al. (2005), who also failed to report main effects. –Dennisen et al (2008)
With the advent of social media and that whole Big Data thing, we can now conduct analyses on a scale that makes the Denissen or Klimstra studies look almost like case studies. In particular, the availability of hundreds of millions of tweets and facebook posts, coupled with comprehensive weather records from every part of the planet, means that we can now investigate the effects of almost every kind of weather pattern (cloud cover, temperature, humidity, barometric pressure, etc.) on many different indices of mood. And yet, here again, the evidence is not very kind to our intuitive notion of a strong association between weather and mood.
For example, in a study of 10 million facebook users in 100 US cities, Coviello et al (2014) found that the incidence of positive posts decreased by approximately 1%, and that of negative posts increased by 1%, on days when rain fell compared to days without rain. While that finding is certainly informative (and served as a starting point for other much more impressive analyses of network contagion), it’s not a terribly impressive demonstration of weather’s supposedly robust impact on mood. I mean, a 1% increase in rain-induced negative affect is probably not what’s really keeping anyone from moving to Seattle. Yet if anyone’s managed to detect a much bigger effect of weather on mood in a large-sample study, I’m not aware of it.
I’ve also had the pleasure of experiencing the mysterious absence of weather effects firsthand: as a graduate student, I once spent nearly two weeks trying to find effects of weather on mood in a large dataset (thousands of users from over twenty cities worldwide) culled from LiveJournal, taking advantage of users’ ability to indicate their mood in a status field via an emoticon (a feat of modern technology that’s now become nearly universal thanks to the introduction of those 4-byte UTF-8 emoji monstrosities 🙀👻🍧😻). I stratified my data eleventy different ways; I tried kneading it into infinity-hundred pleasant geometric shapes; I sang to it in the shower and brought it ice cream in bed. But nothing worked. And I’m pretty sure it wasn’t that my analysis pipeline was fundamentally broken, because I did manage (as a sanity check) to successfully establish that LiveJournal users are more likely to report feeling “cold” when the temperature outside is lower (❄️😢). So it’s not like physical conditions have no effect on people’s internal states. It’s just that the obvious weather variables (temperature, rain, humidity, etc.) don’t seem to shift our mood very much, despite our persistent convictions.
Needless to say, that project is currently languishing quite comfortably in the seventh level of file drawer hell (i.e., that bottom drawer that I locked then somehow lost the key to).
Anyway, the question I’ve been mulling over on and off for several years now–though, two-week data-mining binge aside, never for long enough to actually arrive at a satisfactory answer–is why empirical studies have been largely unable to detect an effect of weather on mood. Here are some of the potential answers I’ve come up with:
There really isn’t a strong effect of weather on mood, and the intuition that there is one stems from a perverse kind of cultural belief or confirmation bias that leads us all to behave in very strange, and often life-changing, ways–for example, to insist on moving to Miami instead of Seattle (which, climate aside, would be a crazy move, right?). This certainly allows for the possibility that there are weak effects on mood–which plenty of data already support–but then, that’s not so exciting, and doesn’t explain why so many people are so eager to move to Hawaii or California for the great weather.
Weather does exert a big effect on mood, but it does so in a highly idiosyncratic way that largely averages out across individuals. On this view, while most people’s mood might be sensitive to weather to some degree, the precise manifestation differs across individuals, so that some people would rather shoot themselves in the face than spend a week in an Edmonton winter, while others will swear up and down that it really is possible (no, literally!) to melt in the heat of a Texas summer. From a modeling standpoint, if the effects of weather on mood are reliable but extremely idiosyncratic, identifying consistent patterns could be a very difficult proposition, as it would potentially require us to model some pretty complex higher-order interactions. And the difficulty is further compounded by strong geographic selection biases: since people tend to move to places where they like the climate, the variance in mood attributable to weather changes is probably much smaller than it would be under random dispersal.
People’s mood is heavily influenced by the weather when they first spend time somewhere new, but then they get used to it. We habituate to almost everything else, so why not weather? Maybe people who live in California don’t really benefit from living in constant sunshine. Maybe they only enjoyed the sun for their first two weeks in California, and the problem is that now, whenever they travel somewhere else, the rain/snow/heat of other places makes them feel worse than their baseline (habituated) state. In other words, maybe Californians have been snorting sunshine for so long that they now need a hit of clarified sunbeams three times a day just to feel normal.
The relationship between objective weather variables and subjective emotional states is highly non-linear. Maybe we can’t consistently detect a relationship between high temperatures and anger because the perception of temperature is highly dependent on a range of other variables (e.g., 30 degrees celsius can feel quite pleasant on a cloudy day in a dry climate, but intolerable if it’s humid and the sun is out). This would make the modeling challenge more difficult, but certainly not insurmountable.
Our measures of mood are not very reliable, and since reliability limits validity, it’s no surprise if we can’t detect consistent effects of weather on mood. Personally I’m actually very skeptical about this one, since there’s plenty of evidence that self-reports of emotion are more than adequate in any number of other situations (e.g., it’s not at all hard to detect strong trait effects of personality on reported mood states). But it’s still not entirely crazy to suggest that maybe what we’re looking at is at least partly a measurement problem—especially once we start talking about algorithmically extracting sentiment from Twitter or Facebook posts, which is a notoriously difficult problem.
The effects of weather on mood are strong, but very transient, and we’re simply not very good at computing mental integrals over all of our moment-by-moment experiences. That is, we tend to overestimate the impact of weather on our mood because we find it easy to remember instances when the weather affected our mood, and not so easy to track all of the other background factors that might influence our mood more deeply but less perceptibly. There are many heuristics and biases you could attribute this to (e.g., the peak-end rule, the availability heuristic, etc.), but the basic point is that, on this view, the belief that the weather robustly influences our mood is a kind of mnemonic illusion attributable to well-known bugs in (or, more charitably, features of) our cognitive architecture.
Anyway, as far as I can tell, none of the above explanations fully account for the available data. And, to be fair, there’s no reason to think any of them should: if I had to guess, I would put money on the true explanation being a convoluted mosaic of some or all of the above factors (plus others I haven’t considered, no doubt). But the proximal problem is that there just doesn’t seem to be much data to speak to the question one way or the other. And this annoys me more than I would like. I won’t go so far as to say I spend a lot of time thinking about the problem, because I don’t. But I think about it often enough that writing a 2,000-word blog post in the hopes that other folks will provide some compelling input seems like a very reasonable time investment.
And so, having read this far—which must mean you’re at least vaguely entertained, right?—it’s your turn to help me out. Please tell me: Why is it so damn hard to detect the effects of weather on mood? Make it rain comments! It will probably cheer me up. Slightly.
There’s a general consensus among biomedical scientists working in the United States that the NIH funding system is in a state of serious debilitation, if not yet on life support. After years of flat budgets and an ever-increasing number of PIs, success rates for R01s (the primary research grant mechanism at NIH) are at an all-time low, even as the average annual budget of awards has decreased in real dollars. The problem, unfortunately, is that there doesn’t appear to be an easy way to fix this problem. As many commentators have noted, there are some very deeply-rooted and systematic incentives that favor a perpetuation, and even exacerbation, of the current problems.
Here’s my suggestion, which I’m also dutifully sending in to NIH in much-abridged form. The basic idea I’ll explore in this post is very simple: NIH should start yoking the success rates of proposals to the amount of money they request. The proposal is not meant to be a long-term solution, and is in some ways just a stopgap measure until more serious policy changes take place. But it’s a stopgap measure that could conceivably increase success rates by a few points for at least a few years, with relatively little implementation cost and few obvious downsides. So I think it’s at least worth considering.
At the moment, the NIH funding system arguably incentivizes PIs to ask for as much money as they think they can responsibly handle. To see why, let’s forget about NIH for the moment and consider, in day-to-day life, the typical relationship between investment cost and probability of investment (holding constant expected returns, which I’ll address later). Generally speaking, the two are inversely related. If a friend asks you to lend them $10, you might lend it without even asking them what they need it for. If, instead, your friend asks you for $100, you might want to know what it’s for, and you might also ask for some indication of how soon you’ll be paid back. But if your friend asks you for $10,000… well, you’re probably going to want to see a business plan and a legally-binding contract laying out a repayment schedule. There is a general understanding in most walks of life that if someone asks you to invest in them more heavily, they expect to see more evidence that you can deliver on whatever it is that you’re promising to do.
At NIH, things don’t work exactly that way. In many ways, there’s actually a positive incentive to ask for more money when writing a grant application. The perverse incentives play out at multiple levels–both across different grant mechanisms, and within the workhorse R01 mechanism. In the former case, a glance at the success rates for different R mechanisms reveals something that many PIs are, in my experience, completely unaware of: “small”-grant mechanisms like the R03 and R21 have lower–in some cases much lower–success rates than R01s at nearly all NIH institutes. This despite the fact that R21s and R03s are advertised as requiring little or no pilot data, and have low budget caps and short award durations (e.g., a maximum of $275,000 over two years for the R21).
Now you might say: well, sure, if you have a grant program expressly designed for exploratory projects, it’s not surprising if the funding rate is much lower, because you’re probably getting an obscene number of applications from people who aren’t in a position to compete for a full-blown R01. But that’s not really it, because the number of R21 and R03 submissions is also much lower than the number of R01 submissions (e.g., in 2013, NCI funded 14.7% of 4,170 R01 applications, but only 10.6% of 2,557 R21 applications). In the grand scheme of things, the amount of money allocated to “small” grants at NIH pales in comparison to the amount allocated to R01s.
The reason that R21s and R03s aren’t much more common is… well, I actually don’t know. But the point is that the data suggest that, in general (though there are of course exceptions), it’s empirically a pretty bad idea to submit R03s and R21s (particularly if you’re an Early Stage Investigator). The succes rates for R01s are higher, you can ask for a lot more money, the project periods are longer, and the amount of work involved in writing the proposal is not dramatically higher. When you look at it that way, it’s not so surprising that PIs don’t submit that many R21/R03 applications: on average, they’re a bad time investment.
The same perverse incentives apply even if you focus on only R01 submissions. You might think that, other things being equal, NIH would prioritize proposals that ask for less money. That may well be true from an administrative standpoint, in the sense that, if two applications receive exactly the same score from a review panel, and are pretty similar in most respects, one imagines that most program officers would prefer to fund the proposal with the smaller budget. But the problem is that, in the grand scheme of things, discretionary awards (i.e., where the PO has the power to choose which award to fund) are a relatively small proportion of the total budget. The majority of proposals get funded because they receive very good scores at review. And it turns out that, at review, asking for more money can actually work in a PI’s favor.
Unless specified otherwise in the Funding Opportunity Announcement, consideration of the budget and project period should not affect the overall impact score.
What should the reviewer do, in regards to the budget? Well, not much:
The reviewer should determine whether the requested budget is realistic for the conduct of the project proposed.
The explicit decoupling of budget from merit sets up a very serious problem, because if you allow yourself to ask for more money, you can also propose correspondingly grander work. By the time reviewers see your proposal, they have no real way of knowing whether you first decided on the minimum viable research program you want to run and then came up with an appropriate budget, or if you instead picked a largish number out of a hat and then proposed a perfectly reasonable (but large) amount of science you could do in order to fit that budget.
At the risk of making my own life a little bit more difficult, I’m willing to put my money where my mouth is on this point. For just about every proposal I’ve sent to NIH so far, I’ve asked for more money than I strictly need. Now, “need” is a tricky word in this context. I emphatically am not suggesting that I routinely ask NIH for more money just for the sake of having more money. I can honestly say that I’ve never asked for any funds that I didn’t think I could use responsibly in the pursuit of what I consider to be good science. But the trouble is, virtually every PI who’s ever applied for government funding will happily tell you that they could always do more good science if they just had more money. And, to a first order of approximation, they’re right. Unless a PI already has multiple major grants (which is a very small proportion of PIs at NIH), she or he probably could do more good work if given more money. There might be diminishing returns at some point, but for the most part it should not be terribly surprising if the average PI could increase her or his productivity level somewhat if given the money to hire more personnel, buy better equipment, run more experiments, and so on.
Unfortunately, the NIH budget is a zero-sum game. Every grant dollar I get is a grant dollar some other PI doesn’t get. So, when I go out and ask for a large-but-not-unreasonable amount of money, knowing full well that I could still run a research lab and get at least some good science done with less money, I am, in a sense, screwing everyone else over. Except that I’m not really screwing everyone else over, because everyone else is doing exactly the same thing I am. And the result is that we end up with a lot of PIs proposing a lot of very large projects. The PIs who win the grant lottery (because, increasingly, that’s what it is) will, generally, do a lot of good science with it. So it’s not so much that money is wasted; it’s more that it’s not distributed optimally, because the current system incentivizes people to ask for as much money as they think they can responsibly manage, rather than asking for the minimum amount they need to actually sustain a viable research enterprise.
The solution to this problem is, on paper, quite simple (which is probably why it’s only on paper). The way to induce PIs to ask for the minimum amount they think they can do their research with–thereby freeing up money for everyone else–is to explicitly yoke risk to reward, so that there’s a clearly discernible cost to asking for every increment in funding. You want $50,000 a year? Okay, that’s pretty easy to fund, so we’re not going to ask you a lot of questions. You want $500k/year? Well, hey, look, there are 10 people out in the hallway who each claim they can produce two papers a year on just $50k. So you’re going to have to explain why we should fund one of you instead of ten of them.
How would this proposal be implemented? There are many ways one could go about it, but here’s one that makes sense to me. First, we get rid of all of the research grant (R-type) mechanisms–except maybe for those that have some clearly differentiated purpose (e.g., R25s for training courses). Second, we introduce new R grant programs defined only by their budget caps and durations. For example, we might have R50s (max 50k/year for 2 years), R150s (max 150k/year for 3 years), R300s (max 300k/year for 5 years), and so on. The top tier would have no explicit cap, just like the current R01s. Third, we explicitly tie success rates to budget caps by deciding (and publicly disclosing) how much money we’re allocating to each tier. Each NIH institute would have to decide approximately what its payline for each tier would be for the next year–with the general constraint that the money would be allocated in such a way as to produce a strong inverse correlation between success rate and budget amount. So we might see, for instance, NIMH funding R50s at 50%, R150s at 28%, R300s at 22%, and R1000s at 8%. There would presumably be an initial period of fine-tuning, but over four or five award cycles, the system would almost certainly settle into a fairly stable equilibrium. Paylines would necessarily rise, because PIs would be incentivized to ask for only as much money as they truly need.
Are there objections to the approach I’ve suggested above? Sure. Perhaps the most obvious concern will come from people who do genuinely “big” science–i.e., who work in fields where simply keeping a small lab running can cost hundreds of thousands of dollars a year. Researchers in such fields might complain that yoking success rates to budgets would mean that their colleagues who work on less expensive scientific problems have a major advantage when it comes to securing funding, and that Big Science types would consequently find it harder to survive.
There are several things to note about this objection. First, there’s actually no necessary reason why yoking success rates to budgets has to hurt larger applications. The only assumption this proposal depends on is that, at the moment, some proportion of budgets are inflated–i.e., there are many researchers who could operate successfully (if less comfortably) on smaller budgets than they currently do. The fact that many other investigators couldn’t operate on smaller budgets is immaterial. If 25% of NIH PIs voluntarily opt into a research grant program that guarantees higher success rates in return for smaller budgets, the other 75% of PIs could potentially benefit even if they do nothing at all (depending on how success rates are set). So if you currently run a lab that can’t possibly run on less than $500k/year, you don’t necessarily lose anything if one of your colleagues who was previously submitting grants with $250k annual budgets decides to start writing grants with $125k caps in return for, say, a 10% increase in funding likelihood. On the contrary, it could actually mean that there’s more money left over at the end of the day to fund your own big grants.
Now, it’s certainly true that NIH PIs who work in cheaper domains would have an easier time staying afloat than ones who work in expensive domains. And it’s also true that NIH could explicitly bias in favor of small grants by raising the success rates for small grants disproportionately. But that isn’t necessarily a problem. Personally, I would argue that a moderate bias towards small grants is actually a very good thing. Remember: funding is a zero-sum game. It may seem egalitarian to make success rates independent of operating costs, because it feels like we’re giving everyone a roughly equal shot at a career in biomedical science, no matter what science they like to do. But in another sense, we aren’t being egalitarian at all, because what we’re actually saying is that a scientist who likes to work on $500k problems is worth five times as much to the taxpayer as one who likes to work on $100k problems. That seems unlikely to be true in the general case (though it may certainly be true in a minority of cases), because it’s hard to believe that the cost of doing scientific research is very closely linked to the potential benefits to people’s health (i.e., there are almost certainly many very expensive scientific disciplines that don’t necessarily produce very big benefits to taxpayers). Personally, I don’t see anything wrong with setting a higher bar for research programs that cost more taxpayer money to fund. And note that I’m arguing against my own self-interest here, because my own research is relatively expensive (most of it involves software development, and the average developer salary is roughly double the average postdoc salary).
Lastly, it’s important to keep in mind that this proposal doesn’t in any way precludes the use of other, complementary, funding mechanisms. At present, NIH already routinely issues PAs and RFAs for proposals in areas of particular interest, or which for various reasons (including budget-related considerations) need to be considered separately from other applications. This wouldn’t change in any way under the proposed system. So, for example, if NIH officials decided that it was in the nation’s best interest to fund a round of $10 million grants to develop new heart transplant techniques, they could still issue a special call for such proposals. The plan I’ve sketched above would apply only to “normal” grants.
Okay, so that’s all I have. I was initially going to list a few other potential objections (and rebuttals), but decided to leave that for discussion. Please use the comments to tell me (and perhaps NIH) why this proposal would or wouldn’t work.
[The report below was collectively authored by participants at the Open Source, Open Science meeting, and has been cross-posted in other places.]
On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.
Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.
We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.
A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders’ interests. We also discussed industry interest in “open”, both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question “Why go open?” for researchers.
Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training “decision tree” defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.
The challenge of supporting training efforts was linked to the larger challenge of funding the so-called “glue” – the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.
Another repeated theme was the “open science bubble”. Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution – rather than “post preprints!”, say “get more citations!” Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as “open scientists” per se. The identity and behavior need not be the same.
A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for “next meetings” such as hackathons, no decision has yet been reached. But, a more important decision was clear – the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.
Authors [Authors are listed in reverse alphabetical order; order does not denote relative contribution.]
Digital object identifiers (DOIs) are much sought-after commodities in the world of academic publishing. If you’ve never seen one, a DOI is a unique string associated with a particular digital object (most commonly a publication of some kind) that lets the internet know where to find the stuff you’ve written. For example, say you want to know where you can get a hold of an article titled, oh, say, Designing next-generation platforms for evaluating scientific output: what scientists can learn from the social web. In the real world, you’d probably go to Google, type that title in, and within three or four clicks, you’d arrive at the document you’re looking for. As it turns out, the world of formal resource location is fairly similar to the real world, except that instead of using Google, you go to a website called dx.DOI.org, and then you plug in the string ‘10.3389/fncom.2012.00072’, which is the DOI associated with the aforementioned article. And then, poof, you’re automagically linked directly to the original document, upon which you can gaze in great awe for as long as you feel comfortable.
Historically, DOIs have almost exclusively been issued by official-type publishers: Elsevier, Wiley, PLoS and such. Consequently, DOIs have had a reputation as a minor badge of distinction–probably because you’d traditionally only get one if your work was perceived to be important enough for publication in a journal that was (at least nominally) peer-reviewed. And perhaps because of this tendency to view the presence of a DOIs as something like an implicit seal of approval from the Great Sky Guild of Academic Publishing, many journals impose official or unofficial commandments to the effect that, when writing a paper, one shalt only citeth that which hath been DOI-ified. For example, here’s a boilerplate Elsevier statement regarding references (in this case, taken from the Neuron author guidelines):
References should include only articles that are published or in press. For references to in press articles, please confirm with the cited journal that the article is in fact accepted and in press and include a DOI number and online publication date. Unpublished data, submitted manuscripts, abstracts, and personal communications should be cited within the text only.
This seems reasonable enough until you realize that citations that occur “within the text only” aren’t very useful, because they’re ignored by virtually all formal citation indices. You want to cite a blog post in your Neuron paper and make sure it counts? Well, you can’t! Blog posts don’t have DOIs! You want to cite a what? A tweet? That’s just crazy talk! Tweets are 140 characters! You can’t possibly cite a tweet; the citation would be longer than the tweet itself!
The injunction against citing DOI-less documents is unfortunate, because people deserve to get credit for the interesting things they say–and it turns out that they have, on rare occasion, been known to say interesting things in formats other than the traditional peer-reviewed journal article. I’m pretty sure if Mark Twain were alive today, he’d write the best tweets EVER. Well, maybe it would be a tie between Mark Twain and the NIH Bear. But Mark Twain would definitely be up there. And he’d probably write some insightful blog posts too. And then, one imagines that other people would probably want to cite this brilliant 21st-century man of letters named @MarkTwain in their work. Only they wouldn’t be allowed to, you see, because 21st-century Mark Twain doesn’t publish all, or even most, of his work in traditional pre-publication peer-reviewed journals. He’s too impatient to rinse-and-repeat his way through the revise-and-resubmit process every time he wants to share a new idea with the world, even when those ideas are valuable. 21st-century @MarkTwain just wants his stuff out there already where people can see it.
Why does Elsevier hate 21st-century Mark Twain, you ask? I don’t know. But in general, I think there are two main reasons for the disdain many people seem to feel at the thought of allowing authors to freely cite DOI-less objects in academic papers. The first reason has to do with permanence—or lack thereof. The concern here is that if we allowed everyone to cite just any old web page, blog post, or tweet in academic articles, there would be no guarantee that those objects would still be around by the time the citing work was published, let alone several years hence. Which means that readers might be faced with a bunch of dead links. And dead links are not very good at backing up scientific arguments. In principle, the DOI requirement is supposed to act like some kind of safety word that protects a citation from the ravages of time—presumably because having a DOI means the cited work is important enough for the watchful eye of Sauron Elsevier to periodically scan across it and verify that it hasn’t yet fallen off of the internet’s cliffside.
The second reason has to do with quality. Here, the worry is that we can’t just have authors citing any old opinion someone else published somewhere on the web, because, well, think of the children! Terrible things would surely happen if we allowed authors to link to unverified and unreviewed works. What would stop me from, say, writing a paper criticizing the idea that human activity is contributing to climate change, and supporting my argument with “citations” to random pages I’ve found via creative Google searches? For that matter, what safeguard would prevent a brazen act of sockpuppetry in which I cite a bunch of pages that I myself have (anonymously) written? Loosening the injunction against formally citing non-peer-reviewed work seems tantamount to inviting every troll on the internet to a formal academic dinner.
To be fair, I think there’s some merit to both of these concerns. Or at least, I think there used to be some merit to these concerns. Back when the internet was a wee nascent flaky thing winking in and out of existence every time a dial-up modem connection went down, it made sense to worry about permanence (I mean, just think: if we had allowed people to cite GeoCities webpages in published articles, every last one of those citations links would now be dead!) And similarly, back in the days when peer review was an elite sort of activity that could only be practiced by dignified gentlepersons at the cordial behest of a right honorable journal editor, it probably made good sense to worry about quality control. But the merits of such concerns have now largely disappeared, because we now live in a world of marvelous technology, where bits of information cost virtually nothing to preserve forever, and a new post-publication platform that allows anyone to review just about any academic work in existence seems to pop up every other week (cf. PubPeer, PubMed Commons, Publons, etc.). In the modern world, nothing ever goes out of print, and if you want to know what a whole bunch of experts think about something, you just have to ask them about it on Twitter.
Unfortunately, there’s a small problem with this URL: it contains nary a DOI in sight. Really. None of the eleventy billion possible substrings in it look anything like a DOI. You can even scramble the characters if you like; I don’t care. You’re still not going to find one. Which means that most journals won’t allow you to officially cite this blog post in your academic writing. Or any other post, for that matter. You can’t cite my post about statistical power and magical sample sizes; you can’t cite Joe Simmons’ Data Colada post about Mturk and effect sizes; you can’t cite Sanjay Srivastava’s discussion of replication and falsifiability; and so on ad infinitum. Which is a shame, because it’s a reasonably safe bet that there are at least one or two citation-worthy nuggets of information trapped in some of those blog posts (or millions of others), and there’s no reason to believe that these nuggets must all have readily-discoverable analogs somewhere in the “formal” scientific literature. As the Elsevier author guidelines would have it, the appropriate course of action in such cases is to acknowledge the source of an idea or finding in the text of the article, but not to grant any other kind of formal credit.
Now, typically, this is where the story would end. The URL can’t be formally cited in an Elsevier article; end of story. BUT! In this case, the story doesn’t quite end there. A strange thing happens! A short time after it appears on my blog, this post also appears–in virtually identical form–on something called The Winnower, which isn’t a blog at all, but rather, a respectable-looking alternative platform for scientific publication and evaluation.
Even more strangely, on The Winnower, a mysterious-looking set of characters appear alongside the text. For technical reasons, I can’t tell you what the set of characters actually is (because it isn’t assigned until this piece is published!). But I can tell you that it starts with “10.15200/winn”. And I can also tell you what it is: It’s a DOI! It’s one bona fide free DOI, courtesy of The Winnower. I didn’t have to pay for it, or barter any of my services for it, or sign away any little pieces of my soul to get it*. I just installed a WordPress plugin, pressed a few buttons, and… poof, instant DOI. So now this is, proudly, one of the world’s first N (where N is some smallish number probably below 1000) blog posts to dress itself up in a nice DOI (Figure 1). Presumably because it’s getting ready for a wild night out on the academic town.
Does the mere fact that my blog post now has a DOI actually change anything, as far as the citation rules go? I don’t know. I have no idea if publishers like Elsevier will let you officially cite this piece in an article in one of their journals. I would guess not, but I strongly encourage you to try it anyway (in fact, I’m willing to let you try to cite this piece in every paper you write for the next year or so—that’s the kind of big-hearted sacrifice I’m willing to make in the name of science). But I do think it solves both the permanence and quality control issues that are, in theory, the whole reason for journals having a no-DOI-no-shoes-no-service policy in the first place.
How? Well, it solves the permanence problem because The Winnower is a participant in the CLOCKSS archive, which means that if The Winnower ever goes out of business (a prospect that, let’s face it, became a little bit more likely the moment this piece appeared on their site), this piece will be immediately, freely, and automatically made available to the worldwide community in perpetuity via the associated DOI. So you don’t need to trust the safety of my blog—or even The Winnower—any more. This piece is here to stay forever! Rejoice in the cheapness of digital information and librarians’ obsession with archiving everything!
As for the quality argument, well, clearly, this here is not what you would call a high-quality academic work. But I still think you should be allowed to cite it wherever and whenever you want. Why? For several reasons. First, it’s not exactly difficult to determine whether or not it’s a high-quality academic work—even if you’re not willing to exercise your own judgment. When you link to a publication on The Winnower, you aren’t just linking to a paper; you’re also linking to a review platform. And the reviews are very prominently associated with the paper. If you dislike this piece, you can use the comment form to indicate exactly why you dislike it (if you like it, you don’t need to write a comment; instead, send an envelope stuffed with money to my home address).
Second, it’s not at all clear that banning citations to non-prepublication-reviewed materials accomplishes anything useful in the way of quality control. The reliability of the peer-review process is sufficiently low that there is simply no way for it to consistently sort the good from the bad. The problem is compounded by the fact that rejected manuscripts are rarely discarded forever; typically, they’re quickly resubmitted to another journal. The bibliometric literature shows that it’s possible to publish almost anything in the peer-reviewed literature given enough persistence.
Third, I suspect—though I have no data to support this claim—that a worldview that treats having passed peer review and/or receiving a DOI as markers of scientific quality is actually counterproductive to scientific progress, because it promotes a lackadaisical attitude on the part of researchers. A reader who believes that a claim is significantly more likely to be true in virtue of having a DOI is a reader who is slightly less likely to take the extra time to directly evaluate the evidence for that claim. The reality, unfortunately, is that most scientific claims are wrong, because the world is complicated and science is hard. Pretending that there is some reasonably accurate mechanism that can sort all possible sources into reliable and unreliable buckets—even to a first order of approximation—is misleading at best and dangerous at worst. Of course, I’m not suggesting that you can’t trust a paper’s conclusions unless you’ve read every work it cites in detail (I don’t believe I’ve ever done that for any paper!). I’m just saying that you can’t abdicate the responsibility of evaluating the evidence to some shapeless, anonymous mass of “reviewers”. If I decide not to chase down the Smith & Smith (2007) paper that Jones & Jones (2008) cite as critical support for their argument, I shouldn’t be able to turn around later and say something like “hey, Smith & Smith (2007) was peer reviewed, so it’s not my fault for not bothering to read it!”
So where does that leave us? Well, if you’ve read this far, and agree with most or all of the above arguments, I hope I can convince you of one more tiny claim. Namely, that this piece represents (a big part of) the future of academic publishing. Not this particular piece, of course; I mean the general practice of (a) assigning unique identifiers to digital objects, (b) preserving those objects for all posterity in a centralized archive, and (c) allowing researchers to cite any and all such objects in their work however they like. (We could perhaps also add (d) working very hard to promote centralized “post-publication” peer review of all of those objects–but that’s a story for another day.)
These are not new ideas, mind you. People have been calling for a long time for a move away from a traditional gatekeeping-oriented model of pre-publication review and towards more open publication and evaluation models. These calls have intensified in recent years; for instance, in 2012, a special topic in Frontiers in Computational Neuroscience featured 18 different papers that all independently advocated for very similar post-publication review models. Even the actual attachment of DOIs to blog posts isn’t new; as a case in point, consider that C. Titus Brown—in typical pioneering form—was already experimenting with ways to automatically DOIfy his blog posts via FigShare way back in the same dark ages of 2012. What is new, though, is the emergence and widespread adoption of platforms like The Winnower, FigShare, or Research Gate that make it increasingly easy to assign a DOI to academically-relevant works other than traditional journal articles. Thanks to such services, you can now quickly and effortlessly attach a DOI to your open-source software packages, technical manuals and white papers, conference posters, or virtually any other kind of digital document.
Once such efforts really start to pick up steam—perhaps even in the next two or three years—I think there’s a good chance we’ll fall into a positive feedback loop, because it will become increasingly clear that for many kinds of scientific findings or observations, there’s simply nothing to be gained by going through the cumbersome, time-consuming conventional peer review process. To the contrary, there will be all kinds of incentives for researchers to publish their work as soon as they feel it’s ready to share. I mean, look, I can write blog posts a lot faster than I can write traditional academic papers. Which means that if I write, say, one DOI-adorned blog post a month, my Google Scholar profile is going to look a lot bulkier a year from now, at essentially no extra effort or cost (since I’m going to write those blog posts anyway!). In fact, since services like The Winnower and FigShare can assign DOIs to documents retroactively, you might not even have to wait that long. Check back this time next week, and I might have a dozen new indexed publications! And if some of these get cited—whether in “real” journals or on other indexed blog posts—they’ll then be contributing to my citation count and h-index too (at least on Google Scholar). What are you going to do to keep up?
Now, this may all seem a bit off-putting if you’re used to thinking of scientific publication as a relatively formal, laborious process, where two or three experts have to sign off on what you’ve written before it gets to count for anything. If you’ve grown comfortable with the idea that there are “real” scientific contributions on the one hand, and a blooming, buzzing confusion of second-rate opinions on the other, you might find the move to suddenly make everything part of the formal record somewhat disorienting. It might even feel like some people (like, say, me) are actively trying to game the very system that separates science from tabloid news. But I think that’s the wrong perspective. I don’t think anybody—certainly not me—is looking to get rid of peer review. What many people are actively working towards are alternative models of peer review that will almost certainly work better.
The right perspective, I would argue, is to embrace the benefits of technology and seek out new evaluation models that emphasize open, collaborative review by the community as a whole instead of closed pro forma review by two or three semi-randomly selected experts. We now live in an era where new scientific results can be instantly shared at essentially no cost, and where sophisticated collaborative filtering algorithms and carefully constructed reputation systems can potentially support truly community-driven, quantitatively-grounded open peer review on a massive scale. In such an environment, there are few legitimate excuses for sticking with archaic publication and evaluation models—only the familiar, comforting pull of the status quo. Viewed in this light, using technology to get around the limitations of old gatekeeper-based models of scientific publication isn’t gaming the system; it’s actively changing the system—in ways that will ultimately benefit us all. And in that context, the humble self-assigned DOI may ultimately become—to liberally paraphrase Robert Oppenheimer and the Bhagavad Gita—one of the destroyers of the old gatekeeping world.
“I’m a statistician,” she wrote. “By day, I work for the census bureau. By night, I use my statistical skills to build the perfect profile. I’ve mastered the mysterious headline, the alluring photo, and the humorous description that comes off as playful but with a hint of an edge. I’m pretty much irresistible at this point.”
“Really?” I wrote back. “That sounds pretty amazing. The stuff about building the perfect profile, I mean. Not the stuff about working at the census bureau. Working at the census bureau sounds decent, I guess, but not amazing. How do you build the perfect profile? What kind of statistical analysis do you do? I have a bit of programming experience, but I don’t know any statistics. Maybe we can meet some time and you can teach me a bit of statistics.”
I am, as you can tell, a smooth operator.
A reply arrived in my inbox a day later:
No, of course I don’t really spend all my time constructing the perfect profile. What are you, some kind of idiot?
And so was born our brief relationship; it was love at first insult.
“This probably isn’t going to work out,” she told me within five minutes of meeting me in person for the first time. We were sitting in the lobby of the Chateau Laurier downtown. Her choice of venue. It’s an excellent place to meet an internet date; if you don’t like the way they look across the lobby, you just back out quietly and then email the other person to say sorry, something unexpected came up.
“That fast?” I asked. “You can already tell you don’t like me? I’ve barely introduced myself.”
“Oh, no, no. It’s not that. So far I like you okay. I’m just going by the numbers here. It probably isn’t going to work out. It rarely does.”
“That’s a reasonable statement,” I said, “but a terrible thing to say on a first date. How do you ever get a second date with anyone, making that kind of conversation?”
“It helps to be smoking hot,” she said. “Did I offend you terribly?”
“Not really, no. But I’m not a very sentimental kind of guy.”
“Well, that’s good.”
Later, in bed, I awoke to a shooting pain in my leg. It felt like I’d been kicked in the shin.
“Did you just kick me in the shin,” I asked.
“Any particular reason?”
“You were a little bit on my side of the bed. I don’t like that.”
“Oh. Okay. Sorry.”
“I still don’t think this will work,” she said, then rolled over and went back to sleep.
She was right. We dated for several months, but it never really worked. We had terrific fights, and reasonable make-up sex, but our interactions never had very much substance. We related to one another like two people who were pretty sure something better was going to come along any day now, but in the meantime, why not keep what we had going, because it was better than eating dinner alone.
I never really learned what she liked; I did learn that she disliked most things. Mostly our conversations revolved around statistics and food. I’ll give you some examples.
“Beer is the reason for statistics,” she informed me one night while we were sitting at Cicero’s and sharing a lasagna.
“I imagine beer might be the reason for a lot of bad statistics,” I said.
“No, no. Not just bad statistics. All statistics. The discipline of statistics as we know it exists in large part because of beer.”
“Pray, do go on,” I said, knowing it would have been futile to ask her to shut up.
“Well,” she said, “there once was a man named Student…”
I won’t bore you with all the details; the gist of it is that there once was a man by name of William Gosset, who worked for Guinness as a brewer in the early 1900s. Like a lot of other people, Gosset was interested in figuring out how to make Guinness taste better, so he invented a bunch of statistical tests to help him quantify the differences in quality between different batches of beer. Guinness didn’t want Gosset to publish his statistical work under his real name, for fear he might somehow give away their trade secrets, so they made him use the pseudonym “Student”. As a result, modern-day statisticians often work with somethinfg called Student’s t distribution, which is apparently kind of a big deal. And all because of beer.
“That’s a nice story,” I said. “But clearly, if Student—or Gosset or whatever his real name was—hadn’t been working for Guinness, someone else would have invented the same tests shortly afterwards, right? It’s not like he was so brilliant no one else would have ever thought of the same thing. I mean, if Edison hadn’t invented the light bulb, someone else would have. I take it you’re not really saying that without beer, there would be no statistics.”
“No, that is what I’m saying. No beer, no stats. Simple.”
“Yeah, okay. I don’t believe you.”
“No. What’s that thing about lies, damned lies, and stat—”
“No idea,” she said. “Never heard that saying.”
“It’s that they lie. The saying is that statisticians lie. Repeatedly and often. About anything at all. It’s that they have no moral compass.”
“Sounds about right.”
“I don’t get this whole accurate to within 3 percent 19 times out of 20 business,” I whispered into her ear late one night after we’d had sex all over her apartment. “I mean, either you’re accurate or you’re not, right? If you’re accurate, you’re accurate. And if you’re not accurate, I guess maybe then you could be within 3 percent or 7 percent or whatever. But what the hell does it mean to be accurate X times out of Y? And how would you even know how many times you’re accurate? And why is it always 19 out of 20?”
She turned on the lamp on the nightstand and rolled over to face me. Her hair covered half of her face; the other half was staring at me with those pale blue eyes that always looked like they wanted to either jump you or murder you, and you never knew which.
“You really want me to explain confidence intervals to you at 11:30 pm on a Thursday night?”
“How much time do you have?”
“All, Night, Long,” I said, channeling Lionel Richie.
“Wonderful. Let me put my spectacles on.”
She fumbled around on the nightstand looking for them.
“What do you need your glasses for,” I asked. “We’re just talking.”
“Well, I need to be able to see you clearly. I use the amount of confusion on your face to gauge how much I need to dumb down my explanations.”
Frankly, most of the time she was as cold as ice. The only time she really came alive—other than in the bedroom—was when she talked about statistics. Then she was a different person: excited and exciting, full of energy. She looked like a giant Tesla coil, mid-discharge.
“Why do you like statistics so much,” I asked her over a bento box at ZuNama one day.
“Because,” she said, “without statistics, you don’t really know anything.”
“I thought you said statistics was all about uncertainty.”
“Right. Without statistics, you don’t know anything… and with statistics, you still don’t know anything. But with statistics, we can at least get a sense of how much we know or don’t know.”
“Sounds very… Rumsfeldian,” I said. “Known knowns… unknown unknowns… is that right?”
“It’s kind of right,” she said. “But the error bars are pretty huge.”
“I’m going to pretend I know what that means. If I admit I have no idea, you’ll think I wasn’t listening to you in bed the other night.”
“No,” she said. “I know you were listening. You were listening very well. It’s just that you were understanding very poorly.”
Uncertainty was a big theme for her. Once, to make a point, she asked me how many nostrils a person breathes through at any given time. And then, after I experimented on myself and discovered that the answer was one and not two, she pushed me on it:
“Well, how do you know you’re not the only freak in the world who breathes through one nostril?”
“Easily demonstrated,” I said, and stuck my hand right in front of her face, practically covering her nose.
“And now breathe in! And then repeat several times!”
“You see,” I said, retracting my hand once I was satisfied. “It’s not just me. You also breathe through one nostril at a time. Right now it’s your left.”
“That proves nothing,” she said. “We’re not independent observations; I live with you. You probably just gave me your terrible mononarial disease. All you’ve shown is that we’re both sick.”
I realized then that I wasn’t going to win this round—or any other round.
“Try the unagi,” I said, waving at the sushi in a heroic effort to change the topic.
“You know I don’t like to try new things. It’s bad enough I’m eating sushi.”
“Try the unagi,” I suggested again.
So she did.
“It’s not bad,” she said after chewing on it very carefully for a very long time. “But it could use some ketchup.”
“Don’t you dare ask them for ketchup,” I said. “I will get up and leave if you ask them for ketchup.”
She waved her hand at the server.
“There once was a gentleman named Bayes,” she said over coffee at Starbucks one morning. I was running late for work, but so what? Who’s going to pass up the chance to hear about a gentleman named Bayes when the alternative is spending the morning refactoring enterprise code and filing progress reports?
“Oh yes, I’ve heard about him,” I said. “He’s the guy who came up with Bayes’ theorem.” I’d heard of Bayes theorem in some distant class somewhere, and knew it had something to do with statistics, though I had not one clue what it actually referred to.
“No, the Bayes I’m talking about is John Bayes—my mechanic. He’s working on my car right now.”
“No, not really, you idiot. Yes, Bayes as in Bayes’ theorem.”
“Thought so. Well, go ahead and tell me all about him. What is John Bayes famous for?”
“Huh. How about that.”
She launched into a very dry explanation of conditional probabilities and prior distributions and a bunch of other terms I’d never heard of before and haven’t remembered since. I stopped her about three minutes in.
“You know none of this helps me, right? I mean, really, I’m going to forget anything you tell me. You know what might help, is maybe if instead of giving me these long, dry explanations, you could put things in a way I can remember. Like, if you, I don’t know, made up a limerick. I bet I could remember your explanations that way.”
“Oh, a limerick. You want a Bayesian limerick. Okay.”
She scrunched up her forehead like she was thinking very deeply. Held the pose for a few seconds.
“There once was a man named John Bayes,” she began, and then stopped.
“Yes,” I said. “Go on.”
“Who spent most of his days… calculating the posterior probability of go fuck yourself.”
“Very memorable,” I said, waving for the check.
“Suppose I wanted to estimate how much I love you,” I said over asparagus and leek salad at home one night. “How would I do that?”
“You love me?” she arched an eyebrow.
“Good lord no,” I laughed hysterically. “It’s a completely and utterly hypothetical question. But answer it anyway. How would I do it?”
“That’s a measurement problem. I’m a statistician, not a psychometrician. I develop and test statistical models. I don’t build psychological instruments. I haven’t the faintest idea how you’d measure love. As I’m sure you’ve observed, it’s something I don’t know or care very much about.”
I nodded. I had observed that.
“You act like there’s a difference between all these things there’s really no difference between,” I said. “Models, measures… what the hell do I care? I asked a simple question, and I want a simple answer.”
“Well, my friend, in that case, the answer is that you must look deep into your own heart and say, heart, how much do I love this woman, and then your heart will surely whisper the answer delicately into your oversized ear.”
“That’s the dumbest thing I’ve ever heard,” I said, tugging self-consciously at my left earlobe. It wasn’t that big.
“Right?” she said. “You said you wanted a simple answer. I gave you a simple answer. It also happens to be a very dumb answer. Well, great, now you know one of the fundamental principles of statistical analysis.”
“That simple answers tend to be bad answers?”
“No,” she said. “That when you’re asking a statistician for help, you need to operationalize your question very carefully, or the statistician is going to give you a sensible answer to a completely different question than the one you actually care about.”
“How come you never ask me about my work,” I asked her one night as we were eating dinner at Chez Margarite. She was devouring lemon-infused pork chops; I was eating a green papaya salad with mint chutney and mango salsa dressing.
“Because I don’t really care about your work,” she said.
“Oh. That’s… kind of blunt.”
“Sorry. I figured I should be honest. That’s what you say you want in a relationship, right? Honesty?”
“Sure,” I said, as the server refilled our water glasses.
“Well,” I offered. “Maybe not that much honesty.”
“Would you like me to feign interest?”
“Maybe just for a bit. That might be nice.”
“Okay,” she sighed, giving me the green light with a hand wave. “Tell me about your work.”
It was a new experience for me; I didn’t want to waste the opportunity, so I tried to choose my words carefully.
“Well, for the last month or so, I’ve been working on re-architecting our site’s database back-end. We’ve never had to worry about scaling before. Our DB can handle a few dozen queries per second, even with some pretty complicated joins. But then someone posts a product page to reddit because of a funny typo, and suddenly we’re getting hundreds of requests a second, and all hell breaks loose.”
I went on to tell her about normal forms and multivalued dependencies and different ways of modeling inheritance in databases. She listened along, nodding intermittently and at roughly appropriate intervals. But I could tell her heart wasn’t in it. She kept looking over with curiosity at the group of middle-aged Japanese businessmen seated at the next table over from us. Or out the window at the homeless man trying to sell rhododendrons to passers-by. Really, she looked everywhere but at me. Finally, I gave up.
“Look,” I said, “I know you’re not into this. I guess I don’t really need to tell you about what I do. Do you want to tell me more about the Weeble distribution?”
Her face lit up with excitement; for a moment, she looked like the moon. A cold, heartless, beautiful moon, full of numbers and error bars and mascara.
“Weibull,” she said.
“Fine,” I said. “You tell me about the Weibull distribution, and I’ll feign interest. Then we’ll have crème brulee for dessert, and then I’ll buy you a rhododendron from that guy out there on the way out.”
“Rhododendrons,” she snorted. “What a ridiculous choice of flower.”
“How long do you think this relationship is going to last,” I asked her one brisk evening as we stood outside Gordon’s Gourmets with oversized hot dogs in hand.
I was fully aware our relationship was a transient thing—like two people hanging out on a ferry for a couple of hours, both perfectly willing to having a reasonably good time together until the boat hits the far side of the lake, but neither having any real interest in trading numbers or full names.
I was in it for—let’s be honest—the sex and the conversation. As for her, I’m not really sure what she got out of it; I’m not very good at either of those things. I suppose she probably had a hard time finding anyone willing to tolerate her for more than a couple of days.
“About another month,” she said. “We should take a trip to Europe and break up there. That way it won’t be messy when we come back. You book your plane ticket, I’ll book mine. We’ll go together, but come back separately. I’ve always wanted to end a relationship that way—in a planned fashion where there are no weird expectations and no hurt feelings.”
“You think planning to break up in Europe a month from now is a good way to avoid hurt feelings?”
“Okay, I guess I can see that.”
And that’s pretty much how it went. About a month later, we were sitting in a graveyard in a small village in southern France, winding our relationship down. Wine was involved, and had been involved for most of the day; we were both quite drunk.
We’d gone to see this documentary film about homeless magicians who made their living doing card tricks for tourists on the beaches of the French Riviera, and then we stumbled around town until we came across the graveyard, and then, having had a lot of wine, we decided, why not sit on the graves and talk. And so we sat on graves and talked for a while until we finally ran out of steam and affection for each other.
“How do you want to end it,” I asked her when we were completely out of meaningful words, which took less time than you might imagine.
“You sound so sinister,” she said. “Like we’re talking about a suicide pact. When really we’re just two people sitting on graves in a quiet cemetery in France, about to break up forever.”
“Yeah, that. How do you want to end it.”
“Well, I like endings like in Sex, Lies and Videotape, you know? Endings that don’t really mean anything.”
“You like endings that don’t mean anything.”
“They don’t have to literally mean nothing. I just mean they don’t have to have any deep meaning. I don’t like movies that end on some fake bullshit dramatic note just to further the plot line or provide a sense of closure. I like the ending of Sex, Lies, and Videotape because it doesn’t follow from anything; it just happens.”
“Remind me how it ends?”
“They’re sitting on the steps outside, and Ann—-Andie McDowell’s character–says “I think it’s going to rain. Then Graham says, “it is raining.” And that’s it. Fade to black.”
“So that’s what you like.”
“And you want to end our relationship like that.”
“Okay,” I said. “I guess I can do that.”
I looked around. It was almost dark, and the bottle of wine was empty. Well, why not.
“I think it’s going to rain,” I said.
“Jesus,” she said incredulously, leaning back against a headstone belonging to some guy named Jean-Francois. ” I meant we should end it like that. That kind of thing. Not that actual thing. What are you, some kind of moron?”
“Oh. Okay. And yes.”
I thought about it for a while.
“I think I got this,” I finally said.
“Ok, go,” she smiled. One of the last—and only—times I saw her smile. It was devastating.
“Okay. I’m going to say: I have some unfinished business to attend to at home. I should really get back to my life. And then you should say something equally tangential and vacuous. Something like: ‘yes, you really should get back there. Your life must be lonely without you.'”
“Your life must be lonely without you…” she tried the words out.
“That’s perfect,” she smiled. “That’s exactly what I wanted.”
[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]
Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:
The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.
Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.
Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.
While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:
Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.
In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.
To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:
Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:
In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.
Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):
It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.
So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.
Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.
Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.
Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.
* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!