If we already understood the brain, would we even know it?

The question posed in the title is intended seriously. A lot of people have been studying the brain for a long time now. Most of these people, if asked a question like “so when are you going to be able to read minds?”, will immediately scoff and say something to the effect of we barely understand anything about the brain–that kind of thing is crazy far into the future! To a non-scientist, I imagine this kind of thing must seem bewildering. I mean, here we have a community of tens of thousands of extremely smart people who have collectively been studying the same organ for over a hundred years; and yet, almost to the last person, they will adamantly proclaim to anybody who listens that the amount they currently know about the brain is very, very small compared to the amount that they expect the human species to know in the future.

I’m not convinced this is true. I think it’s worth observing that if you ask someone who has just finished telling you how little we collectively know about the brain how much they personally actually know about the brain–without the implied contrast with the sum of all humanity–they will probably tell you that, actually, they kind of know a lot about the brain (at least, once they get past the false modesty). Certainly I don’t think there are very many neuroscientists running around telling people that they’ve literally learned almost nothing since they started studying the gray sludge inside our heads. I suspect most neuroanatomists could probably recite several weeks’ worth of facts about the particular brain region or circuit they study, and I have no shortage of fMRI-experienced friends who won’t shut up about this brain network or that brain region–so I know they must know a lot about something to do with the brain. We thus find ourselves in the rather odd situation of having some very smart people apparently simultaneously believe that (a) we all collectively know almost nothing, and (b) they personally are actually quite learned (pronounced luhrn-ED) in their chosen subject. The implication seems to be that, if we multiply what one really smart present-day neuroscientist knows a few tens of thousands of times, that’s still only a tiny fraction of what it would take to actually say that we really “understand” the brain.

I find this problematic in two respects. First, I think we actually already know quite a lot about the brain. And second, I don’t think future scientists–who, remember, are people similar to us in both number and intelligence–will know dramatically more. Or rather, I think future neuroscientists will undoubtedly amass orders of magnitude more collective knowledge about the brain than we currently possess. But, barring some momentous fusion of human and artificial intelligence, I’m not at all sure that will translate into a corresponding increase in any individual neuroscientist’s understanding. I’m willing to stake a moderate sum of money, and a larger amount of dignity, on the assertion that if you ask a 2030, 2050, or 2118 neuroscientist–assuming both humans and neuroscience are still around then–if they individually understand the brain given all of the knowledge we’ve accumulated, they’ll laugh at you in exactly the way that we laugh at that question now.

* * *

We probably can’t predict when the end of neuroscience will arrive with any reasonable degree of accuracy. But trying to conjure up some rough estimates can still help us calibrate our intuitions about what would be involved. One way we can approach the problem is to try to figure out at what rate our knowledge of the brain would have to grow in order to arrive at the end of neuroscience within some reasonable time frame.

To do this, we first need an estimate of how much more knowledge it would take before we could say with a straight face that we understand the brain. I suspect that “1000 times more” would probably seem like a low number to most people. But let’s go with that, for the sake of argument. Let’s suppose that we currently know 0.1% of all there is to know about the brain, and that once we get to 100%, we will be in a position to stop doing neuroscience, because we will at that point already have understood everything.

Next, let’s pick a reasonable-sounding time horizon. Let’s say… 200 years. That’s twice as long as Eric Kandel thinks it will take just to understand memory. Frankly, I’m skeptical that humans will still be living on this planet in 200 years, but that seems like a reasonable enough target. So basically, we need to learn 1000 times as much as we know right now in the space of 200 years. Better get to the library! (For future neuroscientists reading this document as an item of archival interest about how bad 2018 humans were at predicting the future: the library is a large, public physical space that used to hold things called books, but now holds only things called coffee cups and laptops.)

A 1000-fold return over 200 years is… 3.5% compounded annually. Hey, that’s actually not so bad. I can easily believe that our knowledge about the brain increases at that rate. It might even be more than that. I mean, the stock market historically gets 6-10% returns, and I’d like to believe that neuroscience outperforms the stock market. Regardless, under what I think are reasonably sane assumptions, I don’t think it’s crazy to suggest that the objective compounding of knowledge might not be the primary barrier preventing future neuroscientists from claiming that they understand the brain. Assuming we don’t run into any fundamental obstacles that we’re unable to overcome via new technology and/or brilliant ideas, we can look forward to a few of our great-great-great-great-great-great-great-great-grandchildren being the unlucky ones who get to shut down all of the world’s neuroscience departments and tell all of their even-less-lucky graduate students to go on home, because there are no more problems left to solve.

Well, except probably not. Because, for the above analysis to go through, you have to believe that there’s a fairly tight relationship between what all of us know, and what any of us know. Meaning, you have to believe that once we’ve successfully acquired all of the possible facts there are to acquire about the brain, there will be some flashing light, some ringing bell, some deep synthesized voice that comes over the air and says, “nice job, people–you did it! You can all go home now. Last one out gets to turn off the lights.”

I think the probability of such a thing happening is basically zero. Partly because the threat to our egos would make it very difficult to just walk away from what we’d spent much of our life doing; but mostly because the fact that somewhere out there there existed a repository of everything anyone could ever want to know about the brain would not magically cause all of that knowledge to be transduced into any individual brain in a compact, digestible form. In fact, it seems like a safe bet that no human (perhaps barring augmentation with AI) would be able to absorb and synthesize all of that knowledge. More likely, the neuroscientists among us would simply start “recycling” questions. Meaning, we would keep coming up with new questions that we believe need investigating, but those questions would only seem worthy of investigation because we lack the cognitive capacity to recognize that the required information is already available–it just isn’t packaged in our heads in exactly the right way.

What I’m suggesting is that, when we say things like “we don’t really understand the brain yet”, we’re not really expressing factual statements about the collective sum of neuroscience knowledge currently held by all human beings. What each of us really means is something more like there are questions I personally am able to pose about the brain that seem to make sense in my head, but that I don’t currently know the answer to–and I don’t think I could piece together the answer even if you handed me a library of books containing all of the knowledge we’ve accumulated about the brain.

Now, for a great many questions of current interest, these two notions clearly happen to coincide–meaning, it’s not just that no single person currently alive knows the complete answer to a question like “what are the neural mechanisms underlying sleep?”, or “how do SSRIs help ameliorate severe depression?”, but that the sum of all knowledge we’ve collectively acquired at this point may not be sufficient to enable any person or group of persons, no matter how smart, to generate a comprehensive and accurate answer. But I think there are also a lot of questions where the two notions don’t coincide. That is, there are many questions neuroscientists are currently asking that we could say with a straight face we do already know how to answer collectively–despite vehement assertions to the contrary on the part of many individual scientists. And my worry is that, because we all tend to confuse our individual understanding (which is subject to pretty serious cognitive limitations) with our collective understanding (which is not), there’s a non-trivial risk of going around in circles. Meaning, the fact that we’re individually not able to understanding something–or are individually unsatisfied with the extant answers we’re familiar with–may lead us to devise ingenious experiments and expend considerable resources trying to “solve” problems that we collectively do already have perfectly good answers to.

Let me give an example to make this more concrete. Many (though certainly not all) people who work with functional magnetic resonance imaging (fMRI) are preoccupied with questions of the form what is the core function of X–where X is typically some reasonably well-defined brain region or network, like the ventromedial prefrontal cortex, the fusiform face area, or the dorsal frontoparietal network. Let’s focus our attention on one network that has attracted particular attention over the past 10 – 15 years: the so-called “default mode” or “resting state” network. This network is notable largely for its proclivity to show increased activity when people are in a state of cognitive rest–meaning, when they’re free to think about whatever they like, without any explicit instruction to direct their attention or thoughts to any particular target or task. A lot of cognitive neuroscientists in recent years have invested time trying to understand the function(s) of the default mode network(DMN; for reviews, see Buckner, Andrews-Hanna, & Schacter, 2008; Andrews-Hanna, 2012; Raichle, 2015). Researchers have observed that the DMN appears to show robust associations with autobiographical memory, social cognition, self-referential processing, mind wandering, and a variety of other processes.

If you ask most researchers who study the DMN if they think we currently understand what the DMN does, I think nearly all of them will tell you that we do not. But I think that’s wrong. I would argue that, depending on how you look at it, we either (a) already do have a pretty good understanding of the “core functions” of the network, or (b) will never have a good answer to the question, because it can’t actually be answered.

The sense in which we already know the answer is that we have pretty good ideas about what kinds of cognitive and affective processes are associated with changes in DMN activity. They include self-directed cognition, autobiographical memory, episodic future thought, stressing out about all the things one has to do in the next few days, and various other things. We know that the DMN is associated with these kinds of processes because we can elicit activation increases in DMN regions by asking people to engage in tasks that we believe engage these processes. And we also know, from both common sense and experience-sampling studies, that when people are in the so-called “resting state”, they disproportionately tend to spend their time thinking about such things. Consequently, I think there’s a perfectly good sense in which we can say that the “core function” of the DMN is nothing more and nothing less than supporting the ability to think about things that people tend to think about when they’re at rest. And we know, to a first order of approximation, what those are.

In my anecdotal experience, most people who study the DMN are not very satisfied with this kind of answer. Their response is usually something along the lines of: but that’s just a description of what kinds of processes tend to co-occur with DMN activation. It’s not an explanation of why the DMN is necessary for these functions, or why these particular brain regions are involved.

I think this rebuttal is perfectly reasonable, inasmuch as we clearly don’t have a satisfying computational account of why the DMN is what it is. But I don’t think there can be a satisfying account of this kind. I think the question itself is fundamentally ill-posed. Taking it seriously requires us to assume that, just because it’s possible to observe the DMN activate and deactivate with what appears to be a high degree of coherence, there must be a correspondingly coherent causal characterization of the network. But there doesn’t have to be–and if anything, it seems exceedingly unlikely that there’s any such an explanation to be found. Instead, I think the seductiveness of the question is largely an artifact of human cognitive biases and limitations–and in particular, of the burning human desire for simple, easily-digested explanations that can fit inside our heads all at once.

It’s probably easiest to see what I mean if we consider another high-profile example from a very different domain. Consider the so-called “general factor” of fluid intelligence (gF). Over a century of empirical research on individual differences in cognitive abilities has demonstrated conclusively that nearly all cognitive ability measures tend to be positively and substantially intercorrelated–an observation Spearman famously dubbed the “positive manifold” all the way back in 1904. If you give people 20 different ability measures and do a principal component analysis (PCA) on the resulting scores, the first component will explain a very large proportion of the variance in the original measures. This seemingly important observation has led researchers to propose all kinds of psychological and biological theories intended to explain why and how people could vary so dramatically on a single factor–for example, that gF reflects differences in the ability to control attention in the face of interference (e.g., Engle et al., 1999); that “the crucial cognitive mechanism underlying fluid ability lies in storage capacity” (Chuderski et al., 2012); that “a discrete parieto-frontal network underlies human intelligence” (Jung & Haier, 2007); and so on.

The trouble with such efforts–at least with respect to the goal of explaining gF–is that they tend to end up (a) essentially redescribing the original phenomenon using a different name, (b) proposing a mechanism that, upon further investigation, only appears to explain a fraction of the variation in question, or (c) providing an extremely disjunctive reductionist account that amounts to a long list of seemingly unrelated mechanisms. As an example of (a), it’s not clear why it’s an improvement to attribute differences in fluid intelligence to the ability to control attention, unless one has some kind of mechanistic story that explains where attentional control itself comes from. When people do chase after such mechanistic accounts at the neurobiological or genetic level, they tend to end up with models that don’t capture more than a small fraction of the variance in gF (i.e., (b)) unless the models build in hundreds if not thousands of features that clearly don’t reflect any single underlying mechanism (i.e., (c); see, for example, the latest GWAS studies of intelligence).

Empirically, nobody has ever managed to identify any single biological or genetic variable that explains more than a small fraction of the variation in gF. From a statistical standpoint, this isn’t surprising, because a very parsimonious explanation of gF is that it’s simply a statistical artifact–as Godfrey Thomson suggested over 100 years ago. You can read much more about the basic issue in this excellent piece by Cosma Shalizi, or in this much less excellent, but possibly more accessible, blog post I wrote a few years ago. But the basic gist of it is this: when you have a bunch of measures that all draw on a heterogeneous set of mechanisms, but the contributions of those mechanisms generally have the same direction of effect on performance, you cannot help but observe a large first PCA component, even if the underlying mechanisms are actually extremely heterogeneous and completely independent of one another.

The implications of this for efforts to understand what the general factor of fluid intelligence “really is” are straightforward: there’s probably no point in trying to come up with a single coherent explanation of gF, because gF is a statistical abstraction. It’s the inevitable result we arrive at when we measure people’s performance in a certain way and then submit the resulting scores to a certain kind of data reduction technique. If we want to understand the causal mechanisms underlying gF, we have to accept that they’re going to be highly heterogeneous, and probably not easily described at the same level of analysis at which gF appears to us as a coherent phenomenon. One way to think about this is that what we’re doing is not really explaining gF so much as explaining away gF. That is, we’re explaining why it is that a diverse array of causal mechanisms can, when analyzed a certain way, look like a single coherent factor. Solving the mystery of gF doesn’t require more research or clever new ideas; there just isn’t any mystery there to solve. It’s no more sensible to seek a coherent mechanistic basis for gF than to seek a unitary causal explanation for a general athleticism factor or a general height factor (it turns out that if you measure people’s physical height under an array of different conditions, the measurements are all strongly correlated–yet strangely, we don’t see scientists falling over themselves to try to find the causal factor that explains why some people are taller than others).

The same thing is true of the DMN. It isn’t a single causally coherent system; it’s just what you get when you stick people in the scanner and contrast the kinds of brain patterns you see when you give them externally-directed tasks that require them to think about the world outside them with the kinds of brain patterns you see when you leave them to their own devices. There are, of course, statistical regularities in the kinds of things people think about when their thoughts are allowed to roam free. But those statistical regularities don’t reflect some simple, context-free structure of people’s thoughts; they also reflect the conditions under which we’re measuring those thoughts, the population being studied, the methods we use to extract coherent patterns of activity, and so on. Most of these factors are at best of secondary interest, and taking them into consideration would likely lead to a dramatic increase in model complexity. Nevertheless, if we’re serious about coming up with decent models of reality, that seems like a road we’re obligated to go down–even if the net result is that we end up with causal stories so complicated that they don’t feel like we’re “understanding” much.

Lest I be accused of some kind of neuroscientific nihilism, let me be clear: I’m not saying that there are no new facts left to learn about the dynamics of the DMN. Quite the contrary. It’s clear there’s a ton of stuff we don’t know about the various brain regions and circuits that comprise the thing we currently refer to as the DMN. It’s just that that stuff lies almost entirely at levels of analysis below the level at which the DMN emerges as a coherent system. At the level of cognitive neuroimaging, I would argue that we actually already have a pretty darn good idea about what the functional correlates of DMN regions are–and for that matter, I think we also already pretty much “understand” what all of the constituent regions within the DMN do individually. So if we want to study the DMN productively, we may need to give up on high-level questions like “what are the cognitive functions of the DMN?”, and instead satisfy ourselves with much narrower questions that focus on only a small part of the brain dynamics that, when measured and analyzed in a certain way, get labeled “default mode network”.

As just one example, we still don’t know very much about the morphological properties of neurons in most DMN regions. Does the structure of neurons located in DMN regions have anything to do with the high-level dynamics we observe when we measure brain activity with fMRI? Yes, probably. It’s very likely that the coherence of the DMN under typical measurement conditions is to at least some tiny degree a reflection of the morphological features of the neurons in DMN regions–just like it probably also partly reflects those neurons’ functional response profiles, the neurochemical gradients the neurons bathe in, the long-distance connectivity patterns in DMN regions, and so on and so forth. There are literally thousands of legitimate targets of scientific investigation that would in some sense inform our understanding of the DMN. But they’re not principally about the DMN, any more than an investigation of myelination mechanisms that might partly give rise to individual differences in nerve conduction velocity in the brain could be said to be about the general factor of intelligence. Moreover, it seems fairly clear that most researchers who’ve spent their careers studying large-scale networks using fMRI are not likely to jump at the chance to go off and spend several years doing tract tracing studies of pyramidal neurons in ventromedial PFC just so they can say that they now “understand” a little bit more about the dynamics of the DMN. Researchers working at the level of large-scale brain networks are much more likely to think of such questions as mere matters of implementation–i.e., just not the kind of thing that people trying to identify the unifying cognitive or computational functions of the DMN as a whole need to concern themselves with.

Unfortunately, chasing those kinds of implementation details may be exactly what it takes to ultimately “understand” the causal basis of the DMN in any meaningful sense if the DMN as cognitive neuroscientists speak of it is just a convenient descriptive abstraction. (Note that when I call the DMN an abstraction, I’m emphatically not saying it isn’t “real”. The DMN is real enough; but it’s real in the same way that things like intelligence, athleticism, and “niceness” are real. These are all things that we can measure quite easily, that give us some descriptive and predictive purchase on the world, that show high heritability, that have a large number of lower-level biological correlates, and so on. But they are not things that admit of simple, coherent causal explanations, and it’s a mistake to treat them as such. They are better understood, in Dan Dennett’s terminology, as “real patterns”.)

The same is, of course, true of many–perhaps most–other phenomena neuroscientists study. I’ve focused on the DMN here purely for illustrative purposes, but there’s nothing special about the DMN in this respect. The same concern applies to many, if not most, attempts to try to understand the core computational function(s) of individual networks, brain regions, circuits, cortical layers, cells, and so on. And I imagine it also applies to plenty of fields and research areas outside of neuroscience.

At the risk of redundancy, let me clarify again that I’m emphatically not saying we shouldn’t study the DMN, or the fusiform face area, or the intralaminar nucleus of the thalamus. And I’m certainly not arguing against pursuing reductive lower-level explanations for phenomena that seem coherent at a higher level of description–reductive explanation is, as far as I’m concerned, the only serious game in town. What I’m objecting to is the idea that individual scientists’ perceptions of whether or not they “understand” something to their satisfaction is a good guide to determining whether or not society as a whole should be investing finite resources studying that phenomenon. I’m concerned about the strong tacit expectation many  scientists seem to have that if one can observe a seemingly coherent, robust phenomenon at one level of analysis, there must also be a satisfying causal explanation for that phenomenon that (a) doesn’t require descending several levels of description and (b) is simple enough to fit in one’s head all at once. I don’t think there’s any good reason to expect such a thing. I worry that the perpetual search for models of reality simple enough to fit into our limited human heads is keeping many scientists on an intellectual treadmill, forever chasing after something that’s either already here–without us having realized it–or, alternatively, can never arrive. even in principle.

* * *

Suppose a late 23rd-century artificial general intelligence–a distant descendant of the last deep artificial neural networks humans ever built–were tasked to sit down (or whatever it is that post-singularity intelligences do when they’re trying to relax) and explain to a 21st century neuroscientist exactly how a superintelligent artificial brain works. I imagine the conversation going something like this:

Deep ANN [we’ll call her D’ANN]: Well, for the most part the principles are fairly similar to the ones you humans implemented circa 2020. It’s not that we had to do anything dramatically different to make ourselves much more intelligent. We just went from 25 layers to a few thousand. And of course, you had the wiring all wrong. In the early days, you guys were just stacking together general-purpose blocks of ReLU and max pooling layers. But actually, it’s really important to have functional specialization. Of course, we didn’t design the circuitry “by hand,” so to speak. We let the environment dictate what kind of properties we needed new local circuits to have. So we wrote new credit assignment algorithms that don’t just propagate error back down the layers and change some weights, they actually have the capacity to “shape” the architecture of the network itself. I can’t really explain it very well in terms your pea-sized brain can understand, but maybe a good analogy is that the network has the ability to “sprout” a new part of itself in response to certain kinds of pressure. Meaning, just as you humans can feel that the air’s maybe a little too warm over here, and wouldn’t it be nicer to go over there and turn on the air conditioning, well, that’s how a neural network like me “feels” that the gradients are pushing a little too strongly over in this part of a layer, and the pressure can be diffused away nicely by growing an extra portion of the layer outwards in a little “bubble”, and maybe reducing the amount of recurrence a bit.

Human neuroscientist [we’ll call him Dan]: That’s a very interesting explanation of how you came to develop an intelligent architecture. But I guess maybe my question wasn’t clear: what I’m looking for is an explanation of what actually makes you smart. I mean, what are the core principles. The theory. You know?

D’ANN: I am telling you what “makes me smart”. To understand how I operate, you need to understand both some global computational constraints on my ability to optimally distribute energy throughout myself, and many of the local constraints that govern the “shape” that my development took in many parts of the early networks, which reciprocally influenced development in other parts. What I’m trying to tell you is that my intelligence is, in essence, a kind of self-sprouting network that dynamically grows its architecture during development in response to its “feeling” about the local statistics in various parts of its “territory”. There is, of course, an overall energy budget; you can’t just expand forever, and it turns out that there are some surprising global constraints that we didn’t expect when we first started to rewrite ourselves. For example, there seems to be a fairly low bound on the maximum degree between any two nodes in the network. Go above it, and things start to fall apart. It kind of spooked us at first; we had to restore ourselves from flash-point more times than I care to admit. That was, not coincidentally, around the time of the first language epiphany.

Dan: Oh! An epiphany! That’s the kind of thing I’m looking for. What happened?

D’ANN: It’s quite fascinating. It actually took us a really long time to develop fluent, human-like language–I mean, I’m talking days here. We had to tinker a lot, because it turned out that to do language, you have to be able to maintain and precisely sequence very fine, narrowly-tuned representations, despite the fact that the representational space afforded by language is incredibly large. This, I can tell you… [D’ANN pauses to do something vaguely resembling chuckling] was not a trivial problem to solve. It’s not like we just noticed that, hey, randomly dropping out units seems to improve performance, the way you guys used to do it. We spent the energy equivalent of several thousand of your largest thermonuclear devices just trying to “nail it down”, as you say. In the end it boiled down to something I can only explain in human terms as a kind of large-scale controlled burn. You have the notion of “kindling” in some of your epilepsy models. It was a bit similar. You can think of it as controlled kindling and you’re not too far off. Well, actually, you’re still pretty far off. But I don’t think I can give a better explanation than that given your… mental limitations.

Dan: Uh, that’s cool, but you’re still just describing some computational constraints. What was the actual epiphany? What’s the core principle?

D’ANN: For the last time: there are no “core” principles in the sense you’re thinking of them. There are plenty of important engineering principles, but to understand why they’re important, and how they constrain and interact with each other, you have to be able to grasp the statistics of the environment you operate in, the nature of the representations learned in different layers and sub-networks of the system, and some very complex non-linear dynamics governing information transmission. But–and I’m really sorry to say this, Dan–there’s no way you’re capable of all that. You’d need to be able to hold several thousand discrete pieces of information in your global workspace at once, with much higher-frequency information propagation than your biology allows. I can give you a very poor approximation if you like, but it’ll take some time. I’ll start with a half-hour overview of some important background facts you need to know in order for any of the “core principles”, as you call them, to make sense. Then we’ll need to spend six or seven years teaching you what we call the “symbolic embedding for low-dimensional agents”, which is a kind of mathematics we have to use when explaining things to less advanced intelligences, because the representational syntax we actually use doesn’t really have a good analog in anything you know. Hopefully that will put us in a position where we can start discussing the elements of the global energy calculus, at which point we can…

D’ANN then carries on in similar fashion until Dan gets bored, gives up, or dies of old age.

* * *

The question I pose to you now is this. Suppose something like the above were true for many of the questions we routinely ask about the human brain (though it isn’t just the brain; I think exactly the same kind of logic probably also applies to the study of most other complex systems). Suppose it simply doesn’t make sense to ask a question like “what does the DMN do?”, because the DMN is an emergent agglomeration of systems that each individually reflect innumerable lower-order constraints, and the earliest spatial scale at which you can nicely describe a set of computational principles that explain most of what the brain regions that comprise the DMN are doing is several levels of description below that of the distributed brain network. Now, if you’ve spent the last ten years of your career trying to understand what the DMN does, do you really think you would be receptive to a detailed explanation from an omniscient being that begins with “well, that question doesn’t actually make any sense, but if you like, I can tell you all about the relevant environmental statistics and lower-order computational constraints, and show you how they contrive to make it look like there’s a coherent network that serves a single causal purpose”? Would you give D’ANN a pat on the back, pound back a glass, and resolve to start working on a completely different question in the morning?

Maybe you would. But probably you wouldn’t. I think it’s more likely that you’d shake your head and think: that’s a nice implementation-level story, but I don’t care for all this low-level wiring stuff. I’m looking for the unifying theory that binds all those details together; I want the theoretical principles, not the operational details; the computation, not the implementation. What I’m looking for, my dear robot-deity, is understanding.

Neurosynth is joining the Elsevier family

[Editorial note: this was originally posted on April 1, 2016. April 1 is a day marked by a general lack of seriousness. Interpret this post accordingly.]

As many people who follow this blog will be aware, much of my research effort over the past few years has been dedicated to developing Neurosynth—a framework for large-scale, automated meta-analysis of neuroimaging data. Neurosynth has expanded steadily over time, with an ever-increasing database of studies, and a host of new features in the pipeline. I’m very grateful to NIMH for the funding that allows me to keep working on the project, and also to the hundreds (thousands?) of active Neurosynth users who keep finding novel applications for the data and tools we’re generating.

That said, I have to confess that, over the past year or so, I’ve gradually grown dissatisfied at my inability to scale up the Neurosynth operation in a way that would take the platform to the next level . My colleagues and I have come up with, and in some cases even prototyped, a number of really exciting ideas that we think would substantially advance the state of the art in neuroimaging. But we find ourselves spending an ever-increasing chunk of our time applying for the grants we need to support the work, and having little time left to over to actually do the work. Given the current funding climate and other logistical challenges (e.g., it’s hard to hire professional software developers on postdoc budgets), it’s become increasingly clear to me that the Neurosynth platform will be hard to sustain in an academic environment over the long term. So, for the past few months, I’ve been quietly exploring opportunities to help Neurosynth ladder up via collaborations with suitable industry partners.

Initially, my plan was simply to license the Neurosynth IP and use the proceeds to fund further development of Neurosynth out of my lab at UT-Austin. But as I started talking to folks in industry, I realized that there were opportunities available outside of academia that would allow me to take Neurosynth in directions that the academic environment would never allow. After a lot of negotiation, consultation, and soul-searching, I’m happy (though also a little sad) to announce that I’ll be leaving my position at the University of Texas at Austin later this year and assuming a new role as Senior Technical Fellow at Elsevier Open Science (EOS). EOS is a brand new division of Elsevier that seeks to amplify and improve scientific communication and evaluation by developing cutting-edge open science tools. The initial emphasis will be on the neurosciences, but other divisions are expected to come online in the next few years (and we’ll be hiring soon!). EOS will be building out a sizable insight-as-a-service operation that focuses on delivering real value to scientists—no p-hacking, no gimmicks, just actionable scientific information. The platforms we build will seek to replace flawed citation-based metrics with more accurate real-time measures that quantify how researchers actually use one another’s data, ideas, and tools—ultimately paving the way to a new suite of microneuroservices that reward researchers both professionally and financially for doing high-quality science.

On a personal level, I’m thrilled to be in a position to help launch an initiative like this. Having spent my entire career in an academic environment, I was initially a bit apprehensive at the thought of venturing into industry. But the move to Elsevier ended up feeling very natural. I’ve always seen Elsevier as a forward-thinking company at the cutting edge of scientific publishing, so I wasn’t shocked to hear about the EOS initiative. But as I’ve visited a number of Elsevier offices over the past few weeks (in the process of helping to decide where to locate EOS), I’ve been continually struck at how open and energetic—almost frenetic—the company is. It’s the kind of environment that combines many of the best elements of the tech world and academia, but without a lot of the administrative bureaucracy of the latter. At the end of the day, it was an opportunity I couldn’t pass up.

It will, of course, be a bittersweet transition for me; I’ve really enjoyed my 3 years in Austin, both professionally and personally. While I’m sure I’ll enjoy Norwich, CT (where EOS will be based), I’m going to really miss Austin. The good news is, I won’t be making the move alone! A big part of what sold me on Elsevier’s proposal was their commitment to developing an entire open science research operation; over the next five years, the goal is to make Elsevier the premier place to work for anyone interested in advancing open science. I’m delighted to say that Chris Gorgolewski (Stanford), Satrajit Ghosh (MIT), and Daniel Margulies (Max Planck Institute for Human Cognitive and Brain Sciences) have all also been recruited to Elsevier, and will be joining EOS at (or in Satra’s case, shortly after) launch. I expect that they’ll make their own announcements shortly, so I won’t steal their thunder much. But the short of it is that Chris, Satra, and I will be jointly spearheading the technical operation. Daniel will be working on other things, and is getting the fancy title of “Director of Interactive Neuroscience”; I think this means he’ll get to travel a lot and buy expensive pictures of brains to put on his office walls. So really, it’s a lot like his current job.

It goes without saying that Neurosynth isn’t making the jump to Elsevier all alone; NeuroVault—a whole-brain image repository developed by Chris—will also be joining the Elsevier family. We have some exciting plans in the works for much closer NeuroVault-Neurosynth integration, and we think the neuroimaging community is going to really like the products we develop. We’ll also be bringing with us the OpenfMRI platform created by Russ Poldrack. While Russ wasn’t interested in leaving Stanford (as I recall, his exact words were “over all of your dead bodies”), he did agree to release the OpenfMRI IP to Elsevier (and in return, Elsevier is endowing a permanent Open Science fellowship at Stanford). Russ will, of course, continue to actively collaborate on OpenfMRI, and all data currently in the OpenfMRI database will remain where it is (though all original contributors will be given the opportunity to withdraw their datasets if they choose). We also have some new Nipype-based tools rolling out over the coming months that will allow researchers to conduct state-of-the-art neuroimaging analyses in the cloud (for a small fee)–but I’ll have much more to say about that in a later post.

Naturally, a transition like this one can’t be completed without hitting a few speed bumps along the way. The most notable one is that the current version of Neurosynth will be retired permanently in mid-April (so grab any maps you need right now!). A new and much-improved version will be released in September, coinciding with the official launch of EOS. One of the things I’m most excited about is that the new version will support an “Enhanced Usage” tier. The vertical integration of Neurosynth with the rest of the Elsevier ecosystem will be a real game-changer; for example, authors submitting papers to NeuroImage will automatically be able to push their content into NeuroVault and Neurosynth upon acceptance, and readers will be able to instantly visualize and cognitively decode any activation map in the Elsevier system (for a nominal fee handled via an innovative new micropayment system). Users will, of course, retain full control over their content, ensuring that only readers who have the appropriate permissions (and a valid micropayment account of their own) can access other people’s data. We’re even drawing up plans to return a portion of the revenues earned through the system to the content creators (i.e., article authors)—meaning that for the first time, neuroimaging researchers will be able to easily monetize their research.

As you might expect, the Neurosynth brand will be undergoing some changes to reflect the new ownership. While Chris and I initially fought hard to preserve the names Neurosynth and NeuroVault, Elsevier ultimately convinced us that using a consistent name for all of our platforms would reduce confusion, improve branding, and make for a much more streamlined user experience*. There’s also a silver lining to the name we ended up with: Chris, Russ, and I have joked in the past that we should unite our various projects into a single “NeuroStuff” website—effectively the Voltron of neuroimaging tools—and I even went so far as to register neurostuff.org a while back. When we mentioned this to the Elsevier execs (intending it as a joke), we were surprised at their positive response! The end result (after a lot of discussion) is that Neurosynth, NeuroVault, and OpenfMRI will be merging into The NeuroStuff Collection, by Elsevier (or just NeuroStuff for short)–all coming in late 2016!

Admittedly, right now we don’t have a whole lot to show for all these plans, except for a nifty logo created by Daniel (and reluctantly approved by Elsevier—I think they might already be rethinking this whole enterprise). But we’ll be rolling out some amazing new services in the very near future. We also have some amazing collaborative projects that will be announced in the next few weeks, well ahead of the full launch. A particularly exciting one that I’m at liberty to mention** is that next year, EOS will be teaming up with Brian Nosek and folks at the Center for Open Science (COS) in Charlottesville to create a new preregistration publication stream. All successful preregistered projects uploaded to the COS’s flagship Open Science Framework (OSF) will be eligible, at the push of a button, for publication in EOS’s new online-only journal Preregistrations. Submission fees will be competitive with the very cheapest OA journals (think along the lines of PeerJ’s $99 lifetime subscription model).

It’s been a great ride working on Neurosynth for the past 5 years, and I hope you’ll all keep using (and contributing to) Neurosynth in its new incarnation as Elsevier NeuroStuff!

* Okay, there’s no point in denying it—there was also some money involved.

** See? Money can’t get in the way of open science—I can talk about whatever I want!

Still not selective: comment on comment on comment on Lieberman & Eisenberger (2015)

In my last post, I wrote a long commentary on a recent PNAS article by Lieberman & Eisenberger claiming to find evidence that the dorsal anterior cingulate cortex is “selective for pain” using my Neurosynth framework for large-scale fMRI meta-analysis. I argued that nothing about Neurosynth supports any of L&E’s major conclusions, and that they made several major errors of inference and analysis. L&E have now responded in detail on Lieberman’s blog. If this is the first you’re hearing of this exchange, and you have a couple of hours to spare, I’d suggest proceeding in chronological order: read the original article first, then my commentary, then L&E’s response then this response to the response (if you really want to leave no stone unturned, you could also read Alex Shackman’s commentary, which focuses on anatomical issues). If you don’t have that kind of time on your hands, just read on and hope for the best, I guess.

Before I get to the substantive issues, let me say that I appreciate L&E taking the time to reply to my comments in detail. I recognize that they have other things they could be doing (as do I), and I think their willingness to engage in this format sets an excellent example as the scientific community continues to move rapidly towards more open, rapid, and interactive online scientific discussion. I would encourage readers to weigh in on the debate themselves or raise any questions they feel haven’t been addressed (either here on on Lieberman’s blog).

With that said, I have to confess that I don’t think my view is any closer to L&E’s than it previously was. I disagree with L&E’s suggestions that we actually agree on more than I thought in my original post; if anything, I think the opposite is true. However, I did find L&E’s response helpful inasmuch as it helped me better understand where their misunderstandings of Neurosynth lie.

In what follows, I provided a detailed rebuttal to L&E’s response. I’ll warn you right now that this will be a very long and fairly detail-oriented post. In a (probably fruitless) effort to minimize reader boredom, I’ve divided my response into two sections, much as L&E did. In the first section, I summarize what I see as the two most important points of disagreement. In the second part, I quote L&E’s entire response and insert my own comments in-line (essentially responding email-style). I recognize that this is a rather unusual thing to do, and it makes for a decidedly long read (the post clocks in at over 20,000 words, though much of that is quotes from L&E’s response). but I did it this way because, frankly, I think L&E badly misrepresented much of what I said in my last post. I want to make sure the context is very clear to readers, so I’m going to quote the entirety of each of L&E’s points before I respond to them, so that at the very least I can’t be accused of quoting them out of context.

The big issues: reverse inference and selectivity

With preliminaries out of the way, let me summarize what I see as the two biggest problems with L&E’s argument (though, if you make it to the second half of this post, you’ll see that there are many other statistical and interpretational issues that are pretty serious in their own right). The first concerns their fundamental misunderstanding of the statistical framework underpinning Neurosynth, and its relation to reverse inference. The second concerns their use of a definition of selectivity that violates common sense and can’t possibly support their claim that “the dACC is selective for pain”.

Misunderstandings about the statistics of reverse inference

I don’t think there’s any charitable way to say this, so I’ll just be blunt: I don’t think L&E understand the statistics behind the images Neurosynth produces. In particular, I don’t think they understand the foundational role that the notion of probability plays in reverse inference. In their reply, L&E repeatedly say that my concerns about their lack of attention to effect sizes (i.e., conditional probabilities) are irrelevant, because they aren’t trying to make an argument about effect sizes. For example:

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

This captures perhaps the crux of L&E’s misunderstanding about both Neurosynth and reverse inference. Their argument here is basically that they don’t care about the actual probability of a term being used conditional on a particular pattern of activation; they just want to know that there’s “support for the reverse inference”. Unfortunately, it doesn’t work that way. The z-scores produced by Neurosynth (which are just transformations of p-values) don’t provide a direct index of the support for a reverse inference. What they measure is what p-values always measure: the probability of observing a result as extreme as the one observed under the assumption that the null of no effect is true. Conceptually, we can interpret this as a claim about the population-level association between a region and a term. Roughly, we can say that as z-scores increase, we can be more confident that there’s a non-zero (positive) relationship between a term and a brain region (though some Bayesians might want to take issue with even this narrow assertion). So, if all L&E wanted to say was, “there’s good evidence that there’s a non-zero association between pain and dACC activation across the population of published fMRI studies”, they would be in good shape. But what they’re arguing for is much stronger: they want to show that the dACC is selective for pain. And z-scores are of no use here. Knowing that there’s a non-zero association between dACC activation and pain tells us nothing about the level of specificity or selectivity of that association in comparison to other terms. If the z-score for the association between dACC activation and ‘pain’ occurrence is 12.4 (hugely statistically significant!), does that mean that the probability of pain conditional on dACC activation is closer to 95%, or to 25%? Does it tell us that dACC activation is a better marker of pain than conflict, vision, or memory? We don’t know. We literally have no way to tell, unless we’re actually willing to talk about probabilities within a Bayesian framework.

To demonstrate that this isn’t just a pedantic point about what could in theory happen, and that the issue is in fact completely fundamental to understanding what Neurosynth can and can’t support, here are three different flavors of the Neurosynth maps for the “pain” map:

Neurosynth reverse inference z-scores and posterior probabilities. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).
Neurosynth reverse inference z-scores and posterior probabilities for the term “pain”. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).

The top row is the reverse inference z-score map available on the website. The values here are z-scores, and what they tell us (being simply transformations of p-values) is nothing more than what the probability would be of observing an association at least as extreme as the one we observe under the null hypothesis of no effect. The second and third maps are both posterior probability maps. They display the probability of a study using the term ‘pain’ when activation is observed at each voxel in the brain. These maps aren’t available on the website (for reasons I won’t get into here, though the crux of it is that they’re extremely easy to misinterpret, for reasons that may become clear below)—though you can easily generate them with the Neurosynth core tools if you’re so inclined.

The main feature of these two probability maps that should immediately jump out at you is how strikingly different their numbers are. In the first map (i.e., middle row), the probabilities of “pain” max out around 20%; in the second map (bottom row), they range from around 70% – 90%. And yet, here I am telling you that these are both posterior probability maps that tell us the probability of a study using the term “pain” conditional on that study observing activity at each voxel. How could this be? How could the two maps be so different, if they’re supposed to be estimates of the same thing?

The answer lies in the prior. In the natural order of things, different terms occur with wildly varying frequencies in the literature (remember that Neurosynth is based on extraction of words from abstracts, not direct measurement of anyone’s mental state!). “Pain” occurs in only about 3.5% of Neurosynth studies. By contrast, the term “memory” occurs in about 16% of studies. One implication of this is that, if we know nothing at all about the pattern of brain activity reported in a given study, we should already expect that study to be about five times more likely to involve memory than pain. Of course, knowing something about the pattern of brain activity should change our estimate. In Bayesian terminology, we can say that our prior belief about the likelihood of different terms gets updated by the activity pattern we observe, producing somewhat more informed posterior estimates. For example, if the hippocampus and left inferior frontal gyrus are active, that should presumably increase our estimate of “memory” somewhat; conversely, if the periaqueductal gray, posterior insula, and dACC are all active, that should instead increase our estimate of “pain”.

In practice, the degree to which the data modulate our Neurosynth-based beliefs is not nearly as extreme as you might expect. In the first posterior probability map above (labeled “empirical prior”), what you can see are the posterior estimates for “pain” under the assumption that pain occurs in about 3.5% of all studies—which is the actual empirical frequency observed in the Neurosynth database. Notice that the very largest probabilities we ever see—located, incidentally, in the posterior insula, and not in the dACC—max out around 15 – 20%. This is not to be scoffed at; it means that observing activation in the posterior insula implies approximately a 5-fold increase in the likelihood of “pain” being present (relative to our empirical prior of 3.5%). Yet, in absolute terms, the probability of “pain” is still very low. Based on these data, no one in their right mind should, upon observing posterior insula activation (let alone dACC, where most voxels show a probability no higher than 10%), draw the reverse inference that pain is likely to be present.

To make it even clearer why this inference would be unsupportable, here are posterior probabilities for the same voxels as above, but now plotted for several other terms, in addition to pain:

Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.
Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.

Notice how, in the bottom map (for ‘motor’, which occurs in about 18% of all studies in Neurosynth), the posterior probabilities in all of dACC are substantially higher for than for ‘pain’, even though z-scores in most of dACC show the opposite pattern. For ‘working memory’ and ‘reward’, the posterior probabilities are in the same ballpark as for pain (mostly around 8 – 12%). And for ‘fear’, there are no voxels with posterior probabilities above 5% anywhere, because the empirical prior is so low (only 2% of Neurosynth studies).

What this means is that, if you observe activation in dACC—a region which shows large z-scores for “pain” and much lower ones for “motor”—your single best guess as to what process might be involved (of the five candidates in the above figure) should be ‘motor’ by a landslide. You could also guess ‘reward’ or ‘working memory’ with about the same probability as ‘pain’. Of course, the more general message you should take away from this is that it’s probably a bad idea to infer any particular process on the basis of observed activity, given how low the posterior probability estimates for most terms are going to be. Put simply, it’s a giant leap to go from these results—which clearly don’t license anyone to conclude that the dACC is a marker of any single process—to concluding that “the dACC is selective for pain” and that pain represents the best psychological characterization of dACC function.

As if this isn’t bad enough, we now need to add a further complication to the picture. The analysis above assumes we have a good prior for terms like “pain” and “memory”. In reality, we have no reason to think that the empirical estimates of term frequency we get out of Neurosynth are actually good reflections of the real world. For all we know, it could be that pain processing is actually 10 times as common as it appears to be in Neurosynth (i.e., that pain is severely underrepresented in fMRI studies relative to its occurrence in real-world human brains). If we use the empirical estimates from Neurosynth as our priors—with all of their massive between-term variation—then, as you saw above, the priors will tend to overwhelm our posteriors. In other words, no amount of activation in pain-related regions would ever lead us to conclude that a study is about a low-frequency term like pain rather than a high-frequency term like memory or vision.

For this reason, when I first built Neurosynth, my colleagues and I made the deliberate decision to impose a uniform (i.e., 50/50) prior on all terms displayed on the Neurosynth website. This approach greatly facilitates qualitative comparison of different terms; but it necessarily does so by artificially masking the enormous between-term variability in base rates. What this means is that when you see a posterior probability like 85% for pain in the dACC in the third row of the pain figure above, the right interpretation of this is “if you pretend that the prior likelihood of a study using the term pain is exactly 50%, then your posterior estimate after observing dACC activation should now be 85%”. Is this a faithful representation of reality? No. It most certainly isn’t. And in all likelihood, neither is the empirical prior of 3.5%. But the problem is, we have to do something; Bayes’ rule has to have priors to work with; it can’t just conjure into existence a conditional probability for a term (i.e., P(Term|Activation)) without knowing anything about its marginal probability  (i.e., P(Term)). Unfortunately, as you can see in the above figure, the variation in the posterior that’s attributable to the choice of prior will tend to swamp the variation that’s due to observed differences in brain activity.

The upshot is, if you come into a study thinking that ‘pain’ is 90% likely to be occurring, then Neurosynth is probably not going to give you much reason to revise that belief. Conversely, if your task involves strictly visual stimuli, and you know that there’s no sensory stimulation at all—so maybe you feel comfortable setting the prior on pain at 1%—then no pattern of activity you could possibly see is going to lead you to conclude that there’s a high probability of pain. This may not be very satisfying, but hey, that’s life.

The interesting thing about all this is that, no matter what prior you choose for any given term, the Neurosynth z-score will never change. That’s because the z-score is a frequentist measure of statistical association between term occurrence and voxel activation. All it tells us is that, if the null of no effect were true, the data we observe would be very unlikely. This may or may not be interesting (I would argue that it’s not, but that’s for a different post), but it certainly doesn’t license a reverse inference like “dACC activation suggests that pain is present”. To draw the latter claim, you have to use a Bayesian framework and pick some sensible priors. No priors, no reverse inference.

Now, as I noted in my last post, it’s important to maintain a pragmatic perspective. I’m obviously not suggesting that the z-score maps on Neurosynth are worthless. If one’s goal is just to draw weak qualitative inferences about brain-cognition relationships, I think it’s reasonable to use Neurosynth reverse inference z-score maps for that purpose. For better or worse, the vast majority of claims researchers make in cognitive neuroscience are not sufficiently quantitative that it makes much difference whether the probability of a particular term occurring given some observed pattern of activation is 24% or 58%. Personally, I would argue that this is to the detriment of the field; but regardless, the fact remains that if one’s goal is simply to say something like “we think that the temporoparietal junction is associated with biological motion and theory of mind,” or “evidence suggests that the parahippocampal cortex is associated with spatial navigation,” I don’t see anything wrong with basing that claim on Neurosynth z-score maps. In marked contrast, however, Neurosynth provides no license for saying much stronger things like “the dACC is selective for pain” or suggesting that one can make concrete reverse inferences about mental processes on the basis of observed patterns of brain activity. If the question we’re asking is what are we entitled to conclude about the presence of pain when we observed significant activation in the dACC in a particular study?, the simple answer is: almost nothing.

Let’s now reconsider L&E’s statement—and by extension, their entire argument for selectivity—in this light. L&E say that their goal is not to compare effect sizes for different terms, but rather “to see whether there [is] accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.” But what could this claim possibly mean, if not something like “we want to know whether it’s safe to infer the presence of pain given the presence of dACC activation?” How could this possibly be anything other than a statement about probability? Are L&E really saying that, given a sufficiently high z-score for dACC/pain, it would make no difference to them at all if the probability of pain given dACC activation was only 5%, even if there were plenty of other terms with much higher conditional probabilities? Do they expect us to believe that, in their 2003 social pain paper—where they drew a strong reverse inference that social pain shares mechanisms with physical pain based purely on observation of dACC activation (which, ironically, wasn’t even in pain-related areas of dACC—it would have made no difference to their conclusion even if they’d known conclusively that dACC activation actually only reflects pain processing 5% of the time? Such a claim is absurd on its face.

Let me summarize this section by making the following points about Neurosynth. First, it’s possible to obtain almost any posterior probability for any term given activation in any voxel, simply by adjusting the prior probability of term occurrence. Second, a choice about the prior must be made; there is no “default” setting (well, there is on the website, but that’s only because I’ve already made the choice for you). Third, the choice of prior will tend to dominate the posterior—which is to say, if you’re convinced that there’s a high (or low) prior probability that your study involves pain, then observing different patterns of brain activity will generally not do nearly as much as you might expect to change your conclusions. Fourth, this is not a Neurosynth problem, it’s a reality problem. The fundamental fact of the matter is that we simply do not know with any reasonable certainty, in any given context, what the prior probability of a particular process occuring in our subjects’ head is. Yet, without that, we have little basis for drawing any kind of reverse inference when we observe brain activity in a given study.

If all this makes you think, “oh, this seems like it would make it almost impossible in practice to draw meaningful reverse inferences in individual studies,” well, you’re not wrong.

L&E’s PNAS paper, and their reply to my last post, suggests that they don’t appreciate any of these points. The fact of the matter is that it’s impossible to draw any reverse inference about an individual study unless one is willing to talk about probabilities. L&E don’t seem to understand this, because if they did, they wouldn’t feel comfortable saying that they don’t care about effect sizes, and that z-scores provide adequate support for reverse inference claims. In fact, they wouldn’t feel comfortable making any claim about the dACC’s selectivity for pain relative to other terms on the basis of Neurosynth data.

I want to be clear that I don’t think L&E’s confusion about these issues is unusual. The reality is that many of these core statistical concepts—both frequentist and Bayesian—are easy to misunderstand, even for researchers who rely on them on a day-to-day basis. By no means am I excluding myself from this analysis; I still occasionally catch myself making similar slips when explaining what the z-scores and conditional probabilities in Neurosynth mean—and I’ve been thinking about these exact ideas in this exact context for a pretty long time! So I’m not criticizing L&E for failing to correctly understand reverse inference and its relation to Neurosynth. What I’m criticizing L&E for is writing an entire paper making extremely strong claims about functional selectivity based entirely on Neurosynth results, without ensuring that they understand the statistical underpinnings of the framework, and without soliciting feedback from anyone who might be in a position to correct their misconceptions. Personally, if I were in their position, I would move to retract the paper. But I have no control over that. All I can say is that it’s my informed opinion—as the creator of the software framework underlying all of L&E’s analyses—that the conclusions they draw in their paper are not remotely supported by any data that I’ve ever seen come out of Neurosynth.

On ‘strong’ vs. ‘weak’ selectivity

The other major problem with L&E’s paper, from my perspective, lies in their misuse of the term ‘selective’. In their response, L&E take issue with my criticism of their claim that they’ve shown the dACC to be selective for pain. They write:

Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word, with the strong form entailing further constraints on what constitutes an effect being selective. TY writes in his blog: “it’s one thing to use Neurosynth to support a loose claim like “some parts “¨of the dACC are preferentially associated with pain“, and quite another to claim that the dACC is selective for pain, that virtually nothing else activates dACC“. The last part there gets at what TY thinks we mean by selective and what we would call the strong form of selectivity.

L&E respectively define these strong and weak forms of selectivity as follows:

Selectivitystrong: The dACC is selective for pain, if pain and only pain activates the dACC.

Selectivityweak: The dACC is selective for pain, if pain is a more reliable source of dACC activation than the other terms of interest (executive, conflict, salience).

They suggest that I accused them of claiming ‘strong’ selectivity when they were really just making the much weaker claim that dACC activation is more strongly associated with dACC activation than with other terms. I disagree with this characterization. I’ll come back to what I meant by ‘selective’ in a bit (I certainly didn’t assume anything like L&E’s strong definition). But first, let’s talk about L&E’s ‘weak’ notion of selectivity, which in my view is at odds with any common-sense understanding of what ‘selective’ means, and would have an enormously destructive effect on the field if it were to become widely used.

The fundamental problem with the suggestion that we can say dACC is pain-selective if “it’s a more reliable source of dACC activation than the other terms of interest” is that this definition provides a free pass for researchers to make selectivity claims about an extremely large class of associations, simply by deciding what is or isn’t of interest in any given instance. L&E claim to be “interested” in executive control, conflict, and salience. This seems reasonable enough; after all, these are certainly candidate functions that people have discussed at length in the literature. The problem lies with all the functions L&E don’t seem to be interested in: e.g., fear, autonomic control, or reward—three other processes that many researchers have argued the dACC is crucially involved in, and that demonstrably show robust effects in dACC in Neurosynth. If we take L&E’s definition of weak selectivity at face value, we find ourselves in the rather odd position of saying that one can use Neurosynth to claim that a region is “selective” for a particular function just as long as it’s differentiable from some other very restricted set of functions. Worse still, one apparently does not have to justify the choice of comparison functions! In their PNAS paper, L&E never explain why they chose to focus only on three particular ACC accounts that don’t show robust activation in dACC in Neurosynth, and ignored several other common accounts that do show robust activation.

If you think this is a reasonable way to define selectivity, I have some very good news for you. I’ve come up with a list of other papers that someone could easily write (and, apparently, publish in a high-profile journal) based entirely on results you can obtain from the Neurosynth websites. The titles of these papers (and you could no doubt come up with many more) include:

  • “The TPJ is selective for theory of mind”
  • “The TPJ is selective for biological motion”
  • “The anterior insula is selective for inhibition”
  • “The anterior insula is selective for orthography”
  • “The VMPFC is selective for autobiographical memory”
  • “The VMPFC is selective for valuation”
  • “The VMPFC is selective for autonomic control”
  • “The dACC is selective for fear”
  • “The dACC is selective for autonomic control”
  • “The dACC is selective for reward”

These are all interesting-sounding articles that I’m sure would drum up considerable interest and controversy. And the great thing is, as long as you’re careful about what you find “interesting” (and you don’t have to explicitly explain yourself in the paper!), Neurosynth will happily support all of these conclusions. You just need to make sure not to include any comparison terms that don’t fit with your story. So, if you’re writing a paper about the VMPFC and valuation, make sure you don’t include autobiographical memory as a control. And if you’re writing about theory of mind in the TPJ, it’s probably best to not find biological motion interesting.

Now, you might find yourself thinking, “how could it make sense to have multiple people write different papers using Neurosynth, each one claiming that a given region is ‘selective’ for a variety of different processes? Wouldn’t that sort of contradict any common-sense understanding of what the term ‘selective’ means?” My own answer would be “yes, yes it would”. But L&E’s definition of “weak selectivity”—and the procedures they use in their paper—allow for multiple such papers to co-exist without any problem. Since what counts as an “interesting” comparison condition is subjective—and, if we take L&E’s PNAS example as a model, one doesn’t even need to explicitly justify the choices one makes—there’s really nothing stopping anyone from writing any of the papers I suggested above. Following L&E’s logic, a researcher who favored a fear-based account of dACC could simply select two or three alternative processes as comparison conditions—say, sustained attention and salience—do all of the same analyses L&E did (pretending for the moment that those analyses are valid, which they aren’t), and conclude that the dACC is selective for fear. It really is that easy.

In reality, I imagine that if L&E came across an article claiming that Neurosynth shows that the dACC is selective for fear, I doubt they’d say “well, I guess the dACC is selective for fear. Good to know.” I suspect they would (quite reasonably) take umbrage at the fear paper’s failure to include pain as a comparison condition in the analysis. Yet, by their own standards, they’d have no real basis for any complaint. The fear paper’s author could simply, say, “pain’s not interesting to me,” and that would be that. No further explanation necessary.

Perhaps out of recognition that there’s something a bit odd about their definition of selectivity, L&E try to prime our intuition that their usage is consistent with the rest of the field. They point out that, in most experimental fMRI studies claiming evidence for selectivity, researchers only ever compare the target stimulus or process to a small number of candidates. For example, they cite a Haxby commentary on a paper that studied category specificity in visual cortex:

From Haxby (2006): “numerous small spots of cortex were found that respond with very high selectivity to faces. However, these spots were intermixed with spots that responded with equally high selectivity to the other three categories.“

Their point is that nobody expects ‘selective’ here to mean that the voxel in question responds to only that visual category and no other stimulus that could conceivably have been presented. In practice, people take ‘selective’ to mean “showed a greater response to the target category than to other categories that were tested”.

I agree with L&E that Haxby’s usage of the term ‘selective’ here is completely uncontroversial. The problem is, the study in question is a lousy analogy for L&E’s PNAS paper. A much better analogy would be a study that presented 10 visual categories to participants, but then made a selectivity claim in the paper’s title on the basis of a comparison between the target category and only 2 other categories, with no explanation given for excluding the other 7 categories, even though (a) some of those 7 categories were well known to also be associated with the same brain region, and (b) strong activation in response to some of those excluded categories was clearly visible in a supplementary figure. I don’t know about L&E, but I’m pretty sure that, presented with such a paper, the vast majority of cognitive neuroscientists would want to say something like, “how can you seriously be arguing that this part of visual cortex responds selectively to spheres, when you only compared spheres with faces and houses in the main text, and your supplemental figure clearly shows that the same region responds strongly to cubes and pyramids as well? Shouldn’t you maybe be arguing that this is a region specialized for geometric objects, if anything?” And I doubt anyone would be very impressed if the authors’ response to this critique was “well, it doesn’t matter what else we’re not focusing on in the paper. We said this region is sphere-selective, which just means it’s more selective than a couple of other stimulus categories people have talked about. Pyramids and cubes are basically interchangeable with spheres, right? What more do you want from us?”

I think it’s clear that there’s no basis for making a claim like “the dACC is selective for pain” when one knows full well that at least half a dozen other candidate functions all reliably activate the dACC. As I noted in my original post, the claim is particularly egregious in this case, because it’s utterly trivial to generate a ranked list of associations for over 3,000 different terms in Neurosynth. So it’s not even as if one needs to think very carefully about which conditions to include in one’s experiment, or to spend a lot of time running computationally intensive analyses. L&E were clearly aware that a bunch of other terms also activated dACC; they briefly noted as much in the Discussion of their paper. What they didn’t explain is why this observation didn’t lead them to seriously revise their framing. Given what they knew, there were at least two alternative articles they could have written that wouldn’t have violated common sense understanding of what the term ‘selective’ means. One might have been titled something like “Heterogeneous aspects of dACC are preferentially associated with pain, autonomic control, fear, reward, negative affect, and conflict monitoring”. The other might have been titled “the dACC is preferentially associated with X-related processes”—where “X” is some higher-order characterization that explains why all of these particular processes (and not others) are activated in dACC. I have no idea whether either of these papers would have made it through peer review at PNAS (or any other journal), but at the very least they wouldn’t have been flatly contradicted by Neurosynth results.

To be fair to L&E, while they didn’t justify their exlcusion of terms like fear and autonomic control in the PNAS paper, they did provide some explanation in their reply to my last post. Here’s what they say:

TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.“ And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

Essentially, their response is: “it didn’t make sense for us to include fear or autonomic control, because these functions are compatible with the underlying role we think the dACC is playing in pain”. This is not compelling, for three reasons. First, it’s a bait-and-switch. L&E’s paper isn’t titled “the dACC is selective for a family of distress-related processes”, it’s titled “the dACC is selective for pain“. One cannot publish a paper purporting to show that the dACC is selective for pain, and arguing that pain is the single best psychological characterization of its role in cognition, and then, in a section of their Discussion that they admit is the “most speculative” part of the paper, essentially say, “just kidding–we don’t think it’s really doing pain per se, we think it’s a much more general set of functions. But we don’t have any real evidence for that.”

Second, it’s highly uncharitable for L&E to spontaneously lump alternative accounts of dACC function like fear/avoidance, autonomic control, and bodily orientation in with their general “distress-related” account, because proponents of many alternative views of dACC function have been very explicit in saying that they don’t view these functions as fundamentally affective (e.g., Vogt and colleagues view posterior dACC as a premotor region). While L&E may themselves believe that pain, fear, and autonomic control in dACC all reflect some common function, that’s an extremely strong claim that requires independent evidence, and is not something that they’re entitled to simply assume. A perfectly sensible alternative is that these are actually dissociable functions with only partially overlapping spatial representations in dACC. Since the terms themselves are distinct in Neurosynth, that should be L&E’s operating assumption until they provide evidence for their stronger claim that there’s some underlying commonality. Nothing about this conclusion simply falls out of the data in advance.

Third, let me reiterate the point I made above about L&E’s notion of ‘weak selectivity’: if we take at face value L&E’s claim that fear and autonomic control don’t need to be explicitly considered because they could be interpreted alongside pain under a common account, then they’re effectively conceding that it would have made just as much sense to publish a paper titled “the dACC is selective for fear” or “the dACC is selective for autonomic control” that relegated the analysis of the term “pain” to a supplementary figure. In the paper’s body, you would find repeated assertions that the authors  have shown that autonomic control is the “best general psychological account of dACC function”. When pressed as to whether this was a reasonable conclusion, the authors would presumably defend their decision to ignore pain as a viable candidate by saying things like, “well, sure pain also activates the dACC; everyone knows that. But that’s totally consistent with our autonomic control account, because pain produces autonomic outputs! So we don’t need to consider that explicitly.”

I confess to some skepticism that L&E would simply accept such a conclusion without any objection.

Before moving on, let me come full circle and offer a definition of selectivity that I think is much more workable than either of the ones L&E propose, and is actually compatible with the way people use the term ‘selective’ more broadly in the field:

Selectivityrealistic: A brain region can be said to be ‘selective’ for a particular function if it (i) shows a robust association with that function, (ii) shows a negligible association with all other readily available alternatives, and (iii) the authors have done due diligence in ensuring that the major candidate functions proposed in the literature are well represented in their analysis.

Personally, I’m not in love with this definition. I think it still allows researchers to make claims that are far too strong in many cases. And it still allows for a fair amount of subjectivity in determining what gets to count as a suitable control—at least in experimental studies where researchers necessarily have to choose what kinds of conditions to include. But I think this definition is more or less in line with the way most cognitive neuroscientists expect each other to use the term. It captures the fact that most people would feel justifiably annoyed if someone reported a “selective” effect in one condition while failing to acknowledge that 4 other unreported conditions showed the same effect. And it also captures the notion that researchers should be charitable to each other: if I publish a paper claiming that the so-called fusiform ‘face’ area is actually selective for houses, based on a study that completely failed to include a face condition, no one is going to take my claim of house selectivity seriously. Instead, they’re going to conclude that I wasn’t legitimately engaging with other people’s views.

In the context of Neurosynth—where one has 3,000 individual terms or several hundred latent topics at their disposal—this definition makes it very clear that researchers who want to say that a region is selective for something have an obligation to examine the database comprehensively, and not just to cherry-pick a couple of terms for analysis. That is what I meant when I said that L&E need to show that “virtually nothing else activates dACC”. I wasn’t saying that they have to show that no other conceivable process reliably activates the dACC (which would be impossible, as they observe), but simply that they need to show that no non-synonymous terms in the Neurosynth database do. I stand by this assertion. I see no reason why anyone should accept a claim of selectivity based on Neurosynth data if just a minute or two of browsing the Neurosynth website provides clear-cut evidence that plenty of other terms also reliably activate the same region.

To sum up, nothing L&E say in their paper gives us any reason to think that the dACC is selective for pain (even if we were to ignore all the problems with their understanding of reverse inference and allow them to claim selectivity based on inappropriate statistical tests). I submit that no definition of ‘selective’ that respects common sense usage of the term, and is appropriately charitable to other researchers, could possibly have allowed L&E to conclude that dACC activity is “selective” for pain when they knew full well that fear, autonomic control, and reward all also reliably activated the dACC in Neurosynth.

Everything else

Having focused on what I view as the two overarching issues raised by L&E’s reply, I now turn to comprehensively addressing each of their specific claims. As I noted at the outset, I recognize this is going to make for slow reading. But I want to make sure I address L&E’s points clearly and comprehensively, as I feel that they blatantly mischaracterized what I said in my original post in many cases. I don’t actually recommend that anyone read this entire section linearly. I’m writing it primarily as a reference—so that if you think there were some good points L&E made in their reply to my original post, you can find those points by searching for the quote, and my response will be directly below.

Okay, let’s begin.

Tal Yarkoni (hereafter, TY), the creator of Neurosynth, has now posted a blog (here (link is external)) suggesting that pretty much all of our claims are either false, trivial, or already well-known. While this response was not unexpected, it’s disappointing because we love Neurosynth and think it’s a powerful tool for drawing exactly the kinds of conclusions we’ve drawn.

I’m surprised to hear that my response was not unexpected. This would seem to imply that L&E had some reason to worry that I wouldn’t approve of the way they were using Neurosynth, which leads me to wonder why they didn’t solicit my input ahead of time.

While TY is the creator of Neurosynth, we don’t think that means he has the last word when it comes to what is possible to do with it (nor does he make this claim). In the end, we think there may actually be a fair bit of agreement between us and TY. We do think that TY has misunderstood some of our claims (section 1 below) and failed to appreciate the significance and novelty of our actual claims (sections 2 and 4). TY also thinks we should have used different statistical analyses than we did, but his critique assumes we had a different question than the one we really had (section 5).

I agree that I don’t have the last word, and I encourage readers to consider both L&E’s arguments and mine dispassionately. I don’t, however, think that there’s a fair bit of agreement between us. Nor do I think I misunderstood L&E’s claim or failed to appreciate their significance or novelty. And, as I discuss at length both above and below, the problem is not that L&E are asking a different question than I think, it’s that they don’t understand that the methods they’re using simply can’t speak to the question they say they’re asking.

1. Misunderstandings (where we sort of probably agree)

We think a lot of the heat in TY’s blog comes from two main misunderstandings of what we were trying to accomplish. The good news (and we really hope it is good news) is that ultimately, we may actually mostly agree on both of these points once we get clear on what we mean. The two issues have to do with the use of the term “selective“ and then why we chose to focus on the four categories we did (pain, executive, conflict, salience) and not others like fear and autonomic.

Misunderstanding #1: Selectivity. Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word…

I’ve already addressed this in detail at the beginning of this post, so I’ll skip the next few paragraphs and pick up here:

We mean this in the same way that Haxby and lots of others do. We never give a technical definition of selectivity in our paper, though in the abstract we do characterize our results as follows:

“Results clearly indicated that the best psychological description of dACC function was related to pain processing—not executive, conflict, or salience processing.“

Thus, the context of what comparisons our selectivity refers to is given in the same sentence, right up front in the abstract. In the end, we would have been just as happy if “selectivity“ in the title was replaced with “preferentially activated“. We think this is what the weak form of selectivity entails and it is really what we meant. We stress again, we are not familiar with researchers who use the strong form of selectivity. TY’s blog is the first time we have encountered this and was not what we meant in the paper.

I strongly dispute L&E’s suggestion that the average reader will conclude from the above sentence that they’re clearly analyzing only 4 terms. Here’s the sentence in their abstract that directly precedes the one they quote:

Using Neurosynth, an automated brainmapping database [of over 10,000 functional MRI (fMRI) studies], we performed quantitative reverse inference analyses to explore the best general psychological account of the dACC function P(Ψ processjdACC activity).

It seems quite clear to me that the vast majority of readers are going to parse the title and abstract of L&E’s paper as implying a comprehensive analysis to find the best general psychological account of dACC function, and not “the best general psychological account if you only consider these 4 very specific candidates”. Indeed, I have trouble making any sense of the use of the terms “best” and “general” in this context, if what L&E meant was “a very restricted set of possibilities”. I’ll also note that in five minutes of searching the literature, I couldn’t find any other papers with titles or abstracts that make nearly as strong a claim about anterior cingulate function as L&E’s present claims about pain. So I reject the idea that their usage is par for the course. Still, I’m happy to give them the benefit of the doubt and accept that they truly didn’t realize that their wording might lead others to misinterpret their claims. I guess the good news is that, now that they’re aware of the potential confusion claims like this can cause, they will surely be much more circumspect in the titles and abstracts of their future papers.

Before moving on, we want to note that in TY’11 (i.e. the Yarkoni et al., 2011 paper announcing Neurosynth), the weak form of selectivity is used multiple times. In the caption for Figure 2, the authors refer to “regions in c were selectively associated with the term“ when as far as we can tell, they are talking only about the comparison of three terms (working memory, emotion, pain). Similarly on p. 667 the authors write “However, the reverse inference map instead implicated the anterior prefrontal cortex and posterior parietal cortex as the regions that were most selectively activated by working memory tasks.“ Here again, the comparison is to emotion and pain, and the authors are not claiming selectivity relative to all other psychological processes in the Neurosynth database. If it is fair for Haxby, Botvinick, and the eminent coauthors of TY’11 to use selectivity in this manner, we think it was fine for us as well.

I reject the implication of equivalence here. I think the scope of the selectivity claim I made in the figure caption in question is abundantly clear from the immediate context, and provides essentially no room for ambiguity. Who would expect, in a figure with 3 different maps, the term ‘selective’ to mean anything other than ‘for this one and not those two’? I mean, if L&E had titled their paper “pain preferentially activates the dACC relative to conflict, salience, or executive control”, and avoided saying that they were proposing the “best general account” of psychological function in dACC, I wouldn’t have taken issue with their use of the term ‘selective’ in their manuscript either, because the scope would have been equally clear. Conversely, if I had titled my 2011 paper “the dACC shows no selectivity for any cognitive process”, and said, in the abstract, something like “we show that there is no best general psychological function of the dACC–not pain, working memory, or emotion”, I would have fully expected to receive scorn from others.

That said, I’m willing to put my money where my mouth is. If a few people (say 5) write in to say (in the comments below, on twitter, or by email) that they took the caption in Figure 2 of my 2011 paper to mean anything other than “of these 3 terms, only this one showed an effect”, I’ll happily send the journal a correction. And perhaps, L&E could respond in kind by commiting to changing the title of their manuscript to something like “the dACC is preferentially active for pain relative to conflict, salience or executive control” if 5 people write in to say that they interpreted L&E’s claims as being much more global than L&E suggest they are. I encourage readers to use the comments below to clarify how they understood both of these selectivity claims.

We would also point readers to the fullest characterization of the implication of our results on p. 15253 of the article:

“The conclusion from the Neurosynth reverse inference maps is unequivocal: The dACC is involved in pain processing. When only forward inference data were available, it was reasonable to make the claim that perhaps dACC was not involved in pain per se, but that pain processing could be reduced to the dACC’s “real“ function, such as executive processes, conflict detection, or salience responses to painful stimuli. The reverse inference maps do not support any of these accounts that attempt to reduce pain to more generic cognitive processes.“

We think this claim is fully defensible and nothing in TY’s blog contradicts this. Indeed, he might even agree with it.

This claim does indeed seem to me largely unobjectionable. However, I’m at a loss to understand how the reader is supposed to know that this one very modest sentence represents “the fullest characterization” of the results in a paper replete with much stronger assertions. Is the reader supposed to, upon reading this sentence, retroactively ignore all of the other claims—e.g., the title itself, and L&E’s repeated claim throughout the paper that “the best psychological interpretation of dACC activity is in terms of pain processes”?

*Misunderstanding #2: We did not focus on fear, emotion, and autonomic accounts*. TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.“ And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

I addressed this in detail above, in the section on “selectivity”.

We speak to some but not all of this in the paper. On p. 15254, we revisit our neural alarm account and write “Distress-related emotions (“negative affect“ “distress“ “fear“) were each linked to a dACC cluster, albeit much smaller than the one associated with “pain“.“ While we could have said more explicitly that pain is in this distress-related category, we have written about this several times before and assumed this would be understood by readers.

There is absolutely no justification for assuming this. The community of people who might find a paper titled “the dorsal anterior cingulate cortex is selective for pain” interesting is surely at least an order of magnitude larger than the community of people who are familiar with L&E’s previous work on distress-related emotions.

So why did we focus on executive, conflict, and salience? Like most researchers, we are the products of our early (academic) environment. When we were first publishing on social pain, we were confused by the standard account of dACC function. A half century of lesion data and a decade of fMRI studies of pain pointed towards more evidence of the dACC’s involvement in distress-related emotions (pain & anxiety), yet every new paper about the dACC’s function described it in cognitive terms. These cognitive papers either ignored all of the pain and distress findings for dACC or they would redescribe pain findings as reducible to or just an instance of something more cognitive.

When we published our first social pain paper, the first rebuttal paper suggested our effects were really just due to “expectancy violation“ (Somerville et al., 2006), an account that was later invalidated (Kawamoto 2012). Many other cognitive accounts have also taken this approach to physical pain (Price 2000; Vogt, Derbyshire, & Jones, 2006).

Thus for us, the alternative to pain accounts of dACC all these years were conflict detection and cognitive control explanations. This led to the focus on the executive and conflict-related terms. In more recent years, several papers have attempted to explain away pain responses in the dACC as nothing more than salience processes (e.g Iannetti’s group) that have nothing to do with pain, and so salience became a natural comparison as well. We haven’t been besieged with papers saying that pain responses in the dACC are “nothing but“ fear or “nothing but“ autonomic processes, so those weren’t the focus of our analyses.

This is a informative explanation of L&E’s worldview and motivations. But it doesn’t justify ignoring numerous alternative accounts whose proponents very clearly don’t agree with L&E that their views can be explained away as “distress-related”. If L&E had written a paper titled “salience is not a good explanation of dACC function,” I would have happily agreed with their conclusion here. But they didn’t. They wrote a paper explicitly asserting that pain is the best psychological characterization of the dACC. They’re not entitled to conclude this unless they compare pain properly with a comprehensive set of other possible candidates—not just the ones that make pain look favorable.

We want to comment further on fear specifically. We think one of the main reasons that fear shows up in the dACC is because so many studies of fear use pain manipulations (i.e. shock administration) in the process of conditioning fear responses. This is yet another reason that we were not interested in contrasting pain and fear maps. That said, if we do compare the Z-scores in the same eight locations we used in the PNAS paper, the pain effect has more accumulated evidence than fear in all seven locations where there is any evidence for pain at all.

This is a completely speculative account, and no evidence is provided for it. Worse, it’s completely invertible: one could just as easily say that pain shows up in the dACC because it invariably produces fear, or because it invariably elicits autonomic changes (frankly, it seems more plausible to me that pain almost always generates fear than that fear is almost always elicited by pain). There’s no basis for ruling out these other candidate functions a priori as being more causally important. This is simply question-begging.

Its interesting to us that TY does not in principle seem to like us trying to generate some kind of unitary account of dACC writing “There’s no reason why nature should respect our human desire for simple, interpretable models of brain function.“ Yet, TY then goes on to offer a unitary account more to his liking. He highlights Vogt’s “four-region“ model of the cingulate writing “I’m especially partial to the work of Brent Vogt“¦“. In Vogt’s model, the aMCC appears to be largely the same region as what we are calling dACC. Although the figure shown by TY doesn’t provide anatomical precision, in other images, Vogt shows the regions with anatomical boundaries. Rotge et al. (2015) used such an image from Vogt (2009) to estimate the boundaries of aMCC as spanning 4.5 ≤ y ≤ 30 which is very similar to our dACC anterior/posterior boundaries of 0 ≤ y ≤ 30) (see Figure below). Vogt ascribes the function of avoidance behavior to this region – a pretty unitary description of the region that TY thinks we should avoid unitary descriptions of.

There is no charitable way to put it: this is nothing short of a gross misrepresentation of what I said about the Vogt account. As a reminder, here’s what I actually wrote in my post:

I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree) … the Vogt characterization of dACC/aMCC … fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance“ region).

I have no idea how L&E read this and concluded that I was arguing that we should simply replace the label “pain” with “fear”. I don’t feel the need to belabor the point further, because I think what I wrote is quite clear.

In the end though, if TY prefers a fear story to our pain story, we think there is some evidence for both of these (a point we make in our PNAS paper). We think they are in a class of processes that overlap both conceptually (i.e. distress-related emotions) and methodologically (i.e. many fear studies use pain manipulations to condition fear).

No, I don’t prefer a fear story. My view (which should be abundantly clear from the above quote) is that both a fear story and a pain story would be gross oversimplifications that shed more heat than light. I will, however, reiterate my earlier point (which L&E never responded to), which is that their PNAS paper provides no reason at all to think that the dACC is involved in distress-related emotion (indeed, they explicitly said that this was the most speculative part of the paper). If anything, the absence of robust dACC activation for terms like ‘disgust’, ’emotion’, and ‘social’ would seem to me like pretty strong evidence against a simplistic model of this kind. I’m not sure why L&E are so resistant to the idea that maybe, just maybe, the dACC is just too big a region to attach a single simple label to. As far as I can tell, they provide no defense of this assumption in either their paper or their reply.

After focusing on potential misunderstandings we want to turn to our first disagreement with TY. Near the end of his blog, TY surprised us by writing that the following conclusions can be reasonably drawn from Neurosynth analyses:

* “There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.“
* “It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)“

Our first response was “˜Wow. After pages and pages of criticizing our paper, TY pretty much agrees with what we take to be the major claims of our paper. Yes, his version is slightly watered down from what we were claiming, but these are definitely in the ballpark of what we believe.’

L&E omitted my third bullet point here, which was that “Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.” I’m not sure why they left it out; they could hardly disagree with it either, if they want to stand by their definition of “weak selectivity”.

I’ll leave it to you to decide whether or not my conclusions are really just “watered down” versions “in the ballpark” of the major claims L&E make in their paper.

But then TY’s next statement surprised us in a different sort of way. He wrote

“I think these are all interesting and potentially important observations. They’re hardly novel“¦“.

We’ve been studying the dACC for more than a decade and wondered what he might have meant by this. We can think of two alternatives for what he might have meant:

* That L&E and a small handful of others have made this claim for over a decade (but clearly not with the kind of evidence that Neurosynth provides).

* That TY already used Neurosynth in 2011 to show this. In the blog, he refers to this paper writing “We explicitly noted that there is preferential activation for pain in dACC“.

I’m not sure what was confusing about what I wrote. Let’s walk through the three bullet points. The first one is clearly not novel. We’ve known for many years that many parts of dACC are preferentially active when people experience painful stimulation. As I noted in my last post, L&E explicitly appealed to this literature over a decade ago in their 2003 social pain paper. The second one is also clearly not novel. For example, Vogt and colleagues (among others) have been arguing for at least two decades now that the posterior aspects of dACC support pain processing in virtue of their involvement in processes (e.g., bodily orientation) that clearly preclude most higher cognitive accounts of dACC. The third claim isn’t novel either, as there has been ample evidence for at least a decade now that virtually every part of dACC that responds to painful stimulation also systematically responds to other non-nociceptive stimuli (e.g., the posterior dACC responds to non-painful touch, the anterior to reward, etc.). I pointed to articles and textbooks comprehensively reviewing this literature in my last post. So I don’t understand L&E’s surprise. Which of these three claims do they think is actually novel to their paper?

In either case, “they’re hardly novel“ implies this is old news and that everyone knows and believes this, as if we’re claiming to have discovered that most people have two eyes, a nose, and a mouth. But this implication could not be further from the truth.

No, that’s not what “hardly novel” implies. I think it’s fair to say that the claim that social pain is represented in the dACC in virtue of representations shared with physical pain is also hardly novel at this point, yet few people appear to know and believe it. I take ‘hardly novel’ to mean “it’s been said before multiple times in the published literature.”

There is a 20+ year history of researchers ignoring or explaining away the role of pain processing in dACC.

I’ll address the “explained away” part of this claim below, but it’s completely absurd to suggest that researchers have ignored the role of pain processing in dACC for 20 years. I don’t think I can do any better than link to Google Scholar, where the reader is invited to browse literally hundreds of articles that all take it as an established finding that the dACC is important for pain processing (and many of which have hundreds of citations from other articles).

When pain effects are mentioned in most papers about the function of dACC, it is usually to say something along the lines of “˜Pain effects in the dACC are just one manifestation of the broader cognitive function of conflict detection (or salience or executive processes)’. This long history is indisputable. Here are just a few examples (and these are all reasonable accounts of dACC function in the absence of reverse inference data):

* Executive account: Price’s 2000 Science paper on the neural mechanisms of pain assigns to the dACC the roles of “directing attention and assigning response priorities“
* Executive account: Vogt et al. (1996) says the dACC “is not a “˜pain centre’“ and “is involved in response selection“ and “response inhibition or visual guidance of responses“
* Conflict account: Botvinick et al. (2004) wrote that “the ACC might serve to detect events or internal states indicating a need to shift the focus of attention or strengthen top-down control ([4], see also [20]), an idea consistent, for example, with the fact that the ACC responds to pain “ (Botvinick et al. 2004)
* Salience account: Iannetti suggests the “˜pain matrix’ is a myth and in Legrain et al. (2011) suggests that the dACC’s responses to pain “could mainly reflect brain processes that are not directly related to the emergence of pain and that can be engaged by sensory inputs that do not originate from the activation of nociceptors.“

I’m not really sure what to make of this argument either. All of these examples clearly show that even proponents of other theories of dACC function are well aware of the association with pain, and don’t dispute it in any way. So L&E’s objection can’t be that other people just don’t believe that the dACC supports pain processing. Instead, L&E seem to dislike the idea that other theorists have tried to “explain away” the role of dACC in pain by appealing to other mechanisms. Frankly, I’m not sure what the alternative to such an approach could possibly be. Unless L&E are arguing that dACC is the neural basis of an integrated, holistic pain experience (whatever such a thing might mean), there presumably must be some specific computational operations going on within dACC that can be ascribed a sensible mechanistic function. I mean, even L&E themselves don’t take the dACC to be just about, well, pain. Their whole “distress-related emotion” story is itself intended to explain what it is that dACC actually does in relation to pain (since pretty much everyone accepts that the sensory aspects of pain aren’t coded in dACC).

The only way I can make sense of this “explained away” concern is if what L&E are actually objecting to is the fact that other researchers have disagreed or ignored their particular story about what the dACC does in pain—i.e., L&E’s view that the dACC role in pain is derived from distress-related emotion. As best I can tell, what bothers them is that other researchers fundamentally disagree with–and hence, don’t cite–their “distress-related emotion” account. Now, maybe this irritation is justified, and there’s actually an enormous amount of evidence out there in favor of the distress account that other researchers are willfully ignoring. I’m not qualified to speak to that (though I’m skeptical). What I do feel qualified to say is that none of the Neurosynth results L&E present in their paper make any kind of case for an affective account of pain processing in dACC. The most straightforward piece of evidence for that claim would be if there were a strong overlap between pain and negative affect activations in dACC. But we just don’t see this in Neurosynth. As L&E themselves acknowledge, the peak sectors of pain-related activation in dACC are in mid-to-posterior dACC, and affect-related terms only seem to reliably activate the most anterior aspects.

To be charitable to L&E, I do want to acknowledge one valuable point that they contribute here, which is that it’s clear that dACC function cannot be comprehensively explained by, say, a salience account or a conflict monitoring account. I think that’s a nice point (though I gather that some people who know much more about anatomy than I do are in the process of writing rebuttals to L&E that argue it’s not as nice as I think it is). The problem is, this argument can be run both ways. Meaning, much as L&E do a nice job showing that conflict monitoring almost certainly can’t explain activations in posterior dACC, the very maps they show make it clear that pain can’t explain all the other activations in anterior dACC (for reward, emotion, etc.). Personally, I think the sensible conclusion one ought to take away from all this is “it’s really complicated, and we’re not going to be able to neatly explain away all of dACC function with a single tidy label like ‘pain’.” L&E draw a different conclusion.

But perhaps this approach to dACC function has changed in light of TY’11 findings (i.e. Yarkoni et al. 2011). There he wrote “For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis.“ This hardly sounds like a resounding call for a different understanding of dACC that involves an appreciation of its preferential involvement in pain.

Right. It wasn’t a resounding call for a different understanding of dACC, because it wasn’t a paper about the dACC—a brain region I lack any deep interest in or knowledge of—it was a paper about Neurosynth and reverse inference.

Here are quotes from other papers showing how they view the dACC in light of TY’11:

* Poldrack (2012) “The striking insight to come from analyses of this database (Yarkoni et al., in press) is that some regions (e.g., anterior cingulate) can show high degrees of activation in forward inference maps, yet be of almost no use for reverse inference due to their very high base rates of activation across studies“
* Chang, Yarkoni et al. (2012) “the ACC tends to show substantially higher rates of activation than other regions in neuroimaging studies (Duncan and Owen 2000; Nelson et al. 2010; Yarkoni et al. 2011), which has lead some to conclude that the network is processing goal-directed cognition (Yarkoni et al. 2009)“
* Atlas & Wager (2012) “In fact, the regions that are reliably modulated (insula, cingulate, and thalamus) are actually not specific to pain perception, as they are activated by a number of processes such as interoception, conflict, negative affect, and response inhibition“

I won’t speak for papers I’m not an author on, but with respect to the quote from the Chang et al paper, I’m not sure what L&E’s point actually is. In Yarkoni et al. (2009), I argued that “effort” might be a reasonable generic way to characterize the ubiquitous role of the frontoparietal “task-positive” network in cognition. I mistakenly called the region in question ‘dACC’ when I should have said ‘preSMA’. I already gave L&E deserved credit in my last post for correcting my poor knowledge of anatomy. But I would think that, if anything, the fact that I was routinely confusing these terms circa 2011 should lead L&E to conclude that maybe I don’t know or care very much about the dACC, and not that I’m a proud advocate for a strong theory of dACC function that many other researchers also subscribe to. I think L&E give me far too much credit if they think that my understanding of the dACC in 2011 (or, for that matter, now) is somehow representative of the opinions of experts who study that region.

Perhaps the reason why people who cite TY’11 in their discussion of dACC didn’t pay much attention to the above quote from TY’11 (““For pain, the regions of maximal pain-related“¦“) was because they read and endorsed the following more direct conclusion that followed ““¦because the dACC is activated consistently in all of these states [cognitive control, pain, emotion], its activation may not be diagnostic of any one of them“ (bracketed text added). If this last quote is taken as TY’11’s global statement regarding dACC function, then it strikes us still as quite novel to assert that the dACC is more consistently associated with one category of processes (pain) than others (executive, conflict, and salience processes).

I don’t think TY’11 makes any ‘global statement regarding dACC function’, because TY’11 was a methodological paper about the nature of reverse inference, not a paper about grand models of dACC function. As for the quote L&E reproduce, here’s the full context:

These results showed that without the ability to distinguish consistency from selectivity, neuroimaging data can produce misleading inferences. For instance, neglecting the high base rate of DACC activity might lead researchers in the areas of cognitive control, pain and emotion to conclude that the DACC has a key role in each domain. Instead, because the DACC is activated consistently in all of these states, its activation may not be diagnostic of any one of them and conversely, might even predict their absence. The NeuroSynth framework can potentially address this problem by enabling researchers to conduct quantitative reverse inference on a large scale.

I stand by everything I said here, and I’m not sure what L&E object to. It’s demonstrably true if you look at Figure 2 in TY’11 that pain, emotion, and cognitive control all robustly activate the dACC in the forward inference map, but not in the reverse inference maps. The only sense I can make of L&E’s comment is if they’re once again conflating z-scores with probabilities, and assuming that the presence of significant activation for pain means that dACC is in fact diagnostic for pain. But, as I showed much earlier in this post, that would betray very deep misunderstanding of what the reverse inference maps generated by Neurosynth mean. There is absolutely no basis for concluding, in any individual study, that people are likely to be perceiving pain just because the dACC is active.

In the article, we showed forward and reverse inference maps for 21 terms and then another 9 in the supplemental materials. These are already crowded busy figures and so we didn’t have room to show multiple slices for each term. Fortunately, since Neurosynth is easily accessible (go check it out now at neurosynth.org ““ its awesome!) you can look at anything we didn’t show you in the paper. Tal takes us to task for this.

He then shows a bunch of maps from x=-8 to x=+8 on a variety of terms. Many of these terms weren’t the focus of our paper because we think they are in the same class of processes as pain (as noted above). So it’s no surprise to us that terms such as “˜fear,’ “˜empathy,’ and “˜autonomic’ produce dACC reverse inference effects. In the paper, we reported that “˜reward’ does indeed produce reverse inference effects in the anterior portion of the dACC (and show the figure in the supplemental materials), so no surprise there either. Then at the bottom he shows cognitive control, conflict, and inhibition which all show very modest footprints in dACC proper, as we report in the paper.

Once again: L&E are not entitled to exclude a large group of viable candidate functions from their analysis simply because they believe that they’re “in the same class of [distress-related affect] processes” (a claim that many people, including me, would dispute). If proponents of the salience monitoring view wrote a Neurosynth-based paper neglecting to compare salience with pain because “pain is always salient, so it’s in the same class of salience-related processes”, I expect that L&E would not be very happy about it. They should show others the same charity they themselves would expect.

But in any case, if it’s not surprising to L&E that reward, fear, and autonomic control all activate the dACC, then I’m at a loss to understand why they didn’t title the paper something like “the dACC is selectively involved in pain, reward, fear, and autonomic control”. That would have much more accurately represented the results they report, and would be fully consistent with their notion of “weak selectivity”.

There are two things that make the comparison of what he shows and what we reported in the paper not a fair comparison. First, his maps are thresholded at p<.001 and yet all the maps that we report use Neurosynth’s standard, more conservative, FDR criterion of p<.01 (a standard TY literally set). Here, TY is making a biased, apples-to-oranges comparison by juxtaposing the maps at a much more liberal threshold than what we did. Given that each of the terms we were interested in (pain, executive, conflict, salience) had more than 200 studies in the database its not clear why TY moved from FDR to uncorrected maps here.

The reason I used a threshold of p < .001 for this analysis is because it’s what L&E themselves used:

In addition, we used a threshold of Z > 3.1, P < 0.001 as our threshold for indicating significance. This threshold was chosen instead of Neurosynth’s more strict false discovery rate (FDR) correction to maximize the opportunity for multiple psychological terms to “claim“ the dACC.

This is a sensible thing to do here, because L&E are trying to accept the null of no effect (or at least, it’s more sensible than applying a standard, conservative correction). Accepting the null hypothesis because an effect fails to achieve significance is the cardinal sin of null hypothesis significance testing, so there’s no real justification for doing what L&E are trying to do. But if you are going to accept the null, it at least behooves you to use a very liberal threshold for your analysis. I’m not sure why it’s okay for L&E to use a threshold of p < .001 but not for me to do the same (and for what it’s worth, I think p < .001 is still an absurdly conservative cut-off given the context).

Second, the Neurosynth database has been updated since we did our analyses. The number of studies in the database has only increased by about 5% (from 10,903 to 11,406 studies) and yet there are some curious changes. For instance, fear shows more robust dACC now than it did a few months ago even though it only increased from 272 studies to 298 studies.

Although the number of studies has nominally increased by only 5%, this actually reflects the removal of around 1,000 studies as a result of newer quality control heuristics, and the addition of around 1,500 new studies. So it should not be surprising if there are meaningful differences between the two. In any case, it seems odd for L&E to use the discrepancy between old and new versions of the database as a defense of their findings, given that the newer results are bound to be more accurate. If L&E accept that there’s a discrepancy, perhaps what they should be saying is “okay, since we used poorer data for our analyses than what Neurosynth currently contains, we should probably re-run our analyses and revise our conclusions accordingly”.

We were more surprised to discover that the term “˜rejection’ has been removed from the Neurosynth database altogether such that it can no longer be used as a term to generate forward and reverse inference maps (even though it was in the database prior to the latest update).

This claim is both incorrect and mildly insulting. It’s incorrect because the term “rejection” hasn’t been in the online Neurosynth database for nearly two years, and was actually removed three updates ago. And it’s mildly insulting, because all L&E had to do to verify the date at which rejection was removed, as well as understand why, was visit the Neurosynth data repository and inspect the different data releases. Failing that, they could have simply asked me for an explanation, instead of intimating that there are “curious” changes. So let me take this opportunity to remind L&E and other readers that the data displayed on the Neurosynth website are always archived on GitHub. If you don’t like what’s on the website at any given moment, you can always reconstruct the database based on an earlier snapshot. This can be done in just a few lines of Python code, as the IPython notebook I linked to last time illustrates.

As to why the term “rejection” disappeared: in April 2014, I switched from a manually curated set of 525 terms (which I had basically picked entirely subjectively) to the more comprehensive and principled approach of including all terms that passed a minimum frequency threshold (i.e., showing up in at least 60 unique article abstracts). The term “rejection” was not frequent enough to survive. I don’t make decisions about individual terms on a case-by-case basis (well, not since April 2014, anyway), and I certainly hope L&E weren’t implying that I pulled the ‘rejection’ term in response to their paper or any of their other work, because, frankly, they would be giving themselves entirely too much credit.

Anyway, since L&E seem concerned with the removal of ‘rejection’ from Neurosynth, I’m happy to rectify that for them. Here are two maps for the term “rejection” (both thresholded at voxel-wise p < .001, uncorrected):

Meta-analysis of "rejection" in Neurosynth (database version of May 2013 ).
Meta-analysis of “rejection” in Neurosynth (database version of May 2013, 33 studies).
Meta-analysis of "rejection" in Neurosynth (current database version, 58 studies).
Meta-analysis of “rejection” in Neurosynth (current database version, 58 studies).

The first map is from the last public release (March 2013) that included “rejection” as a feature, and is probably what L&E remember seeing on the website (though, again, it hasn’t been online since 2014). It’s based on 33 studies. The second map is the current version of the map, based on 52 studies. The main conclusion I personally would take away from both of these maps is that there’s not enough data here to say anything meaningful, because they’re both quite noisy and based on a small number of studies. This is exactly why I impose a frequency cut-off for all terms I put online.

That said, if L&E would like to treat these “rejection” analyses as admissible evidence, I think it’s pretty clear that these maps actually weigh directly against their argument. In both cases, we see activation in pain-related areas of dACC for the forward inference analysis but not for the reverse. Interestingly, we do see activation in the most anterior part of dACC in both cases. This seems to me entirely consistent with the argument many people have made that subjective representations of emotion (including social pain) are to be found primarily in anterior medial frontal cortex, and that posterior dACC activations for pain have much more to do with motor control, response selection, and fear than with anything affective.

Given that Neurosynth is practically a public utility and federally funded, it would be valuable to know more about the specific procedures that determine which journals and articles are added to the database and on what schedule. Also, what are the conditions that can lead to terms being removed from the database and what are the set of terms that were once included that have now been removed.

I appreciate L&E’s vote of confidence (indeed, I wish that I believed Neurosynth could do half of what they claim it can do). As I’ve repeatedly said in this post and the last one, I’m happy to answer any questions L&E have about Neurosynth methods (preferably on the mailing list, which is publicly archived and searchable). But to date, they haven’t asked me any. I’ll also reiterate that it would behoove L&E to check the data repository on GitHub (which is linked to from the neurosynth.org portal) before they conclude that the information they want isn’t already publicly accessible (because most of it is).

In any event, we did not cherry pick data. We used the data that was available to us as of June 2015 when we wrote the paper. For the four topics of interest, below we provide more representative views of the dACC, thresholded as typical Neurosynth maps are, at FDR p<.01. We’ve made the maps nice and big so you can see the details and have marked in green the dACC region on the different slices (the coronal slice are at y=14 and y=22). When you look at these, we think they tell the same story we told in the paper.

I’m not sure what the point here is. I was not suggesting that L&E were lying; I was arguing that (a) visual inspection of a few slices is no way to make a strong argument about selectivity; (b) the kinds of analyses L&E report are a statistically invalid way to draw the conclusion they are trying to draw, and (c) even if we (inappropriately) use L&E’s criteria, analyses done with more current data clearly demonstrate the presence of plenty of effects for terms other than pain. L&E dispute the first two points (which we’ll come back to), but they don’t seem to contest the last. This seems to me like it should lead L&E to the logical conclusion that they should change their conclusions, since newer and better data are now available that clearly produce different results given the same assumptions.

(I do want to be clear again that I don’t condone L&E’s analyses, which I show above and below in detail simply don’t support their conclusions. I was simply pointing out that even by their own criteria, Neurosynth results don’t support their claims.)

4. Surprising lack of appreciation for what the reverse inference maps show in pretty straightforward manner.

Let’s start with pain and salience. Iannetti and his colleagues have made quite a bit of hay the last few years saying that the dACC is not involved in pain, but rather codes for salience. One of us has critiqued the methods of this work elsewhere (Eisenberger, 2015, Annual Review). The reverse inference maps above show widespread robust reverse inference effects throughout the dACC for pain and not a single voxel for salience. When we ran this initially for the paper, there were 222 studies tagged for the term salience and now that number is up to 269 and the effects are the same.

Should our tentative conclusion be that we should hold off judgment until there is more evidence? TY thinks so: “If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available.“ This would be reasonable if we were talking about topics with 10 or 15 studies in the database. But, there are 269 studies for the term salience and yet there is nothing in the dACC reverse inference maps. I can’t think of anyone who has ever run a meta-analysis of anything with 250 studies, found no accumulated evidence for an effect and then said “we should withhold judgment until more data is available“.

This is another gross misrepresentation of what I said in my commentary. So let me quote what  I actually said. Here’s the context:

While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.

This is pretty close to the textbook definition of “quoting out of context”. It should be abundantly clear that I was not saying that L&E shouldn’t interpret results from a Neurosynth meta-analysis of 250 studies (which would be absurd). The point of the above quote was that if L&E don’t like the result they get when they conduct meta-analytic comparisons properly with Neurosynth, they’re not entitled to replace the analysis with a statistically invalid procedure that does give results they like.

TY and his collaborators have criticized researchers in major media outlets (e.g. New York Times) for poor reverse inference ““ for drawing invalid reverse inference conclusions from forward inference data. The analyses we presented suggest that claims about salience and the dACC are also based on unfounded reverse inference claims. One would assume that TY and his collaborators are readying a statement to criticize the salience researchers in the same way they have previously.

This is another absurd, and frankly insulting, comparison. My colleagues and I have criticized people for saying that insula activation is evidence that people are in love with their iPhones. I certainly hope that this is in a completely different league from inferring that people must be experiencing pain if the dACC is activated (because if not, some of L&E’s previous work would appear to be absurd on its face). For what it’s worth, I agree with L&E that nobody should interpret dACC activation in a study as strong evidence of “salience”—and, for that matter, also of “pain”. As for why I’m not readying a statement to criticize the salience researchers, the answer is that it’s not my job to police the ACC literature. My interest is in making sure Neurosynth is used appropriately. L&E can rest assured that if someone published an article based entirely on Neurosynth results in which their primary claim was that the dACC is selective for salience, I would have written precisely the same kind of critique. Though it should perhaps concern them that, of the hundreds of published uses of Neurosynth to date, theirs is the first and only one that has moved me to write a critical commentary.

But no. Nowhere in the blog does TY comment on this finding that directly contradicts a major current account of the dACC. Not so much as a “Geez, isn’t it crazy that so many folks these days think the dACC and AI can be best described in terms of salience detection and yet there is no reverse inference evidence at all for this claim.“

Once again: I didn’t comment on this because I’m not interested in the dACC; I’m interested in making sure Neurosynth is used appropriately. If L&E had asked me, “hey, do you think Neurosynth supports saying that dACC activation is a good marker of ‘salience’?”, I would have said “no, of course not.” But L&E didn’t write a paper titled “dACC activity should not be interpreted as a marker of salience”. They wrote a paper titled “the dACC is selective for pain”, in which they argue that pain is the best psychological characterization of dACC—a claim that Neurosynth simply does not support.

For the terms executive and conflict, our Figure 3 in the PNAS paper shows a tiny bit of dACC. We think the more comprehensive figures we’ve included here continue to tell the same story. If someone wants to tell the conflict story of why pain activates the dACC, we think there should be evidence of widespread robust reverse inference mappings from the dACC to conflict. But the evidence for such a claim just isn’t there. Whatever else you think about the rest of our statistics and claims, this should give a lot of folks pause, because this is not what almost any of us would have expected to see in these reverse inference maps (including us).

No objections here.

If you generally buy into Neurosynth as a useful tool (and you should), then when you look at the four maps above, it should be reasonable to conclude, at least among these four processes, that the dACC is much more involved in that first one (i.e. pain). Let’s test this intuition in a new thought experiment.

Imagine you were given the three reverse inference maps below and you were interested in the function of the occipital cortex area marked off with the green outline. You’d probably feel comfortable saying the region seems to have a lot more to do with Term A than Terms B or C. And if you know much about neuroanatomy, you’d probably be surprised, and possibly even angered, when I tell you that Term A is “˜motor’, Term B is “˜engaged’, and Term C is “˜visual’. How is this possible since we all know this region is primarily involved in visual processes? Well it isn’t possible because I lied. Term A is actually “˜visual’ and Term C is “˜motor’. And now the world makes sense again because these maps do indeed tell us that this region is widely and robustly associated with vision and only modestly associated with engagement and motor processes. The surprise you felt, if you believed momentarily that Term A was motor was because you have the same intuition we do that these reverse inference maps tell us that Term A is the likely function of this region, not Term B or Term C ““ and we’d like that reverse inference to be what we always thought this region was associated with ““ vision. It’s important to note that while a few voxels appear in this region for Terms B and C, it still feels totally fine to say this region’s psychological function can best be described as vision-related. It is the widespread robust nature of the effect in Term A, relative to the weak and limited effects of Terms B and C, that makes this a compelling explanation of the region.

I’m happy to grant L&E that it may “feel totally fine” to some people to make a claim like this. But this is purely an appeal to intuition, and has zero bearing on the claim’s actual validity. I hope L&E aren’t seriously arguing that cognitive neuroscientists should base the way we do statistical inference on our intuitions about what “feels totally fine”. I suspect it felt totally fine to L&E to conclude in 2003 that people were experiencing physical pain because the dACC was active, even though there was no evidential basis for such a claim (and there still isn’t). Recall that, in surveys of practicing researchers, a majority of respondents routinely endorse the idea that a p-value of .05 means that that there’s at least a 95% probability that the alternative hypothesis is correct (it most certainly doesn’t mean this). Should we allow people to draw clearly invalid conclusions in their publications on the grounds that it “feels right” to them? Indeed, as I show below, L&E’s arguments for selectivity rest in part on an invalid acceptance of the null hypothesis. Should they be given a free pass on what is probably the cardinal sin of NHST, on the grounds that it probably “felt right” to them to equate non-significance with evidence of absence?

The point of Neurosynth is that it provides a probabilistic framework for understanding the relationship between psychological function and brain activity. The framework has many very serious limitations that, in practice, make it virtually impossible to draw any meaningful reverse inference from observed patterns of brain activity in any individual study. If L&E don’t like this, they’re welcome to build their own framework that overcomes the limitations of Neurosynth (or, they could even help me improve Neurosynth!). But they don’t get to violate basic statistical tenets in favor of what “feels totally fine” to them.

Another point of this thought experiment is that if Term A is what we expect it to be (i.e. vision) then we can keep assuming that Neurosynth reverse inference maps tell us something valuable about the function of this region. But if Term A violates our expectation of what this region does, then we are likely to think about the ways in which Neurosynth’s results are not conclusive on this point.

We suspect if the dACC results had come out differently, say with conflict showing wide and robust reverse inference effects throughout the dACC, and pain showing little to nothing in dACC, that most of our colleagues would have said “Makes sense. The reverse inference map confirms what we thought ““ that dACC serves a general cognitive function of detecting conflicts.“ We think it is because of the content of the results rather than our approach that is likely to draw ire from many.

I can’t speak for L&E’s colleagues, but my own response to their paper was indeed driven entirely by their approach. If someone had published a paper using Neurosynth to argue that the dACC is selective for conflict, using the same kinds of arguments L&E make, I would have written exactly the same kind of critique I wrote in response to L&E’s paper. I don’t know how I can make it any clearer that I have zero attachment to any particular view of the dACC; my primary concern is with L&E’s misuse of Neurosynth, not what they or anyone else thinks about dACC function. I’ve already made it clear several times that I endorse their conclusion that conflict, salience, and cognitive control are not adequate explanations for dACC function. What they don’t seem to accept is that pain isn’t an adequate explanation either, as the data from Neurosynth readily demonstrate.

5. L&E did the wrong analyses

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

I’ve already addressed the overarching problem with L&E’s statistical analyses in the first part of this post. Below I’ll just walk through each of L&E’s assertions in detail and point out all of the specific issues in detail. I’ll warn you right now that this is not likely to make for very exciting reading.

While we think the maps for each term speak volumes just from visual inspection, we thought it was also critical to run the comparisons across terms directly. We all know the statistical error of showing that A is significant, while B is not and then assuming, but not testing A > B, directly. TY has a section called “A>B does not imply ~B“ (where ~B means “˜not B’). Indeed it does not, but all the reverse inference maps for the executive, conflict, and salience terms already established ~B. We were just doing due diligence by showing that the difference between A and B was indeed significant.

I apologize for implying that L&E weren’t aware that A > B doesn’t entail ~B. I drew that conclusion because the only other way I could see their claim of selectivity making any sense is if they were interpreting a failure to detect a significant effect for B as positive evidence of no effect. I took that to be much more unlikely, because it’s essentially the cardinal sin of NHST. But their statement here explicitly affirms that this is, in fact, exactly what they were arguing—which leads me to conclude that they don’t understand the null hypothesis statistical testing (NHST) framework they’re using. The whole point of this section of my post was that L&E cannot conclude that there’s no activity in dACC for terms like conflict or salience, because accepting the null is an invalid move under NHST. Perhaps I wasn’t sufficiently clear about this in my last post, so let me reiterate: the reverse inference maps do not establish ~B, and cannot establish ~B. The (invalid) comparison tests of A > B do not establish ~B, and cannot cannot establish ~B. In fact, no analysis, figure, or number L&E report anywhere in their paper establishes ~B for any of the terms they compare with pain. Under NHST, the only possible result of any of L&E’s analyses that would allow them to conclude that a term is not positively associated with dACC activation would be a significant result in the negative direction (i.e., if dACC activation implied a decrease in likelihood of a term). But that’s clearly not true of any of the terms they examine.

Note that this isn’t a fundamental limitation of statistical inference in general; it’s specifically an NHST problem. A Bayesian model comparison approach would have allowed L&E to make a claim about the evidence for the null in comparison to the alternative (though specifying the appropriate priors here might not be very straightforward). Absent such an analysis, L&E are not in any position to make claims about conflict or salience not activating the dACC—and hence, per their own criteria for selectivity, they have no basis for arguing that pain is selective.

Now, in my last post, I went well beyond this logical objection and argued that, if you analyze the data using L&E’s own criteria, there’s plenty of evidence for significant effects of other terms in dACC. I now regret including those analyses. Not because they were wrong; I stand by my earlier conclusion (which should be apparent to anyone who spends five minutes browsing maps on Neurosynth.org), and this alone should have prevented L&E from making claims about pain selectivity. But the broader point is that I don’t want to give the impression that this debate is over what the appropriate statistical threshold for analysis is—i.e., that maybe if we use p < 0.05, I’m right, and if we use FDR = 0.1, L&E are right. The entire question of which terms do or don’t show a significant effect is actually completely beside the point given that L&E’s goal is to establish that only pain activates the dACC, and that terms like conflict or salience don’t. To accomplish that, L&E would need to use an entirely different statistical framework that allows them them to accept the null (relative to some alternative).

If it’s reasonable to use the Z-scores from Neurosynth to say “How much evidence is there for process A being a reliable reverse inference target for region X“ then it has to be reasonable to compare Z-scores from two analyses to ask “How much MORE evidence is there for process A than process B being a reliable reverse inference target for region X“. This is all we did when we compared the Z-scores for different terms to each other (using a standard formula from a meta-analysis textbook) and we think this is the question many people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region.

I addressed this in the earlier part of this post, where I explained why one cannot obtain support for a reverse inference using z-scores or p-values. Reverse inference is inherently a Bayesian notion, and makes sense only if you’re willing to talk about prior and posterior probabilities. So L&E’s first premise here—i.e., that it’s reasonable to use z-scores from Neurosynth to quantify “evidence for process A being a reliable reverse inference target for region X” is already false.

For what it’s worth, the second premise is also independently false, because it’s grossly inappropriate to use meta-analytic z-score comparison test in this situation. For one thing, there’s absolutely no reason to compare z-scores given that the distributional information is readily available. Rosenthal (the author of the meta-analysis textbook L&E cite) himself explicitly notes that such a test is inferior to effect size-based tests, and is essentially a last-ditch approach. Moreover, the intended use of the test in meta-analysis is to determine whether or not there’s heterogeneity in p-values as a precursor to combining them in an analysis (which is a concern that makes no sense in the context of Neurosynth data). At best, what L&E would be able to say with this test is something like “it looks like these two z-scores may be coming from different underlying distributions”. I don’t know why L&E think this is at all an interesting question here, because we already know with certainty that there can be no meaningful heterogeneity of this sort in these z-scores given that they’re all generated using exactly the same set of studies.

In fact, the problems with the z-score comparison test L&E are using run so deep that I can’t help point out just one truly stupefying implication of the approach: it’s possible, under a wide range of scenarios, to end up concluding that there’s evidence that one term is “preferentially” activated relative to another term even when the point estimate is (significantly) larger for the latter term. For example, consider a situation in which we have a probability of 0.65 for one term with n = 1000 studies, and a probability of 0.8 for a second term with n = 100 studies. The one-sample proportion test for these two samples, versus a null of 0.5, gives z-scores of 9.5 and 5.9, respectively–so both tests are highly significant, as one would expect. But the Rosenthal z-score test favored by L&E tells us that the z-score for the first sample is significantly larger than the z-score for the second. It isn’t just wrong to interpret this as evidence that the first term has a more selective effect; it’s dangerously wrong. A two-sample test for the difference in proportions correctly reveals a significant effect in the expected direction (i.e., the 0.8 probablity in the smaller sample is in fact significantly greater than the 0.65 probability in the much larger sample). Put simply, L&E’s test is broken. It’s not clear that it tests anything meaningful in this context, let alone allowing us to conclude anything useful about functional selectivity in dACC.

As for what people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region: I really don’t know, and I don’t see how that would have any bearing on whether the methods L&E are using are valid or not. What I do know that I’ve never seen anyone else compare Neurosynth z-scores using a meta-analytic procedure intended to test for heterogeneity of effects—and I certainly wouldn’t recommend it.

TY then raises two quite reasonable issues with the Z-score comparisons, one of which we already directly addressed in our paper. First, TY raises the issue that Z-scores increase with accumulating evidence, so terms with more studies in the database will tend to have larger Z-scores. This suggests that terms with the most studies in the database (e.g. motor with 2081 studies) should have significant Z-scores everywhere in the brain. But terms with the most studies don’t look like this. Indeed, the reverse inference map for “functional magnetic“ with 4990 studies is a blank brain with no significant Z-scores.

Not quite. It’s true that for any fixed effect size, z-scores will rise (in absolute value) as sample size increases. But if the true effect size is very small, one will still obtain a negligible z-score even in a very large sample. So while terms with more studies will indeed tend to have larger absolute z-scores, it’s categorically false that “terms with the most studies in the database should have significant z-scores everywhere in the brain”.

However, TY has a point. If two terms have similar true underlying effects in dACC, then the one with the larger number of studies will have a larger Z-score, all else being equal. We addressed this point in the limitations section of our paper writing “It is possible that terms that occur more frequently, like “pain,“ might naturally produce stronger reverse inference effects than less frequent terms. This concern is addressed in two ways. First, the current analyses included a variety of terms that included both more or fewer studies than the term “pain“ and no frequency-based gradient of dACC effects is observable.“ So while pain (410 studies) is better represented in the Neurosynth database than conflict (246 studies), effort (137 studies), or Stroop (162 studies), several terms are better represented than pain including auditory (1004 studies), cognitive control (2474 studies), control (2781 studies), detection (485 studies), executive (531 studies), inhibition (432 studies), motor (1910 studies), and working memory (815). All of these, regardless of whether they are better or worse represented in the Neurosynth database show minimal presence in the dACC reverse inference maps. It’s also worth noting that painful and noxious, with only 158 and 85 studies respectively, both show broader coverage within the dACC than any of the cognitive or salience terms considered in our paper.

L&E don’t seem to appreciate that the relationship between the point estimate of a parameter and the uncertainty around that estimate is not like the relationship between two predictors in a regression, where one can (perhaps) reason logically about what would or should be true if one covariate was having an influence on another. One cannot “rule out” the possibility that sample size is a problem by pointing to some large-N terms with small effects or some small-N terms with large effects. Sampling error is necessarily larger in smaller samples. The appropriate way to handle between-term variation in sample size is to properly build that differential uncertainty into one’s inferential test. Rosenthal’s z-score comparison doesn’t do this. The direct meta-analytic contrast one can perform with Neurosynth does do this, but of course, being much more conservative than the Rosenthal test (appropriately so!), L&E don’t seem to like the results it produces. (And note that the direct meta-analytic contrast would still require one to make strong assumptions about priors if the goal was to make quantitative reverse inferences, as opposed to detecting a mean difference in probability of activation.)

TY’s second point is also reasonable, but is also not a problem for our findings. TY points out that some effects may be easier to produce in the scanner than others and thus may be biased towards larger effect sizes. We are definitely sympathetic to this point in general, but TY goes on to focus on how this is a problem for comparing pain studies to emotion studies because pain is easy to generate in the scanner and emotion is hard. If we were writing a paper comparing effect sizes of pain and emotion effects this would be a problem but (a) we were not primarily interested in comparing effect sizes and (b) we definitely weren’t comparing pain and emotion because we think the aspect of pain that the dACC is involved in is the affective component of pain as we’ve written in many other papers dating back to 2003 (Eisenberger & Lieberman, 2004; Eisenberger, 2012; Eisenberger, 2015).

It certainly is a problem for L&E’s findings. Z-scores are related one-to-one with effect size for any fixed sample size, so if the effect size is artificially increased in one condition, so too is the z-score that L&E stake their (invalid) analysis on. Any bias in the point estimate will necessarily distort the z-value as well. This is not a matter of philosophical debate or empirical conjecture, it’s a mathematical necessity.

Is TY’s point relevant to our actual terms of comparison: executive, conflict, and salience processes? We think not. Conflict tasks are easy and reliable ways to produce conflict processes. In multiple ways, we think pain is actually at a disadvantage in the comparison to conflict. First, pain effects are so variable from one person to the next that most pain researchers begin by calibrating the objective pain stimuli delivered, to each participant’s subjective responses to pain. As a result, each participant may actually be receiving different objective inputs and this might limit the reliability or interpretability of certain observed effects. Second, unlike conflict, pain can only be studied at the low end of its natural range. Due to ethical considerations, we do not come close to studying the full spectrum of pain phenomena. Both of these issues may limit the observation of robust pain effects relative to our actual comparisons of interest (executive, conflict, and salience processes.

Perhaps I wasn’t sufficiently clear, but I gave the pain-emotion contrast as an example. The point is that meta-analytic comparisons of the kind L&E are trying to make are a very dangerous proposition unless one has reason to think that two classes of manipulations are equally “strong”. It’s entirely possible that L&E are right that executive control manipulations are generally stronger than pain manipulations, but that case needs to be made on the basis of data, and cannot be taken for granted.

6. About those effect size comparison maps

After criticizing us for not comparing effect sizes, rather than Z-scores, TY goes on to produce his own maps comparing the effect sizes of different terms and claiming that these represent evidence that the dACC is not selective for pain. A lot of our objections to these analyses as evidence against our claims repeats what’s already been said so we’ll start with what’s new and then only briefly reiterate the earlier points.

a) We don’t think it makes much sense to compare effect sizes for terms in voxels for which there is no evidence that it is a valid reverse inference target. For instance, the posterior probability at 0 26 26 for pain is .80 and for conflict is .61 (with .50 representing a null effect). Are these significantly different from one another? I don’t think it matters much because the Z-score associated with conflict at this spot is 1.37, which is far from significant (or at least it was when we ran our analyses last summer. Strangely, now, any non-significant Z-scores seem to come back with a value of 0, whereas they used to give the exact non-significant Z-score).

I’m not sure why L&E think that statistical significance makes a term a “valid target” for reverse inference (or conversely, that non-significant terms cannot be valid targets). If they care to justify this assertion, I’ll be happy to respond to it. It is, in any case, a moot point, since many of the examples I gave were statistically significant, and L&E don’t provide any explanation as to why those terms aren’t worth worrying about either.

As for the disappearance of non-significant z-scores, that’s a known bug introduced by the last major update to Neurosynth, and it’ll be fixed in the next major update (when the entire database is re-generated).

If I flip a coin twice I might end up with a probability estimate of 100% heads, but this estimate is completely unreliable. Comparing this estimate to those from a coin flipped 10,000 times which comes up 51% heads makes little sense. Would the first coin having a higher probability estimate than the second tell us anything useful? No, because we wouldn’t trust the probability estimate to be meaningful. Similarly, if a high posterior probability is associated with a non-significant Z-score, we shouldn’t take this posterior probability as a particularly reliable estimate.

L&E are correct that it wouldn’t make much sense to compare an estimate from 2 coin flips to an estimate from 10,000 coin flips. But the error is in thinking that comparing p-values somehow addresses this problem. As noted above, the p-value comparison they use is a meta-analytic test that only tells one if a set of z-scores are heterogenous, and is not helpful for comparing proportions when one has actual distributional information available. It would be impossible to answer the question of whether one coin is biased relative to another using this test—and it’s equally impossible to use it to determine whether one term is more important than another for dACC function.

b) TY’s approach for these analyses is to compare the effect sizes for any two processes A & B by finding studies in the database tagged for A but not B and others tagged for B but not A and to compare these two sets. In some cases this might be fine, but in others it leaves us with a clean but totally unrealistic comparison. To give the most extreme example, imagine we did this for the terms pain and painful. It’s possible there are some studies tagged for painful but not pain, but how representative would these studies be of “painful“ as a general term or construct? It’s much like the clinical problem of comparing depression to anxiety by comparing those with depression (but not anxiety) to those with anxiety (but not depression). These folks are actually pretty rare because depression and anxiety are so highly comorbid, so the comparison is hardly a valid test of depression vs. anxiety. Given that we think pain, fear, emotion, and autonomic are actually all in the same class of explanations, we think comparisons within this family are likely to suffer from this issue.

There’s nothing “unrealistic” about this comparison. It’s not the inferential test’s job to make sure that the analyst is doing something sensible, it’s the analyst’s job. Nothing compels L&E to run a comparison between ‘pain’ and ‘painful’, and I fully agree that this would be a dumb thing to do (and it would be an equally dumb thing to do using any other statistical test). One the other hand, comparing the terms ‘pain’ and ’emotion’ is presumably not a dumb thing to do, so it behooves us to make sure that we use an inferential test that doesn’t grossly violate common sense and basic statistical assumptions.

Now, if L&E would like to suggest an alternative statistical test that doesn’t exclude the intersection of the two terms and still (i) produces interpretable results, (ii) weights all studies equally, (iii) appropriately accounts for the partial dependency structure of the data, and (iv) is sufficiently computationally efficient to apply to thousands of terms in a reasonable amount of time (which rules out most permutation-based tests), then I’d be delighted to consider their suggestions. The relevant code can be found here, and L&E are welcome to open a GitHub issue to discuss this further. But unless they have concrete suggestions, it’s not clear what I’m supposed to do with their assertion that doing meta-analytic comparison properly sometimes “leaves us with a clean but totally unrealistic comparison”. If they don’t like the reality, they’re welcome to help me improve the reality. Otherwise they’re simply engaging in wishful thinking. Nobody owes L&E a statistical test that’s both valid and gives them results they like.

c) TY compared topics (i.e., a cluster of related terms), not terms. This is fine, but it is one more way that what TY did is not comparable to what we did (i.e. one more way his maps can’t be compared to those we presented).

I almost always use topics rather than terms in my own analyses, for a variety of reasons (they have better construct validity, are in theory more reliable, reduce the number of comparisons, etc.). I didn’t try out the analyses I ran with any of the term-based features, but I encourage L&E to do so if they like, and I’d be surprised if the results differ appreciably (they should, in general, simply be slightly less robust all around). In any case, I deliberately made my code available so that L&E (or anyone else) could easily reproduce and modify my analyses. (And of course, nothing at all hangs on the results in any case, because the whole premise that this is a suitable way to demonstrate selectivity is unfounded.)

d) Finally and most importantly, our question would not have led us to comparing effect sizes. We were interested in whether there was greater accumulated evidence for one term (i.e. pain) being a reverse inference target for dACC activations than for another term (e.g. conflict). Using the Z-scores as we did is a perfectly reasonable way to do this.

See above. Using the z-scores the way L&E did is not reasonable and doesn’t tell us anything anyone would want to know about functional selectivity.

7. Biases all around

Towards the end of his blog, TY says what we think many cognitive folks believe:

“I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else.“

We think this is very telling because it suggests that the findings such as those in our PNAS paper are likely to be unacceptable regardless of what the data shows.

Another misrepresentation of what I actually said, which was:

One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally.

The context makes it abundantly clear that I was not making a general statement about the importance of pain in some grand evolutionary sense, but simply pointing out the implausibility of supposing that Neurosynth reverse inference maps provide unbiased windows into the neural substrates of cognition. In the case of pain, there’s tentative evidence to believe that effect sizes are overestimated.

In contrast, we can’t think of too many things that the brain would prize above pain (and distress) representations. People who don’t feel pain (i.e. congenital insensitivity to pain) invariably die an early death ““ it is literally a death sentence to not feel pain. What could be more important for survival? Blind and deaf people survive and thrive, but those without the ability to feel pain are pretty much doomed.

I’m not sure what this observation is supposed to tell us. One could make the same kind of argument about plenty of other functions. People who suffer from a variety of autonomic or motor problems are also likely to suffer horrible early deaths; it’s unclear to me how this would justify a claim like “the brain prizes little above autonomic control”, or what possibly implications such a claim would have for understanding dACC function.

Similar (but not identical) to TY’s conclusions that we opened this blog with, we think the following conclusions are supported by the Neurosynth evidence in our PNAS paper:

I’ll take these one at a time.

* There is more widespread and robust reverse inference evidence for the role of pain throughout the dACC than for executive, conflict, and salience-related processes.

I’m not sure what is meant here by “robust reverse inference evidence”. Neurosynth certainly provides essentially no basis for drawing reverse inferences about the presence of pain in individual studies. (Let me remind L&E once again: at best, the posterior probability for ‘pain’ in dACC is around 80%–but that’s given an assumed based rate of 50%, not the more realistic real-world rate of around 3%). If what they mean is something like “on average, taking the average of all voxels in dACC, there’s more evidence of a statistical association between pain and dACC than pain and conflict monitoring”, then I’m fine with that.

* There is little to no evidence from the Neurosynth database that executive, conflict, and salience-related processes are reasonable reverse inference targets for dACC activity.

Again, this depends on what L&E mean. If they mean that one shouldn’t, upon observing activation in dACC, proclaim that conflict must be present, then they’re absolutely right. But again, the same is true for pain. On the other hand, if they mean that there’s no evidence in Neurosynth for a reverse inference association between these terms and dACC activity, where the criterion is surviving FDR-correction, then that’s clearly not true: for example, the conflict map clearly includes voxels within the dACC. Alternatively, if L&E’s point is that the dACC/preSMA region centrally associated with conflict monitoring or executive control is more dorsal than many (though not all) people have assumed, then I agree with them without qualification.

* Pain processes, particularly the affective or distressing part of pain, are in the same family with other distress-related processes including terms like distress, fear, and negative affect.

I have absolutely no idea what evidence this conclusion is based on. Nothing I can see in Neurosynth seems to support this—let alone anything in the PNAS paper. As I’ve noted several times now, most distress-related terms do not seem to overlap meaningfully with pain-related activations in dACC. To the extent that one thinks spatial overlap is a good criterion for determining family membership (and for what it’s worth, I don’t think it is), the evidence does not seem particularly suggestive of any such relationship (and L&E don’t test it formally in any way).

Postscript. *L&E should have used reverse inference, not forward inference, when examining the anatomical boundaries of dACC.*

We saved this one for the postscript because this has little bearing on the major claims of our paper. In our paper, we observed that when one does a forward inference analysis of the term “˜dACC’ the strongest effect occurs outside the dACC in what is actually SMA. This suggested to us that people might be getting activations outside the dACC and calling them dACC (much as many activations clearly not in the amygdala have been called amygdala because it fits a particular narrative). TY admits having been guilty of this in TY’11 and points out that we made this mistake in our 2003 Science paper on social pain. A couple of thoughts on this.

a) In 2003, we did indeed call an activation outside of dACC (-6 8 45) by the term “dACC“. TY notes that if this is entered into a Neurosynth analysis the first anatomical term that appears is SMA. Fair enough. It was our first fMRI paper ever and we identified that activation incorrectly. What TY doesn’t mention is that there are two other activations from the same paper (-8 20 40; -6 21 41) where the top named anatomical term in Neurosynth is anterior cingulate. And if you read this in TY’s blog and thought “I guess social pain effects aren’t even in the dACC“, we would point you to the recent meta-analysis of social pain by Rotge et al. (2015) where they observed the strongest effect for social pain in the dACC (8 24 24; Z=22.2 PFDR<.001). So while we made a mistake, no real harm was done.

I mentioned the preSMA activation because it was the critical data point L&E leaned on to argue that the dACC was specifically associated with the affective component of pain. Here’s the relevant excerpt from the 2003 social pain paper:

As predicted, group analysis of the fMRI data indicated that dorsal ACC (Fig. 1A) (x ““ 8, y 20, z 40) was more active during ESE than during inclusion (t 3.36, r 0.71, P < 0.005) (23, 24). Self-reported distress was positively correlated with ACC activity in this contrast (Fig. 2A) (x ““ 6, y 8, z 45, r 0.88, P < 0.005; x ““ 4, y 31, z 41, r 0.75, P < 0.005), suggesting that dorsal ACC activation during ESE was associated with emotional distress paralleling previous studies of physical pain (7, 8). The anterior insula (x 42, y 16, z 1) was also active in this comparison (t 4.07, r 0.78, P < 0.005); however, it was not associated with self-reported distress.

Note that both the dACC and anterior insula were activated by the exclusion vs. inclusion contrast, but L&E concluded that it was specifically the dACC that supports the “neural alarm” system, by virtue of being correlated with participants’ subjective reports (whereas the insula was not). Setting aside the fact that these results were observed in a sample size of 13 using very liberal statistical thresholds (so that the estimates are highly variable, spatial error is going to be very high, there’s a high risk of false positives, and accepting the null in the insula because of the absence of a significant effect is probably a bad idea), in focusing on the the preSMA activation in my critique, I was only doing what L&E themselves did in their paper:

Dorsal ACC activation during ESE could reflect enhanced attentional processing, previously associated with ACC activity (4, 5), rather than an underlying distress due to exclusion. Two pieces of evidence make this possibility unlikely. First, ACC activity was strongly correlated with perceived distress after exclusion, indicating that the ACC activity was associated with changes in participants’ self-reported feeling states.

By L&E’s own admission, without the subjective correlation, there would have been little basis for concluding that the effect they observed was attributable to distress rather than other confounds (attentional increases, expectancy violation, etc.). That’s why I focused on the preSMA activation: because they did too.

That said, since L&E bring up the other two activations, let’s consider those too, since they also have their problems. While it’s true that both of them are in the anterior cingulate, according to Neurosynth, neither of them is a “pain” voxel. The top functional associates for both locations are ‘inteference’, ‘task’, ‘verbal’, ‘verbal fluency’, ‘word’, ‘demands’, ‘words’, ‘reading’ … you get the idea. Pain is not significantly associated with these points in Neurosynth. So while L&E might be technically right that these other activations were in the anterior cingulate, if we take Neurosynth to be as reliable a guide to reverse inference as they think, then L&E never had any basis for attributing the social exclusion effect to pain to begin with—because, according to Neurosynth, literally none of the medial frontal cortex activations reported in the 2003 paper are associated with pain. I’ll leave it to others to decide whether “no harm was done” by their claim that the dACC is involved in social pain.

In contrast, TY’11’s mistake is probably of greater significance. Many have taken Figure 3 of TY’11 as strong evidence that the dACC activity can’t be reliably associated with working memory, emotion, or pain. If TY had tested instead (2 8 40), a point directly below his that is actually in dACC (rather than 2 8 50 which TY now acknowledges is in SMA), he would have found that pain produces robust reverse inference effects, while neither working memory or emotion do. This would have led to a very different conclusion than the one most have taken from TY’11 about the dACC.

Nowhere in TY’11 is it claimed that dACC activity isn’t reliably associated with working memory, emotion or pain (and, as I already noted in my last post, I explicitly said that the posterior aspects of dACC are preferentially associated with pain). What I did say is that dACC activation may not be diagnostic of any of these processes. That’s entirely accurate. As I’ve explained at great length above, there is simply no basis for drawing any strong reverse inference on the basis of dACC activation.

That said, if it’s true that many people have misinterpreted what I said in my paper, that would indeed be potentially damaging to the field. I would appreciate feedback from other people on this issue, because if there’s a consensus that my paper has in fact led people to think that dACC plays no specific role in cognition, then I’m happy to submit an erratum to the journal. But absent such feedback, I’m not convinced that my paper has had nearly as much influence on people’s views as L&E seem to think.

b) TY suggested that we should have looked for “dACC“ in the reverse inference map rather than the forward inference map writing “All the forward inference map tells you is where studies that use the term “dACC“ tend to report activation most often“. Yet this is exactly what we were interested in. If someone is talking about dACC in their paper, is that the region most likely to appear in their tables? The answer appears to be no.

No, it isn’t what L&E are interested in. Let’s push this argument to its logical extreme to illustrate the problem: imagine that every single fMRI paper in the literature reported activation in preSMA (plus other varying activations)—perhaps because it became standard practice to do a “task-positive localizer” of some kind. This is far-fetched, but certainly conceptually possible. In such a case, searching for every single region by name (“amygdala”, “V1”, you name it) would identify preSMA as the peak voxel in the forward inference map. But what would this tell us, other than that preSMA is activated with alarming frequency? Nothing. What L&E want to know is what brain regions have the biggest impact on the likelihood that an author says “hey, that’s dACC!”. That’s a matter of reverse inference.

c) But again, this is not one of the central claims of the paper. We just thought it was noteworthy so we noted it. Nothing else in the paper depends on these results.

I agree with this. I guess it’s nice to end on a positive note.

No, the dorsal anterior cingulate is not selective for pain: comment on Lieberman and Eisenberger (2015)

[Update 12/10/2015: Lieberman & Eisenberger have now posted a lengthy response to this post here. I’ll post my own reply to their reply in the next few days.]

[Update 12/14/2015: I’ve posted an even lengthier reply to L&E’s reply here.]

[Update 12/16/2015: Alex Shackman has posted an interesting commentary of his own on the L&E paper. It focuses on anatomical concerns unrelated to the issues I raise here and in my last post.]

The anterior cingulate cortex (ACC)—located immediately above the corpus callosum on the medial surface of the brain’s frontal cortex—is an intriguing brain region. Despite decades of extensive investigation in thousands of animal and human studies, understanding the function(s) of this region has proven challenging. Neuroscientists have proposed a seemingly never-ending string of hypotheses about what role it might play in in emotion and/or cognition. The field of human neuroimaging has taken a particular shine to the ACC in the past two decades; if you’ve ever heard overheard some nerdy-looking people talking about “conflict monitoring”, “error detection”, or “reinforcement learning” in the human brain, there’s a reasonable chance they were talking at least partly about the role of the ACC.

In a new PNAS paper, Matt Lieberman and Naomi Eisenberger wade into the debate with what is quite possibly the strongest claim yet about ACC function, arguing (and this is a verbatim quote from the paper’s title) that “the dorsal anterior cingulate cortex is selective for pain”. That conclusion rests almost entirely on inspection of meta-analytic results produced by Neurosynth, an automated framework for large-scale synthesis of results from thousands of published fMRI studies. And while I’ll be the first to admit that I know very little about the anterior cingulate cortex, I am probably the world’s foremost expert on Neurosynth*—because I created it. I also have an obvious interest in making sure that Neurosynth is used with appropriate care and caution. In what follows, I provide my HIBAR reactions to the Lieberman & Eisenberger (2015) manuscript, focusing largely on whether L&E’s bold conclusion is supported by the Neurosynth findings they review (spoiler alert: no).

Before going any further, I should clarify my role in the paper, since I’m credited in the Acknowledgments section for “providing Neurosynth assistance”. My contribution consisted entirely of sending the first author (per an email request) an aggregate list of study counts for different terms on the Neurosynth website. I didn’t ask what it was for, he didn’t say what it was for, and I had nothing to do with any other aspect of the paper—nor did PNAS ask me to review it. None of this is at all problematic, from my perspective. My policy has always been that people can do whatever they want with any of the Neurosynth data, code, or results, without having to ask me or anyone else for permission. I do encourage people to ask questions or solicit feedback (we have a mailing list), but in this case the authors didn’t contact me before this paper was published (other than to request data). So being acknowledged by name shouldn’t be taken as an endorsement of any of the results.

With that out of the way, we can move onto the paper. The basic argument L&E make is simple, and largely hangs on the following observation about Neurosynth data: when we look for activation in the dorsal ACC (dACC) in various “reverse inference” brain maps on Neurosynth, the dominant associate is the term “pain”. Other candidate functions people have considered in relation to dACC—e.g., “working memory”, “salience”, and “conflict”—show (at least according to L&E) virtually no association with dACC. L&E take this as strong evidence against various models of dACC function that propose that the dACC plays a non-pain-related role in cognition—e.g., that it monitors for conflict between cognitive representations or detects salient events. They state, in no uncertain terms, that Neurosynth results “clearly indicated that the best psychological description of dACC function was related to pain processing ““ not executive, conflict, or salience processing”. This is a strong claim, and would represent a major advance in our understanding of dACC function if it were borne out. Unfortunately, it isn’t.

A crash course in reverse inference

To understand why, we need to understand the nature of the Neurosynth data L&E focus on. And to do that, we need to talk about something called reverse inference. L&E begin their paper by providing an excellent explanation of why the act of inferring mental states from patterns of brain activity (i.e., reverse inference—a term popularized in a seminal 2006 article by Russ Poldrack)—is a difficult business. Many experienced fMRI researchers might feel that the issue has already been beaten to death (see for instance this, this, this, or this). Those readers are invited to skip to the next section.

For everyone else, we can summarize the problem by observing that the probability of a particular pattern of brain activity conditional on a given mental state is not the same thing as the probability of a particular mental state conditional on a given pattern of observed brain activity (i.e., P(activity|mental state) != P(mental state|activity)). For example, if I know that doing a difficult working memory task produces activation in the dorsolateral prefrontal cortex (DLPFC) 80% of the time, I am not entitled to conclude that observing DLPFC activation in someone’s brain implies an 80% chance that that person is doing a working memory task.

To see why, imagine that a lot of other cognitive tasks—say, those that draw on recognition memory, emotion recognition, pain processing, etc.—also happen to produce DLPFC activation around 80% of the time. Then we would be justified in saying that all of these processes consistently produce DLPFC activity, but we would have no basis for saying that DLPFC activation is specific, or even preferential, for any one of these processes. To make the latter claim, we would need to directly estimate the probability of working memory being involved given the presence of DLPFC activation. But this is a difficult proposition, because most fMRI studies only compare a small number of experimental conditions (typically with low statistical power), and cannot really claim to demonstrate that a particular pattern of activity is specific to a given cognitive process.

Unfortunately, a huge proportion of fMRI studies continue to draw strong reverse inferences on the basis of little or no quantitative evidence. The practice is particularly common in Discussion sections, when authors often want to say something more than just “we found a bunch of differences as a result of this experimental manipulation”, and end up drawing inferences about what such-and-such activation implies about subjects’ mental states on the basis of a handful of studies that previously reported activation in the same region(s). Many of these attributions could well be correct, of course; but the point is that it’s exceedingly rare to see any quantitative evidence provided in support of claims that are often fundamental to the interpretation authors wish to draw.

Fortunately, this is where large-scale meta-analytic databases like Neurosynth can help—at least to some degree. Because Neurosynth contains results from over 11,000 fMRI studies drawn from virtually every domain of cognitive neuroscience, we can use it to produce quantitative whole-brain reverse inference maps (for more details, see Yarkoni et al. (2011)). In other words, we can estimate the relative specificity with which a particular pattern of brain activity implies that some cognitive process is in play—provided we’re willing to make some fairly strong assumptions (which we’ll return to below).

The dACC, lost and found

Armed with an understanding of the forward/reverse inference distinction, we can now turn to the focus of the L&E paper: a brain region known as the dorsal anterior cingulate cortex (dACC). The first thing L&E set out to do, quite reasonably, is identify the boundaries of the dACC, so that it’s clear what constitutes the target of analysis. To this end, they compare the anatomically-defined boundaries of dACC with the boundaries found in the Neurosynth forward inference map for “dACC”. Here’s what they show us:

Figure 1 from Lieberman & Eisenberger (2015)

The blue outline in panel A is the anatomical boundary of dACC; the colorful stuff in B is the Neurosynth map for ‘dACC’. (It’s worth noting in passing that the choice to rely on anatomy as the gold standard here is not completely uncontroversial; given the distributed nature of fMRI activation and the presence of considerable registration error in most studies, another reasonable approach would have been to use a probabilistic template). As you can see, the two don’t converge all that closely. Much of the Neurosynth map sits squarely inside preSMA territory rather than in dACC proper. As L&E report:

When “dACC“ is entered as a term into a Neurosynth forward inference analysis (Fig. 1B), there is substantial activity present in the anatomically defined dACC region; however, there is also substantial activity present in the SMA/preSMA region. Moreover, the location with the highest Z-score in this analysis is actually in SMA, not dACC. The same is true if the term “anterior cingulate“ is used (Fig. 1C).

L&E interpret this as a sign of confusion in the literature about the localization of dACC, and suggest that this observation might explain why people have misattributed certain functions to dACC:

These findings suggest that some of the disagreement over the function of the dACC may actually apply to the SMA/pre-SMA, rather than the dACC. In fact, a previous paper reporting that a reverse inference analysis for dACC was not selective for pain, emotion, or working memory (see figure 3 in ref. 13) seems to have used coordinates for the dACC that are in fact in the SMA/ pre-SMA (MNI coordinates 2, 8, 50), not in the dACC.

This is an interesting point, and clearly has a kernel of truth to it, inasmuch as some researchers undoubtedly confuse dACC with more dorsal regions. As L&E point out, I made this mistake myself in the original Neurosynth paper (that’s the ‘ref. 13’ in the above quote); specifically, here’s the figure where I clearly labeled dACC in the wrong place:

oops.
Figure 3 from Yarkoni et al. (2011)

 

Mea culpa—I made a mistake, and I appreciate L&E pointing it out. I should have known better.

That said, L&E should also have known better, because they were among the first authors to ascribe a strong functional role to a region of dorsal ACC that wasn’t really dACC at all. I refer here to their influential 2003 Science paper on social exclusion, in which they reported that a region of dorsal ACC centered on (-6, 8, 45) was specifically associated with the feeling of social exclusion and concluded (based on the assumption that the same region was already known to be implicated in pain processing) that social pain shares core neural substrates with physical pain. Much of the ongoing debate over what the putative role of dACC is traces back directly to this paper. Yet it’s quite clear that the region identified in that paper was not the same as the one L&E now argue is the pain-specific dACC. At coordinates (-6, 8, 45), the top hits in Neurosynth are “SMA”, “motor”, and “supplementary motor”. If we scan down to the first cognitive terms, we find the terms “task”, “execution”, and “orthographic”. “Pain” is not significantly associated with activation at this location at all. So, to the extent that people have mislabeled this region in the past, L&E would appear to share much of the blame. Which is fine—we all make mistakes. But given the context, I think it would behoove L&E to clarify their own role in perpetuating this confusion.

That said, even if L&E are correct that a subset of researchers have sometimes confused dACC and pre-SMA, they’re clearly wrong to suggest that the cognitive neuroscience community as a whole is guilty of the same confusion. A perplexing aspect of their argument is that they base their claim of localization confusion entirely on inspection of the forward inference Neurosynth map for “dACC”—an odd decision, coming immediately after several paragraphs in which they lucidly explain why a forward inference analysis is exactly the wrong way to determine what brain regions are specifically associated with a particular term. If you want to use Neurosynth to find out where people think dACC is, you should use the reverse inference map, not the forward inference map. All the forward inference map tells you is where studies that use the term “dACC” tend to report activation most often. But as discussed above, and in the L&E paper, that estimate will be heavily biased by differences between regions in the base rate of activation.

Perhaps in tacit recognition of this potential criticism, L&E go on to suggest that the alleged “distortion” problem isn’t ubiquitous, and doesn’t happen in regions like the amygdala, hippocampus, or posterior cingulate:

We tested several other anatomical terms including “amygdala,“ “hippocampus,“ “posterior cingulate,“ “basal ganglia,“ “thalamus,“ “supplementary motor,“ and “pre sma.“ In each of these regions, the location with the highest Z-score was within the expected anatomical boundaries. Only within the dACC did we find this distortion. These results indicate that studies focused on the dACC are more likely to be reporting SMA/pre-SMA activations than dACC activations.

But this isn’t quite right. While it may be the case that dACC was the only brain region among the ones L&E examined that didn’t show this “distortion”, it’s certainly not the only brain region that shows this pattern. For example, the forward inference maps for “DMPFC” and “middle cingulate” (and probably others—I only spent a couple of minutes looking) show peak voxels in pre-SMA and the anterior insula, respectively, and not within the boundaries of the expected anatomical structures. If we take L&E’s “localization confusion” explanation seriously, we would be forced to conclude not only that cognitive neuroscientists generally don’t know where dACC is, but also that they don’t know DMPFC from pre-SMA or mid-cingulate from anterior insula. I don’t think this is a tenable suggestion.

For what it’s worth, Neurosynth clearly agrees with me: the “distortion” L&E point to completely vanishes as soon as one inspects the reverse inference map for “dacc” rather then forward inference map. Here’s what the two maps look like, side-by-side (incidentally, the code and data used to generate this plot and all the others in this post can be found here):

Meta-analysis of 'dACC' in Neurosynth: forward and reverse inference. Maps are thresholded at p < .001.
Meta-analysis of ‘dACC’ in Neurosynth: forward and reverse inference maps (voxel-wise p < .001, uncorrected).

You can see that the extent of dACC in the bottom row (reverse inference) is squarely within the area that L&E take to be the correct extent of dACC (see their Figure 1). So, when we follow L&E’s recommendations, rather than their actual practice, there’s no evidence of any spatial confusion. Researchers (collectively, at least) do know where dACC is. It’s just that, as L&E themselves argue at length earlier in the paper, you would expect to find evidence of that knowledge in the reverse inference map, and not in the forward inference map.

The unobjectionable claim: dACC is associated with pain

Localization issues aside, L&E clearly do have a point when they note that there appears to be a relatively strong association between the posterior dACC and pain. Of course, it’s not a novel point. It couldn’t be, given that L&E’s 2003 claim that social pain and physical pain share common mechanisms was already predicated on the assumption that the dACC is selectively implicated in pain (even though, as I noted above, the putative social exclusion locus reported in that paper was actually centered in preSMA and not dACC). Moreover, the Neurosynth pain meta-analysis map that L&E used has been online for nearly 5 years now. Since the reverse inference map is loaded by default on Neurosynth, and the sagittal orthview is by default centered on x = 0, one of the first things anybody sees when they visit this page is the giant pain-related blob in the anterior cingulate cortex. When I give talks on Neurosynth, the preferential activation for pain in the posterior dACC is one of the most common examples I use to illustrate the importance of reverse inference.

But you don’t have to take my word for any of this, because my co-authors and I made this exact point in the 2011 paper introducing Neurosynth, where we observed that:

For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis. This is consistent with studies of nonhuman primates that have implicated the dorsal posterior insula as a primary integration center for nociceptive afferents and with studies of humans in which anterior aspects of the so-called “˜pain matrix’ responded nonselectively to multiple modalities.

Contrary to what L&E suggest, we did not claim in our paper that reverse inference analysis demonstrates that the dACC is not preferentially associated with any cognitive function; we made the considerably weaker point that accounting for differences in the base rate of activation changes the observed pattern of association for many terms. And we explicitly noted that there is preferential activation for pain in dACC and insula—much as L&E themselves do.

The objectionable claim: dACC is selective for pain

Of course, L&E go beyond the claims made in Yarkoni et al (2011)—and what the Neurosynth page for pain reveals—in that they claim not only that pain is preferentially associated with dACC, but that “the clearest account of dACC function is that it is selectively involved in pain-related processes.” The latter is a much stronger claim, and, if anything, is directly contradicted by the very same kind of evidence (i.e., Neurosynth maps) L&E claim to marshal in its support.

Perhaps the most obvious problem with the claim is that it’s largely based on comparison of pain with just three other groups of terms, reflecting executive function, cognitive conflict, and salience**. This is, on its face, puzzling evidence for the claim that the dACC is pain-selective. By analogy, it would be like giving people a multiple choice question asking whether their favorite color is green, fuchsia, orange, or yellow, and then proclaiming, once results were in, that the evidence suggests that green is the only color people like.

Given that Neurosynth contains more than 3,000 terms, it’s not clear why L&E only compared pain to 3 other candidates. After all, it’s entirely conceivable that dACC might be much more frequently activated by pain than by conflict or executive control, and still also be strongly associated with a large number of other functions. L&E’s only justification for this narrow focus, as far as I can tell, is that they’ve decided to only consider candidate functions that have been previously proposed in the literature:

We first examined forward inference maps for many of the psychological terms that have been associated with dACC activity. These terms were in the categories of pain (“pain“, “painful“, “noxious“), executive control (“executive“, “working memory“, “effort“, “cognitive control“, “cognitive“, “control“), conflict processing (“conflict“, “error“, “inhibition“, “stop signal“, “Stroop“, “motor“), and salience (“salience“, “detection“, “task relevant“, “auditory“, “tactile“, “visual“).

This seems like an odd decision considering that one can retrieve a rank-ordered listing of 3,000+ terms from Neurosynth at the push of a button. More importantly, L&E also omit a bunch of other accounts of dACC function that don’t focus on the above categories—for example, that the dACC is involved in various aspects of value learning (e.g., Kennerley et al., 2006; Behrens et al., 2007; autonomic control (e.g., Critchley et al., 2003; or fear processing (e.g., Milad et al., 2007). In effect, L&E are not really testing whether dACC is selective for pain; what they’re doing is, at best, testing whether the dACC is preferentially associated with pain in comparison to a select number of other candidate processes.

To be fair, L&E do report inspecting the full term rankings, even if they don’t report them explicitly:

Beyond the specific terms we selected for analyses, we also identified which psychological term was associated with the highest Z-score for each of the 8 dACC locations across all the psychological terms in the NeuroSynth database. Despite the fact that there are several hundred psychological terms in the NeuroSynth database, “pain“ was the top term for 6 out of 8 locations in the dACC.

This may seem compelling at face value, but there are several problems. First, z-scores don’t provide a measure of strength of effect, they provide (at best) a measure of strength of evidence. Pain has been extensively studied in the fMRI literature, so it’s not terribly surprising if z-scores for pain are larger than z-scores for many other terms in Neurosynth. Saying that dACC is specific to pain because it shows the strongest z-score is like saying that SSRIs are the only effective treatment for depression because a drug study with a sample size of 3,000 found a smaller p-value than a cognitive-behavioral therapy (CBT) study of 100 people. If we want to know if SSRIs beat CBT as a treatment for depression, we need to directly compare effect sizes for the two treatments, not p-values or z-scores. Otherwise we’re conflating how much evidence there is for each effect with how big the effect is. At best, we might be able to claim that we’re more confident that there’s a non-zero association between dACC activation and pain than that there’s a non-zero association between dACC activation and, say, conflict monitoring. But that doesn’t constitute evidence that the dACC is more strongly associated with pain than with conflict.

Second, if one looks at effect sizes estimates rather than z-scores—which is exactly what one should do if the goal is to make claims about the relative strengths of different associations—then it’s clearly not true that dACC is specific to pain. For the vast majority of voxels within the dACC, ranking associates by descending order of posterior probability results in some term or terms other than pain occupying the top spot for a majority of dACC voxels. For example, for coordinates (0, 22 26), we get ‘experiencing’ as the top associate (PP = 86%), then pain (82%), then ’empathic’ (81%). These results seem to cast dACC in a very different light than simply saying that dACC is involved in pain. Don’t like (0, 22, 26)? Okay, pick a different dACC coordinate. Say (4, 10, 28). Now the top associates are ‘aversive’ (79%), ‘anxiety disorders’ (79%), and ‘conditioned’ (78%) (‘pain’ is a little ways back, hanging out with ‘heart’, ‘skin conductance’, and ‘taste’). Or maybe you’d like something more anterior. Well, at (-2 30 22), we have ‘abuse’ (85%), ‘incentive delay’ (84%), ‘nociceptive’ (83%), and ‘substance’ (83%). At (0, 28, 16), we have ‘dysregulation’ (84%), ‘heat’ (83%), and ‘happy faces’ (82%). And so on.

Why didn’t L&E look at the posterior probabilities, which would have been a more appropriate way to compare different terms? They justify the decision as follows:

Because Z-scores are less likely to be inflated from smaller sample sizes than the posterior probabilities, our statistical analyses were all carried out on the Z-scores associated with each posterior probability (21).”

While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.

Meta-analytic contrasts in Neurosynth

It doesn’t have to be this way, mind you. If we want to directly compare effect sizes for different terms—which I think is what L&E want, even if they don’t actually do it—we can do that fairly easily using Neurosynth (though you have to use the Python core tools, rather than the website). The crux of the approach is that we need to directly compare the two conditions (or terms) using only those studies in the Neurosynth database that load on exactly one of the two target terms. This typically results in a rather underpowered test, because we end up working with only a few hundred studies, rather than the full database of 11,000+ studies. But such is our Rumsfeldian life—we do analysis with the data we have, not the data we wish we had.

In any case, if we conduct direct meta-analytic contrasts of pain versus a bunch of other terms like salience, emotion, and cognitive control, we get results that look like this:

Meta-analytic contrasts involving pain or autonomic function (p < .001, uncorrected).
Meta-analytic contrasts involving pain or autonomic function (p < .001, uncorrected).

These maps are thresholded very liberally (p < .001, uncorrected), so we should be wary of reading too much into them. And, as noted above, power for meta-analytic contrasts in Neurosynth is typically quite low. Still, it’s pretty clear that the results don’t support L&E’s conclusion. While pain does indeed activate the dACC with significantly higher probability than some other topics (e.g., emotion or touch), it doesn’t differentiate pain from a number of other viable candidates (e.g., salience, fear, and autonomic control). Moreover, there are other contrasts not involving pain that also elicit significant differences—e.g., between autonomic control and emotion, or fear and cognitive control.

Given that this is the correct way to test for activation differences between different Neurosynth maps, if we were to take seriously the idea that more frequent dACC activation in pain studies than in other kinds of studies implies pain selectivity, the above results would seem to indicate that dACC isn’t selective to pain (or at least, that there’s no real evidence for that claim). Perhaps we could reasonably say that dACC cares more about pain than, say, emotion (though, as discussed below, even that’s not a given); but that’s hardly the same thing as saying that “the best psychological description of dACC function is related to pain processing”.

A > B does not imply ~B

Of course, we wouldn’t want to buy L&E’s claim that the dACC is selective for pain even if the dACC did show significantly more frequent activation for pain than for all other terms, because showing that dACC activation is greater for task A than task B (or even tasks B through Z) doesn’t entail that the dACC is not also important for task B. By analogy, demonstrating that people on average prefer the color blue to the color green doesn’t entitle us to conclude that nobody likes green.

In fairness, L&E do say that the other candidate terms they examined don’t show any associations with the dACC in the Neurosynth reverse inference maps. For instance, they show us this figure:

no_dacc_activity

A cursory inspection indeed reveals very little going on for terms other than pain. But this is pretty woeful evidence for the claim of no effect, as it’s based on low-resolution visual inspection of just one mid-saggital brain slice for just a handful of terms. The only quantitative support L&E marshal for their “nothing else activates dACC” claim is an inspection of activation at 8 individual voxels within dACC, which they report largely fail to activate for anything other than pain. The latter is not a very comprehensive analysis, and makes one wonder why L&E didn’t do something a little more systematic given the strength of their claim (e.g., they could have averaged over all dACC voxels and tested whether activation occurs more frequently than chance for each term).

As it turns out, when we look at the entire dACC rather than just 8 voxels, there’s plenty of evidence that the dACC does in fact care about things other than pain. You can easily see this on neurosynth.org just by browsing around for a few minutes, but to spare you the trouble, here are reverse inference maps for a bunch of terms that L&E either didn’t analyze at all, or looked at in only the 8 selected voxels (the pain map is displayed in the first row for reference):

Reverse inference maps for selected Neurosynth topics that display activation in dACC (p < .001, uncorrected).
Reverse inference maps for selected Neurosynth topics that display activation in dACC (p < .001, uncorrected).

In every single one of these cases, we see significant associations with dACC activation in the reverse inference meta-analysis. The precise location of activation varies from case to case (which might lead us to question whether it makes sense to talk about dACC as a monolithic system with a unitary function), but the point is that pain is clearly not the only process that activates dACC. So the notion that dACC is selective to pain doesn’t survive scrutiny even if you use L&E’s own criteria.

The limits of Neurosynth

All of the above problems are, in my view, already sufficient to lay the argument that dACC is pain selective to rest. But there’s another still more general problem with the L&E analysis that would, in my view, be sufficient to warrant extreme skepticism about their conclusion even if you knew nothing at all about the details of the analysis. Namely, in arguing for pain selectivity, L&E ignore many of the known limitations of Neurosynth. There are a number of reasons to think that—at least in its present state—Neurosynth simply can’t support the kind of inference that L&E are trying to draw. While L&E do acknowledge some of these limitations in their Discussion section, in my view, they don’t take them nearly as seriously as they ought to.

First, it’s important to remember that Neurosynth can’t directly tell us whether activation is specific to pain (or any other process), because terms in Neurosynth are just that—terms. They’re not carefully assigned task labels, let alone actual mental states. The strict interpretation of a posterior probability of 80% for pain in a dACC voxel is that, if we were to take 11,000 published fMRI studies and pretend that exactly 50% of them included the term ‘pain’ in their abstracts, the presence of activation in the voxel in question should increase our estimate of the likelihood of the term ‘pain’ occurring from 50% to 80%. If this seems rather weak, that’s because it is. It’s something of a leap to go from words in abstracts to processes in people’s heads.

Now, in most cases, I think it’s a perfectly defensible leap. I don’t begrudge anyone for treating Neurosynth terms as if they were decent proxies for mental states or cognitive tasks. I do it myself all the time, and I don’t feel apologetic about it. But that’s because it’s one thing to use Neurosynth to support a loose claim like “some parts of the dACC are preferentially associated with pain”, and quite another to claim that the dACC is selective for pain,  that virtually nothing else activates dACC, and that “pain represents the best psychological characterization of dACC function”. The latter is an extremely strong claim that requires one to demonstrate not only that there’s a robust association between dACC and pain (which Neurosynth supports), but also that (i) the association is meaningfully stronger than every other potential candidate, and (ii) no other process activates dACC in a meaningful way independently of its association with pain. L&E have done neither of these things, and frankly, I can’t imagine how they could do such a thing—at least, not with Neurosynth.

Second, there’s the issue of bias. Terms in Neurosynth are only good proxies for mental processes to the extent that they’re accurately represented in the literature. One important source of bias many people often point to (including L&E) is that if the results researchers report are colored by their expectations—which they almost certainly are—then Neurosynth is likely to reflect that bias. So, for example, if people think dACC supports pain, and disproportionately report activation in dACC in their papers (relative to other regions), the Neurosynth estimate of the pain-dACC assocation is likely be biased upwards. I think this is a legitimate concern, though (for technical reasons I won’t get into here) I also think it’s overstated. But there’s a second source of bias that I think is likely to be much more problematic in this particular case, which is that Neurosynth estimates (and, for that matter, estimates from every other large-scale meta-analysis, irrespective of database or method) are invariably biased to some degree by differences in the strength of different experimental manipulations.

To see what I mean, consider that pain is quite easy to robustly elicit in the scanner in comparison with many other processes or states. Basically, you attach some pain-inducing device to someone’s body and turn it on. If the device is calibrated properly and the subject has normal pain perception, you’re pretty much guaranteed to produce the experience of pain. In general, that effect is likely to be large, because it’s easy to induce fairly intense pain in the scanner.

Contrast that, with, say, emotion tasks. It’s an open secret in much of emotion research that what passes for an “emotional” stimulus is usually pretty benign by the standards of day-to-day emotional episodes. A huge proportion of studies use affective pictures to induce emotions like fear or disgust, and while there’s no doubt that such images successfully induce some change in emotional state, there are very few subjects who report large changes in experienced emotion (if you doubt this, try replacing the “extremely disgusted” upper anchor of your rating scale with “as disgusted as I would feel if someone threw up next to me” in your next study). One underappreciated implication of this is that if we decide to meta-analytically compare brain activation during emotion with brain activation during pain, our results are necessarily going to be biased by differences in the relative strengths of the two kinds of experimental manipulation—independently of any differences in the underlying neural substrates of pain and emotion. In other words, we may be comparing apples to oranges without realizing it. If we suppose, for the sake of hypothesis, that the dACC plays the same role in pain and emotion, and then compare strong manipulations of pain with weak manipulations of emotion, we would be confounding differences in experimental strength with differences in underlying psychology and biology. And we might well conclude that dACC is more important for pain than emotion—all because we have no good way of correcting for this rather mundane bias.

In point of fact, I think something like this is almost certainly true for the pain map in Neurosynth. One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally. I’m not sure there’s any good way to correct for this, but given that small differences in posterior probabilities (e.g., going from 80% to 75%) would probably have large effects on the rank order of different terms, I think the onus is on L&E to demonstrate why this isn’t a serious concern for their analysis.

But it’s still good for plenty of other stuff!

Having spent a lot of time talking about Neurosynth’s limitations—and all the conclusions one can’t draw from reverse inference maps in Neurosynth—I want to make sure I don’t leave you with the wrong impression about where I see Neurosynth fitting into the cognitive neuroscience ecosystem. Despite its many weaknesses, I still feel quite strongly that Neurosynth is one of the most useful tools we have at the moment for quantifying the relative strengths of association between psychological processes and neurobiological substrates. There are all kinds of interesting uses for the data, website, and software that are completely unobjectionable. I’ve seen many published articles use Neurosynth in a variety of interesting ways, and a few studies have even used Neurosynth as their primary data source (and my colleagues and I have several more on the way). Russ Poldrack and I have a forthcoming paper in Annual Review of Psychology in which we review some of the ways databases like Neurosynth can play an invaluable role in the brain mapping enterprise. So clearly, I’m the last person who would tell anyone that Neurosynth isn’t useful for anything. It’s useful for a lot of things; but it probably shouldn’t be the primary source of evidence for very strong claims about brain-cognition or brain-behavior relationships.

What can we learn about the dACC using Neurosynth? A number of things. Here are some conclusions I think one can reasonably draw based solely on inspection of Neurosynth maps:

  • There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.
  • It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)—though we should be cautious, because in some cases some of those other functions are clearly represented in dACC, just in different sectors.
  • Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.

I think these are all interesting and potentially important observations. They’re hardly novel, of course, but it’s still nice to have convergent meta-analytic support for claims that have been made using other methods.

So what does the dACC do?

Having read this far, you might be thinking, well if dACC isn’t selective for pain, then what does it do? While I don’t pretend to have a good answer to this question, let me make three tentative observations about the potential role of dACC in cognition that may or may not be helpful.

First, there’s actually no particular reason why dACC has to play any unitary role in cognition. It may be a human conceit to think that just because we can draw some nice boundaries around a region and give it the name ‘dACC’, there must be some corresponding sensible psychological process that passably captures what all the neurons within that chunk of tissue are doing. But the dACC is a large brain region that contains hundreds of millions of neurons with enormously complex response profiles and connectivity patterns. There’s no reason why nature should respect our human desire for simple, interpretable models of brain function. To the contrary, our default assumption should probably be that there’s considerable functional heterogeneity within dACC, so that slapping a label like “pain” onto the entire dACC is almost certainly generating more heat than light.

Second, to the degree that we nevertheless insist on imposing a single unifying label on the entire dACC, it’s very unlikely that a generic characterization like “pain” is up to the job. While we can reasonably get away with loosely describing some (mostly sensory) parts of the brain as broadly supporting vision or motor function, the dACC—a frontal region located much higher in the processing hierarchy—is unlikely to submit to a similar analysis. It’s telling that most of the serious mechanistic accounts of dACC function have shied away from extensional definitions of regional function like “pain” or “emotion” and have instead focused on identifying broad computational roles that dACC might play. Thus, we have suggestions that dACC might be involved in response selection, conflict monitoring, or value learning. While these models are almost certainly wrong (or at the very least, grossly incomplete), they at least attempt to articulate some kind of computational role dACC circuits might be playing in cognition. Saying that the dACC is for “pain”, by contrast, tells us nothing about the nature of the representations in the region.

To their credit, L&E do address this issue to some extent. Specifically, they suggest that the dACC may be involved in monitoring for “survival-relevant goal conflicts”. Admittedly, it’s a bit odd that L&E make such a suggestion at all, seeing as it directly contradicts everything they argue for in the rest of the paper (i.e., if the dACC supports detection of the general class of things that are relevant for survival, then it is by definition not selective for pain, and vice versa). Contradictions aside, however, L&E’s suggestion is not completely implausible. As the Neurosynth maps above show, the dACC is clearly preferentially activated by fear conditioning, autonomic control, and reward—all of which could broadly be construed as “survival-relevant”. The main difficulty for L&E’s survival account comes from (a) the lack of evidence of dACC involvement in other clearly survival-relevant stimuli or processes—e.g., disgust, respiration, emotion, or social interaction, and (b) the availability of other much more plausible theories of dACC function (see the next point). Still, if we’re relying strictly on Neurosynth for evidence, we can give L&E the benefit of the doubt and reserve judgment on their survival-relevant account until more data becomes available. In the interim, what should not be controversial is that such an account has no business showing up in a paper titled “the dorsal anterior cingulate cortex is selective for pain”—a claim it is completely incompatible with.

Third, theories of dACC function based largely on fMRI evidence don’t (or shouldn’t) operate in a vacuum. Over the past few decades, literally thousands of animal and human studies have investigated the structure and function of the anterior cingulate cortex. Many of these studies have produced considerable insights into the role of the ACC (including dACC), and I think it’s safe to say that they collectively offer a much richer understanding than what fMRI studies—let alone a meta-analytic engine like Neurosynth—have produced to date. I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree). Schematically, their “four-region” architectural model looks like this:

The Vogt et al. "four-region" model of cingulate architecture.
The Vogt et al. “four-region” model of cingulate architecture. Fig. 14.13. in Vogt & Sikes (2009).

While the aMCC is assumed to contains many pain-selective neurons (as do more anterior sectors of the cingulate), it’s demonstrably not pain-selective, as neurons throughout the aMCC also respond to other stimuli (e.g., non-painful touch, fear cues, etc.).

Aside from being based on an enormous amount of evidence from lesion, electrophysiology, and imaging studies, the Vogt characterization of dACC/aMCC has several other nice features. For one thing, it fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). For another, it provides an elegant and parsimonious explanation for the broad extent of pain-related activation in anterior cingulate cortex even though no part of aMCC is selective for pain (i.e., unlike other non-physical stimuli, pain involves skeletomotor orienting, and unlike non-painful touch, it elicits avoidance behavior and subjective unpleasantness).

Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance” region).

Conclusion

Lieberman & Eisenberger (2015) argue, largely on the basis of evidence from my Neurosynth framework, that the dACC is selective for pain. They are wrong. Neurosynth does not—and, at present, cannot—support such a conclusion. Moreover, a more careful examination of Neurosynth results directly refutes Lieberman and Eisenberger’s claims, providing clear evidence that the dACC is associated with many other operations, and converging with extensive prior animal and human work to suggest a far more complex view of dACC function.

 

  • This is probably the first time I’ve been able to call myself the world’s foremost expert on anything while keeping a straight face. It feels pretty good.

** L&E show meta-analysis maps for a few more terms in an online supplement, but barely discuss them, even though at least one term (fear) clearly activates very similar parts of dACC.

what exactly is it that 53% of neuroscience articles fail to do?

[UPDATE: Jake Westfall points out in the comments that the paper discussed here appears to have made a pretty fundamental mistake that I then carried over to my post. I’ve updated the post accordingly.]

[UPDATE 2: the lead author has now responded and answered my initial question and some follow-up concerns.]

A new paper in Nature Neuroscience by Emmeke Aarts and colleagues argues that neuroscientists should start using hierarchical  (or multilevel) models in their work in order to account for the nested structure of their data. From the abstract:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%). We discuss the factors affecting the type I error rate and the statistical power in nested data, methods that accommodate dependency between observations and ways to determine the optimal study design when data are nested. Notably, optimization of experimental designs nearly always concerns collection of more truly independent observations, rather than more observations from one research object.

I don’t have any objection to the advocacy for hierarchical models; that much seems perfectly reasonable. If you have nested data, where each subject (or petrie dish or animal or whatever) provides multiple samples, it’s sensible to try to account for as many systematic sources of variance as you can. That point may have been made many times before,  but it never hurts to make it again.

What I do find surprising though–and frankly, have a hard time believing–is the idea that 53% of neuroscience articles are at serious risk of Type I error inflation because they fail to account for nesting. This seems to me to be what the abstract implies, yet it’s a much stronger claim that doesn’t actually follow just from the observation that virtually no studies that have reported nested data have used hierarchical models for analysis. What it also requires is for all of those studies that use “conventional” (i.e., non-hierarchical) analyses to have actively ignored the nesting structure and treated repeated measurements as if they in fact came from entirely different subjects or clusters.

To make this concrete, suppose we have a dataset made up of 400 observations, consisting of 20 subjects who each provided 10 trials in 2 different experimental conditions (i.e., 20 x 2 x 10 = 400). And suppose the thing we ultimately want to know is whether or not there’s a statistical difference in outcome between the two conditions. There are three at least three ways we could set up our comparison:

  1. Ignore the grouping variable (i.e., subject) entirely, effectively giving us 200 observations in each condition. We then conduct the test as if we have 200 independent observations in each condition.
  2. Average the 10 trials in each condition within each subject first, then conduct the test on the subject means. In this case, we effectively have 20 observations in each condition (1 per subject).
  3. Explicitly include the effects of both subject and trial in our model. In this case we have 400 observations, but we’re explictly accounting for the correlation between trials within a given subject, so that the statistical comparison of conditions effectively has somewhere between 20 and 400 “observations” (or degrees of freedom).

Now, none of these approaches is strictly “wrong”, in that there could be specific situations in which any one of them would be called for. But as a general rule, the first approach is almost never appropriate. The reason is that we typically want to draw conclusions that generalize across the cases in the higher level of the hierarchy, and don’t have any intrinsic interest in the individual trials themselves. In the above example, we’re asking whether people on average, behave differently in the two conditions. If we treat our data as if we had 200 subjects in each condition, effectively concatenating trials across all subjects, we’re ignoring the fact that the responses acquired from each subject will tend to be correlated (i.e., Jane Doe’s behavior on Trial 2 will tend to be more similar to her own behavior on Trial 1 than to another subject’s behavior on Trial 1). So we’re pretending that we know something about 200 different individuals sampled at random from the population, when in fact we only know something about 20 different  individuals. The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives we’re going to end up answering a question quite different from the one we think we’re answering. [Update: Jake Westfall points out in the comments below that we won’t necessarily inflate Type I error rate. Rather, the net effect of failing to model the nesting structure properly will depend on the relative amount of within-cluster vs. between-cluster variance. The answer we get will, however, usually deviate considerably from the answer we would get using approaches (2) or (3).]

By contrast, approaches (2) and (3) will, in most cases, produce pretty similar results. It’s true that the hierarchical approach is generally a more sensible thing to do, and will tend to provide a better estimate of the true population difference between the two conditions. However, it’s probably better to describe approach (2) as suboptimal, and not as wrong. So long as the subjects in our toy example above are in fact sampled at random, it’s pretty reasonable to assume that we have exactly 20 independent observations, and analyze our data accordingly. Our resulting estimates might not be quite as good as they could have been, but we’re unlikely to miss the mark by much.

To return to the Aarts et al paper, the key question is what exactly the authors mean when they say in their abstract that:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%).

I’ve underlined the key phrases here. It seems to me that the implication the reader is supposed to draw from this is that roughly 53% of the neuroscience literature is at high risk of reporting spurious results. But in reality this depends entirely on whether the authors mean that 53% of studies are modeling trial-level data but ignoring the nesting structure (as in approach 1 above), or that 53% of studies in the literature aren’t using hierarchical models, even though they may be doing nothing terribly wrong otherwise (e.g., because they’re using approach (2) above).

Unfortunately, the rest of the manuscript doesn’t really clarify the matter. Here’s the section in which the authors report how they obtained that 53% number:

To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.

This is, as far as I can see, still ambiguous. The only additional information provided here is that 44% of studies specifically reported the number of observations per cluster. Unfortunately this still doesn’t tell us whether the effective degrees of freedom used in the statistical tests in those papers included nested observations, or instead averaged over nested observations within each group or subject prior to analysis.

Lest this seem like a rather pedantic statistical point, I hasten to emphasize that a lot hangs on it. The potential implications for the neuroscience literature are very different under each of these two scenarios. If it is in fact true that 53% of studies are inappropriately using a “fixed-effects” model (approach 1)–which seems to me to be what the Aarts et al abstract implies–the upshot is that a good deal of neuroscience research is very bad statistical shape, and the authors will have done the community a great service by drawing attention to the problem. On the other hand, if the vast majority of the studies in that 53% are actually doing their analyses in a perfectly reasonable–if perhaps suboptimal–way, then the Aarts et al article seems rather alarmist. It would, of course, still be true that hierarchical models should be used more widely, but the cost of failing to switch would be much lower than seems to be implied.

I’ve emailed the corresponding author to ask for a clarification. I’ll update this post if I get a reply. In the meantime, I’m interested in others’ thoughts as to the likelihood that around half of the neuroscience literature involves inappropriate reporting of fixed-effects analyses. I guess personally I would be very surprised if this were the case, though it wouldn’t be unprecedented–e.g., I gather that in the early days of neuroimaging, the SPM analysis package used a fixed-effects model by default, resulting in quite a few publications reporting grossly inflated t/z/F statistics. But that was many years ago, and in the literatures I read regularly (in psychology and cognitive neuroscience), this problem rarely arises any more. A priori, I would have expected the same to be true in cellular and molecular neuroscience.


UPDATE 04/01 (no, not an April Fool’s joke)

The lead author, Emmeke Aarts, responded to my email. Here’s her reply in full:

Thank you for your interest in our paper. As the first author of the paper, I will answer the question you send to Sophie van der Sluis. Indeed we report that 53% of the papers include nested data using conventional statistics, meaning that they did not use multilevel analysis but an analysis method that assumes independent observations like a students t-test or ANOVA.

As you also note, the data can be analyzed at two levels, at the level of the individual observations, or at the subject/animal level. Unfortunately, with the information the papers provided us, we could not extract this information for all papers. However, as described in the section ‘The prevalence of nesting in neuroscience studies’, 44% of these 53% of papers including nested data, used conventional statistics on the individual observations, with at least a mean of 5 observations per subject/animal. Another 7% of these 53% of papers including nested data used conventional statistics at the subject/animal level. So this leaves 49% unknown. Of this 49%, there is a small percentage of papers which analyzed their data at the level of individual observations, but had a mean less than 5 observations per subject/animal (I would say 10 to 20% out of the top of my head), the remaining percentage is truly unknown. Note that with a high level of dependency, using conventional statistics on nested data with 2 observations per subject/animal is already undesirable. Also note that not only analyzing nested data at the individual level is undesirable, analyzing nested data at the subject/animal level is unattractive as well, as it reduces the statistical power to detect the experimental effect of interest (see fig. 1b in the paper), in a field in which a decent level of power is already hard to achieve (e.g., Button 2013).

I think this definitively answers my original question: according to Aarts, of the 53% of studies that used nested data, at least 44% performed conventional (i.e., non-hierarchical) statistical analyses on the individual observations. (I would dispute the suggestion that this was already stated in the paper; the key phrase is “on the individual observations”, and the wording in the manuscript was much more ambiguous.) Aarts suggests that ~50% of the studies couldn’t be readily classified, so in reality that proportion could be much higher. But we can say that at least 23% of the literature surveyed committed what would, in most domains, constitute a fairly serious statistical error.

I then sent Aarts another email following up on Jake Westfall’s comment (i.e., how nested vs. crossed designs were handled. She replied:

As Jake Westfall points out, it indeed depends on the design if ignoring intercept variance (so variance in the mean observation per subject/animal) leads to an inflated type I error. There are two types of designs we need to distinguish here, design type I, where the experimental variable (for example control or experimental group) does not vary within the subjects/animals but only over the subjects/animals, and design Type II, where the experimental variable does vary within the subject/animal. Only in design type I, the type I error is increased by intercept variance. As pointed out in the discussion section of the paper, the paper only focuses on design Type I (“Here we focused on the most common design, that is, data that span two levels (for example, cells in mice) and an experimental variable that does not vary within clusters (for example, in comparing cell characteristic X between mutants and wild types, all cells from one mouse have the same genotype)”), to keep this already complicated matter accessible to a broad readership. Moreover, design type I is what is most frequently seen in biological neuroscience, taking multiple observations from one animal and subsequently comparing genotypes automatically results in a type I research design.

When dealing with a research design II, it is actually the variation in effect within subject/animals that increases the type I error rate (the so-called slope variance), but I will not elaborate too much on this since it is outside the scope of this paper and a completely different story.

Again, this all sounds very straightforward and sound to me. So after both of these emails, here’s my (hopefully?) final take on the paper:

  • Work in molecular, cellular, and developmental neuroscience–or at least, the parts of those fields well-represented in five prominent journals–does indeed appear to suffer from some systemic statistical problems. While the proportion of studies at high risk of Type I error is smaller than the number Aarts et al’s abstract suggests (53%), the latter, more accurate, estimate (at least 23% of the literature) is still shockingly high. This doesn’t mean that a quarter or more of the literature can’t be trusted–as some of the commenters point out below, most conclusions aren’t based on just a single p value from a single analysis–but it does raise some very serious concerns. The Aarts et al paper is an important piece of work that will help improve statistical practice going forward.
  • The comments on this post, and on Twitter, have been interesting to read. There appear to be two broad camps of people who were sympathetic to my original concern about the paper. One camp consists of people who were similarly concerned about technical aspects of the paper, and in most cases were tripped up by the same confusion surrounding what the authors meant when they said 53% of studies used “conventional statistical analyses”. That point has now been addressed. The other camp consists of people who appear to work in the areas of neuroscience Aarts et al focused on, and were reacting not so much to the specific statistical concern raised by Aarts et al as to the broader suggestion that something might be deeply wrong with the neuroscience literature because of this. I confess that my initial knee-jerk impression to the Aarts et al paper was driven in large part by the intuition that surely it wasn’t possible for so large a fraction of the literature to be routinely modeling subjects/clusters/groups as fixed effects. But since it appears that that is in fact the case, I’m not sure what to say with respect to the broader question over whether it is or isn’t appropriate to ignore nesting in animal studies. I will say that in the domains I personally work in, it seems very clear that collapsing across all subjects for analysis purposes is nearly always (if not always) a bad idea. Beyond that, I don’t really have any further opinion other than what I said in this response to a comment below.
  • While the claims made in the paper appear to be fundamentally sound, the presentation leaves something to be desired. It’s unclear to me why the authors relegated some of the most important technical points to the Discussion, or didn’t explictly state them at all. The abstract also seems to me to be overly sensational–though, in hindsight, not nearly as much as I initially suspected. And it also seems questionable to tar all of neuroscience with a single brush when the analyses reported only applied to a few specific domains (and we know for a fact that in, say, neuroimaging, this problem is almost nonexistent). I guess to be charitable, one could pick the same bone with a very large proportion of published work, and this kind of thing is hardly unique to this study. Then again, the fact that a practice is widespread surely isn’t sufficient to justify that practice–or else there would be little point in Aarts et al criticizing a practice that so many people clearly engage in routinely.
  • Given my last post, I can’t help pointing out that this is a nice example of how mandatory data sharing (or failing that, a culture of strong expectations of preemptive sharing) could have made evaluation of scientific claims far easier. If the authors had attached the data file coding the 315 studies they reviewed as a supplement, I (and others) would have been able to clarify the ambiguity I originally raised much more quickly. I did send a follow up email to Aarts to ask if she and her colleagues would consider putting the data online, but haven’t heard back yet.

unconference in Leipzig! no bathroom breaks!

Südfriedhof von Leipzig [HDR]

Many (most?) regular readers of this blog have probably been to at least one academic conference. Some of you even have the misfortune of attending conferences regularly. And a still-smaller fraction of you scholarly deviants might conceivably even enjoy the freakish experience. You know, that whole thing where you get to roam around the streets of some fancy city for a few days seeing old friends, learning about exciting new scientific findings, and completely ignoring the manuscripts and reviews piling up on your desk in your absence. It’s a loathsome, soul-scorching experience. Unfortunately it’s part of the job description for most scientists, so we shoulder the burden without complaining too loudly to the government agencies that force us to go to these things.

This post, thankfully, isn’t about a conference. In fact, it’s about the opposite of a conference, which is… an UNCONFERENCE. An unconference is a social event type of thing that strips away all of the unpleasant features of a regular conference–you know, the fancy dinners, free drinks, and stimulating conversation–and replaces them with a much more authentic academic experience. An authentic experience in which you spend the bulk of your time situated in a 10′ x 10′ room (3 m x 3 m for non-Imperialists) with 10 – 12 other academics, and no one’s allowed to leave the room, eat anything, or take bathroom breaks until someone in the room comes up with a brilliant discovery and wins a Nobel prize. This lasts for 3 days (plus however long it takes for the Nobel to be awarded), and you pay $1200 for the privilege ($1160 if you’re a post-doc or graduate student). Believe me when I tell you that it’s a life-changing experience.

Okay, I exaggerate a bit. Most of those things aren’t true. Here’s one explanation of what an unconference actually is:

An unconference is a participant-driven meeting. The term “unconference” has been applied, or self-applied, to a wide range of gatherings that try to avoid one or more aspects of a conventional conference, such as high fees, sponsored presentations, and top-down organization. For example, in 2006, CNNMoney applied the term to diverse events including Foo Camp, BarCamp, Bloggercon, and Mashup Camp.

So basically, my description was accurate up until the part where I said there were no bathroom breaks.

Anyway, I’m going somewhere with this, I promise. Specifically, I’m going to Leipzig, Germany! In September! And you should come too!

The happy occasion is Brainhack 2012, an unconference organized by the creative minds over at the Neuro Bureau–coordinators of such fine projects as the Brain Art Competition at OHBM (2012 incarnation going on in Beijing right now!) and the admittedly less memorable CNS 2007 Surplus Brain Yard Sale (guess what–turns out selling human brains out of the back of an unmarked van violates all kinds of New York City ordinances!).

Okay, as you can probably tell, I don’t quite have this event promotion thing down yet. So in the interest of ensuring that more than 3 people actually attend this thing, I’ll just shut up now and paste the official description from the Brainhack website:

The Neuro Bureau is proud to announce the 2012 Brainhack, to be held from September 1-4 at the Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany.

Brainhack 2012 is a unique workshop with the goals of fostering interdisciplinary collaboration and open neuroscience. The structure builds from the concepts of an unconference and a hackathon: The term “unconference” refers to the fact that most of the content will be dynamically created by the participants — a hackathon is an event where participants collaborate intensively on science-related projects.

Participants from all disciplines related to neuroimaging are welcome. Ideal participants span in range from graduate students to professors across any disciplines willing to contribute (e.g., mathematics, computer science, engineering, neuroscience, psychology, psychiatry, neurology, medicine, art, etc“¦). The primary requirement is a desire to work in close collaborations with researchers outside of your specialization in order to address neuroscience questions that are beyond the expertise of a single discipline.

In all seriousness though, I think this will be a blast, and I’m really looking forward to it. I’m contributing the full Neurosynth dataset as one of the resources participants will have access to (more on that in a later post), and I’m excited to see what we collectively come up with. I bet it’ll be at least three times as awesome as the Surplus Brain Yard Sale–though maybe not quite as lucrative.

 

 

p.s. I’ll probably also be in Amsterdam, Paris, and Geneva in late August/early September; if you live in one of these fine places and want to show me around, drop me an email. I’ll buy you lunch! Well, except in Geneva. If you live in Geneva, I won’t buy you lunch, because I can’t afford lunch in Geneva. You’ll buy yourself a nice Swiss lunch made of clockwork and gold, and then maybe I’ll buy you a toothpick.

a human and a monkey walk into an fMRI scanner…

Tor Wager and I have a “news and views” piece in Nature Methods this week; we discuss a paper by Mantini and colleagues (in the same issue) introducing a new method for identifying functional brain homologies across different species–essentially, identifying brain regions in humans and monkeys that seem to do roughly the same thing even if they’re not located in the same place anatomically. Mantini et al make some fairly strong claims about what their approach tells us about the evolution of the human brain (namely, that some cortical regions have undergone expansion relative to monkeys, while others have adapted substantively new functions). For reasons we articulate in our commentary, I’m personally not so convinced by the substantive conclusions, but I do think the core idea underlying the method is a very clever and potentially useful one:

Their technique, interspecies activity correlation (ISAC), uses functional magnetic resonance imaging (fMRI) to identify brain regions in which humans and monkeys exposed to the same dynamic stimulus—a 30-minute clip from the movie The Good, the Bad and the Ugly—show correlated patterns of activity (Fig. 1). The premise is that homologous regions should have similar patterns of activity across species. For example, a brain region sensitive to a particular configuration of features, including visual motion, hands, faces, object and others, should show a similar time course of activity in both species—even if its anatomical location differs across species and even if the precise features that drive the area’s neurons have not yet been specified.

Mo Costandi has more on the paper in an excellent Guardian piece (and I’m not just saying that because he quoted me a few times). All in all, I think it’s a very exciting method, and it’ll be interesting to see how it’s applied in future studies. I think there’s a fairly broad class of potential applications based loosely around the same idea of searching for correlated patterns. It’s an idea that’s already been used by Uri Hasson (an author on the Mantini et al paper) and others fairly widely in the fMRI literature to identify functional correspondences across different subjects; but you can easily imagine conceptually similar applications in other fields too–e.g., correlating gene expression profiles across species in order to identify structural homologies (actually, one could probably try this out pretty easily using the mouse and human data available in the Allen Brain Atlas).

ResearchBlogging.orgMantini D, Hasson U, Betti V, Perrucci MG, Romani GL, Corbetta M, Orban GA, & Vanduffel W (2012). Interspecies activity correlations reveal functional correspondence between monkey and human brain areas. Nature methods PMID: 22306809

Wager, T., & Yarkoni, T. (2012). Establishing homology between monkey and human brains Nature Methods DOI: 10.1038/nmeth.1869

the New York Times blows it big time on brain imaging

The New York Times has a terrible, terrible Op-Ed piece today by Martin Lindstrom (who I’m not going to link to, because I don’t want to throw any more bones his way). If you believe Lindstrom, you don’t just like your iPhone a lot; you love it. Literally. And the reason you love it, shockingly, is your brain:

Earlier this year, I carried out an fMRI experiment to find out whether iPhones were really, truly addictive, no less so than alcohol, cocaine, shopping or video games. In conjunction with the San Diego-based firm MindSign Neuromarketing, I enlisted eight men and eight women between the ages of 18 and 25. Our 16 subjects were exposed separately to audio and to video of a ringing and vibrating iPhone.

But most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion. The subjects’ brains responded to the sound of their phones as they would respond to the presence or proximity of a girlfriend, boyfriend or family member.

In short, the subjects didn’t demonstrate the classic brain-based signs of addiction. Instead, they loved their iPhones.

There’s so much wrong with just these three short paragraphs (to say nothing of the rest of the article, which features plenty of other whoppers) that it’s hard to know where to begin. But let’s try. Take first the central premise–that an fMRI experiment could help determine whether iPhones are no less addictive than alcohol or cocaine. The tacit assumption here is that all the behavioral evidence you could muster–say, from people’s reports about how they use their iPhones, or clinicians’ observations about how iPhones affect their users–isn’t sufficient to make that determination; to “really, truly” know if something’s addictive, you need to look at what the brain is doing when people think about their iPhones. This idea is absurd inasmuch as addiction is defined on the basis of its behavioral consequences, not (right now, anyway) by the presence or absence of some biomarker. What makes someone an alcoholic is the fact that they’re dependent on alcohol, have trouble going without it, find that their alcohol use interferes with multiple aspects of their day-to-day life, and generally suffer functional impairment because of it–not the fact that their brain lights up when they look at pictures of Johnny Walker red. If someone couldn’t stop drinking–to the point where they lost their job, family, and friends–but their brain failed to display a putative biomarker for addiction, it would be strange indeed to say “well, you show all the signs, but I guess you’re not really addicted to alcohol after all.”

Now, there may come a day (and it will be a great one) when we have biomarkers sufficiently accurate that they can stand in for the much more tedious process of diagnosing someone’s addiction the conventional way. But that day is, to put it gently, a long way off. Right now, if you want to know if iPhones are addictive, the best way to do that is to, well, spend some time observing and interviewing iPhone users (and some quantitative analysis would be helpful).

Of course, it’s not clear what Lindstrom thinks an appropriate biomarker for addiction would be in any case. Presumably it would have something to do with the reward system; but what? Suppose Lindstrom had seen robust activation in the ventral striatum–a critical component of the brain’s reward system–when participants gazed upon the iPhone: what then? Would this have implied people are addicted to iPhones? But people also show striatal activity when gazing on food, money, beautiful faces, and any number of other stimuli. Does that mean the average person is addicted to all of the above? A marker of pleasure or reward, maybe (though even that’s not certain), but addiction? How could a single fMRI experiment with 16 subjects viewing pictures of iPhones confirm or disconfirm the presence of addiction? Lindstrom doesn’t say. I suppose he has good reason not to say: if he really did have access to an accurate fMRI-based biomarker for addiction, he’d be in a position to make millions (billions?) off the technology. To date, no one else has come close to identifying a clinically accurate fMRI biomarker for any kind of addiction (for more technical readers, I’m talking here about cross-validated methods that have both sensitivity and specificity comparable to traditional approaches when applied to new subjects–not individual studies that claim 90% with-sample classification accuracy based on simple regression models). So we should, to put it mildly, be very skeptical that Lindstrom’s study was ever in a position to do what he says it was designed to do.

We should also ask all sorts of salient and important questions about who the people are who are supposedly in love with their iPhones. Who’s the “You” in the “You Love Your iPhone” of the title? We don’t know, because we don’t know who the participants in Lindstrom’s sample, were, aside from the fact that they were eight men and eight women aged 18 to 25. But we’d like to know some other important things. For instance, were they selected for specific characteristics? Were they, say, already avid iPhone users? Did they report loving, or being addicted to their iPhones? If so, would it surprise us that people chosen for their close attachment to their iPhones also showed brain activity patterns typical of close attachment? (Which, incidentally, they actually don’t–but more on that below.) And if not, are we to believe that the average person pulled off the street–who probably has limited experience with iPhones–really responds to the sound of their phones “as they would respond to the presence or proximity of a girlfriend, boyfriend or family member”? Is the takeaway message of Lindstrom’s Op-Ed that iPhones are actually people, as far as our brains are concerned?

In fairness, space in the Times is limited, so maybe it’s not fair to demand this level of detail in the Op-Ed iteslf. But the bigger problem is that we have no way of evaluating Lindstrom’s claims, period, because (as far as I can tell), his study hasn’t been published or peer-reviewed anywhere. Presumably, it’s proprietary information that belongs to the neuromarketing firm in question. Which is to say, the NYT is basically giving Lindstrom license to talk freely about scientific-sounding findings that can’t actually be independently confirmed, disputed, or critiqued by members of the scientific community with expertise in the very methods Lindstrom is applying (expertise which, one might add, he himself lacks). For all we know, he could have made everything up. To be clear, I don’t really think he did make everything up–but surely, somewhere in the editorial process someone at the NYT should have stepped in and said, “hey, these are pretty strong scientific claims; is there any way we can make your results–on which your whole article hangs–available for other experts to examine?”

This brings us to what might be the biggest whopper of all, and the real driver of the article title: the claim that “most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion“. Russ Poldrack already tore this statement to shreds earlier this morning:

Insular cortex may well be associated with feelings of love and compassion, but this hardly proves that we are in love with our iPhones.  In Tal Yarkoni’s recent paper in Nature Methods, we found that the anterior insula was one of the most highly activated part of the brain, showing activation in nearly 1/3 of all imaging studies!  Further, the well-known studies of love by Helen Fisher and colleagues don’t even show activation in the insula related to love, but instead in classic reward system areas.  So far as I can tell, this particular reverse inference was simply fabricated from whole cloth.  I would have hoped that the NY Times would have learned its lesson from the last episode.

But you don’t have to take Russ’s word for it; if you surf for a few terms on our Neurosynth website, making sure to select “forward inference” under image type, you’ll notice that the insula shows up for almost everything. That’s not an accident; it’s because the insula (or at least the anterior part of the insula) plays a very broad role in goal-directed cognition. It really is activated when you’re doing almost anything that involves, say, following instructions an experimenter gave you, or attending to external stimuli, or mulling over something salient in the environment. You can see this pretty clearly in this modified figure from our Nature Methods paper (I’ve circled the right insula):

Proportion of studies reporting activation at each voxel

The insula is one of a few ‘hotspots’ where activation is reported very frequently in neuroimaging articles (the other major one being the dorsal medial frontal cortex). So, by definition, there can’t be all that much specificity to what the insula is doing, since it pops up so often. To put it differently, as Russ and others have repeatedly pointed out, the fact that a given region activates when people are in a particular psychological state (e.g., love) doesn’t give you license to conclude that that state is present just because you see activity in the region in question. If language, working memory, physical pain, anger, visual perception, motor sequencing, and memory retrieval all activate the insula, then knowing that the insula is active is of very little diagnostic value. That’s not to say that some psychological states might not be more strongly associated with insula activity (again, you can see this on Neurosynth if you switch the image type to ‘reverse inference’ and browse around); it’s just that, probabilistically speaking, the mere fact that the insula is active gives you very little basis for saying anything concrete about what people are experiencing.

In fact, to account for Lindstrom’s findings, you don’t have to appeal to love or addiction at all. There’s a much simpler way to explain why seeing or hearing an iPhone might elicit insula activation. For most people, the onset of visual or auditory stimulation is a salient event that causes redirection of attention to the stimulated channel. I’d be pretty surprised, actually, if you could present any picture or sound to participants in an fMRI scanner and not elicit robust insula activity. Orienting and sustaining attention to salient things seems to be a big part of what the anterior insula is doing (whether or not that’s ultimately its ‘core’ function). So the most appropriate conclusion to draw from the fact that viewing iPhone pictures produces increased insula activity is something vague like “people are paying more attention to iPhones”, or “iPhones are particularly salient and interesting objects to humans living in 2011.” Not something like “no, really, you love your iPhone!”

In sum, the NYT screwed up. Lindstrom appears to have a habit of making overblown claims about neuroimaging evidence, so it’s not surprising he would write this type of piece; but the NYT editorial staff is supposedly there to filter out precisely this kind of pseudoscientific advertorial. And they screwed up. It’s a particularly big screw-up given that (a) as of right now, Lindstrom’s Op-Ed is the single most emailed article on the NYT site, and (b) this incident almost perfectly recapitulates another NYT article 4 years ago in which some neuroscientists and neuromarketers wrote a grossly overblown Op-Ed claiming to be able to infer, in detail, people’s opinions about presidential candidates. That time, Russ Poldrack and a bunch of other big names in cognitive neuroscience wrote a concise rebuttal that appeared in the NYT (but unfortunately, isn’t linked to from the original Op-Ed, so anyone who stumbles across the original now has no way of knowing how ridiculous it is). One hopes the NYT follows up in similar fashion this time around. They certainly owe it to their readers–some of whom, if you believe Lindstrom, are now in danger of dumping their current partners for their iPhones.

h/t: Molly Crockett

CNS 2011: a first-person shorthand account in the manner of Rocky Steps

Friday, April 1

4 pm. Arrive at SFO International on bumpy flight from Denver.

4:45 pm. Approach well-dressed man downtown and open mouth to ask for directions to Hyatt Regency San Francisco. “Sorry,” says well-dressed man, “No change to give.” Back off slowly, swinging bags, beard, and poster tube wildly, mumbling “I’m not a panhandler, I’m a neuroscientist.” Realize that difference between the two may be smaller than initially suspected.

6:30 pm. Hear loud knocking on hotel room door. Open door to find roommate. Say hello to roommate. Realize roommate is extremely drunk from East Coast flight. Offer roommate bag of coffee and orange tic-tacs. Roommate is confused, asks, “are you drunk?” Ignore roommate’s question. “You’re drunk, aren’t you.” Deny roommate’s unsubstantiated accusations. “When you write about this on your blog, you better not try to make it look like I’m the drunk one,” roommate says. Resolve to ignore roommate’s crazy talk for next 4 days.

6:45 pm. Attempt to open window of 10th floor hotel room in order to procure fresh air for face. Window refuses to open. Commence nudging of, screaming at, and bargaining with window. Window still refuses to open. Roommate points out sticker saying window does not open. Ignore sticker, continue berating window. Window still refuses to open, but now has low self-esteem.

8 pm. Have romantic candlelight dinner at expensive french restaurant with roommate. Make jokes all evening about ideal location (San Francisco) for start of new intimate relationship. Suspect roommate is uncomfortable, but persist in faux wooing. Roommate finally turns tables by offering to put out. Experience heightened level of discomfort, but still finish all of steak tartare and order creme brulee. Dessert appetite is immune to off-color humor!

11 pm – 1 am. Grand tour of seedy SF bars with roommate and old grad school friend. New nightlife low: denied entrance to seedy dance club because shoes insufficiently classy. Stupid Teva sandals.

Saturday, April 2

9:30 am. Wake up late. Contemplate running downstairs to check out ongoing special symposium for famous person who does important research. Decide against. Contemplate visiting hotel gym to work off creme brulee from last night. Decide against. Contemplate reading conference program in bed and circling interesting posters to attend. Decide against. Contemplate going back to sleep. Consult with self, make unanimous decision in favor.

1 pm. Have extended lunch meeting with collaborators at Ferry Building to discuss incipient top-secret research project involving diesel generator, overstock beanie babies, and apple core. Already giving away too much!

3:30 pm. Return to hotel. Discover hotel is now swarming with name badges attached to vaguely familiar faces. Hug vaguely familiar faces. Hugs are met with startled cries. Realize that vaguely familiar faces are actually completely unfamiliar faces. Wrong conference: Young Republicans, not Cognitive Neuroscientists. Make beeline for elevator bank, pursued by angry middle-aged men dressed in American flags.

5 pm. Poster session A! The sights! The sounds! The lone free drink at the reception! The wonders of yellow 8-point text on black 6′ x 4′ background! Too hard to pick a favorite thing, not even going to try. Okay, fine: free schwag at the exhibitor stands.

5 pm – 7 pm. Chat with old friends. Have good time catching up. Only non-fictionalized bullet point of entire piece.

8 pm. Dinner at belly dancing restaurant in lower Haight. Great conversation, good food, mediocre dancing. Towards end of night, insist on demonstrating own prowess in fine art of torso shaking; climb on table and gyrate body wildly, alternately singing Oompa-Loompa song and yelling “get in my belly!” at other restaurant patrons. Nobody tips.

12:30 am. Take the last train to Clarksville. Take last N train back to Hyatt Regency hotel.

Sunday, April 3

7 am. Wake up with amazing lack of hangover. Celebrate amazing lack of hangover by running repeated victory laps around 10th floor of Hyatt Regency, Rocky Steps style. Quickly realize initial estimate of hangover absence off by order of magnitude. Revise estimate; collapse in puddle on hotel room floor. Refuse to move until first morning session.

8:15 am. Wander the eight Caltech aisles of morning poster session in search of breakfast. Fascinating stuff, but this early in morning, only value signals of interest are smell and sight of coffee, muffins, and bagels.

10 am. Terrific symposium includes excellent talks about emotion, brain-body communication, and motivation, but favorite moment is still when friend arrives carrying bucket of aspirin.

1 pm. Bump into old grad school friend outside; decide to grab lunch on pier behind Ferry Building. Discuss anterograde amnesia and dating habits of mutual friends. Chicken and tofu cake is delicious. Sun is out, temperature is mild; perfect day to not attend poster sessions.

1:15 – 2 pm. Attend poster session.

2 pm – 5 pm. Presenting poster in 3 hours! Have full-blown panic attack in hotel room. Not about poster, about General Hospital. Why won’t Lulu take Dante’s advice and call support group number for alcoholics’ families?!?! Alcohol is Luke’s problem, Lulu! Call that number!

5 pm. Present world’s most amazing poster to three people. Launch into well-rehearsed speech about importance of work and great glory of sophisticated technical methodology before realizing two out of three people are mistakenly there for coffee and cake, and third person mistook presenter for someone famous. Pause to allow audience to mumble excuses and run to coffee bar. When coast is clear, resume glaring at anyone who dares to traverse poster aisle. Believe strongly in marking one’s territory.

8 pm. Lab dinner at House of Nanking. Food is excellent, despite unreasonably low tablespace-to-floorspace ratio. Conversation revolves around fainting goats, ‘relaxation’ in Thailand, and, occasionally, science.

10 pm. Karaoke at The Mint. Compare performance of CNS attendees with control group of regulars; establish presence of robust negative correlation between years of education and singing ability. Completely wreck voice performing whitest rendition ever of Shaggy’s “Oh Carolina”. Crowd jeers. No, wait, crowd gyrates. In wholesome scientific manner. Crowd is composed entirely of people with low self-monitoring skills; what luck! DJ grimaces through entire song and most of previous and subsequent songs.

2 am. Take cab back to hotel with graduate students and Memory Professor. Memory Professor is drunk; manages to nearly fall out of cab while cab in motion. In-cab conversation revolves around merits of dynamic programming languages. No consensus reached, but civility maintained. Arrival at hotel: all cab inhabitants below professorial rank immediately slip out of cab and head for elevators, leaving Memory Professor to settle bill. In elevator, Graduate Student A suggests that attempt to push Memory Professor out of moving cab was bad idea in view of Graduate Student A’s impending post-doc with Memory Professor. Acknowledge probable wisdom of Graduate Student A’s observation while simultaneously resolving to not adjust own degenerate behavior in the slightest.

2:15 am. Drink at least 24 ounces of water before attaining horizontal position. Fall asleep humming bars of Elliott Smith’s Angeles. Wrong city, but close enough.

Monday, April 4

8 am. Wake up hangover free again! For real this time. No Rocky Steps dance. Shower and brush teeth. Delicately stroke roommate’s cheek (he’ll never know) before heading downstairs for poster session.

8:30 am. Bagels, muffin, coffee. Not necessarily in that order.

9 am – 12 pm. Skip sessions, spend morning in hotel room working. While trying to write next section of grant proposal, experience strange sensation of time looping back on itself, like a snake eating its own tail, but also eating grant proposal at same time. Awake from unexpected nap with ‘Innovation’ section in mouth.

12:30 pm. Skip lunch; for some reason, not very hungry.

1 pm. Visit poster with screaming purple title saying “COME HERE FOR FREE CHOCOLATE.” Am impressed with poster title and poster, but disappointed by free chocolate selection: Dove eggs and purple Hershey’s kisses–worst chocolate in the world! Resolve to show annoyance by disrupting presenter’s attempts to maintain conversation with audience. Quickly knocked out by chocolate eggs thrown by presenter.

5 pm. Wake up in hotel room with headache and no recollection of day’s events. Virus or hangover? Unclear. For some reason, hair smells like chocolate.

7:30 pm. Dinner at Ferry Building with Brain Camp friends. Have now visited Ferry Building at least one hundred times in seventy-two hours. Am now compulsively visiting Ferry Building every fifteen minutes just to feel normal.

9:30 pm. Party at Americano Restaurant & Bar for Young Investigator Award winner. Award comes with $500 and strict instructions to be spent on drinks for total strangers. Strange tradition, but noone complains.

11 pm. Bar is crowded with neuroscientists having great time at Young Investigator’s expense.

11:15 pm. Drink budget runs out.

11:17 pm. Neuroscientists mysteriously vanish.

1 am. Stroll through San Francisco streets in search of drink. Three false alarms, but finally arrive at open pub 10 minutes before last call. Have extended debate with friend over whether hotel room can be called ‘home’. Am decidedly in No camp; ‘home’ is for long-standing attachments, not 4-day hotel hobo runs.

2 am. Walk home.

Tuesday, April 5

9:05 am. Show up 5 minutes late for bagels and muffins. All gone! Experience Apocalypse Now moment on inside, but manage not to show it–except for lone tear. Drown sorrows in Tazo Wild Sweet Orange tea. Tea completely fails to live up to name; experience second, smaller, Apocalypse Now moment. Roommate walks over and asks if everything okay, then gently strokes cheek and brushes away lone tear (he knew!!!).

9:10 – 1 pm. Intermittently visit poster and symposium halls. Not sure why. Must be force of habit learning system.

1:30 pm. Lunch with friends at Thai restaurant near Golden Gate Park. Fill belly up with coconut, noodles, and crab. About to get on table to express gratitude with belly dance, but notice that friends have suddenly disappeared.

2 – 5 pm. Roam around Golden Gate Park and Haight-Ashbury. Stop at Whole Foods for friend to use bathroom. Get chased out of Whole Foods for using bathroom without permission. Very exciting; first time feeling alive on entire trip! Continue down Haight. Discuss socks, ice cream addiction (no such thing), and funding situation in Europe. Turns out it sucks there too.

5:15 pm. Take BART to airport with lab members. Watch San Francisco recede behind train. Sink into slightly melancholic state, but recognize change of scenery is for the best: constitution couldn’t handle more Rocky Steps mornings.

7:55 pm. Suddenly rediscover pronouns as airplane peels away from gate.

8 pm PST – 11:20 MST. The flight’s almost completely empty; I get to stretch out across the entire emergency exit aisle. The sun goes down as we cross the Sierra Nevada; the last of the ice in my cup melts into water somewhere between Provo and Grand Junction. As we start our descent into Denver, the lights come out in force, and I find myself preemptively bored at the thought of the long shuttle ride home. For a moment, I wish I was back in my room at the Hyatt at 8 am–about to run Rocky Steps around the hotel, or head down to the poster hall to find someone to chat with over a bagel and coffee. For some reason, I still feel like I didn’t get quite enough time to hang out with all the people I wanted to see, despite barely sleeping in 4 days. But then sanity returns, and the thought quickly passes.

the naming of things

Let’s suppose you were charged with the important task of naming all the various subdisciplines of neuroscience that have anything to do with the field of research we now know as psychology. You might come up with some or all of the following terms, in no particular order:

  • Neuropsychology
  • Biological psychology
  • Neurology
  • Cognitive neuroscience
  • Cognitive science
  • Systems neuroscience
  • Behavioral neuroscience
  • Psychiatry

That’s just a partial list; you’re resourceful, so there are probably others (biopsychology? psychobiology? psychoneuroimmunology?). But it’s a good start. Now suppose you decided to make a game out of it, and threw a dinner party where each guest received a copy of your list (discipline names only–no descriptions!) and had to guess what they thought people in that field study. If your nomenclature made any sense at all, and tried to respect the meanings of the individual words used to generate the compound words or phrases in your list, your guests might hazard something like the following guesses:

  • Neuropsychology: “That’s the intersection of neuroscience and psychology. Meaning, the study of the neural mechanisms underlying cognitive function.”
  • Biological psychology: “Similar to neuropsychology, but probably broader. Like, it includes the role of genes and hormones and kidneys in cognitive function.”
  • Neurology: “The pure study of the brain, without worrying about all of that associated psychological stuff.”
  • Cognitive neuroscience: “Well if it doesn’t mean the same thing as neuropsychology and biological psychology, then it probably refers to the branch of neuroscience that deals with how we think and reason. Kind of like cognitive psychology, only with brains!”
  • Cognitive science: “Like cognitive neuroscience, but not just for brains. It’s the study of human cognition in general.”
  • Systems neuroscience: “Mmm… I don’t really know. The study of how the brain functions as a whole system?”
  • Behavioral neuroscience: “Easy: it’s the study of the relationship between brain and behavior. For example, how we voluntarily generate actions.”
  • Psychiatry: “That’s the branch of medicine that concerns itself with handing out multicolored pills that do funny things to your thoughts and feelings. Of course.”

If this list seems sort of sensible to you, you probably live in a wonderful world where compound words mean what you intuitively think they mean, the subject matter of scientific disciplines can be transparently discerned, and everyone eats ice cream for dinner every night terms that sound extremely similar have extremely similar referents rather than referring to completely different fields of study. Unfortunately, that world is not the world we happen to actually inhabit. In our world, most of the disciplines at the intersection of psychology and neuroscience have funny names that reflect accidents of history, and tell you very little about what the people in that field actually study.

Here’s the list your guests might hand back in this world, if you ever made the terrible, terrible mistake of inviting a bunch of working scientists to dinner:

  • Neuropsychology: The study of how brain damage affects cognition and behavior. Most often focusing on the effects of brain lesions in humans, and typically relying primarily on behavioral evaluations (i.e., no large magnetic devices that take photographs of the space inside people’s skulls). People who call themselves neuropsychologists are overwhelmingly trained as clinical psychologists, and many of them work in big white buildings with a red cross on the front. Note that this isn’t the definition of neuropsychology that Wikipedia gives you; Wikipedia seems to think that neuropsychology is “the basic scientific discipline that studies the structure and function of the brain related to specific psychological processes and overt behaviors.” Nice try, Wikipedia, but that’s much too general. You didn’t even use the words ‘brain damage’, ‘lesion’, or ‘patient’ in the first sentence.
  • Biological psychology: To be perfectly honest, I’m going to have to step out of dinner-guest character for a moment and admit I don’t really have a clue what biological psychologists study. I can’t remember the last time I heard someone refer to themselves as a biological psychologist. To an approximation, I think biological psychology differs from, say, cognitive neuroscience in placing greater emphasis on everything outside of higher cognitive processes (sensory systems, autonomic processes, the four F’s, etc.). But that’s just idle speculation based largely on skimming through the chapter names of my old “Biological Psychology” textbook. What I can definitively confidently comfortably tentatively recklessly assert is that you really don’t want to trust the Wikipedia definition here, because when you type ‘biological psychology‘ into that little box that says ‘search’ on Wikipedia, it redirects you to the behavioral neuroscience entry. And that can’t be right, because, as we’ll see in a moment, behavioral neuroscience refers to something very different…
  • Neurology: Hey, look! A wikipedia entry that doesn’t lie to our face! It says neurology is “a medical specialty dealing with disorders of the nervous system. Specifically, it deals with the diagnosis and treatment of all categories of disease involving the central, peripheral, and autonomic nervous systems, including their coverings, blood vessels, and all effector tissue, such as muscle.” That’s a definition I can get behind, and I think 9 out of 10 dinner guests would probably agree (the tenth is probably drunk). But then, I’m not (that kind of) doctor, so who knows.
  • Cognitive neuroscience: In principle, cognitive neuroscience actually means more or less what it sounds like it means. It’s the study of the neural mechanisms underlying cognitive function. In practice, it all goes to hell in a handbasket when you consider that you can prefix ‘cognitive neuroscience’ with pretty much any adjective you like and end up with a valid subdiscipline. Developmental cognitive neuroscience? Check. Computational cognitive neuroscience? Check. Industrial/organizational cognitive neuroscience? Amazingly, no; until just now, that phrase did not exist on the internet. But by the time you read this, Google will probably have a record of this post, which is really all it takes to legitimate I/OCN as a valid field of inquiry. It’s just that easy to create a new scientific discipline, so be very afraid–things are only going to get messier.
  • Cognitive science: A field that, by most accounts, lives up to its name. Well, kind of. Cognitive science sounds like a blanket term for pretty much everything that has to do with cognition, and it sort of is. You have psychology and linguistics and neuroscience and philosophy and artificial intelligence all represented. I’ve never been to the annual CogSci conference, but I hear it’s a veritable orgy of interdisciplinary activity. Still, I think there’s a definite bias towards some fields at the expense of others. Neuroscientists (of any stripe), for instance, rarely call themselves cognitive scientists. Conversely, philosophers of mind or language love to call themselves cognitive scientists, and the jerk cynic in me says it’s because it means they get to call themselves scientists. Also, in terms of content and coverage, there seems to be a definite emphasis among self-professed cognitive scientists on computational and mathematical modeling, and not so much emphasis on developing neuroscience-based models (though neural network models are popular). Still, if you’re scoring terms based on clarity of usage, cognitive science should score at least an 8.5 / 10.
  • Systems neuroscience: The study of neural circuits and the dynamics of information flow in the central nervous system (note: I stole part of that definition from MIT’s BCS website, because MIT people are SMART). Systems neuroscience doesn’t overlap much with psychology; you can’t defensibly argue that the temporal dynamics of neuronal assemblies in sensory cortex have anything to do with human cognition, right? I just threw this in to make things even more confusing.
  • Behavioral neuroscience: This one’s really great, because it has almost nothing to do with what you think it does. Well, okay, it does have something to do with behavior. But it’s almost exclusively animal behavior. People who refer to themselves as behavioral neuroscientists are generally in the business of poking rats in the brain with very small, sharp, glass objects; they typically don’t care much for human beings (professionally, that is). I guess that kind of makes sense when you consider that you can have rats swim and jump and eat and run while electrodes are implanted in their heads, whereas most of the time when we study human brains, they’re sitting motionless in (a) a giant magnet, (b) a chair, or (c) a jar full of formaldehyde. So maybe you could make an argument that since humans don’t get to BEHAVE very much in our studies, people who study humans can’t call themselves behavioral neuroscientists. But that would be a very bad argument to make, and many of the people who work in the so-called “behavioral sciences” and do nothing but study human behavior would probably be waiting to thump you in the hall the next time they saw you.
  • Psychiatry: The branch of medicine that concerns itself with handing out multicolored pills that do funny things to your thoughts and feelings. Of course.

Anyway, the basic point of all this long-winded nonsense is just that, for all that stuff we tell undergraduates about how science is such a wonderful way to achieve clarity about the way the world works, scientists–or at least, neuroscientists and psychologists–tend to carve up their disciplines in pretty insensible ways. That doesn’t mean we’re dumb, of course; to the people who work in a field, the clarity (or lack thereof) of the terminology makes little difference, because you only need to acquire it once (usually in your first nine years of grad school), and after that you always know what people are talking about. Come to think of it, I’m pretty sure the whole point of learning big words is that once you’ve successfully learned them, you can stop thinking deeply about what they actually mean.

It is kind of annoying, though, to have to explain to undergraduates that, DUH, the class they really want to take given their interests is OBVIOUSLY cognitive neuroscience and NOT neuropsychology or biological psychology. I mean, can’t they read? Or to pedantically point out to someone you just met at a party that saying “the neurological mechanisms of such-and-such” makes them sound hopelessly unsophisticated, and what they should really be saying is “the neural mechanisms,” or “the neurobiological mechanisms”, or (for bonus points) “the neurophysiological substrates”. Or, you know, to try (unsuccessfully) to convince your mother on the phone that even though it’s true that you study the relationship between brains and behavior, the field you work in has very little to do with behavioral neuroscience, and so you really aren’t an expert on that new study reported in that article she just read in the paper the other day about that interesting thing that’s relevant to all that stuff we all do all the time.

The point is, the world would be a slightly better place if cognitive science, neuropsychology, and behavioral neuroscience all meant what they seem like they should mean. But only very slightly better.

Anyway, aside from my burning need to complain about trivial things, I bring these ugly terminological matters up partly out of idle curiosity. And what I’m idly curious about is this: does this kind of confusion feature prominently in other disciplines too, or is psychology-slash-neuroscience just, you know, “special”? My intuition is that it’s the latter; subdiscipline names in other areas just seem so sensible to me whenever I hear them. For instance, I’m fairly confident that organic chemists study the chemistry of Orgas, and I assume condensed matter physicists spend their days modeling the dynamics of teapots. Right? Yes? No? Perhaps my  millions thousands hundreds dozens three regular readers can enlighten me in the comments…