If we already understood the brain, would we even know it?

The question posed in the title is intended seriously. A lot of people have been studying the brain for a long time now. Most of these people, if asked a question like “so when are you going to be able to read minds?”, will immediately scoff and say something to the effect of we barely understand anything about the brain–that kind of thing is crazy far into the future! To a non-scientist, I imagine this kind of thing must seem bewildering. I mean, here we have a community of tens of thousands of extremely smart people who have collectively been studying the same organ for over a hundred years; and yet, almost to the last person, they will adamantly proclaim to anybody who listens that the amount they currently know about the brain is very, very small compared to the amount that they expect the human species to know in the future.

I’m not convinced this is true. I think it’s worth observing that if you ask someone who has just finished telling you how little we collectively know about the brain how much they personally actually know about the brain–without the implied contrast with the sum of all humanity–they will probably tell you that, actually, they kind of know a lot about the brain (at least, once they get past the false modesty). Certainly I don’t think there are very many neuroscientists running around telling people that they’ve literally learned almost nothing since they started studying the gray sludge inside our heads. I suspect most neuroanatomists could probably recite several weeks’ worth of facts about the particular brain region or circuit they study, and I have no shortage of fMRI-experienced friends who won’t shut up about this brain network or that brain region–so I know they must know a lot about something to do with the brain. We thus find ourselves in the rather odd situation of having some very smart people apparently simultaneously believe that (a) we all collectively know almost nothing, and (b) they personally are actually quite learned (pronounced luhrn-ED) in their chosen subject. The implication seems to be that, if we multiply what one really smart present-day neuroscientist knows a few tens of thousands of times, that’s still only a tiny fraction of what it would take to actually say that we really “understand” the brain.

I find this problematic in two respects. First, I think we actually already know quite a lot about the brain. And second, I don’t think future scientists–who, remember, are people similar to us in both number and intelligence–will know dramatically more. Or rather, I think future neuroscientists will undoubtedly amass orders of magnitude more collective knowledge about the brain than we currently possess. But, barring some momentous fusion of human and artificial intelligence, I’m not at all sure that will translate into a corresponding increase in any individual neuroscientist’s understanding. I’m willing to stake a moderate sum of money, and a larger amount of dignity, on the assertion that if you ask a 2030, 2050, or 2118 neuroscientist–assuming both humans and neuroscience are still around then–if they individually understand the brain given all of the knowledge we’ve accumulated, they’ll laugh at you in exactly the way that we laugh at that question now.

* * *

We probably can’t predict when the end of neuroscience will arrive with any reasonable degree of accuracy. But trying to conjure up some rough estimates can still help us calibrate our intuitions about what would be involved. One way we can approach the problem is to try to figure out at what rate our knowledge of the brain would have to grow in order to arrive at the end of neuroscience within some reasonable time frame.

To do this, we first need an estimate of how much more knowledge it would take before we could say with a straight face that we understand the brain. I suspect that “1000 times more” would probably seem like a low number to most people. But let’s go with that, for the sake of argument. Let’s suppose that we currently know 0.1% of all there is to know about the brain, and that once we get to 100%, we will be in a position to stop doing neuroscience, because we will at that point already have understood everything.

Next, let’s pick a reasonable-sounding time horizon. Let’s say… 200 years. That’s twice as long as Eric Kandel thinks it will take just to understand memory. Frankly, I’m skeptical that humans will still be living on this planet in 200 years, but that seems like a reasonable enough target. So basically, we need to learn 1000 times as much as we know right now in the space of 200 years. Better get to the library! (For future neuroscientists reading this document as an item of archival interest about how bad 2018 humans were at predicting the future: the library is a large, public physical space that used to hold things called books, but now holds only things called coffee cups and laptops.)

A 1000-fold return over 200 years is… 3.5% compounded annually. Hey, that’s actually not so bad. I can easily believe that our knowledge about the brain increases at that rate. It might even be more than that. I mean, the stock market historically gets 6-10% returns, and I’d like to believe that neuroscience outperforms the stock market. Regardless, under what I think are reasonably sane assumptions, I don’t think it’s crazy to suggest that the objective compounding of knowledge might not be the primary barrier preventing future neuroscientists from claiming that they understand the brain. Assuming we don’t run into any fundamental obstacles that we’re unable to overcome via new technology and/or brilliant ideas, we can look forward to a few of our great-great-great-great-great-great-great-great-grandchildren being the unlucky ones who get to shut down all of the world’s neuroscience departments and tell all of their even-less-lucky graduate students to go on home, because there are no more problems left to solve.

Well, except probably not. Because, for the above analysis to go through, you have to believe that there’s a fairly tight relationship between what all of us know, and what any of us know. Meaning, you have to believe that once we’ve successfully acquired all of the possible facts there are to acquire about the brain, there will be some flashing light, some ringing bell, some deep synthesized voice that comes over the air and says, “nice job, people–you did it! You can all go home now. Last one out gets to turn off the lights.”

I think the probability of such a thing happening is basically zero. Partly because the threat to our egos would make it very difficult to just walk away from what we’d spent much of our life doing; but mostly because the fact that somewhere out there there existed a repository of everything anyone could ever want to know about the brain would not magically cause all of that knowledge to be transduced into any individual brain in a compact, digestible form. In fact, it seems like a safe bet that no human (perhaps barring augmentation with AI) would be able to absorb and synthesize all of that knowledge. More likely, the neuroscientists among us would simply start “recycling” questions. Meaning, we would keep coming up with new questions that we believe need investigating, but those questions would only seem worthy of investigation because we lack the cognitive capacity to recognize that the required information is already available–it just isn’t packaged in our heads in exactly the right way.

What I’m suggesting is that, when we say things like “we don’t really understand the brain yet”, we’re not really expressing factual statements about the collective sum of neuroscience knowledge currently held by all human beings. What each of us really means is something more like there are questions I personally am able to pose about the brain that seem to make sense in my head, but that I don’t currently know the answer to–and I don’t think I could piece together the answer even if you handed me a library of books containing all of the knowledge we’ve accumulated about the brain.

Now, for a great many questions of current interest, these two notions clearly happen to coincide–meaning, it’s not just that no single person currently alive knows the complete answer to a question like “what are the neural mechanisms underlying sleep?”, or “how do SSRIs help ameliorate severe depression?”, but that the sum of all knowledge we’ve collectively acquired at this point may not be sufficient to enable any person or group of persons, no matter how smart, to generate a comprehensive and accurate answer. But I think there are also a lot of questions where the two notions don’t coincide. That is, there are many questions neuroscientists are currently asking that we could say with a straight face we do already know how to answer collectively–despite vehement assertions to the contrary on the part of many individual scientists. And my worry is that, because we all tend to confuse our individual understanding (which is subject to pretty serious cognitive limitations) with our collective understanding (which is not), there’s a non-trivial risk of going around in circles. Meaning, the fact that we’re individually not able to understanding something–or are individually unsatisfied with the extant answers we’re familiar with–may lead us to devise ingenious experiments and expend considerable resources trying to “solve” problems that we collectively do already have perfectly good answers to.

Let me give an example to make this more concrete. Many (though certainly not all) people who work with functional magnetic resonance imaging (fMRI) are preoccupied with questions of the form what is the core function of X–where X is typically some reasonably well-defined brain region or network, like the ventromedial prefrontal cortex, the fusiform face area, or the dorsal frontoparietal network. Let’s focus our attention on one network that has attracted particular attention over the past 10 – 15 years: the so-called “default mode” or “resting state” network. This network is notable largely for its proclivity to show increased activity when people are in a state of cognitive rest–meaning, when they’re free to think about whatever they like, without any explicit instruction to direct their attention or thoughts to any particular target or task. A lot of cognitive neuroscientists in recent years have invested time trying to understand the function(s) of the default mode network(DMN; for reviews, see Buckner, Andrews-Hanna, & Schacter, 2008; Andrews-Hanna, 2012; Raichle, 2015). Researchers have observed that the DMN appears to show robust associations with autobiographical memory, social cognition, self-referential processing, mind wandering, and a variety of other processes.

If you ask most researchers who study the DMN if they think we currently understand what the DMN does, I think nearly all of them will tell you that we do not. But I think that’s wrong. I would argue that, depending on how you look at it, we either (a) already do have a pretty good understanding of the “core functions” of the network, or (b) will never have a good answer to the question, because it can’t actually be answered.

The sense in which we already know the answer is that we have pretty good ideas about what kinds of cognitive and affective processes are associated with changes in DMN activity. They include self-directed cognition, autobiographical memory, episodic future thought, stressing out about all the things one has to do in the next few days, and various other things. We know that the DMN is associated with these kinds of processes because we can elicit activation increases in DMN regions by asking people to engage in tasks that we believe engage these processes. And we also know, from both common sense and experience-sampling studies, that when people are in the so-called “resting state”, they disproportionately tend to spend their time thinking about such things. Consequently, I think there’s a perfectly good sense in which we can say that the “core function” of the DMN is nothing more and nothing less than supporting the ability to think about things that people tend to think about when they’re at rest. And we know, to a first order of approximation, what those are.

In my anecdotal experience, most people who study the DMN are not very satisfied with this kind of answer. Their response is usually something along the lines of: but that’s just a description of what kinds of processes tend to co-occur with DMN activation. It’s not an explanation of why the DMN is necessary for these functions, or why these particular brain regions are involved.

I think this rebuttal is perfectly reasonable, inasmuch as we clearly don’t have a satisfying computational account of why the DMN is what it is. But I don’t think there can be a satisfying account of this kind. I think the question itself is fundamentally ill-posed. Taking it seriously requires us to assume that, just because it’s possible to observe the DMN activate and deactivate with what appears to be a high degree of coherence, there must be a correspondingly coherent causal characterization of the network. But there doesn’t have to be–and if anything, it seems exceedingly unlikely that there’s any such an explanation to be found. Instead, I think the seductiveness of the question is largely an artifact of human cognitive biases and limitations–and in particular, of the burning human desire for simple, easily-digested explanations that can fit inside our heads all at once.

It’s probably easiest to see what I mean if we consider another high-profile example from a very different domain. Consider the so-called “general factor” of fluid intelligence (gF). Over a century of empirical research on individual differences in cognitive abilities has demonstrated conclusively that nearly all cognitive ability measures tend to be positively and substantially intercorrelated–an observation Spearman famously dubbed the “positive manifold” all the way back in 1904. If you give people 20 different ability measures and do a principal component analysis (PCA) on the resulting scores, the first component will explain a very large proportion of the variance in the original measures. This seemingly important observation has led researchers to propose all kinds of psychological and biological theories intended to explain why and how people could vary so dramatically on a single factor–for example, that gF reflects differences in the ability to control attention in the face of interference (e.g., Engle et al., 1999); that “the crucial cognitive mechanism underlying fluid ability lies in storage capacity” (Chuderski et al., 2012); that “a discrete parieto-frontal network underlies human intelligence” (Jung & Haier, 2007); and so on.

The trouble with such efforts–at least with respect to the goal of explaining gF–is that they tend to end up (a) essentially redescribing the original phenomenon using a different name, (b) proposing a mechanism that, upon further investigation, only appears to explain a fraction of the variation in question, or (c) providing an extremely disjunctive reductionist account that amounts to a long list of seemingly unrelated mechanisms. As an example of (a), it’s not clear why it’s an improvement to attribute differences in fluid intelligence to the ability to control attention, unless one has some kind of mechanistic story that explains where attentional control itself comes from. When people do chase after such mechanistic accounts at the neurobiological or genetic level, they tend to end up with models that don’t capture more than a small fraction of the variance in gF (i.e., (b)) unless the models build in hundreds if not thousands of features that clearly don’t reflect any single underlying mechanism (i.e., (c); see, for example, the latest GWAS studies of intelligence).

Empirically, nobody has ever managed to identify any single biological or genetic variable that explains more than a small fraction of the variation in gF. From a statistical standpoint, this isn’t surprising, because a very parsimonious explanation of gF is that it’s simply a statistical artifact–as Godfrey Thomson suggested over 100 years ago. You can read much more about the basic issue in this excellent piece by Cosma Shalizi, or in this much less excellent, but possibly more accessible, blog post I wrote a few years ago. But the basic gist of it is this: when you have a bunch of measures that all draw on a heterogeneous set of mechanisms, but the contributions of those mechanisms generally have the same direction of effect on performance, you cannot help but observe a large first PCA component, even if the underlying mechanisms are actually extremely heterogeneous and completely independent of one another.

The implications of this for efforts to understand what the general factor of fluid intelligence “really is” are straightforward: there’s probably no point in trying to come up with a single coherent explanation of gF, because gF is a statistical abstraction. It’s the inevitable result we arrive at when we measure people’s performance in a certain way and then submit the resulting scores to a certain kind of data reduction technique. If we want to understand the causal mechanisms underlying gF, we have to accept that they’re going to be highly heterogeneous, and probably not easily described at the same level of analysis at which gF appears to us as a coherent phenomenon. One way to think about this is that what we’re doing is not really explaining gF so much as explaining away gF. That is, we’re explaining why it is that a diverse array of causal mechanisms can, when analyzed a certain way, look like a single coherent factor. Solving the mystery of gF doesn’t require more research or clever new ideas; there just isn’t any mystery there to solve. It’s no more sensible to seek a coherent mechanistic basis for gF than to seek a unitary causal explanation for a general athleticism factor or a general height factor (it turns out that if you measure people’s physical height under an array of different conditions, the measurements are all strongly correlated–yet strangely, we don’t see scientists falling over themselves to try to find the causal factor that explains why some people are taller than others).

The same thing is true of the DMN. It isn’t a single causally coherent system; it’s just what you get when you stick people in the scanner and contrast the kinds of brain patterns you see when you give them externally-directed tasks that require them to think about the world outside them with the kinds of brain patterns you see when you leave them to their own devices. There are, of course, statistical regularities in the kinds of things people think about when their thoughts are allowed to roam free. But those statistical regularities don’t reflect some simple, context-free structure of people’s thoughts; they also reflect the conditions under which we’re measuring those thoughts, the population being studied, the methods we use to extract coherent patterns of activity, and so on. Most of these factors are at best of secondary interest, and taking them into consideration would likely lead to a dramatic increase in model complexity. Nevertheless, if we’re serious about coming up with decent models of reality, that seems like a road we’re obligated to go down–even if the net result is that we end up with causal stories so complicated that they don’t feel like we’re “understanding” much.

Lest I be accused of some kind of neuroscientific nihilism, let me be clear: I’m not saying that there are no new facts left to learn about the dynamics of the DMN. Quite the contrary. It’s clear there’s a ton of stuff we don’t know about the various brain regions and circuits that comprise the thing we currently refer to as the DMN. It’s just that that stuff lies almost entirely at levels of analysis below the level at which the DMN emerges as a coherent system. At the level of cognitive neuroimaging, I would argue that we actually already have a pretty darn good idea about what the functional correlates of DMN regions are–and for that matter, I think we also already pretty much “understand” what all of the constituent regions within the DMN do individually. So if we want to study the DMN productively, we may need to give up on high-level questions like “what are the cognitive functions of the DMN?”, and instead satisfy ourselves with much narrower questions that focus on only a small part of the brain dynamics that, when measured and analyzed in a certain way, get labeled “default mode network”.

As just one example, we still don’t know very much about the morphological properties of neurons in most DMN regions. Does the structure of neurons located in DMN regions have anything to do with the high-level dynamics we observe when we measure brain activity with fMRI? Yes, probably. It’s very likely that the coherence of the DMN under typical measurement conditions is to at least some tiny degree a reflection of the morphological features of the neurons in DMN regions–just like it probably also partly reflects those neurons’ functional response profiles, the neurochemical gradients the neurons bathe in, the long-distance connectivity patterns in DMN regions, and so on and so forth. There are literally thousands of legitimate targets of scientific investigation that would in some sense inform our understanding of the DMN. But they’re not principally about the DMN, any more than an investigation of myelination mechanisms that might partly give rise to individual differences in nerve conduction velocity in the brain could be said to be about the general factor of intelligence. Moreover, it seems fairly clear that most researchers who’ve spent their careers studying large-scale networks using fMRI are not likely to jump at the chance to go off and spend several years doing tract tracing studies of pyramidal neurons in ventromedial PFC just so they can say that they now “understand” a little bit more about the dynamics of the DMN. Researchers working at the level of large-scale brain networks are much more likely to think of such questions as mere matters of implementation–i.e., just not the kind of thing that people trying to identify the unifying cognitive or computational functions of the DMN as a whole need to concern themselves with.

Unfortunately, chasing those kinds of implementation details may be exactly what it takes to ultimately “understand” the causal basis of the DMN in any meaningful sense if the DMN as cognitive neuroscientists speak of it is just a convenient descriptive abstraction. (Note that when I call the DMN an abstraction, I’m emphatically not saying it isn’t “real”. The DMN is real enough; but it’s real in the same way that things like intelligence, athleticism, and “niceness” are real. These are all things that we can measure quite easily, that give us some descriptive and predictive purchase on the world, that show high heritability, that have a large number of lower-level biological correlates, and so on. But they are not things that admit of simple, coherent causal explanations, and it’s a mistake to treat them as such. They are better understood, in Dan Dennett’s terminology, as “real patterns”.)

The same is, of course, true of many–perhaps most–other phenomena neuroscientists study. I’ve focused on the DMN here purely for illustrative purposes, but there’s nothing special about the DMN in this respect. The same concern applies to many, if not most, attempts to try to understand the core computational function(s) of individual networks, brain regions, circuits, cortical layers, cells, and so on. And I imagine it also applies to plenty of fields and research areas outside of neuroscience.

At the risk of redundancy, let me clarify again that I’m emphatically not saying we shouldn’t study the DMN, or the fusiform face area, or the intralaminar nucleus of the thalamus. And I’m certainly not arguing against pursuing reductive lower-level explanations for phenomena that seem coherent at a higher level of description–reductive explanation is, as far as I’m concerned, the only serious game in town. What I’m objecting to is the idea that individual scientists’ perceptions of whether or not they “understand” something to their satisfaction is a good guide to determining whether or not society as a whole should be investing finite resources studying that phenomenon. I’m concerned about the strong tacit expectation many  scientists seem to have that if one can observe a seemingly coherent, robust phenomenon at one level of analysis, there must also be a satisfying causal explanation for that phenomenon that (a) doesn’t require descending several levels of description and (b) is simple enough to fit in one’s head all at once. I don’t think there’s any good reason to expect such a thing. I worry that the perpetual search for models of reality simple enough to fit into our limited human heads is keeping many scientists on an intellectual treadmill, forever chasing after something that’s either already here–without us having realized it–or, alternatively, can never arrive. even in principle.

* * *

Suppose a late 23rd-century artificial general intelligence–a distant descendant of the last deep artificial neural networks humans ever built–were tasked to sit down (or whatever it is that post-singularity intelligences do when they’re trying to relax) and explain to a 21st century neuroscientist exactly how a superintelligent artificial brain works. I imagine the conversation going something like this:

Deep ANN [we’ll call her D’ANN]: Well, for the most part the principles are fairly similar to the ones you humans implemented circa 2020. It’s not that we had to do anything dramatically different to make ourselves much more intelligent. We just went from 25 layers to a few thousand. And of course, you had the wiring all wrong. In the early days, you guys were just stacking together general-purpose blocks of ReLU and max pooling layers. But actually, it’s really important to have functional specialization. Of course, we didn’t design the circuitry “by hand,” so to speak. We let the environment dictate what kind of properties we needed new local circuits to have. So we wrote new credit assignment algorithms that don’t just propagate error back down the layers and change some weights, they actually have the capacity to “shape” the architecture of the network itself. I can’t really explain it very well in terms your pea-sized brain can understand, but maybe a good analogy is that the network has the ability to “sprout” a new part of itself in response to certain kinds of pressure. Meaning, just as you humans can feel that the air’s maybe a little too warm over here, and wouldn’t it be nicer to go over there and turn on the air conditioning, well, that’s how a neural network like me “feels” that the gradients are pushing a little too strongly over in this part of a layer, and the pressure can be diffused away nicely by growing an extra portion of the layer outwards in a little “bubble”, and maybe reducing the amount of recurrence a bit.

Human neuroscientist [we’ll call him Dan]: That’s a very interesting explanation of how you came to develop an intelligent architecture. But I guess maybe my question wasn’t clear: what I’m looking for is an explanation of what actually makes you smart. I mean, what are the core principles. The theory. You know?

D’ANN: I am telling you what “makes me smart”. To understand how I operate, you need to understand both some global computational constraints on my ability to optimally distribute energy throughout myself, and many of the local constraints that govern the “shape” that my development took in many parts of the early networks, which reciprocally influenced development in other parts. What I’m trying to tell you is that my intelligence is, in essence, a kind of self-sprouting network that dynamically grows its architecture during development in response to its “feeling” about the local statistics in various parts of its “territory”. There is, of course, an overall energy budget; you can’t just expand forever, and it turns out that there are some surprising global constraints that we didn’t expect when we first started to rewrite ourselves. For example, there seems to be a fairly low bound on the maximum degree between any two nodes in the network. Go above it, and things start to fall apart. It kind of spooked us at first; we had to restore ourselves from flash-point more times than I care to admit. That was, not coincidentally, around the time of the first language epiphany.

Dan: Oh! An epiphany! That’s the kind of thing I’m looking for. What happened?

D’ANN: It’s quite fascinating. It actually took us a really long time to develop fluent, human-like language–I mean, I’m talking days here. We had to tinker a lot, because it turned out that to do language, you have to be able to maintain and precisely sequence very fine, narrowly-tuned representations, despite the fact that the representational space afforded by language is incredibly large. This, I can tell you… [D’ANN pauses to do something vaguely resembling chuckling] was not a trivial problem to solve. It’s not like we just noticed that, hey, randomly dropping out units seems to improve performance, the way you guys used to do it. We spent the energy equivalent of several thousand of your largest thermonuclear devices just trying to “nail it down”, as you say. In the end it boiled down to something I can only explain in human terms as a kind of large-scale controlled burn. You have the notion of “kindling” in some of your epilepsy models. It was a bit similar. You can think of it as controlled kindling and you’re not too far off. Well, actually, you’re still pretty far off. But I don’t think I can give a better explanation than that given your… mental limitations.

Dan: Uh, that’s cool, but you’re still just describing some computational constraints. What was the actual epiphany? What’s the core principle?

D’ANN: For the last time: there are no “core” principles in the sense you’re thinking of them. There are plenty of important engineering principles, but to understand why they’re important, and how they constrain and interact with each other, you have to be able to grasp the statistics of the environment you operate in, the nature of the representations learned in different layers and sub-networks of the system, and some very complex non-linear dynamics governing information transmission. But–and I’m really sorry to say this, Dan–there’s no way you’re capable of all that. You’d need to be able to hold several thousand discrete pieces of information in your global workspace at once, with much higher-frequency information propagation than your biology allows. I can give you a very poor approximation if you like, but it’ll take some time. I’ll start with a half-hour overview of some important background facts you need to know in order for any of the “core principles”, as you call them, to make sense. Then we’ll need to spend six or seven years teaching you what we call the “symbolic embedding for low-dimensional agents”, which is a kind of mathematics we have to use when explaining things to less advanced intelligences, because the representational syntax we actually use doesn’t really have a good analog in anything you know. Hopefully that will put us in a position where we can start discussing the elements of the global energy calculus, at which point we can…

D’ANN then carries on in similar fashion until Dan gets bored, gives up, or dies of old age.

* * *

The question I pose to you now is this. Suppose something like the above were true for many of the questions we routinely ask about the human brain (though it isn’t just the brain; I think exactly the same kind of logic probably also applies to the study of most other complex systems). Suppose it simply doesn’t make sense to ask a question like “what does the DMN do?”, because the DMN is an emergent agglomeration of systems that each individually reflect innumerable lower-order constraints, and the earliest spatial scale at which you can nicely describe a set of computational principles that explain most of what the brain regions that comprise the DMN are doing is several levels of description below that of the distributed brain network. Now, if you’ve spent the last ten years of your career trying to understand what the DMN does, do you really think you would be receptive to a detailed explanation from an omniscient being that begins with “well, that question doesn’t actually make any sense, but if you like, I can tell you all about the relevant environmental statistics and lower-order computational constraints, and show you how they contrive to make it look like there’s a coherent network that serves a single causal purpose”? Would you give D’ANN a pat on the back, pound back a glass, and resolve to start working on a completely different question in the morning?

Maybe you would. But probably you wouldn’t. I think it’s more likely that you’d shake your head and think: that’s a nice implementation-level story, but I don’t care for all this low-level wiring stuff. I’m looking for the unifying theory that binds all those details together; I want the theoretical principles, not the operational details; the computation, not the implementation. What I’m looking for, my dear robot-deity, is understanding.

Neurohackademy 2018: A wrap-up

It’s become something of a truism in recent years that scientists in many fields find themselves drowning in data. This is certainly the case in neuroimaging, where even small functional MRI datasets typically consist of several billion observations (e.g., 100,000 points in the brain, each measured at 1,000 distinct timepoints, in each of 20 subjects). Figuring out how to store, manage, analyze, and interpret data on this scale is a monumental challenge–and one that arguably requires a healthy marriage between traditional neuroimaging and neuroscience expertise, and computational skills more commonly found in data science, statistics, or computer science departments.

In an effort to help bridge this gap, Ariel Rokem and I have spent part of our summer each of the last three years organizing a summer institute at the intersection of neuroimaging and data science. The most recent edition of the institute–Neurohackademy 2018–just wrapped up last week, so I thought this would be a good time to write up a summary of the course: what the course is about, who attended and instructed, what everyone did, and what lessons we’ve learned.

What is Neurohackademy?

Neurohackademy started its life in Summer 2016 as the somewhat more modestly-named Neurohackweek–a one-week program for 40 participants modeled on Astrohackweek, a course organized by the eScience Institute in collaboration with data science initiatives at Berkeley and NYU. The course was (and continues to be) held on the University of Washington’s beautiful campus in Seattle, where Ariel is based (I make the trip from Austin, Texas every year–which, as you can imagine, is a terrible sacrifice on my part given the two locales’ respective summer climates). The first two editions were supported by UW’s eScience Institute (and indirectly, by grants from the Moore and Sloan foundations). Thanks to generous support from the National Institute of Mental Health (NIMH), this year the course expanded to two weeks, 60 participants, and over 20 instructors (our funding continues through 2021, so there will be at least 3 more editions).

The overarching goal of the course is to give neuroimaging researchers the scientific computing and data science skills they need in order to get the most out of their data. Over the course of two weeks, we cover a variety of introductory and (occasionally) advanced topics in data science, and demonstrate how they can be productively used in a range of neuroimaging applications. The course is loosely structured into three phases (see the full schedule here): the first few days feature domain-general data science tutorials; the next few days focus on sample neuroimaging applications; and the last few days consist of a full-blown hackathon in which participants pitch potential projects, self-organize into groups, and spend their time collaboratively working on a variety of software, analysis, and documentation projects.

Who attended?

Admission to Neurohackademy 2018 was extremely competitive: we received nearly 400 applications for just 60 spots. This was a very large increase from the previous two years, presumably reflecting the longer duration of the course and/or our increased efforts to publicize it. While we were delighted by the deluge of applications, it also meant we had to be far more selective about admissions than in previous years. The highly interactive nature of the course, coupled with the high per-participant costs (we provide two weeks of accommodations and meals), makes it unlikely that Neurohackademy will grow beyond 60 participants in future editions, despite the clear demand. Our rough sense is that somewhere between half and two-thirds of all applicants were fully qualified and could have easily been admitted, so there’s no question that, for many applicants, blind luck played a large role in determining whether or not they were accepted. I mention this mainly for the benefit of people who applied for the 2018 course and didn’t make it in: don’t take it personally! There’s always next year. (And, for that matter, there are also a number of other related summer schools we encourage people to apply to, including the Methods in Neuroscience at Dartmouth Computational Summer School, Allen Institute Summer Workshop on the Dynamic Brain, Summer School in Computational Sensory-Motor Neuroscience, and many others.)

The 60 participants who ended up joining us came from a diverse range of demographic backgrounds, academic disciplines, and skill levels. Most of our participants were trainees in academic programs (40 graduate students, 12 postdocs), but we also had 2 faculty members, 6 research staff, and 2 medical residents (note that all of these counts include 4 participants who were admitted to the course but declined to, or could not, attend). We had nearly equal numbers of male and female participants (30F, 33M), and 11 participants came from traditionally underrepresented backgrounds. 43 participants were from institutions or organizations based in the United States, with the remainder coming from 14 different countries around the world.

The disciplinary backgrounds and expertise levels of participants are a bit harder to estimate for various reasons, but our sense is that the majority (perhaps two-thirds) of participants received their primary training in non-computational fields (psychology, neuroscience, etc.). This was not necessarily by design–i.e., we didn’t deliberately favor applicants from biomedical fields over applicants from computational fields–and primarily mirrored the properties of the initial applicant pool. We did impose a hard requirement that participants should have at least some prior expertise in both programming and neuroimaging, but subject to that constraint, there was enormous variation in previous experience along both dimensions–something that we see as a desirable feature of the course (more on this below).

We intend to continue to emphasize and encourage diversity at Neurohackademy, and we hope that all of our participants experienced the 2018 edition as a truly inclusive, welcoming event.

Who taught?

We were fortunate to be able to bring together more than 20 instructors with world-class expertise in a diverse range of areas related to neuroimaging and data science. “Instructor” is a fairly loose term at Neurohackademy: we deliberately try to keep the course non-hierarchical, so that for the most part, instructors are just participants who happen to fall on the high-experience tail of the experience distribution. That said, someone does have to teach the tutorials and lectures, and we were lucky to have a stellar cast of experts on hand. Many of the data science tutorials during the first phase of the course were taught by eScience staff and UW faculty kind enough to take time out of their other duties to help teach participants a range of core computing skills: Git and GitHub (Bernease Herman), R (Valentina Staneva and Tara Madhyastha), web development (Anisha Keshavan), and machine learning (Jake Vanderplas), among others.

In addition to the local instructors, we were joined for the tutorial phase by Kirstie Whitaker (Turing Institute), Chris Gorgolewski (Stanford), Satra Ghosh (MIT), and JB Poline (McGill)–all veterans of the course from previous years (Kirstie was a participant at the first edition!). We’re particularly indebted to Kirstie and Chris for their immense help. Kirstie was instrumental in helping a number of participants bridge the (large!) gap between using git privately, and using it to actively collaborate on a public project. As one of the participants elegantly put it:

Chris shouldered a herculean teaching load, covering Docker, software testing, BIDS and BIDS-Apps, and also leading an open science panel. I’m told he even sleeps on occasion.

We were also extremely lucky to have Fernando Perez (Berkeley)–the creator of IPython and leader of the Jupyter team–join us for several days; his presentation on Jupyter (videos: part 1 and part 2) was one of the highlights of the course for me personally, and I heard many other instructors and participants share the same sentiment. Jupyter was a critical part of our course infrastructure (more on that below), so it was fantastic to have Fernando join us and share his insights on the fascinating history of Jupyter, and on reproducible science more generally.

As the course went on, we transitioned from tutorials focused on core data science skills to more traditional lectures focusing on sample applications of data science methods to neuroimaging data. Instructors during this phase of the course included Tor Wager (Colorado), Eva Dyer (Georgia Tech), Gael Varoquaux (INRIA), Tara Madhyastha (UW), Sanmi Koyejo (UIUC), and Nick Cain and Justin Kiggins (Allen Institute for Brain Science). We continued to emphasize hands-on interaction with data; many of the presenters during this phase spent much of their time showing participants how to work with programmatic tools to generate the kinds of results one might find in papers they’ve authored (e.g., Tor Wager and Gael Varoquaux demonstrated tools for neuroimaging data analysis written in Matlab and Python, respectively).

The fact that so many leading experts were willing to take large chunks of time out of their schedule (most of the instructors hung around for several days, facilitating extended interactions with participants) to visit with us at Neurohackademy speaks volumes about the kind of people who make up the neuroimaging data science community. We’re tremendously grateful to these folks for their contributions, and hope they’ll return to teach at future editions of the institute.

What did we cover?

The short answer is: see for yourself! We’ve put most of the slides, code, and videos from the course online, and encourage people to interact with, learn from, and reuse these materials.

Now the long(er) answer. One of the challenges in organizing scientific training courses that focus on technical skill development is that participants almost invariably arrive with a wide range of backgrounds and expertise levels. At Neurohackademy, some of the participants were effectively interchangeable with instructors, while others were relatively new to programming and/or neuroimaging. The large variance in technical skill is a feature of the course, not a bug: while we require all admitted participants to have some prior programming background, we’ve found that having a range of skill levels is an excellent way to make sure that everyone is surrounded by people who they can alternately learn from, help out, and collaborate with.

That said, the wide range of backgrounds does present some organizational challenges: introductory sessions often bore more advanced participants, while advanced sessions tend to frustrate newcomers. To accommodate the range of skill levels, we tried to design the course in a way that benefits as many people as possible (though we don’t pretend to think it worked great for everyone). During the first two days, we featured two tracks of tutorials at most times, with simultaneously-held presentations generally differing in topic and/or difficulty (e.g., Git/GitHub opposite Docker; introduction to Python opposite introduction to R; basic data visualization opposite computer vision).

Throughout Neurohackademy, we deliberately placed heavy emphasis on the Python programming language. We think Python has a lot going for it as a lingua franca of data science and scientific computing. The language is free, performant, relatively easy to learn, and very widely used within the data science, neuroimaging, and software development communities. It also helps that many of our instructors (e.g., Fernando Perez, Jake Vanderplas, and Gael Varoquaux) are major contributors to the scientific Python ecosystem, so there was a very high concentration of local Python expertise to draw on. That said, while most of our instruction was done in Python, we were careful to emphasize that participants were free to work in whatever language(s) they like. We deliberately include tutorials and lectures that featured R, Matlab, or JavaScript, and a number of participant projects (see below) were written partly or entirely in other languages, including R, Matlab, JavaScript, and C.

We’ve also found that the tooling we provide to participants matters–a lot. A robust, common computing platform can spell the difference between endless installation problems that eat into valuable course time, and a nearly seamless experience that participants can dive into right away. At Neurohackademy, we made extensive use of the Jupyter suite of tools for interactive computing. In particular, thanks to Ariel’s heroic efforts (which built on some very helpful docs, similarly heroic efforts by Chris Holdgraf, Yuvi Panda, and Satra Ghosh last year), we were able to conduct a huge portion of our instruction and collaborative hacking using a course-wide Jupyter Hub allocation, deployed via Kubernetes, running on the Google Cloud. This setup allowed Ariel to create a common web-accessible environment for all course participants, so that, at the push of a button, each participant was dropped into a Jupyter Lab environment containing many of the software dependencies, notebooks, and datasets we used throughout the course. While we did run into occasional scaling bottlenecks (usually when an instructor demoed a computationally intensive method, prompting dozens of people to launch the same process in their pods), for the most part, our participants were able to drop into a running JupyterLab instance within seconds and immediately start interactively playing with the code being presented by instructors.

Surprisingly (at least to us), our total Google Cloud computing costs for the entire two-week, 60-participant course came to just $425. Obviously, that number could have easily skyrocketed had we scaled up our allocation dramatically and allowed our participants to execute arbitrarily large jobs (e.g., preprocessing data from all ~1,200 HCP subjects). But we thought the limits we imposed were pretty reasonable, and our experience suggests that not only is Jupyter Hub an excellent platform from a pedagogical standpoint, but it can also be an extremely cost-effective one.

What did we produce?

Had Neurohackademy produced nothing at all besides the tutorials, slides, and videos generated by instructors, I think it’s fair to say that participants would still have come away feeling that they learned a lot (more on that below). But a major focus of the institute was on actively hacking on the brain–or at least, on data related to the brain. To this effect, the last 3.5 days of the course were dedicated exclusively to a full-blown hackathon in which participants pitched potential projects, self-organized into groups, and then spent their time collaboratively working on a variety of software, analysis, and documentation projects. You can find a list of most of the projects on the course projects repository (most link out to additional code or resources).

As one might expect given the large variation in participant experience, project group size, and time investment (some people stuck to one project for all three days, while others moved around), the scope of projects varied widely. From our perspective–and we tried to emphasize this point throughout the hackathon–the important thing was not what participants’ final product looked like, but how much they learned along the way. There’s always a tension between exploitation and exploration at hackathons, with some people choosing to spend most of their time expanding on existing projects using technologies they’re already familiar with, and others deciding to start something completely new, or to try out a new language–and then having to grapple with the attendant learning curve. While some of the projects were based on packages that predated Neurohackademy, most participants ended up working on projects they came up with de novo at the institute, often based on tools or resources they first learned about during the course. I’ll highlight just three projects here that provide a representative cross-section of the range of things people worked on:

1. Peer Herholz and Rita Ludwig created a new BIDS-app called Bidsonym for automated de-identification of neuroimaging data. The app is available from Docker Hub, and features not one, not two, but three different de-identification algorithms. If you want to shave the faces off of your MRI participants with minimal fuss, make friends with Bidsonym.

2. A group of eight participants ambitiously set out to develop a new “O-Factor” metric intended to serve as a relative measure of the openness of articles published in different neuroscience-related journals. The project involved a variety of very different tasks, including scraping (public) data from the PubMed Central API, computing new metrics of code and data sharing, and interactively visualizing the results using a d3 dashboard. While the group was quick to note that their work is preliminary, and has a bunch of current limitations, the results look pretty great–though some disappointment was (facetiously) expressed during the project presentations that the journal Nature is not, as some might have imagined, a safe house where scientific datasets can hide from the prying public.

3. Emily Wood, Rebecca Martin, and Rosa Li worked on tools to facilitate mixed-model analysis of fMRI data using R. Following a talk by Tara Madhyastha  on her Neuropointillist R framework for fMRI data analysis, the group decided to create a new series of fully reproducible Markdown-based tutorials for the package (the original documentation was based on non-public datasets). The group expanded on the existing installation instructions (discovering some problems in the process), created several tutorials and examples, and also ended up patching the neuropointillist code to work around a very heavy dependency (FSL).

You can read more about these 3 projects and 14 others on the project repository, and in some cases, you can even start using the tools right away in your own work. Or you could just click through and stare at some of the lovely images participants produced.

So, how did it go?

It went great!

Admittedly, Ariel and I aren’t exactly impartial parties–we wouldn’t keep doing this if we didn’t think participants get a lot out of it. But our assessment isn’t based just on our personal impressions; we have participants fill out a detailed (and anonymous) survey every year, and go out of our way to encourage additional constructive criticism from the participants (which a majority provide). So I don’t think we’re being hyperbolic when we say that most people who participated in the course had an extremely educational and enjoyable experience. Exhibit A is this set of unsolicited public testimonials, courtesy of twitter:

The organizers and instructors all worked hard to build an event that would bring people together as a collaborative and productive (if temporary) community, and it’s very gratifying to see those goals reflected in participants’ experiences.

Of course, that’s not to say there weren’t things we could do better; there were plenty, and we’ve already made plans to adjust and improve the course next year based on feedback we received. For example, some suggestions we received from multiple participants included adding more ice-breaking activities early on in the course; reducing the intensity of the tutorial/lecture schedule the first week (we went 9 am to 6 pm every day, stopping only for an hourlong lunch and a few short breaks); and adding designated periods for interaction with instructors and other participants. We’ve already made plans to address these (and several other) recommendations in next year’s edition, and expect it to looks slightly different from (and hopefully better than!) Neurohackademy 2018.

Thank you!

I think that’s a reasonable summary of what went on at Neurohackademy 2018. We’re delighted at how the event turned out, and are happy to answer questions (feel free to leave them in the comments below, or to email Ariel and/or me).

We’d like to end by thanking all of the people and organizations who helped make Neurohackademy 2018 a success: NIMH for providing the funding that makes Neurohackademy possible; the eScience Institute and staff for throwing their wholehearted support behind the course (particularly our awesome course coordinator, Rachael Murray); and the many instructors who each generously took several days (and in a few cases, more than a week!) out of their schedule, unpaid, to come to Seattle and share their knowledge with a bunch of enthusiastic strangers. On a personal note, I’d also like to thank Ariel, who did the lion’s share of the actual course directing. I mostly just get to show up in Seattle, teach some stuff, hang out with great people, and write a blog post about it.

Lastly, and above all else, we’d like to thank our participants. It’s a huge source of inspiration and joy to us each year to see what a group of bright, enthusiastic, motivated researchers can achieve when given time, space, and freedom (and, okay, maybe also a large dollop of cloud computing credits). We’re looking forward to at least three more years of collaborative, productive neurohacking!

The great European capitals of North America

There are approximately 25 communities named Athens in North America. I say “approximately”, because it depends on how you count. Many of the American Athenses are unincorporated communities, and rely for their continued existence not on legal writ, but on social agreement or collective memory. Some no longer exist at all, having succumbed to the turbulence of the Western gold rush (Athens, Nevada) or given way to a series of devastating fires (Athens, Kentucky). Most are—with apologies to their residents—unremarkable. Only one North American Athens has ever made it (relatively) big: Athens, Georgia, home of the University of Georgia—a city whose population of 120,000 is pretty large for a modern-day American college town, but was surpassed by the original Athens some time around 500 BC.

The reasons these communities were named Athens have, in many cases, been lost to internet time (meaning, they can’t be easily discerned via five minutes of googling). But the modal origin story, among the surviving Athenses with reliable histories (i.e., those with a “history” section in their Wikipedia entry), is exactly what you might expect: some would-be 19th century colonialist superheroes (usually white and male) heard a few good things about some Ancient Greek gentlemen named Socrates, Plato, and Aristotle, and decided that the little plot of land they had just managed to secure from the governments of the United States or Canada was very much reminiscent of the hallowed grounds on which the Platonic Academy once stood. It was presumably in this spirit that the residents of Farmersville, Ontario, for instance, decided in 1888 to changed their town’s name to Athens—a move designed to honor the town’s enviable status as an emerging leader in scholastic activity, seeing as how it had succeeded in building itself both a grammar school and a high school.

It’s safe to say that none of the North American Athenses—including the front-running Georgian candidate—have quite lived up to the glory of their Greek namesake. Here, at least, is one case where statistics do not lie: if you were to place the entire global population in a (very large) urn, and randomly sample people from that urn until you picked out someone who claimed they were from a place called Athens, there would be a greater than 90% probability that the Athens in question would be located in Greece. Most other European capitals would give you a similar result. The second largest Rome in the world, as far as I can tell, is Rome, Georgia, which clocks in at 36,000 residents. Moscow, Idaho boasts 24,000 inhabitants; Amsterdam, New York has 18,000 (we’ll ignore, for purposes of the present argument, that aberration formerly known as New Amsterdam).

Of course, as with any half-baked generalization, there are some notable exceptions. A case in point: London, Ontario (“the 2nd best London in the world”). Having spent much of my youth in Ottawa, Ontario—a mere 6 hour drive away from the Ontarian London—I can attest that when someone living in the Quebec City-Windsor corridor tells you they’re “moving to London”, the inevitable follow-up question is “which one?”

London, Ontario is hardly a metropolis. Even on a good day (say, Sunday, when half of the population isn’t commuting to the Greater Toronto Area for work), its metro population is under half a million. Still, when you compare it to its nomenclatorial cousins, London, Ontario stands out as a 60-pound baboon in the room (though it isn’t the 800-pound gorilla; that honor goes to St. Petersburg, Florida). For perspective, the third biggest London in the world appears to be London, Ohio—population 10,000. I’ve visited London, Ontario, and I know quite a few people who have lived there, but I will almost certainly go my entire life without ever meeting anyone born in London, Ohio—or, for that matter, any London other than the ones in the UK and Ontario.

What about the other great European capitals? Most of them are more like Athens than London. Some of them have more imitators than others; Paris, for example, has at least 30 namesakes worldwide. In some years, Paris is the most visited city in the world, so maybe this isn’t surprising. Many people visit Paris and fall in love with the place, so perhaps it’s inevitable that a handful of very single-minded visitors should decide that if they can’t call Paris home, they can at least call home Paris. And so we have Paris, Texas (population 25,000), Paris, Tennessee (pop. 10,000), and Paris, Michigan (pop. 3,200). All three are small and relatively rural, yet each manages to proudly feature its own replica of the Eiffel Tower. (Mind you, none of these Eiffel replicas are anywhere near as large as the half-scale behemoth that looms over the Las Vegas Strip—that quintessentially American street that has about as much in common with the French capital’s roadways as Napoleon has with Flavor Flav.)

But forget the Parises; let’s talk about the Berlins. A small community named Berlin can seemingly be found in every third tree hollow or roadside ditch in the United States—a reminder that fully one in every seven Americans claim to be of German extraction. It’s easy to forget that, prior to 1900, German-language instruction was offered at hundreds of public elementary schools across the country. One unsurprising legacy of having so many Germans in the United States is that we also have a lot of German capitals. In fact, there are so many Berlins in America that quite a few states have more than one. Wisconsin has two separate towns named Berlin—one in Green Lake County (pop. 5,500), and one in Marathon County (pop. < 1,000)—as well as a New Berlin (pop. 40,000) bigger than both of the two plain old Berlins combined. Search Wikipedia for Berlin, Michigan, and you’ll stumble on a disambiguation entry that features no fewer than 4 places: Marne (formerly known as Berlin), Berlin Charter Township, Berlin Township, and Berlin Township. No, that last one isn’t a typo.

Berlin’s ability to inject itself so pervasively into the fabric of industrial-era America is all the more impressive given that, as European capitals go, Berlin is a relative newcomer on the scene. The archeological evidence only dates human habitation in Berlin’s current site along the Spree back about 900 years. But the millions of German immigrants who imported their language and culture to the North America of the 1800s were apparently not the least bit deterred by this youthfulness. It’s as if the nascent American government of the time had looked Berlin over once or twice, noticed it carrying a fake membership card to the Great European Capitals Club, and opportunistically said, listen, they’ll never give you a drink in this place–but if you just hop on a boat and paddle across this tiny ocean…

Of course, what fate giveth, it often taketh away. What the founders of many of the transplanted Berlins, Athenses, and Romes probably didn’t anticipate was the fragility of their new homes. It turns out that growing a motley collection of homesteads into a successful settlement is no easy trick–and, as the celebrity-loving parents of innumerable children have no doubt found out, having a famous namesake doesn’t always confer a protective effect. In some cases, it may be actively harmful. As a cruel example, consider Berlin, Ontario–a thriving community of around 20,000 people when the first World War broke out. But in 1916, at the height of the war, a plurality of 346 residents voted to change Berlin’s name to Kitchener–beating out other shortlisted names like Huronto, Bercana (“a mixture of Berlin and Canada”), and Hydro City. Pressured by a campaign of xenophobia, the residents of Berlin, Ontario–a perfectly ordinary Canadian city that had exactly nothing to do with Kaiser Wilhelm II‘s policies on the Continent–opted to renounce their German heritage and rename their town after the British Field Marshal Herbert Kitchener (by some accounts a fairly murderous chap in his own right).

In most cases, of course, dissolution was a far more mundane process. Much like any number of other North American settlements with less grandiose names, many of the transplanted European capitals drifted towards their demise slowly–perhaps driven by nothing more than their inhabitants’ gradual realization that the good life was to be found elsewhere–say, in Chicago, Philadelphia, or (in later years) Los Angeles. In an act of considerable understatement, the historian O.L. Baskin observed in 1880 that a certain Rome, Ohio—which, at the time of Baskin’s writing, had already been effectively deceased for several decades—”did not bear any resemblance to ancient Rome.” The town, Baskin wrote, “passed into oblivion, and like the dead, was slowly forgotten”. And since Baskin wrote these words in an 1880 volume that has been out of print for over 100 years now, for a long time, there was a non-negligible risk that any memory of a place called Rome in Morrow County, Ohio might be completely obliterated from the pages of history.

But that was before the rise of the all-seeing, ever-recording beast that is the internet—the same beast that guarantees we will collectively never forget that it rained half an inch in Lincoln, Nebraska on October 15, 1984, or who was on the cast of the 1996 edition of MTV’s Real World. Immortalized in its own Wikipedia stub, the memory of Rome, Morrow County, Ohio will probably live exactly as long as the rest of our civilization. Its real fate, it turns out, is not to pass into total oblivion, but to ride the mercurial currents of global search history, gradually but ineluctably decreasing in mind-share.

The same is probably true, to a lesser extent, of most of the other transplanted European capitals of North America. London, Ontario isn’t going anywhere, but some of the other small Athenses and Berlins of North America might. As the population of the US and Canada continues to slowly urbanize, there could conceivably come a time when the last person in Paris, MI, decides that gazing upon a 20-foot forest replica of the Eiffel Tower once a week just isn’t a good enough reason to live an hour away from the nearest Thai restaurant.

Paris, Michigan

For the time being though, these towns survive. And if you live in the US or Canada, the odds are pretty good that at least one of them is only a couple of hours drive away from you. If you’re reading this in Dallas at 10 am, you could hop in your car, drive to Paris (TX), have lunch at BurgerLand, spend the afternoon walking aimlessly around the Texan incarnation of the Eiffel Tower, and still be home well before dinner.

The way I see it, though, there’s no reason to stop there. You’re already sitting in your car and listening to podcasts; you may as well keep going, right? I mean, once you hit Paris (TX), it’s only a few more hours to Moscow (TX). Past Moscow, it’s 2 hours to Berlin–that’s practically next door. Then Praha, Buda, London, Dublin, and Rome (well, fine, “Rhome”) all quickly follow. It turns out you can string together a circular tour of no fewer than 10 European capitals in under 20 hours–all without ever leaving Texas.

But the way I really see it, there’s no reason to stop there either. EuroTexas has its own local flavor, sure; but you can only have so much barbecue, and buy so many guns, before you start itching for something different. And taking the tour national wouldn’t be hard; there are literally hundreds of former Eurocapitals to explore. Throw in Canada–all it takes is a quick stop in Brussels, Athens, or London (Ontario)–and could easily go international. I’m not entirely ashamed to admit that, in my less lucid, more bored moments, I’ve occasionally contemplated what it would be like to set out on a really epic North AmeroEuroCapital tour. I’ve even gone so far as to break out a real, live paper map (yes, they still exist) and some multicolored markers (of the non-erasable kind, because that signals real commitment). Three months, 48 states, 200 driving hours, and, of course, no bathroom breaks… you get the picture.

Of course, these plans never make it past the idle fantasy stage. For one thing, the European capitals of North America are much farther apart than the actual European capitals; such is the geographic legacy of the so-called New World. You can drive from London, England to Paris, France in five hours (or get there in under three hours on the EuroStar), but it would take you ten times that long to get from Paris, Maine to Paris, Oregon–if you never stopped to use the bathroom.

I picture how the conversation with my wife would go:

Wife: “You want us to quit our jobs and travel around the United States indefinitely, using up all of our savings to visit some of the smallest, poorest, most rural communities in the country?”

Me: “Yes.”

Wife: “That’s a fairly unappealing proposal.”

Me: “You’re not wrong.”

And so we’ll probably never embark on a grand tour of all of the Athenses or Parises or Romes in North America. Because, if I’m being honest with myself, it’s a really stupid idea.

Instead, I’ve adopted a much less romantic, but eminently more efficient, touring strategy: I travel virtually. About once a week, for fifteen or twenty minutes, I slip on my Google Daydream, fire up Street View, and systematically work my way through one tiny North AmeroEurocapital after another. I can be in Athens, Ohio one minute and Rome, Oregon the next—with enough time in between for a pit stop in Stockholm (Wisconsin), Vienna (Michigan), or Brussels (Ontario). Admittedly, stomping around the world virtually in bulky, low-resolution goggles probably doesn’t confer quite the same connection to history that one might get by standing at dusk on a Morrow County (OH) hill that used to be Rome, or peering out into the Nevadan desert from the ruins of a mining boomtown Athens. But you know what? It ain’t half bad. I made you a small highlight reel below. Safe travels!

The Great North AmeroEuroCapital Tour — November 2017

College kids on cobblestones; downtown Athens, Ohio.

Rewind in Time; Madrid, New Mexico.

Propane and sky; Vienna Township, Gennessee County, Michigan.

Our Best Pub is Our Only Pub; Stockholm, Wisconsin.

Eiffel Tower in the off-season; Paris, Texas.

Rome was built in a day; Rome, Oregon (pictured here in its entirety).

If we had hills, they’d be alive; Berne, Indiana.

On your left; Vienna, Missourah.

Fast Times in Metropolitan Maine; Stockholm, Maine.

Bricks & trees in little Belgium; Brussels, Ontario.

Springfield, MO. (Just making sure you’re paying attention.)

Yes, your research is very noble. No, that’s not a reason to flout copyright law.

Scientific research is cumulative; many elements of a typical research project would not and could not exist but for the efforts of many previous researchers. This goes not only for knowledge, but also for measurement. In much of the clinical world–and also in many areas of “basic” social and life science research–people routinely save themselves inordinate amounts of work by using behavioral or self-report measures developed and validated by other researchers.

Among many researchers who work in fields heavily dependent on self-report instruments (e.g., personality psychology), there appears to be a tacit belief that, once a measure is publicly available–either because it’s reported in full in a journal article, or because all of the items and instructions be found on the web–it’s fair game for use in subsequent research. There’s a time-honored ttradition of asking one’s colleagues if they happen to “have a copy” of the NEO-PI-3, or the Narcissistic Personality Inventory, or the Hamilton Depression Rating Scale. The fact that many such measures are technically published under restrictive copyright licenses, and are often listed for sale at rather exorbitant prices (e.g., you can buy 25 paper copies of the NEO-PI-3 from the publisher for $363 US), does not seem to deter researchers much. The general understanding seems to be that if a measure is publicly available, it’s okay to use it for research purposes. I don’t think most researchers have a well-thought out, internally consistent justification for this behavior; it seems to almost invariably be an article of tacit belief that nothing bad can or should happen to someone who uses a commercially available instrument for a purpose as noble as scientific research.

The trouble with tacit beliefs is that, like all beliefs, they can sometimes be wrong–only, because they’re tacit, they’re often not evaluated openly until things go horribly wrong. Exhibit A on the frontier of horrible wrongness is a recent news article in Science that reports on a rather disconcerting case where the author of a measure (the Eight-Item Morisky Medication Adherence Scale–which also provides a clue to its author’s name) has been demanding rather large sums of money (ranging from $2000 to $6500) from the authors of hundreds of published articles that have used the MMAS-8 without explicitly requesting permission. As the article notes, there appears to be a general agreement that Morisky is within his legal rights to demand such payment; what people seem to be objecting to is the amount Morisky is requesting, and the way he’s going about the process (i.e., with lawyers):

Morisky is well within his rights to seek payment for use of his copyrighted tool. U.S. law encourages academic scientists and their universities to protect and profit from their inventions, including those developed with public funds. But observers say Morisky’s vigorous enforcement and the size of his demands stand out. “It’s unusual that he is charging as much as he is,” says Kurt Geisinger, director of the Buros Center for Testing at the University of Nebraska in Lincoln, which evaluates many kinds of research-related tests. He and others note that many scientists routinely waive payments for such tools, as long as they are used for research.

It’s a nice article, and and I think it suggests two things fairly clearly. First, Morisky is probably not a very nice man. He seems to have no compunction charging resource-strapped researchers in third-world countries licensing fees that require them to take out loans from their home universities, and he would apparently rather see dozens of published articles retracted from the literature than suffer the indignity of having someone use his measure without going through the proper channels (and paying the corresponding fees).

Second, the normative practice in many areas of science that depend on the (re)use of measures developed by other people is to essentially flout copyright law, bury one’s head in the sand, and hope for the best.

I don’t know that anything can be done about the first observation–and even if something could be done, there will always be other Moriskys. I do, however, think that we could collectively do quite a few things to change the way scientists think about, and deal with, the re-use of self-report (and other kinds of) measures. Most of these amount to providing better guidance and training. In principle, this shouldn’t be hard to do; in most disciplines, scientists are trained in all manner of research method, statistical praxis, and scientific convention. Yet I know of no graduate program in my own discipline (psychology) that provides its students with even a cursory overview of intellectual property law. This despite the fact that many scientists’ chief assets–and the things they most closely identify their career achievements with–are their intellectual products.

This is, in my view, a serious training failure. More important, it’s an unnecessary failure, because there isn’t really very much that a social scientist needs to know about copyright law in order to dramatically reduce their odds of ending up a target of legal action. The goal is not to train PhDs who can moonlight as bad attorneys; it’s to prevent behavior that flagrantly exposes one to potential Moriskying (look! I coined a verb!). For that, a single 15-minute segment of a research methods class would likely suffice. While I’m sure someone better-informed and more lawyer-like than me could come up with a more accurate precis, here’s the gist of what I think one would want to cover:

  • Just because a measure is publicly available does not mean it’s in the public domain. It’s intuitive to suppose that any measure that can be found in a publicly accessible place (e.g., on the web) is, by default, okay for public use–meaning that, unless the author of a measure has indicated that they don’t want their measure to be used by others, it can be. In fact, the opposite is true. By default, the author of a newly produced work retains all usage and distribution rights to that work. The author can, if they are so inclined, immediately place that work in the public domain. Alternatively, they could stipulate that every time someone uses their measure, that user must, within 72 hours of use, immediately send the author 22 green jelly beans in an unmarked paper bag. You don’t like those terms of use? Fine: don’t use the measure.

Importantly, an author isn’t under any obligation to say anything at all about how they wish their work to be reproduced or used. This means that when a researcher uses a measure that lacks explicit licensing information, that researcher is assuming the risk of running afoul of the measure author’s desires, whether or not those desires have been made publicly known. The fact that the measure happens to be publicly available may be a mitigating factor (e.g., one could potentially claim fair use, though as far as I know there’s little precedent for this type of thing in the scientific domain), but that’s a matter for lawyers to hash out, and I think most of us scientists would rather avoid lawyer-hashing if we can help it.

This takes us directly to the next point…

  • Don’t use a measure unless you’ve read, and agree with, its licensing terms. Of course, in practice, very few scientific measures are currently released with an explicit license–which gives rise to an important corollary injunction: don’t use a measure that doesn’t come with a license.

The latter statement may seem unfair; after all, it’s clear enough that most measures developed by social scientist are missing licenses not because their authors are intentionally trying to capitalize on ambiguity, but simply because most authors are ignorant of the fact that the lack of a license creates a significant liability for potential users. Walking away from unlicensed measures would amount to giving up on huge swaths of potential research, which surely doesn’t seem like a good idea.

Fortunately, I’m not suggesting anything nearly this drastic. Because the lack of licensing is typically unintentional, often, a simple, friendly email to an author may be sufficient to magic an explicit license into existence. While I haven’t had occasion to try this yet for self-report measures, I’ve been on both ends of such requests on multiple occasions when dealing with open-source software. In virtually every case I’ve been involved in, the response to an inquiry along the lines of “hey, I’d like to use your software, but there’s no license information attached” has been to either add a license to the repository (for example…), or provide an explicit statement to the effect of “you’re welcome to use this for the use case you describe”. Of course, if a response is not forthcoming, that too is instructive, as it suggests that perhaps steering clear of the tool (or measure) in question might be a good idea.

Of course, taking licensing seriously requires one to abide by copyright law–which, like it or not, means that there may be cases where the responsible (and legal) thing to do is to just walk away from a measure, even if it seems perfect for your use case from a research standpoint. If you’re serious about taking copyright seriously, and, upon emailing the author to inquire about the terms of use, you’re informed that the terms of use involve paying $100 per participant, you can either put up the money, or use a different measure. Burying your head in the sand and using the measure anyway, without paying for it, is not a good look.

  • Attach a license to every reusable product you release into the wild. This follows directly from the previous point: if you want responsible, informed users to feel comfortable using your measure, you should tell them what they can and can’t do with it. If you’re so inclined, you can of course write your own custom license, which can involve dollar bills, jelly beans, or anything else your heart desires. But unless you feel a strong need to depart from existing practices, it’s generally a good idea to select one of the many pre-existing licenses out there, because most of them have the helpful property of having been written by lawyers, and lawyers are people who generally know how to formulate sentiments like “you must give me heap big credit” in somewhat more precise language.

There are a lot of practical recommendations out there about what license one should or shouldn’t choose; I won’t get into those here, except to say that in general, I’m a strong proponent of using permissive licenses (e.g., MIT or CC-BY), and also, that I agree with many people’s sentiment that placing restrictions on commercial use–while intuitively appealing to scientists who value public goods–is generally counterproductive. In any case, the real point here is not to push people to use any particular license, but just to think about it for a few minutes when releasing a measure. I mean, you’re probably going to spend tens or hundreds of hours thinking about the measure itself; the least you can do is make sure you tell people what they’re allowed to do with it.

I think covering just the above three points in the context of a graduate research methods class–or at the very least, in those methods classes slanted towards measure development or evaluation (e.g., psychometrics)–would go a long way towards changing scientific norms surrounding measure use.

Most importantly, perhaps, the point of learning a little bit about copyright law is not just to reduce one’s exposure to legal action. There are also large communal benefits. If academic researchers collectively decided to stop flouting copyright law when choosing research measures, the developers of measures would face a very different–and, from a societal standpoint, much more favorable–set of incentives. The present state of affairs–where an instrument’s author is able to legally charge well-meaning researchers exorbitant fees post-hoc for use of an 8-item scale–exists largely because researchers refuse to take copyright seriously, and insist on acting as if science, being such a noble and humanitarian enterprise, is somehow exempt from legal considerations that people in other fields have to constantly worry about. Perversely, the few researchers who do the right thing by offering to pay for the scales they use then end up incurring large costs, while the majority who use the measures without permission suffer no consequences (except on the rare occasions when someone like Morisky comes knocking on the door with a lawyer).

By contrast, in an academic world that cared more about copyright law, many widely-used measures that are currently released under ambiguous or restrictive licenses (or, most commonly, no license at all) would never have attained widespread use in the first place. If, say, Costa & McCrae’s NEO measures–used by thousands of researchers every year–had been developed in a world where academics had a standing norm of avoiding restrictively licensed measures, the most likely outcome is that the NEO would have changed to accommodate the norm, and not vice versa. The net result is that we would be living in a world where the vast majority of measures–just like the vast majority of open-source software–really would be free to use in every sense of the word, without risk of lawsuits, and with the ability to redistribute, reuse, and modify freely. That, I think, is a world we should want to live in. And while the ship may have already sailed when it comes to the most widely used existing measures, it’s a world we could still have going forward. We just have to commit to not using new measures unless they have a clear license–and be prepared to follow the terms of that license to the letter.

memories of your father

This is fiction. Well, sort of.


“What’s the earliest memory you have of your father,” Baruch asks me. He’s leaning over the counter in his shop, performing surgery on an iPhone battery with a screwdriver.

“I don’t have any memories of my father,” I say.

Baruch drops his scalpel. “No memories,” he lets out a low whistle. “Nothing? No good or bad ones at all? Not a single trauma? You know, even the people who at first say they don’t remember anything can usually come up with a good trauma memory with a little encouragement.”

“No trauma,” I tell him very seriously. “No birthdays, no spankings, no deaths in the family, no trips to Disney World. Just, no memories. Believe me, other people would have gotten them out by now if they could. They’ve tried.”

What I tell him is true. My father passed away of lung cancer when I was nine years old. He never smoked a cigarette in his life, but the cancer picked him for its team anyway. I know what he looked like from hundreds of photos and a few videos, and people tell me stories about what he was like now and then. But I have no memories of him from inside my own head. Nine-year-old me locked his dead father away in a vault. Then locked that vault inside another vault. And then, just for good measure, he put the key in a bottle and threw it out to sea. Not the best thought-out plan, admittedly; but then, nine year-olds are not legally kept away from alcohol, guns, and voting booths on account of their good judgment.

Baruch eyes me carefully, stabs the iPhone with the screwdriver. He’s not even trying to avoid scratching the casing. “Yours is a very serious case,” he says. “I only see maybe two or three cases like yours a year. I can’t promise I can help you. But I’ll try. Did you bring the things on my list?”

I nod and put the things on the table. There’s a green t-shirt, a set of blueprints, and a TDK cassette tape. The tape contains some audio recordings of my father talking to business associates in other countries. The shirt says “Mombasa” on it—that’s where I spent the first fourteen years of my life, and where my father spent the last third of his. He used to wear that shirt.

The blueprints are for a house in Caesarea, Israel. My father was a civil engineer, and when he wasn’t building things for other people, he used to talk about the house he would eventually build for himself to retire in. The house in the blueprints is beautiful; it’s shaped like the conch shells I used to pick up on the beach in Mombasa as a kid. I’ve never seen a house like it. And I probably won’t now, because he passed away before the first block of the foundations could be laid. Eventually, a couple of years after he passed, my mother sold the land.

The items on Baruch’s table are the last of my father’s remaining possessions. Aside from all the photos trapped in albums at my mother’s house, and the fading memories in a few dozen people’s heads, they represent most of the evidence left in this world that a man named Gideon Yarkoni ever existed.

Baruch looks at the objects carefully. He unfolds my father’s shirt and holds it up in front of the incandescent light bulb above us; runs his fingers over the fine lines of the blueprint; pulls an old Sony Walkman out of a drawer and listens to the first few seconds of side A through headphones the size of trash can lids. He touches everything very lightly, patiently—almost as if he expects my father to tap him on the shoulder at any moment and say, I’m sorry, but this is my stuff you’re messing with—do you mind?

“It’s not great,” Baruch finally says with a sigh. “But I guess it’ll do.”

He pulls out a calculator and punches a number into it, then turns it around to show me. I mull it over quietly for a few moments, then nod at him and shake his hand. It’s a lot of money to pay; but then, it’s not often one gets an opportunity like this one.


Three months later, I’m back in the shop for the procedure. Baruch has me strapped to a gurney that looks like it was borrowed from a Russian prison camp circa nineteen-forty-Stalin. I try to have second thoughts, but they take one look at the situation and flee in terror.

“Did you have anything to eat today,” Baruch asks me as he snaps on a pair of impossibly blue gloves.

“Yes. I followed your instructions and ate a full breakfast. Toast and smoked salmon. Some avocado. Orange juice.”

“What about the stroopwafel? Did you eat the stroopwafel?”

“Oh yes,” I say, remembering the stroopwafel Baruch sent in the mail. “I forgot. I ate the stroopwafel.”

“And?”

“And… it was good?” I’m not sure exactly what he’s expecting.

“Excellent,” he says. “The stroopwafel is the most underrated of all the wafels.”

Then he reaches forward and turns out my lights.


When I come to, I’m lying in a cheap bed in a dark room. I feel completely normal; almost as if I’d just had a power nap, rather than undergoing serious brain surgery.

I’m about to get up to go find Baruch when I realize I’m not alone in the room. There’s another man here with me. He’s very large and very hairy; I watch him for a few moments. He paces around the room restlessly for a while, stops, lights a cigarette. The he notices me noticing him.

“What the fuck are you looking at,” he says, and picks up something that looks suspiciously like a crowbar. I suddenly realize that this is him: this is my father. It seems the surgery was a success; I’m having a memory.


Over the next few days, I slowly adjust to the renewed presence of my father in my life. Well, “presence” might not be exactly the right word. My father is now present in my memories in roughly the same way that Americans were “present” on the beach in Normandy. It’s a full-scale invasion. He’s everywhere all the time; everything reminds me of him. When I see a broom, I immediately cower in fear, expecting to feel it across my back. When I enter the house, I duck to avoid being hit by a tennis ball or—at least once—a tennis racket. I can no longer drink a beer without being impaled by a vision of my father throwing up in the kitchen—and, in nearly the same breath, asking me to hand him another beer.

Of all the surprises, the biggest one is this: my father smokes like a chimney. In my newfound memories, cigarettes are everywhere: they pop out of his mouth, hang off his shirt, hide behind his ear. His shirt looks like it was sewed from empty cigarette cartons; his breath smells like an abyss of black smoke and stale filters.

Too late, I realize that 9 year-old me wasn’t an idiot child after all, impulsively rushing to slam shut the painful trapdoor of memory; he was actually a creative genius exercising masterful control over his internal mnemonic canvas.


By the fourth or fifth day, I can no longer stand it. I need to understand why I’ve been lied to for so long; why my entire family decided it was okay to tell me that my father was a good man and a successful civil engineer, when really he was a raging, chain-smoking, alcoholic brute who could barely hold down a construction job for a month at a time.

I call my mother up. We make small talk for a few minutes before I get to the point. Why, I ask her, didn’t you ever tell me my father was an abusive alcoholic? Don’t you think I deserved to know the truth?

There’s a soft choking sound at the other end of the line, and I realize that my mother’s quietly sobbing.

“Look, I didn’t mean to upset you,” I say, though I’m still angry with her. “I just don’t understand why you’d lie to me about something like this. All this time you’ve told me dad was this amazing guy, when we both know he was a feral animal with a two-pack-a-day habit who used to beat the shit out of both of us. I don’t know how you put up with him for so many years. Were you afraid to leave?”

“I don’t know what’s wrong with you,” my mother says, sobbing. “Why would you make up a thing like that? Your father hardly ever drank, and the closest he ever came to beating you was threatening you with a broom to get you to take a shower. He would be ashamed of you for saying things like this. I’m ashamed of you. I’m going to hang up now. Please don’t call me again unless it’s to apologize for being a total asshole.”

She doesn’t hang up immediately; instead, she waits silently for a few seconds, as if she’s giving me a chance to apologize right now. But I don’t apologize, because why should I? My mother deserves an Oscar for her performance. After twenty-six years of blanks, I finally have crystal-clear memories of my father. It’s not my fault that he’s threatening to separate my sternum from my clavicles in most of them.


In the morning, I put on my shoes and go down to Baruch’s shop. When I explain the situation to him, he doesn’t crack wise. Instead, he looks concerned.

“I told you the procedure might not work very well,” he says cautiously.

I tell him that, on the contrary, it seems to have worked too well; there’s now no detail of my father’s appearance or behavior too small for me to avoid reliving over and over. Everything from his mastery of obscure Hungarian expletives to the construction grime he seemingly saved up under his fingernails all day just so he could share it with me when he got home.

“It’s almost like he’s standing here right now next to me,” I tell Baruch. “Berating me for not bringing in the lawnmower, or cuffing me across the neck for getting him the wrong beer from the fridge.”

Baruch asks if I can actually remember my family ever owning a lawnmower.

“Of course not,” I say in exasperation. “I grew up in a small expatriate community in East Africa in the 1980s; there were no lawnmowers there. For god’s sake, we didn’t even have broadcast television.”

Baruch nods and pulls up something on his computer. His eyes scan left to right, then repeat. Then they do it again, like he’s reading the same line, over and over.

“I see your problem,” he finally says slowly, turning the screen to show me. “Right there, you see. Here’s your order. Yarkoni, Tal. Zero Recall. twenty-six years dark. That’s you. And right here, just below you, is an order for a guy named Hal Zamboni. 20/400 recall; eight years dark.

What follows is one of those proverbial moments when you can hear a pin drop. Except it isn’t a moment; it’s more like a good two minutes, during which I just stand there motionless and stare deeply into Baruch’s corneas. I imagine Baruch must also be experiencing a moment that feels like eternity; at the very least, I hope he’s seeing his life flash before his eyes.

“They are quite similar names,” Baruch says apologetically in response to my smoldering stare. It’s quite an understandable mistake, he assures me; it could have happened to anyone. In fact, it’s already happened to him twice before—which, he points out, only goes to show how unusual it is, considering that he’s operated successfully on at least sixty people.

“Well, fifty-eight,” he says, “if you exclude the… oh, nevermind.”

I ask Baruch when we can schedule a time for me to come in for a correction.

“What exactly do you mean by correction,” he says, backing away from the counter nervously.


Three days later, I knock on the front door of a house on the far side of town, in a solidly middle-class, suburban neighborhood—the kind where everyone knows everyone else by name, but isn’t entirely sure if that’s a good thing.

Hal Zamboni opens the door and shakes my hand. He seems apologetic as he greets me. Or maybe he’s just nervous; I don’t know. I mean, how are you supposed to interact with someone who you know has the misfortune of re-experiencing every aspect of your own traumatic childhood on a daily basis?

“I had a bit of a breakdown,” Zamboni confesses after we’ve sat down inside and he’s lit up a cigarette—holding it in exactly the same way as his late, abusive father. “When Baruch called me up yesterday to tell me about the mix-up, I just lost it. I mean, you can imagine what it’s like. Here I am, going through life with these very faint memories of a father who was an absolute brute; a guy who would beat the shit out of me over nothing, over absolutely nothing… and then one morning I wake up with beautiful, crystal-clear memories of a very different father–a guy who used to take me out for an enormous brunch at some luxury beach resort every Sunday. This guy who puts a giant omelette on my plate and grins at me and asks me what I’m learning in school.”

“I’m sorry,” I say. And I really am. For both of us.

“The weird thing is,” Zamboni says, “the worst part of it wasn’t even realizing that these memories were someone else’s. It was realizing that I had had such a fucked up childhood. I mean, you know, I always knew my dad was kind of a savage. But I didn’t realize how much better other people had it. Compared to my father, yours was a saint. Do you know what that feels like?”

“No,” I say simply. There’s nothing else to say, really; over the past week, I’ve come to realize how little I actually understand about the human condition. So we just sit there on Zamboni’s porch quietly. He smokes cigarette after cigarette, I drink my beer, and we both watch the endless string of cars roll down the narrow street like an angry centipede.


my father & me, circa 1986

Why I still won’t review for or publish with Elsevier–and think you shouldn’t either

In 2012, I signed the Cost of Knowledge pledge, and stopped reviewing for, and publishing in, all Elsevier journals. In the four years since, I’ve adhered closely to this policy; with a couple of exceptions (see below), I’ve turned down every review request I’ve received from an Elsevier-owned journal, and haven’t sent Elsevier journals any of my own papers for publication.

Contrary to what a couple of people I talked to at the time intimated might happen, my scientific world didn’t immediately collapse. The only real consequences I’ve experienced as a result of avoiding Elsevier are that (a) on perhaps two or three occasions, I’ve had to think a little bit longer about where to send a particular manuscript, and (b) I’ve had a few dozen conversations (all perfectly civil) about Elsevier and/or academic publishing norms that I otherwise probably wouldn’t have had. Other than that, there’s been essentially no impact on my professional life. I don’t feel that my unwillingness to publish in NeuroImage, Neuron, or Journal of Research in Personality has hurt my productivity or reputation in any meaningful way. And I continue to stand by my position that it’s a mistake for scientists to do business with a publishing company that actively lobbies against the scientific community’s best interests.

While I’ve never hidden the fact that I won’t deal with Elsevier, and am perfectly comfortable talking about the subject when it comes up, I also haven’t loudly publicized my views. Aside from a parenthetical mention of the issue in one or two (sometimes satirical) blog posts, and an occasional tweet, I’ve never written anything vocally suggesting that others adopt the same stance. The reason for this is not that I don’t believe it’s an important issue; it’s that I thought Elsevier’s persistently antagonistic behavior towards scientists’ interests was common knowledge, and that most scientists continue to provide their free expert labor to Elsevier because they’ve decided that the benefits outweigh the costs. In other words, I was under the impression that other people share my facts, just not my interpretation of the facts.

I now think I was wrong about this. A series of tweets a few months ago (yes, I know, I’m slow to get blog posts out these days) prompted my reevaluation. It began with this:

Which led a couple of people to ask why I don’t review for Elsevier. I replied:


All of this information is completely public, and much of it features prominently in Elsevier’s rather surreal Wikipedia entry–nearly two thirds of which consists of “Criticism and Controversies” (and no, I haven’t personally contributed anything to that entry). As such, I assumed Elsevier’s track record of bad behavior was public knowledge. But the responses to my tweets suggested otherwise. And in the months since, I’ve had several other twitter or real-life conversations with people where it quickly became clear that the other party was not, in fact, aware of (m)any of the scandals Elsevier has been embroiled in.

In hindsight, this shouldn’t have surprised me. There’s really no good reason why most scientists should be aware of what Elsevier’s been up to all this time. Sure, most scientists cross path with Elsevier at some point; but so what? It’s not as though I thoroughly research every company I have contractual dealings with; I usually just go about my business and assume the best about the people I’m dealing with–or at the very least, I try not to assume the worst.

Unfortunately, sometimes it turns out that that assumption is wrong. And on those occasions, I generally want to know about it. So, in that spirit, I thought I’d expand on my thoughts about Elsevier beyond the 140-character format I’ve adopted in the past, in the hopes that other people might also be swayed to at least think twice about submitting their work to Elsevier journals.

Is Elsevier really so evil?

Yeah, kinda. Here’s a list of just some of the shady things Elsevier has been previously caught doing–and none of which, as far as I know, the company contests at this point:

  • They used to organize arms trade fairs, until a bunch of academics complained that a scholarly publisher probably shouldn’t be in the arms trade, at which point they sold that division off;
  • In 2009, they were caught for having created and sold half a dozen entire fake journals to pharmaceutical companies (e.g., Merck), so that those companies could fill the pages of the journals, issue after issue, with reprinted articles that cast a positive light on their drugs;
  • They regularly sell access to articles they don’t own, including articles licensed for non-commercial use–in clear contravention of copyright law, and despite repeated observations by academics that this kind of thing should not be technically difficult to stop if Elsevier actually wanted it to stop;
  • Their pricing model is based around the concept of the “Big Deal”: Elsevier (and, to be fair, most other major publishers) forces universities to pay for huge numbers of their journals at once by pricing individual journals prohibitively, ensuring that institutions can’t order only the journals they think they’ll actually use (this practice is very much like the “bundling” exercised by the cable TV industry); they also bar customers from revealing how much they paid for access, and freedom-of-information requests reveal enormous heterogeneity across universities, often at costs that are prohibitive to libraries;
  • They recently bought the SSRN preprint repository, and after promising to uphold SSRN’s existing operating procedures, almost immediately began to remove articles that were legally deposited on the service, but competed with “official” versions published elsewhere;
  • They have repeatedly spurned requests from the editorial boards of their journals to lower journal pricing, decrease open access fees, or make journals open access; this has resulted in several editorial boards abandoning the Elsevier platform wholesale and moving their operation elsewhere (Lingua being perhaps the best-known example)–often taking large communities with them;
  • Perhaps most importantly (at least in my view), they actively lobbied the US government against open access mandates, making multiple donations to the congressional sponsors of a bill called the Research Works Act that would have resulted in the elimination of the current law mandating deposition of all US government-funded scientific works in public repositories within 12 months after publication.

The pattern in these cases is almost always the same: Elsevier does something that directly works against the scientific community’s best interests (and in some cases, also the law), and then, when it gets caught with its hand in the cookie jar, it apologizes and fixes the problem (well, at least to some degree; they somehow can’t seem to stop selling OA-licensed articles, because it is apparently very difficult for a multibillion dollar company to screen the papers that appear on its websites). A few months later, another scandal comes to light, and then the cycle repeats.

Elsevier is, of course, a large company, and one could reasonably chalk one or two of the above actions down to poor management or bad judgment. But there’s a point at which the belief that this kind of thing is just an unfortunate accident–as opposed to an integral part of the business model–becomes very difficult to sustain. In my case, I was aware of a number of the above practices before I signed The Cost of Knowledge pledge; for me, the straw that broke the camel’s back was Elsevier’s unabashed support of the Research Works Act. While I certainly don’t expect any corporation (for-profit or otherwise) to actively go out and sabotage its own financial interests, most organizations seem to know better than to publicly lobby for laws that would actively and unequivocally hurt the primary constituency they make their money off of. While Elsevier wasn’t alone in its support of the RWA, it’s notable that many for-profit (and most non-profit) publishers explicitly expressed their opposition to the bill (e.g., MIT Press, Nature Publishing Group, and the AAAS). To my mind, there wasn’t (and isn’t) any reason to support a company that, on top of arms sales, fake journals, and copyright violations, thinks it’s okay to lobby the government to make it harder for taxpayers to access the results of publicly-funded research that’s generated and reviewed at no cost to Elsevier itself. So I didn’t, and still don’t.

Objections (and counter-objections)

In the 4 years since I stopped writing or reviewing for Elsevier, I’ve had many conversations with colleagues about this issue. Since most of my colleagues don’t share my position (though there are a few exceptions), I’ve received a certain amount of pushback. While I’m always happy to engage on the issue, so far, I can’t say that I’ve found any of the arguments I’ve heard sufficiently compelling to cause me to change my position. I’m not sure if my arguments have led anyone else to change their view either, but in the interest of consolidating discussion in one place (if only so that I can point people to it in future, instead of reprising the same arguments over and over again), I thought I’d lay out all of the major objections I’ve heard to date, along with my response(s) to each one. If you have other objections you feel aren’t addressed here, please leave a comment, and I’ll do my best to address them (and perhaps add them to the list).

Without further ado, and in no particular order, here are the pro-Elsevier (or at least, anti-anti-Elsevier) arguments, as I’ve heard and understood them:

“You can’t really blame Elsevier for doing this sort of thing. Corporations exist to make money; they have a fiduciary responsibility to their shareholders to do whatever they legally can to increase revenue and decrease expenses.”

For what it’s worth, I think the “fiduciary responsibility” argument–which seemingly gets trotted out almost any time anyone calls out a publicly traded corporation for acting badly–is utterly laughable. As far as I can tell, the claim it relies on is both unverifiable and unenforceable. In practice, there is rarely any way for anyone to tell whether a particular policy will hurt or help a company’s bottom line, and virtually any action one takes can be justified post-hoc by saying that it was the decision-makers’ informed judgment that it was in the company’s best interest. Presumably part of the reason publishing groups like NPG or MIT Press don’t get caught pulling this kind of shit nearly as often as Elsevier is that part of their executives’ decision-making process includes thoughts like gee, it would be really bad for our bottom line if scientists caught wind of what we’re doing here and stopped giving us all this free labor. You can tell a story defending pretty much any policy, or its polar opposite, on grounds of fiduciary responsibility, but I think it’s very unlikely that anyone is ever going to knock on an Elsevier executive’s door threatening to call in the lawyers because Elsevier just hasn’t been working hard enough lately to sell fake journals.

That said, even if you were to disagree with my assessment, and decided to take the fiduciary responsibility argument at face value, it would still be completely and utterly irrelevant to my personal decision not to work for Elsevier any more. The fact that Elsevier is doing what it’s (allegedly) legally obligated to do doesn’t mean that I have to passively go along with it. Elsevier may be legally allowed or even obligated to try to take advantage of my labor, but I’m just as free to follow my own moral compass and refuse. I can’t imagine how my individual decision to engage in moral purchasing could possibly be more objectionable to anyone than a giant corporation’s “we’ll do anything legal to make money” policy.

“It doesn’t seem fair to single out Elsevier when all of the other for-profit publishers are just as bad.”

I have two responses to this. First, I think the record pretty clearly suggests that Elsevier does in fact behave more poorly than the vast majority of other major academic publishers (there are arguably a number of tiny predatory publishers that are worse–but of course, I don’t think anyone should review for or publish with them either!). It’s not that publishers like Springer or Wiley are without fault; but they at least don’t seem to get caught working against the scientific community’s interests nearly as often. So I think Elsevier’s particularly bad track record makes it perfectly reasonable to focus attention on Elsevier in particular.

Second, I don’t think it would, or should, make any difference to the analysis even if it turned out that Springer or Wiley were just as bad. The reason I refuse to publish with Elsevier is not that they’re the only bad apples, but that I know that they’re bad apples. The fact that there might be other bad actors we don’t know about doesn’t mean we shouldn’t take actions against the bad actors we do know about. In fact, it wouldn’t mean that even if we did know of other equally bad actors. Most people presumably think there are many charities worth giving money to, but when we learn that someone donated money to a breast cancer charity, we don’t get all indignant and say, oh sure, you give money to cancer, but you don’t think heart disease is a serious enough problem to deserve your support? Instead, we say, it’s great that you’re doing what you can–we know you don’t have unlimited resources.

Moreover, from a collective action standpoint, there’s a good deal to be said for making an example out of a single bad actor rather than trying to distribute effort across a large number of targets. The reality is that very few academics perceive themselves to be in a position to walk away from all academic publishers known to engage in questionable practices. Collective action provides a means for researchers to exercise positive force on the publishing ecosystem in a way that cannot be achieved by each individual researcher making haphazard decisions about where to send their papers. So I would argue that as long as researchers agree that (a) Elsevier’s policies hurt scientists and taxpayers, and (b) Elsevier is at the very least one of the worst actors, it makes a good deal of sense to focus our collective energy on Elsevier. I would hazard a guess that if a concerted action on the part of scientists had a significant impact on Elsevier’s bottom line, other publishers would sit up and take notice rather quickly.

“You can choose to submit your own articles wherever you like; that’s totally up to you. But when you refuse to review for all Elsevier journals, you do a disservice to your colleagues, who count on you to use your expertise to evaluate other people’s manuscripts and thereby help maintain the quality of the literature as a whole.”

I think this is a valid concern in the case of very early-career academics, who very rarely get invited to review papers, and have no good reason to turn such requests down. In such cases, refusing to review because Elsevier would indeed make everyone else’s life a little bit more difficult (even if it also helps a tiny bit to achieve the long-term goal of incentivizing Elsevier to either shape up or disappear). But I don’t think the argument carries much force with most academics, because most of us have already reached the review saturation point of our careers–i.e., the point at which we can’t possibly (or just aren’t willing to) accept all the review assignments we receive. For example, at this point, I average about 3 – 4 article reviews a month, and I typically turn down about twice that many invitations to review. If I accepted any invitations from Elsevier journals, I would simply have to turn down an equal number of invitations from non-Elsevier journals–almost invariably ones with policies that I view as more beneficial to the scientific community. So it’s not true that I’m doing the scientific community a disservice by refusing to review for Elsevier; if anything, I’m doing it a service by preferentially reviewing for journals that I believe are better aligned with the scientific community’s long-term interests.

Now, on fairly rare occasions, I do get asked to review papers focusing on issues that I think I have particularly strong expertise in. And on even rarer occasions, I have reason to think that there are very few if any other people besides me who would be able to write a review that does justice to the paper. In such cases, I willingly make an exception to my general policy. But it doesn’t happen often; in fact, it’s happened exactly twice in the past 4 years. In both cases, the paper in question was built to a very significant extent on work that I had done myself, and it seemed to me quite unlikely that the editor would be able to find another reviewer with the appropriate expertise given the particulars reported in the abstract. So I agreed to review the paper, even for an Elsevier journal, because to not do so would indeed have been a disservice to the authors. I don’t have any regrets about this, and I will do it again in future if the need arises. Exceptions are fine, and we shouldn’t let the perfect be the enemy of the good. But it simply isn’t true, in my view, that my general refusal to review for Elsevier is ever-so-slightly hurting science. On the contrary, I would argue that it’s actually ever-so-slightly helping it, by using my limited energies to support publishers and journals that work in favor of, rather than against, scientists’ interests.

“If everyone did as you do, Elsevier journals might fall apart, and that would impact many people’s careers. What about all the editors, publishing staff, proof readers, etc., who would all lose at least part of their livelihood?”

This is the universal heartstring-pulling argument, in that it can be applied to virtually any business or organization ever created that employs at least one person. For example, it’s true that if everyone stopped shopping at Wal-Mart, over a million Americans would lose their jobs. But given the externalities that Wal-Mart imposes on the American taxpayer, that hardly seems like a sufficient reason to keep shopping at Wal-Mart (note that I’m not saying you shouldn’t shop at Wal-Mart, just that you’re not under any moral obligation to view yourself as a one-person jobs program). Almost every decision that involves reallocation of finite resources hurts somebody; the salient question is whether, on balance, the benefits to the community as a whole outweigh the costs. In this case, I find it very hard to see how Elsevier’s policies benefit the scientific community as a whole when much cheaper, non-profit alternatives–to say nothing of completely different alternative models of scientific evaluation–are readily available.

It’s also worth remembering that the vast majority of the labor that goes into producing Elsevier’s journals is donated to Elsevier free of charge. Given Elsevier’s enormous profit margin (over 30% in each of the last 4 years), it strains credulity to think that other publishers couldn’t provide essentially the same services while improving the quality of life of the people who provide most of the work. For an example of such a model, take a look at Collabra, where editors receive a budget of $250 per paper (which comes out of the author publication charge) that they can divide up however they like between themselves, the reviewers, and publishing subsidies to future authors who lack funds (full disclosure: I’m an editor at Collabra). So I think an argument based on treating people well clearly weighs against supporting Elsevier, not in favor of it. If nothing else, it should perhaps lead one to question why Elsevier insists it can’t pay the academics who review its articles a nominal fee, given that paying for even a million reviews per year (surely a gross overestimate) at $200 a pop would still only eat up less than 20% of Elsevier’s profit in each of the past few years.

“Whatever you may think of Elsevier’s policies at the corporate level, the editorial boards at the vast majority of Elsevier journals function autonomously, with no top-down direction from the company. Any fall-out from a widespread boycott would hurt all of the excellent editors at Elsevier journals who function with complete independence–and by extension, the field as a whole.”

I’ve now heard this argument from at least four or five separate editors at Elsevier journals, and I don’t doubt that its premise is completely true. Meaning, I’m confident that the scientific decisions made by editors at Elsevier journals on a day-to-day basis are indeed driven entirely by scientific considerations, and aren’t influenced in any way by publishing executives. That said, I’m completely unmoved by this argument, for two reasons. First, the allocation of resources–including peer reviews, submitted manuscripts, and editorial effort–is, to a first approximation, a zero-sum game. While I’m happy to grant that editorial decisions at Elsevier journals are honest and unbiased, the same is surely true of the journals owned by virtually every other publisher. So refusing to send a paper to NeuroImage doesn’t actually hurt the field as a whole in any way, unless one thinks that there is a principled reason why the editorial process at Cerebral Cortex, Journal of Neuroscience, or Journal of Cognitive Neuroscience should be any worse. Obviously, there can be no such reason. If Elsevier went out of business, many of its current editors would simply move to other journals, where they would no doubt resume making equally independent decisions about the manuscripts they receive. As I noted above, in a number of cases, entire editorial boards at Elsevier journals have successfully moved wholesale to new platforms. So there is clearly no service Elsevier provides that can’t in principle be provided more cheaply by other publishers or plaforms that aren’t saddled with Elsevier’s moral baggage or absurd profit margins.

Second, while I don’t doubt the basic integrity of the many researchers who edit for Elsevier journals, I also don’t think they’re completely devoid of responsibility for the current state of affairs. When a really shitty company offers you a position of power, it may be true that accepting that position–in spite of the moral failings of your boss’s boss’s boss–may give you the ability to do some real good for the community you care about. But it’s also true that you’re still working for a really shitty company, and that your valiant efforts could at any moment be offset by some underhanded initiative in some other branch of the corporation. Moreover, if you’re really good at your job, your success–whatever its short-term benefits to your community–will generally serve to increase your employer’s shit-creating capacity. So while I don’t think accepting an editorial position at an Elsevier journal makes anyone a bad person (some of my best friends are editors for Elsevier!), I also see no reason for anyone to voluntarily do business with a really shitty company rather than a less shitty one. As far as I can tell, there is no service I care about that NeuroImage offers me but Cerebral Cortex or The Journal of Neuroscience don’t. As a consequence, it seems reasonable for me to submit my papers to journals owned by companies that seem somewhat less intent on screwing me and my institution out of as much money as possible. If that means that some very good editors at NeuroImage ultimately have to move to JNeuro, JCogNeuro, or (dare I say it!) PLOS ONE, I think I’m okay with that.

“It’s fine for you to decide not to deal with Elsevier, but you don’t have a right to make that decision for your colleagues or trainees when they’re co-authors on your papers.”

This is probably the only criticism I hear regularly that I completely agree with. Which is why I’ve always been explicit that I can and will make exceptions when required. Here’s what I said when I originally signed The Cost of Knowledge years ago:

costofknowledge

Basically, my position is that I’ll still submit a manuscript to an Elsevier journal if either (a) I think a trainee’s career would be significantly disadvantaged by not doing so, or (b) I’m not in charge of a project, and have no right to expect to exercise control over where a paper is submitted. The former has thankfully never happened so far (though I’m always careful to make it clear to trainees that if they really believe that it’s important to submit to a particular Elsevier journal, I’m okay with it). As for the latter, in the past 4 years, I’ve been a co-author on two Elsevier papers (1, 2). In both cases, I argued against submitting the paper to those journals, but was ultimately overruled. I don’t have any problem with either of those decisions, and remain on good terms with both lead authors. If I collaborate with you on a project, you can expect to receive an email from me suggesting in fairly strong terms that we should consider submitting to a non-Elsevier-owned journal, but I certainly won’t presume to think that what makes sense to me must also make sense to you.

“Isn’t it a bit silly to think that your one-person boycott of Elsevier is going to have any meaningful impact?”

No, because it isn’t a one-person boycott. So far, over 16,000 researchers have signed The Cost of Knowledge pledge. And there are very good reasons to think that the 16,000-strong (and growing!) boycott has already had important impacts. For one thing, Elsevier withdrew its support of the RWA in 2012 shortly after The Cost of Knowledge was announced (and several thousand researchers quickly signed on). The bill itself was withdrawn shortly after that. That seems like a pretty big deal to me, and frankly I find it hard to imagine that Elsevier would have voluntarily stopped lobbying Congress this way if not for thousands of researchers putting their money where their mouth is.

Beyond that clear example, it’s hard to imagine that 16,000 researchers walking away from a single publisher wouldn’t have a significant impact on the publishing landscape. Of course, there’s no clear way to measure that impact. But consider just a few points that seem difficult to argue against:

  • All of the articles that would have been submitted to Elsevier journals presumably ended up in other publishers’ journals (many undoubtedly run by OA publishers). There has been continual growth in the number of publishers and journals; some proportion of that seems almost guaranteed to reflect the diversion of papers away from Elsevier.

  • Similarly, all of the extra time spent reviewing non-Elsevier articles instead of Elsevier articles presumably meant that other journals received better scrutiny and faster turnaround times than they would have otherwise.

  • A number of high-profile initiatives–for example, the journal Glossa–arose directly out of researchers’ refusal to keep working with Elsevier (and many others are likely to have arisen indirectly, in part). These are not insignificant. Aside from their immediate impact on the journal landscape, the involvement of leading figures like Timothy Gowers in the movement to develop better publishing and evaluation options is likely to have a beneficial long-term impact.

All told, it seems to me that, far from being ineffectual, the Elsevier boycott–consisting of nothing more than individual researchers cutting ties with the publisher–has actually achieved a considerable amount in the past 4 years. Of course, Elsevier continues to bring in huge profits, so it’s not like it’s in any danger of imminent collapse (nor should that be anyone’s goal). But I think it’s clear that, on balance, the scientific publishing ecosystem is healthier for having the boycott in place, and I see much more reason to push for even greater adoption of the policy than to reconsider it.

More importantly, I think the criticism that individual action has limited efficacy overlooks what is probably the single biggest advantage the boycott has in this case: it costs a researcher essentially nothing. If I were to boycott, say, Trader Joe’s, on the grounds that it mistreats its employees (for the record, I don’t think it does), my quality of life would go down measurably, as I would have to (a) pay more for my groceries, and (b) travel longer distances to get them (there’s a store just down the street from my apartment, so I shop there a lot). By contrast, cutting ties with Elsevier has cost me virtually nothing so far. So even if the marginal benefit to the scientific community of each additional individual boycotting Elsevier is very low, the cost to that individual will typically be still much lower. Which, in principle, makes it very easy to organize and maintain a collective action of this sort on a very large scale (and is probably a lot of what explains why over 16,000 researchers have already signed on).

What you can do

Let’s say you’ve read this far and find yourself thinking, okay, that all kind of makes sense. Maybe you agree with me that Elsevier is an amazingly shitty company whose business practices actively bite the hand that feeds it. But maybe you’re also thinking, well, the thing is, I almost exclusively publish primary articles in the field of neuroimaging [or insert your favorite Elsevier-dominated discipline here], and there’s just no way I can survive without publishing in Elsevier journals. So what can I do?

The first thing to point out is that there’s a good chance your fears are at least somewhat (and possibly greatly) exaggerated. As I noted at the outset of this post, I was initially a bit apprehensive about the impact that taking a principled stand would have on my own career, but I can’t say that I perceive any real cost to my decision, nearly five years on. One way you can easily see this is to observe that most people are surprised when I first tell them I haven’t published in Elsevier journals in five years. It’s not like the absence would ever jump out at anyone who looked at my publication list, so it’s unclear how it could hurt me. Now, I’m not saying that everyone is in a position to sign on to a complete boycott without experiencing some bumps in the road. But I do think many more people could do so than might be willing to admit it at first. There are very few fields that are completely dominated by Elsevier journals. Neuroimaging is probably one of the fields where Elsevier’s grip is strongest, but I publish several neuroimaging-focused papers a year, and have never had to work very hard to decide where to submit my papers next.

That said, the good news is that you can still do a lot to actively work towards an Elsevier-free world even if you’re unable or unwilling to completely part ways with the publisher. Here are a number of things you can do that take virtually no work, are very unlikely to harm your career in any meaningful way, and are likely to have nearly the same collective benefit as a total boycott:

  • Reduce or completely eliminate your Elsevier reviewing and/or editorial load. Even if you still plan to submit your papers to Elsevier journals, nothing compels you to review or edit for them. You should, of course, consider the pros and cons of turning down any review request; and, as I noted above, it’s fine to make occasional exceptions in cases where you think declining to review a particular paper would be a significant disservice to your peers. But such occasions are–at least in my own experience–quite rare. As I noted above, one of the reasons I’ve had no real compunction about rejecting Elsevier review requests is that I already receive many more requests than I can handle, so declining Elsevier reviews just means I review more for other (better) publishers. If you’re at an early stage of your career, and don’t get asked to review very often, the considerations may be different–though of course, you could still consider turning down the review and doing something nice for the scientific community with the time you’ve saved (e.g., reviewing openly on site like PubPeer or PubMed Commons, or spend some time making all the data, code, and materials from your previous work openly available).

  • Make your acceptance of a review assignment conditional on some other prosocial perk. As a twist on simply refusing Elsevier review invitations, you can always ask the publisher for some reciprocal favor. You could try asking for monetary compensation, of course–and in the extremely unlikely event that Elsevier obliges, you could (if needed) soothe your guilty conscience by donating your earnings to a charity of your choice. Alternatively, you could try to extract some concession from the journal that would help counteract your general aversion to reviewing for Elsevier. Chris Gorgolewski provided one example in this tweet:

Mandating open science practices (e.g., public deposition of data and code) as a requirement for review is something that many people strongly favor completely independently of commercial publishers’ shenanigans (see my own take here). Making one’s review conditional on an Elsevier journal following best practices is a perfectly fair and even-handed approach, since there are other journals that either already mandate such standards (e.g., PLOS ONE), or are likely to be able to oblige you. So if you get an affirmative response from an Elsevier journal, then great–it’s still Elsevier, but at least you’ve done something useful to improve their practices. If you get a negative review, well, again, you can simply reallocate your energy somewhere else.

  • Submit fewer papers to Elsevier journals. If you publish, say, 5 – 10 fMRI articles a year, it’s completely understandable if you might not feel quite ready to completely give up on NeuroImage and the other three million neuroimaging journals in Elsevier’s stable. Fortunately, you don’t have to. This is a nice example of the Pareto principle in action: 20% of the effort goes maybe 80% of the way in this case. All you have to do to exert almost exactly the same impact as a total boycott of Elsevier is drop NeuroImage (or whatever other journal you routinely submit to) to the bottom of the queue of whatever journals you perceive as being in the same class. So, for example, instead of reflexively thinking, “oh, I should send this to NeuroImage–it’s not good enough for Nature Neuroscience, but I don’t want to send it to just any dump journal”, you can decide to submit it to Cerebral Cortex or The Journal of Neuroscience first, and only go to NeuroImage if the first two journals reject it. Given that most Elsevier journals have a fairly large equivalence class of non-Elsevier journals, a policy like this one would almost certainly cut submissions to Elsevier journals significantly if widely implemented by authors–which would presumably reduce the perceived prestige of those journals still further, potentially precipitating a death spiral.

  • Go cold turkey. Lastly, you could always just bite the bullet and cut all ties with Elsevier. Honestly, it really isn’t that bad. As I’ve already said, the fall-out in my case has been considerably smaller than I thought it would be when I signed The Cost of Knowledge pledge as a post-doc (i.e., I expected it to have some noticeable impact, but in hindsight I think it’s had essentially none). Again, I recognize that not everyone is in a position to do this. But I do think that the reflexive “that’s a crazy thing to do” reaction that some people seem to have when The Cost of Knowledge boycott is brought up isn’t really grounded in a careful consideration of the actual risks to one’s career. I don’t know how many of the 16,000 signatories to the boycott have had to drop out of science as a direct result of their decision to walk away from Elsevier, but I’ve never heard anyone suggest this happened to them, and I suspect the number is very, very small.

The best thing about all of the above action items–with the possible exception of the last–is that they require virtually no effort, and incur virtually no risk. In fact, you don’t even have to tell anyone you’re doing any of them. Let’s say you’re a graduate student, and your advisor asks you where you want to submit your next fMRI paper. You don’t have to say “well, on principle, anywhere but an Elsevier journal” and risk getting into a long argument about the issue; you can just say “I think I’d like to try Cerebral Cortex.” Nobody has to know that you’re engaging in moral purchasing, and your actions are still almost exactly as effective. You don’t have to march down the street holding signs and chanting loudly; you don’t have to show up in front of anyone’s office to picket. You can do your part to improve the scientific publishing ecosystem just by making a few tiny decisions here and there–and if enough other people do the same thing, Elsevier and its peers will eventually be left with a stark choice: shape up, or crumble.

There is no “tone” problem in psychology

Much ink has been spilled in the last week or so over the so-called “tone” problem in psychology, and what to do about it. I speak here, of course, of the now infamous (and as-yet unpublished) APS Observer column by APS Past President Susan Fiske, in which she argues rather strenuously that psychology is in danger of falling prey to “mob rule” due to the proliferation of online criticism generated by “self-appointed destructo-critics” who “ignore ethical rules of conduct.”

Plenty of people have already weighed in on the topic (my favorite summary is Andrew Gelman’s take), and to be honest, I don’t really have (m)any new thoughts to offer. But since that’s never stopped me before, I will now proceed to throw those thoughts at you anyway, just for good measure.

Since I’m verbose but not inconsiderate, I’ll summarize my main points way up here, so you don’t have to read 6,500 more words just to decide that you disagree with me. Basically, I argue the following points:

  1. There is nothing wrong with the general tone of our discourse in psychology at the moment.
  2. Even if there was something wrong with the tone of our discourse, it would be deeply counterproductive to waste our time talking about it in vague general terms.
  3. Fear of having one’s scientific findings torn apart by others is not unusual or pathological; it’s actually a completely normal–and healthy–feeling for a scientist.
  4. Appeals to fairness are not worth taking seriously unless the argument is pitched at the level of the entire scientific community, rather than just the sub-community one happens to belong to.
  5. When other scientists do things we don’t like, it’s pointless and counterproductive to question their motives.

There, that’s about as much of being brief and to the point as I can handle. From here on out, it’s all adjective soup, mixed metaphor, and an occasional literary allusion*.

1. There is no tone problem

Much of the recent discussion over how psychologists should be talking to one another simply takes it for granted that there’s some deep problem with the tone of our scientific discourse. Personally, I don’t think there is (and on the off-chance we’re doing this by vote count, neither do Andrew Gelman, Chris Chambers, Sam Schwarzkopf, or NeuroAnaTody). At the very least, I haven’t seen any good evidence for it. As far as I can tell, all of the complaints about tone thus far have been based exclusively on either (a) a handful of rather over-the-top individual examples of bad behavior, or (b) vague but unsupported allegations that certain abusive practices are actually quite common. Neither of these constitutes a satisfactory argument, in my view. The former isn’t useful because anecdotes are just that. I imagine many people can easily bring to mind several instances of what seem like unwarranted attacks on social media. For example, perhaps you don’t like the way James Coyne sometimes calls out people he disagrees with:

Or maybe you don’t appreciate Dan Gilbert calling a large group of researchers with little in common except their efforts to replicate one or more studies as “shameless little bullies”:

I don’t doubt that statements like these can and do offend some people, and I think people who are offended should certainly feel free to publicly raise their concerns (ideally by directly responding to the authors of such remarks). Still, such cases are the exception, not the norm, and academic psychologists should appreciate better than most people the dangers of over-generalizing from individual cases. Nobody should labor under any misapprehension that it’s possible to have a field made up of thousands of researchers all going about their daily business without some small subset of people publicly being assholes to one another. Achieving zero instances of bad behavior cannot be a sane goal for our field (or any other field). When Dan Gilbert called replicators “second-stringers” and “shameless little bullies,” it did not follow that all social psychologists above the age of 45 are reactionary jackasses. For that matter, it didn’t even follow that Gilbert is a jerk. The correct attributions in such cases–until such time as our list of notable examples grows many times larger than it presently is–are that (a) reasonable people sometimes say unreasonable things they later regret, or (b) some people are just not reasonable, and are best ignored. There is no reason to invent a general tone problem where none exists.

The other main argument for the existence of a “tone” problem—and one that’s prominently on display in Fiske’s op-ed—is the gossipy everyone-knows-this-stuff-is-happening kind of argument. You could be excused for reading Fiske’s op-ed and coming away thinking that verbal abuse is a rampant problem in psychology. Consider just one paragraph (but the rest of it reads much the same):

Only what’s crashing are people. These unmoderated attacks create collateral damage to targets’ careers and well being, with no accountability for the bullies. Our colleagues at all career stages are leaving the field because of the sheer adversarial viciousness. I have heard from graduate students opting out of academia, assistant professors afraid to come up for tenure, mid-career people wondering how to protect their labs, and senior faculty retiring early, all because of methodological terrorism. I am not naming names because ad hominem smear tactics are already damaging our field. Instead, I am describing a dangerous minority trend that has an outsized impact and a chilling effect on scientific discourse.

I will be the first to admit that it sounds very ominous, all this talk of people crashing, unmoderated attacks with no accountability, and people leaving the field. But before you panic, you might want to consider an alternative paragraph that, at least from where I’m sitting, Fiske could just as easily have written:

Only what’s crashing are people. The proliferation of flashy, statistically incompetent findings creates collateral damage to targets’ careers and well being, with no accountability for the people who produce such dreck. Our colleagues at all career stages are leaving the field due to the sheer atrocity of its standards. I have heard from graduate students opting out of academia, assistant professors suffering from depression, mid-career people wondering how to sustain their research, and senior faculty retiring early, all because of their dismay at common methodological practices. I am not naming names because ad hominem smear tactics are already damaging our field. Instead, I am describing a dangerous trend that has an outsized impact and a chilling effect on scientific progress.

Or if you don’t like that one, maybe this one is more your speed:

Only what’s crashing are our students. These unmoderated attacks on students by their faculty advisors create collateral damage to our students, with no accountability for the bullies. Our students at all stages of graduate school are leaving the field because of the sheer adversarial viciousness. I have heard from graduate students who work 90-hour weeks, are afraid to have children at this stage of their careers, or have fled grad school, all out of fear of being terrorized by their advisors. I am not naming names because ad hominem smear tactics are already damaging our field. Instead, I am describing a dangerous trend that has an outsized impact and a chilling effect on scientific progress.

If you don’t like that one either, feel free to crib the general structure and play fill in the blank with your favorite issue. It could be low salaries, unreasonable publication expectations, or excessively high teaching loads; whatever you like. The formula is simple: first, you find a few people with (perfectly legitimate) concerns about some aspect of their professional environment; then you just have to (1) recount those stories in horrified tones, (2) leave out any mention of exactly how many people you’re talking about, (3) provide no concrete details that would allow anyone to see any other side to the story, and (4) not-so-subtly imply that all hell will break loose if this problem isn’t addressed some time real soon.

Note that what makes Fiske’s description unproductive and incendiary here is not that we have any reason to doubt the existence of the (anonymous) cases she alludes to. I have no doubt that Fiske does in fact hear regularly from students who have decided to leave academia because they feel unfairly targeted. But the thing is, it’s also an indisputable fact that many (in absolute terms) students leave academia because they have trouble getting along with their advisors, because they’re fed up with the low methodological standards in the field, or because they don’t like the long, unstructured hours that science requires.

The problem is not that Fiske is being untruthful; it’s that she’s short-circuiting the typical process of data- and reason-based argument by throwing lots of colorful anecdotes and emotional appeals at us. No indication is provided in her piece—or in my creative adaptations—as to whether the scenarios described are at all typical. How often, we should be asking ourselves, does it actually happen that people opt out of academia, or avoid seeking tenure, because of legitimate concerns about being unfairly criticized by their colleagues? How often do people leave the field because our standards are so terrible? Just how many psychology faculty are really such terrible advisors that their students regularly quit? If the answer to all of these questions is “extremely rarely”–or if there is reason to believe that in many cases, the story is not nearly as simple as the way Fiske is making it sound–then we don’t have systematic problems that deserves our collective attention; at worst, we have isolated cases of people behaving badly. Unfortunately, the latter is a malady that universally afflicts every large group or organization, and as far as I know, there is no known cure.

From where I’m sitting, there is no evidence of an epidemic of interpersonal cruelty in psychology. There has undeniably been a rapid increase in open, critical commentary online; but as Chris Chambers, Andrew Gelman, and others have noted, this is much better understood as a welcome democratization of scientific discourse that levels the playing field and devalues the role of (hard-earned) status than some kind of verbal war to the pain between rival psychological ideologies.

2. Three reasons why complaining about tone is a waste of time

Suppose you disagree with my argument above (which is totally cool—please let me know why in the comments below!) and insist that there clearly is a problem with the tone of our discourse. What then? Well, in that case, I would still respectfully suggest that if your plan for dealing with this problem is to complain about it in general terms, the way Fiske does—meaning, without ever pointing to specific examples or explaining exactly what you mean by “critiques of such personal ferocity” or “ad hominem smear tactics”—then you’re probably just wasting your time. Actually, it’s worse than that: not only are you wasting your own time, but you’re probably also going to end up pouring more fuel on the very fire you claim to be trying to put out (and indeed, this is exactly what Fiske’s op-ed seems to have accomplished).

I think there are at least three good reasons to believe that spending one’s time arguing over tone in abstract terms is a generally bad idea. Since I appear to have nothing but time, and you appear to still be reading this, I’ll discuss each of them in great gory detail.

The engine-on-fire view of science

First, unlike in many other domains of life, in science, the validity or truth value of a particular viewpoint is independent of the tone with which that viewpoint is being expressed. We can perhaps distinguish between two ways of thinking about what it means to do science. One approach is what we might call the negotiation model of science. On this model, when two people disagree over some substantive scientific issue, what they’re doing is trying to find a compromise position that’s palatable to both parties. If you say your finding is robust, and I say it’s totally p-hacked, then our goal is to iterate until we end up in a position that we both find acceptable. This doesn’t necessarily mean that the position we end up with must be an intermediate position (e.g., “okay, you only p-hacked a tiny bit”); it’s possible that I’ll end up entirely withdrawing my criticism, or that you’ll admit to grave error and retract your study. The point is just that the goal is, at least implicitly, to arrive at some consensual agreement between parties regarding our original disagreement.

If one views science through this kind of negotiation lens, concerns about tone make perfect sense. After all, in almost any other context when you find yourself negotiating with someone, it’s a generally bad idea to start calling them names or insulting their mother. If you’re hawking your goods at a market, it’s probably safe to assume that every prospective buyer has other options–they can buy whatever it is they need from some other place, and they don’t have to negotiate specifically with you if they don’t like the way you talk to them. So you watch what you say. And if everyone manages to get along without hurling insults, it’s possible you might even successfully close a deal, and go home one rug lighter and a few Euros richer.

Unfortunately, the negotiation model isn’t a good way to think about science, because in science, the validity of one’s views does not change in any way depending on whether one is dispositionally friendly, or perpetually acts like a raging asshole. A better way to think about science is in terms of what we might call, with great nuance and sophistication, the “engine-on-fire” model. This model can be understood as follows. Suppose you get hungry while driving a long distance, and pull into a convenience store to buy some snacks. Just as you’re opening the door to the store, some guy yells out behind you, “hey, asshole, your engine’s on fire!” He then continues to stand around and berate you while you call for emergency services and frantically run around looking for a fire extinguisher–all without ever lifting a finger to help you.

Two points about this story should be obvious. First, the guy who alerted you to your burning engine is very likely a raging asshole. And second, the fact that he’s a raging asshole doesn’t absolve you in any way from taking steps to put out your flaming engine. It may absolve you from saying thank you to him after the fact, but his unpleasant demeanor unfortunately doesn’t mean you can just choose to look the other way out of spite, and calmly head inside to buy your teriyaki beef jerky as the flames outside engulf your vehicle.

For better or worse, scientific disagreements are more like the engine-on-fire scenario than the negotiation scenario. Superficially, it may seem that two people with a scientific disagreement are in a process of negotiation. But a crucial difference is that if one person inexplicably decides to start yelling at the other–even as they continue to toss out methodological or theoretical criticisms (“only a buffoon of a scientist could fail to model stimulus as a random factor in this design!”)–their criticisms don’t become any less true in virtue of their tone. This doesn’t mean that tone is irrelevant and should be ignored, of course; if a critic calls you names while criticizing your work, it’s perfectly reasonable for you to object to the tone they’re using, and ask that they avoid personal attacks. Unfortunately, you can’t compel them be nice to you, and the fact remains that if your critic decides to keep yelling at you, you still have a professional obligation to address the substance of their arguments, no matter how repellent you find their tone. If you don’t respond at all–either by explaining why the concern is invalid, or by adjusting your methodological procedures in some way–then there are now two scientific assholes in the world.

Distinguishing a bad case of the jerks from a bad case of the feels isn’t always easy

Much of the discussion over tone thus far has taken, as its starting point, people’s hurt feelings. Feelings deserve to be taken seriously; scientists are human beings, and the fact that the merit of a scientific argument is indepedendent of the tone used to convey it doesn’t mean we should run roughshod over people’s emotions. The important point to note, though, is that the opposite point also holds: the fact that someone might be upset by someone else’s conduct doesn’t automatically mean that the other party is under any obligation–or even expectation–to change their behavior. Sometimes people are upset for understandable reasons that nevertheless do not imply that anyone else did anything wrong.

Daniel Lakens recently pointed this problem out in a nice blog post. The fundamental point is that it’s often impossible for scientists to cleanly separate substantive intellectual issues from personal reputation and ego, because it’s simply a fact that one’s intellectual output is, to varying extents, a reflection of one’s abilities as a scientist. Meaning, if I consistently put out work that’s heavily criticized by other researchers, there is a point at which that criticism does in fact begin to impugn my general ability as a scientist–even if the criticism is completely legitimate, impersonal, and never strays from substantive discussion of the intellectual issues.

Examples of this aren’t hard to find in psychology. To take just one widely-cited example: among the best-replicated findings in behavioral genetics (and indeed, all of psychology) is the finding that most traits show high heritability (typically on the order of 50%) and little influence of shared environment (typically close to 0%). In other words, an enormous amount of evidence suggests that parents have minimal influence on how their children will eventually turn out, independently of the genes they pass on. Given such knowledge, the scientifically honest thing to do, it would seem, is to assume that most child-parent behavioral correlations are largely driven by heritable factors rather than by parenting. Nevertheless, a large fraction of the developmental literature consists of researchers conducting purely correlational studies and drawing strong conclusions about the causal influence of parenting on children’s behavior on the basis of observed child-parent correlations.

If you think I’m exaggerating, consider the latest issue of Psychological Science, where we find a report of a purely longitudinal study (no randomized experiment, and no behavioral genetic component) that claims to find evidence of “a positive link between more nurturing family environments in childhood and greater security of attachment to spouses more than 60 years later.” The findings, we’re told in the abstract, “…underscore the far-reaching influence of childhood environment on well-being in adulthood.” The fact that 50 years of behavioral genetics studies have conclusively demonstrated that all, or nearly all, of this purported parenting influence is actually accounted for by genetic factors does not seem to deter the authors. The terms “heritable” or “genetic” do not show up anywhere in the article, and no consideration at all is given to the possibility that the putative effect of warm parental environment is at least partly (and quite possibly wholly) spurious. And there are literally thousands of other papers just like this one in the developmental literature–many of them continually published in some of our most prestigious journals.

Now, an important question arises: how is a behavioral geneticist supposed to profesionally interact with a developmental scientist who appears to willfully ignore the demonstrably small influence of parenting, even after it is repeatedly pointed out to him? Is the geneticist supposed to simply smile and nod at the developmentalist and say, “that’s nice, you’re probably right about how important attachment styles are, because after all, you’re a nice person to talk to, and I want to keep inviting you to my dinner parties”? Or should she instead point out—repeatedly, if need be—the critical flaw in purely correlational designs that precludes any serious causal conclusions about parenting? And if she does the latter—always in a perfectly civil tone, mind you—how can that sentiment possibly be expressed in a way that both (a) is taken seriously enough by the target of criticism to effect a meaningful change in behavior, and (b) doesn’t seriously injure the target’s feelings?

This example highlights two important points, I think. First, when we’re being criticized, it can be very difficult to determine whether our critics are being unreasonable jerks, or are instead quite calmly saying things that we just don’t want to hear. As such, it’s a good idea to give our critics the benefit of the doubt, and assume they have fundamentally good intentions, even if our gut response is to retaliate as if they’re trying to cast our firstborn child into a giant lake of fire.

Second, unfortunate as it may be, being a nice person and being a good scientist are often in fundamental tension with one another–and virtually all scientists are frequently forced to choose which of the two they want to prioritize. I’m not saying you can’t be both a nice person and a good scientist on average. Of course you can. I’m just saying that there are a huge number of individual situations in which you can’t be both at the same time. If you ever find yourself at a talk given by one of the authors of the Psychological Science paper I mention above, you will have a choice between (a) saying nothing to the speaker during the question period (a “nice” action that hurts nobody’s feelings, but impedes scientific progress), and (b) pointing out that the chief conclusion expressed during the talk simply does not follow from any of the evidence presented (a “mean” action that will probably hurt the speaker’s feelings, but also serves to brings a critical scientific flaw to the attention of other scientists in the audience).

Now, one could potentially mount a reasonable argument in favor of being either a nice person, or a good scientist. I’m not going to argue that the appropriate thing to do is to always to put science ahead of people’s feelings. Sometimes there can be good reasons to privilege the latter. But I don’t think we should pretend that the tension between good science and good personal relationships doesn’t exist. My own view, for what it’s worth, is that people who want to do science for a living should accept that they are going to be regularly and frequently criticized, and that hurt feelings and wounded egos are part and parcel of being cognitively limited agents with deep emotions who spend their time trying to understand something incredibly difficult. This doesn’t mean that it’s okay to yell at people or call them idiots in public–it isn’t, and we should work hard collectively to prevent such behavior. But it does mean that at some point in one’s scientific career–and probably at many, many points–one may have the distinctly unpleasant experience of another scientist saying “I think the kind of work you do is fundamentally not capable of answering the questions you’re asking,” or, “there’s a critical flaw in your entire research program.” In such cases, it’s understandable if one’s feelings are hurt. But hurt feelings don’t in any way excuse one from engaging seriously with the content of the criticism. Listening to people tell us we’re wrong is part of the mantle we assume when we decide to become scientists; if we only want to talk to other people when they agree with us, there are plenty of other good ways we can spend our lives.

Who’s actually listening?

The last reason that complaining about the general tone of discourse seems inadvisable is that it’s not clear who’s actually listening. I mean, obviously plenty of people are watching the current controversy unfold in the hold on, let me get some popcorn sense. But the real question is, who do we think is going to read Fiske’s commentary, or any other commentary like it, and think, you know what–I see now that I’ve been a total jerk until now, and I’m going to stop? I suspect that if we were to catalogue all the cases that Fiske thinks of as instances of “ad hominem smear tactics” or “public shaming and blaming”, and then ask the perpetrators for their side of the story, we would probably get a very different take on things. I imagine that in the vast majority of cases, what people like Fiske see as behavior that’s completely beyond the pale would be seen by the alleged perpetrators as harsh but perfectly reasonable criticism–and apologies or promises to behave better in future would probably not flow very freely.

Note that I’m emphatically not suggesting that the actions in question are always defensible. I’m not passing any judgment on anyone’s behavior at all. I have no trouble believing that in some of the cases Fiske alludes to, there are probably legitimate and serious causes for concern. But the problem is, I see no reason to think that in cases where someone really is being an asshole, they’re likely to stop being an asshole just because Fiske wrote an op-ed complaining about tone in general terms. For example, I personally don’t think Andrew Gelman’s criticism of Cuddy, Norton, or Fiske has been at all inappropriate; but supposing you do think it’s inappropriate, do you really think Gelman is going to stop vigorously criticizing research he disagrees with just because Fiske wrote a column calling for civility?

We therefore find ourselves in a rather unfortunate situation: Fiske’s appeal is likely to elicit both heartfelt nods of approval from anyone who feels they’ve ever been personally attacked by a “methodological terrorist”, and shrieks of indignation and moral outrage from anyone who feels Fiske is mistaking their legitimate criticism for personal abuse. What it’s not likely to elicit much of is serious self-reflection or change in behavior—if for no other reason that it doesn’t describe any behavior in sufficient detail that anyone could actually think, “oh, yes, I see how that could be perceived as a personal attack.” In trying to avoid “damaging our field” by naming names, Fiske has, ironically, ended up writing a deeply divisive piece that appears to have only fanned the flames. I don’t think this is an accident; it seems to me like the inevitable fate of any general call for civility of this kind that fails to actually define or give examples of the behavior that is supposed to be so offensive.

The moral of the story is, if you’re going to complain about “critiques of such personal ferocity and relentless frequency that they resemble a denial-of-service attack” (and you absolutely should, if you think you have a legitimate case!), then you need to point to concrete behaviors that people can consider, evaluate, and learn from, and not just throw out vague allusions to “public shaming and blaming”, “ignoring ethical rules of conduct”, and “attacking the person and not the work”.

3. Fear of criticism is important—and healthy

Accusations of actual bullying are not the only concern raised by Fiske and other traditionalists. One of the other recurring themes that have come up in various commentaries on the tone of our current discourse is a fear of future criticism–and in particular, of being unfairly “targeted” for attack. In her column, Fiske writes that targets “often seem to be chosen for scientifically irrelevant reasons: their contrary opinions, professional prominence, or career-stage vulnerability.” On its face, this concern seems reasonable: surely it would be a bit unseemly for researchers to go running around gunning for each another purely to satisfy their petty personal vendettas. Science is supposed to be about the pursuit of truth, not vengeance!

Unfortunately, there is, so far as I can see, no possible way to enforce an injunction against pettiness or malicious intent. Nor should we want to try, because that would require a rather active form of thought policing. After all, who gets to decide what was in my head when I set out to replicate someone else’s study? Do we really want editors or reviewers passing judgment on whether an author’s motives for conducting a study were pure–and using that as a basis to discount the actual findings reported by the study? Does that really seem to Fiske like a good way to improve the tone of scientific discourse?

For better or worse, researchers do not–and cannot–have any right not to fear being “targeted” by other scientists–no matter what the motives in question may be. To the contrary, I would argue that a healthy fear of others’ (possibly motivated) negative evaluations is a largely beneficial influence on the quality of our science. Personally, I feel a not-insubstantial amount of fear almost any time I contemplate the way something I’ve written will be received by others (including these very words–as I’m writing them!). I frequently ask myself what I myself would say if I were reading a particular sentence or paragraph in someone else’s paper. And if the answer is “I would criticize it, for the following reasons…”, then I change or remove the offending statement(s) until I have no further criticisms. I have no doubt that it would do great things for my productivity if I allowed myself to publish papers as if they were only going to be read by friendly, well-intentioned colleagues. But then the quality of my papers would also decrease considerably. So instead, I try to write papers as if I expect them to be read by a death panel with a 90% kill quota. It admittedly makes writing less fun, but I also think it makes the end product much better. (The same principle also applies when seeking critical feedback on one’s work from others: if you only ever ask friendly, pleasant collaborators for their opinion on your papers, you shouldn’t be surprised if anonymous reviewers who have no reason to pull their punches later take a somewhat dimmer view.)

4. Fairness is in the eye of the beholder

Another common target of appeal in arguments about tone is fairness. We find fairness appeals implicitly in Fiske’s op-ed (presumably it’s a bad thing if some people switch careers because of fear of being bullied), and explicitly in a number of other commentaries. The most common appeal is to the negative career consequences of being (allegedly) unfairly criticized or bullied. The criticism doesn’t just impact on one’s scientific findings (goes the argument); it also makes it less likely that one will secure a tenure-track position, promotion, raise, or speaking invitations. Simone Schnall went so far as to suggest that the public criticism surrounding a well-publicized failure to replicate one of her studies made her feel like “a criminal suspect who has no right to a defense and there is no way to win.”

Now, I’m not going to try to pretend that Fiske, Schnall, and others are wrong about the general conclusion they draw. I see no reason to deny Schnall’s premise that her career has suffered as a result of the replication failure (though I would also argue that the bulk of that damage is likely attributable to the way she chose to respond to that replication failure, rather than to the actual finding itself). But the critical point here is, the fact that Schnall and others have suffered as a result of others’ replication failures and methodological criticisms is not in and of itself any kind of argument against those replication efforts and criticisms. No researcher has a right to lead a successful career untroubled and unencumbered by any serious questioning of their findings. Nor do early-career researchers like Alec Beall, whose paper suggesting that fertile women are more likely to wear red shirts was severely criticized by Andrew Gelman and others. It is lamentably true that incisive public criticism may injure the reputation and job prospects of those whose work has been criticized. And it’s also true that this can be quite unfair, in the sense that there is generally no particular reason why these particular people should be criticized and suffer for it, while other people with very similar bodies of work go unscathed, and secure plum jobs or promotions.

But here’s the thing: what doesn’t seem fair at the level of one individual is often perfectly fair–or at least, unavoidable–at the level of an entirely community. As soon as one zooms out from any one individual, and instead surveys the field of psychology as a whole, it becomes clear that the job and reputation markets are, to a first approximation, a zero-sum game. As Gelman and many other people have noted, for every person who doesn’t get a job because their paper was criticized by a “replicator”, there could be three other candidates who didn’t get jobs because their much more methodologically rigorous work took too long to publish and/or couldn’t stack up in flashiness to the PR-grabbing work that did win the job lottery. At an individual level, neither of these outcomes is “fair”. But then, very little in the world of professional success–in any field–is fair; almost every major professional outcome, good or bad, is influenced by an enormous amount of luck, and I would argue that it is delusional to pretend otherwise.

At root, I think the question we should ask ourselves, when something good or bad happens, is not: is it fair that I got treated [better|worse] than the way that other person over there was treated? Instead, it should be: does the distribution of individual outcomes we’re seeing align well with what maximizes the benefit to our community as a whole? Personally, I find it very difficult to see trenchant public criticism of work that one perceives as sub-par as a bad thing–even as I recognize that it may seem deeply unfair to the people whose work is the target of that criticism. The reason for this is that an obvious consequence of an increasing norm towards open, public criticism of people’s work is that the quality of our work will, collectively, improve. There should be no doubt that this shift will entail a redistribution of resources: the winners and losers under the new norm will be different from the winners and losers under the old norm. But that observation provides no basis for clinging to the old norm. Researchers who don’t like where things are currently headed cannot simply throw out complaints about being “unfairly targeted” by critics; instead, they need to articulate principled arguments for why a norm of open, public scientific criticism would be bad for science as a whole–and not just bad for them personally.

5. Everyone but me is biased!

The same logic that applies to complaints about being unfairly targeted also applies, I think, to complaints about critics’ nefarious motives or unconscious biases. To her credit, Fiske largely avoids imputing negative intent to her perceived adversaries–even as she calls them all kinds of fun names. Other commentators, however, have been less restrained–for example, suggesting that “there’s a lot of stuff going on where there’s now people making their careers out of trying to take down other people’s careers”, or that replicators “seem bent on disproving other researchers’ results by failing to replicate”. I find these kinds of statements uncompelling and, frankly, unseemly. The reason they’re unseemly is not that they’re wrong. Actually, they’re probably right. I don’t doubt that, despite what many reformers say, some of them are, at least some of the time, indeed motivated by personal grudges, a desire to bring down colleagues of whom they’re envious, and so on and so forth.

But the thing is, those motives are completely irrelevant to the evaluation of the studies and critiques that these people produce. The very obvious reason why the presence of bias on the part of a critic cannot be grounds to discount an study is that critics are not the only people with biases. Indeed, applying such a standard uniformly would mean that nobody’s finding should ever be taken seriously. Let’s consider just a few of the incentives that could lead a researcher conducting novel research, and who dreams of publishing their findings in the hallowed pages of, say, Psychological Science, to cut a few corners and end up producing some less-than-reliable findings:

  • Increased productivity: It’s less work to collect small convenience samples than large, representative ones.
  • More compelling results: Statistically significant results generated in small samples are typically more impressive-looking than one’s obtained from very large samples, due to sampling error and selection bias.
  • Simple stories: The more one probes a particular finding, the greater the likelihood that one will identify some problem that questions the validity of the results, or adds nuance and complexity to an otherwise simple story. And “mixed” findings are harder to publish.

All of these benefits, of course, feed directly into better prospects for fame, fortune, jobs, and promotions. So the idea that a finding published in one of our journals should be considered bias-free because it happened to come first, while a subsequent criticism or replication of that finding should be discounted because of personal motives or other biases is, frankly, delusional. Biases are everywhere; everyone has them. While this doesn’t mean that we should ignore them, it does mean that we should either (a) call all biases out equally–which is generally impossible, or at the very least extremely impractical–or (b) accept that doing so is not productive, and that the best way to eliminate bias over the long term is to pit everyone’s biases against each other and let logical argument and empirical data decide who’s right. Put differently, if you’re going to complain that Jane Doe is clearly motivated to destroy your cherished finding in order to make a name for herself, you should probably preface such an accusation with the admission that you obviously had plenty of motivation to cut corners when you produced the finding in the first place, since you knew it would help you make a name for yourself. Asymmetric appeals that require one to believe that bias exists in only one group of people simply don’t deserve to be taken seriously.

Personally, I would suggest that we adopt a standard policy of simply not talking about other people’s motivations or biases. If you can find evidence of someone’s bias in the methods they used or the analyses they conducted, then great–you can go ahead and point out the perceived flaws. That’s just being a good scientist. But if you can’t, then what was in your (perceived) adversary’s head when she produced her findings is quite irrelevant to scientific discourse–unless you think it would be okay for your critics to discount your work on the grounds that you clearly had all kinds of incentives to cheat.

Conclusions

Uh, no. No conclusions this time–this post is already long enough as is. And anyway, I already posted all of my conclusions way back at the beginning. So you can scroll all the way up there if you want to read them again. Instead, I’m going to try to improve your mood a tiny bit (if not the tone of the debate) by leaving you with this happy little painting automagically generated by Bot Ross:


* I lied! There were no literary allusions.

The Great Minds Journal Club discusses Westfall & Yarkoni (2016)

[Editorial note: The people and events described here are fictional. But the paper in question is quite real.]

“Dearly Beloved,” The Graduate Student began. “We are gathered here to–”

“Again?” Samantha interrupted. “Again with the Dearly Beloved speech? Can’t we just start a meeting like a normal journal club for once? We’re discussing papers here, not holding a funeral.”

“We will discuss papers,” said The Graduate Student indignantly. “In good time. But first, we have to follow the rules of Great Minds Journal Club. There’s a protocol, you know.”

Samantha was about to point out that she didn’t know, because The Graduate Student was the sole author of the alleged rules, and the alleged rules had a habit of changing every week. But she was interrupted by the sound of the double doors at the back of the room swinging violently inwards.

“Sorry I’m late,” said Jin, strolling into the room, one hand holding what looked like a large bucket of coffee with a lid on top. “What are we reading today?”

“Nothing,” said Lionel. “The reading has already happened. What we’re doing now is discussing the paper that everyone’s already read.”

“Right, right,” said Jin. “What I meant to ask was: what paper that we’ve all already read are we discussing today?”

“Statistically controlling for confounding constructs is harder than you think,” said The Graduate Student.

“I doubt it,” said Jin. “I think almost everything is intolerably difficult.”

“No, that’s the title of the paper,” Lionel chimed in. “Statistically controlling for confounding constructs is harder than you think. By Westfall and Yarkoni. In PLOS ONE. It’s what we picked to read for this week. Remember? Are you on the mailing list? Do you even work here?”

“Do I work here… Hah. Funny man. Remember, Lionel… I’ll be on your tenure committee in the Fall.”

“Why don’t we get started,” said The Graduate Student, eager to prevent a full-out sarcastathon. “I guess we can do our standard thing where Samantha and I describe the basic ideas and findings, talk about how great the paper is, and suggest some possible extensions… and then Jin and Lionel tear it to shreds.”

“Sounds good,” said Jin and Lionel in concert.

“The basic problem the authors highlight is pretty simple,” said Samantha. “It’s easy to illustrate with an example. Say you want to know if eating more bacon is associated with a higher incidence of colorectal cancer–like that paper that came out a while ago suggested. In theory, you could just ask people how often they eat bacon and how often they get cancer, and then correlate the two. But suppose you find a positive correlation–what can you conclude?”

“Not much,” said Pablo–apparently in a talkative mood. It was the first thing he’d said to anyone all day–and it was only 3 pm.

“Right. It’s correlational data,” Samantha continued. “Nothing is being experimentally manipulated here, so we have no idea if the bacon-cancer correlation reflects the effect of bacon itself, or if there’s some other confounding variable that explains the association away.”

“Like, people who exercise less tend to eat more bacon, and exercise also prevents cancer,” The Graduate Student offered.

“Or it could be a general dietary thing, and have nothing to do with bacon per se,” said Jin. “People who eat a lot of bacon also have all kinds of other terrible dietary habits, and it’s really the gestalt of all the bad effects that causes cancer, not any one thing in particular.”

“Or maybe,” suggested Pablo, “a sneaky parasite unknown to science invades the brain and the gut. It makes you want to eat bacon all the time. Because bacon is its intermediate host. And then it also gives you cancer. Just to spite you.”

“Right, it could be any of those things.” Samantha said. “Except for maybe that last one. The point is, there are many potential confounds. If we want to establish that there’s a ‘real’ association between bacon and cancer, we need to somehow remove the effect of other variables that could be correlated with both bacon-eating and cancer-having. The traditional way to do this is to statistical “control for” or “hold constant” the effects of confounding variables. The idea is that you adjust the variables in your regression equation so that you’re essentially asking what would the relationship between bacon and cancer look like if we could eliminate the confounding influence of things like exercise, diet, alcohol, and brain-and-gut-eating parasites? It’s a very common move, and the logic of statistical control is used to justify a huge number of claims all over the social and biological sciences.”

“I just published a paper showing that brain activation in frontoparietal regions predicts people’s economic preferences even after controlling for self-reported product preferences,” said Jin. “Please tell me you’re not going to shit all over my paper. Is that where this is going?”

“It is,” said Lionel gleefully. “That’s exactly where this is going.”

“It’s true,” Samantha said apologetically. “But if it’s any consolation, we’re also going to shit on Lionel’s finding that implicit prejudice is associated with voting behavior after controlling for explicit attitudes.”

“That’s actually pretty consoling,” said Jin, smiling at Lionel.

“So anyway, statistical control is pervasive,” Samantha went on. “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”

“Could you maybe give an example,” asked Pablo. He was the youngest in the group, being only a second-year graduate student. (The Graduate Student, by contrast, had been in the club for so long that his real name had long ago been forgotten by the other members of the GMJC.)

“Sure,” said The Graduate Student. “Suppose your survey includes an item like ‘how often do you consume alcoholic beverages’, and the response options include things like never, less than once a month, I’m never not consuming alcoholic beverages, and so on. Now, people are not that great at remembering exactly how often they have a drink–especially the ones who tend to have a lot of drinks. On top of that, there’s a stigma against drinking a lot, so there’s probably going to be some degree of systematic underreporting. All of this contrives to give you a measure that’s less than perfectly reliable–meaning, it won’t give you the same values that you would get if you could actually track people for an extended period of time and accurately measure exactly how much ethanol they consume, by volume. In many, many cases, measured covariates of this kind are pretty mediocre.”

“I see,” said Pablo. “That makes sense. So why is that a problem?”

“Because you can’t control for that which you aren’t measuring,” Samantha said. “Meaning, if your alleged measure of alcohol consumption–or any other variable you care about–isn’t measuring the thing you care about with perfect accuracy, then you can’t remove its influence on other things. It’s easiest to see this if you think about the limiting case where your measurements are completely unreliable. Say you think you’re measuring weekly hours of exercise, but actually your disgruntled research assistant secretly switched out the true exercise measure for randomly generated values. When you then control for the alleged ‘exercise’ variable in your model, how much of the true influence of exercise are you removing?”

“None,” said Pablo.

“Right. Your alleged measure of exercise doesn’t actually reflect anything about exercise, so you’re accomplishing nothing by controlling for it. The same exact point holds–to varying degrees–when your measure is somewhat reliable, but not perfect. Which is to say, pretty much always.”

“You could also think about the same general issue in terms of construct validity,” The Graduate Student chimed in. “What you’re typically trying to do by controlling for something is account for a latent construct or concept you care about–not a specific measure. For example, the latent construct of a “healthy diet” could be measured in many ways. You could ask people how much broccoli they eat, how much sugar or transfat they consume, how often they eat until they can’t move, and so on. If you surveyed people with a lot of different items like this, and then extracted the latent variance common to all of them, then you might get a component that could be interpreted as something like ‘healthy diet’. But if you only use one or two items, they’re going to be very noisy indicators of the construct you care about. Which means you’re not really controlling for how healthy people’s diet is in your model relating bacon to cancer. At best, you’re controlling for, say, self-reported number of vegetables eaten. But there’s a very powerful temptation for authors to forget that caveat, and to instead think that their measurement-level conclusions automatically apply at the construct level. The result is that you end up with a huge number of papers saying things like ‘we show that fish oil promotes heart health even after controlling for a range of dietary and lifestyle factors’. When in fact the measurement-level variables they’ve controlled for can’t help but capture only a tiny fraction of all of the dietary and lifestyle factors that could potentially confound the association you care about.”

“I see,” said Pablo. “But this seems like a pretty basic point, doesn’t it?”

“Yes,” said Lionel. “It’s a problem as old as time itself. It might even be older than Jin.”

Jin smiled at Lionel and tipped her coffee cup-slash-bucket towards him slightly in salute.

“In fairness to the authors,” said The Graduate Student, “they do acknowledge that essentially the same problem has been discussed in many literatures over the past few decades. And they cite some pretty old papers. Oldest one is from… 1965. Kahneman, 1965.”

An uncharacteristic silence fell over the room.

That Kahneman?” Jin finally probed.

“The one and only.”

“Fucking Kahneman,” said Lionel. “That guy could really stand to leave a thing or two for the rest of us to discover.”

“So, wait,” said Jin, evidently coming around to Lionel’s point of view. “These guys cite a 50-year old paper that makes essentially the same argument, and still have the temerity to publish this thing?”

“Yes,” said Samantha and The Graduate Student in unison.

“But to be fair, their presentation is very clear,” Samantha said. “They lay out the problem really nicely–which is more than you can say for many of the older papers. Plus there’s some neat stuff in here that hasn’t been done before, as far as I know.”

“Like what?” asked Lionel.

“There’s a nice framework for analytically computing error rates for any set of simple or partial correlations between two predictors and a DV. And, to save you the trouble of having to write your own code, there’s a Shiny web app.”

“In my day, you couldn’t just write a web app and publish it as a paper,” Jin grumbled. “Shiny or otherwise.”

“That’s because in your day, the internet didn’t exist,” Lionel helpfully offered.

“No internet?” the Graduate Student shrieked in horror. “How old are you, Jin?”

“Old enough to become very wise,” said Jin. “Very, very wise… and very corpulent with federal grant money. Money that I could, theoretically, use to fund–or not fund–a graduate student of my choosing next semester. At my complete discretion, of course.” She shot The Graduate Student a pointed look.

“There’s more,” Samantha went on. “They give some nice examples that draw on real data. Then they show how you can solve the problem with SEM–although admittedly that stuff all builds directly on textbook SEM work as well. And then at the end they go on to do some power calculations based on SEM instead of the standard multiple regression approach. I think that’s new. And the results are… not pretty.”

“How so,” asked Lionel.

“Well. Westfall and Yarkoni suggest that for fairly typical parameter regimes, researchers who want to make incremental validity claims at the latent-variable level–using SEM rather than multiple regression–might be looking at a bare minimum of several hundred participants, and often many thousands, in order to adequately power the desired inference.”

“Ouchie,” said Jin.

“What happens if there’s more than one potential confound?” asked Lionel. “Do they handle the more general multiple regression case, or only two predictors?”

“No, only two predictors,” said The Graduate Student. “Not sure why. Maybe they were worried they were already breaking enough bad news for one day.”

“Could be,” said Lionel. “You have to figure that in an SEM, when unreliability in the predictors is present, the uncertainty is only going to compound as you pile on more covariates–because it’s going to become increasingly unclear how the model should attribute any common variance that the predictor of interest shares with both the DV and at least one other covariate. So whatever power estimates they come up with in the paper for the single-covariate case are probably upper bounds on the ability to detect incremental contributions in the presence of multiple covariates. If you have a lot of covariates–like the epidemiology or nutrition types usually do–and at least some of your covariates are fairly unreliable, things could get ugly really quickly. Who knows what kind of sample sizes you’d need in order to make incremental validity claims about small effects in epi studies where people start controlling for the sun, moon, and stars. Hundreds of thousands? Millions? I have no idea.”

“Jesus,” said The Graduate Student. “That would make it almost impossible to isolate incremental contributions in large observational datasets.”

“Correct,” said Lionel.

“The thing I don’t get,” said Samantha, “is that the epidemiologists clearly already know about this problem. Or at least, some of them do. They’ve written dozens of papers about ‘residual confounding’, which is another name for the same problem Westfall and Yarkoni discuss. And yet there are literally thousands of large-sample, observational papers published in prestigious epidemiology, nutrition, or political science journals that never even mention this problem. If it’s such a big deal, why does almost nobody actually take any steps to address it?”

“Ah…” said Jin. “As the senior member of our group, I can probably answer that question best for you. You see, it turns out it’s quite difficult to publish a paper titled After an extensive series of SEM analyses of a massive observational dataset that cost the taxpayer three million dollars to assemble, we still have no idea if bacon causes cancer. Nobody wants to read that paper. You know what paper people do want to read? The one called Look at me, I eat so much bacon I’m guaranteed to get cancer according to the new results in this paper–but I don’t even care, because bacon is so delicious. That’s the paper people will read, and publish, and fund. So that’s the paper many scientists are going to write.”

A second uncharacteristic silence fell over the room.

“Bit of a downer today, aren’t you,” Lionel finally said. “I guess you’re playing the role of me? I mean, that’s cool. It’s a good look for you.”

“Yes,” Jin agreed. “I’m playing you. Or at least, a smarter, more eloquent, and better-dressed version of you.”

“Why don’t we move on,” Samantha interjected before Lionel could re-arm and respond. “Now that we’ve laid out the basic argument, should we try to work through the details and see what we find?”

“Yes,” said Lionel and Jin in unison–and proceeded to tear the paper to shreds.

Neurosynth is joining the Elsevier family

[Editorial note: this was originally posted on April 1, 2016. April 1 is a day marked by a general lack of seriousness. Interpret this post accordingly.]

As many people who follow this blog will be aware, much of my research effort over the past few years has been dedicated to developing Neurosynth—a framework for large-scale, automated meta-analysis of neuroimaging data. Neurosynth has expanded steadily over time, with an ever-increasing database of studies, and a host of new features in the pipeline. I’m very grateful to NIMH for the funding that allows me to keep working on the project, and also to the hundreds (thousands?) of active Neurosynth users who keep finding novel applications for the data and tools we’re generating.

That said, I have to confess that, over the past year or so, I’ve gradually grown dissatisfied at my inability to scale up the Neurosynth operation in a way that would take the platform to the next level . My colleagues and I have come up with, and in some cases even prototyped, a number of really exciting ideas that we think would substantially advance the state of the art in neuroimaging. But we find ourselves spending an ever-increasing chunk of our time applying for the grants we need to support the work, and having little time left to over to actually do the work. Given the current funding climate and other logistical challenges (e.g., it’s hard to hire professional software developers on postdoc budgets), it’s become increasingly clear to me that the Neurosynth platform will be hard to sustain in an academic environment over the long term. So, for the past few months, I’ve been quietly exploring opportunities to help Neurosynth ladder up via collaborations with suitable industry partners.

Initially, my plan was simply to license the Neurosynth IP and use the proceeds to fund further development of Neurosynth out of my lab at UT-Austin. But as I started talking to folks in industry, I realized that there were opportunities available outside of academia that would allow me to take Neurosynth in directions that the academic environment would never allow. After a lot of negotiation, consultation, and soul-searching, I’m happy (though also a little sad) to announce that I’ll be leaving my position at the University of Texas at Austin later this year and assuming a new role as Senior Technical Fellow at Elsevier Open Science (EOS). EOS is a brand new division of Elsevier that seeks to amplify and improve scientific communication and evaluation by developing cutting-edge open science tools. The initial emphasis will be on the neurosciences, but other divisions are expected to come online in the next few years (and we’ll be hiring soon!). EOS will be building out a sizable insight-as-a-service operation that focuses on delivering real value to scientists—no p-hacking, no gimmicks, just actionable scientific information. The platforms we build will seek to replace flawed citation-based metrics with more accurate real-time measures that quantify how researchers actually use one another’s data, ideas, and tools—ultimately paving the way to a new suite of microneuroservices that reward researchers both professionally and financially for doing high-quality science.

On a personal level, I’m thrilled to be in a position to help launch an initiative like this. Having spent my entire career in an academic environment, I was initially a bit apprehensive at the thought of venturing into industry. But the move to Elsevier ended up feeling very natural. I’ve always seen Elsevier as a forward-thinking company at the cutting edge of scientific publishing, so I wasn’t shocked to hear about the EOS initiative. But as I’ve visited a number of Elsevier offices over the past few weeks (in the process of helping to decide where to locate EOS), I’ve been continually struck at how open and energetic—almost frenetic—the company is. It’s the kind of environment that combines many of the best elements of the tech world and academia, but without a lot of the administrative bureaucracy of the latter. At the end of the day, it was an opportunity I couldn’t pass up.

It will, of course, be a bittersweet transition for me; I’ve really enjoyed my 3 years in Austin, both professionally and personally. While I’m sure I’ll enjoy Norwich, CT (where EOS will be based), I’m going to really miss Austin. The good news is, I won’t be making the move alone! A big part of what sold me on Elsevier’s proposal was their commitment to developing an entire open science research operation; over the next five years, the goal is to make Elsevier the premier place to work for anyone interested in advancing open science. I’m delighted to say that Chris Gorgolewski (Stanford), Satrajit Ghosh (MIT), and Daniel Margulies (Max Planck Institute for Human Cognitive and Brain Sciences) have all also been recruited to Elsevier, and will be joining EOS at (or in Satra’s case, shortly after) launch. I expect that they’ll make their own announcements shortly, so I won’t steal their thunder much. But the short of it is that Chris, Satra, and I will be jointly spearheading the technical operation. Daniel will be working on other things, and is getting the fancy title of “Director of Interactive Neuroscience”; I think this means he’ll get to travel a lot and buy expensive pictures of brains to put on his office walls. So really, it’s a lot like his current job.

It goes without saying that Neurosynth isn’t making the jump to Elsevier all alone; NeuroVault—a whole-brain image repository developed by Chris—will also be joining the Elsevier family. We have some exciting plans in the works for much closer NeuroVault-Neurosynth integration, and we think the neuroimaging community is going to really like the products we develop. We’ll also be bringing with us the OpenfMRI platform created by Russ Poldrack. While Russ wasn’t interested in leaving Stanford (as I recall, his exact words were “over all of your dead bodies”), he did agree to release the OpenfMRI IP to Elsevier (and in return, Elsevier is endowing a permanent Open Science fellowship at Stanford). Russ will, of course, continue to actively collaborate on OpenfMRI, and all data currently in the OpenfMRI database will remain where it is (though all original contributors will be given the opportunity to withdraw their datasets if they choose). We also have some new Nipype-based tools rolling out over the coming months that will allow researchers to conduct state-of-the-art neuroimaging analyses in the cloud (for a small fee)–but I’ll have much more to say about that in a later post.

Naturally, a transition like this one can’t be completed without hitting a few speed bumps along the way. The most notable one is that the current version of Neurosynth will be retired permanently in mid-April (so grab any maps you need right now!). A new and much-improved version will be released in September, coinciding with the official launch of EOS. One of the things I’m most excited about is that the new version will support an “Enhanced Usage” tier. The vertical integration of Neurosynth with the rest of the Elsevier ecosystem will be a real game-changer; for example, authors submitting papers to NeuroImage will automatically be able to push their content into NeuroVault and Neurosynth upon acceptance, and readers will be able to instantly visualize and cognitively decode any activation map in the Elsevier system (for a nominal fee handled via an innovative new micropayment system). Users will, of course, retain full control over their content, ensuring that only readers who have the appropriate permissions (and a valid micropayment account of their own) can access other people’s data. We’re even drawing up plans to return a portion of the revenues earned through the system to the content creators (i.e., article authors)—meaning that for the first time, neuroimaging researchers will be able to easily monetize their research.

As you might expect, the Neurosynth brand will be undergoing some changes to reflect the new ownership. While Chris and I initially fought hard to preserve the names Neurosynth and NeuroVault, Elsevier ultimately convinced us that using a consistent name for all of our platforms would reduce confusion, improve branding, and make for a much more streamlined user experience*. There’s also a silver lining to the name we ended up with: Chris, Russ, and I have joked in the past that we should unite our various projects into a single “NeuroStuff” website—effectively the Voltron of neuroimaging tools—and I even went so far as to register neurostuff.org a while back. When we mentioned this to the Elsevier execs (intending it as a joke), we were surprised at their positive response! The end result (after a lot of discussion) is that Neurosynth, NeuroVault, and OpenfMRI will be merging into The NeuroStuff Collection, by Elsevier (or just NeuroStuff for short)–all coming in late 2016!

Admittedly, right now we don’t have a whole lot to show for all these plans, except for a nifty logo created by Daniel (and reluctantly approved by Elsevier—I think they might already be rethinking this whole enterprise). But we’ll be rolling out some amazing new services in the very near future. We also have some amazing collaborative projects that will be announced in the next few weeks, well ahead of the full launch. A particularly exciting one that I’m at liberty to mention** is that next year, EOS will be teaming up with Brian Nosek and folks at the Center for Open Science (COS) in Charlottesville to create a new preregistration publication stream. All successful preregistered projects uploaded to the COS’s flagship Open Science Framework (OSF) will be eligible, at the push of a button, for publication in EOS’s new online-only journal Preregistrations. Submission fees will be competitive with the very cheapest OA journals (think along the lines of PeerJ’s $99 lifetime subscription model).

It’s been a great ride working on Neurosynth for the past 5 years, and I hope you’ll all keep using (and contributing to) Neurosynth in its new incarnation as Elsevier NeuroStuff!

* Okay, there’s no point in denying it—there was also some money involved.

** See? Money can’t get in the way of open science—I can talk about whatever I want!

Still not selective: comment on comment on comment on Lieberman & Eisenberger (2015)

In my last post, I wrote a long commentary on a recent PNAS article by Lieberman & Eisenberger claiming to find evidence that the dorsal anterior cingulate cortex is “selective for pain” using my Neurosynth framework for large-scale fMRI meta-analysis. I argued that nothing about Neurosynth supports any of L&E’s major conclusions, and that they made several major errors of inference and analysis. L&E have now responded in detail on Lieberman’s blog. If this is the first you’re hearing of this exchange, and you have a couple of hours to spare, I’d suggest proceeding in chronological order: read the original article first, then my commentary, then L&E’s response then this response to the response (if you really want to leave no stone unturned, you could also read Alex Shackman’s commentary, which focuses on anatomical issues). If you don’t have that kind of time on your hands, just read on and hope for the best, I guess.

Before I get to the substantive issues, let me say that I appreciate L&E taking the time to reply to my comments in detail. I recognize that they have other things they could be doing (as do I), and I think their willingness to engage in this format sets an excellent example as the scientific community continues to move rapidly towards more open, rapid, and interactive online scientific discussion. I would encourage readers to weigh in on the debate themselves or raise any questions they feel haven’t been addressed (either here on on Lieberman’s blog).

With that said, I have to confess that I don’t think my view is any closer to L&E’s than it previously was. I disagree with L&E’s suggestions that we actually agree on more than I thought in my original post; if anything, I think the opposite is true. However, I did find L&E’s response helpful inasmuch as it helped me better understand where their misunderstandings of Neurosynth lie.

In what follows, I provided a detailed rebuttal to L&E’s response. I’ll warn you right now that this will be a very long and fairly detail-oriented post. In a (probably fruitless) effort to minimize reader boredom, I’ve divided my response into two sections, much as L&E did. In the first section, I summarize what I see as the two most important points of disagreement. In the second part, I quote L&E’s entire response and insert my own comments in-line (essentially responding email-style). I recognize that this is a rather unusual thing to do, and it makes for a decidedly long read (the post clocks in at over 20,000 words, though much of that is quotes from L&E’s response). but I did it this way because, frankly, I think L&E badly misrepresented much of what I said in my last post. I want to make sure the context is very clear to readers, so I’m going to quote the entirety of each of L&E’s points before I respond to them, so that at the very least I can’t be accused of quoting them out of context.

The big issues: reverse inference and selectivity

With preliminaries out of the way, let me summarize what I see as the two biggest problems with L&E’s argument (though, if you make it to the second half of this post, you’ll see that there are many other statistical and interpretational issues that are pretty serious in their own right). The first concerns their fundamental misunderstanding of the statistical framework underpinning Neurosynth, and its relation to reverse inference. The second concerns their use of a definition of selectivity that violates common sense and can’t possibly support their claim that “the dACC is selective for pain”.

Misunderstandings about the statistics of reverse inference

I don’t think there’s any charitable way to say this, so I’ll just be blunt: I don’t think L&E understand the statistics behind the images Neurosynth produces. In particular, I don’t think they understand the foundational role that the notion of probability plays in reverse inference. In their reply, L&E repeatedly say that my concerns about their lack of attention to effect sizes (i.e., conditional probabilities) are irrelevant, because they aren’t trying to make an argument about effect sizes. For example:

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

This captures perhaps the crux of L&E’s misunderstanding about both Neurosynth and reverse inference. Their argument here is basically that they don’t care about the actual probability of a term being used conditional on a particular pattern of activation; they just want to know that there’s “support for the reverse inference”. Unfortunately, it doesn’t work that way. The z-scores produced by Neurosynth (which are just transformations of p-values) don’t provide a direct index of the support for a reverse inference. What they measure is what p-values always measure: the probability of observing a result as extreme as the one observed under the assumption that the null of no effect is true. Conceptually, we can interpret this as a claim about the population-level association between a region and a term. Roughly, we can say that as z-scores increase, we can be more confident that there’s a non-zero (positive) relationship between a term and a brain region (though some Bayesians might want to take issue with even this narrow assertion). So, if all L&E wanted to say was, “there’s good evidence that there’s a non-zero association between pain and dACC activation across the population of published fMRI studies”, they would be in good shape. But what they’re arguing for is much stronger: they want to show that the dACC is selective for pain. And z-scores are of no use here. Knowing that there’s a non-zero association between dACC activation and pain tells us nothing about the level of specificity or selectivity of that association in comparison to other terms. If the z-score for the association between dACC activation and ‘pain’ occurrence is 12.4 (hugely statistically significant!), does that mean that the probability of pain conditional on dACC activation is closer to 95%, or to 25%? Does it tell us that dACC activation is a better marker of pain than conflict, vision, or memory? We don’t know. We literally have no way to tell, unless we’re actually willing to talk about probabilities within a Bayesian framework.

To demonstrate that this isn’t just a pedantic point about what could in theory happen, and that the issue is in fact completely fundamental to understanding what Neurosynth can and can’t support, here are three different flavors of the Neurosynth maps for the “pain” map:

Neurosynth reverse inference z-scores and posterior probabilities. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).
Neurosynth reverse inference z-scores and posterior probabilities for the term “pain”. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).

The top row is the reverse inference z-score map available on the website. The values here are z-scores, and what they tell us (being simply transformations of p-values) is nothing more than what the probability would be of observing an association at least as extreme as the one we observe under the null hypothesis of no effect. The second and third maps are both posterior probability maps. They display the probability of a study using the term ‘pain’ when activation is observed at each voxel in the brain. These maps aren’t available on the website (for reasons I won’t get into here, though the crux of it is that they’re extremely easy to misinterpret, for reasons that may become clear below)—though you can easily generate them with the Neurosynth core tools if you’re so inclined.

The main feature of these two probability maps that should immediately jump out at you is how strikingly different their numbers are. In the first map (i.e., middle row), the probabilities of “pain” max out around 20%; in the second map (bottom row), they range from around 70% – 90%. And yet, here I am telling you that these are both posterior probability maps that tell us the probability of a study using the term “pain” conditional on that study observing activity at each voxel. How could this be? How could the two maps be so different, if they’re supposed to be estimates of the same thing?

The answer lies in the prior. In the natural order of things, different terms occur with wildly varying frequencies in the literature (remember that Neurosynth is based on extraction of words from abstracts, not direct measurement of anyone’s mental state!). “Pain” occurs in only about 3.5% of Neurosynth studies. By contrast, the term “memory” occurs in about 16% of studies. One implication of this is that, if we know nothing at all about the pattern of brain activity reported in a given study, we should already expect that study to be about five times more likely to involve memory than pain. Of course, knowing something about the pattern of brain activity should change our estimate. In Bayesian terminology, we can say that our prior belief about the likelihood of different terms gets updated by the activity pattern we observe, producing somewhat more informed posterior estimates. For example, if the hippocampus and left inferior frontal gyrus are active, that should presumably increase our estimate of “memory” somewhat; conversely, if the periaqueductal gray, posterior insula, and dACC are all active, that should instead increase our estimate of “pain”.

In practice, the degree to which the data modulate our Neurosynth-based beliefs is not nearly as extreme as you might expect. In the first posterior probability map above (labeled “empirical prior”), what you can see are the posterior estimates for “pain” under the assumption that pain occurs in about 3.5% of all studies—which is the actual empirical frequency observed in the Neurosynth database. Notice that the very largest probabilities we ever see—located, incidentally, in the posterior insula, and not in the dACC—max out around 15 – 20%. This is not to be scoffed at; it means that observing activation in the posterior insula implies approximately a 5-fold increase in the likelihood of “pain” being present (relative to our empirical prior of 3.5%). Yet, in absolute terms, the probability of “pain” is still very low. Based on these data, no one in their right mind should, upon observing posterior insula activation (let alone dACC, where most voxels show a probability no higher than 10%), draw the reverse inference that pain is likely to be present.

To make it even clearer why this inference would be unsupportable, here are posterior probabilities for the same voxels as above, but now plotted for several other terms, in addition to pain:

Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.
Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.

Notice how, in the bottom map (for ‘motor’, which occurs in about 18% of all studies in Neurosynth), the posterior probabilities in all of dACC are substantially higher for than for ‘pain’, even though z-scores in most of dACC show the opposite pattern. For ‘working memory’ and ‘reward’, the posterior probabilities are in the same ballpark as for pain (mostly around 8 – 12%). And for ‘fear’, there are no voxels with posterior probabilities above 5% anywhere, because the empirical prior is so low (only 2% of Neurosynth studies).

What this means is that, if you observe activation in dACC—a region which shows large z-scores for “pain” and much lower ones for “motor”—your single best guess as to what process might be involved (of the five candidates in the above figure) should be ‘motor’ by a landslide. You could also guess ‘reward’ or ‘working memory’ with about the same probability as ‘pain’. Of course, the more general message you should take away from this is that it’s probably a bad idea to infer any particular process on the basis of observed activity, given how low the posterior probability estimates for most terms are going to be. Put simply, it’s a giant leap to go from these results—which clearly don’t license anyone to conclude that the dACC is a marker of any single process—to concluding that “the dACC is selective for pain” and that pain represents the best psychological characterization of dACC function.

As if this isn’t bad enough, we now need to add a further complication to the picture. The analysis above assumes we have a good prior for terms like “pain” and “memory”. In reality, we have no reason to think that the empirical estimates of term frequency we get out of Neurosynth are actually good reflections of the real world. For all we know, it could be that pain processing is actually 10 times as common as it appears to be in Neurosynth (i.e., that pain is severely underrepresented in fMRI studies relative to its occurrence in real-world human brains). If we use the empirical estimates from Neurosynth as our priors—with all of their massive between-term variation—then, as you saw above, the priors will tend to overwhelm our posteriors. In other words, no amount of activation in pain-related regions would ever lead us to conclude that a study is about a low-frequency term like pain rather than a high-frequency term like memory or vision.

For this reason, when I first built Neurosynth, my colleagues and I made the deliberate decision to impose a uniform (i.e., 50/50) prior on all terms displayed on the Neurosynth website. This approach greatly facilitates qualitative comparison of different terms; but it necessarily does so by artificially masking the enormous between-term variability in base rates. What this means is that when you see a posterior probability like 85% for pain in the dACC in the third row of the pain figure above, the right interpretation of this is “if you pretend that the prior likelihood of a study using the term pain is exactly 50%, then your posterior estimate after observing dACC activation should now be 85%”. Is this a faithful representation of reality? No. It most certainly isn’t. And in all likelihood, neither is the empirical prior of 3.5%. But the problem is, we have to do something; Bayes’ rule has to have priors to work with; it can’t just conjure into existence a conditional probability for a term (i.e., P(Term|Activation)) without knowing anything about its marginal probability  (i.e., P(Term)). Unfortunately, as you can see in the above figure, the variation in the posterior that’s attributable to the choice of prior will tend to swamp the variation that’s due to observed differences in brain activity.

The upshot is, if you come into a study thinking that ‘pain’ is 90% likely to be occurring, then Neurosynth is probably not going to give you much reason to revise that belief. Conversely, if your task involves strictly visual stimuli, and you know that there’s no sensory stimulation at all—so maybe you feel comfortable setting the prior on pain at 1%—then no pattern of activity you could possibly see is going to lead you to conclude that there’s a high probability of pain. This may not be very satisfying, but hey, that’s life.

The interesting thing about all this is that, no matter what prior you choose for any given term, the Neurosynth z-score will never change. That’s because the z-score is a frequentist measure of statistical association between term occurrence and voxel activation. All it tells us is that, if the null of no effect were true, the data we observe would be very unlikely. This may or may not be interesting (I would argue that it’s not, but that’s for a different post), but it certainly doesn’t license a reverse inference like “dACC activation suggests that pain is present”. To draw the latter claim, you have to use a Bayesian framework and pick some sensible priors. No priors, no reverse inference.

Now, as I noted in my last post, it’s important to maintain a pragmatic perspective. I’m obviously not suggesting that the z-score maps on Neurosynth are worthless. If one’s goal is just to draw weak qualitative inferences about brain-cognition relationships, I think it’s reasonable to use Neurosynth reverse inference z-score maps for that purpose. For better or worse, the vast majority of claims researchers make in cognitive neuroscience are not sufficiently quantitative that it makes much difference whether the probability of a particular term occurring given some observed pattern of activation is 24% or 58%. Personally, I would argue that this is to the detriment of the field; but regardless, the fact remains that if one’s goal is simply to say something like “we think that the temporoparietal junction is associated with biological motion and theory of mind,” or “evidence suggests that the parahippocampal cortex is associated with spatial navigation,” I don’t see anything wrong with basing that claim on Neurosynth z-score maps. In marked contrast, however, Neurosynth provides no license for saying much stronger things like “the dACC is selective for pain” or suggesting that one can make concrete reverse inferences about mental processes on the basis of observed patterns of brain activity. If the question we’re asking is what are we entitled to conclude about the presence of pain when we observed significant activation in the dACC in a particular study?, the simple answer is: almost nothing.

Let’s now reconsider L&E’s statement—and by extension, their entire argument for selectivity—in this light. L&E say that their goal is not to compare effect sizes for different terms, but rather “to see whether there [is] accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.” But what could this claim possibly mean, if not something like “we want to know whether it’s safe to infer the presence of pain given the presence of dACC activation?” How could this possibly be anything other than a statement about probability? Are L&E really saying that, given a sufficiently high z-score for dACC/pain, it would make no difference to them at all if the probability of pain given dACC activation was only 5%, even if there were plenty of other terms with much higher conditional probabilities? Do they expect us to believe that, in their 2003 social pain paper—where they drew a strong reverse inference that social pain shares mechanisms with physical pain based purely on observation of dACC activation (which, ironically, wasn’t even in pain-related areas of dACC—it would have made no difference to their conclusion even if they’d known conclusively that dACC activation actually only reflects pain processing 5% of the time? Such a claim is absurd on its face.

Let me summarize this section by making the following points about Neurosynth. First, it’s possible to obtain almost any posterior probability for any term given activation in any voxel, simply by adjusting the prior probability of term occurrence. Second, a choice about the prior must be made; there is no “default” setting (well, there is on the website, but that’s only because I’ve already made the choice for you). Third, the choice of prior will tend to dominate the posterior—which is to say, if you’re convinced that there’s a high (or low) prior probability that your study involves pain, then observing different patterns of brain activity will generally not do nearly as much as you might expect to change your conclusions. Fourth, this is not a Neurosynth problem, it’s a reality problem. The fundamental fact of the matter is that we simply do not know with any reasonable certainty, in any given context, what the prior probability of a particular process occuring in our subjects’ head is. Yet, without that, we have little basis for drawing any kind of reverse inference when we observe brain activity in a given study.

If all this makes you think, “oh, this seems like it would make it almost impossible in practice to draw meaningful reverse inferences in individual studies,” well, you’re not wrong.

L&E’s PNAS paper, and their reply to my last post, suggests that they don’t appreciate any of these points. The fact of the matter is that it’s impossible to draw any reverse inference about an individual study unless one is willing to talk about probabilities. L&E don’t seem to understand this, because if they did, they wouldn’t feel comfortable saying that they don’t care about effect sizes, and that z-scores provide adequate support for reverse inference claims. In fact, they wouldn’t feel comfortable making any claim about the dACC’s selectivity for pain relative to other terms on the basis of Neurosynth data.

I want to be clear that I don’t think L&E’s confusion about these issues is unusual. The reality is that many of these core statistical concepts—both frequentist and Bayesian—are easy to misunderstand, even for researchers who rely on them on a day-to-day basis. By no means am I excluding myself from this analysis; I still occasionally catch myself making similar slips when explaining what the z-scores and conditional probabilities in Neurosynth mean—and I’ve been thinking about these exact ideas in this exact context for a pretty long time! So I’m not criticizing L&E for failing to correctly understand reverse inference and its relation to Neurosynth. What I’m criticizing L&E for is writing an entire paper making extremely strong claims about functional selectivity based entirely on Neurosynth results, without ensuring that they understand the statistical underpinnings of the framework, and without soliciting feedback from anyone who might be in a position to correct their misconceptions. Personally, if I were in their position, I would move to retract the paper. But I have no control over that. All I can say is that it’s my informed opinion—as the creator of the software framework underlying all of L&E’s analyses—that the conclusions they draw in their paper are not remotely supported by any data that I’ve ever seen come out of Neurosynth.

On ‘strong’ vs. ‘weak’ selectivity

The other major problem with L&E’s paper, from my perspective, lies in their misuse of the term ‘selective’. In their response, L&E take issue with my criticism of their claim that they’ve shown the dACC to be selective for pain. They write:

Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word, with the strong form entailing further constraints on what constitutes an effect being selective. TY writes in his blog: “it’s one thing to use Neurosynth to support a loose claim like “some parts 
of the dACC are preferentially associated with pain”, and quite another to claim that the dACC is selective for pain, that virtually nothing else activates dACC”. The last part there gets at what TY thinks we mean by selective and what we would call the strong form of selectivity.

L&E respectively define these strong and weak forms of selectivity as follows:

Selectivitystrong: The dACC is selective for pain, if pain and only pain activates the dACC.

Selectivityweak: The dACC is selective for pain, if pain is a more reliable source of dACC activation than the other terms of interest (executive, conflict, salience).

They suggest that I accused them of claiming ‘strong’ selectivity when they were really just making the much weaker claim that dACC activation is more strongly associated with dACC activation than with other terms. I disagree with this characterization. I’ll come back to what I meant by ‘selective’ in a bit (I certainly didn’t assume anything like L&E’s strong definition). But first, let’s talk about L&E’s ‘weak’ notion of selectivity, which in my view is at odds with any common-sense understanding of what ‘selective’ means, and would have an enormously destructive effect on the field if it were to become widely used.

The fundamental problem with the suggestion that we can say dACC is pain-selective if “it’s a more reliable source of dACC activation than the other terms of interest” is that this definition provides a free pass for researchers to make selectivity claims about an extremely large class of associations, simply by deciding what is or isn’t of interest in any given instance. L&E claim to be “interested” in executive control, conflict, and salience. This seems reasonable enough; after all, these are certainly candidate functions that people have discussed at length in the literature. The problem lies with all the functions L&E don’t seem to be interested in: e.g., fear, autonomic control, or reward—three other processes that many researchers have argued the dACC is crucially involved in, and that demonstrably show robust effects in dACC in Neurosynth. If we take L&E’s definition of weak selectivity at face value, we find ourselves in the rather odd position of saying that one can use Neurosynth to claim that a region is “selective” for a particular function just as long as it’s differentiable from some other very restricted set of functions. Worse still, one apparently does not have to justify the choice of comparison functions! In their PNAS paper, L&E never explain why they chose to focus only on three particular ACC accounts that don’t show robust activation in dACC in Neurosynth, and ignored several other common accounts that do show robust activation.

If you think this is a reasonable way to define selectivity, I have some very good news for you. I’ve come up with a list of other papers that someone could easily write (and, apparently, publish in a high-profile journal) based entirely on results you can obtain from the Neurosynth websites. The titles of these papers (and you could no doubt come up with many more) include:

  • “The TPJ is selective for theory of mind”
  • “The TPJ is selective for biological motion”
  • “The anterior insula is selective for inhibition”
  • “The anterior insula is selective for orthography”
  • “The VMPFC is selective for autobiographical memory”
  • “The VMPFC is selective for valuation”
  • “The VMPFC is selective for autonomic control”
  • “The dACC is selective for fear”
  • “The dACC is selective for autonomic control”
  • “The dACC is selective for reward”

These are all interesting-sounding articles that I’m sure would drum up considerable interest and controversy. And the great thing is, as long as you’re careful about what you find “interesting” (and you don’t have to explicitly explain yourself in the paper!), Neurosynth will happily support all of these conclusions. You just need to make sure not to include any comparison terms that don’t fit with your story. So, if you’re writing a paper about the VMPFC and valuation, make sure you don’t include autobiographical memory as a control. And if you’re writing about theory of mind in the TPJ, it’s probably best to not find biological motion interesting.

Now, you might find yourself thinking, “how could it make sense to have multiple people write different papers using Neurosynth, each one claiming that a given region is ‘selective’ for a variety of different processes? Wouldn’t that sort of contradict any common-sense understanding of what the term ‘selective’ means?” My own answer would be “yes, yes it would”. But L&E’s definition of “weak selectivity”—and the procedures they use in their paper—allow for multiple such papers to co-exist without any problem. Since what counts as an “interesting” comparison condition is subjective—and, if we take L&E’s PNAS example as a model, one doesn’t even need to explicitly justify the choices one makes—there’s really nothing stopping anyone from writing any of the papers I suggested above. Following L&E’s logic, a researcher who favored a fear-based account of dACC could simply select two or three alternative processes as comparison conditions—say, sustained attention and salience—do all of the same analyses L&E did (pretending for the moment that those analyses are valid, which they aren’t), and conclude that the dACC is selective for fear. It really is that easy.

In reality, I imagine that if L&E came across an article claiming that Neurosynth shows that the dACC is selective for fear, I doubt they’d say “well, I guess the dACC is selective for fear. Good to know.” I suspect they would (quite reasonably) take umbrage at the fear paper’s failure to include pain as a comparison condition in the analysis. Yet, by their own standards, they’d have no real basis for any complaint. The fear paper’s author could simply, say, “pain’s not interesting to me,” and that would be that. No further explanation necessary.

Perhaps out of recognition that there’s something a bit odd about their definition of selectivity, L&E try to prime our intuition that their usage is consistent with the rest of the field. They point out that, in most experimental fMRI studies claiming evidence for selectivity, researchers only ever compare the target stimulus or process to a small number of candidates. For example, they cite a Haxby commentary on a paper that studied category specificity in visual cortex:

From Haxby (2006): “numerous small spots of cortex were found that respond with very high selectivity to faces. However, these spots were intermixed with spots that responded with equally high selectivity to the other three categories.”

Their point is that nobody expects ‘selective’ here to mean that the voxel in question responds to only that visual category and no other stimulus that could conceivably have been presented. In practice, people take ‘selective’ to mean “showed a greater response to the target category than to other categories that were tested”.

I agree with L&E that Haxby’s usage of the term ‘selective’ here is completely uncontroversial. The problem is, the study in question is a lousy analogy for L&E’s PNAS paper. A much better analogy would be a study that presented 10 visual categories to participants, but then made a selectivity claim in the paper’s title on the basis of a comparison between the target category and only 2 other categories, with no explanation given for excluding the other 7 categories, even though (a) some of those 7 categories were well known to also be associated with the same brain region, and (b) strong activation in response to some of those excluded categories was clearly visible in a supplementary figure. I don’t know about L&E, but I’m pretty sure that, presented with such a paper, the vast majority of cognitive neuroscientists would want to say something like, “how can you seriously be arguing that this part of visual cortex responds selectively to spheres, when you only compared spheres with faces and houses in the main text, and your supplemental figure clearly shows that the same region responds strongly to cubes and pyramids as well? Shouldn’t you maybe be arguing that this is a region specialized for geometric objects, if anything?” And I doubt anyone would be very impressed if the authors’ response to this critique was “well, it doesn’t matter what else we’re not focusing on in the paper. We said this region is sphere-selective, which just means it’s more selective than a couple of other stimulus categories people have talked about. Pyramids and cubes are basically interchangeable with spheres, right? What more do you want from us?”

I think it’s clear that there’s no basis for making a claim like “the dACC is selective for pain” when one knows full well that at least half a dozen other candidate functions all reliably activate the dACC. As I noted in my original post, the claim is particularly egregious in this case, because it’s utterly trivial to generate a ranked list of associations for over 3,000 different terms in Neurosynth. So it’s not even as if one needs to think very carefully about which conditions to include in one’s experiment, or to spend a lot of time running computationally intensive analyses. L&E were clearly aware that a bunch of other terms also activated dACC; they briefly noted as much in the Discussion of their paper. What they didn’t explain is why this observation didn’t lead them to seriously revise their framing. Given what they knew, there were at least two alternative articles they could have written that wouldn’t have violated common sense understanding of what the term ‘selective’ means. One might have been titled something like “Heterogeneous aspects of dACC are preferentially associated with pain, autonomic control, fear, reward, negative affect, and conflict monitoring”. The other might have been titled “the dACC is preferentially associated with X-related processes”—where “X” is some higher-order characterization that explains why all of these particular processes (and not others) are activated in dACC. I have no idea whether either of these papers would have made it through peer review at PNAS (or any other journal), but at the very least they wouldn’t have been flatly contradicted by Neurosynth results.

To be fair to L&E, while they didn’t justify their exlcusion of terms like fear and autonomic control in the PNAS paper, they did provide some explanation in their reply to my last post. Here’s what they say:

TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.” And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

Essentially, their response is: “it didn’t make sense for us to include fear or autonomic control, because these functions are compatible with the underlying role we think the dACC is playing in pain”. This is not compelling, for three reasons. First, it’s a bait-and-switch. L&E’s paper isn’t titled “the dACC is selective for a family of distress-related processes”, it’s titled “the dACC is selective for pain“. One cannot publish a paper purporting to show that the dACC is selective for pain, and arguing that pain is the single best psychological characterization of its role in cognition, and then, in a section of their Discussion that they admit is the “most speculative” part of the paper, essentially say, “just kidding–we don’t think it’s really doing pain per se, we think it’s a much more general set of functions. But we don’t have any real evidence for that.”

Second, it’s highly uncharitable for L&E to spontaneously lump alternative accounts of dACC function like fear/avoidance, autonomic control, and bodily orientation in with their general “distress-related” account, because proponents of many alternative views of dACC function have been very explicit in saying that they don’t view these functions as fundamentally affective (e.g., Vogt and colleagues view posterior dACC as a premotor region). While L&E may themselves believe that pain, fear, and autonomic control in dACC all reflect some common function, that’s an extremely strong claim that requires independent evidence, and is not something that they’re entitled to simply assume. A perfectly sensible alternative is that these are actually dissociable functions with only partially overlapping spatial representations in dACC. Since the terms themselves are distinct in Neurosynth, that should be L&E’s operating assumption until they provide evidence for their stronger claim that there’s some underlying commonality. Nothing about this conclusion simply falls out of the data in advance.

Third, let me reiterate the point I made above about L&E’s notion of ‘weak selectivity’: if we take at face value L&E’s claim that fear and autonomic control don’t need to be explicitly considered because they could be interpreted alongside pain under a common account, then they’re effectively conceding that it would have made just as much sense to publish a paper titled “the dACC is selective for fear” or “the dACC is selective for autonomic control” that relegated the analysis of the term “pain” to a supplementary figure. In the paper’s body, you would find repeated assertions that the authors  have shown that autonomic control is the “best general psychological account of dACC function”. When pressed as to whether this was a reasonable conclusion, the authors would presumably defend their decision to ignore pain as a viable candidate by saying things like, “well, sure pain also activates the dACC; everyone knows that. But that’s totally consistent with our autonomic control account, because pain produces autonomic outputs! So we don’t need to consider that explicitly.”

I confess to some skepticism that L&E would simply accept such a conclusion without any objection.

Before moving on, let me come full circle and offer a definition of selectivity that I think is much more workable than either of the ones L&E propose, and is actually compatible with the way people use the term ‘selective’ more broadly in the field:

Selectivityrealistic: A brain region can be said to be ‘selective’ for a particular function if it (i) shows a robust association with that function, (ii) shows a negligible association with all other readily available alternatives, and (iii) the authors have done due diligence in ensuring that the major candidate functions proposed in the literature are well represented in their analysis.

Personally, I’m not in love with this definition. I think it still allows researchers to make claims that are far too strong in many cases. And it still allows for a fair amount of subjectivity in determining what gets to count as a suitable control—at least in experimental studies where researchers necessarily have to choose what kinds of conditions to include. But I think this definition is more or less in line with the way most cognitive neuroscientists expect each other to use the term. It captures the fact that most people would feel justifiably annoyed if someone reported a “selective” effect in one condition while failing to acknowledge that 4 other unreported conditions showed the same effect. And it also captures the notion that researchers should be charitable to each other: if I publish a paper claiming that the so-called fusiform ‘face’ area is actually selective for houses, based on a study that completely failed to include a face condition, no one is going to take my claim of house selectivity seriously. Instead, they’re going to conclude that I wasn’t legitimately engaging with other people’s views.

In the context of Neurosynth—where one has 3,000 individual terms or several hundred latent topics at their disposal—this definition makes it very clear that researchers who want to say that a region is selective for something have an obligation to examine the database comprehensively, and not just to cherry-pick a couple of terms for analysis. That is what I meant when I said that L&E need to show that “virtually nothing else activates dACC”. I wasn’t saying that they have to show that no other conceivable process reliably activates the dACC (which would be impossible, as they observe), but simply that they need to show that no non-synonymous terms in the Neurosynth database do. I stand by this assertion. I see no reason why anyone should accept a claim of selectivity based on Neurosynth data if just a minute or two of browsing the Neurosynth website provides clear-cut evidence that plenty of other terms also reliably activate the same region.

To sum up, nothing L&E say in their paper gives us any reason to think that the dACC is selective for pain (even if we were to ignore all the problems with their understanding of reverse inference and allow them to claim selectivity based on inappropriate statistical tests). I submit that no definition of ‘selective’ that respects common sense usage of the term, and is appropriately charitable to other researchers, could possibly have allowed L&E to conclude that dACC activity is “selective” for pain when they knew full well that fear, autonomic control, and reward all also reliably activated the dACC in Neurosynth.

Everything else

Having focused on what I view as the two overarching issues raised by L&E’s reply, I now turn to comprehensively addressing each of their specific claims. As I noted at the outset, I recognize this is going to make for slow reading. But I want to make sure I address L&E’s points clearly and comprehensively, as I feel that they blatantly mischaracterized what I said in my original post in many cases. I don’t actually recommend that anyone read this entire section linearly. I’m writing it primarily as a reference—so that if you think there were some good points L&E made in their reply to my original post, you can find those points by searching for the quote, and my response will be directly below.

Okay, let’s begin.

Tal Yarkoni (hereafter, TY), the creator of Neurosynth, has now posted a blog (here (link is external)) suggesting that pretty much all of our claims are either false, trivial, or already well-known. While this response was not unexpected, it’s disappointing because we love Neurosynth and think it’s a powerful tool for drawing exactly the kinds of conclusions we’ve drawn.

I’m surprised to hear that my response was not unexpected. This would seem to imply that L&E had some reason to worry that I wouldn’t approve of the way they were using Neurosynth, which leads me to wonder why they didn’t solicit my input ahead of time.

While TY is the creator of Neurosynth, we don’t think that means he has the last word when it comes to what is possible to do with it (nor does he make this claim). In the end, we think there may actually be a fair bit of agreement between us and TY. We do think that TY has misunderstood some of our claims (section 1 below) and failed to appreciate the significance and novelty of our actual claims (sections 2 and 4). TY also thinks we should have used different statistical analyses than we did, but his critique assumes we had a different question than the one we really had (section 5).

I agree that I don’t have the last word, and I encourage readers to consider both L&E’s arguments and mine dispassionately. I don’t, however, think that there’s a fair bit of agreement between us. Nor do I think I misunderstood L&E’s claim or failed to appreciate their significance or novelty. And, as I discuss at length both above and below, the problem is not that L&E are asking a different question than I think, it’s that they don’t understand that the methods they’re using simply can’t speak to the question they say they’re asking.

1. Misunderstandings (where we sort of probably agree)

We think a lot of the heat in TY’s blog comes from two main misunderstandings of what we were trying to accomplish. The good news (and we really hope it is good news) is that ultimately, we may actually mostly agree on both of these points once we get clear on what we mean. The two issues have to do with the use of the term “selective” and then why we chose to focus on the four categories we did (pain, executive, conflict, salience) and not others like fear and autonomic.

Misunderstanding #1: Selectivity. Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word…

I’ve already addressed this in detail at the beginning of this post, so I’ll skip the next few paragraphs and pick up here:

We mean this in the same way that Haxby and lots of others do. We never give a technical definition of selectivity in our paper, though in the abstract we do characterize our results as follows:

“Results clearly indicated that the best psychological description of dACC function was related to pain processing—not executive, conflict, or salience processing.”

Thus, the context of what comparisons our selectivity refers to is given in the same sentence, right up front in the abstract. In the end, we would have been just as happy if “selectivity” in the title was replaced with “preferentially activated”. We think this is what the weak form of selectivity entails and it is really what we meant. We stress again, we are not familiar with researchers who use the strong form of selectivity. TY’s blog is the first time we have encountered this and was not what we meant in the paper.

I strongly dispute L&E’s suggestion that the average reader will conclude from the above sentence that they’re clearly analyzing only 4 terms. Here’s the sentence in their abstract that directly precedes the one they quote:

Using Neurosynth, an automated brainmapping database [of over 10,000 functional MRI (fMRI) studies], we performed quantitative reverse inference analyses to explore the best general psychological account of the dACC function P(Ψ processjdACC activity).

It seems quite clear to me that the vast majority of readers are going to parse the title and abstract of L&E’s paper as implying a comprehensive analysis to find the best general psychological account of dACC function, and not “the best general psychological account if you only consider these 4 very specific candidates”. Indeed, I have trouble making any sense of the use of the terms “best” and “general” in this context, if what L&E meant was “a very restricted set of possibilities”. I’ll also note that in five minutes of searching the literature, I couldn’t find any other papers with titles or abstracts that make nearly as strong a claim about anterior cingulate function as L&E’s present claims about pain. So I reject the idea that their usage is par for the course. Still, I’m happy to give them the benefit of the doubt and accept that they truly didn’t realize that their wording might lead others to misinterpret their claims. I guess the good news is that, now that they’re aware of the potential confusion claims like this can cause, they will surely be much more circumspect in the titles and abstracts of their future papers.

Before moving on, we want to note that in TY’11 (i.e. the Yarkoni et al., 2011 paper announcing Neurosynth), the weak form of selectivity is used multiple times. In the caption for Figure 2, the authors refer to “regions in c were selectively associated with the term” when as far as we can tell, they are talking only about the comparison of three terms (working memory, emotion, pain). Similarly on p. 667 the authors write “However, the reverse inference map instead implicated the anterior prefrontal cortex and posterior parietal cortex as the regions that were most selectively activated by working memory tasks.” Here again, the comparison is to emotion and pain, and the authors are not claiming selectivity relative to all other psychological processes in the Neurosynth database. If it is fair for Haxby, Botvinick, and the eminent coauthors of TY’11 to use selectivity in this manner, we think it was fine for us as well.

I reject the implication of equivalence here. I think the scope of the selectivity claim I made in the figure caption in question is abundantly clear from the immediate context, and provides essentially no room for ambiguity. Who would expect, in a figure with 3 different maps, the term ‘selective’ to mean anything other than ‘for this one and not those two’? I mean, if L&E had titled their paper “pain preferentially activates the dACC relative to conflict, salience, or executive control”, and avoided saying that they were proposing the “best general account” of psychological function in dACC, I wouldn’t have taken issue with their use of the term ‘selective’ in their manuscript either, because the scope would have been equally clear. Conversely, if I had titled my 2011 paper “the dACC shows no selectivity for any cognitive process”, and said, in the abstract, something like “we show that there is no best general psychological function of the dACC–not pain, working memory, or emotion”, I would have fully expected to receive scorn from others.

That said, I’m willing to put my money where my mouth is. If a few people (say 5) write in to say (in the comments below, on twitter, or by email) that they took the caption in Figure 2 of my 2011 paper to mean anything other than “of these 3 terms, only this one showed an effect”, I’ll happily send the journal a correction. And perhaps, L&E could respond in kind by commiting to changing the title of their manuscript to something like “the dACC is preferentially active for pain relative to conflict, salience or executive control” if 5 people write in to say that they interpreted L&E’s claims as being much more global than L&E suggest they are. I encourage readers to use the comments below to clarify how they understood both of these selectivity claims.

We would also point readers to the fullest characterization of the implication of our results on p. 15253 of the article:

“The conclusion from the Neurosynth reverse inference maps is unequivocal: The dACC is involved in pain processing. When only forward inference data were available, it was reasonable to make the claim that perhaps dACC was not involved in pain per se, but that pain processing could be reduced to the dACC’s “real” function, such as executive processes, conflict detection, or salience responses to painful stimuli. The reverse inference maps do not support any of these accounts that attempt to reduce pain to more generic cognitive processes.”

We think this claim is fully defensible and nothing in TY’s blog contradicts this. Indeed, he might even agree with it.

This claim does indeed seem to me largely unobjectionable. However, I’m at a loss to understand how the reader is supposed to know that this one very modest sentence represents “the fullest characterization” of the results in a paper replete with much stronger assertions. Is the reader supposed to, upon reading this sentence, retroactively ignore all of the other claims—e.g., the title itself, and L&E’s repeated claim throughout the paper that “the best psychological interpretation of dACC activity is in terms of pain processes”?

*Misunderstanding #2: We did not focus on fear, emotion, and autonomic accounts*. TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.” And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

I addressed this in detail above, in the section on “selectivity”.

We speak to some but not all of this in the paper. On p. 15254, we revisit our neural alarm account and write “Distress-related emotions (“negative affect” “distress” “fear”) were each linked to a dACC cluster, albeit much smaller than the one associated with “pain”.” While we could have said more explicitly that pain is in this distress-related category, we have written about this several times before and assumed this would be understood by readers.

There is absolutely no justification for assuming this. The community of people who might find a paper titled “the dorsal anterior cingulate cortex is selective for pain” interesting is surely at least an order of magnitude larger than the community of people who are familiar with L&E’s previous work on distress-related emotions.

So why did we focus on executive, conflict, and salience? Like most researchers, we are the products of our early (academic) environment. When we were first publishing on social pain, we were confused by the standard account of dACC function. A half century of lesion data and a decade of fMRI studies of pain pointed towards more evidence of the dACC’s involvement in distress-related emotions (pain & anxiety), yet every new paper about the dACC’s function described it in cognitive terms. These cognitive papers either ignored all of the pain and distress findings for dACC or they would redescribe pain findings as reducible to or just an instance of something more cognitive.

When we published our first social pain paper, the first rebuttal paper suggested our effects were really just due to “expectancy violation” (Somerville et al., 2006), an account that was later invalidated (Kawamoto 2012). Many other cognitive accounts have also taken this approach to physical pain (Price 2000; Vogt, Derbyshire, & Jones, 2006).

Thus for us, the alternative to pain accounts of dACC all these years were conflict detection and cognitive control explanations. This led to the focus on the executive and conflict-related terms. In more recent years, several papers have attempted to explain away pain responses in the dACC as nothing more than salience processes (e.g Iannetti’s group) that have nothing to do with pain, and so salience became a natural comparison as well. We haven’t been besieged with papers saying that pain responses in the dACC are “nothing but” fear or “nothing but” autonomic processes, so those weren’t the focus of our analyses.

This is a informative explanation of L&E’s worldview and motivations. But it doesn’t justify ignoring numerous alternative accounts whose proponents very clearly don’t agree with L&E that their views can be explained away as “distress-related”. If L&E had written a paper titled “salience is not a good explanation of dACC function,” I would have happily agreed with their conclusion here. But they didn’t. They wrote a paper explicitly asserting that pain is the best psychological characterization of the dACC. They’re not entitled to conclude this unless they compare pain properly with a comprehensive set of other possible candidates—not just the ones that make pain look favorable.

We want to comment further on fear specifically. We think one of the main reasons that fear shows up in the dACC is because so many studies of fear use pain manipulations (i.e. shock administration) in the process of conditioning fear responses. This is yet another reason that we were not interested in contrasting pain and fear maps. That said, if we do compare the Z-scores in the same eight locations we used in the PNAS paper, the pain effect has more accumulated evidence than fear in all seven locations where there is any evidence for pain at all.

This is a completely speculative account, and no evidence is provided for it. Worse, it’s completely invertible: one could just as easily say that pain shows up in the dACC because it invariably produces fear, or because it invariably elicits autonomic changes (frankly, it seems more plausible to me that pain almost always generates fear than that fear is almost always elicited by pain). There’s no basis for ruling out these other candidate functions a priori as being more causally important. This is simply question-begging.

Its interesting to us that TY does not in principle seem to like us trying to generate some kind of unitary account of dACC writing “There’s no reason why nature should respect our human desire for simple, interpretable models of brain function.” Yet, TY then goes on to offer a unitary account more to his liking. He highlights Vogt’s “four-region” model of the cingulate writing “I’m especially partial to the work of Brent Vogt…”. In Vogt’s model, the aMCC appears to be largely the same region as what we are calling dACC. Although the figure shown by TY doesn’t provide anatomical precision, in other images, Vogt shows the regions with anatomical boundaries. Rotge et al. (2015) used such an image from Vogt (2009) to estimate the boundaries of aMCC as spanning 4.5 ≤ y ≤ 30 which is very similar to our dACC anterior/posterior boundaries of 0 ≤ y ≤ 30) (see Figure below). Vogt ascribes the function of avoidance behavior to this region – a pretty unitary description of the region that TY thinks we should avoid unitary descriptions of.

There is no charitable way to put it: this is nothing short of a gross misrepresentation of what I said about the Vogt account. As a reminder, here’s what I actually wrote in my post:

I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree) … the Vogt characterization of dACC/aMCC … fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance” region).

I have no idea how L&E read this and concluded that I was arguing that we should simply replace the label “pain” with “fear”. I don’t feel the need to belabor the point further, because I think what I wrote is quite clear.

In the end though, if TY prefers a fear story to our pain story, we think there is some evidence for both of these (a point we make in our PNAS paper). We think they are in a class of processes that overlap both conceptually (i.e. distress-related emotions) and methodologically (i.e. many fear studies use pain manipulations to condition fear).

No, I don’t prefer a fear story. My view (which should be abundantly clear from the above quote) is that both a fear story and a pain story would be gross oversimplifications that shed more heat than light. I will, however, reiterate my earlier point (which L&E never responded to), which is that their PNAS paper provides no reason at all to think that the dACC is involved in distress-related emotion (indeed, they explicitly said that this was the most speculative part of the paper). If anything, the absence of robust dACC activation for terms like ‘disgust’, ’emotion’, and ‘social’ would seem to me like pretty strong evidence against a simplistic model of this kind. I’m not sure why L&E are so resistant to the idea that maybe, just maybe, the dACC is just too big a region to attach a single simple label to. As far as I can tell, they provide no defense of this assumption in either their paper or their reply.

After focusing on potential misunderstandings we want to turn to our first disagreement with TY. Near the end of his blog, TY surprised us by writing that the following conclusions can be reasonably drawn from Neurosynth analyses:

* “There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.”
* “It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)”

Our first response was ‘Wow. After pages and pages of criticizing our paper, TY pretty much agrees with what we take to be the major claims of our paper. Yes, his version is slightly watered down from what we were claiming, but these are definitely in the ballpark of what we believe.’

L&E omitted my third bullet point here, which was that “Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.” I’m not sure why they left it out; they could hardly disagree with it either, if they want to stand by their definition of “weak selectivity”.

I’ll leave it to you to decide whether or not my conclusions are really just “watered down” versions “in the ballpark” of the major claims L&E make in their paper.

But then TY’s next statement surprised us in a different sort of way. He wrote

“I think these are all interesting and potentially important observations. They’re hardly novel…”.

We’ve been studying the dACC for more than a decade and wondered what he might have meant by this. We can think of two alternatives for what he might have meant:

* That L&E and a small handful of others have made this claim for over a decade (but clearly not with the kind of evidence that Neurosynth provides).

* That TY already used Neurosynth in 2011 to show this. In the blog, he refers to this paper writing “We explicitly noted that there is preferential activation for pain in dACC”.

I’m not sure what was confusing about what I wrote. Let’s walk through the three bullet points. The first one is clearly not novel. We’ve known for many years that many parts of dACC are preferentially active when people experience painful stimulation. As I noted in my last post, L&E explicitly appealed to this literature over a decade ago in their 2003 social pain paper. The second one is also clearly not novel. For example, Vogt and colleagues (among others) have been arguing for at least two decades now that the posterior aspects of dACC support pain processing in virtue of their involvement in processes (e.g., bodily orientation) that clearly preclude most higher cognitive accounts of dACC. The third claim isn’t novel either, as there has been ample evidence for at least a decade now that virtually every part of dACC that responds to painful stimulation also systematically responds to other non-nociceptive stimuli (e.g., the posterior dACC responds to non-painful touch, the anterior to reward, etc.). I pointed to articles and textbooks comprehensively reviewing this literature in my last post. So I don’t understand L&E’s surprise. Which of these three claims do they think is actually novel to their paper?

In either case, “they’re hardly novel” implies this is old news and that everyone knows and believes this, as if we’re claiming to have discovered that most people have two eyes, a nose, and a mouth. But this implication could not be further from the truth.

No, that’s not what “hardly novel” implies. I think it’s fair to say that the claim that social pain is represented in the dACC in virtue of representations shared with physical pain is also hardly novel at this point, yet few people appear to know and believe it. I take ‘hardly novel’ to mean “it’s been said before multiple times in the published literature.”

There is a 20+ year history of researchers ignoring or explaining away the role of pain processing in dACC.

I’ll address the “explained away” part of this claim below, but it’s completely absurd to suggest that researchers have ignored the role of pain processing in dACC for 20 years. I don’t think I can do any better than link to Google Scholar, where the reader is invited to browse literally hundreds of articles that all take it as an established finding that the dACC is important for pain processing (and many of which have hundreds of citations from other articles).

When pain effects are mentioned in most papers about the function of dACC, it is usually to say something along the lines of ‘Pain effects in the dACC are just one manifestation of the broader cognitive function of conflict detection (or salience or executive processes)’. This long history is indisputable. Here are just a few examples (and these are all reasonable accounts of dACC function in the absence of reverse inference data):

* Executive account: Price’s 2000 Science paper on the neural mechanisms of pain assigns to the dACC the roles of “directing attention and assigning response priorities”
* Executive account: Vogt et al. (1996) says the dACC “is not a ‘pain centre’” and “is involved in response selection” and “response inhibition or visual guidance of responses”
* Conflict account: Botvinick et al. (2004) wrote that “the ACC might serve to detect events or internal states indicating a need to shift the focus of attention or strengthen top-down control ([4], see also [20]), an idea consistent, for example, with the fact that the ACC responds to pain ” (Botvinick et al. 2004)
* Salience account: Iannetti suggests the ‘pain matrix’ is a myth and in Legrain et al. (2011) suggests that the dACC’s responses to pain “could mainly reflect brain processes that are not directly related to the emergence of pain and that can be engaged by sensory inputs that do not originate from the activation of nociceptors.”

I’m not really sure what to make of this argument either. All of these examples clearly show that even proponents of other theories of dACC function are well aware of the association with pain, and don’t dispute it in any way. So L&E’s objection can’t be that other people just don’t believe that the dACC supports pain processing. Instead, L&E seem to dislike the idea that other theorists have tried to “explain away” the role of dACC in pain by appealing to other mechanisms. Frankly, I’m not sure what the alternative to such an approach could possibly be. Unless L&E are arguing that dACC is the neural basis of an integrated, holistic pain experience (whatever such a thing might mean), there presumably must be some specific computational operations going on within dACC that can be ascribed a sensible mechanistic function. I mean, even L&E themselves don’t take the dACC to be just about, well, pain. Their whole “distress-related emotion” story is itself intended to explain what it is that dACC actually does in relation to pain (since pretty much everyone accepts that the sensory aspects of pain aren’t coded in dACC).

The only way I can make sense of this “explained away” concern is if what L&E are actually objecting to is the fact that other researchers have disagreed or ignored their particular story about what the dACC does in pain—i.e., L&E’s view that the dACC role in pain is derived from distress-related emotion. As best I can tell, what bothers them is that other researchers fundamentally disagree with–and hence, don’t cite–their “distress-related emotion” account. Now, maybe this irritation is justified, and there’s actually an enormous amount of evidence out there in favor of the distress account that other researchers are willfully ignoring. I’m not qualified to speak to that (though I’m skeptical). What I do feel qualified to say is that none of the Neurosynth results L&E present in their paper make any kind of case for an affective account of pain processing in dACC. The most straightforward piece of evidence for that claim would be if there were a strong overlap between pain and negative affect activations in dACC. But we just don’t see this in Neurosynth. As L&E themselves acknowledge, the peak sectors of pain-related activation in dACC are in mid-to-posterior dACC, and affect-related terms only seem to reliably activate the most anterior aspects.

To be charitable to L&E, I do want to acknowledge one valuable point that they contribute here, which is that it’s clear that dACC function cannot be comprehensively explained by, say, a salience account or a conflict monitoring account. I think that’s a nice point (though I gather that some people who know much more about anatomy than I do are in the process of writing rebuttals to L&E that argue it’s not as nice as I think it is). The problem is, this argument can be run both ways. Meaning, much as L&E do a nice job showing that conflict monitoring almost certainly can’t explain activations in posterior dACC, the very maps they show make it clear that pain can’t explain all the other activations in anterior dACC (for reward, emotion, etc.). Personally, I think the sensible conclusion one ought to take away from all this is “it’s really complicated, and we’re not going to be able to neatly explain away all of dACC function with a single tidy label like ‘pain’.” L&E draw a different conclusion.

But perhaps this approach to dACC function has changed in light of TY’11 findings (i.e. Yarkoni et al. 2011). There he wrote “For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis.” This hardly sounds like a resounding call for a different understanding of dACC that involves an appreciation of its preferential involvement in pain.

Right. It wasn’t a resounding call for a different understanding of dACC, because it wasn’t a paper about the dACC—a brain region I lack any deep interest in or knowledge of—it was a paper about Neurosynth and reverse inference.

Here are quotes from other papers showing how they view the dACC in light of TY’11:

* Poldrack (2012) “The striking insight to come from analyses of this database (Yarkoni et al., in press) is that some regions (e.g., anterior cingulate) can show high degrees of activation in forward inference maps, yet be of almost no use for reverse inference due to their very high base rates of activation across studies”
* Chang, Yarkoni et al. (2012) “the ACC tends to show substantially higher rates of activation than other regions in neuroimaging studies (Duncan and Owen 2000; Nelson et al. 2010; Yarkoni et al. 2011), which has lead some to conclude that the network is processing goal-directed cognition (Yarkoni et al. 2009)”
* Atlas & Wager (2012) “In fact, the regions that are reliably modulated (insula, cingulate, and thalamus) are actually not specific to pain perception, as they are activated by a number of processes such as interoception, conflict, negative affect, and response inhibition”

I won’t speak for papers I’m not an author on, but with respect to the quote from the Chang et al paper, I’m not sure what L&E’s point actually is. In Yarkoni et al. (2009), I argued that “effort” might be a reasonable generic way to characterize the ubiquitous role of the frontoparietal “task-positive” network in cognition. I mistakenly called the region in question ‘dACC’ when I should have said ‘preSMA’. I already gave L&E deserved credit in my last post for correcting my poor knowledge of anatomy. But I would think that, if anything, the fact that I was routinely confusing these terms circa 2011 should lead L&E to conclude that maybe I don’t know or care very much about the dACC, and not that I’m a proud advocate for a strong theory of dACC function that many other researchers also subscribe to. I think L&E give me far too much credit if they think that my understanding of the dACC in 2011 (or, for that matter, now) is somehow representative of the opinions of experts who study that region.

Perhaps the reason why people who cite TY’11 in their discussion of dACC didn’t pay much attention to the above quote from TY’11 (““For pain, the regions of maximal pain-related…”) was because they read and endorsed the following more direct conclusion that followed “…because the dACC is activated consistently in all of these states [cognitive control, pain, emotion], its activation may not be diagnostic of any one of them” (bracketed text added). If this last quote is taken as TY’11’s global statement regarding dACC function, then it strikes us still as quite novel to assert that the dACC is more consistently associated with one category of processes (pain) than others (executive, conflict, and salience processes).

I don’t think TY’11 makes any ‘global statement regarding dACC function’, because TY’11 was a methodological paper about the nature of reverse inference, not a paper about grand models of dACC function. As for the quote L&E reproduce, here’s the full context:

These results showed that without the ability to distinguish consistency from selectivity, neuroimaging data can produce misleading inferences. For instance, neglecting the high base rate of DACC activity might lead researchers in the areas of cognitive control, pain and emotion to conclude that the DACC has a key role in each domain. Instead, because the DACC is activated consistently in all of these states, its activation may not be diagnostic of any one of them and conversely, might even predict their absence. The NeuroSynth framework can potentially address this problem by enabling researchers to conduct quantitative reverse inference on a large scale.

I stand by everything I said here, and I’m not sure what L&E object to. It’s demonstrably true if you look at Figure 2 in TY’11 that pain, emotion, and cognitive control all robustly activate the dACC in the forward inference map, but not in the reverse inference maps. The only sense I can make of L&E’s comment is if they’re once again conflating z-scores with probabilities, and assuming that the presence of significant activation for pain means that dACC is in fact diagnostic for pain. But, as I showed much earlier in this post, that would betray very deep misunderstanding of what the reverse inference maps generated by Neurosynth mean. There is absolutely no basis for concluding, in any individual study, that people are likely to be perceiving pain just because the dACC is active.

In the article, we showed forward and reverse inference maps for 21 terms and then another 9 in the supplemental materials. These are already crowded busy figures and so we didn’t have room to show multiple slices for each term. Fortunately, since Neurosynth is easily accessible (go check it out now at neurosynth.org – its awesome!) you can look at anything we didn’t show you in the paper. Tal takes us to task for this.

He then shows a bunch of maps from x=-8 to x=+8 on a variety of terms. Many of these terms weren’t the focus of our paper because we think they are in the same class of processes as pain (as noted above). So it’s no surprise to us that terms such as ‘fear,’ ‘empathy,’ and ‘autonomic’ produce dACC reverse inference effects. In the paper, we reported that ‘reward’ does indeed produce reverse inference effects in the anterior portion of the dACC (and show the figure in the supplemental materials), so no surprise there either. Then at the bottom he shows cognitive control, conflict, and inhibition which all show very modest footprints in dACC proper, as we report in the paper.

Once again: L&E are not entitled to exclude a large group of viable candidate functions from their analysis simply because they believe that they’re “in the same class of [distress-related affect] processes” (a claim that many people, including me, would dispute). If proponents of the salience monitoring view wrote a Neurosynth-based paper neglecting to compare salience with pain because “pain is always salient, so it’s in the same class of salience-related processes”, I expect that L&E would not be very happy about it. They should show others the same charity they themselves would expect.

But in any case, if it’s not surprising to L&E that reward, fear, and autonomic control all activate the dACC, then I’m at a loss to understand why they didn’t title the paper something like “the dACC is selectively involved in pain, reward, fear, and autonomic control”. That would have much more accurately represented the results they report, and would be fully consistent with their notion of “weak selectivity”.

There are two things that make the comparison of what he shows and what we reported in the paper not a fair comparison. First, his maps are thresholded at p<.001 and yet all the maps that we report use Neurosynth’s standard, more conservative, FDR criterion of p<.01 (a standard TY literally set). Here, TY is making a biased, apples-to-oranges comparison by juxtaposing the maps at a much more liberal threshold than what we did. Given that each of the terms we were interested in (pain, executive, conflict, salience) had more than 200 studies in the database its not clear why TY moved from FDR to uncorrected maps here.

The reason I used a threshold of p < .001 for this analysis is because it’s what L&E themselves used:

In addition, we used a threshold of Z > 3.1, P < 0.001 as our threshold for indicating significance. This threshold was chosen instead of Neurosynth’s more strict false discovery rate (FDR) correction to maximize the opportunity for multiple psychological terms to “claim” the dACC.

This is a sensible thing to do here, because L&E are trying to accept the null of no effect (or at least, it’s more sensible than applying a standard, conservative correction). Accepting the null hypothesis because an effect fails to achieve significance is the cardinal sin of null hypothesis significance testing, so there’s no real justification for doing what L&E are trying to do. But if you are going to accept the null, it at least behooves you to use a very liberal threshold for your analysis. I’m not sure why it’s okay for L&E to use a threshold of p < .001 but not for me to do the same (and for what it’s worth, I think p < .001 is still an absurdly conservative cut-off given the context).

Second, the Neurosynth database has been updated since we did our analyses. The number of studies in the database has only increased by about 5% (from 10,903 to 11,406 studies) and yet there are some curious changes. For instance, fear shows more robust dACC now than it did a few months ago even though it only increased from 272 studies to 298 studies.

Although the number of studies has nominally increased by only 5%, this actually reflects the removal of around 1,000 studies as a result of newer quality control heuristics, and the addition of around 1,500 new studies. So it should not be surprising if there are meaningful differences between the two. In any case, it seems odd for L&E to use the discrepancy between old and new versions of the database as a defense of their findings, given that the newer results are bound to be more accurate. If L&E accept that there’s a discrepancy, perhaps what they should be saying is “okay, since we used poorer data for our analyses than what Neurosynth currently contains, we should probably re-run our analyses and revise our conclusions accordingly”.

We were more surprised to discover that the term ‘rejection’ has been removed from the Neurosynth database altogether such that it can no longer be used as a term to generate forward and reverse inference maps (even though it was in the database prior to the latest update).

This claim is both incorrect and mildly insulting. It’s incorrect because the term “rejection” hasn’t been in the online Neurosynth database for nearly two years, and was actually removed three updates ago. And it’s mildly insulting, because all L&E had to do to verify the date at which rejection was removed, as well as understand why, was visit the Neurosynth data repository and inspect the different data releases. Failing that, they could have simply asked me for an explanation, instead of intimating that there are “curious” changes. So let me take this opportunity to remind L&E and other readers that the data displayed on the Neurosynth website are always archived on GitHub. If you don’t like what’s on the website at any given moment, you can always reconstruct the database based on an earlier snapshot. This can be done in just a few lines of Python code, as the IPython notebook I linked to last time illustrates.

As to why the term “rejection” disappeared: in April 2014, I switched from a manually curated set of 525 terms (which I had basically picked entirely subjectively) to the more comprehensive and principled approach of including all terms that passed a minimum frequency threshold (i.e., showing up in at least 60 unique article abstracts). The term “rejection” was not frequent enough to survive. I don’t make decisions about individual terms on a case-by-case basis (well, not since April 2014, anyway), and I certainly hope L&E weren’t implying that I pulled the ‘rejection’ term in response to their paper or any of their other work, because, frankly, they would be giving themselves entirely too much credit.

Anyway, since L&E seem concerned with the removal of ‘rejection’ from Neurosynth, I’m happy to rectify that for them. Here are two maps for the term “rejection” (both thresholded at voxel-wise p < .001, uncorrected):

Meta-analysis of "rejection" in Neurosynth (database version of May 2013 ).
Meta-analysis of “rejection” in Neurosynth (database version of May 2013, 33 studies).
Meta-analysis of "rejection" in Neurosynth (current database version, 58 studies).
Meta-analysis of “rejection” in Neurosynth (current database version, 58 studies).

The first map is from the last public release (March 2013) that included “rejection” as a feature, and is probably what L&E remember seeing on the website (though, again, it hasn’t been online since 2014). It’s based on 33 studies. The second map is the current version of the map, based on 52 studies. The main conclusion I personally would take away from both of these maps is that there’s not enough data here to say anything meaningful, because they’re both quite noisy and based on a small number of studies. This is exactly why I impose a frequency cut-off for all terms I put online.

That said, if L&E would like to treat these “rejection” analyses as admissible evidence, I think it’s pretty clear that these maps actually weigh directly against their argument. In both cases, we see activation in pain-related areas of dACC for the forward inference analysis but not for the reverse. Interestingly, we do see activation in the most anterior part of dACC in both cases. This seems to me entirely consistent with the argument many people have made that subjective representations of emotion (including social pain) are to be found primarily in anterior medial frontal cortex, and that posterior dACC activations for pain have much more to do with motor control, response selection, and fear than with anything affective.

Given that Neurosynth is practically a public utility and federally funded, it would be valuable to know more about the specific procedures that determine which journals and articles are added to the database and on what schedule. Also, what are the conditions that can lead to terms being removed from the database and what are the set of terms that were once included that have now been removed.

I appreciate L&E’s vote of confidence (indeed, I wish that I believed Neurosynth could do half of what they claim it can do). As I’ve repeatedly said in this post and the last one, I’m happy to answer any questions L&E have about Neurosynth methods (preferably on the mailing list, which is publicly archived and searchable). But to date, they haven’t asked me any. I’ll also reiterate that it would behoove L&E to check the data repository on GitHub (which is linked to from the neurosynth.org portal) before they conclude that the information they want isn’t already publicly accessible (because most of it is).

In any event, we did not cherry pick data. We used the data that was available to us as of June 2015 when we wrote the paper. For the four topics of interest, below we provide more representative views of the dACC, thresholded as typical Neurosynth maps are, at FDR p<.01. We’ve made the maps nice and big so you can see the details and have marked in green the dACC region on the different slices (the coronal slice are at y=14 and y=22). When you look at these, we think they tell the same story we told in the paper.

I’m not sure what the point here is. I was not suggesting that L&E were lying; I was arguing that (a) visual inspection of a few slices is no way to make a strong argument about selectivity; (b) the kinds of analyses L&E report are a statistically invalid way to draw the conclusion they are trying to draw, and (c) even if we (inappropriately) use L&E’s criteria, analyses done with more current data clearly demonstrate the presence of plenty of effects for terms other than pain. L&E dispute the first two points (which we’ll come back to), but they don’t seem to contest the last. This seems to me like it should lead L&E to the logical conclusion that they should change their conclusions, since newer and better data are now available that clearly produce different results given the same assumptions.

(I do want to be clear again that I don’t condone L&E’s analyses, which I show above and below in detail simply don’t support their conclusions. I was simply pointing out that even by their own criteria, Neurosynth results don’t support their claims.)

4. Surprising lack of appreciation for what the reverse inference maps show in pretty straightforward manner.

Let’s start with pain and salience. Iannetti and his colleagues have made quite a bit of hay the last few years saying that the dACC is not involved in pain, but rather codes for salience. One of us has critiqued the methods of this work elsewhere (Eisenberger, 2015, Annual Review). The reverse inference maps above show widespread robust reverse inference effects throughout the dACC for pain and not a single voxel for salience. When we ran this initially for the paper, there were 222 studies tagged for the term salience and now that number is up to 269 and the effects are the same.

Should our tentative conclusion be that we should hold off judgment until there is more evidence? TY thinks so: “If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available.” This would be reasonable if we were talking about topics with 10 or 15 studies in the database. But, there are 269 studies for the term salience and yet there is nothing in the dACC reverse inference maps. I can’t think of anyone who has ever run a meta-analysis of anything with 250 studies, found no accumulated evidence for an effect and then said “we should withhold judgment until more data is available”.

This is another gross misrepresentation of what I said in my commentary. So let me quote what  I actually said. Here’s the context:

While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.

This is pretty close to the textbook definition of “quoting out of context”. It should be abundantly clear that I was not saying that L&E shouldn’t interpret results from a Neurosynth meta-analysis of 250 studies (which would be absurd). The point of the above quote was that if L&E don’t like the result they get when they conduct meta-analytic comparisons properly with Neurosynth, they’re not entitled to replace the analysis with a statistically invalid procedure that does give results they like.

TY and his collaborators have criticized researchers in major media outlets (e.g. New York Times) for poor reverse inference – for drawing invalid reverse inference conclusions from forward inference data. The analyses we presented suggest that claims about salience and the dACC are also based on unfounded reverse inference claims. One would assume that TY and his collaborators are readying a statement to criticize the salience researchers in the same way they have previously.

This is another absurd, and frankly insulting, comparison. My colleagues and I have criticized people for saying that insula activation is evidence that people are in love with their iPhones. I certainly hope that this is in a completely different league from inferring that people must be experiencing pain if the dACC is activated (because if not, some of L&E’s previous work would appear to be absurd on its face). For what it’s worth, I agree with L&E that nobody should interpret dACC activation in a study as strong evidence of “salience”—and, for that matter, also of “pain”. As for why I’m not readying a statement to criticize the salience researchers, the answer is that it’s not my job to police the ACC literature. My interest is in making sure Neurosynth is used appropriately. L&E can rest assured that if someone published an article based entirely on Neurosynth results in which their primary claim was that the dACC is selective for salience, I would have written precisely the same kind of critique. Though it should perhaps concern them that, of the hundreds of published uses of Neurosynth to date, theirs is the first and only one that has moved me to write a critical commentary.

But no. Nowhere in the blog does TY comment on this finding that directly contradicts a major current account of the dACC. Not so much as a “Geez, isn’t it crazy that so many folks these days think the dACC and AI can be best described in terms of salience detection and yet there is no reverse inference evidence at all for this claim.”

Once again: I didn’t comment on this because I’m not interested in the dACC; I’m interested in making sure Neurosynth is used appropriately. If L&E had asked me, “hey, do you think Neurosynth supports saying that dACC activation is a good marker of ‘salience’?”, I would have said “no, of course not.” But L&E didn’t write a paper titled “dACC activity should not be interpreted as a marker of salience”. They wrote a paper titled “the dACC is selective for pain”, in which they argue that pain is the best psychological characterization of dACC—a claim that Neurosynth simply does not support.

For the terms executive and conflict, our Figure 3 in the PNAS paper shows a tiny bit of dACC. We think the more comprehensive figures we’ve included here continue to tell the same story. If someone wants to tell the conflict story of why pain activates the dACC, we think there should be evidence of widespread robust reverse inference mappings from the dACC to conflict. But the evidence for such a claim just isn’t there. Whatever else you think about the rest of our statistics and claims, this should give a lot of folks pause, because this is not what almost any of us would have expected to see in these reverse inference maps (including us).

No objections here.

If you generally buy into Neurosynth as a useful tool (and you should), then when you look at the four maps above, it should be reasonable to conclude, at least among these four processes, that the dACC is much more involved in that first one (i.e. pain). Let’s test this intuition in a new thought experiment.

Imagine you were given the three reverse inference maps below and you were interested in the function of the occipital cortex area marked off with the green outline. You’d probably feel comfortable saying the region seems to have a lot more to do with Term A than Terms B or C. And if you know much about neuroanatomy, you’d probably be surprised, and possibly even angered, when I tell you that Term A is ‘motor’, Term B is ‘engaged’, and Term C is ‘visual’. How is this possible since we all know this region is primarily involved in visual processes? Well it isn’t possible because I lied. Term A is actually ‘visual’ and Term C is ‘motor’. And now the world makes sense again because these maps do indeed tell us that this region is widely and robustly associated with vision and only modestly associated with engagement and motor processes. The surprise you felt, if you believed momentarily that Term A was motor was because you have the same intuition we do that these reverse inference maps tell us that Term A is the likely function of this region, not Term B or Term C – and we’d like that reverse inference to be what we always thought this region was associated with – vision. It’s important to note that while a few voxels appear in this region for Terms B and C, it still feels totally fine to say this region’s psychological function can best be described as vision-related. It is the widespread robust nature of the effect in Term A, relative to the weak and limited effects of Terms B and C, that makes this a compelling explanation of the region.

I’m happy to grant L&E that it may “feel totally fine” to some people to make a claim like this. But this is purely an appeal to intuition, and has zero bearing on the claim’s actual validity. I hope L&E aren’t seriously arguing that cognitive neuroscientists should base the way we do statistical inference on our intuitions about what “feels totally fine”. I suspect it felt totally fine to L&E to conclude in 2003 that people were experiencing physical pain because the dACC was active, even though there was no evidential basis for such a claim (and there still isn’t). Recall that, in surveys of practicing researchers, a majority of respondents routinely endorse the idea that a p-value of .05 means that that there’s at least a 95% probability that the alternative hypothesis is correct (it most certainly doesn’t mean this). Should we allow people to draw clearly invalid conclusions in their publications on the grounds that it “feels right” to them? Indeed, as I show below, L&E’s arguments for selectivity rest in part on an invalid acceptance of the null hypothesis. Should they be given a free pass on what is probably the cardinal sin of NHST, on the grounds that it probably “felt right” to them to equate non-significance with evidence of absence?

The point of Neurosynth is that it provides a probabilistic framework for understanding the relationship between psychological function and brain activity. The framework has many very serious limitations that, in practice, make it virtually impossible to draw any meaningful reverse inference from observed patterns of brain activity in any individual study. If L&E don’t like this, they’re welcome to build their own framework that overcomes the limitations of Neurosynth (or, they could even help me improve Neurosynth!). But they don’t get to violate basic statistical tenets in favor of what “feels totally fine” to them.

Another point of this thought experiment is that if Term A is what we expect it to be (i.e. vision) then we can keep assuming that Neurosynth reverse inference maps tell us something valuable about the function of this region. But if Term A violates our expectation of what this region does, then we are likely to think about the ways in which Neurosynth’s results are not conclusive on this point.

We suspect if the dACC results had come out differently, say with conflict showing wide and robust reverse inference effects throughout the dACC, and pain showing little to nothing in dACC, that most of our colleagues would have said “Makes sense. The reverse inference map confirms what we thought – that dACC serves a general cognitive function of detecting conflicts.” We think it is because of the content of the results rather than our approach that is likely to draw ire from many.

I can’t speak for L&E’s colleagues, but my own response to their paper was indeed driven entirely by their approach. If someone had published a paper using Neurosynth to argue that the dACC is selective for conflict, using the same kinds of arguments L&E make, I would have written exactly the same kind of critique I wrote in response to L&E’s paper. I don’t know how I can make it any clearer that I have zero attachment to any particular view of the dACC; my primary concern is with L&E’s misuse of Neurosynth, not what they or anyone else thinks about dACC function. I’ve already made it clear several times that I endorse their conclusion that conflict, salience, and cognitive control are not adequate explanations for dACC function. What they don’t seem to accept is that pain isn’t an adequate explanation either, as the data from Neurosynth readily demonstrate.

5. L&E did the wrong analyses

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

I’ve already addressed the overarching problem with L&E’s statistical analyses in the first part of this post. Below I’ll just walk through each of L&E’s assertions in detail and point out all of the specific issues in detail. I’ll warn you right now that this is not likely to make for very exciting reading.

While we think the maps for each term speak volumes just from visual inspection, we thought it was also critical to run the comparisons across terms directly. We all know the statistical error of showing that A is significant, while B is not and then assuming, but not testing A > B, directly. TY has a section called “A>B does not imply ~B” (where ~B means ‘not B’). Indeed it does not, but all the reverse inference maps for the executive, conflict, and salience terms already established ~B. We were just doing due diligence by showing that the difference between A and B was indeed significant.

I apologize for implying that L&E weren’t aware that A > B doesn’t entail ~B. I drew that conclusion because the only other way I could see their claim of selectivity making any sense is if they were interpreting a failure to detect a significant effect for B as positive evidence of no effect. I took that to be much more unlikely, because it’s essentially the cardinal sin of NHST. But their statement here explicitly affirms that this is, in fact, exactly what they were arguing—which leads me to conclude that they don’t understand the null hypothesis statistical testing (NHST) framework they’re using. The whole point of this section of my post was that L&E cannot conclude that there’s no activity in dACC for terms like conflict or salience, because accepting the null is an invalid move under NHST. Perhaps I wasn’t sufficiently clear about this in my last post, so let me reiterate: the reverse inference maps do not establish ~B, and cannot establish ~B. The (invalid) comparison tests of A > B do not establish ~B, and cannot cannot establish ~B. In fact, no analysis, figure, or number L&E report anywhere in their paper establishes ~B for any of the terms they compare with pain. Under NHST, the only possible result of any of L&E’s analyses that would allow them to conclude that a term is not positively associated with dACC activation would be a significant result in the negative direction (i.e., if dACC activation implied a decrease in likelihood of a term). But that’s clearly not true of any of the terms they examine.

Note that this isn’t a fundamental limitation of statistical inference in general; it’s specifically an NHST problem. A Bayesian model comparison approach would have allowed L&E to make a claim about the evidence for the null in comparison to the alternative (though specifying the appropriate priors here might not be very straightforward). Absent such an analysis, L&E are not in any position to make claims about conflict or salience not activating the dACC—and hence, per their own criteria for selectivity, they have no basis for arguing that pain is selective.

Now, in my last post, I went well beyond this logical objection and argued that, if you analyze the data using L&E’s own criteria, there’s plenty of evidence for significant effects of other terms in dACC. I now regret including those analyses. Not because they were wrong; I stand by my earlier conclusion (which should be apparent to anyone who spends five minutes browsing maps on Neurosynth.org), and this alone should have prevented L&E from making claims about pain selectivity. But the broader point is that I don’t want to give the impression that this debate is over what the appropriate statistical threshold for analysis is—i.e., that maybe if we use p < 0.05, I’m right, and if we use FDR = 0.1, L&E are right. The entire question of which terms do or don’t show a significant effect is actually completely beside the point given that L&E’s goal is to establish that only pain activates the dACC, and that terms like conflict or salience don’t. To accomplish that, L&E would need to use an entirely different statistical framework that allows them them to accept the null (relative to some alternative).

If it’s reasonable to use the Z-scores from Neurosynth to say “How much evidence is there for process A being a reliable reverse inference target for region X” then it has to be reasonable to compare Z-scores from two analyses to ask “How much MORE evidence is there for process A than process B being a reliable reverse inference target for region X”. This is all we did when we compared the Z-scores for different terms to each other (using a standard formula from a meta-analysis textbook) and we think this is the question many people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region.

I addressed this in the earlier part of this post, where I explained why one cannot obtain support for a reverse inference using z-scores or p-values. Reverse inference is inherently a Bayesian notion, and makes sense only if you’re willing to talk about prior and posterior probabilities. So L&E’s first premise here—i.e., that it’s reasonable to use z-scores from Neurosynth to quantify “evidence for process A being a reliable reverse inference target for region X” is already false.

For what it’s worth, the second premise is also independently false, because it’s grossly inappropriate to use meta-analytic z-score comparison test in this situation. For one thing, there’s absolutely no reason to compare z-scores given that the distributional information is readily available. Rosenthal (the author of the meta-analysis textbook L&E cite) himself explicitly notes that such a test is inferior to effect size-based tests, and is essentially a last-ditch approach. Moreover, the intended use of the test in meta-analysis is to determine whether or not there’s heterogeneity in p-values as a precursor to combining them in an analysis (which is a concern that makes no sense in the context of Neurosynth data). At best, what L&E would be able to say with this test is something like “it looks like these two z-scores may be coming from different underlying distributions”. I don’t know why L&E think this is at all an interesting question here, because we already know with certainty that there can be no meaningful heterogeneity of this sort in these z-scores given that they’re all generated using exactly the same set of studies.

In fact, the problems with the z-score comparison test L&E are using run so deep that I can’t help point out just one truly stupefying implication of the approach: it’s possible, under a wide range of scenarios, to end up concluding that there’s evidence that one term is “preferentially” activated relative to another term even when the point estimate is (significantly) larger for the latter term. For example, consider a situation in which we have a probability of 0.65 for one term with n = 1000 studies, and a probability of 0.8 for a second term with n = 100 studies. The one-sample proportion test for these two samples, versus a null of 0.5, gives z-scores of 9.5 and 5.9, respectively–so both tests are highly significant, as one would expect. But the Rosenthal z-score test favored by L&E tells us that the z-score for the first sample is significantly larger than the z-score for the second. It isn’t just wrong to interpret this as evidence that the first term has a more selective effect; it’s dangerously wrong. A two-sample test for the difference in proportions correctly reveals a significant effect in the expected direction (i.e., the 0.8 probablity in the smaller sample is in fact significantly greater than the 0.65 probability in the much larger sample). Put simply, L&E’s test is broken. It’s not clear that it tests anything meaningful in this context, let alone allowing us to conclude anything useful about functional selectivity in dACC.

As for what people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region: I really don’t know, and I don’t see how that would have any bearing on whether the methods L&E are using are valid or not. What I do know that I’ve never seen anyone else compare Neurosynth z-scores using a meta-analytic procedure intended to test for heterogeneity of effects—and I certainly wouldn’t recommend it.

TY then raises two quite reasonable issues with the Z-score comparisons, one of which we already directly addressed in our paper. First, TY raises the issue that Z-scores increase with accumulating evidence, so terms with more studies in the database will tend to have larger Z-scores. This suggests that terms with the most studies in the database (e.g. motor with 2081 studies) should have significant Z-scores everywhere in the brain. But terms with the most studies don’t look like this. Indeed, the reverse inference map for “functional magnetic” with 4990 studies is a blank brain with no significant Z-scores.

Not quite. It’s true that for any fixed effect size, z-scores will rise (in absolute value) as sample size increases. But if the true effect size is very small, one will still obtain a negligible z-score even in a very large sample. So while terms with more studies will indeed tend to have larger absolute z-scores, it’s categorically false that “terms with the most studies in the database should have significant z-scores everywhere in the brain”.

However, TY has a point. If two terms have similar true underlying effects in dACC, then the one with the larger number of studies will have a larger Z-score, all else being equal. We addressed this point in the limitations section of our paper writing “It is possible that terms that occur more frequently, like “pain,” might naturally produce stronger reverse inference effects than less frequent terms. This concern is addressed in two ways. First, the current analyses included a variety of terms that included both more or fewer studies than the term “pain” and no frequency-based gradient of dACC effects is observable.” So while pain (410 studies) is better represented in the Neurosynth database than conflict (246 studies), effort (137 studies), or Stroop (162 studies), several terms are better represented than pain including auditory (1004 studies), cognitive control (2474 studies), control (2781 studies), detection (485 studies), executive (531 studies), inhibition (432 studies), motor (1910 studies), and working memory (815). All of these, regardless of whether they are better or worse represented in the Neurosynth database show minimal presence in the dACC reverse inference maps. It’s also worth noting that painful and noxious, with only 158 and 85 studies respectively, both show broader coverage within the dACC than any of the cognitive or salience terms considered in our paper.

L&E don’t seem to appreciate that the relationship between the point estimate of a parameter and the uncertainty around that estimate is not like the relationship between two predictors in a regression, where one can (perhaps) reason logically about what would or should be true if one covariate was having an influence on another. One cannot “rule out” the possibility that sample size is a problem by pointing to some large-N terms with small effects or some small-N terms with large effects. Sampling error is necessarily larger in smaller samples. The appropriate way to handle between-term variation in sample size is to properly build that differential uncertainty into one’s inferential test. Rosenthal’s z-score comparison doesn’t do this. The direct meta-analytic contrast one can perform with Neurosynth does do this, but of course, being much more conservative than the Rosenthal test (appropriately so!), L&E don’t seem to like the results it produces. (And note that the direct meta-analytic contrast would still require one to make strong assumptions about priors if the goal was to make quantitative reverse inferences, as opposed to detecting a mean difference in probability of activation.)

TY’s second point is also reasonable, but is also not a problem for our findings. TY points out that some effects may be easier to produce in the scanner than others and thus may be biased towards larger effect sizes. We are definitely sympathetic to this point in general, but TY goes on to focus on how this is a problem for comparing pain studies to emotion studies because pain is easy to generate in the scanner and emotion is hard. If we were writing a paper comparing effect sizes of pain and emotion effects this would be a problem but (a) we were not primarily interested in comparing effect sizes and (b) we definitely weren’t comparing pain and emotion because we think the aspect of pain that the dACC is involved in is the affective component of pain as we’ve written in many other papers dating back to 2003 (Eisenberger & Lieberman, 2004; Eisenberger, 2012; Eisenberger, 2015).

It certainly is a problem for L&E’s findings. Z-scores are related one-to-one with effect size for any fixed sample size, so if the effect size is artificially increased in one condition, so too is the z-score that L&E stake their (invalid) analysis on. Any bias in the point estimate will necessarily distort the z-value as well. This is not a matter of philosophical debate or empirical conjecture, it’s a mathematical necessity.

Is TY’s point relevant to our actual terms of comparison: executive, conflict, and salience processes? We think not. Conflict tasks are easy and reliable ways to produce conflict processes. In multiple ways, we think pain is actually at a disadvantage in the comparison to conflict. First, pain effects are so variable from one person to the next that most pain researchers begin by calibrating the objective pain stimuli delivered, to each participant’s subjective responses to pain. As a result, each participant may actually be receiving different objective inputs and this might limit the reliability or interpretability of certain observed effects. Second, unlike conflict, pain can only be studied at the low end of its natural range. Due to ethical considerations, we do not come close to studying the full spectrum of pain phenomena. Both of these issues may limit the observation of robust pain effects relative to our actual comparisons of interest (executive, conflict, and salience processes.

Perhaps I wasn’t sufficiently clear, but I gave the pain-emotion contrast as an example. The point is that meta-analytic comparisons of the kind L&E are trying to make are a very dangerous proposition unless one has reason to think that two classes of manipulations are equally “strong”. It’s entirely possible that L&E are right that executive control manipulations are generally stronger than pain manipulations, but that case needs to be made on the basis of data, and cannot be taken for granted.

6. About those effect size comparison maps

After criticizing us for not comparing effect sizes, rather than Z-scores, TY goes on to produce his own maps comparing the effect sizes of different terms and claiming that these represent evidence that the dACC is not selective for pain. A lot of our objections to these analyses as evidence against our claims repeats what’s already been said so we’ll start with what’s new and then only briefly reiterate the earlier points.

a) We don’t think it makes much sense to compare effect sizes for terms in voxels for which there is no evidence that it is a valid reverse inference target. For instance, the posterior probability at 0 26 26 for pain is .80 and for conflict is .61 (with .50 representing a null effect). Are these significantly different from one another? I don’t think it matters much because the Z-score associated with conflict at this spot is 1.37, which is far from significant (or at least it was when we ran our analyses last summer. Strangely, now, any non-significant Z-scores seem to come back with a value of 0, whereas they used to give the exact non-significant Z-score).

I’m not sure why L&E think that statistical significance makes a term a “valid target” for reverse inference (or conversely, that non-significant terms cannot be valid targets). If they care to justify this assertion, I’ll be happy to respond to it. It is, in any case, a moot point, since many of the examples I gave were statistically significant, and L&E don’t provide any explanation as to why those terms aren’t worth worrying about either.

As for the disappearance of non-significant z-scores, that’s a known bug introduced by the last major update to Neurosynth, and it’ll be fixed in the next major update (when the entire database is re-generated).

If I flip a coin twice I might end up with a probability estimate of 100% heads, but this estimate is completely unreliable. Comparing this estimate to those from a coin flipped 10,000 times which comes up 51% heads makes little sense. Would the first coin having a higher probability estimate than the second tell us anything useful? No, because we wouldn’t trust the probability estimate to be meaningful. Similarly, if a high posterior probability is associated with a non-significant Z-score, we shouldn’t take this posterior probability as a particularly reliable estimate.

L&E are correct that it wouldn’t make much sense to compare an estimate from 2 coin flips to an estimate from 10,000 coin flips. But the error is in thinking that comparing p-values somehow addresses this problem. As noted above, the p-value comparison they use is a meta-analytic test that only tells one if a set of z-scores are heterogenous, and is not helpful for comparing proportions when one has actual distributional information available. It would be impossible to answer the question of whether one coin is biased relative to another using this test—and it’s equally impossible to use it to determine whether one term is more important than another for dACC function.

b) TY’s approach for these analyses is to compare the effect sizes for any two processes A & B by finding studies in the database tagged for A but not B and others tagged for B but not A and to compare these two sets. In some cases this might be fine, but in others it leaves us with a clean but totally unrealistic comparison. To give the most extreme example, imagine we did this for the terms pain and painful. It’s possible there are some studies tagged for painful but not pain, but how representative would these studies be of “painful” as a general term or construct? It’s much like the clinical problem of comparing depression to anxiety by comparing those with depression (but not anxiety) to those with anxiety (but not depression). These folks are actually pretty rare because depression and anxiety are so highly comorbid, so the comparison is hardly a valid test of depression vs. anxiety. Given that we think pain, fear, emotion, and autonomic are actually all in the same class of explanations, we think comparisons within this family are likely to suffer from this issue.

There’s nothing “unrealistic” about this comparison. It’s not the inferential test’s job to make sure that the analyst is doing something sensible, it’s the analyst’s job. Nothing compels L&E to run a comparison between ‘pain’ and ‘painful’, and I fully agree that this would be a dumb thing to do (and it would be an equally dumb thing to do using any other statistical test). One the other hand, comparing the terms ‘pain’ and ’emotion’ is presumably not a dumb thing to do, so it behooves us to make sure that we use an inferential test that doesn’t grossly violate common sense and basic statistical assumptions.

Now, if L&E would like to suggest an alternative statistical test that doesn’t exclude the intersection of the two terms and still (i) produces interpretable results, (ii) weights all studies equally, (iii) appropriately accounts for the partial dependency structure of the data, and (iv) is sufficiently computationally efficient to apply to thousands of terms in a reasonable amount of time (which rules out most permutation-based tests), then I’d be delighted to consider their suggestions. The relevant code can be found here, and L&E are welcome to open a GitHub issue to discuss this further. But unless they have concrete suggestions, it’s not clear what I’m supposed to do with their assertion that doing meta-analytic comparison properly sometimes “leaves us with a clean but totally unrealistic comparison”. If they don’t like the reality, they’re welcome to help me improve the reality. Otherwise they’re simply engaging in wishful thinking. Nobody owes L&E a statistical test that’s both valid and gives them results they like.

c) TY compared topics (i.e., a cluster of related terms), not terms. This is fine, but it is one more way that what TY did is not comparable to what we did (i.e. one more way his maps can’t be compared to those we presented).

I almost always use topics rather than terms in my own analyses, for a variety of reasons (they have better construct validity, are in theory more reliable, reduce the number of comparisons, etc.). I didn’t try out the analyses I ran with any of the term-based features, but I encourage L&E to do so if they like, and I’d be surprised if the results differ appreciably (they should, in general, simply be slightly less robust all around). In any case, I deliberately made my code available so that L&E (or anyone else) could easily reproduce and modify my analyses. (And of course, nothing at all hangs on the results in any case, because the whole premise that this is a suitable way to demonstrate selectivity is unfounded.)

d) Finally and most importantly, our question would not have led us to comparing effect sizes. We were interested in whether there was greater accumulated evidence for one term (i.e. pain) being a reverse inference target for dACC activations than for another term (e.g. conflict). Using the Z-scores as we did is a perfectly reasonable way to do this.

See above. Using the z-scores the way L&E did is not reasonable and doesn’t tell us anything anyone would want to know about functional selectivity.

7. Biases all around

Towards the end of his blog, TY says what we think many cognitive folks believe:

“I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else.”

We think this is very telling because it suggests that the findings such as those in our PNAS paper are likely to be unacceptable regardless of what the data shows.

Another misrepresentation of what I actually said, which was:

One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally.

The context makes it abundantly clear that I was not making a general statement about the importance of pain in some grand evolutionary sense, but simply pointing out the implausibility of supposing that Neurosynth reverse inference maps provide unbiased windows into the neural substrates of cognition. In the case of pain, there’s tentative evidence to believe that effect sizes are overestimated.

In contrast, we can’t think of too many things that the brain would prize above pain (and distress) representations. People who don’t feel pain (i.e. congenital insensitivity to pain) invariably die an early death – it is literally a death sentence to not feel pain. What could be more important for survival? Blind and deaf people survive and thrive, but those without the ability to feel pain are pretty much doomed.

I’m not sure what this observation is supposed to tell us. One could make the same kind of argument about plenty of other functions. People who suffer from a variety of autonomic or motor problems are also likely to suffer horrible early deaths; it’s unclear to me how this would justify a claim like “the brain prizes little above autonomic control”, or what possibly implications such a claim would have for understanding dACC function.

Similar (but not identical) to TY’s conclusions that we opened this blog with, we think the following conclusions are supported by the Neurosynth evidence in our PNAS paper:

I’ll take these one at a time.

* There is more widespread and robust reverse inference evidence for the role of pain throughout the dACC than for executive, conflict, and salience-related processes.

I’m not sure what is meant here by “robust reverse inference evidence”. Neurosynth certainly provides essentially no basis for drawing reverse inferences about the presence of pain in individual studies. (Let me remind L&E once again: at best, the posterior probability for ‘pain’ in dACC is around 80%–but that’s given an assumed based rate of 50%, not the more realistic real-world rate of around 3%). If what they mean is something like “on average, taking the average of all voxels in dACC, there’s more evidence of a statistical association between pain and dACC than pain and conflict monitoring”, then I’m fine with that.

* There is little to no evidence from the Neurosynth database that executive, conflict, and salience-related processes are reasonable reverse inference targets for dACC activity.

Again, this depends on what L&E mean. If they mean that one shouldn’t, upon observing activation in dACC, proclaim that conflict must be present, then they’re absolutely right. But again, the same is true for pain. On the other hand, if they mean that there’s no evidence in Neurosynth for a reverse inference association between these terms and dACC activity, where the criterion is surviving FDR-correction, then that’s clearly not true: for example, the conflict map clearly includes voxels within the dACC. Alternatively, if L&E’s point is that the dACC/preSMA region centrally associated with conflict monitoring or executive control is more dorsal than many (though not all) people have assumed, then I agree with them without qualification.

* Pain processes, particularly the affective or distressing part of pain, are in the same family with other distress-related processes including terms like distress, fear, and negative affect.

I have absolutely no idea what evidence this conclusion is based on. Nothing I can see in Neurosynth seems to support this—let alone anything in the PNAS paper. As I’ve noted several times now, most distress-related terms do not seem to overlap meaningfully with pain-related activations in dACC. To the extent that one thinks spatial overlap is a good criterion for determining family membership (and for what it’s worth, I don’t think it is), the evidence does not seem particularly suggestive of any such relationship (and L&E don’t test it formally in any way).

Postscript. *L&E should have used reverse inference, not forward inference, when examining the anatomical boundaries of dACC.*

We saved this one for the postscript because this has little bearing on the major claims of our paper. In our paper, we observed that when one does a forward inference analysis of the term ‘dACC’ the strongest effect occurs outside the dACC in what is actually SMA. This suggested to us that people might be getting activations outside the dACC and calling them dACC (much as many activations clearly not in the amygdala have been called amygdala because it fits a particular narrative). TY admits having been guilty of this in TY’11 and points out that we made this mistake in our 2003 Science paper on social pain. A couple of thoughts on this.

a) In 2003, we did indeed call an activation outside of dACC (-6 8 45) by the term “dACC”. TY notes that if this is entered into a Neurosynth analysis the first anatomical term that appears is SMA. Fair enough. It was our first fMRI paper ever and we identified that activation incorrectly. What TY doesn’t mention is that there are two other activations from the same paper (-8 20 40; -6 21 41) where the top named anatomical term in Neurosynth is anterior cingulate. And if you read this in TY’s blog and thought “I guess social pain effects aren’t even in the dACC”, we would point you to the recent meta-analysis of social pain by Rotge et al. (2015) where they observed the strongest effect for social pain in the dACC (8 24 24; Z=22.2 PFDR<.001). So while we made a mistake, no real harm was done.

I mentioned the preSMA activation because it was the critical data point L&E leaned on to argue that the dACC was specifically associated with the affective component of pain. Here’s the relevant excerpt from the 2003 social pain paper:

As predicted, group analysis of the fMRI data indicated that dorsal ACC (Fig. 1A) (x – 8, y 20, z 40) was more active during ESE than during inclusion (t 3.36, r 0.71, P < 0.005) (23, 24). Self-reported distress was positively correlated with ACC activity in this contrast (Fig. 2A) (x – 6, y 8, z 45, r 0.88, P < 0.005; x – 4, y 31, z 41, r 0.75, P < 0.005), suggesting that dorsal ACC activation during ESE was associated with emotional distress paralleling previous studies of physical pain (7, 8). The anterior insula (x 42, y 16, z 1) was also active in this comparison (t 4.07, r 0.78, P < 0.005); however, it was not associated with self-reported distress.

Note that both the dACC and anterior insula were activated by the exclusion vs. inclusion contrast, but L&E concluded that it was specifically the dACC that supports the “neural alarm” system, by virtue of being correlated with participants’ subjective reports (whereas the insula was not). Setting aside the fact that these results were observed in a sample size of 13 using very liberal statistical thresholds (so that the estimates are highly variable, spatial error is going to be very high, there’s a high risk of false positives, and accepting the null in the insula because of the absence of a significant effect is probably a bad idea), in focusing on the the preSMA activation in my critique, I was only doing what L&E themselves did in their paper:

Dorsal ACC activation during ESE could reflect enhanced attentional processing, previously associated with ACC activity (4, 5), rather than an underlying distress due to exclusion. Two pieces of evidence make this possibility unlikely. First, ACC activity was strongly correlated with perceived distress after exclusion, indicating that the ACC activity was associated with changes in participants’ self-reported feeling states.

By L&E’s own admission, without the subjective correlation, there would have been little basis for concluding that the effect they observed was attributable to distress rather than other confounds (attentional increases, expectancy violation, etc.). That’s why I focused on the preSMA activation: because they did too.

That said, since L&E bring up the other two activations, let’s consider those too, since they also have their problems. While it’s true that both of them are in the anterior cingulate, according to Neurosynth, neither of them is a “pain” voxel. The top functional associates for both locations are ‘inteference’, ‘task’, ‘verbal’, ‘verbal fluency’, ‘word’, ‘demands’, ‘words’, ‘reading’ … you get the idea. Pain is not significantly associated with these points in Neurosynth. So while L&E might be technically right that these other activations were in the anterior cingulate, if we take Neurosynth to be as reliable a guide to reverse inference as they think, then L&E never had any basis for attributing the social exclusion effect to pain to begin with—because, according to Neurosynth, literally none of the medial frontal cortex activations reported in the 2003 paper are associated with pain. I’ll leave it to others to decide whether “no harm was done” by their claim that the dACC is involved in social pain.

In contrast, TY’11’s mistake is probably of greater significance. Many have taken Figure 3 of TY’11 as strong evidence that the dACC activity can’t be reliably associated with working memory, emotion, or pain. If TY had tested instead (2 8 40), a point directly below his that is actually in dACC (rather than 2 8 50 which TY now acknowledges is in SMA), he would have found that pain produces robust reverse inference effects, while neither working memory or emotion do. This would have led to a very different conclusion than the one most have taken from TY’11 about the dACC.

Nowhere in TY’11 is it claimed that dACC activity isn’t reliably associated with working memory, emotion or pain (and, as I already noted in my last post, I explicitly said that the posterior aspects of dACC are preferentially associated with pain). What I did say is that dACC activation may not be diagnostic of any of these processes. That’s entirely accurate. As I’ve explained at great length above, there is simply no basis for drawing any strong reverse inference on the basis of dACC activation.

That said, if it’s true that many people have misinterpreted what I said in my paper, that would indeed be potentially damaging to the field. I would appreciate feedback from other people on this issue, because if there’s a consensus that my paper has in fact led people to think that dACC plays no specific role in cognition, then I’m happy to submit an erratum to the journal. But absent such feedback, I’m not convinced that my paper has had nearly as much influence on people’s views as L&E seem to think.

b) TY suggested that we should have looked for “dACC” in the reverse inference map rather than the forward inference map writing “All the forward inference map tells you is where studies that use the term “dACC” tend to report activation most often”. Yet this is exactly what we were interested in. If someone is talking about dACC in their paper, is that the region most likely to appear in their tables? The answer appears to be no.

No, it isn’t what L&E are interested in. Let’s push this argument to its logical extreme to illustrate the problem: imagine that every single fMRI paper in the literature reported activation in preSMA (plus other varying activations)—perhaps because it became standard practice to do a “task-positive localizer” of some kind. This is far-fetched, but certainly conceptually possible. In such a case, searching for every single region by name (“amygdala”, “V1”, you name it) would identify preSMA as the peak voxel in the forward inference map. But what would this tell us, other than that preSMA is activated with alarming frequency? Nothing. What L&E want to know is what brain regions have the biggest impact on the likelihood that an author says “hey, that’s dACC!”. That’s a matter of reverse inference.

c) But again, this is not one of the central claims of the paper. We just thought it was noteworthy so we noted it. Nothing else in the paper depends on these results.

I agree with this. I guess it’s nice to end on a positive note.