No, the dorsal anterior cingulate is not selective for pain: comment on Lieberman and Eisenberger (2015)

[Update 12/10/2015: Lieberman & Eisenberger have now posted a lengthy response to this post here. I’ll post my own reply to their reply in the next few days.]

[Update 12/14/2015: I’ve posted an even lengthier reply to L&E’s reply here.]

[Update 12/16/2015: Alex Shackman has posted an interesting commentary of his own on the L&E paper. It focuses on anatomical concerns unrelated to the issues I raise here and in my last post.]

The anterior cingulate cortex (ACC)—located immediately above the corpus callosum on the medial surface of the brain’s frontal cortex—is an intriguing brain region. Despite decades of extensive investigation in thousands of animal and human studies, understanding the function(s) of this region has proven challenging. Neuroscientists have proposed a seemingly never-ending string of hypotheses about what role it might play in in emotion and/or cognition. The field of human neuroimaging has taken a particular shine to the ACC in the past two decades; if you’ve ever heard overheard some nerdy-looking people talking about “conflict monitoring”, “error detection”, or “reinforcement learning” in the human brain, there’s a reasonable chance they were talking at least partly about the role of the ACC.

In a new PNAS paper, Matt Lieberman and Naomi Eisenberger wade into the debate with what is quite possibly the strongest claim yet about ACC function, arguing (and this is a verbatim quote from the paper’s title) that “the dorsal anterior cingulate cortex is selective for pain”. That conclusion rests almost entirely on inspection of meta-analytic results produced by Neurosynth, an automated framework for large-scale synthesis of results from thousands of published fMRI studies. And while I’ll be the first to admit that I know very little about the anterior cingulate cortex, I am probably the world’s foremost expert on Neurosynth*—because I created it. I also have an obvious interest in making sure that Neurosynth is used with appropriate care and caution. In what follows, I provide my HIBAR reactions to the Lieberman & Eisenberger (2015) manuscript, focusing largely on whether L&E’s bold conclusion is supported by the Neurosynth findings they review (spoiler alert: no).

Before going any further, I should clarify my role in the paper, since I’m credited in the Acknowledgments section for “providing Neurosynth assistance”. My contribution consisted entirely of sending the first author (per an email request) an aggregate list of study counts for different terms on the Neurosynth website. I didn’t ask what it was for, he didn’t say what it was for, and I had nothing to do with any other aspect of the paper—nor did PNAS ask me to review it. None of this is at all problematic, from my perspective. My policy has always been that people can do whatever they want with any of the Neurosynth data, code, or results, without having to ask me or anyone else for permission. I do encourage people to ask questions or solicit feedback (we have a mailing list), but in this case the authors didn’t contact me before this paper was published (other than to request data). So being acknowledged by name shouldn’t be taken as an endorsement of any of the results.

With that out of the way, we can move onto the paper. The basic argument L&E make is simple, and largely hangs on the following observation about Neurosynth data: when we look for activation in the dorsal ACC (dACC) in various “reverse inference” brain maps on Neurosynth, the dominant associate is the term “pain”. Other candidate functions people have considered in relation to dACC—e.g., “working memory”, “salience”, and “conflict”—show (at least according to L&E) virtually no association with dACC. L&E take this as strong evidence against various models of dACC function that propose that the dACC plays a non-pain-related role in cognition—e.g., that it monitors for conflict between cognitive representations or detects salient events. They state, in no uncertain terms, that Neurosynth results “clearly indicated that the best psychological description of dACC function was related to pain processing – not executive, conflict, or salience processing”. This is a strong claim, and would represent a major advance in our understanding of dACC function if it were borne out. Unfortunately, it isn’t.

A crash course in reverse inference

To understand why, we need to understand the nature of the Neurosynth data L&E focus on. And to do that, we need to talk about something called reverse inference. L&E begin their paper by providing an excellent explanation of why the act of inferring mental states from patterns of brain activity (i.e., reverse inference—a term popularized in a seminal 2006 article by Russ Poldrack)—is a difficult business. Many experienced fMRI researchers might feel that the issue has already been beaten to death (see for instance this, this, this, or this). Those readers are invited to skip to the next section.

For everyone else, we can summarize the problem by observing that the probability of a particular pattern of brain activity conditional on a given mental state is not the same thing as the probability of a particular mental state conditional on a given pattern of observed brain activity (i.e., P(activity|mental state) != P(mental state|activity)). For example, if I know that doing a difficult working memory task produces activation in the dorsolateral prefrontal cortex (DLPFC) 80% of the time, I am not entitled to conclude that observing DLPFC activation in someone’s brain implies an 80% chance that that person is doing a working memory task.

To see why, imagine that a lot of other cognitive tasks—say, those that draw on recognition memory, emotion recognition, pain processing, etc.—also happen to produce DLPFC activation around 80% of the time. Then we would be justified in saying that all of these processes consistently produce DLPFC activity, but we would have no basis for saying that DLPFC activation is specific, or even preferential, for any one of these processes. To make the latter claim, we would need to directly estimate the probability of working memory being involved given the presence of DLPFC activation. But this is a difficult proposition, because most fMRI studies only compare a small number of experimental conditions (typically with low statistical power), and cannot really claim to demonstrate that a particular pattern of activity is specific to a given cognitive process.

Unfortunately, a huge proportion of fMRI studies continue to draw strong reverse inferences on the basis of little or no quantitative evidence. The practice is particularly common in Discussion sections, when authors often want to say something more than just “we found a bunch of differences as a result of this experimental manipulation”, and end up drawing inferences about what such-and-such activation implies about subjects’ mental states on the basis of a handful of studies that previously reported activation in the same region(s). Many of these attributions could well be correct, of course; but the point is that it’s exceedingly rare to see any quantitative evidence provided in support of claims that are often fundamental to the interpretation authors wish to draw.

Fortunately, this is where large-scale meta-analytic databases like Neurosynth can help—at least to some degree. Because Neurosynth contains results from over 11,000 fMRI studies drawn from virtually every domain of cognitive neuroscience, we can use it to produce quantitative whole-brain reverse inference maps (for more details, see Yarkoni et al. (2011)). In other words, we can estimate the relative specificity with which a particular pattern of brain activity implies that some cognitive process is in play—provided we’re willing to make some fairly strong assumptions (which we’ll return to below).

The dACC, lost and found

Armed with an understanding of the forward/reverse inference distinction, we can now turn to the focus of the L&E paper: a brain region known as the dorsal anterior cingulate cortex (dACC). The first thing L&E set out to do, quite reasonably, is identify the boundaries of the dACC, so that it’s clear what constitutes the target of analysis. To this end, they compare the anatomically-defined boundaries of dACC with the boundaries found in the Neurosynth forward inference map for “dACC”. Here’s what they show us:

Figure 1 from Lieberman & Eisenberger (2015)

The blue outline in panel A is the anatomical boundary of dACC; the colorful stuff in B is the Neurosynth map for ‘dACC’. (It’s worth noting in passing that the choice to rely on anatomy as the gold standard here is not completely uncontroversial; given the distributed nature of fMRI activation and the presence of considerable registration error in most studies, another reasonable approach would have been to use a probabilistic template). As you can see, the two don’t converge all that closely. Much of the Neurosynth map sits squarely inside preSMA territory rather than in dACC proper. As L&E report:

When “dACC” is entered as a term into a Neurosynth forward inference analysis (Fig. 1B), there is substantial activity present in the anatomically defined dACC region; however, there is also substantial activity present in the SMA/preSMA region. Moreover, the location with the highest Z-score in this analysis is actually in SMA, not dACC. The same is true if the term “anterior cingulate” is used (Fig. 1C).

L&E interpret this as a sign of confusion in the literature about the localization of dACC, and suggest that this observation might explain why people have misattributed certain functions to dACC:

These findings suggest that some of the disagreement over the function of the dACC may actually apply to the SMA/pre-SMA, rather than the dACC. In fact, a previous paper reporting that a reverse inference analysis for dACC was not selective for pain, emotion, or working memory (see figure 3 in ref. 13) seems to have used coordinates for the dACC that are in fact in the SMA/ pre-SMA (MNI coordinates 2, 8, 50), not in the dACC.

This is an interesting point, and clearly has a kernel of truth to it, inasmuch as some researchers undoubtedly confuse dACC with more dorsal regions. As L&E point out, I made this mistake myself in the original Neurosynth paper (that’s the ‘ref. 13’ in the above quote); specifically, here’s the figure where I clearly labeled dACC in the wrong place:

oops.
Figure 3 from Yarkoni et al. (2011)

 

Mea culpa—I made a mistake, and I appreciate L&E pointing it out. I should have known better.

That said, L&E should also have known better, because they were among the first authors to ascribe a strong functional role to a region of dorsal ACC that wasn’t really dACC at all. I refer here to their influential 2003 Science paper on social exclusion, in which they reported that a region of dorsal ACC centered on (-6, 8, 45) was specifically associated with the feeling of social exclusion and concluded (based on the assumption that the same region was already known to be implicated in pain processing) that social pain shares core neural substrates with physical pain. Much of the ongoing debate over what the putative role of dACC is traces back directly to this paper. Yet it’s quite clear that the region identified in that paper was not the same as the one L&E now argue is the pain-specific dACC. At coordinates (-6, 8, 45), the top hits in Neurosynth are “SMA”, “motor”, and “supplementary motor”. If we scan down to the first cognitive terms, we find the terms “task”, “execution”, and “orthographic”. “Pain” is not significantly associated with activation at this location at all. So, to the extent that people have mislabeled this region in the past, L&E would appear to share much of the blame. Which is fine—we all make mistakes. But given the context, I think it would behoove L&E to clarify their own role in perpetuating this confusion.

That said, even if L&E are correct that a subset of researchers have sometimes confused dACC and pre-SMA, they’re clearly wrong to suggest that the cognitive neuroscience community as a whole is guilty of the same confusion. A perplexing aspect of their argument is that they base their claim of localization confusion entirely on inspection of the forward inference Neurosynth map for “dACC”—an odd decision, coming immediately after several paragraphs in which they lucidly explain why a forward inference analysis is exactly the wrong way to determine what brain regions are specifically associated with a particular term. If you want to use Neurosynth to find out where people think dACC is, you should use the reverse inference map, not the forward inference map. All the forward inference map tells you is where studies that use the term “dACC” tend to report activation most often. But as discussed above, and in the L&E paper, that estimate will be heavily biased by differences between regions in the base rate of activation.

Perhaps in tacit recognition of this potential criticism, L&E go on to suggest that the alleged “distortion” problem isn’t ubiquitous, and doesn’t happen in regions like the amygdala, hippocampus, or posterior cingulate:

We tested several other anatomical terms including “amygdala,” “hippocampus,” “posterior cingulate,” “basal ganglia,” “thalamus,” “supplementary motor,” and “pre sma.” In each of these regions, the location with the highest Z-score was within the expected anatomical boundaries. Only within the dACC did we find this distortion. These results indicate that studies focused on the dACC are more likely to be reporting SMA/pre-SMA activations than dACC activations.

But this isn’t quite right. While it may be the case that dACC was the only brain region among the ones L&E examined that didn’t show this “distortion”, it’s certainly not the only brain region that shows this pattern. For example, the forward inference maps for “DMPFC” and “middle cingulate” (and probably others—I only spent a couple of minutes looking) show peak voxels in pre-SMA and the anterior insula, respectively, and not within the boundaries of the expected anatomical structures. If we take L&E’s “localization confusion” explanation seriously, we would be forced to conclude not only that cognitive neuroscientists generally don’t know where dACC is, but also that they don’t know DMPFC from pre-SMA or mid-cingulate from anterior insula. I don’t think this is a tenable suggestion.

For what it’s worth, Neurosynth clearly agrees with me: the “distortion” L&E point to completely vanishes as soon as one inspects the reverse inference map for “dacc” rather then forward inference map. Here’s what the two maps look like, side-by-side (incidentally, the code and data used to generate this plot and all the others in this post can be found here):

Meta-analysis of 'dACC' in Neurosynth: forward and reverse inference. Maps are thresholded at p < .001.
Meta-analysis of ‘dACC’ in Neurosynth: forward and reverse inference maps (voxel-wise p < .001, uncorrected).

You can see that the extent of dACC in the bottom row (reverse inference) is squarely within the area that L&E take to be the correct extent of dACC (see their Figure 1). So, when we follow L&E’s recommendations, rather than their actual practice, there’s no evidence of any spatial confusion. Researchers (collectively, at least) do know where dACC is. It’s just that, as L&E themselves argue at length earlier in the paper, you would expect to find evidence of that knowledge in the reverse inference map, and not in the forward inference map.

The unobjectionable claim: dACC is associated with pain

Localization issues aside, L&E clearly do have a point when they note that there appears to be a relatively strong association between the posterior dACC and pain. Of course, it’s not a novel point. It couldn’t be, given that L&E’s 2003 claim that social pain and physical pain share common mechanisms was already predicated on the assumption that the dACC is selectively implicated in pain (even though, as I noted above, the putative social exclusion locus reported in that paper was actually centered in preSMA and not dACC). Moreover, the Neurosynth pain meta-analysis map that L&E used has been online for nearly 5 years now. Since the reverse inference map is loaded by default on Neurosynth, and the sagittal orthview is by default centered on x = 0, one of the first things anybody sees when they visit this page is the giant pain-related blob in the anterior cingulate cortex. When I give talks on Neurosynth, the preferential activation for pain in the posterior dACC is one of the most common examples I use to illustrate the importance of reverse inference.

But you don’t have to take my word for any of this, because my co-authors and I made this exact point in the 2011 paper introducing Neurosynth, where we observed that:

For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis. This is consistent with studies of nonhuman primates that have implicated the dorsal posterior insula as a primary integration center for nociceptive afferents and with studies of humans in which anterior aspects of the so-called ‘pain matrix’ responded nonselectively to multiple modalities.

Contrary to what L&E suggest, we did not claim in our paper that reverse inference analysis demonstrates that the dACC is not preferentially associated with any cognitive function; we made the considerably weaker point that accounting for differences in the base rate of activation changes the observed pattern of association for many terms. And we explicitly noted that there is preferential activation for pain in dACC and insula—much as L&E themselves do.

The objectionable claim: dACC is selective for pain

Of course, L&E go beyond the claims made in Yarkoni et al (2011)—and what the Neurosynth page for pain reveals—in that they claim not only that pain is preferentially associated with dACC, but that “the clearest account of dACC function is that it is selectively involved in pain-related processes.” The latter is a much stronger claim, and, if anything, is directly contradicted by the very same kind of evidence (i.e., Neurosynth maps) L&E claim to marshal in its support.

Perhaps the most obvious problem with the claim is that it’s largely based on comparison of pain with just three other groups of terms, reflecting executive function, cognitive conflict, and salience**. This is, on its face, puzzling evidence for the claim that the dACC is pain-selective. By analogy, it would be like giving people a multiple choice question asking whether their favorite color is green, fuchsia, orange, or yellow, and then proclaiming, once results were in, that the evidence suggests that green is the only color people like.

Given that Neurosynth contains more than 3,000 terms, it’s not clear why L&E only compared pain to 3 other candidates. After all, it’s entirely conceivable that dACC might be much more frequently activated by pain than by conflict or executive control, and still also be strongly associated with a large number of other functions. L&E’s only justification for this narrow focus, as far as I can tell, is that they’ve decided to only consider candidate functions that have been previously proposed in the literature:

We first examined forward inference maps for many of the psychological terms that have been associated with dACC activity. These terms were in the categories of pain (“pain”, “painful”, “noxious”), executive control (“executive”, “working memory”, “effort”, “cognitive control”, “cognitive”, “control”), conflict processing (“conflict”, “error”, “inhibition”, “stop signal”, “Stroop”, “motor”), and salience (“salience”, “detection”, “task relevant”, “auditory”, “tactile”, “visual”).

This seems like an odd decision considering that one can retrieve a rank-ordered listing of 3,000+ terms from Neurosynth at the push of a button. More importantly, L&E also omit a bunch of other accounts of dACC function that don’t focus on the above categories—for example, that the dACC is involved in various aspects of value learning (e.g., Kennerley et al., 2006; Behrens et al., 2007; autonomic control (e.g., Critchley et al., 2003; or fear processing (e.g., Milad et al., 2007). In effect, L&E are not really testing whether dACC is selective for pain; what they’re doing is, at best, testing whether the dACC is preferentially associated with pain in comparison to a select number of other candidate processes.

To be fair, L&E do report inspecting the full term rankings, even if they don’t report them explicitly:

Beyond the specific terms we selected for analyses, we also identified which psychological term was associated with the highest Z-score for each of the 8 dACC locations across all the psychological terms in the NeuroSynth database. Despite the fact that there are several hundred psychological terms in the NeuroSynth database, “pain” was the top term for 6 out of 8 locations in the dACC.

This may seem compelling at face value, but there are several problems. First, z-scores don’t provide a measure of strength of effect, they provide (at best) a measure of strength of evidence. Pain has been extensively studied in the fMRI literature, so it’s not terribly surprising if z-scores for pain are larger than z-scores for many other terms in Neurosynth. Saying that dACC is specific to pain because it shows the strongest z-score is like saying that SSRIs are the only effective treatment for depression because a drug study with a sample size of 3,000 found a smaller p-value than a cognitive-behavioral therapy (CBT) study of 100 people. If we want to know if SSRIs beat CBT as a treatment for depression, we need to directly compare effect sizes for the two treatments, not p-values or z-scores. Otherwise we’re conflating how much evidence there is for each effect with how big the effect is. At best, we might be able to claim that we’re more confident that there’s a non-zero association between dACC activation and pain than that there’s a non-zero association between dACC activation and, say, conflict monitoring. But that doesn’t constitute evidence that the dACC is more strongly associated with pain than with conflict.

Second, if one looks at effect sizes estimates rather than z-scores—which is exactly what one should do if the goal is to make claims about the relative strengths of different associations—then it’s clearly not true that dACC is specific to pain. For the vast majority of voxels within the dACC, ranking associates by descending order of posterior probability results in some term or terms other than pain occupying the top spot for a majority of dACC voxels. For example, for coordinates (0, 22 26), we get ‘experiencing’ as the top associate (PP = 86%), then pain (82%), then ’empathic’ (81%). These results seem to cast dACC in a very different light than simply saying that dACC is involved in pain. Don’t like (0, 22, 26)? Okay, pick a different dACC coordinate. Say (4, 10, 28). Now the top associates are ‘aversive’ (79%), ‘anxiety disorders’ (79%), and ‘conditioned’ (78%) (‘pain’ is a little ways back, hanging out with ‘heart’, ‘skin conductance’, and ‘taste’). Or maybe you’d like something more anterior. Well, at (-2 30 22), we have ‘abuse’ (85%), ‘incentive delay’ (84%), ‘nociceptive’ (83%), and ‘substance’ (83%). At (0, 28, 16), we have ‘dysregulation’ (84%), ‘heat’ (83%), and ‘happy faces’ (82%). And so on.

Why didn’t L&E look at the posterior probabilities, which would have been a more appropriate way to compare different terms? They justify the decision as follows:

Because Z-scores are less likely to be inflated from smaller sample sizes than the posterior probabilities, our statistical analyses were all carried out on the Z-scores associated with each posterior probability (21).”

While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.

Meta-analytic contrasts in Neurosynth

It doesn’t have to be this way, mind you. If we want to directly compare effect sizes for different terms—which I think is what L&E want, even if they don’t actually do it—we can do that fairly easily using Neurosynth (though you have to use the Python core tools, rather than the website). The crux of the approach is that we need to directly compare the two conditions (or terms) using only those studies in the Neurosynth database that load on exactly one of the two target terms. This typically results in a rather underpowered test, because we end up working with only a few hundred studies, rather than the full database of 11,000+ studies. But such is our Rumsfeldian life—we do analysis with the data we have, not the data we wish we had.

In any case, if we conduct direct meta-analytic contrasts of pain versus a bunch of other terms like salience, emotion, and cognitive control, we get results that look like this:

Meta-analytic contrasts involving pain or autonomic function (p < .001, uncorrected).
Meta-analytic contrasts involving pain or autonomic function (p < .001, uncorrected).

These maps are thresholded very liberally (p < .001, uncorrected), so we should be wary of reading too much into them. And, as noted above, power for meta-analytic contrasts in Neurosynth is typically quite low. Still, it’s pretty clear that the results don’t support L&E’s conclusion. While pain does indeed activate the dACC with significantly higher probability than some other topics (e.g., emotion or touch), it doesn’t differentiate pain from a number of other viable candidates (e.g., salience, fear, and autonomic control). Moreover, there are other contrasts not involving pain that also elicit significant differences—e.g., between autonomic control and emotion, or fear and cognitive control.

Given that this is the correct way to test for activation differences between different Neurosynth maps, if we were to take seriously the idea that more frequent dACC activation in pain studies than in other kinds of studies implies pain selectivity, the above results would seem to indicate that dACC isn’t selective to pain (or at least, that there’s no real evidence for that claim). Perhaps we could reasonably say that dACC cares more about pain than, say, emotion (though, as discussed below, even that’s not a given); but that’s hardly the same thing as saying that “the best psychological description of dACC function is related to pain processing”.

A > B does not imply ~B

Of course, we wouldn’t want to buy L&E’s claim that the dACC is selective for pain even if the dACC did show significantly more frequent activation for pain than for all other terms, because showing that dACC activation is greater for task A than task B (or even tasks B through Z) doesn’t entail that the dACC is not also important for task B. By analogy, demonstrating that people on average prefer the color blue to the color green doesn’t entitle us to conclude that nobody likes green.

In fairness, L&E do say that the other candidate terms they examined don’t show any associations with the dACC in the Neurosynth reverse inference maps. For instance, they show us this figure:

no_dacc_activity

A cursory inspection indeed reveals very little going on for terms other than pain. But this is pretty woeful evidence for the claim of no effect, as it’s based on low-resolution visual inspection of just one mid-saggital brain slice for just a handful of terms. The only quantitative support L&E marshal for their “nothing else activates dACC” claim is an inspection of activation at 8 individual voxels within dACC, which they report largely fail to activate for anything other than pain. The latter is not a very comprehensive analysis, and makes one wonder why L&E didn’t do something a little more systematic given the strength of their claim (e.g., they could have averaged over all dACC voxels and tested whether activation occurs more frequently than chance for each term).

As it turns out, when we look at the entire dACC rather than just 8 voxels, there’s plenty of evidence that the dACC does in fact care about things other than pain. You can easily see this on neurosynth.org just by browsing around for a few minutes, but to spare you the trouble, here are reverse inference maps for a bunch of terms that L&E either didn’t analyze at all, or looked at in only the 8 selected voxels (the pain map is displayed in the first row for reference):

Reverse inference maps for selected Neurosynth topics that display activation in dACC (p < .001, uncorrected).
Reverse inference maps for selected Neurosynth topics that display activation in dACC (p < .001, uncorrected).

In every single one of these cases, we see significant associations with dACC activation in the reverse inference meta-analysis. The precise location of activation varies from case to case (which might lead us to question whether it makes sense to talk about dACC as a monolithic system with a unitary function), but the point is that pain is clearly not the only process that activates dACC. So the notion that dACC is selective to pain doesn’t survive scrutiny even if you use L&E’s own criteria.

The limits of Neurosynth

All of the above problems are, in my view, already sufficient to lay the argument that dACC is pain selective to rest. But there’s another still more general problem with the L&E analysis that would, in my view, be sufficient to warrant extreme skepticism about their conclusion even if you knew nothing at all about the details of the analysis. Namely, in arguing for pain selectivity, L&E ignore many of the known limitations of Neurosynth. There are a number of reasons to think that—at least in its present state—Neurosynth simply can’t support the kind of inference that L&E are trying to draw. While L&E do acknowledge some of these limitations in their Discussion section, in my view, they don’t take them nearly as seriously as they ought to.

First, it’s important to remember that Neurosynth can’t directly tell us whether activation is specific to pain (or any other process), because terms in Neurosynth are just that—terms. They’re not carefully assigned task labels, let alone actual mental states. The strict interpretation of a posterior probability of 80% for pain in a dACC voxel is that, if we were to take 11,000 published fMRI studies and pretend that exactly 50% of them included the term ‘pain’ in their abstracts, the presence of activation in the voxel in question should increase our estimate of the likelihood of the term ‘pain’ occurring from 50% to 80%. If this seems rather weak, that’s because it is. It’s something of a leap to go from words in abstracts to processes in people’s heads.

Now, in most cases, I think it’s a perfectly defensible leap. I don’t begrudge anyone for treating Neurosynth terms as if they were decent proxies for mental states or cognitive tasks. I do it myself all the time, and I don’t feel apologetic about it. But that’s because it’s one thing to use Neurosynth to support a loose claim like “some parts of the dACC are preferentially associated with pain”, and quite another to claim that the dACC is selective for pain,  that virtually nothing else activates dACC, and that “pain represents the best psychological characterization of dACC function”. The latter is an extremely strong claim that requires one to demonstrate not only that there’s a robust association between dACC and pain (which Neurosynth supports), but also that (i) the association is meaningfully stronger than every other potential candidate, and (ii) no other process activates dACC in a meaningful way independently of its association with pain. L&E have done neither of these things, and frankly, I can’t imagine how they could do such a thing—at least, not with Neurosynth.

Second, there’s the issue of bias. Terms in Neurosynth are only good proxies for mental processes to the extent that they’re accurately represented in the literature. One important source of bias many people often point to (including L&E) is that if the results researchers report are colored by their expectations—which they almost certainly are—then Neurosynth is likely to reflect that bias. So, for example, if people think dACC supports pain, and disproportionately report activation in dACC in their papers (relative to other regions), the Neurosynth estimate of the pain-dACC assocation is likely be biased upwards. I think this is a legitimate concern, though (for technical reasons I won’t get into here) I also think it’s overstated. But there’s a second source of bias that I think is likely to be much more problematic in this particular case, which is that Neurosynth estimates (and, for that matter, estimates from every other large-scale meta-analysis, irrespective of database or method) are invariably biased to some degree by differences in the strength of different experimental manipulations.

To see what I mean, consider that pain is quite easy to robustly elicit in the scanner in comparison with many other processes or states. Basically, you attach some pain-inducing device to someone’s body and turn it on. If the device is calibrated properly and the subject has normal pain perception, you’re pretty much guaranteed to produce the experience of pain. In general, that effect is likely to be large, because it’s easy to induce fairly intense pain in the scanner.

Contrast that, with, say, emotion tasks. It’s an open secret in much of emotion research that what passes for an “emotional” stimulus is usually pretty benign by the standards of day-to-day emotional episodes. A huge proportion of studies use affective pictures to induce emotions like fear or disgust, and while there’s no doubt that such images successfully induce some change in emotional state, there are very few subjects who report large changes in experienced emotion (if you doubt this, try replacing the “extremely disgusted” upper anchor of your rating scale with “as disgusted as I would feel if someone threw up next to me” in your next study). One underappreciated implication of this is that if we decide to meta-analytically compare brain activation during emotion with brain activation during pain, our results are necessarily going to be biased by differences in the relative strengths of the two kinds of experimental manipulation—independently of any differences in the underlying neural substrates of pain and emotion. In other words, we may be comparing apples to oranges without realizing it. If we suppose, for the sake of hypothesis, that the dACC plays the same role in pain and emotion, and then compare strong manipulations of pain with weak manipulations of emotion, we would be confounding differences in experimental strength with differences in underlying psychology and biology. And we might well conclude that dACC is more important for pain than emotion—all because we have no good way of correcting for this rather mundane bias.

In point of fact, I think something like this is almost certainly true for the pain map in Neurosynth. One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally. I’m not sure there’s any good way to correct for this, but given that small differences in posterior probabilities (e.g., going from 80% to 75%) would probably have large effects on the rank order of different terms, I think the onus is on L&E to demonstrate why this isn’t a serious concern for their analysis.

But it’s still good for plenty of other stuff!

Having spent a lot of time talking about Neurosynth’s limitations—and all the conclusions one can’t draw from reverse inference maps in Neurosynth—I want to make sure I don’t leave you with the wrong impression about where I see Neurosynth fitting into the cognitive neuroscience ecosystem. Despite its many weaknesses, I still feel quite strongly that Neurosynth is one of the most useful tools we have at the moment for quantifying the relative strengths of association between psychological processes and neurobiological substrates. There are all kinds of interesting uses for the data, website, and software that are completely unobjectionable. I’ve seen many published articles use Neurosynth in a variety of interesting ways, and a few studies have even used Neurosynth as their primary data source (and my colleagues and I have several more on the way). Russ Poldrack and I have a forthcoming paper in Annual Review of Psychology in which we review some of the ways databases like Neurosynth can play an invaluable role in the brain mapping enterprise. So clearly, I’m the last person who would tell anyone that Neurosynth isn’t useful for anything. It’s useful for a lot of things; but it probably shouldn’t be the primary source of evidence for very strong claims about brain-cognition or brain-behavior relationships.

What can we learn about the dACC using Neurosynth? A number of things. Here are some conclusions I think one can reasonably draw based solely on inspection of Neurosynth maps:

  • There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.
  • It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)—though we should be cautious, because in some cases some of those other functions are clearly represented in dACC, just in different sectors.
  • Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.

I think these are all interesting and potentially important observations. They’re hardly novel, of course, but it’s still nice to have convergent meta-analytic support for claims that have been made using other methods.

So what does the dACC do?

Having read this far, you might be thinking, well if dACC isn’t selective for pain, then what does it do? While I don’t pretend to have a good answer to this question, let me make three tentative observations about the potential role of dACC in cognition that may or may not be helpful.

First, there’s actually no particular reason why dACC has to play any unitary role in cognition. It may be a human conceit to think that just because we can draw some nice boundaries around a region and give it the name ‘dACC’, there must be some corresponding sensible psychological process that passably captures what all the neurons within that chunk of tissue are doing. But the dACC is a large brain region that contains hundreds of millions of neurons with enormously complex response profiles and connectivity patterns. There’s no reason why nature should respect our human desire for simple, interpretable models of brain function. To the contrary, our default assumption should probably be that there’s considerable functional heterogeneity within dACC, so that slapping a label like “pain” onto the entire dACC is almost certainly generating more heat than light.

Second, to the degree that we nevertheless insist on imposing a single unifying label on the entire dACC, it’s very unlikely that a generic characterization like “pain” is up to the job. While we can reasonably get away with loosely describing some (mostly sensory) parts of the brain as broadly supporting vision or motor function, the dACC—a frontal region located much higher in the processing hierarchy—is unlikely to submit to a similar analysis. It’s telling that most of the serious mechanistic accounts of dACC function have shied away from extensional definitions of regional function like “pain” or “emotion” and have instead focused on identifying broad computational roles that dACC might play. Thus, we have suggestions that dACC might be involved in response selection, conflict monitoring, or value learning. While these models are almost certainly wrong (or at the very least, grossly incomplete), they at least attempt to articulate some kind of computational role dACC circuits might be playing in cognition. Saying that the dACC is for “pain”, by contrast, tells us nothing about the nature of the representations in the region.

To their credit, L&E do address this issue to some extent. Specifically, they suggest that the dACC may be involved in monitoring for “survival-relevant goal conflicts”. Admittedly, it’s a bit odd that L&E make such a suggestion at all, seeing as it directly contradicts everything they argue for in the rest of the paper (i.e., if the dACC supports detection of the general class of things that are relevant for survival, then it is by definition not selective for pain, and vice versa). Contradictions aside, however, L&E’s suggestion is not completely implausible. As the Neurosynth maps above show, the dACC is clearly preferentially activated by fear conditioning, autonomic control, and reward—all of which could broadly be construed as “survival-relevant”. The main difficulty for L&E’s survival account comes from (a) the lack of evidence of dACC involvement in other clearly survival-relevant stimuli or processes—e.g., disgust, respiration, emotion, or social interaction, and (b) the availability of other much more plausible theories of dACC function (see the next point). Still, if we’re relying strictly on Neurosynth for evidence, we can give L&E the benefit of the doubt and reserve judgment on their survival-relevant account until more data becomes available. In the interim, what should not be controversial is that such an account has no business showing up in a paper titled “the dorsal anterior cingulate cortex is selective for pain”—a claim it is completely incompatible with.

Third, theories of dACC function based largely on fMRI evidence don’t (or shouldn’t) operate in a vacuum. Over the past few decades, literally thousands of animal and human studies have investigated the structure and function of the anterior cingulate cortex. Many of these studies have produced considerable insights into the role of the ACC (including dACC), and I think it’s safe to say that they collectively offer a much richer understanding than what fMRI studies—let alone a meta-analytic engine like Neurosynth—have produced to date. I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree). Schematically, their “four-region” architectural model looks like this:

The Vogt et al. "four-region" model of cingulate architecture.
The Vogt et al. “four-region” model of cingulate architecture. Fig. 14.13. in Vogt & Sikes (2009).

While the aMCC is assumed to contains many pain-selective neurons (as do more anterior sectors of the cingulate), it’s demonstrably not pain-selective, as neurons throughout the aMCC also respond to other stimuli (e.g., non-painful touch, fear cues, etc.).

Aside from being based on an enormous amount of evidence from lesion, electrophysiology, and imaging studies, the Vogt characterization of dACC/aMCC has several other nice features. For one thing, it fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). For another, it provides an elegant and parsimonious explanation for the broad extent of pain-related activation in anterior cingulate cortex even though no part of aMCC is selective for pain (i.e., unlike other non-physical stimuli, pain involves skeletomotor orienting, and unlike non-painful touch, it elicits avoidance behavior and subjective unpleasantness).

Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance” region).

Conclusion

Lieberman & Eisenberger (2015) argue, largely on the basis of evidence from my Neurosynth framework, that the dACC is selective for pain. They are wrong. Neurosynth does not—and, at present, cannot—support such a conclusion. Moreover, a more careful examination of Neurosynth results directly refutes Lieberman and Eisenberger’s claims, providing clear evidence that the dACC is associated with many other operations, and converging with extensive prior animal and human work to suggest a far more complex view of dACC function.

 

  • This is probably the first time I’ve been able to call myself the world’s foremost expert on anything while keeping a straight face. It feels pretty good.

** L&E show meta-analysis maps for a few more terms in an online supplement, but barely discuss them, even though at least one term (fear) clearly activates very similar parts of dACC.

the mysterious inefficacy of weather

I like to think of myself as a data-respecting guy–by which I mean that I try to follow the data wherever it leads, and work hard to suppress my intuitions in cases where those intuitions are convincingly refuted by the empirical evidence. Over the years, I’ve managed to argue myself into believing many things that I would have once found ludicrous–for instance, that parents have very little influence on their children’s personalities, or that in many fields, the judgments of acclaimed experts with decades of training are only marginally better than those of people selected at random, and often considerably worse than simple actuarial models. I believe these things not because I want to or like to, but because I think a dispassionate reading of the available evidence suggests that that’s just how the world works, whether I like it or not.

Still, for all of my efforts, there are times when I find myself unable to set aside my intuitions in the face of what would otherwise be pretty compelling evidence. A case in point is the putative relationship between weather and mood. I think most people–including me–take it as a self-evident fact that weather exerts a strong effect on mood. Climate is one of the first things people bring up when discussing places they’ve lived or visited. When I visit other cities and talk to people about what Austin, Texas (my current home) is like, my description usually amounts to something like it’s an amazing place to live so long as you don’t mind the heat. When people talk about Seattle, they bitch about the rain and the clouds; when people rave about living in California, they’re often thinking in no small part about the constant sunshine that pervades most of the state. When someone comments on the absurdly high rate of death metal bands in Finland, our first reaction is to chuckle and think well, what the hell else is there to do that far up north in the winter?–a reaction promptly followed by a twinge of guilt, because Seasonal Affective Disorder is no laughing matter.

And yet… and yet, the empirical evidence linking variations in the weather to variations in human mood is surprisingly scant. There are a few published reports of very large effects of weather on mood going back several decades, but these are invariably from very small samples–and we know that big correlations tend to occur in little studies. By contrast, large-scale studies with hundreds or thousands of subjects have found very little evidence of a relationship between mood and weather–and the effects identified are not necessarily consistent across studies.

For example, Denissen and colleagues (2008) fit a series of multilevel models of the relationship between objective weather parameters and self-reported mood in 1,233 German subjects, and found only very small associations between weather variables and negative (but not positive) affect. [Klimstra et al (2011)] found similarly negligible main effects in another sample of ~500 subjects. The state of the empirical literature on weather and mood was nicely summed up by Denissen et al in their Discussion:

As indicated by the relatively small regression weights, weather fluctuations accounted for very little variance in people’s day-to-day mood. This result may be unexpected given the existence of commonly held conceptions that weather exerts a strong influence on mood (Watson, 2000), though it replicates findings by Watson (2000) and Keller et al. (2005), who also failed to report main effects. –Dennisen et al (2008)

With the advent of social media and that whole Big Data thing, we can now conduct analyses on a scale that makes the Denissen or Klimstra studies look almost like case studies. In particular, the availability of hundreds of millions of tweets and facebook posts, coupled with comprehensive weather records from every part of the planet, means that we can now investigate the effects of almost every kind of weather pattern (cloud cover, temperature, humidity, barometric pressure, etc.) on many different indices of mood. And yet, here again, the evidence is not very kind to our intuitive notion of a strong association between weather and mood.

For example, in a study of 10 million facebook users in 100 US cities, Coviello et al (2014) found that the incidence of positive posts decreased by approximately 1%, and that of negative posts increased by 1%, on days when rain fell compared to days without rain. While that finding is certainly informative (and served as a starting point for other much more impressive analyses of network contagion), it’s not a terribly impressive demonstration of weather’s supposedly robust impact on mood. I mean, a 1% increase in rain-induced negative affect is probably not what’s really keeping anyone from moving to Seattle. Yet if anyone’s managed to detect a much bigger effect of weather on mood in a large-sample study, I’m not aware of it.

I’ve also had the pleasure of experiencing the mysterious absence of weather effects firsthand: as a graduate student, I once spent nearly two weeks trying to find effects of weather on mood in a large dataset (thousands of users from over twenty cities worldwide) culled from LiveJournal, taking advantage of users’ ability to indicate their mood in a status field via an emoticon (a feat of modern technology that’s now become nearly universal thanks to the introduction of those 4-byte UTF-8 emoji monstrosities 🙀👻🍧😻). I stratified my data eleventy different ways; I tried kneading it into infinity-hundred pleasant geometric shapes; I sang to it in the shower and brought it ice cream in bed. But nothing worked. And I’m pretty sure it wasn’t that my analysis pipeline was fundamentally broken, because I did manage (as a sanity check) to successfully establish that LiveJournal users are more likely to report feeling “cold” when the temperature outside is lower (❄️😢). So it’s not like physical conditions have no effect on people’s internal states. It’s just that the obvious weather variables (temperature, rain, humidity, etc.) don’t seem to shift our mood very much, despite our persistent convictions.

Needless to say, that project is currently languishing quite comfortably in the seventh level of file drawer hell (i.e., that bottom drawer that I locked then somehow lost the key to).

Anyway, the question I’ve been mulling over on and off for several years now–though, two-week data-mining binge aside, never for long enough to actually arrive at a satisfactory answer–is why empirical studies have been largely unable to detect an effect of weather on mood. Here are some of the potential answers I’ve come up with:

  • There really isn’t a strong effect of weather on mood, and the intuition that there is one stems from a perverse kind of cultural belief or confirmation bias that leads us all to behave in very strange, and often life-changing, ways–for example, to insist on moving to Miami instead of Seattle (which, climate aside, would be a crazy move, right?). This certainly allows for the possibility that there are weak effects on mood–which plenty of data already support–but then, that’s not so exciting, and doesn’t explain why so many people are so eager to move to Hawaii or California for the great weather.

  • Weather does exert a big effect on mood, but it does so in a highly idiosyncratic way that largely averages out across individuals. On this view, while most people’s mood might be sensitive to weather to some degree, the precise manifestation differs across individuals, so that some people would rather shoot themselves in the face than spend a week in an Edmonton winter, while others will swear up and down that it really is possible (no, literally!) to melt in the heat of a Texas summer. From a modeling standpoint, if the effects of weather on mood are reliable but extremely idiosyncratic, identifying consistent patterns could be a very difficult proposition, as it would potentially require us to model some pretty complex higher-order interactions. And the difficulty is further compounded by strong geographic selection biases: since people tend to move to places where they like the climate, the variance in mood attributable to weather changes is probably much smaller than it would be under random dispersal.

  • People’s mood is heavily influenced by the weather when they first spend time somewhere new, but then they get used to it. We habituate to almost everything else, so why not weather? Maybe people who live in California don’t really benefit from living in constant sunshine. Maybe they only enjoyed the sun for their first two weeks in California, and the problem is that now, whenever they travel somewhere else, the rain/snow/heat of other places makes them feel worse than their baseline (habituated) state. In other words, maybe Californians have been snorting sunshine for so long that they now need a hit of clarified sunbeams three times a day just to feel normal.

  • The relationship between objective weather variables and subjective emotional states is highly non-linear. Maybe we can’t consistently detect a relationship between high temperatures and anger because the perception of temperature is highly dependent on a range of other variables (e.g., 30 degrees celsius can feel quite pleasant on a cloudy day in a dry climate, but intolerable if it’s humid and the sun is out). This would make the modeling challenge more difficult, but certainly not insurmountable.

  • Our measures of mood are not very reliable, and since reliability limits validity, it’s no surprise if we can’t detect consistent effects of weather on mood. Personally I’m actually very skeptical about this one, since there’s plenty of evidence that self-reports of emotion are more than adequate in any number of other situations (e.g., it’s not at all hard to detect strong trait effects of personality on reported mood states). But it’s still not entirely crazy to suggest that maybe what we’re looking at is at least partly a measurement problem—especially once we start talking about algorithmically extracting sentiment from Twitter or Facebook posts, which is a notoriously difficult problem.

  • The effects of weather on mood are strong, but very transient, and we’re simply not very good at computing mental integrals over all of our moment-by-moment experiences. That is, we tend to overestimate the  impact of weather on our mood because we find it easy to remember instances when the weather affected our mood, and not so easy to track all of the other background factors that might influence our mood more deeply but less perceptibly. There are many heuristics and biases you could attribute this to (e.g., the peak-end rule, the availability heuristic, etc.), but the basic point is that, on this view, the belief that the weather robustly influences our mood is a kind of mnemonic illusion attributable to well-known bugs in (or, more charitably, features of) our cognitive architecture.

Anyway, as far as I can tell, none of the above explanations fully account for the available data. And, to be fair, there’s no reason to think any of them should: if I had to guess, I would put money on the true explanation being a convoluted mosaic of some or all of the above factors (plus others I haven’t considered, no doubt). But the proximal problem is that there just doesn’t seem to be much data to speak to the question one way or the other. And this annoys me more than I would like. I won’t go so far as to say I spend a lot of time thinking about the problem, because I don’t. But I think about it often enough that writing a 2,000-word blog post in the hopes that other folks will provide some compelling input seems like a very reasonable time investment.

And so, having read this far—which must mean you’re at least vaguely entertained, right?—it’s your turn to help me out. Please tell me: Why is it so damn hard to detect the effects of weather on mood? Make it rain comments! It will probably cheer me up. Slightly.

☀️🌞😎😅

To increase sustainability, NIH should yoke success rates to budgets

There’s a general consensus among biomedical scientists working in the United States that the NIH funding system is in a state of serious debilitation, if not yet on life support. After years of flat budgets and an ever-increasing number of PIs, success rates for R01s (the primary research grant mechanism at NIH) are at an all-time low, even as the average annual budget of awards has decreased in real dollars. The problem, unfortunately, is that there doesn’t appear to be an easy way to fix this problem. As many commentators have noted, there are some very deeply-rooted and systematic incentives that favor a perpetuation, and even exacerbation, of the current problems.

Last month, NIH released an RFI asking for suggestions for strategies to improve the impact and sustainability of biomedical research. This isn’t a formal program announcement, and doesn’t carry any real force at the moment, but it does at least signal some interest in making policy changes that could help prevent serious problems from getting worse.

Here’s my suggestion, which I’m also dutifully sending in to NIH in much-abridged form. The basic idea I’ll explore in this post is very simple: NIH should start yoking the success rates of proposals to the amount of money they request. The proposal is not meant to be a long-term solution, and is in some ways just a stopgap measure until more serious policy changes take place. But it’s a stopgap measure that could conceivably increase success rates by a few points for at least a few years, with relatively little implementation cost and few obvious downsides. So I think it’s at least worth considering.

The problem

At the moment, the NIH funding system arguably incentivizes PIs to ask for as much money as they think they can responsibly handle. To see why, let’s forget about NIH for the moment and consider, in day-to-day life, the typical relationship between investment cost and probability of investment (holding constant expected returns, which I’ll address later). Generally speaking, the two are inversely related. If a friend asks you to lend them $10, you might lend it without even asking them what they need it for. If, instead, your friend asks you for $100, you might want to know what it’s for, and you might also ask for some indication of how soon you’ll be paid back. But if your friend asks you for $10,000… well, you’re probably going to want to see a business plan and a legally-binding contract laying out a repayment schedule. There is a general understanding in most walks of life that if someone asks you to invest in them more heavily, they expect to see more evidence that you can deliver on whatever it is that you’re promising to do.

At NIH, things don’t work exactly that way. In many ways, there’s actually a positive incentive to ask for more money when writing a grant application. The perverse incentives play out at multiple levels–both across different grant mechanisms, and within the workhorse R01 mechanism. In the former case, a glance at the success rates for different R mechanisms reveals something that many PIs are, in my experience, completely unaware of: “small”-grant mechanisms like the R03 and R21 have lower–in some cases much lower–success rates than R01s at nearly all NIH institutes. This despite the fact that R21s and R03s are advertised as requiring little or no pilot data, and have low budget caps and short award durations (e.g., a maximum of $275,000 over  two years for the R21).

Now you might say: well, sure, if you have a grant program expressly designed for exploratory projects, it’s not surprising if the funding rate is much lower, because you’re probably getting an obscene number of applications from people who aren’t in a position to compete for a full-blown R01. But that’s not really it, because the number of R21 and R03 submissions is also much lower than the number of R01 submissions (e.g., in 2013, NCI funded 14.7% of 4,170 R01 applications, but only 10.6% of 2,557 R21 applications). In the grand scheme of things, the amount of money allocated to “small” grants at NIH pales in comparison to the amount allocated to R01s.

The reason that R21s and R03s aren’t much more common is… well, I actually don’t know. But the point is that the data suggest that, in general (though there are of course exceptions), it’s empirically a pretty bad idea to submit R03s and R21s (particularly if you’re an Early Stage Investigator). The succes rates for R01s are higher, you can ask for a lot more money, the project periods are longer, and the amount of work involved in writing the proposal is not dramatically higher. When you look at it that way, it’s not so surprising that PIs don’t submit that many R21/R03 applications: on average, they’re a bad time investment.

The same perverse incentives apply even if you focus on only R01 submissions. You might think that, other things being equal, NIH would prioritize proposals that ask for less money. That may well be true from an administrative standpoint, in the sense that, if two applications receive exactly the same score from a review panel, and are pretty similar in most respects, one imagines that most program officers would prefer to fund the proposal with the smaller budget. But the problem is that, in the grand scheme of things, discretionary awards (i.e., where the PO has the power to choose which award to fund) are a relatively small proportion of the total budget. The  majority of proposals get funded because they receive very good scores at review. And it turns out that, at review, asking for more money can actually work in a PI’s favor.

To see why, consider the official NIH guidelines for reviewing budgets. Reviewers are explicitly instructed not to judge a proposal’s merit based on its budget:

Unless specified otherwise in the Funding Opportunity Announcement, consideration of the budget and project period should not affect the overall impact score.

What should the reviewer do, in regards to the budget? Well, not much:

The reviewer should determine whether the requested budget is realistic for the conduct of the project proposed.

The explicit decoupling of budget from merit sets up a very serious problem, because if you allow yourself to ask for more money, you can also propose correspondingly grander work. By the time reviewers see your proposal, they have no real way of knowing whether you first decided on the minimum viable research program you want to run and then came up with an appropriate budget, or if you instead picked a largish number out of a hat and then proposed a perfectly reasonable (but large) amount of science you could do in order to fit that budget.

At the risk of making my own life a little bit more difficult, I’m willing to put my money where my mouth is on this point. For just about every proposal I’ve sent to NIH so far, I’ve asked for more money than I strictly need. Now, “need” is a tricky word in this context. I emphatically am not suggesting that I routinely ask NIH for more money just for the sake of having more money. I can honestly say that I’ve never asked for any funds that I didn’t think I could use responsibly in the pursuit of what I consider to be good science. But the trouble is, virtually every PI who’s ever applied for government funding will happily tell you that they could always do more good science if they just had more money. And, to a first order of approximation, they’re right. Unless a PI already has multiple major grants (which is a very small proportion of PIs at NIH), she or he probably could do more good work if given more money. There might be diminishing returns at some point, but for the most part it should not be terribly surprising if the average PI could increase her or his productivity level somewhat if given the money to hire more personnel, buy better equipment, run more experiments, and so on.

Unfortunately, the NIH budget is a zero-sum game. Every grant dollar I get is a grant dollar some other PI doesn’t get. So, when I go out and ask for a large-but-not-unreasonable amount of money, knowing full well that I could still run a research lab and get at least some good science done with less money, I am, in a sense, screwing everyone else over. Except that I’m not really screwing everyone else over, because everyone else is doing exactly the same thing I am. And the result is that we end up with a lot of PIs proposing a lot of very large projects. The PIs who win the grant lottery (because, increasingly, that’s what it is) will, generally, do a lot of good science with it. So it’s not so much that money is wasted; it’s more that it’s not distributed optimally, because the current system incentivizes people to ask for as much money as they think they can responsibly manage, rather than asking for the minimum amount they need to actually sustain a viable research enterprise.

The fix

The solution to this problem is, on paper, quite simple (which is probably why it’s only on paper). The way to induce PIs to ask for the minimum amount they think they can do their research with–thereby freeing up money for everyone else–is to explicitly yoke risk to reward, so that there’s a clearly discernible cost to asking for every increment in funding. You want $50,000 a year? Okay, that’s pretty easy to fund, so we’re not going to ask you a lot of questions. You want $500k/year? Well, hey, look, there are 10 people out in the hallway who each claim they can produce two papers a year on just $50k. So you’re going to have to explain why we should fund one of you instead of ten of them.

How would this proposal be implemented? There are many ways one could go about it, but here’s one that makes sense to me. First, we get rid of all of the research grant (R-type) mechanisms–except maybe for those that have some clearly differentiated purpose (e.g., R25s for training courses). Second, we introduce new R grant programs defined only by their budget caps and durations. For example, we might have R50s (max 50k/year for 2 years), R150s (max 150k/year for 3 years), R300s (max 300k/year for 5 years), and so on. The top tier would have no explicit cap, just like the current R01s. Third, we explicitly tie success rates to budget caps by deciding (and publicly disclosing) how much money we’re allocating to each tier. Each NIH institute would have to decide approximately what its payline for each tier would be for the next year–with the general constraint that the money would be allocated in such a way as to produce a strong inverse correlation between success rate and budget amount. So we might see, for instance, NIMH funding R50s at 50%, R150s at 28%, R300s at 22%, and R1000s at 8%. There would presumably be an initial period of fine-tuning, but over four or five award cycles, the system would almost certainly settle into a fairly stable equilibrium. Paylines would necessarily rise, because PIs would be incentivized to ask for only as much money as they truly need.

The objection(s)

Are there objections to the approach I’ve suggested above? Sure. Perhaps the most obvious concern will come from people who do genuinely “big” science–i.e., who work in fields where simply keeping a small lab running can cost hundreds of thousands of dollars a year. Researchers in such fields might complain that yoking success rates to budgets would mean that their colleagues who work on less expensive scientific problems have a major advantage when it comes to securing funding, and that Big Science types would consequently find it harder to survive.

There are several things to note about this objection. First, there’s actually no necessary reason why yoking success rates to budgets has to hurt larger applications. The only assumption this proposal depends on is that, at the moment, some proportion of budgets are inflated–i.e., there are many researchers who could operate successfully (if less comfortably) on smaller budgets than they currently do. The fact that many other investigators couldn’t operate on smaller budgets is immaterial. If 25% of NIH PIs voluntarily opt into a research grant program that guarantees higher success rates in return for smaller budgets, the other 75% of PIs could potentially benefit even if they do nothing at all (depending on how success rates are set). So if you currently run a lab that can’t possibly run on less than $500k/year, you don’t necessarily lose anything if one of your colleagues who was previously submitting grants with $250k annual budgets decides to start writing grants with $125k caps in return for, say, a 10% increase in funding likelihood. On the contrary, it could actually mean that there’s more money left over at the end of the day to fund your own big grants.

Now, it’s certainly true that NIH PIs who work in cheaper domains would have an easier time staying afloat than ones who work in expensive domains. And it’s also true that NIH could explicitly bias in favor of small grants by raising the success rates for small grants disproportionately. But that isn’t necessarily a problem. Personally, I would argue that a moderate bias towards small grants is actually a very good thing. Remember: funding is a zero-sum game. It may seem egalitarian to make success rates independent of operating costs, because it feels like we’re giving everyone a roughly equal shot at a career in biomedical science, no matter what science they like to do. But in another sense, we aren’t being egalitarian at all, because what we’re actually saying is that a scientist who likes to work on $500k problems is worth five times as much to the taxpayer as one who likes to work on $100k problems. That seems unlikely to be true in the general case (though it may certainly be true in a minority of cases), because it’s hard to believe that the cost of doing scientific research is very closely linked to the potential benefits to people’s health (i.e., there are almost certainly many very expensive scientific disciplines that don’t necessarily produce very big benefits to taxpayers). Personally, I don’t see anything wrong with setting a higher bar for research programs that cost more taxpayer money to fund. And note that I’m arguing against my own self-interest here, because my own research is relatively expensive (most of it involves software development, and the average developer salary is roughly double the average postdoc salary).

Lastly, it’s important to keep in mind that this proposal doesn’t in any way precludes the use of other, complementary, funding mechanisms. At present, NIH already routinely issues PAs and RFAs for proposals in areas of particular interest, or which for various reasons (including budget-related considerations) need to be considered separately from other applications. This wouldn’t change in any way under the proposed system. So, for example, if NIH officials decided that it was in the nation’s best interest to fund a round of $10 million grants to develop new heart transplant techniques, they could still issue a special call for such proposals. The plan I’ve sketched above would apply only to “normal” grants.

Okay, so that’s all I have. I was initially going to list a few other potential objections (and rebuttals), but decided to leave that for discussion. Please use the comments to tell me (and perhaps NIH) why this proposal would or wouldn’t work.

“Open Source, Open Science” Meeting Report – March 2015

[The report below was collectively authored by participants at the Open Source, Open Science meeting, and has been cross-posted in other places.]

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.
Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders’ interests. We also discussed industry interest in “open”, both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question “Why go open?” for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training “decision tree” defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called “glue” – the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the “open science bubble”. Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution – rather than “post preprints!”, say “get more citations!” Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as “open scientists” per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for “next meetings” such as hackathons, no decision has yet been reached. But, a more important decision was clear – the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.

Authors
[Authors are listed in reverse alphabetical order; order does not denote relative contribution.]

  1. Tal Yarkoni, University of Texas at Austin
  2. Kara Woo, NCEAS
  3. Andrew Updegrove, Gesmer Updegrove and ConsortiumInfo.org
  4. Kaitlin Thaney, Mozilla Science Lab
  5. Jeffrey Spies, Center for Open Science
  6. Courtney Soderberg, Center for Open Science
  7. Elliott Shore, Association of Research Libraries
  8. Andrew Sallans, Center for Open Science
  9. Karthik Ram, rOpenSci and Berkeley Institute for Data Science
  10. Min Ragan-Kelley, IPython and UC Berkeley
  11. Brian Nosek, Center for Open Science and University of Virginia
  12. Erin C, McKiernan, Wilfrid Laurier University
  13. Jennifer Lin, PLOS
  14. Amye Kenall, BioMed Central
  15. Mark Hahnel, figshare
  16. C. Titus Brown, UC Davis
  17. Sara D. Bowman, Center for Open Science

Now I am become DOI, destroyer of gatekeeping worlds

Digital object identifiers (DOIs) are much sought-after commodities in the world of academic publishing. If you’ve never seen one, a DOI is a unique string associated with a particular digital object (most commonly a publication of some kind) that lets the internet know where to find the stuff you’ve written. For example, say you want to know where you can get a hold of an article titled, oh, say, Designing next-generation platforms for evaluating scientific output: what scientists can learn from the social web. In the real world, you’d probably go to Google, type that title in, and within three or four clicks, you’d arrive at the document you’re looking for. As it turns out, the world of formal resource location is fairly similar to the real world, except that instead of using Google, you go to a website called dx.DOI.org, and then you plug in the string ‘10.3389/fncom.2012.00072’, which is the DOI associated with the aforementioned article. And then, poof, you’re automagically linked directly to the original document, upon which you can gaze in great awe for as long as you feel comfortable.

Historically, DOIs have almost exclusively been issued by official-type publishers: Elsevier, Wiley, PLoS and such. Consequently, DOIs have had a reputation as a minor badge of distinction–probably because you’d traditionally only get one if your work was perceived to be important enough for publication in a journal that was (at least nominally) peer-reviewed. And perhaps because of this tendency to view the presence of a DOIs as something like an implicit seal of approval from the Great Sky Guild of Academic Publishing, many journals impose official or unofficial commandments to the effect that, when writing a paper, one shalt only citeth that which hath been DOI-ified. For example, here’s a boilerplate Elsevier statement regarding references (in this case, taken from the Neuron author guidelines):

References should include only articles that are published or in press. For references to in press articles, please confirm with the cited journal that the article is in fact accepted and in press and include a DOI number and online publication date. Unpublished data, submitted manuscripts, abstracts, and personal communications should be cited within the text only.

This seems reasonable enough until you realize that citations that occur “within the text only” aren’t very useful, because they’re ignored by virtually all formal citation indices. You want to cite a blog post in your Neuron paper and make sure it counts? Well, you can’t! Blog posts don’t have DOIs! You want to cite a what? A tweet? That’s just crazy talk! Tweets are 140 characters! You can’t possibly cite a tweet; the citation would be longer than the tweet itself!

The injunction against citing DOI-less documents is unfortunate, because people deserve to get credit for the interesting things they say–and it turns out that they have, on rare occasion, been known to say interesting things in formats other than the traditional peer-reviewed journal article. I’m pretty sure if Mark Twain were alive today, he’d write the best tweets EVER. Well, maybe it would be a tie between Mark Twain and the NIH Bear. But Mark Twain would definitely be up there. And he’d probably write some insightful blog posts too. And then, one imagines that other people would probably want to cite this brilliant 21st-century man of letters named @MarkTwain in their work. Only they wouldn’t be allowed to, you see, because 21st-century Mark Twain doesn’t publish all, or even most, of his work in traditional pre-publication peer-reviewed journals. He’s too impatient to rinse-and-repeat his way through the revise-and-resubmit process every time he wants to share a new idea with the world, even when those ideas are valuable. 21st-century @MarkTwain just wants his stuff out there already where people can see it.

Why does Elsevier hate 21st-century Mark Twain, you ask? I don’t know. But in general, I think there are two main reasons for the disdain many people seem to feel at the thought of allowing authors to freely cite DOI-less objects in academic papers. The first reason has to do with permanence—or lack thereof. The concern here is that if we allowed everyone to cite just any old web page, blog post, or tweet in academic articles, there would be no guarantee that those objects would still be around by the time the citing work was published, let alone several years hence. Which means that readers might be faced with a bunch of dead links. And dead links are not very good at backing up scientific arguments. In principle, the DOI requirement is supposed to act like some kind of safety word that protects a citation from the ravages of time—presumably because having a DOI means the cited work is important enough for the watchful eye of Sauron Elsevier to periodically scan across it and verify that it hasn’t yet fallen off of the internet’s cliffside.

The second reason has to do with quality. Here, the worry is that we can’t just have authors citing any old opinion someone else published somewhere on the web, because, well, think of the children! Terrible things would surely happen if we allowed authors to link to unverified and unreviewed works. What would stop me from, say, writing a paper criticizing the idea that human activity is contributing to climate change, and supporting my argument with “citations” to random pages I’ve found via creative Google searches? For that matter, what safeguard would prevent a brazen act of sockpuppetry in which I cite a bunch of pages that I myself have (anonymously) written? Loosening the injunction against formally citing non-peer-reviewed work seems tantamount to inviting every troll on the internet to a formal academic dinner.

To be fair, I think there’s some merit to both of these concerns. Or at least, I think there used to be some merit to these concerns. Back when the internet was a wee nascent flaky thing winking in and out of existence every time a dial-up modem connection went down, it made sense to worry about permanence (I mean, just think: if we had allowed people to cite GeoCities webpages in published articles, every last one of those citations links would now be dead!) And similarly, back in the days when peer review was an elite sort of activity that could only be practiced by dignified gentlepersons at the cordial behest of a right honorable journal editor, it probably made good sense to worry about quality control. But the merits of such concerns have now largely disappeared, because we now live in a world of marvelous technology, where bits of information cost virtually nothing to preserve forever, and a new post-publication platform that allows anyone to review just about any academic work in existence seems to pop up every other week (cf. PubPeer, PubMed Commons, Publons, etc.). In the modern world, nothing ever goes out of print, and if you want to know what a whole bunch of experts think about something, you just have to ask them about it on Twitter.

Which brings me to this blog post. Or paper. Whatever you want to call it. It was first published on my blog. You can find it–or at least, you could find it at one point in time–at the following URL: http://www.talyarkoni.org/blog/2015/03/04/now-i-am-become-doi-destroyer-of-gates.

Unfortunately, there’s a small problem with this URL: it contains nary a DOI in sight. Really. None of the eleventy billion possible substrings in it look anything like a DOI. You can even scramble the characters if you like; I don’t care. You’re still not going to find one. Which means that most journals won’t allow you to officially cite this blog post in your academic writing. Or any other post, for that matter. You can’t cite my post about statistical power and magical sample sizes; you can’t cite Joe Simmons’ Data Colada post about Mturk and effect sizes; you can’t cite Sanjay Srivastava’s discussion of replication and falsifiability; and so on ad infinitum. Which is a shame, because it’s a reasonably safe bet that there are at least one or two citation-worthy nuggets of information trapped in some of those blog posts (or millions of others), and there’s no reason to believe that these nuggets must all have readily-discoverable analogs somewhere in the “formal” scientific literature. As the Elsevier author guidelines would have it, the appropriate course of action in such cases is to acknowledge the source of an idea or finding in the text of the article, but not to grant any other kind of formal credit.

Now, typically, this is where the story would end. The URL can’t be formally cited in an Elsevier article; end of story. BUT! In this case, the story doesn’t quite end there. A strange thing happens! A short time after it appears on my blog, this post also appears–in virtually identical form–on something called The Winnower, which isn’t a blog at all, but rather, a respectable-looking alternative platform for scientific publication and evaluation.

Even more strangely, on The Winnower, a mysterious-looking set of characters appear alongside the text. For technical reasons, I can’t tell you what the set of characters actually is (because it isn’t assigned until this piece is published!). But I can tell you that it starts with “10.15200/winn”. And I can also tell you what it is: It’s a DOI! It’s one bona fide free DOI, courtesy of The Winnower. I didn’t have to pay for it, or barter any of my services for it, or sign away any little pieces of my soul to get it*. I just installed a WordPress plugin, pressed a few buttons, and… poof, instant DOI. So now this is, proudly, one of the world’s first N (where N is some smallish number probably below 1000) blog posts to dress itself up in a nice DOI (Figure 1). Presumably because it’s getting ready for a wild night out on the academic town.

sticks and stones may break my bones, but DOIs make me feel pretty
Figure 1. Effects of assigning DOIs to blog posts: an anthropomorphic depiction. (A) A DOI-less blog post feels exposed and inadequate; it envies its more reputable counterparts and languishes in a state of torpor and existential disarray. (B) Freshly clothed in a newly-minted DOI, the same blog post feels confident, charismatic, and alert. Brimming with energy, it eagerly awaits the opportunity to move mountains and reshape scientific discourse. Also, it has longer arms.

Does the mere fact that my blog post now has a DOI actually change anything, as far as the citation rules go? I don’t know. I have no idea if publishers like Elsevier will let you officially cite this piece in an article in one of their journals. I would guess not, but I strongly encourage you to try it anyway (in fact, I’m willing to let you try to cite this piece in every paper you write for the next year or so—that’s the kind of big-hearted sacrifice I’m willing to make in the name of science). But I do think it solves both the permanence and quality control issues that are, in theory, the whole reason for journals having a no-DOI-no-shoes-no-service policy in the first place.

How? Well, it solves the permanence problem because The Winnower is a participant in the CLOCKSS archive, which means that if The Winnower ever goes out of business (a prospect that, let’s face it, became a little bit more likely the moment this piece appeared on their site), this piece will be immediately, freely, and automatically made available to the worldwide community in perpetuity via the associated DOI. So you don’t need to trust the safety of my blog—or even The Winnower—any more. This piece is here to stay forever! Rejoice in the cheapness of digital information and librarians’ obsession with archiving everything!

As for the quality argument, well, clearly, this here is not what you would call a high-quality academic work. But I still think you should be allowed to cite it wherever and whenever you want. Why? For several reasons. First, it’s not exactly difficult to determine whether or not it’s a high-quality academic work—even if you’re not willing to exercise your own judgment. When you link to a publication on The Winnower, you aren’t just linking to a paper; you’re also linking to a review platform. And the reviews are very prominently associated with the paper. If you dislike this piece, you can use the comment form to indicate exactly why you dislike it (if you like it, you don’t need to write a comment; instead, send an envelope stuffed with money to my home address).

Second, it’s not at all clear that banning citations to non-prepublication-reviewed materials accomplishes anything useful in the way of quality control. The reliability of the peer-review process is sufficiently low that there is simply no way for it to consistently sort the good from the bad. The problem is compounded by the fact that rejected manuscripts are rarely discarded forever; typically, they’re quickly resubmitted to another journal. The bibliometric literature shows that it’s possible to publish almost anything in the peer-reviewed literature given enough persistence.

Third, I suspect—though I have no data to support this claim—that a worldview that treats having passed peer review and/or receiving a DOI as markers of scientific quality is actually counterproductive to scientific progress, because it promotes a lackadaisical attitude on the part of researchers. A reader who believes that a claim is significantly more likely to be true in virtue of having a DOI is a reader who is slightly less likely to take the extra time to directly evaluate the evidence for that claim. The reality, unfortunately, is that most scientific claims are wrong, because the world is complicated and science is hard. Pretending that there is some reasonably accurate mechanism that can sort all possible sources into reliable and unreliable buckets—even to a first order of approximation—is misleading at best and dangerous at worst. Of course, I’m not suggesting that you can’t trust a paper’s conclusions unless you’ve read every work it cites in detail (I don’t believe I’ve ever done that for any paper!). I’m just saying that you can’t abdicate the responsibility of evaluating the evidence to some shapeless, anonymous mass of “reviewers”. If I decide not to chase down the Smith & Smith (2007) paper that Jones & Jones (2008) cite as critical support for their argument, I shouldn’t be able to turn around later and say something like “hey, Smith & Smith (2007) was peer reviewed, so it’s not my fault for not bothering to read it!”

So where does that leave us? Well, if you’ve read this far, and agree with most or all of the above arguments, I hope I can convince you of one more tiny claim. Namely, that this piece represents (a big part of) the future of academic publishing. Not this particular piece, of course; I mean the general practice of (a) assigning unique identifiers to digital objects, (b) preserving those objects for all posterity in a centralized archive, and (c) allowing researchers to cite any and all such objects in their work however they like. (We could perhaps also add (d) working very hard to promote centralized “post-publication” peer review of all of those objects–but that’s a story for another day.)

These are not new ideas, mind you. People have been calling for a long time for a move away from a traditional gatekeeping-oriented model of pre-publication review and towards more open publication and evaluation models. These calls have intensified in recent years; for instance, in 2012, a special topic in Frontiers in Computational Neuroscience featured 18 different papers that all independently advocated for very similar post-publication review models. Even the actual attachment of DOIs to blog posts isn’t new; as a case in point, consider that C. Titus Brown—in typical pioneering form—was already experimenting with ways to automatically DOIfy his blog posts via FigShare way back in the same dark ages of 2012. What is new, though, is the emergence and widespread adoption of platforms like The Winnower, FigShare, or Research Gate that make it increasingly easy to assign a DOI to academically-relevant works other than traditional journal articles. Thanks to such services, you can now quickly and effortlessly attach a DOI to your open-source software packages, technical manuals and white papers, conference posters, or virtually any other kind of digital document.

Once such efforts really start to pick up steam—perhaps even in the next two or three years—I think there’s a good chance we’ll fall into a positive feedback loop, because it will become increasingly clear that for many kinds of scientific findings or observations, there’s simply nothing to be gained by going through the cumbersome, time-consuming conventional peer review process. To the contrary, there will be all kinds of incentives for researchers to publish their work as soon as they feel it’s ready to share. I mean, look, I can write blog posts a lot faster than I can write traditional academic papers. Which means that if I write, say, one DOI-adorned blog post a month, my Google Scholar profile is going to look a lot bulkier a year from now, at essentially no extra effort or cost (since I’m going to write those blog posts anyway!). In fact, since services like The Winnower and FigShare can assign DOIs to documents retroactively, you might not even have to wait that long. Check back this time next week, and I might have a dozen new indexed publications! And if some of these get cited—whether in “real” journals or on other indexed blog posts—they’ll then be contributing to my citation count and h-index too (at least on Google Scholar). What are you going to do to keep up?

Now, this may all seem a bit off-putting if you’re used to thinking of scientific publication as a relatively formal, laborious process, where two or three experts have to sign off on what you’ve written before it gets to count for anything. If you’ve grown comfortable with the idea that there are “real” scientific contributions on the one hand, and a blooming, buzzing confusion of second-rate opinions on the other, you might find the move to suddenly make everything part of the formal record somewhat disorienting. It might even feel like some people (like, say, me) are actively trying to game the very system that separates science from tabloid news. But I think that’s the wrong perspective. I don’t think anybody—certainly not me—is looking to get rid of peer review. What many people are actively working towards are alternative models of peer review that will almost certainly work better.

The right perspective, I would argue, is to embrace the benefits of technology and seek out new evaluation models that emphasize open, collaborative review by the community as a whole instead of closed pro forma review by two or three semi-randomly selected experts. We now live in an era where new scientific results can be instantly shared at essentially no cost, and where sophisticated collaborative filtering algorithms and carefully constructed reputation systems can potentially support truly community-driven, quantitatively-grounded open peer review on a massive scale. In such an environment, there are few legitimate excuses for sticking with archaic publication and evaluation models—only the familiar, comforting pull of the status quo. Viewed in this light, using technology to get around the limitations of old gatekeeper-based models of scientific publication isn’t gaming the system; it’s actively changing the system—in ways that will ultimately benefit us all. And in that context, the humble self-assigned DOI may ultimately become—to liberally paraphrase Robert Oppenheimer and the Bhagavad Gita—one of the destroyers of the old gatekeeping world.

the weeble distribution: a love story

“I’m a statistician,” she wrote. “By day, I work for the census bureau. By night, I use my statistical skills to build the perfect profile. I’ve mastered the mysterious headline, the alluring photo, and the humorous description that comes off as playful but with a hint of an edge. I’m pretty much irresistible at this point.”

“Really?” I wrote back. “That sounds pretty amazing. The stuff about building the perfect profile, I mean. Not the stuff about working at the census bureau. Working at the census bureau sounds decent, I guess, but not amazing. How do you build the perfect profile? What kind of statistical analysis do you do? I have a bit of programming experience, but I don’t know any statistics. Maybe we can meet some time and you can teach me a bit of statistics.”

I am, as you can tell, a smooth operator.

A reply arrived in my inbox a day later:

No, of course I don’t really spend all my time constructing the perfect profile. What are you, some kind of idiot?

And so was born our brief relationship; it was love at first insult.


“This probably isn’t going to work out,” she told me within five minutes of meeting me in person for the first time. We were sitting in the lobby of the Chateau Laurier downtown. Her choice of venue. It’s an excellent place to meet an internet date; if you don’t like the way they look across the lobby, you just back out quietly and then email the other person to say sorry, something unexpected came up.

“That fast?” I asked. “You can already tell you don’t like me? I’ve barely introduced myself.”

“Oh, no, no. It’s not that. So far I like you okay. I’m just going by the numbers here. It probably isn’t going to work out. It rarely does.”

“That’s a reasonable statement,” I said, “but a terrible thing to say on a first date. How do you ever get a second date with anyone, making that kind of conversation?”

“It helps to be smoking hot,” she said. “Did I offend you terribly?”

“Not really, no. But I’m not a very sentimental kind of guy.”

“Well, that’s good.”


Later, in bed, I awoke to a shooting pain in my leg. It felt like I’d been kicked in the shin.

“Did you just kick me in the shin,” I asked.

“Yes.”

“Any particular reason?”

“You were a little bit on my side of the bed. I don’t like that.”

“Oh. Okay. Sorry.”

“I still don’t think this will work,” she said, then rolled over and went back to sleep.


She was right. We dated for several months, but it never really worked. We had terrific fights, and reasonable make-up sex, but our interactions never had very much substance. We related to one another like two people who were pretty sure something better was going to come along any day now, but in the meantime, why not keep what we had going, because it was better than eating dinner alone.

I never really learned what she liked; I did learn that she disliked most things. Mostly our conversations revolved around statistics and food. I’ll give you some examples.


“Beer is the reason for statistics,” she informed me one night while we were sitting at Cicero’s and sharing a lasagna.

“I imagine beer might be the reason for a lot of bad statistics,” I said.

“No, no. Not just bad statistics. All statistics. The discipline of statistics as we know it exists in large part because of beer.”

“Pray, do go on,” I said, knowing it would have been futile to ask her to shut up.

“Well,” she said, “there once was a man named Student…”

I won’t bore you with all the details; the gist of it is that there once was a man by name of William Gosset, who worked for Guinness as a brewer in the early 1900s. Like a lot of other people, Gosset was interested in figuring out how to make Guinness taste better, so he invented a bunch of statistical tests to help him quantify the differences in quality between different batches of beer. Guinness didn’t want Gosset to publish his statistical work under his real name, for fear he might somehow give away their trade secrets, so they made him use the pseudonym “Student”. As a result, modern-day statisticians often work with somethinfg called Student’s t distribution, which is apparently kind of a big deal. And all because of beer.

“That’s a nice story,” I said. “But clearly, if Student—or Gosset or whatever his real name was—hadn’t been working for Guinness, someone else would have invented the same tests shortly afterwards, right? It’s not like he was so brilliant no one else would have ever thought of the same thing. I mean, if Edison hadn’t invented the light bulb, someone else would have. I take it you’re not really saying that without beer, there would be no statistics.”

“No, that is what I’m saying. No beer, no stats. Simple.”

“Yeah, okay. I don’t believe you.”

“Oh no?”

“No. What’s that thing about lies, damned lies, and stat—”

“Statistics?”

“No. Statisticians.”

“No idea,” she said. “Never heard that saying.”

“It’s that they lie. The saying is that statisticians lie. Repeatedly and often. About anything at all. It’s that they have no moral compass.”

“Sounds about right.”


“I don’t get this whole accurate to within 3 percent 19 times out of 20 business,” I whispered into her ear late one night after we’d had sex all over her apartment. “I mean, either you’re accurate or you’re not, right? If you’re accurate, you’re accurate. And if you’re not accurate, I guess maybe then you could be within 3 percent or 7 percent or whatever. But what the hell does it mean to be accurate X times out of Y? And how would you even know how many times you’re accurate? And why is it always 19 out of 20?”

She turned on the lamp on the nightstand and rolled over to face me. Her hair covered half of her face; the other half was staring at me with those pale blue eyes that always looked like they wanted to either jump you or murder you, and you never knew which.

“You really want me to explain confidence intervals to you at 11:30 pm on a Thursday night?”

“Absolutely.”

“How much time do you have?”

“All, Night, Long,” I said, channeling Lionel Richie.

“Wonderful. Let me put my spectacles on.”

She fumbled around on the nightstand looking for them.

“What do you need your glasses for,” I asked. “We’re just talking.”

“Well, I need to be able to see you clearly. I use the amount of confusion on your face to gauge how much I need to dumb down my explanations.”


Frankly, most of the time she was as cold as ice. The only time she really came alive—other than in the bedroom—was when she talked about statistics. Then she was a different person: excited and exciting, full of energy. She looked like a giant Tesla coil, mid-discharge.

“Why do you like statistics so much,” I asked her over a bento box at ZuNama one day.

“Because,” she said, “without statistics, you don’t really know anything.”

“I thought you said statistics was all about uncertainty.”

“Right. Without statistics, you don’t know anything… and with statistics, you still don’t know anything. But with statistics, we can at least get a sense of how much we know or don’t know.”

“Sounds very… Rumsfeldian,” I said. “Known knowns… unknown unknowns… is that right?”

“It’s kind of right,” she said. “But the error bars are pretty huge.”

“I’m going to pretend I know what that means. If I admit I have no idea, you’ll think I wasn’t listening to you in bed the other night.”

“No,” she said. “I know you were listening. You were listening very well. It’s just that you were understanding very poorly.”


Uncertainty was a big theme for her. Once, to make a point, she asked me how many nostrils a person breathes through at any given time. And then, after I experimented on myself and discovered that the answer was one and not two, she pushed me on it:

“Well, how do you know you’re not the only freak in the world who breathes through one nostril?”

“Easily demonstrated,” I said, and stuck my hand right in front of her face, practically covering her nose.

“Breathe out!”

She did.

“And now breathe in! And then repeat several times!”

She did.

“You see,” I said, retracting my hand once I was satisfied. “It’s not just me. You also breathe through one nostril at a time. Right now it’s your left.”

“That proves nothing,” she said. “We’re not independent observations; I live with you. You probably just gave me your terrible mononarial disease. All you’ve shown is that we’re both sick.”

I realized then that I wasn’t going to win this round—or any other round.

“Try the unagi,” I said, waving at the sushi in a heroic effort to change the topic.

“You know I don’t like to try new things. It’s bad enough I’m eating sushi.”

“Try the unagi,” I suggested again.

So she did.

“It’s not bad,” she said after chewing on it very carefully for a very long time. “But it could use some ketchup.”

“Don’t you dare ask them for ketchup,” I said. “I will get up and leave if you ask them for ketchup.”

She waved her hand at the server.


“There once was a gentleman named Bayes,” she said over coffee at Starbucks one morning. I was running late for work, but so what? Who’s going to pass up the chance to hear about a gentleman named Bayes when the alternative is spending the morning refactoring enterprise code and filing progress reports?

“Oh yes, I’ve heard about him,” I said. “He’s the guy who came up with Bayes’ theorem.” I’d heard of Bayes theorem in some distant class somewhere, and knew it had something to do with statistics, though I had not one clue what it actually referred to.

“No, the Bayes I’m talking about is John Bayes—my mechanic. He’s working on my car right now.”

“Really?”

“No, not really, you idiot. Yes, Bayes as in Bayes’ theorem.”

“Thought so. Well, go ahead and tell me all about him. What is John Bayes famous for?”

“Bayes’ theorem.”

“Huh. How about that.”

She launched into a very dry explanation of conditional probabilities and prior distributions and a bunch of other terms I’d never heard of before and haven’t remembered since. I stopped her about three minutes in.

“You know none of this helps me, right? I mean, really, I’m going to forget anything you tell me. You know what might help, is maybe if instead of giving me these long, dry explanations, you could put things in a way I can remember. Like, if you, I don’t know, made up a limerick. I bet I could remember your explanations that way.”

“Oh, a limerick. You want a Bayesian limerick. Okay.”

She scrunched up her forehead like she was thinking very deeply. Held the pose for a few seconds.

“There once was a man named John Bayes,” she began, and then stopped.

“Yes,” I said. “Go on.”

“Who spent most of his days… calculating the posterior probability of go fuck yourself.”

“Very memorable,” I said, waving for the check.


“Suppose I wanted to estimate how much I love you,” I said over asparagus and leek salad at home one night. “How would I do that?”

“You love me?” she arched an eyebrow.

“Good lord no,” I laughed hysterically. “It’s a completely and utterly hypothetical question. But answer it anyway. How would I do it?”

She shrugged.

“That’s a measurement problem. I’m a statistician, not a psychometrician. I develop and test statistical models. I don’t build psychological instruments. I haven’t the faintest idea how you’d measure love. As I’m sure you’ve observed, it’s something I don’t know or care very much about.”

I nodded. I had observed that.

“You act like there’s a difference between all these things there’s really no difference between,” I said. “Models, measures… what the hell do I care? I asked a simple question, and I want a simple answer.”

“Well, my friend, in that case, the answer is that you must look deep into your own heart and say, heart, how much do I love this woman, and then your heart will surely whisper the answer delicately into your oversized ear.”

“That’s the dumbest thing I’ve ever heard,” I said, tugging self-consciously at my left earlobe. It wasn’t that big.

“Right?” she said. “You said you wanted a simple answer. I gave you a simple answer. It also happens to be a very dumb answer. Well, great, now you know one of the fundamental principles of statistical analysis.”

“That simple answers tend to be bad answers?”

“No,” she said. “That when you’re asking a statistician for help, you need to operationalize your question very carefully, or the statistician is going to give you a sensible answer to a completely different question than the one you actually care about.”


“How come you never ask me about my work,” I asked her one night as we were eating dinner at Chez Margarite. She was devouring lemon-infused pork chops; I was eating a green papaya salad with mint chutney and mango salsa dressing.

“Because I don’t really care about your work,” she said.

“Oh. That’s… kind of blunt.”

“Sorry. I figured I should be honest. That’s what you say you want in a relationship, right? Honesty?”

“Sure,” I said, as the server refilled our water glasses.

“Well,” I offered. “Maybe not that much honesty.”

“Would you like me to feign interest?”

“Maybe just for a bit. That might be nice.”

“Okay,” she sighed, giving me the green light with a hand wave. “Tell me about your work.”

It was a new experience for me; I didn’t want to waste the opportunity, so I tried to choose my words carefully.

“Well, for the last month or so, I’ve been working on re-architecting our site’s database back-end. We’ve never had to worry about scaling before. Our DB can handle a few dozen queries per second, even with some pretty complicated joins. But then someone posts a product page to reddit because of a funny typo, and suddenly we’re getting hundreds of requests a second, and all hell breaks loose.”

I went on to tell her about normal forms and multivalued dependencies and different ways of modeling inheritance in databases. She listened along, nodding intermittently and at roughly appropriate intervals. But I could tell her heart wasn’t in it. She kept looking over with curiosity at the group of middle-aged Japanese businessmen seated at the next table over from us. Or out the window at the homeless man trying to sell rhododendrons to passers-by. Really, she looked everywhere but at me. Finally, I gave up.

“Look,” I said, “I know you’re not into this. I guess I don’t really need to tell you about what I do. Do you want to tell me more about the Weeble distribution?”

Her face lit up with excitement; for a moment, she looked like the moon. A cold, heartless, beautiful moon, full of numbers and error bars and mascara.

Weibull,” she said.

“Fine,” I said. “You tell me about the Weibull distribution, and I’ll feign interest. Then we’ll have crème brulee for dessert, and then I’ll buy you a rhododendron from that guy out there on the way out.”

“Rhododendrons,” she snorted. “What a ridiculous choice of flower.”


“How long do you think this relationship is going to last,” I asked her one brisk evening as we stood outside Gordon’s Gourmets with oversized hot dogs in hand.

I was fully aware our relationship was a transient thing—like two people hanging out on a ferry for a couple of hours, both perfectly willing to having a reasonably good time together until the boat hits the far side of the lake, but neither having any real interest in trading numbers or full names.

I was in it for—let’s be honest—the sex and the conversation. As for her, I’m not really sure what she got out of it; I’m not very good at either of those things. I suppose she probably had a hard time finding anyone willing to tolerate her for more than a couple of days.

“About another month,” she said. “We should take a trip to Europe and break up there. That way it won’t be messy when we come back. You book your plane ticket, I’ll book mine. We’ll go together, but come back separately. I’ve always wanted to end a relationship that way—in a planned fashion where there are no weird expectations and no hurt feelings.”

“You think planning to break up in Europe a month from now is a good way to avoid hurt feelings?”

“Correct.”

“Okay, I guess I can see that.”


And that’s pretty much how it went. About a month later, we were sitting in a graveyard in a small village in southern France, winding our relationship down. Wine was involved, and had been involved for most of the day; we were both quite drunk.

We’d gone to see this documentary film about homeless magicians who made their living doing card tricks for tourists on the beaches of the French Riviera, and then we stumbled around town until we came across the graveyard, and then, having had a lot of wine, we decided, why not sit on the graves and talk. And so we sat on graves and talked for a while until we finally ran out of steam and affection for each other.

“How do you want to end it,” I asked her when we were completely out of meaningful words, which took less time than you might imagine.

“You sound so sinister,” she said. “Like we’re talking about a suicide pact. When really we’re just two people sitting on graves in a quiet cemetery in France, about to break up forever.”

“Yeah, that. How do you want to end it.”

“Well, I like endings like in Sex, Lies and Videotape, you know? Endings that don’t really mean anything.”

“You like endings that don’t mean anything.”

“They don’t have to literally mean nothing. I just mean they don’t have to have any deep meaning. I don’t like movies that end on some fake bullshit dramatic note just to further the plot line or provide a sense of closure. I like the ending of Sex, Lies, and Videotape because it doesn’t follow from anything; it just happens.”

“Remind me how it ends?”

“They’re sitting on the steps outside, and Ann—-Andie McDowell’s character–says “I think it’s going to rain. Then Graham says, “it is raining.” And that’s it. Fade to black.”

“So that’s what you like.”

“Yes.”

“And you want to end our relationship like that.”

“Yes.”

“Okay,” I said. “I guess I can do that.”

I looked around. It was almost dark, and the bottle of wine was empty. Well, why not.

I think it’s going to rain,” I said.

Jesus,” she said incredulously, leaning back against a headstone belonging to some guy named Jean-Francois. ” I meant we should end it like that. That kind of thing. Not that actual thing. What are you, some kind of moron?”

“Oh. Okay. And yes.”

I thought about it for a while.

“I think I got this,” I finally said.

“Ok, go,” she smiled. One of the last—and only—times I saw her smile. It was devastating.

“Okay. I’m going to say: I have some unfinished business to attend to at home. I should really get back to my life. And then you should say something equally tangential and vacuous. Something like: ‘yes, you really should get back there. Your life must be lonely without you.'”

“Your life must be lonely without you…” she tried the words out.

“That’s perfect,” she smiled. “That’s exactly what I wanted.”


Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:

oOooo

In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):

abbreviated_layout

It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

yet another Python state machine (and why you might care)

TL;DR: I wrote a minimalistic state machine implementation in Python. You can find the code on GitHub. The rest of this post explains what a state machine is and why you might (or might not) care. The post is slanted towards scientists who are technically inclined but lack formal training in computer science or software development. If you just want some documentation or examples, see the README.

A common problem that arises in many software applications is the need to manage an application’s trajectory through a state of discrete states. This problem will be familiar, for instance, to almost every researcher who has ever had to program an experiment for a study involving human subjects: there are typically a number of different states your study can be in (informed consent, demographic information, stimulus presentation, response collection, etc.), and these states are governed by a set of rules that determine the valid progression of your participants from one state to another. For example, a participant can proceed from informed consent to a cognitive task, but never the reverse (on pain of entering IRB hell!).

In the best possible case, the transition rules are straightforward. For example, given states [A, B, C, D], life would be simple if the the only valid transitions were A –> B, B –> C, and C –> D. Unfortunately, the real world is more complicated, and state transitions are rarely completely sequential. More commonly, at least some states have multiple potential destinations. Sometimes the identity of the next state depends on meeting certain conditions while in the current state (e.g., if the subject responded incorrectly, the study may transition to a different state than if they had responded correctly); other times the rules may be probabilistic, or depend on the recent trajectory through state space (e.g., a slot machine transitions to a winning or losing state with some fixed probability that may also depend on its current position, recent history, etc.).

In software development, a standard method for dealing with this kind of problem is to use something called a finite-state machine (FSM). FSMs have been around a relatively long time (at least since Mealy and Moore’s work in the 1950s), and have all kinds of useful applications. In a nutshell, what a good state machine implementation does is represent much of the messy logic governing state transitions in a more abstract, formal and clean way. Rather than having to write a lot of complicated nested logic to direct the flow of the application through state space, one can usually get away with a terse description of (a) the possible states of the machine and (b) a list of possible transitions, including a specification of the source and destination states for each transition, what conditions must be met in order for the transition to execute, etc.

For example, suppose you need to write some code to transition between different phases in an online experiment. Your naive implementation might look vaguely like this (leaving out a lot of supporting code and focusing just on the core logic):

This is a minimalistic example, but already, it illustrates several common scenarios–e.g., that the transition from one state to another often depends on meeting some specified condition (we don’t advance beyond the informed consent stage until the user signs the document), and that there may be some actions we want to issue immediately before or after a particular kind of transition (e.g., we save survey responses before we move onto the next phase).

The above code is still quite manageable, so if things never get any more complex than this, there may be no reason to abandon a (potentially lengthy) chain of conditionals in favor of a fundamentally different approach. But trouble tends to arises when the complexity does increase–e.g., you need to throw a few more states into the mix later on–or when you need to move stuff around (e.g., you decide to administer the task before the demographic survey). If you’ve ever had the frustrating experience of tracing the flow of your app through convoluted logic scattered across several files, and being unable to figure out why your code is entering the wrong state in response to some triggered event, the state machine pattern may be right for you.

I’ve made extensive use of state machines in the past when building online studies, and finding a suitable implementation has never been a problem. For example, in Rails–which is what most of my apps have been built in–there are a number of excellent options, including the state_machine plugin and (more recently) Statesman. In the last year or two, though, I’ve begun to transition all of my web development to Python (if you want to know why, read this). Python is a very common language, and the basic FSM pattern is very simple, so there are dozens of Python FSM implementations out there. But for some reason, very few of the Python implementations are as elegant and usable as their Ruby analogs. This isn’t to say there aren’t some nice ones (I’m partial to Fysom, for instance)–just that none of them quite meet my needs (in particular, there are very few fully object-oriented implementations, and I like to have my state machine tightly coupled with the model it’s managing). So I decided to write one. It’s called Transitions, and you can find the code on GitHub, or install it directly from the command prompt (“pip install transitions”, assuming you have pip installed). It’s very lightweight–fewer than 200 lines of code (the documentation is about 10 times as long!)–but still turns out to be quite functional.

For example, here’s some code that does almost exactly the same thing as what we saw above (there are much more extensive examples and documentation in the GitHub README):

That’s it! And now we have a nice object-oriented state machine that elegantly transitions between phases of matter, triggers callback functions as needed, and supports conditional transitions, branching, and various other nice features, all without ever having to write a single explicit conditional or for-loop. Understanding what’s going on is as simple as looking at the specification of the states and transitions. For example, we can tell at a glance from the second transition that if the model is currently in the ‘demographics’ state, calling advance() will effect a transition to the ‘personality’ state–conditional on the validate_demographics() function returns True. Also, right before the transition executes, the save_demographics() callback will be called.

As I noted above, given the simplicity of the example, this may not seem like a huge win. If anything, the second snippet is slightly longer than the first. But it’s also much clearer (once you’re familiar with the semantics of Transitions), scales much better as complexity increases, and will be vastly easier to modify when you need to change anything.

Anyway, I mention all of this here for two reasons. First, as small and simple a project as this is, I think it ended up being one of the more elegant and functional minimalistic Python FSMs–so I imagine a few other people might find it useful (yes, I’m basically just exploiting my PageRank on Google to drive traffic to GitHub). And second, I know many people who read this blog are researchers who regularly program experiments, but probably haven’t encountered state machines before. So, Python implementation aside, the general idea that there’s a better way to manage complex state transitions than writing a lot of ugly logic seems worth spreading.

In defense of In Defense of Facebook

A long, long time ago (in social media terms), I wrote a post defending Facebook against accusations of ethical misconduct related to a newly-published study in PNAS. I won’t rehash the study, or the accusations, or my comments in any detail here; for that, you can read the original post (I also recommend reading this or this for added context). While I stand by most of what I wrote, as is the nature of things, sometimes new information comes to light, and sometimes people say things that make me change my mind. So I thought I’d post my updated thoughts and reactions. I also left some additional thoughts in a comment on my last post, which I won’t rehash here.

Anyway, in no particular order…

I’m not arguing for a lawless world where companies can do as they like with your data

Some people apparently interpreted my last post as a defense of Facebook’s data use policy in general. It wasn’t. I probably brought this on myself in part by titling the post “In Defense of Facebook”. Maybe I should have called it something like “In Defense of this one particular study done by one Facebook employee”. In any case, I’ll reiterate: I’m categorically not saying that Facebook–or any other company, for that matter–should be allowed to do whatever it likes with its users’ data. There are plenty of valid concerns one could raise about the way companies like Facebook store, manage, and use their users’ data. And for what it’s worth, I’m generally in favor of passing new rules regulating the use of personal data in the private sector. So, contrary to what some posts suggested, I was categorically not advocating for a laissez-faire world in which large corporations get to do as they please with your information, and there’s nothing us little people can do about it.

The point I made in my last post was much narrower than that–namely, that picking on the PNAS study as an example of ethically questionable practices at Facebook was a bad idea, because (a) there aren’t any new risks introduced by this manipulation that aren’t already dwarfed by the risks associated with using Facebook itself (which is not exactly a high-risk enterprise to begin with), and (b) there are literally thousands of experiments just like this being conducted every day by large companies intent on figuring out how best to market their products and services–so Facebook’s study doesn’t stand out in any respect. My point was not that you shouldn’t be concerned about who has your data and how they’re using it, but that it’s deeply counterproductive to go after Facebook for this particular experiment when Facebook is of the few companies in this arena who actually (occasionally) publish the results of their findings in the scientific literature, instead of hiding them entirely from the light, as almost everyone else does. Of course, that will probably change as a result of this controversy.

I Was Wrong–A/B Testing Edition.

One claim I made in my last post that was very clearly wrong is this (emphasis added):

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface.

After watching the commentary over the past two days, I think it’s pretty clear I was wrong about this. A surprisingly large number of people clearly were genuinely unaware that Facebook, Twitter, Google, and other major players in every major industry (not just tech–also banks, groceries, department stores, you name it) are constantly running large-scale, controlled experiments on their users and customers. For instance, here’s a telling comment left on my last post:

The main issue I have with the experiment is that they conducted it without telling us. Given, that would have been counterproductive, but even a small adverse affect is still an adverse affect. I just don’t like the idea that corporations can do stuff to me without my consent. Just my opinion.

Similar sentiments are all over the place. Clearly, the revelation that Facebook regularly experiments on its users without their knowledge was indeed just that to many people–a revelation. I suppose in this sense, there’s potentially a considerable upside to this controversy, inasmuch as it has clearly served to raise awareness of industry-standard practices.

Questions about the ethics of the PNAS paper’s publication

My post focused largely on the question of whether the experiment Facebook conducted was itself illegal or unethical. I took this to be the primary concern of most lay people who have expressed concern about the episode. As I discussed in my post, I think it’s quite clear that the experiment itself is (a) entirely legal and that (b) any ethical objections one could raise are actually much broader objections about the way we regulate data use and consumer privacy, and have nothing to do with Facebook in particular. However, there’s a separate question that does specifically concern Facebook–or really, the authors of the PNAS paper–which is whether the authors, in their efforts to publish their findings, violated any laws or regulations.

When I wrote my post, I was under the impression–based largely on reports of an interview with the PNAS editor, Susan Fiske–that the authors had in fact obtained approval to conduct the study from an IRB, and had simply neglected to include that information in the text (which would have been an Editorial lapse, but not an unethical act). I wrote as much in a comment on my post. I was not suggesting–as some seemed to take away–that Facebook doesn’t need to get IRB approval. I was operating on the assumption that it had obtained IRB approval, based on the information available at the time.

In any case, it now appears that may not be exactly what happened. Unfortunately, it’s not yet clear exactly what did happen. One version of events people have suggested is that the study’s authors exploited a loophole in the rules by having Facebook conduct and analyze the experiment without the involvement of the other authors–who only contributed to the genesis of the idea and the writing of the manuscript. However, this interpretation is not unambiguous, and risks maligning the authors’ reputations unfairly, because Adam Kramer’s post explaining the motivation for the experiment suggests that the idea for the experiment originated entirely at Facebook, and was related to internal needs:

The reason we did this research is because we care about the emotional impact of Facebook and the people that use our product. We felt that it was important to investigate the common worry that seeing friends post positive content leads to people feeling negative or left out. At the same time, we were concerned that exposure to friends’ negativity might lead people to avoid visiting Facebook. We didn’t clearly state our motivations in the paper.

How you interpret the ethics of the study thus depends largely on what you believe actually happened. If you believe that the genesis and design of the experiment were driven by Facebook’s internal decision-making, and the decision to publish an interesting finding came only later, then there’s nothing at all ethically questionable about the authors’ behavior. It would have made no more sense to seek out IRB approval for this one experiment than for any of the other in-house experiments Facebook regularly conducts. And there is, again, no question whatsoever that Facebook does not have to get approval from anyone to do experiments that are not for the purpose of systematic, generalizable research.

Moreover, since the non-Facebook authors did in fact ask the IRB to review their proposal to use archival data–and the IRB exempted them from review, as is routinely done for this kind of analysis–there would be no legitimacy to the claim that the authors acted unethically. About the only claim one could raise an eyebrow at is that the authors “didn’t clearly state” their motivations. But since presenting a post-hoc justification for one’s studies that has nothing to do with the original intention is extremely common in psychology (though it shouldn’t be), it’s not really fair to fault Kramer et al for doing something that is standard practice.

If, on the other hand, the idea for the study did originate outside of Facebook, and the authors deliberately attempted to avoid prospective IRB review, then I think it’s fair to say that their behavior was unethical. However, given that the authors were following the letter of the law (if clearly not the spirit), it’s not clear that PNAS should have, or could have, rejected the paper. It certainly should have demanded that information regarding interactions with the IRB be included in the manuscript, and perhaps it could have published some kind of expression of concern alongside the paper. But I agree with Michelle Meyer’s analysis that, in taking the steps they took, the authors are almost certainly operating within the rules, because (a) Facebook itself is not subject to HHS rules, (b) the non-Facebook authors were not technically “engaged in research”, and (c) the archival use of already-collected data by the non-Facebook authors was approved by the Cornell IRB (or rather, the study was exempted from further review).

Absent clear evidence of what exactly happened in the lead-up to publication, I think the appropriate course of action is to withhold judgment. In the interim, what the episode clearly does do is lay bare how ill-prepared the existing HHS regulations are for dealing with the research use of data collected online–particularly when the data was acquired by private entities. Actually, it’s not just research use that’s problematic; it’s clear that many people complaining about Facebook’s conduct this week don’t really give a hoot about the “generalizable knowledge” side of things, and are fundamentally just upset that Facebook is allowed to run these kinds of experiments at all without providing any notification.

In my view, what’s desperately called for is a new set of regulations that provide a unitary code for dealing with consumer data across the board–i.e., in both research and non-research contexts. This leaves aside exactly what such regulations would look like, of course. My personal view is that the right direction to move in is to tighten consumer protection laws to better regulate management and use of private citizens’ data, while simultaneously liberalizing the research use of private datasets that have already been acquired. For example, I would favor a law that (a) forced Facebook and other companies to more clearly and explicitly state how they use their users’ data, (b) provided opt-out options when possible, along with the ability for users to obtain report of how their data has been used in the past, and (c) gave blanket approval to use data acquired under these conditions for any and all academic research purposes so long as the data are deidentified. Many people will disagree with this, of course, and have very different ideas. That’s fine; the key point is that the conversation we should be having is about how to update and revise the rules governing research vs. non-research uses of data in such a way that situations like the PNAS study don’t come up again.

What Facebook does is not research–until they try to publish it

Much of the outrage over the Facebook experiment is centered around the perception that Facebook shouldn’t be allowed to conduct research on its users without their consent. What many people mean by this, I think, is that Facebook shouldn’t be allowed to conduct any experiments on its users for purposes of learning things about user experience and behavior unless Facebook explicitly asks for permission. A point that I should have clarified in my original post is that Facebook users are, in the normal course of things, not considered participants in a research study, no matter how or how much their emotions are manipulated. That’s because the HHS’s definition of research includes, as a necessary component, that there be an active intention to contribute to generalizable new knowledge.

Now, to my mind, this isn’t a great way to define “research”–I think it’s a good idea to avoid definitions that depend on knowing what people’s intentions were when they did something. But that’s the definition we’re stuck with, and there’s really no ambiguity over whether Facebook’s normal operations–which include constant randomized, controlled experimentation on its users–constitute research in this sense. They clearly don’t. Put simply, if Facebook were to eschew disseminating its results to the broader community, the experiment in question would not have been subject to any HHS regulations whatsoever (though, as Michelle Meyer astutely pointed out, technically the experiment probably isn’t subject to HHS regulation even now, so the point is moot). Again, to reiterate: it’s only the fact that Kramer et al wanted to publish their results in a scientific journal that opened them up to criticism of research misconduct in the first place.

This observation may not have any impact on your view if your concern is fundamentally about the publication process–i.e., you don’t object to Facebook doing the experiment; what you object to is Facebook trying to disseminate their findings as research. But it should have a strong impact on your views if you were previously under the impression that Facebook’s actions must have violated some existing human subjects regulation or consumer protection law. The laws in the United States–at least as I understand them, and I admittedly am not a lawyer–currently afford you no such protection.

Now, is it a good idea to have two very separate standards, one for research and one for everything else? Probably not. Should Facebook be allowed to do whatever it wants to your user experience so long as it’s covered under the Data Use policy in the user agreement you didn’t read? Probably not. But what’s unequivocally true is that, as it stands right now, your interactions with Facebook–no matter how your user experience, data, or emotions are manipulated–are not considered research unless Facebook manipulates your experience with the express intent of disseminating new knowledge to the world.

Informed consent is not mandatory for research studies

As a last point, there seems to be a very common misconception floating around among commentators that the Facebook experiment was unethical because it didn’t provide informed consent, which is a requirement for all research studies involving experimental manipulation. I addressed this in the comments on my last post in response to other comments:

[I]t’s simply not correct to suggest that all human subjects research requires informed consent. At least in the US (where Facebook is based), the rules governing research explicitly provide for a waiver of informed consent. Directly from the HHS website:

An IRB may approve a consent procedure which does not include, or which alters, some or all of the elements of informed consent set forth in this section, or waive the requirements to obtain informed consent provided the IRB finds and documents that:

(1) The research involves no more than minimal risk to the subjects;

(2) The waiver or alteration will not adversely affect the rights and welfare of the subjects;

(3) The research could not practicably be carried out without the waiver or alteration; and

(4) Whenever appropriate, the subjects will be provided with additional pertinent information after participation.

Granting such waivers is a commonplace occurrence; I myself have had online studies granted waivers before for precisely these reasons. In this particular context, it’s very clear that conditions (1) and (2) are met (because this easily passes the “not different from ordinary experience” test). Further, Facebook can also clearly argue that (3) is met, because explicitly asking for informed consent is likely not viable given internal policy, and would in any case render the experimental manipulation highly suspect (because it would no longer be random). The only point one could conceivably raise questions about is (4), but here again I think there’s a very strong case to be made that Facebook is not about to start providing debriefing information to users every time it changes some aspect of the news feed in pursuit of research, considering that its users have already agreed to its User Agreement, which authorizes this and much more.

Now, if you disagree with the above analysis, that’s fine, but what should be clear enough is that there are many IRBs (and I’ve personally interacted with some of them) that would have authorized a waiver of consent in this particular case without blinking. So this is clearly well within “reasonable people can disagree” territory, rather than “oh my god, this is clearly illegal and unethical!” territory.

I can understand the objection that Facebook should have applied for IRB approval prior to conducting the experiment (though, as I note above, that’s only true if the experiment was initially conducted as research, which is not clear right now). However, it’s important to note that there is no guarantee that an IRB would have insisted on informed consent at all in this case. There’s considerable heterogeneity in different IRBs’ interpretation of the HHS guidelines (and in fact, even across different reviewers within the same IRB), and I don’t doubt that many IRBs would have allowed Facebook’s application to sail through without any problems (see, e.g., this comment on my last post)–though I think there’s a general consensus that a debriefing of some kind would almost certainly be requested.

In defense of Facebook

[UPDATE July 1st: I’ve now posted some additional thoughts in a second post here.]

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people.

The study

Before getting into the sources of that rage–and why I think it’s misplaced–though, it’s worth describing the study and its results. Here’s a description of the basic procedure, from the paper:

The experiment manipulated the extent to which people (N = 689,003) were exposed to emotional expressions in their News Feed. This tested whether exposure to emotions led people to change their own posting behaviors, in particular whether exposure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment. Two parallel experiments were conducted for positive and negative emotion: One in which exposure to friends’ positive emotional content in their News Feed was reduced, and one in which exposure to negative emotional content in their News Feed was reduced. In these conditions, when a person loaded their News Feed, posts that contained emotional content of the relevant emotional valence, each emotional post had between a 10% and 90% chance (based on their User ID) of being omitted from their News Feed for that specific viewing.

And here’s their central finding:

What the figure shows is that, in the experimental conditions, where negative or positive emotional posts are censored, users produce correspondingly more positive or negative emotional words in their own status updates. Reducing the number of negative emotional posts users saw led those users to produce more positive, and fewer negative words (relative to the unmodified control condition); conversely, reducing the number of presented positive posts led users to produce more negative and fewer positive words of their own.

Taken at face value, these results are interesting and informative. For the sake of contextualizing the concerns I discuss below, though, two points are worth noting. First, these effects, while highly statistically significant, are tiny. The largest effect size reported had a Cohen’s d of 0.02–meaning that eliminating a substantial proportion of emotional content from a user’s feed had the monumental effect of shifting that user’s own emotional word use by two hundredths of a standard deviation. In other words, the manipulation had a negligible real-world impact on users’ behavior. To put it in intuitive terms, the effect of condition in the Facebook study is roughly comparable to a hypothetical treatment that increased the average height of the male population in the United States by about one twentieth of an inch (given a standard deviation of ~2.8 inches). Theoretically interesting, perhaps, but not very meaningful in practice.

Second, the fact that users in the experimental conditions produced content with very slightly more positive or negative emotional content doesn’t mean that those users actually felt any differently. It’s entirely possible–and I would argue, even probable–that much of the effect was driven by changes in the expression of ideas or feelings that were already on users’ minds. For example, suppose I log onto Facebook intending to write a status update to the effect that I had an “awesome day today at the beach with my besties!” Now imagine that, as soon as I log in, I see in my news feed that an acquaintance’s father just passed away. I might very well think twice about posting my own message–not necessarily because the news has made me feel sad myself, but because it surely seems a bit unseemly to celebrate one’s own good fortune around people who are currently grieving. I would argue that such subtle behavioral changes, while certainly responsive to others’ emotions, shouldn’t really be considered genuine cases of emotional contagion. Yet given how small the effects were, one wouldn’t need very many such changes to occur in order to produce the observed results. So, at the very least, the jury should still be out on the extent to which Facebook users actually feel differently as a result of this manipulation.

The concerns

Setting aside the rather modest (though still interesting!) results, let’s turn to look at the criticism. Here’s what Katy Waldman, writing in a Slate piece titled “Facebook’s Unethical Experiment“, had to say:

The researchers, who are affiliated with Facebook, Cornell, and the University of California–San Francisco, tested whether reducing the number of positive messages people saw made those people less likely to post positive content themselves. The same went for negative messages: Would scrubbing posts with sad or angry words from someone’s Facebook feed make that person write fewer gloomy updates?

The upshot? Yes, verily, social networks can propagate positive and negative feelings!

The other upshot: Facebook intentionally made thousands upon thousands of people sad.

Or consider an article in the The Wire, quoting Jacob Silverman:

“What’s disturbing about how Facebook went about this, though, is that they essentially manipulated the sentiments of hundreds of thousands of users without asking permission (blame the terms of service agreements we all opt into). This research may tell us something about online behavior, but it’s undoubtedly more useful for, and more revealing of, Facebook’s own practices.”

On Twitter, the reaction to the study has been similarly negative). A lot of people appear to be very upset at the revelation that Facebook would actively manipulate its users’ news feeds in a way that could potentially influence their emotions.

Why the concerns are misplaced

To my mind, the concerns expressed in the Slate piece and elsewhere are misplaced, for several reasons. First, they largely mischaracterize the study’s experimental procedures–to the point that I suspect most of the critics haven’t actually bothered to read the paper. In particular, the suggestion that Facebook “manipulated users’ emotions” is quite misleading. Framing it that way tacitly implies that Facebook must have done something specifically designed to induce a different emotional experience in its users. In reality, for users assigned to the experimental condition, Facebook simply removed a variable proportion of status messages that were automatically detected as containing positive or negative emotional words. Let me repeat that: Facebook removed emotional messages for some users. It did not, as many people seem to be assuming, add content specifically intended to induce specific emotions. Now, given that a large amount of content on Facebook is already highly emotional in nature–think about all the people sharing their news of births, deaths, break-ups, etc.–it seems very hard to argue that Facebook would have been introducing new risks to its users even if it had presented some of them with more emotional content. But it’s certainly not credible to suggest that replacing 10% – 90% of emotional content with neutral content constitutes a potentially dangerous manipulation of people’s subjective experience.

Second, it’s not clear what the notion that Facebook users’ experience is being “manipulated” really even means, because the Facebook news feed is, and has always been, a completely contrived environment. I hope that people who are concerned about Facebook “manipulating” user experience in support of research realize that Facebook is constantly manipulating its users’ experience. In fact, by definition, every single change Facebook makes to the site alters the user experience, since there simply isn’t any experience to be had on Facebook that isn’t entirely constructed by Facebook. When you log onto Facebook, you’re not seeing a comprehensive list of everything your friends are doing, nor are you seeing a completely random subset of events. In the former case, you would be overwhelmed with information, and in the latter case, you’d get bored of Facebook very quickly. Instead, what you’re presented with is a carefully curated experience that is, from the outset, crafted in such a way as to create a more engaging experience (read: keeps you spending more time on the site, and coming back more often). The items you get to see are determined by a complex and ever-changing algorithm that you make only a partial contribution to (by indicating what you like, what you want hidden, etc.). It has always been this way, and it’s not clear that it could be any other way. So I don’t really understand what people mean when they sarcastically suggest–as Katy Waldman does in her Slate piece–that “Facebook reserves the right to seriously bum you out by cutting all that is positive and beautiful from your news feed”. Where does Waldman think all that positive and beautiful stuff comes from in the first place? Does she think it spontaneously grows wild in her news feed, free from the meddling and unnatural influence of Facebook engineers?

Third, if you were to construct a scale of possible motives for manipulating users’ behavior–with the global betterment of society at one end, and something really bad at the other end–I submit that conducting basic scientific research would almost certainly be much closer to the former end than would the other standard motives we find on the web–like trying to get people to click on more ads. The reality is that Facebook–and virtually every other large company with a major web presence–is constantly conducting large controlled experiments on user behavior. Data scientists and user experience researchers at Facebook, Twitter, Google, etc. routinely run dozens, hundreds, or thousands of experiments a day, all of which involve random assignment of users to different conditions. Typically, these manipulations aren’t conducted in order to test basic questions about emotional contagion; they’re conducted with the explicit goal of helping to increase revenue. In other words, if the idea that Facebook would actively try to manipulate your behavior bothers you, you should probably stop reading this right now and go close your account. You also should definitely not read this paper suggesting that a single social message on Facebook prior to the last US presidential election the may have single-handedly increased national voter turn-out by as much as 0.6%). Oh, and you should probably also stop using Google, YouTube, Yahoo, Twitter, Amazon, and pretty much every other major website–because I can assure you that, in every single case, there are people out there who get paid a good salary to… yes, manipulate your emotions and behavior! For better or worse, this is the world we live in. If you don’t like it, you can abandon the internet, or at the very least close all of your social media accounts. But the suggestion that Facebook is doing something unethical simply by publishing the results of one particular experiment among thousands–and in this case, an experiment featuring a completely innocuous design that, if anything, is probably less motivated by a profit motive than most of what Facebook does–seems kind of absurd.

Fourth, it’s worth keeping in mind that there’s nothing intrinsically evil about the idea that large corporations might be trying to manipulate your experience and behavior. Everybody you interact with–including every one of your friends, family, and colleagues–is constantly trying to manipulate your behavior in various ways. Your mother wants you to eat more broccoli; your friends want you to come get smashed with them at a bar; your boss wants you to stay at work longer and take fewer breaks. We are always trying to get other people to feel, think, and do certain things that they would not otherwise have felt, thought, or done. So the meaningful question is not whether people are trying to manipulate your experience and behavior, but whether they’re trying to manipulate you in a way that aligns with or contradicts your own best interests. The mere fact that Facebook, Google, and Amazon run experiments intended to alter your emotional experience in a revenue-increasing way is not necessarily a bad thing if in the process of making more money off you, those companies also improve your quality of life. I’m not taking a stand one way or the other, mind you, but simply pointing out that without controlled experimentation, the user experience on Facebook, Google, Twitter, etc. would probably be very, very different–and most likely less pleasant. So before we lament the perceived loss of all those “positive and beautiful” items in our Facebook news feeds, we should probably remind ourselves that Facebook’s ability to identify and display those items consistently is itself in no small part a product of its continual effort to experimentally test its offering by, yes, experimentally manipulating its users’ feelings and thoughts.

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface. Given that Facebook has over half a billion users, it’s a foregone conclusion that every tiny change Facebook makes to the news feed or any other part of its websites induces a change in millions of people’s emotions. Yet nobody seems to complain about this much–presumably because, when you put it this way, it seems kind of silly to suggest that a company whose business model is predicated on getting its users to use its product more would do anything other than try to manipulate its users into, you know, using its product more.

Why the backlash is deeply counterproductive

Now, none of this is meant to suggest that there aren’t legitimate concerns one could raise about Facebook’s more general behavior–or about the immense and growing social and political influence that social media companies like Facebook wield. One can certainly question whether it’s really fair to expect users signing up for a service like Facebook’s to read and understand user agreements containing dozens of pages of dense legalese, or whether it would make sense to introduce new regulations on companies like Facebook to ensure that they don’t acquire or exert undue influence on their users’ behavior (though personally I think that would be unenforceable and kind of silly). So I’m certainly not suggesting that we give Facebook, or any other large web company, a free pass to do as it pleases. What I am suggesting, however, is that even if your real concerns are, at bottom, about the broader social and political context Facebook operates in, using this particular study as a lightning rod for criticism of Facebook is an extremely counterproductive, and potentially very damaging, strategy.

Consider: by far the most likely outcome of the backlash Facebook is currently experiencing is that, in future, its leadership will be less likely to allow its data scientists to publish their findings in the scientific literature. Remember, Facebook is not a research institute expressly designed to further understanding of the human condition; it’s a publicly-traded corporation that exists to create wealth for its shareholders. Facebook doesn’t have to share any of its data or findings with the rest of the world if it doesn’t want to; it could comfortably hoard all of its knowledge and use it for its own ends, and no one else would ever be any wiser for it. The fact that Facebook is willing to allow its data science team to spend at least some of its time publishing basic scientific research that draws on Facebook’s unparalleled resources is something to be commended, not criticized.

There is little doubt that the present backlash will do absolutely nothing to deter Facebook from actually conducting controlled experiments on its users, because A/B testing is a central component of pretty much every major web company’s business strategy at this point–and frankly, Facebook would be crazy not to try to empirically determine how to improve user experience. What criticism of the Kramer et al article will almost certainly do is decrease the scientific community’s access to, and interaction with, one of the largest and richest sources of data on human behavior in existence. You can certainly take a dim view of Facebook as a company if you like, and you’re free to critique the way they do business to your heart’s content. But haranguing Facebook and other companies like it for publicly disclosing scientifically interesting results of experiments that it is already constantly conducting anyway–and that are directly responsible for many of the positive aspects of the user experience–is not likely to accomplish anything useful. If anything, it’ll only ensure that, going forward, all of Facebook’s societally relevant experimental research is done in the dark, where nobody outside the company can ever find out–or complain–about it.

[UPDATE July 1st: I’ve posted some additional thoughts in a second post here.]