what exactly is it that 53% of neuroscience articles fail to do?

[UPDATE: Jake Westfall points out in the comments that the paper discussed here appears to have made a pretty fundamental mistake that I then carried over to my post. I've updated the post accordingly.]

[UPDATE 2: the lead author has now responded and answered my initial question and some follow-up concerns.]

A new paper in Nature Neuroscience by Emmeke Aarts and colleagues argues that neuroscientists should start using hierarchical  (or multilevel) models in their work in order to account for the nested structure of their data. From the abstract:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%). We discuss the factors affecting the type I error rate and the statistical power in nested data, methods that accommodate dependency between observations and ways to determine the optimal study design when data are nested. Notably, optimization of experimental designs nearly always concerns collection of more truly independent observations, rather than more observations from one research object.

I don’t have any objection to the advocacy for hierarchical models; that much seems perfectly reasonable. If you have nested data, where each subject (or petrie dish or animal or whatever) provides multiple samples, it’s sensible to try to account for as many systematic sources of variance as you can. That point may have been made many times before,  but it never hurts to make it again.

What I do find surprising though–and frankly, have a hard time believing–is the idea that 53% of neuroscience articles are at serious risk of Type I error inflation because they fail to account for nesting. This seems to me to be what the abstract implies, yet it’s a much stronger claim that doesn’t actually follow just from the observation that virtually no studies that have reported nested data have used hierarchical models for analysis. What it also requires is for all of those studies that use “conventional” (i.e., non-hierarchical) analyses to have actively ignored the nesting structure and treated repeated measurements as if they in fact came from entirely different subjects or clusters.

To make this concrete, suppose we have a dataset made up of 400 observations, consisting of 20 subjects who each provided 10 trials in 2 different experimental conditions (i.e., 20 x 2 x 10 = 400). And suppose the thing we ultimately want to know is whether or not there’s a statistical difference in outcome between the two conditions. There are three at least three ways we could set up our comparison:

  1. Ignore the grouping variable (i.e., subject) entirely, effectively giving us 200 observations in each condition. We then conduct the test as if we have 200 independent observations in each condition.
  2. Average the 10 trials in each condition within each subject first, then conduct the test on the subject means. In this case, we effectively have 20 observations in each condition (1 per subject).
  3. Explicitly include the effects of both subject and trial in our model. In this case we have 400 observations, but we’re explictly accounting for the correlation between trials within a given subject, so that the statistical comparison of conditions effectively has somewhere between 20 and 400 “observations” (or degrees of freedom).

Now, none of these approaches is strictly “wrong”, in that there could be specific situations in which any one of them would be called for. But as a general rule, the first approach is almost never appropriate. The reason is that we typically want to draw conclusions that generalize across the cases in the higher level of the hierarchy, and don’t have any intrinsic interest in the individual trials themselves. In the above example, we’re asking whether people on average, behave differently in the two conditions. If we treat our data as if we had 200 subjects in each condition, effectively concatenating trials across all subjects, we’re ignoring the fact that the responses acquired from each subject will tend to be correlated (i.e., Jane Doe’s behavior on Trial 2 will tend to be more similar to her own behavior on Trial 1 than to another subject’s behavior on Trial 1). So we’re pretending that we know something about 200 different individuals sampled at random from the population, when in fact we only know something about 20 different  individuals. The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives we’re going to end up answering a question quite different from the one we think we’re answering. [Update: Jake Westfall points out in the comments below that we won't necessarily inflate Type I error rate. Rather, the net effect of failing to model the nesting structure properly will depend on the relative amount of within-cluster vs. between-cluster variance. The answer we get will, however, usually deviate considerably from the answer we would get using approaches (2) or (3).]

By contrast, approaches (2) and (3) will, in most cases, produce pretty similar results. It’s true that the hierarchical approach is generally a more sensible thing to do, and will tend to provide a better estimate of the true population difference between the two conditions. However, it’s probably better to describe approach (2) as suboptimal, and not as wrong. So long as the subjects in our toy example above are in fact sampled at random, it’s pretty reasonable to assume that we have exactly 20 independent observations, and analyze our data accordingly. Our resulting estimates might not be quite as good as they could have been, but we’re unlikely to miss the mark by much.

To return to the Aarts et al paper, the key question is what exactly the authors mean when they say in their abstract that:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%).

I’ve underlined the key phrases here. It seems to me that the implication the reader is supposed to draw from this is that roughly 53% of the neuroscience literature is at high risk of reporting spurious results. But in reality this depends entirely on whether the authors mean that 53% of studies are modeling trial-level data but ignoring the nesting structure (as in approach 1 above), or that 53% of studies in the literature aren’t using hierarchical models, even though they may be doing nothing terribly wrong otherwise (e.g., because they’re using approach (2) above).

Unfortunately, the rest of the manuscript doesn’t really clarify the matter. Here’s the section in which the authors report how they obtained that 53% number:

To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.

This is, as far as I can see, still ambiguous. The only additional information provided here is that 44% of studies specifically reported the number of observations per cluster. Unfortunately this still doesn’t tell us whether the effective degrees of freedom used in the statistical tests in those papers included nested observations, or instead averaged over nested observations within each group or subject prior to analysis.

Lest this seem like a rather pedantic statistical point, I hasten to emphasize that a lot hangs on it. The potential implications for the neuroscience literature are very different under each of these two scenarios. If it is in fact true that 53% of studies are inappropriately using a “fixed-effects” model (approach 1)–which seems to me to be what the Aarts et al abstract implies–the upshot is that a good deal of neuroscience research is very bad statistical shape, and the authors will have done the community a great service by drawing attention to the problem. On the other hand, if the vast majority of the studies in that 53% are actually doing their analyses in a perfectly reasonable–if perhaps suboptimal–way, then the Aarts et al article seems rather alarmist. It would, of course, still be true that hierarchical models should be used more widely, but the cost of failing to switch would be much lower than seems to be implied.

I’ve emailed the corresponding author to ask for a clarification. I’ll update this post if I get a reply. In the meantime, I’m interested in others’ thoughts as to the likelihood that around half of the neuroscience literature involves inappropriate reporting of fixed-effects analyses. I guess personally I would be very surprised if this were the case, though it wouldn’t be unprecedented–e.g., I gather that in the early days of neuroimaging, the SPM analysis package used a fixed-effects model by default, resulting in quite a few publications reporting grossly inflated t/z/F statistics. But that was many years ago, and in the literatures I read regularly (in psychology and cognitive neuroscience), this problem rarely arises any more. A priori, I would have expected the same to be true in cellular and molecular neuroscience.


UPDATE 04/01 (no, not an April Fool’s joke)

The lead author, Emmeke Aarts, responded to my email. Here’s her reply in full:

Thank you for your interest in our paper. As the first author of the paper, I will answer the question you send to Sophie van der Sluis. Indeed we report that 53% of the papers include nested data using conventional statistics, meaning that they did not use multilevel analysis but an analysis method that assumes independent observations like a students t-test or ANOVA.

As you also note, the data can be analyzed at two levels, at the level of the individual observations, or at the subject/animal level. Unfortunately, with the information the papers provided us, we could not extract this information for all papers. However, as described in the section ‘The prevalence of nesting in neuroscience studies’, 44% of these 53% of papers including nested data, used conventional statistics on the individual observations, with at least a mean of 5 observations per subject/animal. Another 7% of these 53% of papers including nested data used conventional statistics at the subject/animal level. So this leaves 49% unknown. Of this 49%, there is a small percentage of papers which analyzed their data at the level of individual observations, but had a mean less than 5 observations per subject/animal (I would say 10 to 20% out of the top of my head), the remaining percentage is truly unknown. Note that with a high level of dependency, using conventional statistics on nested data with 2 observations per subject/animal is already undesirable. Also note that not only analyzing nested data at the individual level is undesirable, analyzing nested data at the subject/animal level is unattractive as well, as it reduces the statistical power to detect the experimental effect of interest (see fig. 1b in the paper), in a field in which a decent level of power is already hard to achieve (e.g., Button 2013).

I think this definitively answers my original question: according to Aarts, of the 53% of studies that used nested data, at least 44% performed conventional (i.e., non-hierarchical) statistical analyses on the individual observations. (I would dispute the suggestion that this was already stated in the paper; the key phrase is “on the individual observations”, and the wording in the manuscript was much more ambiguous.) Aarts suggests that ~50% of the studies couldn’t be readily classified, so in reality that proportion could be much higher. But we can say that at least 23% of the literature surveyed committed what would, in most domains, constitute a fairly serious statistical error.

I then sent Aarts another email following up on Jake Westfall’s comment (i.e., how nested vs. crossed designs were handled. She replied:

As Jake Westfall points out, it indeed depends on the design if ignoring intercept variance (so variance in the mean observation per subject/animal) leads to an inflated type I error. There are two types of designs we need to distinguish here, design type I, where the experimental variable (for example control or experimental group) does not vary within the subjects/animals but only over the subjects/animals, and design Type II, where the experimental variable does vary within the subject/animal. Only in design type I, the type I error is increased by intercept variance. As pointed out in the discussion section of the paper, the paper only focuses on design Type I (“Here we focused on the most common design, that is, data that span two levels (for example, cells in mice) and an experimental variable that does not vary within clusters (for example, in comparing cell characteristic X between mutants and wild types, all cells from one mouse have the same genotype)”), to keep this already complicated matter accessible to a broad readership. Moreover, design type I is what is most frequently seen in biological neuroscience, taking multiple observations from one animal and subsequently comparing genotypes automatically results in a type I research design.

When dealing with a research design II, it is actually the variation in effect within subject/animals that increases the type I error rate (the so-called slope variance), but I will not elaborate too much on this since it is outside the scope of this paper and a completely different story.

Again, this all sounds very straightforward and sound to me. So after both of these emails, here’s my (hopefully?) final take on the paper:

  • Work in molecular, cellular, and developmental neuroscience–or at least, the parts of those fields well-represented in five prominent journals–does indeed appear to suffer from some systemic statistical problems. While the proportion of studies at high risk of Type I error is smaller than the number Aarts et al’s abstract suggests (53%), the latter, more accurate, estimate (at least 23% of the literature) is still shockingly high. This doesn’t mean that a quarter or more of the literature can’t be trusted–as some of the commenters point out below, most conclusions aren’t based on just a single p value from a single analysis–but it does raise some very serious concerns. The Aarts et al paper is an important piece of work that will help improve statistical practice going forward.
  • The comments on this post, and on Twitter, have been interesting to read. There appear to be two broad camps of people who were sympathetic to my original concern about the paper. One camp consists of people who were similarly concerned about technical aspects of the paper, and in most cases were tripped up by the same confusion surrounding what the authors meant when they said 53% of studies used “conventional statistical analyses”. That point has now been addressed. The other camp consists of people who appear to work in the areas of neuroscience Aarts et al focused on, and were reacting not so much to the specific statistical concern raised by Aarts et al as to the broader suggestion that something might be deeply wrong with the neuroscience literature because of this. I confess that my initial knee-jerk impression to the Aarts et al paper was driven in large part by the intuition that surely it wasn’t possible for so large a fraction of the literature to be routinely modeling subjects/clusters/groups as fixed effects. But since it appears that that is in fact the case, I’m not sure what to say with respect to the broader question over whether it is or isn’t appropriate to ignore nesting in animal studies. I will say that in the domains I personally work in, it seems very clear that collapsing across all subjects for analysis purposes is nearly always (if not always) a bad idea. Beyond that, I don’t really have any further opinion other than what I said in this response to a comment below.
  • While the claims made in the paper appear to be fundamentally sound, the presentation leaves something to be desired. It’s unclear to me why the authors relegated some of the most important technical points to the Discussion, or didn’t explictly state them at all. The abstract also seems to me to be overly sensational–though, in hindsight, not nearly as much as I initially suspected. And it also seems questionable to tar all of neuroscience with a single brush when the analyses reported only applied to a few specific domains (and we know for a fact that in, say, neuroimaging, this problem is almost nonexistent). I guess to be charitable, one could pick the same bone with a very large proportion of published work, and this kind of thing is hardly unique to this study. Then again, the fact that a practice is widespread surely isn’t sufficient to justify that practice–or else there would be little point in Aarts et al criticizing a practice that so many people clearly engage in routinely.
  • Given my last post, I can’t help pointing out that this is a nice example of how mandatory data sharing (or failing that, a culture of strong expectations of preemptive sharing) could have made evaluation of scientific claims far easier. If the authors had attached the data file coding the 315 studies they reviewed as a supplement, I (and others) would have been able to clarify the ambiguity I originally raised much more quickly. I did send a follow up email to Aarts to ask if she and her colleagues would consider putting the data online, but haven’t heard back yet.

strong opinions about data sharing mandates–mine included

Apparently, many scientists have rather strong feelings about data sharing mandates. In the wake of PLOS’s recent announcement–which says that, effective now, all papers published in PLOS journals must deposit their data in a publicly accessible location–a veritable gaggle of scientists have taken to their blogs to voice their outrage and/or support for the policy. The nays have posts like DrugMonkey’s complaint that the inmates are running the asylum at PLOS (more choice posts are here, here, here, and here); the yays have Edmund Hart telling the nays to get over themselves and share their data (more posts here, here, and here). While I’m a bit late to the party (mostly because I’ve been traveling and otherwise indisposed), I guess I’ll go ahead and throw my hat into the ring in support of data sharing mandates. For a number of reasons outlined below, I think time will show the anti-PLOS folks to very clearly be on the wrong side of this issue.

Mandatory public deposition is like, totally way better than a “share-upon-request” approach

You might think that proactive data deposition has little incremental utility over a philosophy of sharing one’s data upon request, since emails are these wordy little things that only take a few minutes of a data-seeker’s time to write. But it’s not just the time and effort that matter. It’s also the psychology and technology. Psychology, because if you don’t know the person on the other end, or if the data is potentially useful but not essential to you, or if you’re the agreeable sort who doesn’t like to bother other people, it’s very easy to just say, “nah, I’ll just go do something else”. Scientists are busy people. If a dataset is a click away, many people will be happy to download that dataset and play with it who wouldn’t feel comfortable emailing the author to ask for it. Technology, because data that isn’t publicly available is data that isn’t publicly indexed. It’s all well and good to say that if someone really wants a dataset, they can email you to ask for it, but if someone doesn’t know about your dataset in the first place–because it isn’t in the first three pages of Google results–they’re going to have a hard time asking.

People don’t actually share on request

Much of the criticism of the PLoS data sharing policy rests on the notion that the policy is unnecessary, because in practice most journals already mandate that authors must share their data upon request. One point that defenders of the PLOS mandate haven’t stressed enough is that such “soft” mandates are largely meaningless. Empirical studies have repeatedly demonstrated  that it’s actually very difficult  to get authors to share their data upon request –even when they’re obligated to do so by the contractual agreement they’ve signed with a publisher. And when researchers do fulfill data sharing requests, they often take inordinately long to do so, and the data often don’t line up properly with what was reported in the paper (as the PLOS editors noted in their explanation for introducing the policy), or reveal potentially serious errors.

Personally, I have to confess that I often haven’t fulfilled other researchers’ requests for my data–and in at least two cases, I never even responded to the request. These failures to share didn’t reflect my desire to hide anything; they occurred largely because I knew it would be a lot of work, and/or the data were no longer readily accessible to me, and/or I was too busy to take care of the request right when it came in. I think I’m sufficiently aware of my own character flaws to know that good intentions are no match for time pressure and divided attention–and that’s precisely why I’d rather submit my work to journals that force me to do the tedious curation work up front, when I have a strong incentive to do it, rather than later, when I don’t.

Comprehensive evaluation requires access to the data

It’s hard to escape the feeling that some of the push-back against the policy is actually rooted in the fear that other researchers will find mistakes in one’s work by going through one’s data. In some cases, this fear is made explicit. For example, DrugMonkey suggested that:

There will be efforts to say that the way lab X deals with their, e.g., fear conditioning trials, is not acceptable and they MUST do it the way lab Y does it. Keep in mind that this is never going to be single labs but rather clusters of lab methods traditions. So we’ll have PLoS inserting itself in the role of how experiments are to be conducted and interpreted!

This rather dire premonition prompted a commenter to ask if it’s possible that DM might ever be wrong about what his data means–necessitating other pairs of eyes and/or opinions. DM’s response was, in essence, “No.”. But clearly, this is wishful thinking: we have plenty of reasons to think that everyone in science–even the luminaries among us–make mistakes all the time. Science is hard. In the fields I’m most familiar with, I rarely read a paper that I don’t feel has some serious flaws–even though nearly all of these papers were written by people who have, in DM’s words, “been at this for a while”. By the same token, I’m certain that other people read each of my papers and feel exactly the same way. Of course, it’s not pleasant to confront our mistakes by putting everything out into the open, and I don’t doubt that one consequence of sharing data proactively is that error-finding will indeed become much more common. At least initially (i.e., until we develop an appreciation for the true rate of error in the average dataset, and become more tolerant of minor problems), this will probably cause everyone some discomfort. But temporary discomfort surely isn’t a good excuse to continue to support practices that clearly impede scientific progress.

Part of the problem, I suspect, is that scientists have collectively internalized as acceptable many practices that are on some level clearly not good for the community as a whole. To take just one example, it’s an open secret in biomedical science that so-called “representative figures” (of spiking neurons, Western blots, or whatever else you like) are rarely truly representative. Frequently, they’re actually among the best examples the authors of a paper were able to find. The communal wink-and-shake agreement to ignore this kind of problem is deeply problematic, in that it likely allows many claims to go unchallenged that are actually not strongly supported by the data. In a world where other researchers could easily go through my dataset and show that the “representative” raster plot I presented in Figure 2C was actually the best case rather than the norm, I would probably have to be more careful about making that kind of claim up front–and someone else might not waste a lot of their time chasing results that can’t possibly be as good as my figures make them look.

Figure 1.  A representative planet.

The Data are a part of the Methods

If you still don’t find this convincing, consider that one could easily have applied nearly all of the arguments people having been making in the blogosphere these past two weeks to that dastardly scientific timesink that is the common Methods sections. Imagine that we lived in a culture where scientists always reported their Results telegraphically–that is, with the brevity of a typical Nature or Science paper, but without the accompanying novel’s worth of Supplementary Methods. Then, when someone first suggested that it might perhaps be a good idea to introduce a separate section that describes in dry, technical language how authors actually produced all those exciting results, we would presumably see many people in the community saying something like the following:

Why should I bother to tell you in excruciating detail what software, reagents, and stimuli I used in my study? The vast majority of readers will never try to directly replicate my experiment, and those who do want to can just email me to get the information they need–which of course I’m always happy to provide in a timely and completely disinterested fashion. Asking me to proactively lay out every little methodological step I took is really unreasonable; it would take a very long time to write a clear “Methods” section of the kind you propose, and the benefits seem very dubious. I mean, the only thing that will happen if I adopt this new policy is that half of my competitors will start going through this new section with a fine-toothed comb in order to find problems, and the other half will now be able to scoop me by repeating the exact procedures I used before I have a chance to follow them up myself! And for what? Why do I need to tell everyone exactly what I did? I’m an expert with many years of experience in this field! I know what I’m doing, and I don’t appreciate your casting aspersions on my work and implying that my conclusions might not always be 100% sound!

As far as I can see, there isn’t any qualitative difference between reporting detailed Methods and providing comprehensive Data. In point of fact, many decisions about which methods one should use depend entirely on the nature of the data, so it’s often actually impossible to evaluate the methodological choices the authors made without seeing their data. If DrugMonkey et al think it’s crazy for one researcher to want access to another researcher’s data in order to determine whether the distribution of some variable looks normal, they should also think it’s crazy for researchers to have to report their reasoning for choosing a particular transformation in the first place. Or for using a particular reagent. Or animal strain. Or learning algorithm, or… you get the idea. But as Bjorn Brembs succinctly put it, in the digital age, this is silly: for all intents and purposes, there’s no longer any difference between text and data.

The data are funded by the taxpayers, and (in some sense) belong to the taxpayers

People vary widely in the extent to which they feel the public deserves to have access to the products of the work it funds. I don’t think I hold a particularly extreme position in this regard, in the sense that I don’t think the mere fact that someone’s effort is funded by the public automatically means any of their products should be publicly available for anyone’s perusal or use. However, when we’re talking about scientific data–where the explicit rationale for funding the work is to produce new generalizable knowledge, and where the marginal cost of replicating digital data is close to zero–I really don’t see any reason not to push very strongly to force scientists to share their data. I’m sympathetic to claims about scooping and credit assignment, but as a number of other folks have pointed out in comment threads, these are fundamentally arguments in favor of better credit assignment, and not arguments against sharing data. The fear some people have of being scooped is not sufficient justification for impeding our collective scientific progress.

It’s also worth noting that, in principle, PLOS’s new data sharing policy shouldn’t actually make it any easier for someone else to scoop you. Remember that under PLOS’s current data sharing mandate–as well as the equivalent policies at most other scientific journals–authors are already required to provide their data to anyone else upon request. Critics who argue that the new public archiving mandate opens the door to being scooped are in effect admitting that the old mandate to share upon request doesn’t work, because in theory there already shouldn’t really be anything preventing me from scooping you with your data simply by asking you for it (other than social norms–but then, the people who are actively out to usurp others’ ideas are the least likely to abide by those norms anyway). It’s striking to see how many of the posts defending the “share-upon-request” approach have no compunction in saying that they’re currently only willing to share their data after determining what the person on the other end wants to use it for–in clear violation of most journals’ existing policy.

It’s really not that hard

Organizing one’s data or code in a form minimally suitable for public consumption isn’t much fun. I do it fairly regularly; I know it sucks. It takes some time out of your day, and requires you to allocate resources to the problem that could otherwise be directed elsewhere. That said, a lot of the posts complaining about how much effort the new policy requires seem absurdly overwrought. There seems to be a widespread belief–which, as far as I can tell, isn’t supported by a careful reading of the actual PLOS policy–that there’s some incredibly strict standard that datasets have to live up to before pulic release. I don’t really understand where this concern comes from. Personally, I spend much of my time analyzing data other people have collected. I’ve worked with many other people’s data, and rarely is it in exactly the form I would like. Often times it’s not even in the ballpark of what I’d like. And I’ve had to invest a considerable amount of my time understanding what columns and rows mean, and scrounging for morsels of (poor) documentation. My working assumption when I do this–and, I think, most other people’s–is that the onus is on me to expend some effort figuring out what’s in a dataset I wish to use, and not on the author to release that dataset in a form that a completely naive person could understand without any effort. Of course it would be nice if everyone put their data up on the web in a form that maximized accessibility, but it certainly isn’t expected*. In asking authors to deposit their data publicly, PLOS isn’t asserting that there’s a specific format or standard that all data must meet; they’re just saying data must meet accepted norms. Since those norms depend on one’s field, it stands to reason that expectations will be lower for a 10-TB fMRI dataset than for an 800-row spreadsheet of behavioral data.

There are some valid concerns, but…

I don’t want to sound too Pollyannaish about all this. I’m not suggesting that the PLOS policy is perfect, or that issues won’t arise in the course of its implementation and enforcement. It’s very clear that there are some domains in which data sharing is a hassle, and I sympathize with the people who’ve pointed out that it’s not really clear what “all” the data means–is it the raw data, which aren’t likely to be very useful to anyone, or the post-processed data, which may be too close to the results reported in the paper? But such domain- or case-specific concerns are grossly outweighed by the very general observation that it’s often impossible to evaluate previous findings adequately, or to build a truly replicable science, if you don’t have access to other scientists’ data. There’s no doubt that edge cases will arise in the course of enforcing the new policy. But they’ll be dealt with on a case-by-case basis, exactly as the PLOS policy indicates. In the meantime, our default assumption should be that editors at PLOS–who are, after all, also working scientists–will behave reasonably, since they face many of the same considerations in their own research. When a researcher tells an editor that she doesn’t have anywhere to put the 50 TB of raw data for her imaging study, I expect that that editor will typically respond by saying, “fine, but surely you can drag and drop a directory full of the first- and second-level beta images, along with a basic description, into NeuroVault, right?”, and not “Whut!? No raw DICOM images, no publication!”

As for the people who worry that by sharing their data, they’ll be giving away a competitive advantage… to be honest, I think many of these folks are mistaken about the dire consequences that would ensue if they shared their data publicly. I suspect that many of the researchers in question would be pleasantly surprised at the benefits of data sharing (increased citation rates, new offers of collaboration, etc.) Still, it’s clear enough that some of the people who’ve done very well for themselves in the current scientific system–typically by leveraging some incredibly difficult-to-acquire dataset into a cottage industry of derivative studies–would indeed do much less well in a world where open data sharing was mandatory. What I fail to see, though, is why PLOS, or the scientific community as a whole, should care very much about this latter group’s concerns. As far as I can tell, PLOS’s new policy is a significant net positive for the scientific community as a whole, even if it hurts one segment of that community in the short term. For the moment, scientists who harbor proprietary attitudes towards their data can vote with their feet by submitting their papers somewhere other than PLOS. Contrary to the dire premonitions floating around, I very much doubt any potential drop in submissions is going to deliver a terminal blow to PLOS (and the upside is that the articles that do get published in PLOS will arguably be of higher quality). In the medium-to-long term, I suspect that cultural norms surrounding who gets credit for acquiring and sharing data vs. analyzing and reporting new findings based on those data are are going to undergo a sea change–to the point where in the not-too-distant future, the scoopophobia that currently drives many people to privately hoard their data is a complete non-factor. At that point, it’ll be seen as just plain common sense that if you want your scientific assertions to be taken seriously, you need to make the data used to support those assertions available for public scrutiny, re-analysis, and re-use.

 

* As a case in point, just yesterday I came across a publicly accessible dataset I really wanted to use, but that was in SPSS format. I don’t own a copy of SPSS, so I spent about an hour trying to get various third-party libraries to extract the data appropriately, without any luck. So eventually I sent the file to a colleague who was helpful enough to convert it. My first thought when I received the tab-delimited file in my mailbox this morning was not “ugh, I can’t believe they released the file in SPSS”, it was “how amazing is it that I can download this gigantic dataset acquired half the world away instantly, and with just one minor hiccup, be able to test a novel hypothesis in a high-powered way without needing to spend months of time collecting data?”

What we can and can’t learn from the Many Labs Replication Project

By now you will most likely have heard about the “Many Labs” Replication Project (MLRP)–a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of classical and not-so-classical findings in psychology. You probably already know that the authors tested a variety of different effects–some recent, some not so recent (the oldest one dates back to 1941!); some well-replicated, others not so much–and reported successful replications of 10 out of 13 effects (though with widely varying effect sizes).

By and large, the reception of the MLRP paper has been overwhelmingly positive. Setting aside for the moment what the findings actually mean (see also Rolf Zwaan’s earlier take), my sense is that most psychologists are united in agreement that the mere fact that researchers at 36 different sites were able to get together and run a common protocol testing 13 different effects is a pretty big deal, and bodes well for the field in light of recent concerns about iffy results and questionable research practices.

But not everyone’s convinced. There now seems to be something of an incipient backlash against replication. Or perhaps not so much against replication itself as against the notion that the ongoing replication efforts have any special significance. An in press paper by Joseph Cesario makes a case for deferring independent efforts to replicate an effect until the original effect is theoretically well understood (a suggestion I disagree with quite strongly, and plan to follow up on in a separate post). And a number of people have questioned, in blog comments and tweets, what the big deal is. A case in point:

I think the charitable way to interpret this sentiment is that Gilbert and others are concerned that some people might read too much into the fact that the MLRP successfully replicated 10 out of 13 effects. And clearly, at least some journalists have; for instance, Science News rather irresponsibly reported that the MLRP “offers reassurance” to psychologists. That said, I don’t think it’s fair to characterize this as anything close to a dominant reaction, and I don’t think I’ve seen any researchers react to the MLRP findings as if the 10/13 number means anything special. The piece Dan Gilbert linked to in his tweet, far from promoting “hysteria” about replication, is a Nature News article by the inimitable Ed Yong, and is characteristically careful and balanced. Far from trumpeting the fact that 10 out of 13 findings replicated, here’s a direct quote from the article:

Project co-leader Brian Nosek, a psychologist at the Center of Open Science in Charlottesville, Virginia, finds the outcomes encouraging. “It demonstrates that there are important effects in our field that are replicable, and consistently so,” he says. “But that doesn’t mean that 10 out of every 13 effects will replicate.”

Kahneman agrees. The study “appears to be extremely well done and entirely convincing”, he says, “although it is surely too early to draw extreme conclusions about entire fields of research from this single effort”.

Clearly, the mere fact that 10 out of 13 effects replicated is not in and of itself very interesting. For one thing (and as Ed Yong also noted in his article), a number of the effects were selected for inclusion in the project precisely because they had already been repeatedly replicated. Had the MLRP failed to replicate these effects–including, for instance, the seminal anchoring effect discovered by Kahneman and Tversky in the 1970s–the conclusion would likely have been that something was wrong with the methodology, and not that the anchoring effect doesn’t exist. So I think pretty much everyone can agree with Gilbert that we have most assuredly not learned, as a result of the MLRP, that there’s no replication crisis in psychology after all, and that roughly 76.9% of effects are replicable. Strictly speaking, all we know is that there are at least 10 effects in all of psychology that can be replicated. But that’s not exactly what one would call an earth-shaking revelation. What’s important to appreciate, however, is that the utility of the MLRP was never supposed to be about the number of successfully replicated effects. Rather, its value is tied to a number of other findings and demonstrations–some of which are very important, and have potentially big implications for the field at large. To wit:

1. The variance between effects is greater than the variance within effects.

Here’s the primary figure from the MLRP paper: Many Labs Replication Project results

Notice that the range of meta-analytic estimates for the different effect sizes (i.e., the solid green circles) is considerably larger than the range of individual estimates within a given effect. In other words, if you want to know how big a given estimate is likely to be, it’s more informative to know what effect is being studied than to know which of the 36 sites is doing the study. This may seem like a rather esoteric point, but it has important implications. Most notably, it speaks directly to the question of how much one should expect effect sizes to fluctuate from lab to lab when direct replications are attempted. If you’ve been following the controversy over the relative (non-)replicability of a number of high-profile social priming studies, you’ve probably noticed that a common defense researchers use when their findings fails to replicate is to claim that the underlying effect is very fragile, and can’t be expected to work in other researchers’ hands. What the MLRP shows, for a reasonable set of studies, is that there does not in fact appear to be a huge amount of site-to-site variability in effects. Take currency priming, for example–an effect in which priming participants with money supposedly leads them to express capitalistic beliefs and behaviors more strongly. Given a single failure to replicate the effect, one could plausibly argue that perhaps the effect was simply too fragile to reproduce consistently. But when 36 different sites all produce effects within a very narrow range–with a mean that is effectively zero–it becomes much harder to argue that the problem is that the effect is highly variable. To the contrary, the effect size estimates are remarkably consistent–it’s just that they’re consistently close to zero.

2. Larger effects show systematically greater variability.

You can see in the above figure that the larger an effect is, the more individual estimates appear to vary across sites. In one sense, this is not terribly surprising–you might already have the statistical intuition that the larger an effect is, the more reliable variance should be available to interact with other moderating variables. Conversely, if an effect is very small to begin with, it’s probably less likely that it could turn into a very large effect under certain circumstances–or that it might reverse direction entirely. But in another sense, this finding is actually quite unexpected, because, as noted above, there’s a general sense in the field that it’s the smaller effects that tend to be more fragile and heterogeneous. To the extent we can generalize from these 13 studies, these findings should give researchers some pause before attributing replication failures to invisible moderators that somehow manage to turn very robust effects (e.g., the original currency priming effect was nearly a full standard deviation in size) into nonexistent ones.

3. A number of seemingly important variables don’t systematically moderate effects.

There have long been expressions of concern over the potential impact of cultural and population differences on psychological effects. For instance, despite repeated demonstrations that internet samples typically provide data that are as good as conventional lab samples, many researchers continue to display a deep (and in my view, completely unwarranted) skepticism of findings obtained online. More reasonably, many researchers have worried that effects obtained using university students in Western nations–the so-called WEIRD samples–may not generalize to other social groups, cultures and countries. While the MLRP results are obviously not the last word on this debate, it’s instructive to note that factors like data acquisition approach (online vs. offline) and cultural background (US vs. non-US) didn’t appear to exert a systematic effect on results. This doesn’t mean that there are no culture-specific effects in psychology of course (there undoubtedly are), but simply that our default expectation should probably be that most basic effects will generalize across cultures to at least some extent.

4. Researchers have pretty good intuitions about which findings will replicate and which ones won’t.

At the risk of offending some researchers, I submit that the likelihood that a published finding will successfully replicate is correlated to some extent with (a) the field of study it falls under and (b) the journal in which it was originally published. For example, I don’t think it’s crazy to suggest that if one were to try to replicate all of the social priming studies and all of the vision studies published in Psychological Science in the last decade, the vision studies would replicate at a consistently higher rate. Anecdotal support for this intuition comes from a string of high-profile failures to replicate famous findings–e.g., John Bargh’s demonstration that priming participants with elderly concepts leads them to walk away from an experiment more slowly. However, the MLRP goes one better than anecdote, as it included a range of effects that clearly differ in their a priori plausibility. Fortuitously, just prior to publicly releasing the MLRP results, Brian Nosek asked the following question on Twitter:

Several researchers, including me, took Brian up on his offers; here are the responses:

As you can see, pretty much everyone that replied to Brian expressed skepticism about the two priming studies (#9 and #10 in Hal Pashler’s reply). There was less consensus on the third effect. (Actually, as it happens, there were actually ultimately only 2 failures to replicate–the third effect became statistically significant when samples were weighted properly.) Nonetheless, most of us picked Imagined Contact as number 3, which did in fact emerge as the smallest of the statistically significant effects. (It’s probably worth mentioning that I’d personally only heard of 4 or 5 of the 13 effects prior to reading their descriptions, so it’s not as though my response was based on a deep knowledge of prior work on these effects–I simply read the descriptions of the findings and gauged their plausibility accordingly.)

Admittedly, these are just two (or three) studies. It’s possible that the MLRP researchers just happened to pick two of the only high-profile priming studies that both seem highly counterintuitive and happen to be false positives. That said, I don’t really think these findings stand out from the mass of other counterintuitive priming studies in social psychology in any way. While we obviously shouldn’t conclude from this that no high-profile, counterintuitive priming studies will successfully replicate, the fact that a number of researchers were able to prospectively determine, with a high degree of accuracy, which effects would fail to replicate (and, among those that replicated, which were rather weak), is a pretty good sign that researchers’ intuitions about plausibility and replicability are pretty decent.

Personally, I’d love to see this principle pushed further, and formalized as a much broader tool for evaluating research findings. For example, one can imagine a website where researchers could publicly (and perhaps anonymously) register their degree of confidence in the likely replicability of any finding associated with a doi or PubMed ID. I think such a service would be hugely valuable–not only because it would help calibrate individual researchers’ intuitions and provide a sense of the field’s overall belief in an effect, but because it would provide a useful index of a finding’s importance in the event of successful replication (i.e., the authors of a well-replicated finding should probably receive more credit if the finding was initially viewed with great skepticism than if it was universally deemed rather obvious).

There are other potentially important findings in the MLRP paper that I haven’t mentioned here (see Rolf Zwaan’s blog post for additional points), but if nothing else, I hope this will help convince any remaining skeptics that this is indeed a landmark paper for psychology–even though the number of successful replications is itself largely meaningless.

Oh, there’s one last point worth mentioning, in light of the rather disagreeable tone of the debate surrounding previous replication efforts. If your findings are ever called into question by a multinational consortium of 36 research groups, this is exactly how you should respond:

Social psychologist Travis Carter of Colby College in Waterville, Maine, who led the original flag-priming study, says that he is disappointed but trusts Nosek’s team wholeheartedly, although he wants to review their data before commenting further. Behavioural scientist Eugene Caruso at the University of Chicago in Illinois, who led the original currency-priming study, says, “We should use this lack of replication to update our beliefs about the reliability and generalizability of this effect”, given the “vastly larger and more diverse sample” of the MLRP. Both researchers praised the initiative.

Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings (excepting cases of deliberate misconduct or clear negligence). What matters is, as Caruso notes, whether and to what extent one shows a willingness to update one’s beliefs in response to countervailing evidence. That’s one mark of a good scientist.

whether or not you should pursue a career in science still depends mostly on that thing that is you

I took the plunge a couple of days ago and answered my first question on Quora. Since Brad Voytek won’t shut up about how great Quora is, I figured I should give it a whirl. So far, Brad is not wrong.

The question in question is: “How much do you agree with Johnathan Katz’s advice on (not) choosing science as a career? Or how realistic is it today (the article was written in 1999)?” The Katz piece referred to is here. The gist of it should be familiar to many academics; the argument boils down to the observation that relatively few people who start graduate programs in science actually end up with permanent research positions, and even then, the need to obtain funding often crowds out the time one has to do actual science. Katz’s advice is basically: don’t pursue a career in science. It’s not an optimistic piece.

My answer is, I think, somewhat more optimistic. Here’s the full text:

The real question is what you think it means to be a scientist. Science differs from many other professions in that the typical process of training as a scientist–i.e., getting a Ph.D. in a scientific field from a major research university–doesn’t guarantee you a position among the ranks of the people who are training you. In fact, it doesn’t come close to guaranteeing it; the proportion of PhD graduates in science who go on to obtain tenure-track positions at research-intensive universities is very small–around 10% in most recent estimates. So there is a very real sense in which modern academic science is a bit of a pyramid scheme: there are a relatively small number of people at the top, and a lot of people on the rungs below laboring to get up to the top–most of whom will, by definition, fail to get there.

If you equate a career in science solely with a tenure-track position at a major research university, and are considering the prospect of a Ph.D. in science solely as an investment intended to secure that kind of position, then Katz’s conclusion is difficult to escape. He is, in most respects, correct: in most biomedical, social, and natural science fields, science is now an extremely competitive enterprise. Not everyone makes it through the PhD; of those who do, not everyone makes it into–and then through–one more more postdocs; and of those who do that, relatively few secure tenure-track positions. Then, of those few “lucky” ones, some will fail to get tenure, and many others will find themselves spending much or most of their time writing grants and managing people instead of actually doing science. So from that perspective, Katz is probably right: if what you mean when you say you want to become a scientist is that you want to run your own lab at a major research university, then your odds of achieving that at the outset are probably not very good (though, to be clear, they’re still undoubtedly better than your odds of becoming a successful artist, musician, or professional athlete). Unless you have really, really good reasons to think that you’re particularly brilliant, hard-working, and creative (note: undergraduate grades, casual feedback from family and friends, and your own internal gut sense do not qualify as really, really good reasons), you probably should not pursue a career in science.

But that’s only true given a rather narrow conception where your pursuit of a scientific career is motivated entirely by the end goal rather than by the process, and where failure is anything other than ending up with a permanent tenure-track position. By contrast, if what you’re really after is an environment in which you can pursue interesting questions in a rigorous way, surrounded by brilliant minds who share your interests, and with more freedom than you might find at a typical 9 to 5 job, the dream of being a scientist is certainly still alive, and is worth pursuing. The trivial demonstration of this is that if you’re one of the many people who actuallyenjoy the graduate school environment (yes, they do exist!), it may not even matter to you that much whether or not you have a good shot of getting a tenure-track position when you graduate.

To see this, imagine that you’ve just graduated with an undergraduate degree in science, and someone offers you a choice between two positions for the next six years. One position is (relatively) financially secure, but involves rather boring work of quesitonable utility to society, an inflexible schedule, and colleagues who are mostly only there for a paycheck. The other position has terrible pay, but offers fascinating and potentially important work, a flexible lifestyle, and colleagues who are there because they share your interests and want to do scientific research.

Admittedly, real-world choices are rarely this stark. Many non-academic jobs offer many of the same perceived benefits of academia (e.g., many tech jobs offer excellent working conditions, flexible schedules, and important work). Conversely, many academic environments don’t quite live up to the ideal of a place where you can go to pursue your intellectual passion unfettered by the annoyances of “real” jobs–there’s often just as much in the way of political intrigue, personality dysfunction, and menial due-paying duties. But to a first approximation, this is basically the choice you have when considering whether to go to graduate school in science or pursue some other career: you’re trading financial security and a fixed 40-hour work week against intellectual engagement and a flexible lifestyle. And the point to note is that, even if we completely ignore what happens after the six years of grad school are up, there is clearly a non-negligible segment of the population who would quite happy opt for the second choice–even recognizing full well that at the end of six years they may have to leave and move onto something else, with little to show for their effort. (Of course, in reality we don’t need to ignore what happens after six years, because many PhDs who don’t get tenure-track positions find rewarding careers in other fields–many of them scientific in nature. And, even though it may not be a great economic investment, having a Ph.D. in science is a great thing to be able to put on one’s resume when applying for a very broad range of non-academic positions.)

The bottom line is that whether or not you should pursue a career in science has as much or more to do with your goals and personality as it does with the current environment within or outside of (academic) science. In an ideal world (which is certainly what the 1970s as described by Katz sound like, though I wasn’t around then), it wouldn’t matter: if you had any inkling that you wanted to do science for a living, you would simply go to grad school in science, and everything would probably work itself out. But given real-world constraints, it’s absolutely essentially that you think very carefully about what kind of environment makes you happy and what your expectations and goals for the future are. You have to ask yourself: Am I the kind of person who values intellectual freedom more than financial security? Do I really love the process of actually doing science–not some idealized movie version of it, but the actual messy process–enough to warrant investing a huge amount of my time and energy over the next few years? Can I deal with perpetual uncertainty about my future? And ultimately, would I be okay doing something that I really enjoy for six years if at the end of that time I have to walk away and do something very different?

If the answer to all of these questions is yes–and for many people it is!–then pursuing a career in science is still a very good thing to do (and hey, you can always quit early if you don’t like it–then you’ve lost very little time!). If the answer to any of them is no, then Katz may be right. A prospective career in science may or may not be for you, but at the very least, you should carefully consider alternative prospects. There’s absolutely no shame in going either route; the important thing is just to make an honest decision that takes the facts as they are and not as you wish that they were.

A couple of other thoughts I’ll add belatedly:

  • Calling academia a pyramid scheme is admittedly a bit hyperbolic. It’s true that the personnel structure in academia broadly has the shape of a pyramid, but that’s true of most organizations in most other domains too. Pyramid schemes are typically built on promises and lies that (almost by definition) can’t be realized, and I don’t think many people who enter a Ph.D. program in science can claim with a straight face that they were guaranteed a permanent research position at the end of the road (or that it’s impossible to get such a position). As I suggested in this post, it’s much more likely that everyone involved is simply guilty of minor (self-)deception: faculty don’t go out of their way to tell prospective students what the odds are of actually getting a tenure-track position, and prospective grad students don’t work very hard to find out the painful truth, or to tell faculty what their real intentions are after they graduate. And it may actually be better for everyone that way.
  • Just in case it’s not clear from the above, I’m not in any way condoning the historically low levels of science funding, or the fact that very few science PhDs go on to careers in academic research. I would love for NIH and NSF budgets (or whatever your local agency is) to grow substantially–and for everyone get exactly the kind of job they want, academic or not. But that’s not the world we live in, so we may as well be pragmatic about it and try to identify the conditions under which it does or doesn’t make sense to pursue a career in science right now.
  • I briefly mention this above, but it’s probably worth stressing that there are many jobs outside of academia that still allow one to do scientific research, albeit typically with less freedom (but often for better hours and pay). In particular, the market for data scientists is booming right now, and many of the hires are coming directly from academia. One lesson to take away from this is: if you’re in a science Ph.D. program right now, you should really spend as much time as you can building up your quantitative and technical skills, because they could very well be the difference between a job that involves scientific research and one that doesn’t in the event you leave academia. And those skills will still serve you well in your research career even if you end up staying in academia.

 

The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

I’m moving to Austin!

The title pretty much says it. After spending four great years in Colorado, I’m happy to say that I’ll be moving to Austin at the end of the month. I’ll be joining the Department of Psychology at UT-Austin as a Research Associate, where I plan to continue dabbling in all things psychological and informatic, but with less snow and more air conditioning.

While my new position nominally has the same title as my old one, the new one’s a bit unusual in that the funding is coming from two quite different sources. Half of it comes from my existing NIH grant for development of the Neurosynth framework, which means that half of my time will be spent more or less the same way I’m spending it now–namely, on building tools to improve and automate the large-scale synthesis of functional MRI data. (Incidentally, I’ll be hiring a software developer and/or postdoc in the very near future, so drop me a line if you think you might be interested.)

The other half of the funding is tied to the PsyHorns course developed by Jamie Pennebaker and Sam Gosling over the past few years. PsyHorns is a synchronous massive online course (SMOC) that lets anyone in the world with an internet connection (okay, and $550 in loose change lying around) take an introductory psychology class via the internet and officially receive credit for it from the University of Texas (this recent WSJ article on PsyHorns provides some more details). My role will be to serve as a bridge between the psychologists and the developers–which means I’ll have an eclectic assortment of duties like writing algorithms to detect cheating, developing tools to predict how well people are doing in the class, mining the gigantic reams of data we’re acquiring, developing ideas for new course features, and, of course, publishing papers.

Naturally, the PILab will be joining me in my southern adventure. Since the PILab currently only has one permanent member (guess who?), and otherwise consists of a single Mac Pro workstation, this latter move involves much less effort than you might think (though it does mean I’ll have to change the lab website’s URL, logo, and–horror of horrors–color scheme). Unfortunately, all the wonderful people of the PILab will be staying behind, as they all have various much more important ties to Boulder (by which I mean that I’m not actually currently paying any of their salaries, and none of them were willing to subsist on the stipend of baked beans, love, and high-speed internet I offered them).

While I’m super excited about moving to Austin, I’m not at all excited to leave Colorado. Boulder is a wonderful place to live*–it’s sunny all the time, has a compact, walkable core, a surprising amount of stuff to do, and these gigantic mountain things you can walk all over. My wife and I have made many incredible friends here, and after four years in Colorado, it’s come to feel very much like home. So leaving will be difficult. Still, I’m excited to move onto new things. As great as the past four years have been, a number of factors precipitated this move:

  • The research fit is better. This isn’t in any way a knock against the environment here at Colorado, which has been great (hey, they’re hiring! If you do computational cognitive neuroscience, you should apply!). I had great colleagues here who work on some really interesting questions–particularly Tor Wager, my postdoc advisor for my first 3 years here, who’s an exceptional scientist and stellar human being. But every department necessarily has to focus on some areas at the expense of others, and much of the research I do (or would ideally like to do) wasn’t well-represented here. In particular, my interests in personality and individual differences have languished during my time in Boulder, as I’ve had trouble finding collaborators for most of the project ideas I’ve had. UT-Austin, by contrast, has one of the premier personality and individual differences groups anywhere. I’m delighted to be working a few doors down from people like Sam Gosling, Jamie Pennebaker, Elliot Tucker-Drob, and David Buss. On top of that, UT-Austin still has major strengths in most of my other areas of interest, most notably neuroimaging (I expect to continue to collaborate frequently with Russ Poldrack) and data mining (a world-class CS department with an expanding focus on Big Data). So, purely in terms of fit, it’s hard for me to imagine a better place than UT.
  • I’m excited to work on a project with immediate real-world impact. While I’d love to believe that most of the work I currently do is making the world better in some very small way, the reality most scientists engaged in basic research face is that at the end of the day, we don’t actually know what impact we’re having. There’s nothing inherently wrong with that, mind you; as a general rule, I’m a big believer in the idea of doing science just because it’s interesting and exciting, without worrying about the consequences (or lack thereof). You know, knowledge for it’s own sake and all that. Still, on a personal level, I find myself increasingly wanting to do something that I feel confers some clear and measurable benefit on the world right now–however small. In that respect, online education strikes me as an excellent area to pour my energy into. And PsyHorns is a particularly unusual (and, to my mind, promising) experiment in online education. The preliminary data from previous iterations of the course suggests that students who take the course synchronously online do better academically–not just in this particular class (as compared to an in-class section), but in other courses as well. While I’m not hugely optimistic about the malleability of the human mind as a general rule–meaning, I don’t think there are as-yet undiscovered teaching approaches that are going to radically improve the learning experience–I do believe strongly in the cumulative impact of many small nudging in the right direction. I think this is the right platform for that kind of nudging.
  • Data. Lots and lots of data. Enrollment in PsyHorns this year is about 1,500 students, and previous iterations have seen comparable numbers. As part of their introduction to psychology, the students engage in a wide range of activities: they have group chats about the material they’re learning; they write essays about a range of topics; they fill out questionnaires and attitude surveys; and, for the first time this year, they use a mobile app that assesses various aspects of their daily experience. Aside from the feedback we provide to the students (some of which is potentially actionable right away), the data we’re collecting provides a unique opportunity to address many questions at the intersection of personality and individual differences, health and subjective well-being, and education. It’s not Big Data by, say, Google or Amazon standards (we’re talking thousands of rows rather than billions), but it’s a dataset with few parallels in psychology, and I’m thrilled to be able to work on it.
  • I like doing research more than I like teaching** or doing service work. Like my current position, the position I’m assuming at UT-Austin is 100% research-focused, with very little administrative or teaching overhead. Obviously, it doesn’t have the long-term security of a tenure-track position, but I’m okay with that. I’m still selectively applying for tenure-track positions (and turned one down this year in favor of the UT position), so it’s not as though I have any principled objections to the tenure stream. But short of a really amazing opportunity, I’m very happy with my current arrangement.
  • mmm, chocolatey Austin goodness...
    Austin seems like a pretty awesome place to live. Boulder is too, but after four years of living in a relatively small place (population: ~100,000), my wife and I are looking forward to living somewhere more city-like. We’ve opted to take the (expensive) plunge and live downtown–where we’ll be within walking distance of just about everything we need. By which of course I mean the chocolate fountain at the Whole Foods mothership.
  • The tech community in Austin is booming. Given that most of my work these days lies at the interface of psychology and informatics, and there are unprecedented opportunities for psychology-related data mining in industry these days, I’m hoping to develop better collaborations with people in industry–at both startups and established companies. While I have no intention of leaving academia in the near future, I do think psychologists have collectively failed to take advantage of the many opportunities to collaborate with folks in industry on interesting questions about human behavior–often at an unprecedented scale. I’ve done a terrible job of that myself, and fixing that is near the top of my agenda. So, hey, if you work at a tech company in Austin and have some data lying around that you think might shed new insights on what people feel, think, and do, let’s chat!
  • I guess sometimes you just get the itch to move onto something new. For me, this is that.

University of Texas Austin campus at sunset-dusk - aerial view

 

 

* Okay, it was an amazing place to live until the massive floods this past week rearranged rivers, roads, and lives. My wife and I  were fortunate enough to escape any personal or material damage, but many others were not so lucky. If you’d like to help, please consider making a donation.

** Actually, I love teaching. What I don’t love is all the stuff surrounding teaching.

Jirafas

This is fiction.

The party is supposed to start at 7 pm, but of course, no one shows up before 8:45. When the guests finally do arrive, I randomly assign each of them to one of four groups–A through D–as they enter. Each assignment comes with an adhesive 2″ color patch, a nametag, and a sharpie.

The labels are not for the dinner,” I say, “they’re for the orgy that follows the dinner. The bedrooms are all color-coded; there are strict rules governing inter-cubicular transitions. Please read the manual on the table.”

Nobody moves to pick up the manual. There’s a long and uncomfortable silence, made longer and more uncomfortable by the fact that we can all hear the upstairs neighbors loudly having sex on their kitchen counter.

“Turn on the music,” my wife says. “It masks the sex.”

I put on some music. Something soft, by Elton John, followed by something angry—a duet by Tenacious D and Leonard Skynyrd. One of the guests—unsoothed by the music, and noticing the random collection of chairs scattered around the living room—grows restless and asks whether we will all be playing musical chairs this fine evening.

“No,” I reply; “this fine night, we all play Mafia.” Then I shoot him dead as everyone else pretends to stare out the window.

In the kitchen, my wife uncorks the last bottle of wine. As trendy wines go, this one wears its pretention with pride: Jugo de Jirafas, the label proclaims in vermilion Helvetica Neue overtones.

“What does jirafas mean,” I ask my Spanish friend. “Giraffes?”

“No,” she says. “Jirafas was a famous rebel general who came out of hiding during the Spanish Civil War to challenge Franco to a fight to the death. They brawled in the streets for hours, and and just when it looked like Jirafas was about to snap Franco’s neck, Franco screamed for his deputies, who immediately pumped several rounds straight through Jirafas’s heart. They say the body continued to bleed courage into the street for several weeks.”

Jugo de Jirafas, I enunciate out loud.

There’s an awkward silence in the living room as the assembled guests all hold an involuntary thirty-second vigil for the dearly departed General Jirafas, who was taken from us much too soon. Poor man—we barely knew him.

Then the vigil is broken up by the arrival of my Brazilian friend João, who lives across the way. Our housing complex is nominally open to all faculty and staff affiliated with the university, but in practice it more or less operates as a kind of hippie commune for expatriate scientists. On any given day you can hear forty different languages being spoken, and stumble across marauding groups of eight-year old children all babbling away at each other in mutual incomprehension. Walking through our apartment complex is like taking a simultaneous trip through every foreign-language channel on extended cable.

It does have its perks, though. For example, if you want to experience other cultures, you don’t need to travel anywhere. When people suggest that I’ve been working too hard and need a vacation, I yell at João through the bedroom window: how’s Rio this time of year?

Exceptional, he’ll yell back. The cannonball trees are in full bloom. You should come for a visit.

Then I usually take a bottle of wine over—nothing of Jugo de Jirafas caliber, just a basic Zinfandel from Whole Foods—and we sit around and talk about the strange places we’ve lived: Rio and Istanbul for him; Mombasa and Ottawa for me. After dinner we usually play a few games of backgammon, which is not a Brazilian game at all, but is acceptable to play because João spent three years of his life doing a postdoc in Turkey. Thus begins and ends my cosmetic Latin American vacation, punctuated by a detour to the Near East.

Tonight, João shows up with a German lady on his arm. She’s a newly arrived faculty member in the Department of Earth Sciences.

“This is the bad Jew I was telling you about,” he says to the lady by way of introduction.

“It’s true,” I say; “I’m a very bad Jew. Even by Jewish standards.”

She wants to know what makes a Jew a bad Jew. I tell her I eat bacon on the Sabbath and wrap myself in cheeseburgers before bed. And that I make sure to drink the blood of goyim at least four times a year. And that I’m so money-hungry and cunning, I’ve been banned from lending money even to other Jews.

My joke doesn’t go over so well. Germans have had, for obvious reasons, a lot of trouble putting the war behind them. When you make Jew jokes in Germany, people give you a look that’s made up of one part contempt, one part cognitive dissonance. They don’t know what to do; it’s like you’ve lit a warehouse full of bottle rockets up inside their heads all at once. As an American, I don’t mind this, of course. In America, it’s your god-given birthright to make ethnic jokes at your own expense. As long as you’re making fun only of your own in-group and nobody else, no one is allowed to come between you and your chuckles.

The German lady doesn’t see it this way.

“You should not make fun of the Jews,” she says in over-articled English. “Even if you are a one yourself.”

“Well,” says I. “If you can’t laugh at yourself, who can you laugh at?”

She shrugs her shoulders.

“Other people,” offers João.

So I laugh at João, because he’s another person. There’s an uncomfortable pause, but then the earth scientist–whose name turns out to be Brunhilde–laughs too. A moment later, we’re all making small talk again, and I feel pretty confident that any budding crisis in diplomatic relations has been averted.

“Speaking of making fun of others,” João says, “what happened to your lip? It looks like you have the herpes.”

“I damaged myself while flossing,” I tell him.

It’s true: I have a persistent cut on my lip caused by aggressive flossing. It refuses to heal. And now, after several days of incubation, it looks exactly like a cold sore. So I have to walk around my life constantly putting up with herpes jokes.

“I’ll go put something on it,” I say, self-consciously rubbing at the wound. “You just stand here and keep laughing at me, you anti-semite.”

Turns out, I’ve forgotten the name of the lip balm my wife buys. So I walk around the party with a chafed, bloody lip, asking everyone I know if they’ve seen my Tampax. The guests mostly demur quietly, but one particularly mercurial friend looks slightly alarmed, and slowly starts to edge towards the door.

He means Carmex, my wife yells from the kitchen.

Eventually, all of the wine is drunk and the conversation is spent. The guests begin to leave, each one curling his or her self carefully through the doorway in sequence. For some reason, they remind me of ants circling around a drain—but I don’t tell anyone that. There is no longer any music; there was never an orgy. There are no more Jew jokes. I turn the phonograph off—by which I mean I press the stop button on my iTunes playlist—and dim the lights. My wife stays downstairs.

“To do some research,” she says.

Much later, just as I’m making the delicate nightly transition from restless leg syndrome to stage 1 sleep, I’m suddenly jarred wide awake by the sound of someone cursing loudly and repeatedly as they get into bed next to me. I vaguely recognize my wife’s voice, though it sounds different over the haze of near-sleep and a not-insignificant amount of wine.

What’s going on, I ask her.

She mutters that she’s just spent the last hour and a half exhausting the infinite wisdom of Google, circumnavigating the information superhighway, and consulting with various technical support workers scattered all around the Indian subcontinent. And the clear consensus among all sources is that there is not now, and never was, any General Jirafas.

“It just means giraffes,” she says.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

what do you get when you put 1,000 psychologists together in one journal?

I’m working on a TOP SEKKRIT* project involving large-scale data mining of the psychology literature. I don’t have anything to say about the TOP SEKKRIT* project just yet, but I will say that in the process of extracting certain information I needed in order to do certain things I won’t talk about, I ended up with certain kinds of data that are useful for certain other tangential analyses. Just for fun, I threw some co-authorship data from 2,000+ Psychological Science articles into the d3.js blender, and out popped an interactive network graph of all researchers who have published at least 2 papers in Psych Science in the last 10 years**. It looks like this:

coauthorship_graph

You can click on the image to take a closer (and interactive) look.

I don’t think this is very useful for anything right now, but if nothing else, it’s fun to drag Adam Galinsky around the screen and watch half of the field come along for the ride. There are plenty of other more interesting things one could do with this, though, and it’s also quite easy to generate the same graph for other journals, so I expect to have more to say about this later on.

 

* It’s not really TOP SEKKRIT at all–it just sounds more exciting that way.

** Or, more accurately, researchers who have co-authored at least 2 Psych Science papers with other researchers who meet the same criterion. Otherwise we’d have even more nodes in the graph, and as you can see, it’s already pretty messy.

the truth is not optional: five bad reasons (and one mediocre one) for defending the status quo

You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.

Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.

“But everyone does it, so how bad can it be?”

We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?

The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.

In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.

“But psychology would break if we could only report results that were truly predicted a priori!”

This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.

In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.

In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.

“But mistakes happen, and people could get falsely accused!”

Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.

For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.

More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.

“But it hurts the public’s perception of our field!”

Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.

While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.

In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.

“But unreliable effects will just fail to replicate, so what’s the big deal?”

This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.

I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?

While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.

In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.

There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.

Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).

The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.

“But it would hurt my career to be meticulously honest about everything I do!”

Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.

I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.

 

Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.

a blog about minds, brains, data & stuff