fourteen questions about selection bias, circularity, nonindependence, etc.
A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.
I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead, I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…
(1) Is circular analysis a problem in systems and cognitive neuroscience?
Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.
(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?
I think Nichols sums it up nicely here:
TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., ‘P<0.001 uncorrected’ and crossing your fingers) remain pervasive.
The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.
(3) Are circular estimates useful measures of effect size?
Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.
(4) Should circular estimates of effect size be presented in papers and, if so, how?
Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:
In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.
A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.
(5) Are effect size estimates important/useful for neuroscience research, and why?
I think my view here is closest to Ed Vul’s:
Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.
I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)
People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.
Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?
(6) What is the best way to accurately estimate effect sizes from imaging data?
Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.
(7) What makes data sets independent? Are different sets of subjects required?
Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:
Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.
Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.
(8) What information can one glean from data selected for a certain effect?
I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.
(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?
Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.
(10) Should data exploration be discouraged in favor of valid confirmatory analyses?
No. I agree with Poldrack’s sentiment here:
Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.
(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?
In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.
(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?
From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).
(13) How much power should a brain-mapping analysis have to be useful?
As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).
(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?
It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.
Kriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517