Archive for the ‘fmri’ Category

fourteen questions about selection bias, circularity, nonindependence, etc.

Sunday, June 27th, 2010

A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.

I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead,  I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…

(1) Is circular analysis a problem in systems and cognitive neuroscience?

Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.

(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?

I think Nichols sums it up nicely here:

TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., ‘P<0.001 uncorrected’ and crossing your fingers) remain pervasive.

The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.

(3) Are circular estimates useful measures of effect size?

Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.

(4) Should circular estimates of effect size be presented in papers and, if so, how?

Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:

In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.

A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.

(5) Are effect size estimates important/useful for neuroscience research, and why?

I think my view here is closest to Ed Vul’s:

Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.

I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)

People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.

Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?

(6) What is the best way to accurately estimate effect sizes from imaging data?

Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.

(7) What makes data sets independent? Are different sets of subjects required?

Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:

Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.

Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.

(8) What information can one glean from data selected for a certain effect?

I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.

(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?

Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.

(10) Should data exploration be discouraged in favor of valid confirmatory analyses?

No. I agree with Poldrack’s sentiment here:

Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.

(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?

In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.

(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?

From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).

(13) How much power should a brain-mapping analysis have to be useful?

As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).

(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?

It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.

ResearchBlogging.orgKriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517

time-on-task effects in fMRI research: why you should care

Wednesday, June 16th, 2010

There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.

This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.

An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.

A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.

Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:

plos_one_figure1

Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.

The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)

Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.

For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.

It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:

grinband_figure
So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).

The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.

I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.

The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.

Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:

plos_one_figure2

There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:

plos_one_figure4

If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.

The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.

Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.

The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.

A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.

The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.

ResearchBlogging.org

Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335

Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784

fMRI, not coming to a courtroom near you so soon after all

Friday, June 4th, 2010

That’s a terribly constructed title, I know, but bear with me. A couple of weeks ago I blogged about a courtroom case in Tennessee where the defense was trying to introduce fMRI to the courtroom as a way of proving the defendant’s innocence (his brain, apparently, showed no signs of guilt). The judge’s verdict is now in, and…. fMRI is out. In United States v. Lorne Semrau, Judge Pham recommended that the government’s motion to exclude fMRI scans from consideration be granted. That’s the outcome I think most respectable cognitive neuroscientists were hoping for; as many people associated with the case or interviewed about it have noted (and as the judge recognized), there just isn’t a shred of evidence to suggest that fMRI has any utility as a lie detector in real-world situations.

The judge’s decision, which you can download in PDF form here (hat-tip: Thomas Nadelhoffer), is really quite elegant, and worth reading (or at least skimming through). He even manages some subtle snark in places. For instance (my italics):

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive” on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.

The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560.

The reference here is to the fact that Laken and his company scanned Semrau (the defendant) on three separate occasions. The first two scans were planned ahead of time, but the third apparently wasn’t:

From the first scan, which included SIQs relating to defrauding the government, the results showed that Dr. Semrau was “not deceptive.” However, from the second scan, which included SIQs relating to AIMS tests, the results showed that Dr. Semrau was “being deceptive.” According to Dr. Laken, “testing indicates that a positive test result in a person purporting to tell the truth is accurate only 6% of the time.” Dr. Laken also believed that the second scan may have been affected by Dr. Semrau’s fatigue. Based on his findings on the second test, Dr. Laken suggested that Dr. Semrau be administered another fMRI test on the AIMS tests topic, but this time with shorter questions and conducted later in the day to reduce the effects of fatigue. … The third scan was conducted on January 12, 2010 at around 7:00 p.m., and according to Dr. Laken, Dr. Semrau tolerated it well and did not express any fatigue. Dr. Laken reviewed this data on January 18, 2010, and concluded that Dr. Semrau was not deceptive. He further stated that based on his prior studies, “a finding such as this is 100% accurate in determining truthfulness from a truthful person.”

I may very well be misunderstanding something here (and so might the judge), but if the positive predictive value of the test is only 6%, I’m guessing that the probability that the test is seriously miscalibrated is somewhat higher than 6%. Especially since the base rate for lying among people who are accused of committing serious fraud is probably reasonably high (this matters, because when base rates are very low, low positive predictive values are not unexpected). But then, no one really knows how to calibrate these tests properly, because the data you’d need to do that simply don’t exist. Serious validation of fMRI as a tool for lie detection would require assembling a large set of brain scans from defendants accused of various crimes (real crimes, not simulated ones) and using that data to predict whether those defendants were ultimately found guilty or not. There really isn’t any substitute for doing a serious study of that sort, but as far as I know, no one’s done it yet. Fortunately, the few judges who’ve had to rule on the courtroom use of fMRI seem to recognize that.

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive” on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.
The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560

links and slides from the CNS symposium

Thursday, April 22nd, 2010

After the CNS symposium on building a cumulative cognitive neuroscience, several people I talked to said it was a pity there wasn’t an online repository where all the sites that the speakers discussed could be accessed. I should have thought of that ahead of time, because even if we made one now, no one would ever find it. So, belatedly, the best I can do is put together a list here, where I’m pretty sure no one’s ever going to read it.

Anyway, this is mostly from memory, so I may be forgetting some of the things people talked about, but here’s what I can remember:

Let me know if there’s anything I’m leaving out.

On a related note, several people at the conference asked me for my slides, but I promptly forgot who they were, so here they are.

UPDATED: Russ Poldrack’s slides are now also on the web here.

green chile muffins and brains in a truck: weekend in albuquerque

Monday, March 22nd, 2010

I spent the better part of last week in Albuquerque for the Mind Research Network fMRI course. It’s a really well-organized 3-day course, and while it’s geared toward people without much background in fMRI, I found a lot of the lectures really helpful. It’s hard impossible to get everything right when you run an fMRI study; the magnet is very fickle and doesn’t like to do what you ask it to–and that assumes you’re asking it to do the right thing, which is also not so common. So I find I learn something interesting from almost every fMRI talk I attend, even when it’s stuff I thought I already knew.

Of course, since I know very little, there’s also almost always stuff that’s completely new to me. In this case, it was a series of lectures on independent components analysis (ICA) of fMRI data, focusing on Vince Calhoun‘s group’s implementation of ICA in the GIFT toolbox. It’s a beautifully implemented set of tools that offer a really powerful alternative to standard univariate analysis, and I’m pretty sure I’ll be using it regularly from now on. So the ICA lectures alone were worth the price of admission. (In the interest of full disclosure, I should note that my post-doc mentor, Tor Wager, is one of the organizers of the MRN course, and I wasn’t paying the $700 tab out of pocket. But I’m not getting any kickbacks to say nice things about the course, I promise.)

Between the lectures and the green chile corn muffins, I didn’t get to see much of Albuquerque (except from the air, where the urban sprawl makes the city seem much larger than its actual population of 800k people would suggest), so I’ll reserve judgment for another time. But the MRN itself is a pretty spectacular facility. Aside from a 3T Siemens Trio magnet, they also have a 1.5T mobile scanner built into a truck. It’s mostly used to scan inmates in the New Mexico prison system (you’ll probably be surprised to learn that they don’t let hardened criminals out of jail to participate in scientific experiments–so the scanner has to go to jail instead). We got a brief tour of the mobile scanner and it was pretty awesome. Which is to say, it beats the pants off my Honda.

There are also some parts of the course I don’t remember so well. Here’s a (blurry) summary of those parts, courtesy of Alex Shackman:

Scott, Tor, and me in Albuquerque

BlurryScott, BlurryTor, and BlurryTal: The Boulder branch of the lab, Albuquerque 2010 edition

fMRI becomes big, big science

Friday, March 12th, 2010

There are probably lots of criteria you could use to determine the relative importance of different scientific disciplines, but the one I like best is the Largest Number of Authors on a Paper. Physicists have long had their hundred-authored papers (see for example this individual here; be sure to click on the “show all authors/affiliations” link), and with the initial sequencing and analysis of the human genome, which involved contributions from 452 different persons, molecular geneticists also joined the ranks of Officially Big Science. Meanwhile, us cognitive neuroscientists have long had to content ourselves with silly little papers that have only four to seven authors (maybe a dozen on a really good day). Which means, despite the pretty pictures we get to put in our papers, we’ve long had this inferiority complex about our work, and a nagging suspicion that it doesn’t really qualify as big science (full disclosure: so when I say “we”, I probably just mean “I”).

UNTIL NOW.

Thanks to the efforts of Bharat Biswal and 53 collaborators (yes, I counted) reported in a recent paper in PNAS, fMRI is now officially Big, Big Science. Granted, 54 authors is still small potatoes in physics-and-biology-land. And for all I know, there could be other fMRI papers with even larger author lists out there that I’ve missed.  BUT THAT’S NOT THE POINT. The point is, people like me now get to run around and say we do something important.

You might think I’m being insincere here, and that I’m really poking fun at ridiculously long author lists that couldn’t possibly reflect meaningful contributions from that many people. Well, I’m not. While I’m not seriously suggesting that the mark of good science is how many authors are on the paper, I really do think that the prevalence of long author lists in a discipline are an important sign of a discipline’s maturity, and that the fact that you can get several dozen contributors to a single paper means you’re seeing a level of collaboration across different labs that previously didn’t exist.

The importance of large-scale collaboration is one of the central elements of the new PNAS article, which is appropriately entitled Toward discovery science of human brain function. What Biswal et al have done is compile the largest publicly-accessible fMRI dataset on the planet, consisting of over 1,400 scans from 35 different centers. All of the data, along with some tools for analysis, are freely available for download from NITRC. Be warned though: you’re probably going to need a couple of terabytes of free space if you want to download the entire dataset.

You might be wondering why no one’s assembled an fMRI dataset of this scope until now; after all, fMRI isn’t that new a technique, having been around for about 20 years now. The answer (or at least, one answer) is that it’s not so easy–and often flatly impossible–to combine raw fMRI datasets in any straightforward way. The problem is that the results of any given fMRI study only really make sense in the context of a particular experimental design. Functional MRI typically measures the change in signal associated with some particular task, which means that you can’t really go about combining the results of studies of phonological processing with those of thermal pain and obtain anything meaningful (actually, this isn’t entirely true; there’s a movement afoot to create image-based centralized databases that will afford meta-analyses on an even more massive scale,  but that’s a post for another time). You need to ensure that the tasks people performed across different sites are at least roughly in the same ballpark.

What allowed Biswal et al  to consolidate datasets to such a degree is that they focused exclusively on one particular kind of cognitive task. Or rather, they focused on a non-task: all 1400+ scans in the 1000 Functional Connectomes Project (as they’re calling it) are from participants being scanned during the “resting state”. The resting state is just what it sounds like: participants are scanned while they’re just resting; usually they’re given no specific instructions other than to lie still, relax, and not fall asleep. The typical finding is that, when you contrast this resting state with activation during virtually any kind of goal-directed processing, you get widespread activation increases in a network that’s come to be referred to as the “default” or “task-negative” network (in reference to the fact that it’s maximally active when people are in their “default” state).

One of the main (and increasingly important) applications of resting state fMRI data is in functional connectivity analyses, which aim to identify patterns of coactivation across different regions rather than mean-level changes associated with some task. The fundamental idea is that you can get a lot of traction on how the brain operates by studying how different brain regions interact with one another spontaneously over time, without having to impose an external task set. The newly released data is ideal for this kind of exploration, since you have a simply massive dataset that includes participants from all over the world scanned in a range of different settings using different scanners. So if you want to explore the functional architecture of the human brain during the resting state, this should really be your one-stop shop. (In fact, I’m tempted to say that there’s going to be much less incentive for people to collect resting-state data from now on, since there really isn’t much you’re going to learn from one sample of 20 – 30 people that you can’t learn from 1,400 people from 35+ combined samples).

Aside from introducing the dataset to the literature, Biswal et al also report a number of new findings. One neat finding is that functional parcellation of the brain using seed-based connectivity (i.e., identifying brain regions that coactivate with a particular “seed” or target region) shows marked consistency across different sites, revealing what Biswal et al call a “universal architecture”. This type of approach by itself isn’t particularly novel, as similar techniques have been used before. Bt no one’s done it on anything approaching this scale. Here’s what the results look like:

You can see that different seeds produce difference functional parcellations across the brain (the brighter areas denote ostensive boundaries).

Another interesting finding is the presence of gender and age differences in functional connectivity:

What this image shows is differences in functional connectivity with specific seed regions (the black dots) as a function of age (left) or gender (right). (The three rows reflect different techniques for producing the maps, with the upshot being that the results are very similar regardless of exactly how you do the analysis.) It isn’t often you get to see scatterplots with 1,400+ points in cognitive neuroscience, so this is a welcome sight. Although it’s also worth pointing out the inevitable downside of having huge sample sizes, which is that even tiny effects attain statistical significance. Which is to say, while the above findings are undoubtedly more representative of gender and age differences in functional connectivity than anything else you’re going to see for a long time, notice that they’re they’re very small effects (e.g., in the right panels, you can see that the differences between men and women are only a fraction of a standard deviation in size, despite the fact that these regions are probably selected because they show some of the “strongest” effects). That’s not meant as a criticism; it’s actually a very good thing, in that these modest effects are probably much closer to the truth than what previous studies have reported. Such findings should serve as an important reminder that most of the effects identified by fMRI studies are almost certainly massively inflated by small sample size (as I’ve discussed before here and in this paper).

Anyway, the bottom line is that if you’ve ever thought to yourself, “gee, I wish I could do cutting-edge fMRI research, but I really don’t want to leave my house to get a PhD; it’s almost lunchtime,” this is your big chance. You can download the data, rejoice in the magic that is the resting state, and bathe yourself freely in functional connectivity. The Biswal et al paper bills itself as “a watershed event in functional imaging,” and it’s hard to argue otherwise. Researchers now have a definitive data set to use for analyses of functional connectivity and the resting state, as well as a model for what other similar data sets might look like in the future.

More importantly, with 54 authors on the paper, fMRI is now officially big science. Prepare to suck it, Human Genome Project!

ResearchBlogging.orgBiswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J., Buckner, R., Colcombe, S., Dogonowski, A., Ernst, M., Fair, D., Hampson, M., Hoptman, M., Hyde, J., Kiviniemi, V., Kotter, R., Li, S., Lin, C., Lowe, M., Mackay, C., Madden, D., Madsen, K., Margulies, D., Mayberg, H., McMahon, K., Monk, C., Mostofsky, S., Nagel, B., Pekar, J., Peltier, S., Petersen, S., Riedl, V., Rombouts, S., Rypma, B., Schlaggar, B., Schmidt, S., Seidler, R., Siegle, G., Sorg, C., Teng, G., Veijola, J., Villringer, A., Walter, M., Wang, L., Weng, X., Whitfield-Gabrieli, S., Williamson, P., Windischberger, C., Zang, Y., Zhang, H., Castellanos, F., & Milham, M. (2010). Toward discovery science of human brain function Proceedings of the National Academy of Sciences, 107 (10), 4734-4739 DOI: 10.1073/pnas.0911855107

functional MRI and the many varieties of reliability

Friday, March 5th, 2010

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”" paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

specificity statistics for ROI analyses: a simple proposal

Sunday, December 13th, 2009

The brain is a big place. In the context of fMRI analysis, what that bigness means is that a typical 3D image of the brain might contain anywhere from 50,000 – 200,000 distinct voxels (3D pixels). Any of those voxels could theoretically show meaningful activation in relation to some contrast of interest, so the only way to be sure that you haven’t overlooked potentially interesting activations is to literally test every voxel (or, given some parcellation algorithm, every region).

Unfortunately, the problem that approach raises–which I’ve discussed in more detail here–is the familiar one of multiple comparisons: If you’re going to test 100,000 locations, it’s not really fair to test each one at the conventional level of p < .05, because on average, you’ll get about 5,000 statistically significant results just by chance that way. So you need to do something to correct for the fact that you’re running thousands of tests. The most common approach is to simply make the threshold for significance more conservative–for example, by testing at p < .0001 instead of p < .05, or by using some combination of intensity and cluster extent thresholds (e.g., you look for 20 contiguous voxels that are all significant at, say, p < .001) that’s supposed to guarantee a cluster-wise error rate of .05.

There is, however, a natural tension between false positives and false negatives: When you make your analysis more conservative, you let fewer false positives through the filter, but you also keep more of the true positives out. A lot of fMRI analysis really just boils down to walking a very thin line between running overconservative analyses that can’t detect anything but the most monstrous effects, and running overly liberal analyses that lack any real ability to distinguish meaningful signals from noise. One very common approach that fMRI researchers have adopted in an effort to optimize this balance is to use complementary hypothesis-driven and whole-brain analyses. The idea is that you’re basically carving the brain up into two separate search spaces: One small space for which you have a priori hypotheses that can be tested using a small number of statistical comparisons, and one much larger space (containing everything but the a priori space) where you continue to use a much more conservative threshold.

For example, if I believe that there’s a very specific chunk of right inferotemporal cortex that’s specialized for detecting clown faces, I can focus my hypothesis-testing on that particular region, without having to pretend that all voxels are created equal. So I delineate the boundaries of a CRC (Clown Representation Cortex) region-of-interest (ROI) based on some prior criteria (e.g., anatomy, or CRC activation in previous studies), and then I can run a single test at p < .05 to test my hypothesis, no correction needed. But to ensure that I don’t miss out on potentially important clown-related activation elsewhere in the brain, I also go ahead and run an additional whole-brain analysis that’s fully corrected for multiple comparisons. By coupling these two analyses, I hopefully get the best of both worlds. That is, I combine one approach (the ROI analysis) that maximizes power to test a priori hypotheses at the cost of an inability to detect effects in unexpected places with another approach (the whole-brain analysis) that has a much more limited capacity to detect effects in both expected and unexpected locations.

This two-pronged strategy is generally a pretty successful one, and I’d go so far as to say that a very large minority, if not an outright majority, of fMRI studies currently use it. Used wisely, I think it’s really an invaluable strategy. There is, however, one fairly serious and largely unappreciated problem associated with the incautious application of this approach. It has to do with claims about the specificity of activation that often tend to accompany studies that use a complementary ROI/whole-brain strategy. Specifically, a pretty common pattern is for researchers to (a) confirm their theoretical predictions by successfully detecting activation in one or more a priori ROIs; (b) identify few if any whole-brain activations; and consequently, (c) conclude that not only were the theoretical predictions confirmed, but that the hypothesized effects in the a priori ROIs were spatially selective, because a complementary whole-brain analysis didn’t turn up much (if anything). Or, to put it in less formal terms, not only were we right, we were really right! There isn’t any other part of the brain that shows the effect we hypothesized we’d see in our a priori ROI!

The problem with this type of inference is that there’s usually a massive discrepancy in the level of power available to detect effects in a priori ROIs versus the rest of the brain. If you search at p < .05 within some predetermined space, but at only p < .0001 everywhere else, you’re naturally going to detect results at a much lower rate everywhere else. But that’s not necessarily because there wasn’t just as much to look at everywhere else; it could just be because you didn’t look very carefully. By way of analogy, if you’re out picking berries in the forest, and you decide to spend half your time on just one bush that (from a distance) seemed particularly berry-full, and the other half of your time divided between the other 40 bushes in the area, you’re not really entitled to conclude that you picked the best bush all along simply because you came away with a relatively full basket. Had you done a better job checking out the other bushes, you might well have found some that were even better, and then you’d have come away carrying two baskets full of delicious, sweet, sweet berries.

Now, in an ideal world, we’d solve this problem by simply going around and carefully inspecting all the berry bushes, until we were berry, berry sure really convinced that we’d found all of the best bushes. Unfortunately, we can’t do that, because we’re out here collecting berries on our lunch break, and the boss isn’t paying us to dick around in the woods. Or, to return to fMRI World, we simply can’t carefully inspect every single voxel (say, by testing it at p < .05), because then we’re right back in mega-false-positive-land, which we’ve already established as a totally boring place we want to avoid at all costs.

Since an optimal solution isn’t likely, the next best thing is to figure out what we can do to guard against careless overinterpretation. Here I think there’s actually a very simple, and relatively elegant, solution. What I’ve suggested when I’ve given recent talks on this topic is that we mandate (or at least, encourage) the use of what you could call a specificity statistic (SS). The SS is a very simple measure of how specific a given ROI-level finding is; it’s just the proportion of voxels that are statistically significant when tested at the same level as the ROI-level effects. In most cases, that’s going to be p < .05, so the SS will usually just be the proportion of all voxels anywhere in the brain that are activated at p < .05.

To see why this is useful, consider what could no longer happen: Researchers would no longer be able to (inadvertently) capitalize on the fact that the one or two regions they happened to define as a priori ROIs turned up significant effects when no other regions did in a whole-brain analysis. Suppose that someone reports a finding that negative emotion activates the amygdala in an ROI analysis, but doesn’t activate any other region in a whole-brain analysis. (While I’m pulling this particular example out of a hat here, I feel pretty confident that if you went and did a thorough literature review, you’d find at least three or four studies that have made this exact claim.) This is a case where the SS would come in really handy. Because if the SS is, say, 26% (i.e., about a quarter of all voxels in the brain are active at p < .05, even if none survive full correction for multiple comparisons), you would want to draw a very different conclusion than if it was just 4%. If fully a quarter of the brain were to show greater activation for a negative-minus-neutral emotion contrast, you wouldn’t want to conclude that the amygdala was critically involved in negative emotion; a better interpretation would be that the researchers in question just happened to define an a priori region that fell within the right quarter of the brain. Perhaps all that’s happening is that negative emotion elicits a general increase in attention, and much of the brain (including, but by no means limited to, the amygdala) tends to increase activation correspondingly. So as a reviewer and reader, you’d want to know how specific the reported amygdala activation really is*. But in the vast majority of papers, you currently have no way of telling (and the researchers probably don’t even know the answer themselves!).

The principal beauty of this statistic lies in its simplicity: It’s easy to understand, easy to calculate, and easy to report. Ideally, researchers would report the SS any time ROI analyses are involved, and would do it for every reported contrast. But at minimum, I think we should all encourage each other (and ourselves) to report such a statistic any time we’re making a specificity claim about ROI-based results. In other words,if you want to argue that a particular cognitive function is relatively localized to the ROI(s) you happened to select, you should be required to show that there aren’t that many other voxels (or regions) that show the same effect when tested at the liberal threshold you used for the ROI analysis. There shouldn’t be an excuse for not doing this; it’s a very easy procedure for researchers to implement, and an even easier one for reviewers to demand.

* An alternative measure of specificity would be to report the percentile ranking of all of the voxels within the ROI mask relative to all other individual voxels. In the above example, you’d assign very different interpretations depending on whether the amygdala was in the 32nd or 87th percentile of all voxels, when ordered according to the strength of the effect for the negative – neutral contrast.

a well-written mainstream article on fMRI?!

Wednesday, December 9th, 2009

Craig Bennett, of prefrontal.org and dead salmon fame, links to a really great Science News article on the promises and pitfalls of fMRI. As Bennett points out, the real gem of the article is the “quote of the week” from Nikos Logethetis (which I won’t spoil for you here; you’ll have to do just a little more work to get to it). But the article is full of many other insightful quotes from fMRI researchers, and manages to succinctly and accurately describe a number of recent controversies in the fMRI literature without sacrificing too much detail. Usually when I come across a mainstream article on fMRI, I pre-emptively slap the screen a few times before I start reading, because I know I’m about to get angry. Well, I did that this time too, so my hand hurts per usual, but at least this time I feel pretty good about it. Kudos to Laura Sanders for writing one of the best non-technical accounts I’ve seen of the current state of fMRI research (and that, unlike a number of other articles in this vein, actually ends on a balanced and optimistic note).

Ioannidis on effect size inflation, with guest appearance by Bozo the Clown

Saturday, November 21st, 2009

Andrew Gelman posted a link on his blog today to a paper by John Ioannidis I hadn’t seen before. In many respects, it’s basically the same paper I wrote earlier this year as a commentary on the Vul et al “voodoo correlations” paper (the commentary was itself based largely on an earlier chapter I wrote with my PhD advisor, Todd Braver). Well, except that the Ioannidis paper came out a year earlier than mine, and is also much better in just about every respect (more on this below).

What really surprises me is that I never came across Ioannidis’ paper when I was doing a lit search for my commentary. The basic point I made in the commentary–which can be summarized as the observation that low power coupled with selection bias almost invariably inflates significant effect sizes–is a pretty straightforward statistical point, so I figured that many people, and probably most statisticians, were well aware of it. But no amount of Google Scholar-ing helped me find an authoritative article that made the same point succinctly; I just kept coming across articles that made the point tangentially, in an off-hand “but of course we all know we shouldn’t trust these effect sizes, because…” kind of way. So I chalked it down as one of those statistical factoids (of which there are surprisingly many) that live in the unhappy land of too-obvious-for-statisticians-to-write-an-article-about-but-not-obvious-enough-for-most-psychologists-to-know-about. And so I just went ahead and wrote the commentary in a non-technical way that I hoped would get the point across intuitively.

Anyway, after the commentary was accepted, I sent a copy to Andrew Gelman, who had written several posts about the Vul et al controversy. He promptly send me back a link to this paper of his, which basically makes the same point about sampling error, but with much more detail and much better examples than I did. His paper also cites an earlier article in American Scientist by Wainer, which I also recommend, and again expresses very similar ideas. So then I felt a bit like a fool for not stumbling across either Gelman’s paper or Wainer’s earlier. And now that I’ve read Ioannidis’ paper, I feel even dumber, seeing as I could have saved myself a lot of trouble by writing two or three paragraphs and then essentially pointing to Ioannidis’ work. Oh well.

That all said, it wasn’t a complete loss; I still think the basic point is important enough that it’s worth repeating loudly and often, no matter how many times it’s been said before. And I’m skeptical that many fMRI researchers would have appreciated the point otherwise, given that none of the papers I’ve mentioned were published in venues fMRI researchers are likely to read regularly (which is presumably part of the reason I never came across them!). Of course, I don’t think that many people who do fMRI research actually bothered to read my commentary, so it’s questionable whether it had much impact anyway.

At any rate, the Ioannidis paper makes a number of points that my paper didn’t, so I figured I’d talk about them a bit. I’ll start by revisiting what I said in my commentary, and then I’ll tell you why you should read Ioannidis’ paper instead of mine.

The basic intuition can be captured as follows. Suppose you’re interested in the following question: Do clowns suffer depression at a higher rate than us non-comical folk do? You might think this is a contrived (to put it delicately) question, but I can assure you it has all sorts of important real-world implications. For instance, you wouldn’t be so quick to book a clown for your child’s next birthday party if you knew that The Great Mancini was going to be out in the parking lot half an hour later drinking cheap gin out of a top hat. If that example makes you feel guilty, congratulations: you’ve just discovered the translational value of basic science.

Anyway, back to the question, and how we’re going to answer it. You can’t just throw a bunch of clowns and non-clowns in a room and give them a depression measure. There’s nothing comical about that. What you need to do, if you’re rigorous about it, is give them multiple measures of depression, because we all know how finicky individual questionnaires can be. So the clowns and non-clowns each get to fill out the Beck Depression Inventory (BDI), the Center for Epidemiologic Studies Depression Scale, the Depression Adjective Checklist, the Zung Self-Rating Depression Scale (ZSRDS), and, let’s say, six other measures. Ten measures in all. And let’s say we have 20 individuals in each group, because that’s all I personally a cash-strapped but enthusiastic investigator can afford. After collecting the data, we score the questionnaires and run a bunch of t-tests to determine whether clowns and non-clowns have different levels of depression. Being scrupulous researchers who care a lot about multiple comparisons correction, we decide to divide our critical p-value by 10 (the dreaded Bonferroni correction, for 10 tests in this case) and test at p < .005. That’s a conservative analysis, of course; but better safe than sorry!

So we run our tests and get what look like mixed results. Meaning, we get statistically significant positive correlations between clown-dom status and depression for 2 measures–the BDI and Zung inventories–but not for the other 8 measures. So that’s admittedly not great; it would have been better if all 10 had come out right. Still, it at least partially supports our hypothesis: Clowns are fucking miserable! And because we’re already thinking ahead to how we’re going to present these results when they (inevitably) get published in Psychological Science, we go ahead and compute the effect sizes for the two significant correlations, because, after all, it’s important to know not only that there is a “real” effect, but also how big that effect is. When we do that, it turns out that the point-biserial correlation is huge! It’s .75 for the BDI and .68 for the ZSRDS. In other words, about half of the variance in clowndom can be explained by depression levels. And of course, because we’re well aware that correlation does not imply causation, we get to interpret the correlation both ways! So we quickly issue a press release claiming that we’ve discovered that it’s possible to conclusively diagnose depression just by knowing whether or not someone’s a clown! (We’re not going to worry about silly little things like base rates in a press release.)

Now, this may all seem great. And it’s probably not an unrealistic depiction of how much of psychology works (well, minus the colorful scarves, big hair, and face paint). That is, very often people report interesting findings that were selectively reported from amongst a larger pool of potential findings on the basis of the fact that the former but not the latter surpassed some predetermined criterion for statistical significance. For example, in our hypothetical in press clown paper, we don’t bother to report results for the correlation between clownhood and the Center for Epidemiologic Studies Depression Scale (r = .12, p > .1). Why should we? It’d be silly to report a whole pile of additional correlations only to turn around and say “null effect, null effect, null effect, null effect, null effect, null effect, null effect, and null effect” (see how boring it was to read that?). Nobody cares about variables that don’t predict other variables; we care about variables that do predict other variables. And we’re not really doing anything wrong, we think; it’s not like the act of selective reporting is inflating our Type I error (i.e., the false positive rate), because we’ve already taken care of that up front by deliberately being overconservative in our analyses.

Unfortunately, while it’s true that our Type I error doesn’t suffer, the act of choosing which findings to report based on the results of a statistical test does have another unwelcome consequence. Specifically, there’s a very good chance that the effect sizes we end up reporting for statistically significant results will be artificially inflated–perhaps dramatically so.

Why would this happen? It’s actually entailed by the selection procedure. To see this, let’s take the classical measurement model, under which the variance in any measured variable reflects the sum of two components: the “true” scores (i.e., the scores we would get if our measurements were always completely accurate) and some random error. The error term can in turn be broken down into many more specific sources of error; but we’ll ignore that and just focus on one source of error–namely, sampling error. Sampling error refers to the fact that we can never select a perfectly representative group of subjects when we collect a sample; there’s always some (ideally small) way in which the sample characteristics differ from the population. This error term can artificially inflate an effect or artificially deflate it, and it can inflate or deflate it more or less, but it’s going to have an effect one way or the other. You can take that to the bank as sure as my name’s Bozo the Clown.

To put this in context, let’s go back to our BDI scores. Recall that what we observed is that clowns have higher BDI scores than non-clowns. But what we’re now saying is that that difference in scores is going to be affected by sampling error. That is, just by chance, we may have selected a group of clowns that are particularly depressed, or a group of non-clowns who are particularly jolly. Maybe if we could measure depression in all clowns and all non-clowns, we would actually find no difference between groups.

Now, if we allow that sampling error really is random, and that we’re not actively trying to pre-determine the outcome of our study by going out of our way to recruit The Great Depressed Mancini and his extended dysthymic clown family, then in theory we have no reason to think that sampling error is going to introduce any particular bias into our results. It’s true that the observed correlations in our sample may not be perfectly representative of the true correlations in the population; but that’s not a big deal so long as there’s no systematic bias (i.e., that we have no reason to think that our sample will systematically inflate correlations or deflate them). But here’s the problem: the act of choosing to report some correlations but not others on the basis of their statistical significance (or lack thereof) introduces precisely such a bias. The reason is that, when you go looking for correlations that are of a certain size or greater, you’re inevitably going to be more likely to select those correlations that happen to have been helped by chance than hurt by it.

Here’s a series of figures that should make the point even clearer. Let’s pretend for a moment that the truth of the matter is that there is in fact a positive correlation between clown status and all 10 depression measures. Except, we’ll make it 100 measures, because it’ll be easier to illustrate the point that way. Moreover, let’s suppose that the correlation is exactly the same for all 100 measures, at .3. Here’s what that would look like if we just plotted the correlations for all 100 measures, 1 through 100:

figure1

It’s just a horizontal red line, because all the individual correlations have the same value (0.3). So that’s not very exciting. But remember, these are the population correlations. They’re not what we’re going to observe in our sample of 20 clowns and 20 non-clowns, because depression scores in our sample aren’t a perfect representation of the population. There’s also error to worry about. And error–or at least, sampling error–is going to be greater for smaller samples than for bigger ones. (The reason for this can be expressed intuitively: other things being equal, the more observations you have, the more representative your sample must be of the population as a whole, because deviations in any given direction will tend to cancel each other out the more data you collect. And if you keep collecting, at the limit, your sample will constitute the whole population, and must therefore by definition be perfectly representative). With only 20 subjects in each group, our estimates of each group’s depression level are not going to be terrifically stable. You can see this in the following figure, which shows the results of a simulation on 100 different variables, assuming that all have an identical underlying correlation of .3:

figure2

Notice how much variability there is in the correlations! The weakest correlation is actually negative, at -.18; the strongest is much larger than .3, at .63. (Caveat for more technical readers: this assumes that the above variables are completely independent, which in practice is unlikely to be true when dealing with 100 measures of the same construct.) So even though the true correlation is .3 in all cases, the magic of sampling will necessarily produce some values that are below .3, and some that are above .3. In some cases, the deviations will be substantial.

By now you can probably see where this is going. Here we have a distribution of effect sizes that to some extent may reflect underlying variability in population effect sizes, but is also almost certainly influenced by sampling error. And now we come along and decide that, hey, it doesn’t really make sense to report all 100 of these correlations in a paper; that’s too messy. Really, for the sake of brevity and clarity, we should only report those correlations that are in some sense more important and “real”. And we do that by calculating p-values and only reporting the results of tests that are significant at some predetermined level (in our case, p < .005). Well, here’s what that would look like:

figure3

This is exactly the same figure as the previous one, except we’ve now grayed out all the non-significant correlations. And in the process, we’ve made Bozo the Clown cry:

Why? Because unfortunately, the criterion that we’ve chosen is an extremely conservative one. In order to detect a significant difference in means between two groups of 20 subjects at p < .005, the observed correlation (depicted as the horizontal black line above) needs to be .42 or greater! That’s substantially larger than the actual population effect size of .3. Effects of this magnitude don’t occur very frequently in our sample; in fact, they only occur 16 times. As a result, we’re going to end up failing to detect 84 of 100 correlations, and will walk away thinking they’re null results–even though the truth is that, in the population, they’re actually all pretty strong, at .3. This quantity–the proportion of “real” effects that we’re likely to end up calling statistically significant given the constraints of our sample–is formally called statistical power. If you do a power analysis for a two-sample t-test on a correlation of .3 at p < .005, it turns out that power is only .17 (which is essentially what we see above; the slight discrepancy is due to chance). In other words, even when there are real and relatively strong associations between depression and clownhood, our sample would only identify those associations 17% of the time, on average.

That’s not good, obviously, but there’s more. Now the other shoe drops, because not only have we systematically missed out on most of the effects we’re interested in (in virtue of using small samples and overly conservative statistical thresholds), but notice what we’ve also done to the effect sizes of those correlations that we do end up identifying. What is in reality a .3 correlation spuriously appears, on average, as  a .51 correlation in the 16 tests that surpass our threshold. So, through the combined magic of low power and selection bias, we’ve turned what may in reality be a relatively diffuse association between two variables (say, clownhood and depression) into a seemingly selective and extremely strong association. After all the excitement about getting a high-profile publication, it might ultimately turn out that clowns aren’t really so depressed after all–it’s all an illusion induced by the sampling apparatus. So you might say that the clowns get the last laugh. Or that the joke’s on us. Or maybe just that this whole clown example is no longer funny and it’s now time for it to go bury itself in a hole somewhere.

Anyway, that, in a nutshell, was the point my commentary on the Vul et al paper made, and it’s the same point the Gelman and Wainer papers make too, in one way or another. While it’s a very general point that really applies in any domain where (a) power is less than 100% (which is just about always) and (b) there is some selection bias (which is also just about always), there were some considerations that were particularly applicable to fMRI research. The basic issue is that, in fMRI research, we often want to conduct analyses that span the entire brain, which means we’re usually faced with conducting many more statistical comparisons than researchers in other domains generally deal with (though not, say, molecular geneticists conducting genome-wide association studies). As a result, there is a very strong emphasis in imaging research on controlling Type I error rates by using very conservative statistical thresholds. You can agree or disagree with this general advice (for the record, I personally think there’s much too great an emphasis in imaging on Type I error, and not nearly enough emphasis on Type II error), but there’s no avoiding the fact that following it will tend to produce highly inflated significant effect sizes, because in the act of reducing p-value thresholds, we’re also driving down power dramatically, and making the selection bias more powerful.

While it’d be nice if there was an easy fix for this problem, there really isn’t one. In behavioral domains, there’s often a relatively simple prescription: report all effect sizes, both significant and non-significant. This doesn’t entirely solve the problem, because people are still likely to overemphasize statistically significant results relative to non-significant ones; but at least at that point you can say you’ve done what you can. In the fMRI literature, this course of action isn’t really available, because most journal editors are not going to be very happy with you when you send them a 25-page table that reports effect sizes and p-values for each of the 100,000 voxels you tested. So we’re forced adopt other strategies. The one I’ve argued for most strongly is to increase sample size (which increases power and decreases the uncertainty of resulting estimates). But that’s understandably difficult in a field where scanning each additional subject can cost $1,000 or more. There are a number of other things you can do, but I won’t talk about them much here, partly because this is already much too long a post, but mostly because I’m currently working on a paper that discusses this problem, and potential solutions, in much more detail.

So now finally I get to the Ioannidis article. As I said, the basic point is the same one made in my paper and Gelman’s and others, and the one I’ve described above in excruciating clownish detail. But there are a number of things about the Ioannidis that are particularly nice. One is that Ioannidis considers not only inflation due to selection of statistically significant results coupled with low power, but also inflation due to the use of flexible analyses (or, as he puts it, “vibration” of effects–also known as massaging the data). Another is that he considers cultural aspects of the phenomenon, e.g., the fact that investigators tend to be rewarded for reporting large effects, even if they subsequently fail to replicate. He also discusses conditions under which you might actually get deflation of effect sizes–something I didn’t touch on in my commentary, and hadn’t really thought about. Finally, he makes some interesting recommendations for minimizing effect size inflation. Whereas my commentary focused primarily on concrete steps researchers could take in individual studies to encourage clearer evaluation of results (e.g., reporting confidence intervals, including power calculations, etc.), Ioannidis focuses on longer-term solutions and the possibility that we’ll need to dramatically change the way we do science (at least in some fields).

Anyway, this whole issue of inflated effect sizes is a critical one to appreciate if you do any kind of social or biomedical science research, because it almost certainly affects your findings on a regular basis, and has all sorts of implications for what kind of research we conduct and how we interpret our findings. (To give just one trivial example, if you’ve ever been tempted to attribute your failure to replicate a previous finding to some minute experimental difference between studies, you should seriously consider the possibility that the original effect size may have been grossly inflated, and that your own study consequently has insufficient power to replicate the effect.) If you only have time to read one article that deals with this issue, read the Ioannidis paper. And remember it when you write your next Discussion section. Bozo the Clown will thank you for it.

Ioannidis, J. (2008). Why Most Discovered True Associations Are Inflated Epidemiology, 19 (5), 640-648 DOI: 10.1097/EDE.0b013e31818131e7

Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power-Commentary on Vul et al. (2009) Perspectives on Psychological Science, 4 (3), 294-298 DOI: 10.1111/j.1745-6924.2009.01127.x