functional MRI and the many varieties of reliability
Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:
One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.
I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).
First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.
By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.
Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”" paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.
To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.
In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.
The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).
Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.
Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:
There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.
On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).
I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.