Posts Tagged ‘reliability’

The reviewer’s dilemma, or why you shouldn’t get too meta when you’re supposed to be writing a review that’s already overdue

Tuesday, October 25th, 2011

When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.

The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.

The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.

To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.

Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”

Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?

The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.

I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.

One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”

Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.

functional MRI and the many varieties of reliability

Friday, March 5th, 2010

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”" paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences