time-on-task effects in fMRI research: why you should care
There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.
This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.
An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.
A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.
Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:
Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.
The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)
Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.
For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.
It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:
So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).
The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.
I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.
The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.
Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:
There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:
If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.
The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.
Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.
The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.
A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.
The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.
Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335
Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784