Tag Archives: fmri

the Neurosynth viewer goes modular and open source

If you’ve visited the Neurosynth website lately, you may have noticed that it looks… the same way it’s always looked. It hasn’t really changed in the last ~20 months, despite the vague promise on the front page that in the next few months, we’re going to do X, Y, Z to improve the functionality. The lack of updates is not by design; it’s because until recently I didn’t have much time to work on Neurosynth. Now that much of my time is committed to the project, things are moving ahead pretty nicely, though the changes behind the scenes aren’t reflected in any user-end improvements yet.

The github repo is now regularly updated and even gets the occasional contribution from someone other than myself; I expect that to ramp up considerably in the coming months. You can already use the code to run your own automated meta-analyses fairly easily; e.g., with everything set up right (follow the Readme and examples in the repo), the following lines of code:

dataset = cPickle.load(open('dataset.pkl', 'rb'))
studies = get_ids_by_expression("memory* &~ ("wm|working|episod*"), threshold=0.001)
ma = meta.MetaAnalysis(dataset, studies)
ma.save_results('memory')

…will perform an automated meta-analysis of all studies in the Neurosynth database that use the term ‘memory’ at a frequency of 1 in 1,000 words or greater, but don’t use the terms wm or working, or words that start with ‘episod’ (e.g., episodic). You can perform queries that nest to arbitrary depths, so it’s a pretty powerful engine for quickly generating customized meta-analyses, subject to all of the usual caveats surrounding Neurosynth (i.e., that the underlying data are very noisy, that terms aren’t mental states, etc.).

Anyway, with the core tools coming along, I’ve started to turn back to other elements of the project, starting with the image viewer. Yesterday I pushed the first commit of a new version of the viewer that’s currently on the Neurosynth website. In the next few weeks, this new version will be replacing the current version of the viewer, along with a bunch of other changes to the website.

A live demo of the new viewer is available here. It’s not much to look at right now, but behind the scenes, it’s actually a huge improvement on the old viewer in a number of ways:

  • The code is completely refactored and is all nice and object-oriented now. It’s also in CoffeeScript, which is an alternative and (if you’re coming from a Python or Ruby background) much more readable syntax for JavaScript. The source code is on github and contributions are very much encouraged. Like most scientists, I’m generally loathe to share my code publicly because I think it sucks most of the time. But I actually feel pretty good about this code. It’s not good code by any stretch, but I think it rises to the level of ‘mostly sensible’, which is about as much as I can hope for.
  • The viewer now handles multiple layers simultaneously, with the ability to hide and show layers, reorder them by dragging, vary the transparency, assign different color palettes, etc. These features have been staples of offline viewers pretty much since the prehistoric beginnings of fMRI time, but they aren’t available in the current Neurosynth viewer or most other online viewers I’m aware of, so this is a nice addition.
  • The architecture is modular, so that it should be quite easy in future to drop in other alternative views onto the data without having to muck about with the app logic. E.g., adding a 3D WebGL-based view to complement the current 2D slice-based HTML5 canvas approach is on the near-term agenda.
  • The resolution of the viewer is now higher–up from 4 mm to 2 mm (which is the most common native resolution used in packages like SPM and FSL). The original motivation for downsampling to 4 mm in the prior viewer was to keep filesize to a minimum and speed up the initial loading of images. But at some point I realized, hey, we’re living in the 21st century; people have fast internet connections now. So now the files are all in 2 mm resolution, which has the unpleasant effect of increasing file sizes by a factor of about 8, but also has the pleasant effect of making it so that you can actually tell what the hell you’re looking at.

Most importantly, there’s now a clean, and near-complete, separation between the HTML/CSS content and the JavaScript code. Which means that you can now effectively drop the viewer into just about any HTML page with just a few lines of code. So in theory, you can have basically the same viewer you see in the demo just by sticking something like the following into your page:

 viewer = Viewer.get('#layer_list', '.layer_settings')
 viewer.addView('#view_axial', 2);
 viewer.addView('#view_coronal', 1);
 viewer.addView('#view_sagittal', 0);
 viewer.addSlider('opacity', '.slider#opacity', 'horizontal', 'false', 0, 1, 1, 0.05);
 viewer.addSlider('pos-threshold', '.slider#pos-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addSlider('neg-threshold', '.slider#neg-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addColorSelect('#color_palette');
 viewer.addDataField('voxelValue', '#data_current_value')
 viewer.addDataField('currentCoords', '#data_current_coords')
 viewer.loadImageFromJSON('data/MNI152.json', 'MNI152 2mm', 'gray')
 viewer.loadImageFromJSON('data/emotion_meta.json', 'emotion meta-analysis', 'bright lights')
 viewer.loadImageFromJSON('data/language_meta.json', 'language meta-analysis', 'hot and cold')
 viewer.paint()

Well, okay, there are some other dependencies and styling stuff you’re not seeing. But all of that stuff is included in the example folder here. And of course, you can modify any of the HTML/CSS you see in the example; the whole point is that you can now easily style the viewer however you want it, without having to worry about any of the app logic.

What’s also nice about this is that you can easily pick and choose which of the viewer’s features you want to include in your page; nothing will (or at least, should) break no matter what you do. So, for example, you could decide you only want to display a single view showing only axial slices; or to allow users to manipulate the threshold of layers but not their opacity; or to show the current position of the crosshairs but not the corresponding voxel value; and so on. All you have to do is include or exclude the various addSlider() and addData() lines you see above.

Of course, it wouldn’t be a mediocre open source project if it didn’t have some important limitations I’ve been hiding from you until near the very end of this post (hoping, of course, that you wouldn’t bother to read this far down). The biggest limitation is that the viewer expects images to be in JSON format rather than a binary format like NIFTI or Analyze. This is a temporary headache until I or someone else can find the time and motivation to adapt one of the JavaScript NIFTI readers that are already out there (e.g., Satra Ghosh‘s parser for xtk), but for now, if you want to load your own images, you’re going to have to take the extra step of first converting them to JSON. Fortunately, the core Neurosynth Python package has a img_to_json() method in the imageutils module that will read in a NIFTI or Analyze volume and produce a JSON string in the expected format. Although I’m pretty sure it doesn’t handle orientation properly for some images, so don’t be surprised if your images look wonky. (And more importantly, if you fix the orientation issue, please commit your changes to the repo.)

In any case, as long as you’re comfortable with a bit of HTML/CSS/JavaScript hacking, the example/ folder in the github repo has everything you need to drop the viewer into your own pages. If you do use this code internally, please let me know! Partly for my own edification, but mostly because when I write my annual progress reports to the NIH, it’s nice to be able to truthfully say, “hey, look, people are actually using this neat thing we built with taxpayer money.”

A very classy reply from Karl Friston

After writing my last post critiquing Karl Friston’s commentary in NeuroImage, I emailed him the link, figuring he might want the opportunity to respond, and also to make sure he knew my commentary wasn’t intended as a personal attack (I have enormous respect for his seminal contributions to the field of neuroimaging). Here’s his very classy reply (posted with permission):

Many thanks for your kind e-mail and link to your blog. I thought your review and deconstruction of the issues were excellent and I concur with the points that you make.

You are absolutely right that I ignored the use of high (corrected) thresholds when controlling for multiple comparisons – and was focusing on the simple case of a single test. I also agree that, ideally, one would report confidence intervals on effect sizes – indeed the original version of my article concluded with this recommendation (now the last line of appendix 1). I remember – at the inception of SPM – discussing with Andrew Holmes the reporting of confidence intervals using statistical maps – however, the closest we ever got was posterior probability maps (PPM), many years later.

My agenda was probably a bit simpler than you might have supposed – it was to point out that significant p-values from small sample studies are valid and will – on average – detect effects whose sizes are bigger than the equivalent effects with larger sample sizes. I did not mean to imply that large studies are useless – although I do believe that unqualified reports of significant p-values from large sample sizes should be treated with caution. Although my agenda was fairly simple, the issues raised may well require more serious consideration – of the sort that you have offered. I submitted the article as a ‘comments and controversy’, anticipating that it would elicit a thoughtful response of the sort in your blog. If you have not done so already; you could prepare your blog for peer-reviewed submission – perhaps as a response to the ‘comments and controversy’ at NeuroImage?

I will not respond to your blog directly; largely because I have never blogged before and prefer to restrict myself to peer-reviewed formats. However, please feel free to use this e-mail in any way you see fit.

With very best wishes,

Karl

PS: although you may have difficulty believing it – all the critiques I caricatured I have actually seen in one form or another – even the retinotopic mapping critique!

Seeing as my optimistic thought when I sent Friston the link was “I hope he doesn’t eat me alive” (not because he has that kind of reputation, but because, frankly, if someone obnoxiously sent me a link to an abrasive article criticizing my work at length, I might not be very happy either), I was very happy to read that. I wrote back:

Thanks very much for your gracious reply–especially since the tone of my commentary was probably a bit abrasive. If I’m being honest with myself, I’m pretty sure I’d have a hard time setting my ego aside long enough to respond this constructively if someone criticized me like this (no matter how I felt about the substantive issues), so it’s very much appreciated.

I won’t take up any of the substantive issues here, since it sounds like we’re in reasonable agreement on most of the issues. As far as submitting a formal response to NeuroImage, I’d normally be happy to do that, but I’m currently boycotting Elsevier journals as part of the Cost of Knowledge campaign, and feel pretty strongly about that, so I won’t be submitting anything to NeuroImage for the foreseeable future. This isn’t meant as an indictment of the NeuroImage editorial board or staff in any way; it’s strictly out of frustration at Elsevier’s policies and recent actions.

Also, while I like the comments and controversies format at NeuroImage a lot, there’s no question that the process is considerably slower than what online communication affords. The reality is that by the time my comment ever came out (probably in a much abridged form), much of the community will have moved on and lost interest in the issue, and I’ve found in the past that the kind of interactive and rapid engagement it’s possible to get online is very hard to approximate in a print forum. But I can completely understand your hesitation to respond this way; it could quickly become unmanageable. For what it’s worth, I don’t really think blogs are the right medium for this kind of thing in the long term anyway, but until we get publisher-independent evaluation platforms that centralize the debate in one place (which I’m hopeful will happen relatively soon), I think they play a useful role.

Anyway, whatever your opinion of the original commentary and/or my post, I think Friston deserves a lot of credit for his response, which, I’ll just reiterate again, is much more civil and tactful than mine would probably have been in his situation. I can’t think of many cases either in print or online when someone has responded so constructively to criticism.

One other thing I forgot to mention in my reply to Friston, but is worth bringing up here: I think SPM confidence interval maps would be a great idea! It would be fantastic if fMRI analysis packages by default produced 3 effect size maps for every analysis–respectively giving the observed, lower bound, and upper bound estimates of effect size at every voxel. This would naturally discourage researchers from making excessively strong claims (since one imagines almost everyone would at least glance at the lower-bound map) while providing reviewers a very easy way to frame concerns about power and sample size (“can the authors please present the confidence interval maps in the appendix?”). Anyone want to write an SPM plug-in to do this?

Sixteen is not magic: Comment on Friston (2012)

UPDATE: I’ve posted a very classy email response from Friston here.

In a “comments and controversies” piece published in NeuroImage last week, Karl Friston describes “Ten ironic rules for non-statistical reviewers”. As the title suggests, the piece is presented ironically; Friston frames it as a series of guidelines reviewers can follow in order to ensure successful rejection of any neuroimaging paper. But of course, Friston’s real goal is to convince you that the practices described in the commentary are bad ones, and that reviewers should stop picking on papers for such things as having too little power, not cross-validating results, and not being important enough to warrant publication.

Friston’s piece is, simultaneously, an entertaining satire of some lamentable reviewer practices, and—in my view, at least—a frustratingly misplaced commentary on the relationship between sample size, effect size, and inference in neuroimaging. While it’s easy to laugh at some of the examples Friston gives, many of the positions Friston presents and then skewers aren’t just humorous portrayals of common criticisms; they’re simply bad caricatures of comments that I suspect only a small fraction of reviewers ever make. Moreover, the cures Friston proposes—most notably, the recommendation that sample sizes on the order of 16 to 32 are just fine for neuroimaging studies—are, I’ll argue, much worse than the diseases he diagnoses.

Before taking up the objectionable parts of Friston’s commentary, I’ll just touch on the parts I don’t think are particularly problematic. Of the ten rules Friston discusses, seven seem palatable, if not always helpful:

  • Rule 6 seems reasonable; there does seem to be excessive concern about the violation of assumptions of standard parametric tests. It’s not that this type of thing isn’t worth worrying about at some point, just that there are usually much more egregious things to worry about, and it’s been demonstrated that the most common parametric tests are (relatively) insensitive to violations of normality under realistic conditions.
  • Rule 10 is also on point; given that we know the reliability of peer review is very low, it’s problematic when reviewers make the subjective assertion that a paper just isn’t important enough to be published in such-and-such journal, even as they accept that it’s technically sound. Subjective judgments about importance and innovation should be left to the community to decide. That’s the philosophy espoused by open-access venues like PLoS ONE and Frontiers, and I think it’s a good one.
  • Rules 7 and 9—criticizing a lack of validation or a failure to run certain procedures—aren’t wrong, but seem to me much too broad to support blanket pronouncements. Surely much of the time when reviewers highlight missing procedures, or complain about a lack of validation, there are perfectly good reasons for doing so. I don’t imagine Friston is really suggesting that reviewers should stop asking authors for more information or for additional controls when they think it’s appropriate, so it’s not clear what the point of including this here is. The example Friston gives in Rule 9 (of requesting retinotopic mapping in an olfactory study), while humorous, is so absurd as to be worthless as an indictment of actual reviewer practices. In fact, I suspect it’s so absurd precisely because anything less extreme Friston could have come up with would have caused readers to think, “but wait, that could actually be a reasonable concern…”
  • Rules 1, 2, and 3 seem reasonable as far as they go; it’s just common sense to avoid overconfidence, arguments from emotion, and tardiness. Still, I’m not sure what’s really accomplished by pointing this out; I doubt there are very many reviewers who will read Friston’s commentary and say “you know what, I’m an overconfident, emotional jerk, and I’m always late with my reviews–I never realized this before.” I suspect the people who fit that description—and for all I know, I may be one of them—will be nodding and chuckling along with everyone else.

This leaves Rules 4, 5, and 8, which, conveniently, all focus on a set of interrelated issues surrounding low power, effect size estimation, and sample size. Because Friston’s treatment of these issues strikes me as dangerously wrong, and liable to send a very bad message to the neuroimaging community, I’ve laid out some of these issues in considerably more detail than you might be interested in. If you just want the direct rebuttal, skip to the “Reprising the rules” section below; otherwise the next two sections sketch Friston’s argument for using small sample sizes in fMRI studies, and then describe some of the things wrong with it.

Friston’s argument

Friston’s argument is based on three central claims:

  1. Classical inference (i.e., the null hypothesis testing framework) suffers from a critical flaw, which is that the null is always false: no effects (at least in psychology) are ever truly zero. Collect enough data and you will always end up rejecting the null hypothesis with probability of 1.
  2. Researchers care more about large effects than about small ones. In particular, there is some size of effect that any given researcher will call ‘trivial’, below which that researcher is uninterested in the effect.
  3. If the null hypothesis is always false, and if some effects are not worth caring about in practical terms, then researchers who collect very large samples will invariably end up identifying many effects that are statistically significant but completely uninteresting.

I think it would be hard to dispute any of these claims. The first one is the source of persistent statistical criticism of the null hypothesis testing framework, and the second one is self-evidently true (if you doubt it, ask yourself whether you would really care to continue your research if you knew with 100% confidence that all of your effects would never be any larger than one one-thousandth of a standard deviation). The third one follows directly from the first two.

Where Friston’s commentary starts to depart from conventional wisdom is in the implications he thinks these premises have for the sample sizes researchers should use in neuroimaging studies. Specifically, he argues that since large samples will invariably end up identifying trivial effects, whereas small samples will generally only have power to detect large effects, it’s actually in neuroimaging researchers’ best interest not to collect a lot of data. In other words, Friston turns what most commentators have long considered a weakness of fMRI studies—their small sample size—into a virtue.

Here’s how he characterizes an imaginary reviewer’s misguided concern about low power:

Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Friston suggests that the appropriate response from a clever author would be something like the following:

Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.

This is supported by an extensive appendix (written non-ironically), where Friston presents a series of nice sensitivity and classification analyses intended to give the reader an intuitive sense of what different standardized effect sizes mean, and what the implications are for the detection of statistically significant effects using a classical inference (i.e., hypothesis testing) approach. The centerpiece of the appendix is a loss-function analysis where Friston pits the benefit of successfully detecting a large effect (which he defines as a Cohen’s d of 1, i.e., an effect of one standard deviation) against the cost of rejecting the null when the effect is actually trivial (defined as a d of 0.125 or less). Friston notes that the loss function is minimized (i.e., the difference between the hit rate for large effects and the miss rate for trivial effects is maximized) when n = 16, which is where the number he repeatedly quotes as a reasonable sample size for fMRI studies comes from. (Actually, as I discuss in my Appendix I below, I think Friston’s power calculations are off, and the right number, even given his assumptions, is more like 22. But the point is, it’s a small number either way.)

It’s important to note that Friston is not shy about asserting his conclusion that small samples are just fine for neuroimaging studies—especially in the Appendices, which are not intended to be ironic. He makes claims like the following:

The first appendix presents an analysis of effect size in classical inference that suggests the optimum sample size for a study is between 16 and 32 subjects. Crucially, this analysis suggests significant results from small samples should be taken more seriously than the equivalent results in oversized studies.

And:

In short, if we wanted to optimise the sensitivity to large effects but not expose ourselves to trivial effects, sixteen subjects would be the optimum number.

And:

In short, if you cannot demonstrate a significant effect with sixteen subjects, it is probably not worth demonstrating.

These are very strong claims delivered with minimal qualification, and given Friston’s influence, could potentially lead many reviewers to discount their own prior concerns about small sample size and low power—which would be disastrous for the field. So I think it’s important to explain exactly why Friston is wrong and why his recommendations regarding sample size shouldn’t be taken seriously.

What’s wrong with the argument

Broadly speaking, there are three problems with Friston’s argument. The first one is that Friston presents the absolute best-case scenario as if it were typical. Specifically, the recommendation that a sample of 16 – 32 subjects is generally adequate for fMRI studies assumes that  fMRI researchers are conducting single-sample t-tests at an uncorrected threshold of p < .05; that they only care about effects on the order of 1 sd in size; and that any effect smaller than d = .125 is trivially small and is to be avoided. If all of this were true, an n of 16 (or rather, 22—see Appendix I below) might be reasonable. But it doesn’t really matter, because if you make even slightly less optimistic assumptions, you end up in a very different place. For example, for a two-sample t-test at p < .001 (a very common scenario in group difference studies), the optimal sample size, according to Friston’s own loss-function analysis, turns out to be 87 per group, or 174 subjects in total.

I discuss the problems with the loss-function analysis in much more detail in Appendix I below; the main point here is that even if you take Friston’s argument at face value, his own numbers put the lie to the notion that a sample size of 16 – 32 is sufficient for the majority of cases. It flatly isn’t. There’s nothing magic about 16, and it’s very bad advice to suggest that authors should routinely shoot for sample sizes this small when conducting their studies given that Friston’s own analysis would seem to demand a much larger sample size the vast majority of the time.

 What about uncertainty?

The second problem is that Friston’s argument entirely ignores the role of uncertainty in drawing inferences about effect sizes. The notion that an effect that comes from a small study is likely to be bigger than one that comes from a larger study may be strictly true in the sense that, for any fixed p value, the observed effect size necessarily varies inversely with sample size. It’s true, but it’s also not very helpful. The reason it’s not helpful is that while the point estimate of statistically significant effects obtained from a small study will tend to be larger, the uncertainty around that estimate is also greater—and with sample sizes in the neighborhood of 16 – 20, will typically be so large as to be nearly worthless. For example, a correlation of r = .75 sounds huge, right? But when that correlation is detected at a threshold of p < .001 in a sample of 16 subjects, the corresponding 99.9% confidence interval is .06 – .95—a range so wide as to be almost completely uninformative.

Fortunately, what Friston argues small samples can do for us indirectly—namely, establish that effect sizes are big enough to care about—can be done much more directly, simply by looking at the uncertainty associated with our estimates. That’s exactly what confidence intervals are for. If our goal is to ensure that we only end up talking about results big enough to care about, it’s surely better to answer the question “how big is the effect?” by saying, “d = 1.1, with a 95% confidence interval of 0.2 – 2.1″ than by saying “well it’s statistically significant at p < .001 in a sample of 16 subjects, so it’s probably pretty big”. In fact, if you take the latter approach, you’ll be wrong quite often, for the simple reason that p values will generally be closer to the statistical threshold with small samples than with big ones. Remember that, by definition, the point at which one is allowed to reject the null hypothesis is also the point at which the relevant confidence interval borders on zero. So it doesn’t really matter whether your sample is small or large; if you only just barely managed to reject the null hypothesis, you cannot possibly be in a good position to conclude that the effect is likely to be a big one.

As far as I can tell, Friston completely ignores the role of uncertainty in his commentary. For example, he gives the following example, which is supposed to convince you that you don’t really need large samples:

Imagine we compared the intelligence quotient (IQ) between the pupils of two schools. When comparing two groups of 800 pupils, we found mean IQs of 107.1 and 108.2, with a difference of 1.1. Given that the standard deviation of IQ is 15, this would be a trivial effect size … In short, although the differential IQ may be extremely significant, it is scientifically uninteresting … Now imagine that your research assistant had the bright idea of comparing the IQ of students who had and had not recently changed schools. On selecting 16 students who had changed schools within the past five years and 16 matched pupils who had not, she found an IQ difference of 11.6, where this medium effect size just reached significance. This example highlights the difference between an uninformed overpowered hypothesis test that gives very significant, but uninformative results and a more mechanistically grounded hypothesis that can only be significant with a meaningful effect size.

But the example highlights no such thing. One is not entitled to conclude, in the latter case, that the true effect must be medium-sized just because it came from a small sample. If the effect only just reached significance, the confidence interval by definition just barely excludes zero, and we can’t say anything meaningful about the size of the effect, but only about its sign (i.e., that it was in the expected direction)—which is (in most cases) not nearly as useful.

In fact, we will generally be in a much worse position with a small sample than a large one, because at least with a large sample, we at least stand a chance of being able to distinguish small effects from large ones. Recall that Friston suggests against collecting very large samples for the very reason that they are likely to produce a wealth of statistically-significant-but-trivially-small effects. Well, maybe so, but so what? Why would it be a bad thing to detect trivial effects so long as we were also in an excellent position to know that those effects were trivial? Nothing about the hypothesis-testing framework commits us to treating all of our statistically significant results like they’re equally important. If we have a very large sample, and some of our effects have confidence intervals from 0.02 to 0.15 while others have CIs from 0.42 to 0.52, we would be wise to focus most of our attention on the latter rather than the former. At the very least this seems like a more reasonable approach than deliberately collecting samples so small that they will rarely be able to tell us anything meaningful about the size of our effects.

What about the prior?

The third, and arguably biggest, problem with Friston’s argument is that it completely ignores the prior—i.e., the expected distribution of effect sizes across the brain. Friston’s commentary assumes a uniform prior everywhere; for the analysis to go through, one has to believe that trivial effects and very large effects are equally likely to occur. But this is patently absurd; while that might be true in select situations, by and large, we should expect small effects to be much more common than large ones. In a previous commentary (on the Vul et al “voodoo correlations” paper), I discussed several reasons for this; rather than go into detail here, I’ll just summarize them:

  • It’s frankly just not plausible to suppose that effects are really as big as they would have to be in order to support adequately powered analyses with small samples. For example, a correlational analysis with 20 subjects at p < .001 would require a population effect size of r = .77 to have 80% power. If you think it’s plausible that focal activation in a single brain region can explain 60% of the variance in a complex trait like fluid intelligence or extraversion, I have some property under a bridge I’d like you to come by and look at.
  • The low-hanging fruit get picked off first. Back when fMRI was in its infancy in the mid-1990s, people could indeed publish findings based on samples of 4 or 5 subjects. I’m not knocking those studies; they taught us a huge amount about brain function. In fact, it’s precisely because they taught us so much about the brain that researchers can no longer stick 5 people in a scanner and report that doing a working memory task robustly activates the frontal cortex. Nowadays, identifying an interesting effect is more difficult—and if that effect were really enormous, odds are someone would have found it years ago. But this shouldn’t surprise us; neuroimaging is now a relatively mature discipline, and effects on the order of 1 sd or more are extremely rare in most mature fields (for a nice review, see Meyer et al (2001)).
  • fMRI studies with very large samples invariably seem to report much smaller effects than fMRI studies with small samples. This can only mean one of two things: (a) large studies are done much more poorly than small studies (implausible—if anything, the opposite should be true); or (b) the true effects are actually quite small in both small and large fMRI studies, but they’re inflated by selection bias in small studies, whereas large studies give an accurate estimate of their magnitude (very plausible).
  • Individual differences or between-group analyses, which have much less power than within-subject analyses, tend to report much more sparing activations. Again, this is consistent with the true population effects being on the small side.

To be clear, I’m not saying there are never any large effects in fMRI studies. Under the right circumstances, there certainly will be. What I’m saying is that, in the absence of very good reasons to suppose that a particular experimental manipulation is going to produce a large effect, our default assumption should be that the vast majority of (interesting) experimental contrasts are going to produce diffuse and relatively weak effects.

Note that Friston’s assertion that “if one finds a significant effect with a small sample size, it is likely to have been caused by a large effect size” depends entirely on the prior effect size distribution. If the brain maps we look at are actually dominated by truly small effects, then it’s simply not true that a statistically significant effect obtained from a small sample is likely to have been caused by a large effect size. We can see this easily by thinking of a situation in which an experiment has a weak but very diffuse effect on brain activity. Suppose that the entire brain showed ‘trivial’ effects of d = 0.125 in the population, and that there were actually no large effects at all. A one-sample t-test at p < .001 has less than 1% power to detect this effect, so you might suppose, as Friston does, that we could discount the possibility that a significant effect would have come from a trivial effect size. And yet, because a whole-brain analysis typically involves tens of thousands of tests, there’s a very good chance such an analysis will end up identifying statistically significant effects somewhere in the brain. Unfortunately, because the only way to identify a trivial effect with a small sample is to capitalize on chance (Friston discusses this point in his Appendix II, and additional treatments can be found in Ionnadis (2008), or in my 2009 commentary), that tiny effect won’t look tiny when we examine it; it will in all likelihood look enormous.

Since they say a picture is worth a thousand words, here’s one (from an unpublished paper in progress):

The top panel shows you a hypothetical distribution of effects (Pearson’s r) in a 2-dimensional ‘brain’ in the population. Note that there aren’t any astronomically strong effects (though the white circles indicate correlations of .5 or greater, which are certainly very large). The bottom panel shows what happens when you draw random samples of various sizes from the population and use different correction thresholds/approaches. You can see that the conclusion you’d draw if you followed Friston’s advice—i.e., that any effect you observe with n = 20 must be pretty robust to survive correction—is wrong; the isolated region that survives correction at FDR = .05, while ‘real’ in a trivial sense, is not in fact very strong in the true map—it just happens to be grossly inflated by sampling error. This is to be expected; when power is very low but the number of tests you’re performing is very large, the odds are good that you’ll end up identifying some real effect somewhere in the brain–and the estimated effect size within that region will be grossly distorted because of the selection process.

Encouraging people to use small samples is a sure way to ensure that researchers continue to publish highly biased findings that lead other researchers down garden paths trying unsuccessfully to replicate ‘huge’ effects. It may make for an interesting, more publishable story (who wouldn’t rather talk about the single cluster that supports human intelligence than about the complex, highly distributed pattern of relatively weak effects?), but it’s bad science. It’s exactly the same problem geneticists confronted ten or fifteen years ago when the first candidate gene and genome-wide association studies (GWAS) seemed to reveal remarkably strong effects of single genetic variants that subsequently failed to replicate. And it’s the same reason geneticists now run association studies with 10,000+ subjects and not 300.

Unfortunately, the costs of fMRI scanning haven’t come down the same way the costs of genotyping have, so there’s tremendous resistance at present to the idea that we really do need to routinely acquire much larger samples if we want to get a clear picture of how big effects really are. Be that as it may, we shouldn’t indulge in wishful thinking just because of logistical constraints. The fact that it’s difficult to get good estimates doesn’t mean we should pretend our bad estimates are actually good ones.

What’s right with the argument

Having criticized much of Friston’s commentary, I should note that there’s one part I like a lot, and that’s the section on protected inference in Appendix I. The point Friston makes here is that you can still use a standard hypothesis testing approach fruitfully—i.e., without falling prey to the problem of classical inference—so long as you explicitly protect against the possibility of identifying trivial effects. Friston’s treatment is mathematical, but all he’s really saying here is that it makes sense to use non-zero ranges instead of true null hypotheses. I’ve advocated the same approach before (e.g., here), as I’m sure many other people have. The point is simple: if you think an effect of, say, 1/8th of a standard deviation is too small to care about, then you should define a ‘pseudonull’ hypothesis of d = -.125 to .125 instead of a null of exactly zero.

Once you do that, any time you reject the null, you’re now entitled to conclude with reasonable certainty that your effects are in fact non-trivial in size. So I completely agree with Friston when he observes in the conclusion to the Appendix I that:

…the adage ‘you can never have enough data’ is also true, provided one takes care to protect against inference on trivial effect sizes – for example using protected inference as described above.

Of course, the reason I agree with it is precisely because it directly contradicts Friston’s dominant recommendation to use small samples. In fact, since rejecting non-zero values is more difficult than rejecting a null of zero, when you actually perform power calculations based on protected inference, it becomes immediately apparent just how inadequate samples on the order of 16 – 32 subjects will be most of the time (e.g., rejecting a null of zero when detecting an effect of d = 0.5 with 80% power using a one-sample t-test at p < .05 requires 33 subjects, but if you want to reject a ‘trivial’ effect size of d <= |.125|, that n is now upwards of 50).

Reprising the rules

With the above considerations in mind, we can now turn back to Friston’s rules 4, 5, and 8, and see why his admonitions to reviewers are uncharitable at best and insensible at worst. First, Rule 4 (the under-sampled study). Here’s the kind of comment Friston (ironically) argues reviewers should avoid:

 Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Perhaps many reviewers make exactly this argument; I haven’t been an editor, so I don’t know (though I can say that I’ve read many reviews of papers I’ve co-reviewed and have never actually seen this particular variant). But even if we give Friston the benefit of the doubt and accept that one shouldn’t question the validity of a finding on the basis of small samples (i.e., we accept that p values mean the same thing in large and small samples), that doesn’t mean the more general critique from low power is itself a bad one. To the contrary, a much better form of the same criticism–and one that I’ve raised frequently myself in my own reviews–is the following:

 Reviewer: the authors draw some very strong conclusions in their Discussion about the implications of their main finding. But their finding issues from a sample of only 16 subjects, and the confidence interval around the effect is consequently very large, and nearly include zero. In other words, the authors’ findings are entirely consistent with the effect they report actually being very small–quite possibly too small to care about. The authors should either weaken their assertions considerably, or provide additional evidence for the importance of the effect.

Or another closely related one, which I’ve also raised frequently:

 Reviewer: the authors tout their results as evidence that region R is ‘selectively’ activated by task T. However, this claim is based entirely on the fact that region R was the only part of the brain to survive correction for multiple comparisons. Given that the sample size in question is very small, and power to detect all but the very largest effects is consequently very low, the authors are in no position to conclude that the absence of significant effects elsewhere in the brain suggests selectivity in region R. With this small a sample, the authors’ data are entirely consistent with the possibility that many other brain regions are just as strongly activated by task T, but failed to attain significance due to sampling error. The authors should either avoid making any claim that the activity they observed is selective, or provide direct statistical support for their assertion of selectivity.

Neither of these criticisms can be defused by suggesting that effect sizes from smaller samples are likely to be larger than effect sizes from large studies. And it would be disastrous for the field of neuroimaging if Friston’s commentary succeeded in convincing reviewers to stop criticizing studies on the basis of low power. If anything, we collectively need to focus far greater attention on issues surrounding statistical power.

Next, Rule 5 (the over-sampled study):

Reviewer: I would like to commend the authors for studying such a large number of subjects; however, I suspect they have not heard of the fallacy of classical inference. Put simply, when a study is overpowered (with too many subjects), even the smallest treatment effect will appear significant. In this case, although I am sure the population effects reported by the authors are significant; they are probably trivial in quantitative terms. It would have been much more compelling had the authors been able to show a significant effect without resorting to large sample sizes. However, this was not the case and I cannot recommend publication.

I’ve already addressed this above; the problem with this line of reasoning is that nothing says you have to care equally about every statistically significant effect you detect. If you ever run into a reviewer who insists that your sample is overpowered and has consequently produced too many statistically significant effects, you can simply respond like this:

 Response: we appreciate the reviewer’s concern that our sample is potentially overpowered. However, this strikes us as a limitation of classical inference rather than a problem with our study. To the contrary, the benefit of having a large sample is that we are able to focus on effect sizes rather than on rejecting a null hypothesis that we would argue is meaningless to begin with. To this end, we now display a second, more conservative, brain activation map alongside our original one that raises the statistical threshold to the point where the confidence intervals around all surviving voxels exclude effects smaller than d = .125. The reviewer can now rest assured that our results protect against trivial effects. We would also note that this stronger inference would not have been possible if our study had had a much smaller sample.

There is rarely if ever a good reason to criticize authors for having a large sample after it’s already collected. You can always raise the statistical threshold to protect against trivial effects if you need to; what you can’t easily do is magic more data into existence in order to shrink your confidence intervals.

Lastly, Rule 8 (exploiting ‘superstitious’ thinking about effect sizes):

 Reviewer: It appears that the authors are unaware of the dangers of voodoo correlations and double dipping. For example, they report effect sizes based upon data (regions of interest) previously identified as significant in their whole brain analysis. This is not valid and represents a pernicious form of double dipping (biased sampling or non-independence problem). I would urge the authors to read Vul et al. (2009) and Kriegeskorte et al. (2009) and present unbiased estimates of their effect size using independent data or some form of cross validation.

Friston’s recommended response is to point out that concerns about double-dipping are misplaced, because the authors are typically not making any claims that the reported effect size is an accurate representation of the population value, but only following standard best-practice guidelines to include effect size measures alongside p values. This would be a fair recommendation if it were true that reviewers frequently object to the mere act of reporting effect sizes based on the specter of double-dipping; but I simply don’t think this is an accurate characterization. In my experience, the impetus for bringing up double-dipping is almost always one of two things: (a) authors getting overly excited about the magnitude of the effects they have obtained, or (b) authors conducting non-independent tests and treating them as though they were independent (e.g., when identifying an ROI based on a comparison of conditions A and B, and then reporting a comparison of A and C without considering the bias inherent in this second test). Both of these concerns are valid and important, and it’s a very good thing that reviewers bring them up.

The right way to determine sample size

If we can’t rely on blanket recommendations to guide our choice of sample size, then what? Simple: perform a power calculation. There’s no mystery to this; both brief and extended treatises on statistical power are all over the place, and power calculators for most standard statistical tests are available online as well as in most off-line statistical packages (e.g., I use the pwr package for R). For more complicated statistical tests for which analytical solutions aren’t readily available (e.g., fancy interactions involving multiple within- and between-subject variables), you can get reasonably good power estimates through simulation.

Of course, there’s no guarantee you’ll like the answers you get. Actually, in most cases, if you’re honest about the numbers you plug in, you probably won’t like the answer you get. But that’s life; nature doesn’t care about making things convenient for us. If it turns out that it takes 80 subjects to have adequate power to detect the effects we care about and expect, we can (a) suck it up and go for n = 80, (b) decide not to run the study, or (c) accept that logistical constraints mean our study will have less power than we’d like (which implies that any results we obtain will offer only a fractional view of what’s really going on). What we don’t get to do is look the other way and pretend that it’s just fine to go with 16 subjects simply because the last time we did that, we got this amazingly strong, highly selective activation that successfully made it into a good journal. That’s the same logic that repeatedly produced unreplicable candidate gene findings in the 1990s, and, if it continues to go unchecked in fMRI research, risks turning the field into a laughing stock among other scientific disciplines.

Conclusion

The point of all this is not to convince you that it’s impossible to do good fMRI research with just 16 subjects, or that reviewers don’t sometimes say silly things. There are many questions that can be answered with 16 or even fewer subjects, and reviewers most certainly do say silly things (I sometimes cringe when re-reading my own older reviews). The point is that blanket pronouncements, particularly when made ironically and with minimal qualification, are not helpful in advancing the field, and can be very damaging. It simply isn’t true that there’s some magic sample size range like 16 to 32 that researchers can bank on reflexively. If there’s any generalization that we can allow ourselves, it’s probably that, under reasonable assumptions, Friston’s recommendations are much too conservative. Typical effect sizes and analysis procedures will generally require much larger samples than neuroimaging researchers are used to collecting. But again, there’s no substitute for careful case-by-case consideration.

In the natural course of things, there will be cases where n = 4 is enough to detect an effect, and others where the effort is questionable even with 100 subjects; unfortunately, we won’t know which situation we’re in unless we take the time to think carefully and dispassionately about what we’re doing. It would be nice to believe otherwise; certainly, it would make life easier for the neuroimaging community in the short term. But since the point of doing science is to discover what’s true about the world, and not to publish an endless series of findings that sound exciting but don’t replicate, I think we have an obligation to both ourselves and to the taxpayers that fund our research to take the exercise more seriously.

 

 

Appendix I: Evaluating Friston’s loss-function analysis

In this appendix I review a number of weaknesses in Friston’s loss-function analysis, and show that under realistic assumptions, the recommendation to use sample sizes of 16 – 32 subjects is far too optimistic.

First, the numbers don’t seem to be right. I say this with a good deal of hesitation, because I have very poor mathematical skills, and I’m sure Friston is much smarter than I am. That said, I’ve tried several different power packages in R and finally resorted to empirically estimating power with simulated draws, and all approaches converge on numbers quite different from Friston’s. Even the sensitivity plots seem off by a good deal (for instance, Friston’s Figure 3 suggests around 30% sensitivity with n = 80 and d = 0.125, whereas all the sources I’ve consulted produce a value around 20%). In my analysis, the loss function is minimized at n = 22 rather than n = 16. I suspect the problem is with Friston’s approximation, but I’m open to the possibility that I’ve done something very wrong, and confirmations or disconfirmations are welcome in the comments below. In what follows, I’ll report the numbers I get rather than Friston’s (mine are somewhat more pessimistic, but the overarching point doesn’t change either way).

Second, there’s the statistical threshold. Friston’s analysis assumes that all of our tests are conducted without correction for multiple comparisions (i.e., at p < .05), but this clearly doesn’t apply to the vast majority of neuroimaging studies, which are either conducting massive univariate (whole-brain) analyses, or testing at least a few different ROIs or networks. As soon as you lower the threshold, the optimal sample size returned by the loss-function analysis increases dramatically. If the threshold is a still-relatively-liberal (for whole-brain analysis) p < .001, the loss function is now minimized at 48 subjects–hardly a welcome conclusion, and a far cry from 16 subjects. Since this is probably still the modal fMRI threshold, one could argue Friston should have been trumpeting a sample size of 48 all along—not exactly a ‘small’ sample size given the associated costs.

Third, the n = 16 (or 22) figure only holds for the simplest of within-subject tests (e.g., a one-sample t-test)–again, a best-case scenario (though certainly a common one). It doesn’t apply to many other kinds of tests that are the primary focus of a huge proportion of neuroimaging studies–for instance, two-sample t-tests, or interactions between multiple within-subject factors. In fact, if you apply the same analysis to a two-sample t-test (or equivalently, a correlation test), the optimal sample size turns out to be 82 (41 per group) at a threshold of p < .05, and a whopping 174 (87 per group) at a threshold of p < .001. In other words, if we were to follow Friston’s own guidelines, the typical fMRI researcher who aims to conduct a (liberal) whole-brain individual differences analysis should be collecting 174 subjects a pop. For other kinds of tests (e.g., 3-way interactions), even larger samples might be required.

Fourth, the claim that only large effects–i.e., those that can be readily detected with a sample size of 16–are worth worrying about is likely to annoy and perhaps offend any number of researchers who have perfectly good reasons for caring about effects much smaller than half a standard deviation. A cursory look at most literatures suggests that effects of 1 sd are not the norm; they’re actually highly unusual in mature fields. For perspective, the standardized difference in height between genders is about 1.5 sd; the validity of job interviews for predicting success is about .4 sd; and the effect of gender on risk-taking (men take more risks) is about .2 sd—what Friston would call a very small effect (for other examples, see Meyer et al., 2001). Against this backdrop, suggesting that only effects greater than 1 sd (about the strength of the association between height and weight in adults) are of interest would seem to preclude many, and perhaps most, questions that researchers currently use fMRI to address. Imaging genetics studies are immediately out of the picture; so too, in all likelihood, are cognitive training studies, most investigations of individual differences, and pretty much any experimental contrast that claims to very carefully isolate a relatively subtle cognitive difference. Put simply, if the field were to take Friston’s analysis seriously, the majority of its practitioners would have to pack up their bags and go home. Entire domains of inquiry would shutter overnight.

To be fair, Friston briefly considers the possibility that small sample sizes could be important. But he doesn’t seem to take it very seriously:

Can true but trivial effect sizes can ever be interesting? It could be that a very small effect size may have important implications for understanding the mechanisms behind a treatment effect and that one should maximise sensitivity by using large numbers of subjects. The argument against this is that reporting a significant but trivial effect size is equivalent to saying that one can be fairly confident the treatment effect exists but its contribution to the outcome measure is trivial in relation to other unknown effects…

The problem with the latter argument is that the real world is a complicated place, and most interesting phenomena have many causes. A priori, it is reasonable to expect that the vast majority of effects will be small. We probably shouldn’t expect any single genetic variant to account for more than a small fraction of the variation in brain activity, but that doesn’t mean we should give up entirely on imaging genetics. And of course, it’s worth remembering that, in the context of fMRI studies, when Friston talks about ‘very small effect sizes,’ that’s a bit misleading; even medium-sized effects that Friston presumably allows are interesting could be almost impossible to detect at the sample sizes he recommends. For example, a one-sample t-test with n = 16 subjects detects an effect of d = 0.5 only 46% or 5% of the time at p < .05 and p < .001, respectively. Applying Friston’s own loss function analysis to detection of d = 0.5 returns an optimal sample size of n = 63 at p < .05 and n = 139 at p < .001—a message not entirely consistent with the recommendations elsewhere in his commentary.

ResearchBlogging.orgFriston, K. (2012). Ten ironic rules for non-statistical reviewers NeuroImage DOI: 10.1016/j.neuroimage.2012.04.018

a human and a monkey walk into an fMRI scanner…

Tor Wager and I have a “news and views” piece in Nature Methods this week; we discuss a paper by Mantini and colleagues (in the same issue) introducing a new method for identifying functional brain homologies across different species–essentially, identifying brain regions in humans and monkeys that seem to do roughly the same thing even if they’re not located in the same place anatomically. Mantini et al make some fairly strong claims about what their approach tells us about the evolution of the human brain (namely, that some cortical regions have undergone expansion relative to monkeys, while others have adapted substantively new functions). For reasons we articulate in our commentary, I’m personally not so convinced by the substantive conclusions, but I do think the core idea underlying the method is a very clever and potentially useful one:

Their technique, interspecies activity correlation (ISAC), uses functional magnetic resonance imaging (fMRI) to identify brain regions in which humans and monkeys exposed to the same dynamic stimulus—a 30-minute clip from the movie The Good, the Bad and the Ugly—show correlated patterns of activity (Fig. 1). The premise is that homologous regions should have similar patterns of activity across species. For example, a brain region sensitive to a particular configuration of features, including visual motion, hands, faces, object and others, should show a similar time course of activity in both species—even if its anatomical location differs across species and even if the precise features that drive the area’s neurons have not yet been specified.

Mo Costandi has more on the paper in an excellent Guardian piece (and I’m not just saying that because he quoted me a few times). All in all, I think it’s a very exciting method, and it’ll be interesting to see how it’s applied in future studies. I think there’s a fairly broad class of potential applications based loosely around the same idea of searching for correlated patterns. It’s an idea that’s already been used by Uri Hasson (an author on the Mantini et al paper) and others fairly widely in the fMRI literature to identify functional correspondences across different subjects; but you can easily imagine conceptually similar applications in other fields too–e.g., correlating gene expression profiles across species in order to identify structural homologies (actually, one could probably try this out pretty easily using the mouse and human data available in the Allen Brain Atlas).

ResearchBlogging.orgMantini D, Hasson U, Betti V, Perrucci MG, Romani GL, Corbetta M, Orban GA, & Vanduffel W (2012). Interspecies activity correlations reveal functional correspondence between monkey and human brain areas. Nature methods PMID: 22306809

Wager, T., & Yarkoni, T. (2012). Establishing homology between monkey and human brains Nature Methods DOI: 10.1038/nmeth.1869

Attention publishers: the data in your tables want to be free! Free!

The Neurosynth database is getting an upgrade over the next couple of weeks; it’s going to go from 4,393 neuroimaging studies to around 5,800. Unfortunately, updating the database is kind of a pain, because academic publishers like to change the format of their full-text HTML articles, which has a nasty habit of breaking the publisher-specific HTML parsers I’ve written. When you expect ScienceDirect to give you <table cellspacing=10>, but you get <table> with no cellspacing attribute (the horror!), bad things happen in XPath land. And then those bad things need to be repaired. And I hate repairing stuff! So I don’t do it very often. Like, once every 6 to 9 months.

In an ideal world, there would be no need to write (and fix) custom filters for different publishers, because the publishers would all simultaneously make XML representations of their articles available (in addition to HTML, PDF, etc.), and then people who have legitimate data mining reasons for regularly downloading hundreds of articles at a time wouldn’t have to cry themselves to sleep every night. But as it stands, only one major publisher of neuroimaging articles (PLoS) provides XML versions of all articles. A minority of articles from other publishers are available in XML from BioMed Central, but that’s still just a fraction of the existing literature.

Anyway, the HTML thing is annoying, but it’s possible to work around it. What’s much more problematic is that some publishers lock up the data in the tables of their articles. To make Neurosynth work, I have to be able to identify rows in tables that look like brain activations. That is, things that look roughly like this:

Most publishers are nice enough to format article tables as HTML tables; which is to say, I can look for tags like <table> and then work down the XPath tree to identify all the the rows, and then scan each rows for values that look activation-like. Then those values go into the database, and poof, next thing you know, you have meta-analytic brain activation maps from hundreds of studies. But some publishers–most notably, Frontiers–throw a wrench in the works by failing to format tables in HTML; instead, they present the tables as images (see for instance this JPEG table, pulled from this article). Which means I can’t really extract any data from them, and as a result, you’re not going to see activations from articles published in Frontiers journals in Neurosynth any time soon. So if you publish fMRI articles in Frontiers in Human Neuroscience regularly, and are wondering why I’ve been ignoring you (I like you! I promise!), now you know.

Anyway, on the remote chance that anyone reading this has any sway with people high up at Frontiers, could you please ask them to release their data? Pretty please? Lack of access to data in tables seems to be a pretty common complaint in the data mining community; I’ve talked to other people in the neuroinformatics world who’ve also expressed frustration about it, and I imagine the same is true of people in other disciplines. It’s particularly surprising given that Frontiers is, in theory, an open access publisher. I can see the data in your tables, Frontiers; why won’t you also let me read it?

Okay, I know this kind of stuff doesn’t really interest anyone; I’m just venting. The main point is, Neurosynth is going to be bigger and (very slightly) better in the near future.

what Ben Parker wants you to know about neuroimaging

I have a short opinion piece in the latest issue of The European Health Psychologist that discusses some of the caveats and limits of functional MRI. It’s a short and (I think) pretty readable piece; I touch on a couple of issues I’ve discussed frequently in other papers as well as here on the blog–namely, the relatively low power of most fMRI analyses and the difficulties inherent in drawing causal inferences from neuroimaging results.

More importantly, though, I’ve finally fulfilled my long held goal of sneaking a Spiderman reference into an academic article (though, granted, one that wasn’t peer-reviewed). It would be going too far to say I can die happy now, but at least I can have an extra large serving of ice cream for dessert tonight without feeling guilty*. And no, I’m not going to spoil the surprise by revealing what Spidey has to do with fMRI. Though I will say that if you actually fall for the hook and go read the article just to find that out, you’re likely to be sorely disappointed.

 

* So okay, the truth is, I never, ever feel guilty for eating ice cream, no matter the serving size.

see me flub my powerpoint slides on NIF tv!

 

UPDATE: the webcast is now archived here for posterity.

This is kind of late notice and probably of interest to few people, but I’m giving the NIF webinar tomorrow (or today, depending on your time zone–either way, we’re talking about November 1st). I’ll be talking about Neurosynth, and focusing in particular on the methods and data, since that’s what NIF (which stands for Neuroscience Information Framework) is all about. Assuming all goes well, the webinar should start at 11 am PST. But since I haven’t done a webcast of any kind before, and have a surprising knack for breaking audiovisual equipment at a distance, all may not go well. Which I suppose could make for a more interesting presentation. In any case, here’s the abstract:

The explosive growth of the human neuroimaging literature has led to major advances in understanding of human brain function, but has also made aggregation and synthesis of neuroimaging findings increasingly difficult. In this webinar, I will describe a highly automated brain mapping framework called NeuroSynth that uses text mining, meta-analysis and machine learning techniques to generate a large database of mappings between neural and cognitive states. The NeuroSynth framework can be used to automatically conduct large-scale, high-quality neuroimaging meta-analyses, address long-standing inferential problems in the neuroimaging literature (e.g., how to infer cognitive states from distributed activity patterns), and support accurate ‘decoding’ of broad cognitive states from brain activity in both entire studies and individual human subjects. This webinar will focus on (a) the methods used to extract the data, (b) the structure of the resulting (publicly available) datasets, and (c) some major limitations of the current implementation. If time allows, I’ll also provide a walk-through of the associated web interface (http://neurosynth.org) and will provide concrete examples of some potential applications of the framework.

There’s some more info (including details about how to connect, which might be important) here. And now I’m off to prepare my slides. And script some evasive and totally non-committal answers to deploy in case of difficult questions from the peanut gallery respected audience.

brain-based prediction of ADHD–now with 100% fewer brains!

UPDATE 10/13: a number of commenters left interesting comments below addressing some of the issues raised in this post. I expand on some of them here.

The ADHD-200 Global Competition, announced earlier this year, was designed to encourage researchers to develop better tools for diagnosing mental health disorders on the basis of neuroimaging data:

The competition invited participants to develop diagnostic classification tools for ADHD diagnosis based on functional and structural magnetic resonance imaging (MRI) of the brain. Applying their tools, participants provided diagnostic labels for previously unlabeled datasets. The competition assessed diagnostic accuracy of each submission and invited research papers describing novel, neuroscientific ideas related to ADHD diagnosis. Twenty-one international teams, from a mix of disciplines, including statistics, mathematics, and computer science, submitted diagnostic labels, with some trying their hand at imaging analysis and psychiatric diagnosis for the first time.

Data for the contest came from several research labs around the world, who donated brain scans from participants with ADHD (both inattentive and hyperactive subtypes) as well as healthy controls. The data were made openly available through the International Neuroimaging Data-sharing Initiative, and nicely illustrate the growing movement towards openly sharing large neuroimaging datasets and promoting their use in applied settings. It is, in virtually every respect, a commendable project.

Well, the results of the contest are now in–and they’re quite interesting. The winning team, from Johns Hopkins, came up with a method that performed substantially above chance and showed particularly high specificity (i.e., it made few false diagnoses, though it missed a lot of true ADHD cases). And all but one team performed above chance, demonstrating that the imaging data has at least some (though currently not a huge amount) of utility in diagnosing ADHD and ADHD subtype. There are some other interesting results on the page worth checking out.

But here’s hands-down the most entertaining part of the results, culled from the “Interesting Observations” section:

The team from the University of Alberta did not use imaging data for their prediction model. This was not consistent with intent of the competition. Instead they used only age, sex, handedness, and IQ. However, in doing so they obtained the most points, outscoring the team from Johns Hopkins University by 5 points, as well as obtaining the highest prediction accuracy (62.52%).

…or to put it differently, if you want to predict ADHD status using the ADHD-200 data, your best bet is to not really use the ADHD-200 data! At least, not the brain part of it.

I say this with tongue embedded firmly in cheek, of course; the fact that the Alberta team didn’t use the imaging data doesn’t mean imaging data won’t ultimately be useful for diagnosing mental health disorders. It remains quite plausible that ten or twenty years from now, structural or functional MRI scans (or some successor technology) will be the primary modality used to make such diagnoses. And the way we get from here to there is precisely by releasing these kinds of datasets and promoting this type of competition. So on the whole, I think this should actually be seen as a success story for the field of human neuroimaging–especially since virtually all of the teams performed above chance using the imaging data.

That said, there’s no question this result also serves as an important and timely reminder that we’re still in the very early days of brain-based prediction. Right now anyone who claims they can predict complex real-world behaviors better using brain imaging data than using (much cheaper) behavioral data has a lot of ‘splainin to do. And there’s a good chance that they’re trying to sell you something (like, cough, neuromarketing ‘technology’).

the New York Times blows it big time on brain imaging

The New York Times has a terrible, terrible Op-Ed piece today by Martin Lindstrom (who I’m not going to link to, because I don’t want to throw any more bones his way). If you believe Lindstrom, you don’t just like your iPhone a lot; you love it. Literally. And the reason you love it, shockingly, is your brain:

Earlier this year, I carried out an fMRI experiment to find out whether iPhones were really, truly addictive, no less so than alcohol, cocaine, shopping or video games. In conjunction with the San Diego-based firm MindSign Neuromarketing, I enlisted eight men and eight women between the ages of 18 and 25. Our 16 subjects were exposed separately to audio and to video of a ringing and vibrating iPhone.

But most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion. The subjects’ brains responded to the sound of their phones as they would respond to the presence or proximity of a girlfriend, boyfriend or family member.

In short, the subjects didn’t demonstrate the classic brain-based signs of addiction. Instead, they loved their iPhones.

There’s so much wrong with just these three short paragraphs (to say nothing of the rest of the article, which features plenty of other whoppers) that it’s hard to know where to begin. But let’s try. Take first the central premise–that an fMRI experiment could help determine whether iPhones are no less addictive than alcohol or cocaine. The tacit assumption here is that all the behavioral evidence you could muster–say, from people’s reports about how they use their iPhones, or clinicians’ observations about how iPhones affect their users–isn’t sufficient to make that determination; to “really, truly” know if something’s addictive, you need to look at what the brain is doing when people think about their iPhones. This idea is absurd inasmuch as addiction is defined on the basis of its behavioral consequences, not (right now, anyway) by the presence or absence of some biomarker. What makes someone an alcoholic is the fact that they’re dependent on alcohol, have trouble going without it, find that their alcohol use interferes with multiple aspects of their day-to-day life, and generally suffer functional impairment because of it–not the fact that their brain lights up when they look at pictures of Johnny Walker red. If someone couldn’t stop drinking–to the point where they lost their job, family, and friends–but their brain failed to display a putative biomarker for addiction, it would be strange indeed to say “well, you show all the signs, but I guess you’re not really addicted to alcohol after all.”

Now, there may come a day (and it will be a great one) when we have biomarkers sufficiently accurate that they can stand in for the much more tedious process of diagnosing someone’s addiction the conventional way. But that day is, to put it gently, a long way off. Right now, if you want to know if iPhones are addictive, the best way to do that is to, well, spend some time observing and interviewing iPhone users (and some quantitative analysis would be helpful).

Of course, it’s not clear what Lindstrom thinks an appropriate biomarker for addiction would be in any case. Presumably it would have something to do with the reward system; but what? Suppose Lindstrom had seen robust activation in the ventral striatum–a critical component of the brain’s reward system–when participants gazed upon the iPhone: what then? Would this have implied people are addicted to iPhones? But people also show striatal activity when gazing on food, money, beautiful faces, and any number of other stimuli. Does that mean the average person is addicted to all of the above? A marker of pleasure or reward, maybe (though even that’s not certain), but addiction? How could a single fMRI experiment with 16 subjects viewing pictures of iPhones confirm or disconfirm the presence of addiction? Lindstrom doesn’t say. I suppose he has good reason not to say: if he really did have access to an accurate fMRI-based biomarker for addiction, he’d be in a position to make millions (billions?) off the technology. To date, no one else has come close to identifying a clinically accurate fMRI biomarker for any kind of addiction (for more technical readers, I’m talking here about cross-validated methods that have both sensitivity and specificity comparable to traditional approaches when applied to new subjects–not individual studies that claim 90% with-sample classification accuracy based on simple regression models). So we should, to put it mildly, be very skeptical that Lindstrom’s study was ever in a position to do what he says it was designed to do.

We should also ask all sorts of salient and important questions about who the people are who are supposedly in love with their iPhones. Who’s the “You” in the “You Love Your iPhone” of the title? We don’t know, because we don’t know who the participants in Lindstrom’s sample, were, aside from the fact that they were eight men and eight women aged 18 to 25. But we’d like to know some other important things. For instance, were they selected for specific characteristics? Were they, say, already avid iPhone users? Did they report loving, or being addicted to their iPhones? If so, would it surprise us that people chosen for their close attachment to their iPhones also showed brain activity patterns typical of close attachment? (Which, incidentally, they actually don’t–but more on that below.) And if not, are we to believe that the average person pulled off the street–who probably has limited experience with iPhones–really responds to the sound of their phones “as they would respond to the presence or proximity of a girlfriend, boyfriend or family member”? Is the takeaway message of Lindstrom’s Op-Ed that iPhones are actually people, as far as our brains are concerned?

In fairness, space in the Times is limited, so maybe it’s not fair to demand this level of detail in the Op-Ed iteslf. But the bigger problem is that we have no way of evaluating Lindstrom’s claims, period, because (as far as I can tell), his study hasn’t been published or peer-reviewed anywhere. Presumably, it’s proprietary information that belongs to the neuromarketing firm in question. Which is to say, the NYT is basically giving Lindstrom license to talk freely about scientific-sounding findings that can’t actually be independently confirmed, disputed, or critiqued by members of the scientific community with expertise in the very methods Lindstrom is applying (expertise which, one might add, he himself lacks). For all we know, he could have made everything up. To be clear, I don’t really think he did make everything up–but surely, somewhere in the editorial process someone at the NYT should have stepped in and said, “hey, these are pretty strong scientific claims; is there any way we can make your results–on which your whole article hangs–available for other experts to examine?”

This brings us to what might be the biggest whopper of all, and the real driver of the article title: the claim that “most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion“. Russ Poldrack already tore this statement to shreds earlier this morning:

Insular cortex may well be associated with feelings of love and compassion, but this hardly proves that we are in love with our iPhones.  In Tal Yarkoni’s recent paper in Nature Methods, we found that the anterior insula was one of the most highly activated part of the brain, showing activation in nearly 1/3 of all imaging studies!  Further, the well-known studies of love by Helen Fisher and colleagues don’t even show activation in the insula related to love, but instead in classic reward system areas.  So far as I can tell, this particular reverse inference was simply fabricated from whole cloth.  I would have hoped that the NY Times would have learned its lesson from the last episode.

But you don’t have to take Russ’s word for it; if you surf for a few terms on our Neurosynth website, making sure to select “forward inference” under image type, you’ll notice that the insula shows up for almost everything. That’s not an accident; it’s because the insula (or at least the anterior part of the insula) plays a very broad role in goal-directed cognition. It really is activated when you’re doing almost anything that involves, say, following instructions an experimenter gave you, or attending to external stimuli, or mulling over something salient in the environment. You can see this pretty clearly in this modified figure from our Nature Methods paper (I’ve circled the right insula):

Proportion of studies reporting activation at each voxel

The insula is one of a few ‘hotspots’ where activation is reported very frequently in neuroimaging articles (the other major one being the dorsal medial frontal cortex). So, by definition, there can’t be all that much specificity to what the insula is doing, since it pops up so often. To put it differently, as Russ and others have repeatedly pointed out, the fact that a given region activates when people are in a particular psychological state (e.g., love) doesn’t give you license to conclude that that state is present just because you see activity in the region in question. If language, working memory, physical pain, anger, visual perception, motor sequencing, and memory retrieval all activate the insula, then knowing that the insula is active is of very little diagnostic value. That’s not to say that some psychological states might not be more strongly associated with insula activity (again, you can see this on Neurosynth if you switch the image type to ‘reverse inference’ and browse around); it’s just that, probabilistically speaking, the mere fact that the insula is active gives you very little basis for saying anything concrete about what people are experiencing.

In fact, to account for Lindstrom’s findings, you don’t have to appeal to love or addiction at all. There’s a much simpler way to explain why seeing or hearing an iPhone might elicit insula activation. For most people, the onset of visual or auditory stimulation is a salient event that causes redirection of attention to the stimulated channel. I’d be pretty surprised, actually, if you could present any picture or sound to participants in an fMRI scanner and not elicit robust insula activity. Orienting and sustaining attention to salient things seems to be a big part of what the anterior insula is doing (whether or not that’s ultimately its ‘core’ function). So the most appropriate conclusion to draw from the fact that viewing iPhone pictures produces increased insula activity is something vague like “people are paying more attention to iPhones”, or “iPhones are particularly salient and interesting objects to humans living in 2011.” Not something like “no, really, you love your iPhone!”

In sum, the NYT screwed up. Lindstrom appears to have a habit of making overblown claims about neuroimaging evidence, so it’s not surprising he would write this type of piece; but the NYT editorial staff is supposedly there to filter out precisely this kind of pseudoscientific advertorial. And they screwed up. It’s a particularly big screw-up given that (a) as of right now, Lindstrom’s Op-Ed is the single most emailed article on the NYT site, and (b) this incident almost perfectly recapitulates another NYT article 4 years ago in which some neuroscientists and neuromarketers wrote a grossly overblown Op-Ed claiming to be able to infer, in detail, people’s opinions about presidential candidates. That time, Russ Poldrack and a bunch of other big names in cognitive neuroscience wrote a concise rebuttal that appeared in the NYT (but unfortunately, isn’t linked to from the original Op-Ed, so anyone who stumbles across the original now has no way of knowing how ridiculous it is). One hopes the NYT follows up in similar fashion this time around. They certainly owe it to their readers–some of whom, if you believe Lindstrom, are now in danger of dumping their current partners for their iPhones.

h/t: Molly Crockett

does functional specialization exist in the language system?

One of the central questions in cognitive neuroscience–according to some people, at least–is how selective different chunks of cortex are for specific cognitive functions. The paradigmatic examples of functional selectivity are pretty much all located in sensory cortical regions or adjacent association cortices. For instance, the fusiform face area (FFA), is so named because it (allegedly) responds selectively to faces but not to other stimuli. Other regions with varying selectivity profiles are similarly named: the visual word form area (VWFA), parahippocampal place area (PPA), extrastriate body area (EBA), and so on.

In a recent review paper, Fedorenko and Kanwisher (2009) sought to apply insights from the study of functionally selective visual regions to the study of language. They posed the following question with respect to the neuroimaging of language in the title of their paper: Why hasn’t a clearer picture emerged? And they gave the following answer: it’s because brains differ from one another, stupid.

Admittedly, I’m paraphrasing; they don’t use exactly those words. But the basic point they make is that it’s difficult to identify functionally selective regions when you’re averaging over a bunch of very different brains. And the solution they propose–again, imported from the study of visual areas–is to identify potentially selective language regions-of-interest (ROIs) on a subject-specific basis rather than relying on group-level analyses.

The Fedorenko and Kanwisher paper apparently didn’t please Greg Hickok of Talking Brains, who’s done a lot of very elegant work on the neurobiology of language.  A summary of Hickok’s take:

What I found a bit on the irritating side though was the extremely dim and distressingly myopic view of progress in the field of the neural basis of language.

He objects to Fedorenko and Kanwisher on several grounds, and the post is well worth reading. But since I’m very lazy tired, I’ll just summarize his points as follows:

  • There’s more functional specialization in the language system than F&K give the field credit for
  • The use of subject-specific analyses in the domain of language isn’t new, and many researchers (including Hickok) have used procedures similar to those F&K recommend in the past
  • Functional selectivity is not necessarily a criterion we should care about all that much anyway

As you might expect, F&K disagree with Hickok on these points, and Hickok was kind enough to post their response. He then responded to their response in the comments (which are also worth reading), which in turn spawned a back-and-forth with F&K, a cameo by Brad Buchsbaum (who posted his own excellent thoughts on the matter here), and eventually, an intervention by a team of professional arbitrators. Okay, I made that last bit up; it was a very civil disagreement, and is exactly what scientific debates on the internet should look like, in my opinion.

Anyway, rather than revisit the entire thread, which you can read for yourself, I’ll just summarize my thoughts:

  • On the whole, I think my view lines up pretty closely with Hickok’s and Buchsbaum’s. Although I’m very far from an expert on the neurobiology of language (is there a word in English for someone’s who’s the diametric opposite of an expert–i.e., someone who consistently and confidently asserts exactly the wrong thing? Cause that’s what I am), I agree with Hickok’s argument that the temporal poles show a response profile that looks suspiciously like sentence- or narrative-specific processing (I have a paper on the neural mechanisms of narrative comprehension that supports that claim to some extent), and think F&K’s review of the literature is probably not as balanced as it could have been.
  • More generally, I agree with Hickok that demonstrating functional specialization isn’t necessarily that important to the study of language (or most other domains). This seems to be a major point of contention for F&K, but I don’t think they make a very strong case for their view. They suggest that they “are not sure what other goals (besides understanding a region’s computations) could drive studies aimed at understanding how functionally specialized a region is,” which I think is reasonable, but affirms the consequent. Hickok isn’t saying there’s no reason to search for functional specialization in the F&K sense; as I read him, he’s simply saying that you can study the nature of neural computation in lots of interesting ways that don’t require you to demonstrate functional specialization to the degree F&K seem to require. Seems hard to disagree with that.
  • Buchsbaum points out that it’s questionable whether there are any brain regions that meet the criteria F&K set out for functional specialization–namely that “A brain region R is specialized for cognitive function x if this region (i) is engaged in tasks that rely on cognitive function x, and (ii) is not engaged in tasks that do not rely on cognitive function x.Buchsbaum and Hickok both point out that the two examples F&K give of putatively specialized regions (the FFA and the temporo-parietal junction, which some people believe is selectively involved in theory of mind) are hardly uncontroversial. Plenty of people have argued that the FFA isn’t really selective to faces, and even more people have argued that the TPJ isn’t selective to theory of mind. As far as I can tell, F&K don’t really address this issue in the comments. They do refer to a recent paper of Kanwisher’s that discusses the evidence for functional specificity in the FFA, but I’m not sure the argument made in that paper is itself uncontroversial, and in any case, Kanwisher does concede that there’s good evidence for at least some representation of non-preferred stimuli (i.e., non-faces in the FFA). In any case, the central question here is whether or not F&K really unequivocally believe that FFA and TPJ aren’t engaged by any tasks that don’t involve face or theory of mind processing. If not, then it’s unfair to demand or expect the same of regions implicated in language.
  • Although I think there’s a good deal to be said for subject-specific analyses, I’m not as sanguine as F&K that a subject-specific approach offers a remedy to the problems that they perceive afflict the study of the neural mechanisms of language. While there’s no denying that group analyses suffer from a number of limitations, subject-specific analyses have their own weaknesses, which F&K don’t really mention in their paper. One is that such analyses typically require the assumption that two clusters located in slightly different places for different subjects must be carrying out the same cognitive operations if they respond similarly to a localizer task. That’s a very strong assumption for which there’s very little evidence (at least in the language domain)–especially because the localizer task F&K promote in this paper involves a rather strong manipulation that may confound several different aspects of language processing.
    Another problem is that it’s not at all obvious how you determine which regions are the “same” (in their 2010 paper, F&K argue for an algorithmic parcellation approach, but the fact that you get sensible-looking results is no guarantee that your parcellation actually reflects meaningful functional divisions in individual subjects). And yet another is that serious statistical problems can arise in cases where one or more subjects fail to show activation in a putative region (which is generally the norm rather than the exception). Say you have 25 subjects in your sample, and 7 don’t show activation anywhere in a region that can broadly be called Broca’s area. What do you do? You can’t just throw those subjects out of the analysis, because that would grossly and misleadingly inflate your effect sizes. Conversely, you can’t just identify any old region that does activate and lump it in with the regions identified in all the other subjects. This is a very serious problem, but it’s one that group analyses, for all their weaknesses, don’t have to contend with.

Disagreements aside, I think it’s really great to see serious scientific discussion taking place in this type of forum. In principle, this is the kind of debate that should be resolved (or not) in the peer-reviewed literature; in practice, peer review is slow, writing full-blown articles takes time, and journal space is limited. So I think blogs have a really important role to play in scientific communication, and frankly, I envy Hickok and Poeppel for the excellent discussion they consistently manage to stimulate over at Talking Brains!