Several people left enlightening comments on my last post about the ADHD-200 Global Competition results, so I thought I’d bump some of them up and save you the trip back there (though the others are worth reading too!), since they’re salient to some of the issues raised in the last post.
Matthew Brown, the project manager on the Alberta team that was disqualified on a minor technicality (they cough didn’t use any of the imaging data), pointed out that they actually did initially use the imaging data (answering Sanjay’s question in another comment)–it just didn’t work very well:
For the record, we tried a pile of imaging-based approaches. As a control, we also did classification with age, gender, etc. but no imaging data. It was actually very frustrating for us that none of our imaging-based methods did better than the no imaging results. It does raise some very interesting issues.
He also pointed out that the (relatively) poor performance of the imaging-based classifiers isn’t cause for alarm:
I second your statement that we’ve only scratched the surface with fMRI-based diagnosis (and prognosis!). There’s a lot of unexplored potential here. For example, the ADHD-200 fMRI scans are resting state scans. I suspect that fMRI using an attention task could work better for diagnosing ADHD. Resting state fMRI has also shown promise for diagnosing other conditions (eg: schizophrenia – see Shen et al.).
I think big, multi-centre datasets like the ADHD-200 are the future though the organizational and political issues with this approach are non-trivial. I’m extremely impressed with the ADHD-200 organizers and collaborators for having the guts and perseverance to put this data out there. I intend to follow their example with the MR data that we collect in future.
I couldn’t agree more!
Jesse Brown explains why his team didn’t use the demographics alone:
I was on the UCLA/Yale team, we came in 4th place with 54.87% accuracy. We did include all the demographic measures in our classifier along with a boatload of neuroimaging measures. The breakdown of demographics by site in the training data did show some pretty strong differences in ADHD vs. TD. These differences were somewhat site dependent, eg girls from OHSU with high IQ are very likely TD. We even considered using only demographics at some point (or at least one of our wise team members did) but I thought that was preposterous. I think we ultimately figured that the imaging data may generalize better to unseen examples, particularly for sites that only had data in the test dataset (Brown). I guess one lesson is to listen to the data and not to fall in love with your method. Not yet anyway.
I imagine some of the other groups probably had a similar experience of trying to use the demographic measures alone and realizing they did better than the imaging data, but sticking with the latter anyway. Seems like a reasonable decision, though ultimately, I still think it’s a good thing the Alberta team used only the demographic variables, since their results provided an excellent benchmark against which to compare the performance of the imaging-based models. Sanjay Srivastava captured this sentiment nicely:
Two words: incremental validity. This kind of contest is valuable, but I’d like to see imaging pitted against behavioral data routinely. The fact that they couldn’t beat a prediction model built on such basic information should be humbling to advocates of neurodiagnosis (and shows what a low bar “better than chance” is). The real question is how an imaging-based diagnosis compares to a clinician using standard diagnostic procedures. Both “which is better” (which accounts for more variance as a standalone prediction) and “do they contain non-overlapping information” (if you put both predictions into a regression, does one or both contribute unique variance).
And Russ Poldrack raised a related question about what it is that the imaging-based models are actually doing:
What is amazing is that the Alberta team only used age, sex, handedness, and IQ.Â That suggests to me that any successful imaging-based decoding could have been relying upon correlates of those variables rather than truly decoding a correlate of the disease.
This seems quite plausible inasmuch as age, sex, and IQ are pretty powerful variables, and there are enormous literatures on their structural and functional correlates. While there probably is at least some incremental information added by the imaging data (and very possibly a lot of it), it’s currently unclear just how much–and whether that incremental variance might also be picked up by (different) behavioral variables. Ultimately, time (and more competitions like this one!) will tell.