trouble with biomarkers and press releases

The latest issue of the Journal of Neuroscience contains an interesting article by Ecker et al in which the authors attempted to classify people with autism spectrum disorder (ASD) and health controls based on their brain anatomy, and report achieving “a sensitivity and specificity of up to 90% and 80%, respectively.” Before unpacking what that means, and why you probably shouldn’t get too excited (about the clinical implications, at any rate; the science is pretty cool), here’s a snippet from the decidedly optimistic press release that accompanied the study:

“Scientists funded by the Medical Research Council (MRC) have developed a pioneering new method of diagnosing autism in adults. For the first time, a quick brain scan that takes just 15 minutes can identify adults with autism with over 90% accuracy. The method could lead to the screening for autism spectrum disorders in children in the future.”

If you think this sounds too good to be true, that’s because it is. Carl Heneghan explains why in an excellent article in the Guardian:

How the brain scans results are portrayed is one of the simplest mistakes in interpreting diagnostic test accuracy to make. What has happened is, the sensitivity has been taken to be the positive predictive value, which is what you want to know: if I have a positive test do I have the disease? Not, if I have the disease, do I have a positive test? It would help if the results included a measure called the likelihood ratio (LR), which is the likelihood that a given test result would be expected in a patient with the target disorder compared to the likelihood that the same result would be expected in a patient without that disorder. In this case the LR is 4.5. We’ve put up an article if you want to know more on how to calculate the LR.

In the general population the prevalence of autism is 1 in 100; the actual chances of having the disease are 4.5 times more likely given a positive test. This gives a positive predictive value of 4.5%; about 5 in every 100 with a positive test would have autism.

For those still feeling confused and not convinced, let’s think of 10,000 children. Of these 100 (1%) will have autism, 90 of these 100 would have a positive test, 10 are missed as they have a negative test: there’s the 90% reported accuracy by the media.

But what about the 9,900 who don’t have the disease? 7,920 of these will test negative (the specificity3 in the Ecker paper is 80%). But, the real worry though, is the numbers without the disease who test positive. This will be substantial: 1,980 of the 9,900 without the disease. This is what happens at very low prevalences, the numbers falsely misdiagnosed rockets. Alarmingly, of the 2,070 with a positive test, only 90 will have the disease, which is roughly 4.5%.

In other words, if you screened everyone in the population for autism, and assume the best about the classifier reported in the JNeuro article (e.g., that the sample of 20 ASD participants they used is perfectly representative of the broader ASD population, which seems unlikely), only about 1 in 20 people who receive a positive diagnosis would actually deserve one.

Ecker et al object to this characterization, and reply to Heneghan in the comments (through the MRC PR office):

Our test was never designed to screen the entire population of the UK. This is simply not practical in terms of costs and effort, and besides totally  unjustified- why would we screen everybody in the UK for autism if there is no evidence whatsoever that an individual is affected?. The same case applies to other diagnostic tests. Not every single individual in the UK is tested for HIV. Clearly this would be too costly and unnecessary. However, in the group of individuals that are test for the virus, we can be very confident that if the test is positive that means a patient is infected. The same goes for our approach.

Essentially, the argument is that, since people would presumably be sent for an MRI scan because they were already under consideration for an ASD diagnosis, and not at random, the false positive rate would in fact be much lower than 95%, and closer to the 20% reported in the article.

One response to this reply–which is in fact Heneghan’s response in the comments–is to point out that the pre-test probability of ASD would need to be pretty high already in order for the classifier to add much. For instance, even if fully 30% of people who were sent for a scan actually had ASD, the posterior probability of ASD given a positive result would still be only 66% (Heneghan’s numbers, which I haven’t checked). Heneghan nicely contrasts these results with the standard for HIV testing, which “reports sensitivity of 99.7% and specificity of 98.5% for enzyme immunoassay.” Clearly, we have a long way to go before doctors can order MRI-based tests for ASD and feel reasonably confident that a positive result is sufficient grounds for an ASD diagnosis.

Setting Heneghan’s concerns about base rates aside, there’s a more general issue that he doesn’t touch on. It’s one that’s not specific to this particular study, and applies to nearly all studies that attempt to develop “biomarkers” for existing disorders. The problem is that the sensitivity and specificity values that people report for their new diagnostic procedure in these types of studies generally aren’t the true parameters of the procedure. Rather, they’re the sensitivity and specificity under the assumption that the diagnostic procedures used to classify patients and controls in the first place are themselves correct. In other words, in order to believe the results, you have to assume that the researchers correctly classified the subjects into patient and control groups using other procedures. In cases where the gold standard test used to make the initial classification is known to have near 100% sensitivity and specificity (e.g., for the aforementioned HIV tests), one can reasonably ignore this concern. But when we’re talking about mental health disorders, where diagnoses are fuzzy and borderline cases abound, it’s very likely that the “gold standard” isn’t really all that great to begin with.

Concretely,  studies that attempt to develop biomarkers for mental health disorders face two substantial problems. One is that it’s extremely unlikely that the clinical diagnoses are ever perfect; after all, if they were perfect, there’d be little point in trying to develop other diagnostic procedures! In this particular case, the authors selected subjects into the ASD group based on standard clinical instruments and structured interviews. I don’t know that there are many clinicians who’d claim with a straight face that the current diagnostic criteria for ASD (and there are multiple sets to choose from!) are perfect. From my limited knowledge, the criteria for ASD seem to be even more controversial than those for most other mental health disorders (which is saying something, if you’ve been following the ongoing DSM-V saga). So really, the accuracy of the classifier in the present study, even if you put the best face on it and ignore the base rate issue Heneghan brings up, is undoubtedly south of the 90% sensitivity / 80% specificity the authors report. How much south, we just don’t know, because we don’t really have any independent, objective way to determine who “really” should get an ASD diagnosis and who shouldn’t (assuming you think it makes sense to make that kind of dichotomous distinction at all). But 90% accuracy is probably a pipe dream, if for no other reason than it’s hard to imagine that level of consensus about autism spectrum diagnoses.

The second problem is that, because the researchers are using the MRI-based classifier to predict the clinician-based diagnosis, it simply isn’t possible for the former to exceed the accuracy of the latter. That bears repeating, because it’s important: no matter how good the MRI-based classifier is, it can only be as good as the procedures used to make the original diagnosis, and no better. It cannot, by definition, make diagnoses that are any more accurate than the clinicians who screened the participants in the authors’ ASD sample. So when you see the press release say this:

For the first time, a quick brain scan that takes just 15 minutes can identify adults with autism with over 90% accuracy.

You should really read it as this:

The method relies on structural (MRI) brain scans and has an accuracy rate approaching that of conventional clinical diagnosis.

That’s not quite as exciting, obviously, but it’s more accurate.

To be fair, there’s something of a catch-22 here, in that the authors didn’t really have a choice about whether or not to diagnose the ASD group using conventional criteria. If they hadn’t, reviewers and other researchers would have complained that we can’t tell if the ASD group is really an ASD group, because they authors used non-standard criteria. Under the circumstances, they did the only thing they could do. But that doesn’t change the fact that it’s misleading to intimate, as the press release does, that the new procedure might be any better than the old ones. It can’t be, by definition.

Ultimately, if we want to develop brain-based diagnostic tools that are more accurate than conventional clinical diagnoses, we’re going to need to show that these tools are capable of predicting meaningful outcomes that clinician diagnoses can’t. This isn’t an impossible task, but it’s a very difficult one. One approach you could take, for instance, would be to compare the ability of clinician diagnosis and MRI-based diagnosis to predict functional outcomes among subjects at a later point in time. If you could show that MRI-based classification of subjects at an early age was a stronger predictor of receiving an ASD diagnosis later in life than conventional criteria, that would make a really strong case for using the former approach in the real world. Short of that type of demonstration though, the only reason I can imagine wanting to use a procedure that was developed by trying to duplicate the results of an existing procedure is in the event that the new procedure is substantially cheaper or more efficient than the old one. Meaning, it would be reasonable enough to say “well, look, we don’t do quite as well with this approach as we do with a full clinical evaluation, but at least this new approach costs much less.” Unfortunately, that’s not really true in this case, since the price of even a short MRI scan is generally going to outweigh that of a comprehensive evaluation by a psychiatrist or psychotherapist. And while it could theoretically be much faster to get an MRI scan than an appointment with a mental health professional, I suspect that that’s not generally going to be true in practice either.

Having said all that, I hasten to note that all this is really a critique of the MRC press release and subsequently lousy science reporting, and not of the science itself. I actually think the science itself is very cool (but the Neuroskeptic just wrote a great rundown of the methods and results, so there’s not much point in me describing them here). People have been doing really interesting work with pattern-based classifiers for several years now in the neuroimaging literature, but relatively few studies have applied this kind of technique to try and discriminate between different groups of individuals in a clinical setting. While I’m not really optimistic that the technique the authors introduce in this paper is going to change the way diagnosis happens any time soon (or at least, I’d argue that it shouldn’t), there’s no question that the general approach will be an important piece of future efforts to improve clinical diagnoses by integrating biological data with existing approaches. But that’s not going to happen overnight, and in the meantime, I think it’s pretty irresponsible of the MRC to be issuing press releases claiming that its researchers can diagnose autism in adults with 90% accuracy.

ResearchBlogging.orgEcker C, Marquand A, Mourão-Miranda J, Johnston P, Daly EM, Brammer MJ, Maltezos S, Murphy CM, Robertson D, Williams SC, & Murphy DG (2010). Describing the brain in autism in five dimensions–magnetic resonance imaging-assisted diagnosis of autism spectrum disorder using a multiparameter classification approach. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30 (32), 10612-23 PMID: 20702694

elsewhere on the internets…

Some stuff I’ve found interesting in the last week or two:

Nicholas Felton released his annual report of… himself. It’s a personal annual report on Felton, as seen through the eyes of a bunch of friends, family, and strangers:

Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report. From parents to old friends, to people I met for the first time, to my dentist… any time I felt that someone had discerned enough of my personality and activities, they were given a card with a URL and unique number to record their experience.

You probably don’t much care about Nicholas Felton’s relationships, moods, or diet, but it’s a neat idea that’s really well executed. And it looks great [via Flowing Data].

Hackademe is a serialized novel about a man, with an axe, who dislikes professors enough to take them out behind the wood shed and… alright, no, it’s actually “a website devoted to sharing clever uses of technology, software, or modified items to solve problems related to information overload, time management, organization, productivity, and other challenges faced by academics on a daily basis.” Which is pretty cool, except that I have trouble seeing the word “hackademic” in a positive light…

The UK’s General Medical Council finally laid the smack down on the ethically-challenged Andrew Wakefield–he of “vaccines cause autism, and here’s a terrible and possibly fraudulent study to prove it” fame. There’s a very long but very good write-up of the whole debacle here. Unfortunately, the reprimand is really just symbolic at this point, because Wakefield now lives in the US, and isn’t (officially) practicing medicine any more. Instead, he spends his days pumping autistic children full of laxatives. I wish I were joking.

The Neuroskeptic has had a string of great posts in the last couple of weeks. I particularly enjoyed this one, wherein he exposes reveals himself to be an expert on all matters sexual, dopaminergic, and British.

According to a study in Nature, running barefoot may be better for our feet than running in shoes. Turns out that barefoot runners strike the ground with the middle or ball of the foot, greatly reducing the force of impact. This may explain why so many (shod) runners get injured every year, and is supposed to make sense to you if you’re one of those evolutionist folks who think humans evolved to run long distances over the course of millions of years. But since you and I both know god created shoes around the same time he was borrowing Adam’s ribs, we can dispense with that sort of silliness.

The Census Bureau has some ‘splaining to do. Over at Freakonomics, Justin Wolfers discusses a new paper that uncovers massive (and inadvertent) problems with large chunks of census data. The fact that the census bureau screwed up isn’t terribly surprising (though it does call a number of published findings into question); everyone who works with data makes mistakes now and then, and the Census Bureau works with most data than most people. What is surprising is that Census has apparently refused to correct the problem, which is going to leave a lot of people hanging.

Slime mold has evolved the capacity to plan metropolitan transit systems! So claims a study in last week’s issue of Science. Ok, that’s not exactly what the article shows. What Tero et al. do show is that slime mold naturally forms networks that have a structure with comparable efficiency to the Tokyo rail system. Which, if you think about it, kind of does mean that slime mold has the capacity to plan metropolitan transit systems.

Projection Point is a neat website that measures something its creators term your “Risk Intelligence Quotient”. What’s interesting is that the site measures meta-cognitive judgments about risk rather than risk attitudes. In other words, it measures how much you know about how much you know, rather than how much you know. If that sounds confusing, spend 5 minutes answering 50 questions, and all will be made clear.

Pete Warden wants to divide up the US into 7 distinct chunks. Or at least, he wants to tell you how FaceBook thinks the US should be divided up, based on social connections between people in different geographic locations. There’s Stayathomia, Mormonia, and Socialistan. (Names have been deliberately altered to protect the guilty states.)

Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report. From parents to old friends, to people I met for the first time, to my dentist… any time I felt that someone had discerned enough of my personality and activities, they were given a card with a URL and unique number to record their experience.