Archive for the ‘Uncategorized’ Category

fMRI: coming soon to a courtroom near you?

Monday, May 17th, 2010

Science magazine has a series of three (1, 2, 3) articles by Greg Miller over the past few days covering an interesting trial in Tennessee. The case itself seems like garden variety fraud, but the novel twist is that the defense is trying to introduce fMRI scans into the courtroom in order to establish the defendant’s innocent. As far as I can tell from Miller’s articles, the only scientists defending the use of fMRI as a lie detector are those employed by Cephos (the company that provides the scanning service); the other expert witnesses (including Marc Raichle!) seem pretty adamant that admitting fMRI scans as evidence would be a colossal mistake. Personally, I think there are several good reasons why it’d be a terrible, terrible, idea to let fMRI scans into the courtroom. In one way or another, they all boil down to the fact that just  isn’t any shred of evidence to support the use of fMRI as a lie detector in real-world (i.e, non-contrived) situations. Greg Miller has a quote from Martha Farah (who’s a spectator at the trial) that sums it up eloquently:

Farah sounds like she would have liked to chime in at this point about some things that weren’t getting enough attention. “No one asked me, but the thing we have not a drop of data on is [the situation] where people have their liberty at stake and have been living with a lie for a long time,” she says. She notes that the only published studies on fMRI lie detection involve people telling trivial lies with no threat of consequences. No peer-reviewed studies exist on real world situations like the case before the Tennessee court. Moreover, subjects in the published studies typically had their brains scanned within a few days of lying about a fake crime, whereas Semrau’s alleged crimes began nearly 10 years before he was scanned.

I’d go even further than this, and point out that even if there were studies that looked at ecologically valid lying, it’s unlikely that we’d be able to make any reasonable determination as to whether or not a particular individual was lying about a particular event. For one thing, most studies deal with group averages and not single-subject prediction; you might think that a highly statistically significant difference between two conditions (e.g., lying and not lying) necessarily implies a reasonable ability to make predictions at the single-subject level, but you’d be surprised. Prediction intervals for individual observations are typically extremely wide even when there’s a clear pattern at the group level. It’s just easier to make general statements about differences between conditions or groups than it is about what state a particular person is likely to be in given a certain set of conditions.

There is, admittedly, an emerging body of literature that uses pattern classification to make predictions about mental states at the level of individual subjects, and accuracy in these types of application can sometimes be quite high. But these studies invariably operate on relatively restrictive sets of stimuli within well-characterized domains (e.g., predicting which word out of a set of 60 subjects are looking at). This really isn’t “mind reading” in the sense that most people (including most judges and jurors) tend to think of it. And of course, even if you could make individual-level predictions reasonably accurately, it’s not clear that that’s good enough for the courtroom. As a scientist, I might be thrilled if I could predict which of 10 words you’re looking at with 80% accuracy (which, to be clear, is currently a pipe dream in the context of studies of ecologically valid lying). But as a lawyer, I’d probably be very skeptical of another lawyer who claimed my predictions vindicated their client. The fact that increased anterior cingulate activation tends to accompany lying on average isn’t a good reason to convict someone unless you can be reasonably certain that increased ACC activation accompanies lying for that person in that context when presented with that bit of information. At the moment, that’s a pretty hard sell.

As an aside, the thing I find perhaps most curious about the whole movement to use fMRI scanners as lie detectors is that there are very few studies that directly pit fMRI against more conventional lie detection techniques–namely, the polygraph. You can say what you like about the polygraph–and many people don’t think polygraph evidence should be admissible in court either–but at least it’s been around for a long time, and people know more or less what to expect from it. It’s easy to forget that it only makes sense to introduce fMRI scans (which are decidedly costly) as evidence if they do substantially better than polygraphs. Otherwise you’re just wasting a lot of money for a fancy brain image, and you could have gotten just as much information by simply measuring someone’s arousal level as you yell at them about that bloodstained Cadillac that was found parked in their driveway on the night of January 7th. But then, maybe that’s the whole point of trying to introduce fMRI to the courtroom; maybe lawyers know that the polygraph has a tainted reputation, and are hoping that fancy new brain scanning techniques that come with pretty pictures don’t carry the same baggage. I hope that’s not true, but I’ve learned to be cynical about these things.

At any rate, the Science articles are well worth a read, and since the judge hasn’t yet decided whether or not to allow fMRI or not, the next couple of weeks should be interesting…

[hat-tip: Thomas Nadelhoffer]

everything we know about the neural bases of cognitive control, in 20 review articles or less

Sunday, May 2nd, 2010

Okay, not everything. But a lot of what we know. The current issue of Current Opinion in Neurobiology, which features a special focus on cognitive neuroscience, contains are almost 20 short review papers, most of which focus on the neural mechanisms of cognitive control in one guise or another. As the Editors of the special issue (Earl Miller and Liz Phelps) explain in their introduction:

Our goal with this special issue was to highlight integrative approaches to brain function. To this end, we focused on the most integrative of brain functions, cognitive control. Cognitive, or executive, control is the ability to coordinate thought and action by directing them toward goals, often far removed goals.

I’ve only skimmed a couple of articles so far, but it’s a pretty impressive table of contents, and I’m looking forward to reading a lot of the reviews. The nice thing about the Current Opinion series, like the Trends series, is that the reviews are short and focused, so they’re well-suited to people who are very busy and don’t have enough hours in their day (like you), or people who just have a short attention span (like me).

Admittedly, I also have an ulterior motive for mentioning this issue: Todd Braver, Mike Cole and I contributed one of the articles, in which we review the neural bases of individual differences in executive control. I think it’s a really nice paper, the credit for which really goes to Todd and Mike–I mostly just contributed the section on methodological considerations (which is basically a precis of a much longer chapter I wrote with Todd a couple of years ago). Todd and Mike somehow managed to review work on everything from reward and motivation to emotion regulation to working memory capacity to dopamine genes, all in the space of eight pages. It’s a nice review highlighting the importance of modeling not only the central tendency of people’s behavior and brain activation in cognitive neuroscience studies, but also the variation between individuals. Aside from the fact that many people (including me!) find individual differences in cognitive abilities intrinsically interesting, an individual differences approach can provide insights that naturally complement those identified by more common within-subject analyses.

For instance, there’s a giant literature on the critical role the neurotransmitter dopamine plays in maintaining and updating goal representations. Most process models of dopamine function make either explicit or tacit predictions about how individual differences in dopamine function should manifest behaviorally, and recent studies have sought to test some of these predictions using both neuroimaging and molecular genetic techniques. A lot of work has focused on a common polymorphism in the COMT gene, variants of which dramatically alter the efficiency of dopamine degradation in the prefrontal cortex. An (admittedly simplistic) prediction that follows from one standard view of prefrontal dopamine function (that tonic dopamine serves to stabilize active representations) is that people who possess the low-activity met allele (and consequently have higher dopamine levels in PFC) should have a greater capacity to maintain goal representations and sustain attention, which may manifest as improved performance on many working memory tasks. Conversely, people with the val allele, which is associated with lower tonic dopamine levels in PFC, should do worse at tasks requiring sustained attention, but may have greater cognitive flexibility (due to the capacity to switch between goal representations more easily).

This prediction, which is borne out by a number of studies we review, is fundamentally about individual differences, since we typically can’t manipulate people’s COMT genes in the lab (though I know some people who probably really wish we could!). But the point is, even if you’re not intrinsically interested in what makes people different from one another, studying individual variation at a genetic, neural, or behavioral level can often tell you something useful about the models you’re developing. Particularly when it comes to the domain of executive control, where differences between individuals can be quite striking. Almost any mechanistic model of executive control is going to have ‘joints’ that could theoretically vary systematically across individuals, so it makes sense to capitalize on natural variability between people to test some of the predictions that fall out of the model, instead of just treating between-subject variability as the error term in your one-sample t-test.

Anyway, our article is here, and the full issue is here (though it’s behind a paywall, unfortunately).

functional MRI and the many varieties of reliability

Friday, March 5th, 2010

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”" paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

on the limitations of psychiatry, or why bad drugs can be good too

Saturday, January 9th, 2010

The Neuroskeptic offers a scathing indictment of the notion, editoralized in Nature this week, that the next decade is going to revolutionize the understanding and treatment of psychiatric disorders:

The 2010s is not the decade for psychiatric disorders. Clinically, that decade was the 1950s. The 50s was when the first generation of psychiatric drugs were discovered – neuroleptics for psychosis (1952), MAOis (1952) and tricyclics (1957) for depression, and lithium for mania (1949, although it took a while to catch on).

Since then, there have been plenty of new drugs invented, but not a single one has proven more effective than those available in 1959. New antidepressants like Prozac are safer in overdose, and have milder side effects, than older ones. New “atypical” antipsychotics have different side effects to older ones. But they work no better. Compared to lithium, newer “mood stabilizers” probably aren’t even as good. (The only exception is clozapine, a powerful antipsychotic, but dangerous side-effects limit its use.)

Those are pretty strong claims–especially the assertion that not a single psychiatric drug has proven more effective than those available in 1959. Are they true? I’m not in a position to know for certain, having had only fleeting contacts here and there with psychiatric research. But I guess I’d be surprised if many basic researchers in psychiatry concurred with that assessment. (I’m sure many clinicians wouldn’t, but that wouldn’t be very surprising.) Still, even if you suppose that present-day drugs are no more effective than those available in 1959 on the average (which may or may not be true), it doesn’t follow that there haven’t been major advances in psychiatric treatment. For one thing, the side effects of many modern drugs do tend to be less severe. The Neuroskeptic is right that atypical antipsychotics aren’t as side effect-free as was once hoped; but consider, in contrast, drugs like lamotrigine or valproate–anticonvulsants nowadays widely prescribed for bipolar disorder–which are undeniably less toxic than lithium (though also no more, and possibly less, effective). If you’re diagnosed with bipolar disorder in 2010, there’s still a good chance that you’ll eventually end up being prescribed with lithium;, but (in most cases) it’s unlikely that that’ll be the first line of treatment. And on the bright side, you could end up with a well-managed case of bipolar disorder that never requires you to take drugs with frequent and severe side effects–something that frankly wouldn’t have been an option for almost anyone in 1959.

That last point gets to what I think is the bigger reason for optimism: choice. Even if new drugs aren’t any better than old drugs on average, they’re probably going to work for different groups of people. One of the things that’s problematic about the way the results of clinical trials are typically interpreted is that if a new drug doesn’t outperform an old one, it’s often dismissed as unhelpful. The trouble with this worldview is that even if drug A helps 60% of people on average and drug B helps 54% of people on average (and the difference is statistically and clinically significant), it may well be that drug B helps people who don’t benefit from drug A. The unfortunate reality is that even relatively stable psychiatric patients usually take a while to find an effective treatment regime; most patients try several treatments before settling on one that works. Simply in virtue of there being dozens more drugs available in 2009 than in 1959, it follows that psychiatric patients are much better off living today than fifty years ago. If an atypical antipsychotic controls your schizophrenia without causing motor symptoms or metabolic syndrome, you never have to try a typical antipsychotic; if valproate works well for your bipolar disorder, there’s no reason for you to ever go on lithium. These aren’t small advances; when you’re talking about millions of people who suffer from each of these disorders worldwide, the introduction of any drug that might help even just a fraction of patients who weren’t helped by older medication is a big deal, translating into huge improvements in quality of life and many tens of thousands of lives saved. That’s not to say we shouldn’t strive to develop drugs that aren’t also better on average than the older treatments; it’s just that it shouldn’t be the only (and perhaps not even the main) criterion we use to gauge efficacy.

Having said that, I do agree with the Neuroskeptic’s assessment as to why psychiatric research and treatment seems to proceed more slowly than research in other areas of neuroscience or medicine:

Why? That’s an excellent question. But if you ask me, and judging by the academic literature I’m not alone, the answer is: diagnosis. The weak link in psychiatry research is the diagnoses we are forced to use: “major depressive disorder”, “schizophrenia”, etc.

There are all sorts of methodological reasons why it’s not a great idea to use discrete diagnostic categories when studying (or developing treatments for) mental health disorders. But perhaps the biggest one is that, in cases where a disorder has multiple contributing factors (which is to say, virtually always), drawing a distinction between people with the disorder and those without it severely restricts the range of expression of various related phenotypes, and may even assign people with positive symptomatology to the wrong half of the divide simply because they don’t have some other (relatively) arbitrary symptoms.

For example, take bipolar disorder. If you classify the population into people with bipolar disorder and people without it, you’re doing two rather unfortunate things. One is that you’re lumping together a group of people who have only a partial overlap of symptomatology, and treating them as though they have identical status. One person’s disorder might be characterized by persistent severe depression punctuated by short-lived bouts of mania every few months; another person might cycle rapidly between a variety of moods multiple times per month, week, or even day. Assigning both people the same diagnosis in a clinical study is potentially problematic in that there may be very different underlying organic disorders, which means you’re basically averaging over multiple discrete mechanisms in your analysis, resulting in a loss of both sensitivity and specificity.

The other problem, which I think is less widely appreciated, is that you’ll invariably have many “control” subjects who don’t receive the diagnosis but share many features with people who do. This problem is analogous to the injunction against using median splits: you almost never want to turn an interval-level variable into an ordinal one if you don’t have to, because you lose a tremendous amount of information. When you contrast a sample of people with a bipolar diagnosis with a group of “healthy” controls, you’re inadvertently weakening your comparison by including in the control group people who would be best characterizing as falling somewhere in between the extremes of pathological and healthy. For example, most of us probably know people who we would characterize as “functionally manic” (sometimes also known as “extraverts”)–that is, people who seem to reap the benefits of the stereotypical bipolar syndrome in the manic phase (high energy, confidence, and activity level) but have none of the downside of the depressive phase. And we certainly know people who seem to have trouble regulating their moods, and oscillate between periods of highs and lows–but perhaps just not to quite the extent necessary to obtain a DSM-IV diagnosis. We do ourselves a tremendous disservice if we call these people “controls”. Sure, they might be controls for some aspects of bipolar symptomatology (e.g., people who are consistently energetic serve as a good contrast to the dysphoria of the depressive phase); but in other respects, they may actually closer to the prototypical patient than to most other people.

From a methodological standpoint, there’s no question we’d be much better off focusing on symptoms rather than classifications. If you want to understand the many different factors that contribute to bipolar disorder or schizophrenia, you shouldn’t start from the diagnosis and work backwards; you should start by asking what symptom constellations are associated with specific mechanisms. And those symptoms may well be present (to varying extents) both in people with and without the disorder in question. That’s precisely the motivation behind the current “endophenotype” movement, where the rationale is that you’re better off trying to figure out what biological and (eventually) behavioral changes a given genetic polymorphism is associated with, and then using that information to reshape taxonomies of mental health disorders, than trying to go directly from diagnosis to genetic mechanisms.

Of course, it’s easy to talk about the problems associated with the way psychiatric diagnoses are applied, and not so easy to fix them. Part of the problem is that, while researchers in the lab have the luxury of using large samples that are defined on the basis of symptomatology rather than classification (a luxury that, as the Neuroskeptic and others have astutely observed, many researchers fail to take advantage of), clinicians generally don’t. When you see a patient come in complain of dsyphoria and mood swings, it’s not particularly useful to say “you seem to be in the 96th percentile for negative affect, and have unusual trouble controlling your mood; let’s study this some more, mmmkay?” What you need is some systematic way of going from symptoms to treatment, and the DSM-IV offers a relatively straightforward (though wildly imperfect) way to do that. And then too, the reality is that most clinicians (at least, the ones I’ve talked to) don’t just rely on some algorithmic scheme for picking out drugs; they instead rely on a mix of professional guidelines, implicit theories, and (occasionally) scientific literature when making decisions about what types of symptom constellations have, in their experience, benefited more or less from specific drugs. The problem is that those decisions often fail to achieve their intended goal, and so you end up with a process of trial-and-error, where most patients might try half a dozen medications before they find one that works (if they’re lucky). But that only takes us back to why it’s actually a good thing that we have so many more medications in 2009 than 1959, even if they’re not necessary individually more effective. So, yes, psychiatric research has some major failings compared to other areas of biomedical research–though I do think that’s partly (though certainly not entirely) because the problems are harder. But I don’t think it’s fair to suggest we haven’t made any solid advances in the treatment or understanding of psychiatric disorders in the last half-century. We have; it’s just that we could do much better.