Ioannidis on effect size inflation, with guest appearance by Bozo the Clown

Andrew Gelman posted a link on his blog today to a paper by John Ioannidis I hadn’t seen before. In many respects, it’s basically the same paper I wrote earlier this year as a commentary on the Vul et al “voodoo correlations” paper (the commentary was itself based largely on an earlier chapter I wrote with my PhD advisor, Todd Braver). Well, except that the Ioannidis paper came out a year earlier than mine, and is also much better in just about every respect (more on this below).

What really surprises me is that I never came across Ioannidis’ paper when I was doing a lit search for my commentary. The basic point I made in the commentary–which can be summarized as the observation that low power coupled with selection bias almost invariably inflates significant effect sizes–is a pretty straightforward statistical point, so I figured that many people, and probably most statisticians, were well aware of it. But no amount of Google Scholar-ing helped me find an authoritative article that made the same point succinctly; I just kept coming across articles that made the point tangentially, in an off-hand “but of course we all know we shouldn’t trust these effect sizes, because…” kind of way. So I chalked it down as one of those statistical factoids (of which there are surprisingly many) that live in the unhappy land of too-obvious-for-statisticians-to-write-an-article-about-but-not-obvious-enough-for-most-psychologists-to-know-about. And so I just went ahead and wrote the commentary in a non-technical way that I hoped would get the point across intuitively.

Anyway, after the commentary was accepted, I sent a copy to Andrew Gelman, who had written several posts about the Vul et al controversy. He promptly send me back a link to this paper of his, which basically makes the same point about sampling error, but with much more detail and much better examples than I did. His paper also cites an earlier article in American Scientist by Wainer, which I also recommend, and again expresses very similar ideas. So then I felt a bit like a fool for not stumbling across either Gelman’s paper or Wainer’s earlier. And now that I’ve read Ioannidis’ paper, I feel even dumber, seeing as I could have saved myself a lot of trouble by writing two or three paragraphs and then essentially pointing to Ioannidis’ work. Oh well.

That all said, it wasn’t a complete loss; I still think the basic point is important enough that it’s worth repeating loudly and often, no matter how many times it’s been said before. And I’m skeptical that many fMRI researchers would have appreciated the point otherwise, given that none of the papers I’ve mentioned were published in venues fMRI researchers are likely to read regularly (which is presumably part of the reason I never came across them!). Of course, I don’t think that many people who do fMRI research actually bothered to read my commentary, so it’s questionable whether it had much impact anyway.

At any rate, the Ioannidis paper makes a number of points that my paper didn’t, so I figured I’d talk about them a bit. I’ll start by revisiting what I said in my commentary, and then I’ll tell you why you should read Ioannidis’ paper instead of mine.

The basic intuition can be captured as follows. Suppose you’re interested in the following question: Do clowns suffer depression at a higher rate than us non-comical folk do? You might think this is a contrived (to put it delicately) question, but I can assure you it has all sorts of important real-world implications. For instance, you wouldn’t be so quick to book a clown for your child’s next birthday party if you knew that The Great Mancini was going to be out in the parking lot half an hour later drinking cheap gin out of a top hat. If that example makes you feel guilty, congratulations: you’ve just discovered the translational value of basic science.

Anyway, back to the question, and how we’re going to answer it. You can’t just throw a bunch of clowns and non-clowns in a room and give them a depression measure. There’s nothing comical about that. What you need to do, if you’re rigorous about it, is give them multiple measures of depression, because we all know how finicky individual questionnaires can be. So the clowns and non-clowns each get to fill out the Beck Depression Inventory (BDI), the Center for Epidemiologic Studies Depression Scale, the Depression Adjective Checklist, the Zung Self-Rating Depression Scale (ZSRDS), and, let’s say, six other measures. Ten measures in all. And let’s say we have 20 individuals in each group, because that’s all I personally a cash-strapped but enthusiastic investigator can afford. After collecting the data, we score the questionnaires and run a bunch of t-tests to determine whether clowns and non-clowns have different levels of depression. Being scrupulous researchers who care a lot about multiple comparisons correction, we decide to divide our critical p-value by 10 (the dreaded Bonferroni correction, for 10 tests in this case) and test at p < .005. That’s a conservative analysis, of course; but better safe than sorry!

So we run our tests and get what look like mixed results. Meaning, we get statistically significant positive correlations between clown-dom status and depression for 2 measures–the BDI and Zung inventories–but not for the other 8 measures. So that’s admittedly not great; it would have been better if all 10 had come out right. Still, it at least partially supports our hypothesis: Clowns are fucking miserable! And because we’re already thinking ahead to how we’re going to present these results when they (inevitably) get published in Psychological Science, we go ahead and compute the effect sizes for the two significant correlations, because, after all, it’s important to know not only that there is a “real” effect, but also how big that effect is. When we do that, it turns out that the point-biserial correlation is huge! It’s .75 for the BDI and .68 for the ZSRDS. In other words, about half of the variance in clowndom can be explained by depression levels. And of course, because we’re well aware that correlation does not imply causation, we get to interpret the correlation both ways! So we quickly issue a press release claiming that we’ve discovered that it’s possible to conclusively diagnose depression just by knowing whether or not someone’s a clown! (We’re not going to worry about silly little things like base rates in a press release.)

Now, this may all seem great. And it’s probably not an unrealistic depiction of how much of psychology works (well, minus the colorful scarves, big hair, and face paint). That is, very often people report interesting findings that were selectively reported from amongst a larger pool of potential findings on the basis of the fact that the former but not the latter surpassed some predetermined criterion for statistical significance. For example, in our hypothetical in press clown paper, we don’t bother to report results for the correlation between clownhood and the Center for Epidemiologic Studies Depression Scale (r = .12, p > .1). Why should we? It’d be silly to report a whole pile of additional correlations only to turn around and say “null effect, null effect, null effect, null effect, null effect, null effect, null effect, and null effect” (see how boring it was to read that?). Nobody cares about variables that don’t predict other variables; we care about variables that do predict other variables. And we’re not really doing anything wrong, we think; it’s not like the act of selective reporting is inflating our Type I error (i.e., the false positive rate), because we’ve already taken care of that up front by deliberately being overconservative in our analyses.

Unfortunately, while it’s true that our Type I error doesn’t suffer, the act of choosing which findings to report based on the results of a statistical test does have another unwelcome consequence. Specifically, there’s a very good chance that the effect sizes we end up reporting for statistically significant results will be artificially inflated–perhaps dramatically so.

Why would this happen? It’s actually entailed by the selection procedure. To see this, let’s take the classical measurement model, under which the variance in any measured variable reflects the sum of two components: the “true” scores (i.e., the scores we would get if our measurements were always completely accurate) and some random error. The error term can in turn be broken down into many more specific sources of error; but we’ll ignore that and just focus on one source of error–namely, sampling error. Sampling error refers to the fact that we can never select a perfectly representative group of subjects when we collect a sample; there’s always some (ideally small) way in which the sample characteristics differ from the population. This error term can artificially inflate an effect or artificially deflate it, and it can inflate or deflate it more or less, but it’s going to have an effect one way or the other. You can take that to the bank as sure as my name’s Bozo the Clown.

To put this in context, let’s go back to our BDI scores. Recall that what we observed is that clowns have higher BDI scores than non-clowns. But what we’re now saying is that that difference in scores is going to be affected by sampling error. That is, just by chance, we may have selected a group of clowns that are particularly depressed, or a group of non-clowns who are particularly jolly. Maybe if we could measure depression in all clowns and all non-clowns, we would actually find no difference between groups.

Now, if we allow that sampling error really is random, and that we’re not actively trying to pre-determine the outcome of our study by going out of our way to recruit The Great Depressed Mancini and his extended dysthymic clown family, then in theory we have no reason to think that sampling error is going to introduce any particular bias into our results. It’s true that the observed correlations in our sample may not be perfectly representative of the true correlations in the population; but that’s not a big deal so long as there’s no systematic bias (i.e., that we have no reason to think that our sample will systematically inflate correlations or deflate them). But here’s the problem: the act of choosing to report some correlations but not others on the basis of their statistical significance (or lack thereof) introduces precisely such a bias. The reason is that, when you go looking for correlations that are of a certain size or greater, you’re inevitably going to be more likely to select those correlations that happen to have been helped by chance than hurt by it.

Here’s a series of figures that should make the point even clearer. Let’s pretend for a moment that the truth of the matter is that there is in fact a positive correlation between clown status and all 10 depression measures. Except, we’ll make it 100 measures, because it’ll be easier to illustrate the point that way. Moreover, let’s suppose that the correlation is exactly the same for all 100 measures, at .3. Here’s what that would look like if we just plotted the correlations for all 100 measures, 1 through 100:


It’s just a horizontal red line, because all the individual correlations have the same value (0.3). So that’s not very exciting. But remember, these are the population correlations. They’re not what we’re going to observe in our sample of 20 clowns and 20 non-clowns, because depression scores in our sample aren’t a perfect representation of the population. There’s also error to worry about. And error–or at least, sampling error–is going to be greater for smaller samples than for bigger ones. (The reason for this can be expressed intuitively: other things being equal, the more observations you have, the more representative your sample must be of the population as a whole, because deviations in any given direction will tend to cancel each other out the more data you collect. And if you keep collecting, at the limit, your sample will constitute the whole population, and must therefore by definition be perfectly representative). With only 20 subjects in each group, our estimates of each group’s depression level are not going to be terrifically stable. You can see this in the following figure, which shows the results of a simulation on 100 different variables, assuming that all have an identical underlying correlation of .3:


Notice how much variability there is in the correlations! The weakest correlation is actually negative, at -.18; the strongest is much larger than .3, at .63. (Caveat for more technical readers: this assumes that the above variables are completely independent, which in practice is unlikely to be true when dealing with 100 measures of the same construct.) So even though the true correlation is .3 in all cases, the magic of sampling will necessarily produce some values that are below .3, and some that are above .3. In some cases, the deviations will be substantial.

By now you can probably see where this is going. Here we have a distribution of effect sizes that to some extent may reflect underlying variability in population effect sizes, but is also almost certainly influenced by sampling error. And now we come along and decide that, hey, it doesn’t really make sense to report all 100 of these correlations in a paper; that’s too messy. Really, for the sake of brevity and clarity, we should only report those correlations that are in some sense more important and “real”. And we do that by calculating p-values and only reporting the results of tests that are significant at some predetermined level (in our case, p < .005). Well, here’s what that would look like:


This is exactly the same figure as the previous one, except we’ve now grayed out all the non-significant correlations. And in the process, we’ve made Bozo the Clown cry:

Why? Because unfortunately, the criterion that we’ve chosen is an extremely conservative one. In order to detect a significant difference in means between two groups of 20 subjects at p < .005, the observed correlation (depicted as the horizontal black line above) needs to be .42 or greater! That’s substantially larger than the actual population effect size of .3. Effects of this magnitude don’t occur very frequently in our sample; in fact, they only occur 16 times. As a result, we’re going to end up failing to detect 84 of 100 correlations, and will walk away thinking they’re null results–even though the truth is that, in the population, they’re actually all pretty strong, at .3. This quantity–the proportion of “real” effects that we’re likely to end up calling statistically significant given the constraints of our sample–is formally called statistical power. If you do a power analysis for a two-sample t-test on a correlation of .3 at p < .005, it turns out that power is only .17 (which is essentially what we see above; the slight discrepancy is due to chance). In other words, even when there are real and relatively strong associations between depression and clownhood, our sample would only identify those associations 17% of the time, on average.

That’s not good, obviously, but there’s more. Now the other shoe drops, because not only have we systematically missed out on most of the effects we’re interested in (in virtue of using small samples and overly conservative statistical thresholds), but notice what we’ve also done to the effect sizes of those correlations that we do end up identifying. What is in reality a .3 correlation spuriously appears, on average, as  a .51 correlation in the 16 tests that surpass our threshold. So, through the combined magic of low power and selection bias, we’ve turned what may in reality be a relatively diffuse association between two variables (say, clownhood and depression) into a seemingly selective and extremely strong association. After all the excitement about getting a high-profile publication, it might ultimately turn out that clowns aren’t really so depressed after all–it’s all an illusion induced by the sampling apparatus. So you might say that the clowns get the last laugh. Or that the joke’s on us. Or maybe just that this whole clown example is no longer funny and it’s now time for it to go bury itself in a hole somewhere.

Anyway, that, in a nutshell, was the point my commentary on the Vul et al paper made, and it’s the same point the Gelman and Wainer papers make too, in one way or another. While it’s a very general point that really applies in any domain where (a) power is less than 100% (which is just about always) and (b) there is some selection bias (which is also just about always), there were some considerations that were particularly applicable to fMRI research. The basic issue is that, in fMRI research, we often want to conduct analyses that span the entire brain, which means we’re usually faced with conducting many more statistical comparisons than researchers in other domains generally deal with (though not, say, molecular geneticists conducting genome-wide association studies). As a result, there is a very strong emphasis in imaging research on controlling Type I error rates by using very conservative statistical thresholds. You can agree or disagree with this general advice (for the record, I personally think there’s much too great an emphasis in imaging on Type I error, and not nearly enough emphasis on Type II error), but there’s no avoiding the fact that following it will tend to produce highly inflated significant effect sizes, because in the act of reducing p-value thresholds, we’re also driving down power dramatically, and making the selection bias more powerful.

While it’d be nice if there was an easy fix for this problem, there really isn’t one. In behavioral domains, there’s often a relatively simple prescription: report all effect sizes, both significant and non-significant. This doesn’t entirely solve the problem, because people are still likely to overemphasize statistically significant results relative to non-significant ones; but at least at that point you can say you’ve done what you can. In the fMRI literature, this course of action isn’t really available, because most journal editors are not going to be very happy with you when you send them a 25-page table that reports effect sizes and p-values for each of the 100,000 voxels you tested. So we’re forced adopt other strategies. The one I’ve argued for most strongly is to increase sample size (which increases power and decreases the uncertainty of resulting estimates). But that’s understandably difficult in a field where scanning each additional subject can cost $1,000 or more. There are a number of other things you can do, but I won’t talk about them much here, partly because this is already much too long a post, but mostly because I’m currently working on a paper that discusses this problem, and potential solutions, in much more detail.

So now finally I get to the Ioannidis article. As I said, the basic point is the same one made in my paper and Gelman’s and others, and the one I’ve described above in excruciating clownish detail. But there are a number of things about the Ioannidis that are particularly nice. One is that Ioannidis considers not only inflation due to selection of statistically significant results coupled with low power, but also inflation due to the use of flexible analyses (or, as he puts it, “vibration” of effects–also known as massaging the data). Another is that he considers cultural aspects of the phenomenon, e.g., the fact that investigators tend to be rewarded for reporting large effects, even if they subsequently fail to replicate. He also discusses conditions under which you might actually get deflation of effect sizes–something I didn’t touch on in my commentary, and hadn’t really thought about. Finally, he makes some interesting recommendations for minimizing effect size inflation. Whereas my commentary focused primarily on concrete steps researchers could take in individual studies to encourage clearer evaluation of results (e.g., reporting confidence intervals, including power calculations, etc.), Ioannidis focuses on longer-term solutions and the possibility that we’ll need to dramatically change the way we do science (at least in some fields).

Anyway, this whole issue of inflated effect sizes is a critical one to appreciate if you do any kind of social or biomedical science research, because it almost certainly affects your findings on a regular basis, and has all sorts of implications for what kind of research we conduct and how we interpret our findings. (To give just one trivial example, if you’ve ever been tempted to attribute your failure to replicate a previous finding to some minute experimental difference between studies, you should seriously consider the possibility that the original effect size may have been grossly inflated, and that your own study consequently has insufficient power to replicate the effect.) If you only have time to read one article that deals with this issue, read the Ioannidis paper. And remember it when you write your next Discussion section. Bozo the Clown will thank you for it.

Ioannidis, J. (2008). Why Most Discovered True Associations Are Inflated Epidemiology, 19 (5), 640-648 DOI: 10.1097/EDE.0b013e31818131e7

Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power-Commentary on Vul et al. (2009) Perspectives on Psychological Science, 4 (3), 294-298 DOI: 10.1111/j.1745-6924.2009.01127.x

more pretty pictures of brains

Google Reader‘s new recommendation engine is pretty nifty, and I find it gets it right most of the time. It just suggested this blog, which looks to be a nice (and growing) collection of neuro-related images. It’s an interesting set of pictures that go beyond the usual combination of brain slices and tractography images to include paintings of brains (and their owners) in strange poses, psychedelic posters, and abandoned Russian brain labs. For example:

In a similar vein, there’s also this, which seems to be the CNS-related incarnation of another earlier favorite.

tuesday at 3 pm works for me

Apparently, Tuesday at 3 pm is the best time to suggest as a meeting time–that’s when people have the most flexibility available in their schedule. At least, that’s the conclusion drawn by a study based on data from WhenIsGood, a free service that helps with meeting scheduling. There’s not much to the study beyond the conclusion I just gave away; not surprisingly, people don’t like to meet before 10 or 11 am or after 4 pm, and there’s very little difference in availability across different days of the week.

What I find neat about this isn’t so much the results of the study itself as the fact that it was done at all. I’m a big proponent of using commercial website data for research purposes–I’m about to submit a paper that relies almost entirely on content pulled using the Blogger API, and am working on another project that makes extensive use of the Twitter API. The scope of the datasets one can assemble via these APIs is simply unparalleled; for example, there’s no way I could ever realistically collect writing samples of 50,000+ words from 500+ participants in a laboratory setting, yet the ability to programmatically access blog contents makes the task trivial. And of course, many websites collect data of a kind that just isn’t available off-line. For example, the folks at OKCupid are able to continuously pump out interesting data on people’s online dating habits because they have comprehensive data on interactions between literally millions of prospective dating partners. If you want to try to generate that sort of data off-line, I hope you have a really large lab.

Of course, I recognize that in this case, the WhenIsGood study really just amounts to a glorified press release. You can tell that’s what it is from the URL, which literally includes the “press/” directory in its path. So I’m certainly not naive enough to think that Web 2.0 companies are publishing interesting research based on their proprietary data solely out of the goodness of their hearts. Quite the opposite. But I think in this case the desire for publicity works in researchers’ favor: It’s precisely because virtually any press is considered good press that many of these websites would probably be happy to let researchers play with their massive (de-identified) datasets. It’s just that, so far, hardly anyone’s asked. The Web 2.0 world is a largely untapped resource that researchers (or at least, psychologists) are only just beginning to take advantage of.

I suspect that this will change in the relatively near future. Five or ten years from now, I imagine that a relatively large chunk of the research conducted in many area of psychology (particularly social and personality psychology) will rely heavily on massive datasets derived from commercial websites. And then we’ll all wonder in amazement at how we ever put up with the tediousness of collecting real-world data from two or three hundred college students at a time, when all of this online data was just lying around waiting for someone to come take a peek at it.

not a day over six

I was born twenty-nine years ago today. This isn’t particularly noteworthy–after all, there are few things as predictable as birthdays–except that all day today, people have been trying to scare me into thinking I’m old. Like somehow twenty-nine is the big one. Well, it isn’t the big one, and I’m not old. Telling me that I have one more year left before it all goes to hell doesn’t make me feel nervous, it just makes you a dirty rotten liar. If my eyesight wasn’t completely shot and my rotator cuff muscles hadn’t degenerated from disuse, I’d probably try to punch anyone insinuating that I’m on the downward slope. I’m not on the downward slope; I feel sprightly! So sprightly that I think I’ll go for a walk. Right now. In the dark. Even though it’s midnight and about negative one zillion degrees outside. I may look twenty-nine on the outside, but I can assure you that on the inside, I’m not a day over six years old.

the brain, in pictures, in newsweek

Newsweek has a beautiful set of graphics illustrating some of the things we’ve learned about the brain in recent years. One or two of the graphics are a bit hokey (e.g., the PET slides showing the effects of a seizure don’t show the same brain slices, and it’s unclear whether the color scales are equivalent), but the arteriograph and MRI slides showing the cerebral vasculature are really amazing.

The images are excerpted from Rita Carter’s new Human Brain Book, which I’d buy in a heartbeat if I wasn’t broke right now. If you aren’t so broke, and happen to buy a copy, you should invite me over some time. We can sit around drinking hot chocolate and staring at illustrations of the fusiform gyrus. Or something.

[via Mind Hacks]

i hate learning new things

Well, I don’t really hate learning new things. I actually quite like learning new things; what I don’t like is having to spend time learning new things. I find my tolerance for the unique kind of frustration associated with learning a new skill (you know, the kind that manifests itself in a series of “crap, now I have to Google that” moments) increases roughly in proportion to my age.

As an undergraduate, I didn’t find learning frustrating at all; quite the opposite, actually. I routinely ignored all the work that I was supposed to be doing (e.g., writing term papers, studying for exams, etc.), and would spend hours piddling around with things that were completely irrelevant to my actual progress through college. In hindsight, a lot of the skills I picked up have actually been quite useful, career-wise (e.g., I spent a lot of my spare time playing around with websites, which has paid off–I now collect large chunks of my data online). But I can’t pretend I had any special foresight at the time. I was just procrastinating by doing stuff that felt like work but really wasn’t.

In my first couple of years in graduate school, when I started accumulating obligations I couldn’t (or didn’t want to) put off, I developed a sort of compromise with myself, where I would spend about fifteen minutes of every hour doing what i was supposed to, and the rest of the hour messing around learning new things. Some of those things were work-related–for instance, learning to use a new software package for analyzing fMRI data, or writing a script that reinvented the wheel just to get a better understanding of the wheel. That arrangement seemed to work pretty well, but strangely, with every year of grad school, I found myself working less and less on so-called “personal development” projects and more and more on supposedly important things like writing journal articles and reviewing other people’s journal articles and just generally acting like someone who has some sort of overarching purpose.

Now that I’m a worldly post-doc in a new lab, I frankly find the thought of having to spend time learning to do new things quite distressing. For example, my new PI’s lab uses a different set of analysis packages than I used in graduate school. So I have to learn to use those packages before I can do much of anything. They’re really great tools, and I don’t have any doubt that I will in fact learn to use them (probably sooner rather than later); I just find it incredibly annoying to have to spend the time doing that. It feels like it’s taking time away from my real work, which is writing. Whereas five years ago, I would have gleefully thrown myself at any opportunity to learn to use a new tool, precisely because it would have allowed me to avoid nasty, icky activities like writing.

In the grand scheme of things, I suppose the transition is for the best. It’s hard to be productive as an academic when you spend all your time learning new things; at some point, you have to turn the things you learn into a product you can communicate to other people. I like the fact that I’ve become more conscientious with age (which, it turns out, is a robust phenomenon); I just wish I didn’t feel so guilty ‘wasting’ my time learning new things. And it’s not like I feel I know everything I need to know. More than ever, I can identify all sorts of tools and skills that would help me work more efficiently if I just took the time to learn them. But learning things often seems like a luxury in this new grown-up world where you do the things you’re supposed to do before the things you actually enjoy most. I fully expect this trend to continue, so that 5 years from now, when someone suggests a new tool or technique I should look into, I’ll just run for the door with my hands covering my ears…

the genetics of dog hair

Aside from containing about eleventy hundred papers on Ardi–our new 4.4 million year-old ancestor–this week’s issue of Science has an interesting article on the genetics of dog hair. What is there to know about dog hair, you ask? Well, it turns out that nearly all of the phenotypic variation in dog coats (curly, shaggy, short-haired, etc.) is explained by recent mutations in just three genes. It’s another beautiful example of how complex phenotypes can emerge from relatively small genotypic differences. I’d tell you much more about it, but I’m very lazy busy right now. For more explanation, see here, here, and here (you’re free to ignore the silly headline of that last article). Oh, and here’s a key figure from the paper. I’ve heard that a picture is worth a thousand words, which effectively makes this a 1200-word post. All this writing is hurting my brain, so I’ll stop now.

a tale of dogs, their coats, and three genetic mutations
a tale of dogs, their coats, and three genetic mutations

diamonds, beer, bars, and pandas: the 2009 Ig Nobel prizes

Apparently I missed this, but the 2009 Ig Nobel prizes were awarded a couple of days ago. There’s a lot of good stuff this year, so it’s hard to pick a favorite; you have people making diamonds from tequila,  demonstrating that beer bottles can crack human skulls, turning bras into facemasks, and reducing garbage mass by 90% using… wait for it… panda poop. That said, I think my favorite is this one right here–the winners of the Economics prize:

The directors, executives, and auditors of four Icelandic Banks — Kaupthing Bank, Landsbanki, Glitnir Bank, and Central Bank of Iceland — for demonstrating that tiny banks can be rapidly transformed into huge banks, and vice versa — and for demonstrating that similar things can be done to an entire national economy.

And yes, I do feel bad about myself for finding that funny.

[h/t: Language Log]

younger and wiser?

Peer reviewers get worse as they age, not better. That’s the conclusion drawn by a study discussed in the latest issue of Nature. The study isn’t published yet, and it’s based on analysis of 1,400 reviews in just one biomedical journal (The Annals of Emergency Medicine), but there’s no obvious reason why these findings shouldn’t generalize to other areas of research.From the article:

The most surprising result, however, was how individual reviewers’ scores changed over time: 93% of them went down, which was balanced by fresh young reviewers coming on board and keeping the average score up. The average decline was 0.04 points per year.

That 0.04/year is, I presume, on a scale of 5,  and the quality of reviews was rated by the editors of the journal. This turns the dogma of experience on its head, in that it suggests editors are better off asking more junior academics for reviews (though whether this data actually affects editorial policy remains to be seen). Of course, the key question–and one that unfortunately isn’t answered in the study–is why more senior academics give worse reviews. It’s unlikely that experience makes you a poorer scientist, so the most likely explanation is that that “older reviewers tend to cut corners,” as the article puts it. Anecdotally, I’ve noticed this myself in the dozen or so reviews I’ve completed; my reviews often tend to be relatively long compared to those of the other reviewers, most of whom are presumably more senior. I imagine length of review is (very) loosely used as a proxy for quality of review by editors, since a longer review will generally be more comprehensive. But this probably says more about constraints on reviewers’ time than anything else. I don’t have grants to write and committees to sit on; my job consists largely of writing papers, collecting data, and playing the occasional video game keeping up with the literature.

Aside from time constraints, senior researchers probably also have less riding on a review than junior researchers do. A superficial review from an established researcher is unlikely to affect one’s standing in the field, but as someone with no reputation to speak of, I usually feel a modicum of pressure to do at least a passable job reviewing a paper. Not that reviews make a big difference (they are, after all, anonymous to all but the editors, and occasionally, the authors), but at this point in my career they seem like something of an opportunity, whereas I’m sure twenty or thirty years from now they’ll feel much more like an obligation.

Anyway, that’s all idle speculation. The real highlight of the Nature article is actually this gem:

Others are not so convinced that older reviewers aren’t wiser. “This is a quantitative review, which is fine, but maybe a qualitative study would show something different,” says Paul Hébert, editor of the Canadian Medical Association Journal in Ottawa. A thorough review might score highly on the Annals scale, whereas a less thorough but more insightful review might not, he says. “When you’re young you spend more time on it and write better reports. But I don’t want a young person on a panel when making a multi-million-dollar decision.”

I think the second quote is on the verge of being reasonable (though DrugMonkey disagrees), but the first is, frankly, silly. Qualitative studies can show almost anything you want them to show; I thought that was precisely why we do quantitative studies…

[h/t: DrugMonkey]