Tag Archives: evaluation

Big Pitch or Big Lottery? The unenviable task of evaluating the grant review system

This week’s issue of Science has an interesting article on The Big Pitch–a pilot NSF initiative to determine whether anonymizing proposals and dramatically cutting down their length (from 15 pages to 2) has a substantial impact on the results of the review process. The answer appears to be an unequivocal yes. From the article:

What happens is a lot, according to the first two rounds of the Big Pitch. NSF’s grant reviewers who evaluated short, anonymized proposals picked a largely different set of projects to fund compared with those chosen by reviewers presented with standard, full-length versions of the same proposals.

Not surprisingly, the researchers who did well under the abbreviated format are pretty pleased:

Shirley Taylor, an awardee during the evolution round of the Big Pitch, says a comparison of the reviews she got on the two versions of her proposal convinced her that anonymity had worked in her favor. An associate professor of microbiology at Virginia Commonwealth University in Richmond, Taylor had failed twice to win funding from the National Institutes of Health to study the role of an enzyme in modifying mitochondrial DNA.

Both times, she says, reviewers questioned the validity of her preliminary results because she had few publications to her credit. Some reviews of her full proposal to NSF expressed the same concern. Without a biographical sketch, Taylor says, reviewers of the anonymous proposal could “focus on the novelty of the science, and this is what allowed my proposal to be funded.”

Broadly speaking, there are two ways to interpret the divergent results of the standard and abbreviated review. The charitable interpretation is that the change in format is, in fact, beneficial, inasmuch as it eliminates prior reputation as one source of bias and forces reviewers to focus on the big picture rather than on small methodological details. Of course, as Prof-Like Substance points out in an excellent post, one could mount a pretty reasonable argument that this isn’t necessarily a good thing. After all, a scientist’s past publication record is likely to be a good predictor of their future success, so it’s not clear that proposals should be anonymous when large amounts of money are on the line (and there are other ways to counteract the bias against newbies–e.g., NIH’s approach of explicitly giving New Investigators a payline boost until they get their first R01). And similarly, some scientists might be good at coming up with big ideas that sound plausible at first blush and not so good at actually carrying out the research program required to bring those big ideas to fruition. Still, at the very least, if we’re being charitable, The Big Pitch certainly does seem like a very different kind of approach to review.

The less charitable interpretation is that the reason the ratings of the standard and abbreviated proposals showed very little correlation is that the latter approach is just fundamentally unreliable. If you suppose that it’s just not possible to reliably distinguish a very good proposal from a somewhat good one on the basis of just 2 pages, it makes perfect sense that 2-page and 15-page proposal ratings don’t correlate much–since you’re basically selecting at random in the 2-page case. Understandably, researchers who happen to fare well under the 2-page format are unlikely to see it that way; they’ll probably come up with many plausible-sounding reasons why a shorter format just makes more sense (just like most researchers who tend to do well with the 15-page format probably think it’s the only sensible way for NSF to conduct its business). We humans are all very good at finding self-serving rationalizations for things, after all.

Personally I don’t have very strong feelings about the substantive merits of short versus long-format review–though I guess I do find it hard to believe that 2-page proposals could be ranked very reliably given that some very strange things seem to happen with alarming frequency even with 12- and 15-page proposals. But it’s an empirical question, and I’d love to see relevant data. In principle, the NSF could have obtained that data by having two parallel review panels rate all of the 2-page proposals (or even 4 panels, since one would also like to know how reliable the normal review process is). That would allow the agency to directly quantify the reliability of the ratings by looking at their cross-panel consistency. Absent that kind of data, it’s very hard to know whether the results Science reports on are different because 2-page review emphasizes different (but important) things, or because a rating process based on an extended 2-page abstract just amounts to a glorified lottery.

Alternatively, and perhaps more pragmatically, NSF could just wait a few years to see how the projects funded under the pilot program turn out (and I’m guessing this is part of their plan). I.e., do the researchers who do well under the 2-page format end producing science as good as (or better than) the researchers who do well under the current system? This sounds like a reasonable approach in principle, but the major problem is that we’re only talking about a total of ~25 funded proposals (across two different review panels), so it’s unclear that there will be enough data to draw any firm conclusions. Certainly many scientists (including me) are likely to feel a bit uneasy at the thought that NSF might end up making major decisions about how to allocate billions of dollars on the basis of two dozen grants.

Anyway, skepticism aside, this isn’t really meant as a criticism of NSF so much as an acknowledgment of the fact that the problem in question is a really, really difficult one. The task of continually evaluating and improving the grant review process is not one anyone should want to take on lightly. If time and money were no object, every proposed change (like dramatically shortened proposals) would be extensively tested on a large scale and directly compared to the current approach before being implemented. Unfortunately, flying thousands of scientists to Washington D.C. is a very expensive business (to say nothing of all the surrounding costs), and I imagine that testing out a substantively different kind of review process on a large scale could easily run into the tens of millions of dollars. In a sense, the funding agencies can’t really win. On the one hand, if they only ever pilot new approaches on a small scale, they never get enough empirical data to confidently back major changes in policy. On the other hand, if they pilot new approaches on a large scale and those approaches end up failing to improve on the current system (as is the fate of most innovative new ideas), the funding agencies get hammered by politicians and scientists alike for wasting taxpayer money in an already-harsh funding climate.

I don’t know what the solution is (or if there is one), but if nothing else, I do think it’s a good thing that NSF and NIH continue to actively tinker with their various processes. After all, if there’s anything most researchers can agree on, it’s that the current system is very far from perfect.

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?