Category Archives: measurement

The reviewer’s dilemma, or why you shouldn’t get too meta when you’re supposed to be writing a review that’s already overdue

When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.

The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.

The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.

To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.

Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”

Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?

The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.

I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.

One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”

Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.

what aspirin can tell us about the value of antidepressants

There’s a nice post on Science-Based Medicine by Harriet Hall pushing back (kind of) against the increasingly popular idea that antidepressants don’t work. For context, there have been a couple of large recent meta-analyses that used comprehensive FDA data on clinical trials of antidepressants (rather than only published studies, which are biased towards larger, statistically significant, effects) to argue that antidepressants are of little or no use in mild or moderately-depressed people, and achieve a clinically meaningful benefit only in the severely depressed.

Hall points out that whether you think antidepressants have a clinically meaningful benefit or not depends on how you define clinically meaningful (okay, this sounds vacuous, but bear with me). Most meta-analyses of antidepressant efficacy reveal an effect size of somewhere between 0.3 and 0.5 standard deviations. Historically, psychologists consider effect sizes of 0.2, 0.5, and 0.8 standard deviations to be small, medium, and large, respectively. But as Hall points out:

The psychologist who proposed these landmarks [Jacob Cohen] admitted that he had picked them arbitrarily and that they had “no more reliable a basis than my own intuition.” Later, without providing any justification, the UK’s National Institute for Health and Clinical Excellence (NICE) decided to turn the 0.5 landmark (why not the 0.2 or the 0.8 value?) into a one-size-fits-all cut-off for clinical significance.

She goes on to explain why this ultimately leaves the efficacy of antidepressants open to interpretation:

In an editorial published in the British Medical Journal (BMJ), Turner explains with an elegant metaphor: journal articles had sold us a glass of juice advertised to contain 0.41 liters (0.41 being the effect size Turner, et al. derived from the journal articles); but the truth was that the “glass” of efficacy contained only 0.31 liters. Because these amounts were lower than the (arbitrary) 0.5 liter cut-off, NICE standards (and Kirsch) consider the glass to be empty. Turner correctly concludes that the glass is far from full, but it is also far from empty. He also points out that patients’ responses are not all-or-none and that partial responses can be meaningful.

I think this pretty much hits the nail on the head; no one really doubts that antidepressants work at this point; the question is whether they work well enough to justify their side effects and the social and economic costs they impose. I don’t have much to add to Hall’s argument, except that I think she doesn’t sufficiently emphasize how big a role scale plays when trying to evaluate the utility of antidepressants (or any other treatment). At the level of a single individual, a change of one-third of a standard deviation may not seem very big (then again, if you’re currently depressed, it might!). But on a societal scale, even canonically ‘small’ effects can have very large effects in the aggregate.

The example I’m most fond of here is Robert Rosenthal’s famous illustration of the effects of aspirin on heart attack. The correlation between taking aspirin daily and decreased risk of heart attack is, at best, .03 (I say at best because the estimate is based on a large 1988 study, but my understanding is that more recent studies have moderated even this small effect). In most domains of psychology, a correlation of .03 is so small as to be completely uninteresting. Most psychologists would never seriously contemplate running a study to try to detect an effect of that size. And yet, at a population level, even an r of .03 can have serious implications. Cast in a different light, what this effect means is that 3% of people who would be expected to have a heart attack without aspirin would be saved from that heart attack given a daily aspirin regimen. Needless to say, this isn’t trivial. It amounts to a potentially life-saving intervention for 30 out of every 1,000 people. At a public policy level, you’d be crazy to ignore something like that (which is why, for a long time, many doctors recommended that people take an aspirin a day). And yet, by the standards of experimental psychology, this is a tiny, tiny effect that probably isn’t worth getting out of bed for.

The point of course is that when you consider how many people are currently on antidepressants (millions), even small effects–and certainly an effect of one-third of a standard deviation–are going to be compounded many times over. Given that antidepressants demonstrably reduce the risk of suicide (according to Hall, by about 20%), there’s little doubt that tens of thousands of lives have been saved by antidepressants. That doesn’t necessarily justify their routine use, of course, because the side effects and costs also scale up to the societal level (just imagine how many millions of bouts of nausea could be prevented by eliminating antidepressants from the market!). The point is that just that, if you think the benefits of antidepressants outweigh their costs even slightly at the level of the average depressed individual, you’re probably committing yourself to thinking that they have a hugely beneficial impact at a societal level–and that holds true irrespective of whether the effects are ‘clinically meaningful’ by conventional standards.

some people are irritable, but everyone likes to visit museums: what personality inventories tell us about how we’re all just like one another

I’ve recently started recruiting participants for online experiments via Mechanical Turk. In the past I’ve always either relied on on directory listings (like this one) or targeted specific populations (e.g., bloggers and twitterers) via email solicitation. But recently I’ve started running a very large-sample decision-making study (it’s here, if you care to contribute to the sample), and waiting for participants to trickle in via directories isn’t cutting it. So I’ve started paying people (very) small amounts of money for participation.

One challenge I’ve had to deal with is figuring out how to filter out participants who aren’t really interested in contributing to science, and are strictly in it for the money. 20 or 30 cents is a pittance to most people in the developed world, but as I’ve found out the hard way, gaming MTurk appears to be a thriving business in some developing countries (some of which I’ve unfortunately had to resort to banning entirely). Cheaters aren’t so much of an issue for very quick tasks like providing individual ratings of faces, because (a) the time it takes to give a fake rating isn’t substantially greater than giving one’s actual opinion, and (b) the standards for what counts as accurate performance are clear, so it’s easy to train workers and weed out the bad apples. Unfortunately, my studies generally involve fairly long personality questionnaires combined with other cognitive tasks (e.g., in the current study, you get to repeatedly allocate hypothetical money between yourself and a computer partner, and rate some faces). They often take around half an hour, and involve 20+ questions per screen, so there’s a pretty big incentive for workers who are only in it for the cash to produce random responses and try to increase their effective wage. And the obvious question then is how to detect cheating in the data.

One of the techniques I’ve found works surprisingly well is to simply compare each person’s pattern of responses across items with the mean for the entire sample. In other words, you just compute the correlation between each individual’s item scores and the means for all the items scores across everyone who’s filled out the same measure. I know that there’s an entire literature on this stuff full of much more sophisticated ways to detect random responding, but I find this crude approach really does quite well (I’ve verified this by comparing it with a bunch of other similar metrics), and has the benefit of being trivial to implement.

Anyway, one of the things that surprised me when I first computed these correlations is just how strong the relationship between the sample mean and most individuals’ responses is. Here’s what the distribution looks like for one particular inventory, the 181-item Analog to Multiple Broadband Inventories (AMBI, whichI introduced in this paper, and discuss further here):

This is based on a sample of about 600 internet respondents, which actually turns out to be pretty representative of the broader population, as Sam Gosling, Simine Vazire, and Sanjay Srivastava will tell you (for what it’s worth, I’ve done the exact same analysis on a similar-sized off-line dataset from Lew Goldberg’s Eugene-Springfield Community Sample (check out that URL!) and obtained essentially the same results). In this sample, the median correlation is .48; so, in effect, you can predict a quarter of the variance in a typical participant’s responses without knowing anything at all about them. Human beings, it turns out, have some things in common with one another (who knew?). What you think you’re like is probably not very dissimilar to what I think I’m like. Which is kind of surprising, considering you’re a well-adjusted, friendly human being, and I’m a real freakshow somewhat eccentric, paranoid kind of guy.

What drives that similarity? Much of it probably has to do with social desirability–i.e., many of the AMBI items (and those on virtually all personality inventories) are evaluatively positive or negative statements that most people are inclined to strongly agree or disagree with. But it seems to be a particular kind of social desirability–one that has to do with openness to new experiences, and particular intellectual ones. For instance, here are the top 10 most endorsed items (based on mean likert scores across the entire sample; scores are in parentheses):

  1. like to read (4.62)
  2. like to visit new places (4.39)
  3. was a better than average student when I was in school (4.28)
  4. am a good listener (4.25)
  5. would love to explore strange places (4.22)
  6. am concerned about others (4.2)
  7. am open to new experiences (4.18)
  8. amuse my friends (4.16)
  9. love excitement (4.08)
  10. spend a lot of time reading (4.07)

And conversely, here are the 10 least-endorsed items:

  1. was a slow learner in school (1.52)
  2. don’t think that laws apply to me (1.8)
  3. do not like to visit museums (1.83)
  4. have difficulty imagining things (1.84)
  5. have no special urge to do something original (1.87)
  6. do not like art (1.95)
  7. feel little concern for others (1.97)
  8. don’t try to figure myself out (2.01)
  9. break my promises (2.01)
  10. make enemies (2.06)

You can see a clear evaluative component in both lists: almost everyone believes that they’re concerned about others and thinks that they’re smarter than average. But social desirability and positive illusions aren’t enough to explain these patterns, because there are plenty of other items on the AMBI that have an equally strong evaluative component–for instance, “don’t have much energy”, “cannot imagine lying or cheating”, “see myself as a good leader”, and “am easily annoyed”–yet have mean scores pretty close to the midpoint (in fact, the item ‘am easily annoyed’ is endorsed more highly than 107 of the 181 items!). So it isn’t just that we like to think and say nice things about ourselves; we’re willing to concede that we have some bad traits, but maybe not the ones that have to do with disliking cultural and intellectual experiences. I don’t have much of an idea as to why that might be, but it does introspectively feel to me like there’s more of a stigma about, say, not liking to visit new places or experience new things than admitting that you’re kind of an irritable person. Or maybe it’s just that many of the openness items can be interpreted more broadly than the other evaluative items–e.g., there are lots of different art forms, so almost everyone can endorse a generic “I like art” statement. I don’t really know.

Anyway, there’s nothing the least bit profound about any of this; if anything, it’s just a nice reminder that most of us are not really very good at evaluating where we stand in relation to other people, at least for many traits (for more on that, go read Simine Vazire’s work). The nominal midpoint on most personality scales is usually quite far from the actual median in the general population. This is a pretty big challenge for personality psychology, and if we could figure out how to get people to rank themselves more accurately relative to other people on self-report measures, that would be a pretty huge advance. But it seems quite likely that you just can’t do it, because people simply may not have introspective access to that kind of information.

Fortunately for our ability to measure individual differences in personality, there are plenty of items that do show considerable variance across individuals (actually, in fairness, even items with relatively low variance like the ones above can be highly discriminative if used properly–that’s what item response theory is for). Just for kicks, here are the 10 AMBI items with the largest standard deviations (in parentheses):

  1. disliked math in school (1.56)
  2. wanted to run away from home when I was a child (1.56)
  3. believe in a universal power or god (1.53)
  4. have felt contact with a divine power (1.51)
  5. rarely cry during sad movies (1.46)
  6. am able to fix electrical-wiring problems (1.46)
  7. am devoted to religion (1.44)
  8. shout or scream when I’m angry (1.43)
  9. love large parties (1.42)
  10. felt close to my parents when I was a child (1.42)

So now finally we come to the real moral of this post… that which you’ve read all this long way for. And the moral is this, grasshopper: if you want to successfully pick a fight at a large party, all you need to do is angrily yell at everyone that God told you math sucks.

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

some thoughtful comments on automatic measure abbreviation

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he” is you and “you” is he because I was writing to Lew…)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I'm looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.