Archive for the ‘methods’ Category

large-scale data exploration, MIC-style

Saturday, December 17th, 2011

Real-world data are messy. Relationships between two variables can take on an infinite number of forms, and while one doesn’t see, say, umbrella-shaped data very often, strange things can happen. When scientists talk about correlations or associations between variables, they’re usually referring to one very specific form of relationship–namely, a linear one. The assumption is that most associations between pairs of variables are reasonably well captured by positing that one variable increases in proportion to the other, with some added noise. In reality, of course, many associations aren’t linear, or even approximately so. For instance, many associations are cyclical (e.g., hours at work versus day of week), or curvilinear (e.g., heart attacks become precipitously more frequent past middle age), and so on.

Detecting a non-linear association is potentially just as easy as detecting a linear relationship if we know the form of that association up front. But there, of course, lies the rub: we generally don’t have strong intuitions about how most variables are likely to be non-linearly related. A more typical situation in many ‘big data’ scientific disciplines is that we have a giant dataset full of thousands or millions of observations and hundreds or thousands of variables, and we want to determine which of the many associations between different variables are potentially important–without knowing anything about their potential shape. The problem, then, is that traditional measures of association don’t work very well; they’re only likely to detect associations to the extent that those associations approximate a linear fit.

A new paper in Science by David Reshef and colleagues (and as a friend pointed out, it’s a feat in and of itself just to get a statistics paper into Science) directly targets this data mining problem by introducing an elegant new measure of association called the Maximal Information Coefficient (MIC; see also the authors’ project website).  The clever insight at the core of the paper is that one can detect a systematic (i.e., non-random) relationship between two variables by quantifying and normalizing their maximal mutual information. Mutual information (MI) is an information theory measure of how much information you have about one variable given knowledge of the other. You have high MI when you can accurately predict the level of one variable given knowledge of the other, and low MI when knowledge of one variable is unhelpful in predicting the other. Importantly, unlike other measures (e.g., the correlation coefficient), MI makes no assumptions about the form of the relationship between the variables; one can have high mutual information for non-linear associations as well as linear ones.

MI and various derivative measures have been around for a long time now; what’s innovative about the Reshef et al paper is that the authors figured out a way to efficiently estimate and normalize the maximal MI one can obtain for any two variables. The very clever approach the authors use is to overlay a series of grids on top of the data, and to keep altering the resolution of the grid and moving its lines around until one obtains the maximum possible MI. In essence, it’s like dropping a wire mesh on top of a scatterplot and playing with it until you’ve boxed in all of the data points in the most informative way possible. And the neat thing is, you can apply the technique to any kind of data at all, and capture a very broad range of systematic relationships, not just linear ones.

To give you an intuitive sense of how this works, consider this Figure from the supplemental material:

The underlying function here is sinusoidal. This is a potentially common type of association in many domains–e.g., it might explain the cyclical relationship between, say, coffee intake and hour of day (more coffee in the early morning and afternoon; less in between). But the linear correlation is essentially zero, so a typical analysis wouldn’t pick it up at all. On the other hand, the relationship itself is perfectly deterministic; if we can correctly identify the generative function in this case, we would have perfect information about Y given X. The question is how to capture this intuition algorithmically–especially given that real data are noisy.

This is where Reshef et al’s grid-based approach comes in. In the left panel above, you have a 2 x 8 grid overlaid on a sinusoidal function (the use of a 2 x 8 resolution here is just illustrative; the algorithm actually produces estimates for a wide range of grid resolutions). Even though it’s the optimal grid of that particular resolution, it still isn’t very good: knowing which row a particular point along the line falls into doesn’t tell you a whole lot about which column it falls into, and vice versa. In other words, mutual information is low. By contrast, the optimal 8 x 2 grid on the right side of the figure has a (perfect) MIC of 1: if you know which row in the grid a point on the line falls into, you can also determine which column it falls into with perfect accuracy. So the MIC approach will detect that there’s a perfectly systematic relationship between these two variables without any trouble, whereas the standard pearson correlation would be 0 (i.e., no relation at all). There are a couple of other steps involved (e.g., one needs to normalize the MIC to account for differences in grid resolution), but that’s the gist of it.

If the idea seems surprisingly simple, it is. But as with many very good ideas, hindsight is 20/20; it’s an idea that seems obvious once you hear it, but clearly wasn’t trivial to come up with (or someone would have done it a long time ago!). And of course, the simplicity of the core idea also shouldn’t blind us to the fact that there was undoubtedly a lot of very sophisticated work involved in figuring out how to normalize and bound the measure, provin that the approach works and implementing a dynamic algorithm capable of computing good MIC estimates in a reasonable amount of time (this Harvard Gazette article suggests Reshef and colleagues worked on the various problems for three years).

The utility of MIC and its improvement over existing measures is probably best captured in Figure 2 from the paper:

Panel A shows the values one obtains with different measures when trying to capture different kinds of noiseless relationships (e.g., linear, exponential, and sinusoidal ones). The key point is that MIC assigns a value of 1 (the maximum) to every kind of association, whereas no other measure is capable of detecting the same range of associations with the same degree of sensitivity (and most fail horribly). By contrast, when given random data, MIC produces a value that tends towards zero (though it’s still not quite zero, a point I’ll come back to later). So what you effectively have is a measure that, with some caveats, can capture a very broad range of associations and place them on the same metric. The latter aspect is nicely captured in Panel G, which gives one a sense of what real (i.e., noisy) data corresponding to different MIC levels would look like. The main point is that, unlike other measures, a given value can correspond to very different types of associations. Admittedly, this may be a mixed blessing, since the flip side is that knowing the MIC value tells you almost nothing about what the association actually looks like (though Anscombe’s Quartet famously demonstrates that even a linear correlation can be misleading in this respect). But on the whole, I think it represents a potentially big advance in our ability to detect novel associations in a data-driven way.

Having introduced and explained the method, Reshef et al then go on to apply it to 4 very different datasets. I’ll just focus on one here–a set of global indicators from the World Health Organization (WHO). The data set contains 357 variables, or 63,546 variable pairs. When plotting MIC against the Pearson correlation coefficient the data look like this (panel A; click to blow up the figure):

The main point to note is that while MIC detects most strong linear effects (e.g., panel D), it also detects quite a few associations that have low linear correlations (e.g., E, F, and G). Reshef et al note that many of these effects have sensible interpretations (e.g., they argue that the left trend line in panel F reflects predominantly Pacific Island nations where obesity is culturally valued, and hence increases with income), but would be completely overlooked by an automated data mining approach that focuses only on linear correlations. They go on to report a number of other interesting examples ranging from analyses of gut bacteria to baseball statistics. All in all, it’s a compelling demonstration of a new metric that could potentially play an important role in large-scale data mining analyses going forward.

That said, while the paper clearly represents an important advance for large-scale data mining efforts, it’s also quite light on caveats and limitations (even for a length-constrained Science paper). Some potential concerns that come to mind:

  • Reshef et al are understandably going to put their best foot forward, so we can expect that the ‘representative’ examples they display (e.g., the WHO scatter plots above) are among the cleanest effects in the data, and aren’t necessarily typical. There’s nothing wrong with this, but it’s worth keeping in mind that much (and perhaps most) of the time, the associations MIC identifies aren’t going to be quite so clear-cut. Reshef’s et al approach can help identify potentially interesting associations, but once they’re identified, it’s still up to the investigator to figure out how to characterize them.
  • MIC is a (potentially quite heavily) biased measure. While it’s true, as the authors suggest, that it will “tend to 0 for statistically independent variables”, in most situations, the observed value will be substantially larger than 0 even when variables are completely uncorrelated. This falls directly out of the ‘M’ in MIC, because when you take the maximal value from some larger search space as your estimate, you’re almost invariably going to end up capitalizing on chance to some degree. MIC will only tend to 0 when the sample size is very large; as this figure (from the supplemental material) shows, even with a sample size of n = 204, the MIC for uncorrelated variables will tend to hover somewhere around .15 for the parameterization used throughout the paper (the red line):
    This isn’t a huge deal, but it does mean that interpretation of small MIC values is going to be very difficult in practice, since the lower end of the distribution is going to depend heavily on sample size. And it’s quite unpleasant to have a putatively standardized metric of effect size whose interpretation depends to some extent on sample parameters.
  • Reshef et al don’t report any analyses quantifying the sensitivity of MIC compared to conventional metrics like Pearson’s correlation coefficient. Obviously, MIC can pick up on effects Pearson can’t; but a crucial question is whether MIC shows comparable sensitivity when effects are linear. Similarly, we don’t know how well MIC performs when sample sizes are substantially smaller than those Reshef et al use in their simulations and empirical analyses. If it breaks down with n’s on the order of, say, 50 – 100, that would be important to know. So it would be great to see follow-up work characterizing performance under such circumstances–preferably before a flood of papers is published that all use MIC to do data mining in relatively small data sets.
  • As Andrew Gelman points out here, it’s not entirely clear that one wants a measure that gives a high r-square-like value for pretty much any non-random association between variables. For instance, a perfect circle would get an MIC of 1 at the limit, which is potentially weird given that you can’t never deterministically predict y from x. I don’t have a strong feeling about this one way or the other, but can see why this might bother someone.

Caveats aside though, from my perspective–as someone who likes to play with very large datasets but isn’t terribly statistically savvy–the Reshef et al paper seems like a really impressive piece of work that could have a big impact on at least some kinds of data mining analyses. I’d be curious to hear what more quantitatively sophisticated folks have to say.

ResearchBlogging.org
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, & Sabeti PC (2011). Detecting novel associations in large data sets. Science (New York, N.Y.), 334 (6062), 1518-24 PMID: 22174245

see me flub my powerpoint slides on NIF tv!

Monday, October 31st, 2011

 

UPDATE: the webcast is now archived here for posterity.

This is kind of late notice and probably of interest to few people, but I’m giving the NIF webinar tomorrow (or today, depending on your time zone–either way, we’re talking about November 1st). I’ll be talking about Neurosynth, and focusing in particular on the methods and data, since that’s what NIF (which stands for Neuroscience Information Framework) is all about. Assuming all goes well, the webinar should start at 11 am PST. But since I haven’t done a webcast of any kind before, and have a surprising knack for breaking audiovisual equipment at a distance, all may not go well. Which I suppose could make for a more interesting presentation. In any case, here’s the abstract:

The explosive growth of the human neuroimaging literature has led to major advances in understanding of human brain function, but has also made aggregation and synthesis of neuroimaging findings increasingly difficult. In this webinar, I will describe a highly automated brain mapping framework called NeuroSynth that uses text mining, meta-analysis and machine learning techniques to generate a large database of mappings between neural and cognitive states. The NeuroSynth framework can be used to automatically conduct large-scale, high-quality neuroimaging meta-analyses, address long-standing inferential problems in the neuroimaging literature (e.g., how to infer cognitive states from distributed activity patterns), and support accurate ‘decoding’ of broad cognitive states from brain activity in both entire studies and individual human subjects. This webinar will focus on (a) the methods used to extract the data, (b) the structure of the resulting (publicly available) datasets, and (c) some major limitations of the current implementation. If time allows, I’ll also provide a walk-through of the associated web interface (http://neurosynth.org) and will provide concrete examples of some potential applications of the framework.

There’s some more info (including details about how to connect, which might be important) here. And now I’m off to prepare my slides. And script some evasive and totally non-committal answers to deploy in case of difficult questions from the peanut gallery respected audience.

Too much p = .048? Towards partial automation of scientific evaluation

Saturday, February 12th, 2011

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Monday, January 10th, 2011

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

does functional specialization exist in the language system?

Friday, September 3rd, 2010

One of the central questions in cognitive neuroscience–according to some people, at least–is how selective different chunks of cortex are for specific cognitive functions. The paradigmatic examples of functional selectivity are pretty much all located in sensory cortical regions or adjacent association cortices. For instance, the fusiform face area (FFA), is so named because it (allegedly) responds selectively to faces but not to other stimuli. Other regions with varying selectivity profiles are similarly named: the visual word form area (VWFA), parahippocampal place area (PPA), extrastriate body area (EBA), and so on.

In a recent review paper, Fedorenko and Kanwisher (2009) sought to apply insights from the study of functionally selective visual regions to the study of language. They posed the following question with respect to the neuroimaging of language in the title of their paper: Why hasn’t a clearer picture emerged? And they gave the following answer: it’s because brains differ from one another, stupid.

Admittedly, I’m paraphrasing; they don’t use exactly those words. But the basic point they make is that it’s difficult to identify functionally selective regions when you’re averaging over a bunch of very different brains. And the solution they propose–again, imported from the study of visual areas–is to identify potentially selective language regions-of-interest (ROIs) on a subject-specific basis rather than relying on group-level analyses.

The Fedorenko and Kanwisher paper apparently didn’t please Greg Hickok of Talking Brains, who’s done a lot of very elegant work on the neurobiology of language.  A summary of Hickok’s take:

What I found a bit on the irritating side though was the extremely dim and distressingly myopic view of progress in the field of the neural basis of language.

He objects to Fedorenko and Kanwisher on several grounds, and the post is well worth reading. But since I’m very lazy tired, I’ll just summarize his points as follows:

  • There’s more functional specialization in the language system than F&K give the field credit for
  • The use of subject-specific analyses in the domain of language isn’t new, and many researchers (including Hickok) have used procedures similar to those F&K recommend in the past
  • Functional selectivity is not necessarily a criterion we should care about all that much anyway

As you might expect, F&K disagree with Hickok on these points, and Hickok was kind enough to post their response. He then responded to their response in the comments (which are also worth reading), which in turn spawned a back-and-forth with F&K, a cameo by Brad Buchsbaum (who posted his own excellent thoughts on the matter here), and eventually, an intervention by a team of professional arbitrators. Okay, I made that last bit up; it was a very civil disagreement, and is exactly what scientific debates on the internet should look like, in my opinion.

Anyway, rather than revisit the entire thread, which you can read for yourself, I’ll just summarize my thoughts:

  • On the whole, I think my view lines up pretty closely with Hickok’s and Buchsbaum’s. Although I’m very far from an expert on the neurobiology of language (is there a word in English for someone’s who’s the diametric opposite of an expert–i.e., someone who consistently and confidently asserts exactly the wrong thing? Cause that’s what I am), I agree with Hickok’s argument that the temporal poles show a response profile that looks suspiciously like sentence- or narrative-specific processing (I have a paper on the neural mechanisms of narrative comprehension that supports that claim to some extent), and think F&K’s review of the literature is probably not as balanced as it could have been.
  • More generally, I agree with Hickok that demonstrating functional specialization isn’t necessarily that important to the study of language (or most other domains). This seems to be a major point of contention for F&K, but I don’t think they make a very strong case for their view. They suggest that they “are not sure what other goals (besides understanding a region’s computations) could drive studies aimed at understanding how functionally specialized a region is,” which I think is reasonable, but affirms the consequent. Hickok isn’t saying there’s no reason to search for functional specialization in the F&K sense; as I read him, he’s simply saying that you can study the nature of neural computation in lots of interesting ways that don’t require you to demonstrate functional specialization to the degree F&K seem to require. Seems hard to disagree with that.
  • Buchsbaum points out that it’s questionable whether there are any brain regions that meet the criteria F&K set out for functional specialization–namely that “A brain region R is specialized for cognitive function x if this region (i) is engaged in tasks that rely on cognitive function x, and (ii) is not engaged in tasks that do not rely on cognitive function x.Buchsbaum and Hickok both point out that the two examples F&K give of putatively specialized regions (the FFA and the temporo-parietal junction, which some people believe is selectively involved in theory of mind) are hardly uncontroversial. Plenty of people have argued that the FFA isn’t really selective to faces, and even more people have argued that the TPJ isn’t selective to theory of mind. As far as I can tell, F&K don’t really address this issue in the comments. They do refer to a recent paper of Kanwisher’s that discusses the evidence for functional specificity in the FFA, but I’m not sure the argument made in that paper is itself uncontroversial, and in any case, Kanwisher does concede that there’s good evidence for at least some representation of non-preferred stimuli (i.e., non-faces in the FFA). In any case, the central question here is whether or not F&K really unequivocally believe that FFA and TPJ aren’t engaged by any tasks that don’t involve face or theory of mind processing. If not, then it’s unfair to demand or expect the same of regions implicated in language.
  • Although I think there’s a good deal to be said for subject-specific analyses, I’m not as sanguine as F&K that a subject-specific approach offers a remedy to the problems that they perceive afflict the study of the neural mechanisms of language. While there’s no denying that group analyses suffer from a number of limitations, subject-specific analyses have their own weaknesses, which F&K don’t really mention in their paper. One is that such analyses typically require the assumption that two clusters located in slightly different places for different subjects must be carrying out the same cognitive operations if they respond similarly to a localizer task. That’s a very strong assumption for which there’s very little evidence (at least in the language domain)–especially because the localizer task F&K promote in this paper involves a rather strong manipulation that may confound several different aspects of language processing.
    Another problem is that it’s not at all obvious how you determine which regions are the “same” (in their 2010 paper, F&K argue for an algorithmic parcellation approach, but the fact that you get sensible-looking results is no guarantee that your parcellation actually reflects meaningful functional divisions in individual subjects). And yet another is that serious statistical problems can arise in cases where one or more subjects fail to show activation in a putative region (which is generally the norm rather than the exception). Say you have 25 subjects in your sample, and 7 don’t show activation anywhere in a region that can broadly be called Broca’s area. What do you do? You can’t just throw those subjects out of the analysis, because that would grossly and misleadingly inflate your effect sizes. Conversely, you can’t just identify any old region that does activate and lump it in with the regions identified in all the other subjects. This is a very serious problem, but it’s one that group analyses, for all their weaknesses, don’t have to contend with.

Disagreements aside, I think it’s really great to see serious scientific discussion taking place in this type of forum. In principle, this is the kind of debate that should be resolved (or not) in the peer-reviewed literature; in practice, peer review is slow, writing full-blown articles takes time, and journal space is limited. So I think blogs have a really important role to play in scientific communication, and frankly, I envy Hickok and Poeppel for the excellent discussion they consistently manage to stimulate over at Talking Brains!

estimating bias in text with Ruby

Friday, June 25th, 2010

Over the past couple of months, I’ve been working on and off on a collaboration with my good friend Nick Holtzman and some other folks that focuses on ways to automatically extract bias from text using a vector space model. The paper is still in progress, so I won’t give much away here, except to say that Nick’s figured out what I think is a pretty clever way to show that, yes, Fox likes Republicans more than Democrats, and MSNBC likes Democrats more than Republicans. It’s not meant to be a surprising result, but simply a nice validation of the underlying method, which can be flexibly applied to all sorts of interesting questions.

The model we’re using is a simplified variant of Jones and Mewhort’s (2007) BEAGLE model. Essentially, similarity between words is quantified by looking at the degree to which words have similar co-occurrence patterns with other words. This basic idea is actually common to pretty much all vector space models, so in that sense, there’s not much new here (there’s plenty that’s new in Jones and Mewhort (2007), but we’re mostly leaving those features out for the sake of simplicity and computational speed). The novel aspect is the contrast coding of similarity terms in order to produce bias estimates. But you’ll have to wait for the paper to read more about that.

In the meantime, one thing we’ve tried to do is develop software that can be used to easily implement the kind of analyses we describe in the paper. With plenty of input from Nick and Mike Jones, I’ve written a set of tools in Ruby that’s now freely available for download here. The tools are actually bundled as a Ruby gem, so installation should be a snap on most platforms. We’re still working on documentation, so there’s no full-blown manual yet, but the quick-start guide should be sufficient to get many users up and running. And for people who share my love of Ruby and are interested in using the tools programmatically, there’s a fairly well-commented RDoc.

The code should really be considered an alpha release at the moment; I’m sure there are plenty of bugs (if you find any, email me!), and the feature set is currently pretty limited. Hopefully it’ll grow over time. I also plan to throw the code up on GitHub at some point in the near future so that anyone who’s interested can help out with the development. In the meantime, if you’re interested in semantic space models and want to play around with a crude (but relatively fast) implementation of one, there’s a (very) small chance you might find these tools useful.

time-on-task effects in fMRI research: why you should care

Wednesday, June 16th, 2010

There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.

This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.

An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.

A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.

Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:

plos_one_figure1

Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.

The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)

Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.

For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.

It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:

grinband_figure
So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).

The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.

I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.

The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.

Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:

plos_one_figure2

There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:

plos_one_figure4

If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.

The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.

Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.

The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.

A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.

The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.

ResearchBlogging.org

Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335

Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784

the capricious nature of p < .05, or why data peeking is evil

Thursday, May 6th, 2010

There’s a time-honored tradition in the social sciences–or at least psychology–that goes something like this. You decide on some provisional number of subjects you’d like to run in your study; usually it’s a nice round number like twenty or sixty, or some number that just happens to coincide with the sample size of the last successful study you ran. Or maybe it just happens to be your favorite number (which of course is forty-four). You get your graduate student to start running the study, and promptly forget about it for a couple of weeks while you go about writing up journal reviews that are three weeks overdue and chapters that are six months overdue.

A few weeks later, you decide you’d like to know how that Amazing New Experiment you’re running is going. You summon your RA and ask him, in magisterial tones, “how’s that Amazing New Experiment we’re running going?” To which he falteringly replies that he’s been very busy with all the other data entry and analysis chores you assigned him, so he’s only managed to collect data from eighteen subjects so far. But he promises to have the other eighty-two subjects done any day now.

“Not to worry,” you say. “We’ll just take a peek at the data now and see what it looks like; with any luck, you won’t even need to run any more subjects! By the way, here are my car keys; see if you can’t have it washed by 5 pm. Your job depends on it. Ha ha.”

Once your RA’s gone to soil himself somewhere, you gleefully plunge into the task of peeking at your data. You pivot your tables, plyr your data frame, and bravely sort your columns. Then you extract two of the more juicy variables for analysis, and after some careful surgery a t-test or six, you arrive at the conclusion that your hypothesis is… “marginally” supported. Which is to say, the magical p value is somewhere north of .05 and somewhere south of .10, and now it’s just parked by the curb waiting for you to give it better directions.

You briefly contemplate reporting your result as a one-tailed test–since it’s in the direction you predicted, right?–but ultimately decide against that. You recall the way your old Research Methods professor used to rail at length against the evils of one-sample tests, and even if you don’t remember exactly why they’re so evil, you’re not willing to take any chances. So you decide it can’t be helped; you need to collect some more data.

You summon your RA again. “Is my car washed yet?” you ask.

“No,” says your RA in a squeaky voice. “You just asked me to do that fifteen minutes ago.”

“Right, right,” you say. “I knew that.”

You then explain to your RA that he should suspend all other assigned duties for the next few days and prioritize running subjects in the Amazing New Experiment. “Abandon all other tasks!” you decree. “If it doesn’t involve collecting new data, it’s unimportant! Your job is to eat, sleep, and breathe new subjects! But not literally!”

Being quite clever, your RA sees an opening. “I guess you’ll want your car keys back, then,” he suggests.

“Nice try, Poindexter,” you say. “Abandon all other tasks… starting tomorrow.”

You also give your RA very careful instructions to email you the new data after every single subject, so that you can toss it into your spreadsheet and inspect the p value at every step. After all, there’s no sense in wasting perfectly good data; once your p value is below .05, you can just funnel the rest of the participants over to the Equally Amazing And Even Newer Experiment you’ve been planning to run as a follow-up. It’s a win-win proposition for everyone involved. Except maybe your RA, who’s still expected to return triumphant with a squeaky clean vehicle by 5 pm.

Twenty-six months and four rounds of review later, you publish the results of the Amazing New Experiment as Study 2 in a six-study paper in the Journal of Ambiguous Results. The reviewers raked you over the coals for everything from the suggested running head of the paper to the ratio between the abscissa and the ordinate in Figure 3. But what they couldn’t argue with was the p value in Study 2, which clocked in at just under p < .05, with only 21 subjects’ worth of data (compare that to the 80 you had to run in Study 4 to get a statistically significant result!). Suck on that, Reviewers!, you think to yourself pleasantly while driving yourself home from work in your shiny, shiny Honda Civic.

So ends our short parable, which has at least two subtle points to teach us. One is that it takes a really long time to publish anything; who has time to wait twenty-six months and go through four rounds of review?

The other, more important point, is that the desire to peek at one’s data, which often seems innocuous enough–and possibly even advisable (quality control is important, right?)–can actually be quite harmful. At least if you believe that the goal of doing research is to arrive at the truth, and not necessarily to publish statistically significant results.

The basic problem is that peeking at your data is rarely a passive process; most often, it’s done in the context of a decision-making process, where the goal is to determine whether or not you need to keep collecting data. There are two possible peeking outcomes that might lead you to decide to halt data collection: a very low p value (i.e., p < .05), in which case your hypothesis is supported and you may as well stop gathering evidence; or a very high p value, in which case you might decide that it’s unlikely you’re ever going to successfully reject the null, so you may as well throw in the towel. Either way, you’re making the decision to terminate the study based on the results you find in a provisional sample.

A complementary situation, which also happens not infrequently, occurs when you collect data from exactly as many participants as you decided ahead of time, only to find that your results aren’t quite what you’d like them to be (e.g., a marginally significant hypothesis test). In that case, it may be quite tempting to keep collecting data even though you’ve already hit your predetermined target. I can count on more than one hand the number of times I’ve overheard people say (often without any hint of guilt) something to the effect of “my p value’s at .06 right now, so I just need to collect data from a few more subjects.”

Here’s the problem with either (a) collecting more data in an effort to turn p < .06 into p < .05, or (b) ceasing data collection because you’ve already hit p < .05: any time you add another subject to your sample, there’s a fairly large probability the p value will go down purely by chance, even if there’s no effect. So there you are sitting at p < .06 with twenty-four subjects, and you decide to run a twenty-fifth subject. Well, let’s suppose that there actually isn’t a meaningful effect in the population, and that p < .06 value you’ve got is a (near) false positive. Adding that twenty-fifth subject can only do one of two things: it can raise your p value, or it can lower it. The exact probabilities of these two outcomes depends on the current effect size in your sample before adding the new subject; but generally speaking, they’ll rarely be very far from 50-50. So now you can see the problem: if you stop collecting data as soon as you get a significant result, you may well be capitalizing on chance. It could be that if you’d collected data from a twenty-sixth and twenty-seventh subject, the p value would reverse its trajectory and start rising. It could even be that if you’d collected data from two hundred subjects, the effect size would stabilize near zero. But you’d never know that if you stopped the study as soon as you got the results you were looking for.

Lest you think I’m exaggerating, and think that this problem falls into the famous class of things-statisticians-and-methodologists-get-all-anal-about-but-that-don’t-really-matter-in-the-real-world, here’s a sobering figure (taken from this chapter):

data_peeking

The figure shows the results of a simulation quantifying the increase in false positives associated with data peeking. The assumptions here are that (a) data peeking begins after about 10 subjects (starting earlier would further increase false positives, and starting later would decrease false positives somewhat), (b) the researcher stops as soon as a peek at the data reveals a result significant at p < .05, and (c) data peeking occurs at incremental steps of either 1 or 5 subjects. Given these assumptions, you can see that there’s a fairly monstrous rise in the actual Type I error rate (relative to the nominal rate of 5%). For instance, if the researcher initially plans to collect 60 subjects, but peeks at the data after every 5 subjects, there’s approximately a 17% chance that the threshold of p < .05 will be reached before the full sample of 60 subjects is collected. When data peeking occurs even more frequently (as might happen if a researcher is actively trying to turn p < .07 into p < .05, and is monitoring the results after each incremental participant), Type I error inflation is even worse. So unless you think there’s no practical difference between a 5% false positive rate and a 15 – 20% false positive rate, you should be concerned about data peeking; it’s not the kind of thing you just brush off as needless pedantry.

How do we stop ourselves from capitalizing on chance by looking at the data? Broadly speaking, there are two reasonable solutions. One is to just pick a number up front and stick with it. If you commit yourself to collecting data from exactly as many subjects as you said you would (you can proclaim the exact number loudly to anyone who’ll listen, if you find it helps), you’re then free to peek at the data all you want. After all, it’s not the act of observing the data that creates the problem; it’s the decision to terminate data collection based on your observation that matters.

The other alternative is to explicitly correct for data peeking. This is a common approach in large clinical trials, where data peeking is often ethically mandated, because you don’t want to either (a) harm people in the treatment group if the treatment turns out to have clear and dangerous side effects, or (b) prevent the control group from capitalizing on the treatment too if it seems very efficacious. In either event, you’d want to terminate the trial early. What researchers often do, then, is pick predetermined intervals at which to peek at the data, and then apply a correction to the p values that takes into account the number of, and interval between, peeking occasions. Provided you do things systematically in that way, peeking then becomes perfectly legitimate. Of course, the downside is that having to account for those extra inspections of the data makes your statistical tests more conservative. So if there aren’t any ethical issues that necessitate peeking, and you’re not worried about quality control issues that might be revealed by eyeballing the data, your best bet is usually to just pick a reasonable sample size (ideally, one based on power calculations) and stick with it.

Oh, and also, don’t make your RAs wash your car for you; that’s not their job.

undergraduates are WEIRD

Tuesday, April 27th, 2010

This month’s issue of Nature Neuroscience contains an editorial lambasting the excessive reliance of psychologists on undergraduate college samples, which, it turns out, are pretty unrepresentative of humanity at large. The impetus for the editorial is a mammoth in-press review of cross-cultural studies by Joseph Henrich and colleagues, which, the authors suggest, collectively indicate that “samples drawn from Western, Educated, Industrialized, Rich and Democratic (WEIRD) societies … are among the least representative populations one could find for generalizing about humans.” I’ve only skimmed the article, but aside from the clever acronym, you could do a lot worse than these (rather graphic) opening paragraphs:

In the tropical forests of New Guinea the Etoro believe that for a boy to achieve manhood he must ingest the semen of his elders. This is accomplished through ritualized rites of passage that require young male initiates to fellate a senior member (Herdt, 1984; Kelley, 1980). In contrast, the nearby Kaluli maintain that  male initiation is only properly done by ritually delivering the semen through the initiate’s anus, not his mouth. The Etoro revile these Kaluli practices, finding them disgusting. To become a man in these societies, and eventually take a wife, every boy undergoes these initiations. Such boy-inseminating practices, which  are enmeshed in rich systems of meaning and imbued with local cultural values, were not uncommon among the traditional societies of Melanesia and Aboriginal Australia (Herdt, 1993), as well as in Ancient Greece and Tokugawa Japan.

Such in-depth studies of seemingly “exotic” societies, historically the province of anthropology, are crucial for understanding human behavioral and psychological variation. However, this paper is not about these peoples. It’s about a truly unusual group: people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. In particular, it’s about the Western, and more specifically American, undergraduates who form the bulk of the database in the experimental branches of psychology, cognitive science, and economics, as well as allied fields (hereafter collectively labeled the “behavioral sciences”). Given that scientific knowledge about human psychology is largely based on findings from this subpopulation, we ask just how representative are these typical subjects in light of the available comparative database. How justified are researchers in assuming a species-level generality for their findings? Here, we review the evidence regarding how WEIRD people compare to other
populations.

Anyway, it looks like a good paper. Based on a cursory read, the conclusions the authors draw seem pretty reasonable, if a bit strong. I think most researchers do already recognize that our dependence on undergraduates is unhealthy in many respects; it’s just that it’s difficult to break the habit, because the alternative is to spend a lot more time and money chasing down participants (and there are limits to that too; it just isn’t feasible for most researchers to conduct research with Etoro populations in New Guinea). Then again, just because it’s hard to do science the right way doesn’t really make it OK to do it the wrong way. So, to the extent that we care about our results generalizing across the entire human species (which, in many cases, we don’t), we should probably be investing more energy in weaning ourselves off undergraduates and trying to recruit more diverse samples.

some thoughtful comments on automatic measure abbreviation

Thursday, April 1st, 2010

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he” is you and “you” is he because I was writing to Lew…)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I'm looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.