Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

no one really cares about anything-but-zero

Tangentially related to the last post, Games With Words has a post up soliciting opinions about the merit of effect sizes. The impetus is a discussion we had in the comments on his last post about Jonah Lehrer’s New Yorker article. It started with an obnoxious comment (mine, of course) and then rapidly devolved into a  murderous duel civil debate about the importance (or lack thereof) of effect sizes in psychology. What I argued is that consideration of effect sizes is absolutely central to most everything psychologists do, even if that consideration is usually implicit rather than explicit. GWW thinks effect sizes aren’t that important, or at least, don’t have to be.

The basic observation in support of thinking in terms of effect sizes rather than (or in addition to) p values is simply that the null hypothesis is nearly always false. (I think I said “always” in the comments, but I can live with “nearly always”). There are exceedingly few testable associations between two or more variables that could plausibly have an effect size of exactly zero. Which means that if all you care about is rejecting the null hypothesis by reaching p < .05, all you really need to do is keep collecting data–you will get there eventually.

I don’t think this is a controversial point, and my sense is that it’s the received wisdom among (most) statisticians. That doesn’t mean that the hypothesis testing framework isn’t useful, just that it’s fundamentally rooted in ideas that turn out to be kind of silly upon examination. (For the record, I use significance tests all the time in my own work, and do all sorts of other things I know on some level to be silly, so I’m not saying that we should abandon hypothesis testing wholesale).

Anyway, GWW’s argument is that, at least in some areas of psychology, people don’t really care about effect sizes, and simply want to know if there’s a real effect or not. I disagree for at least two reasons. First, when people say they don’t care about effect sizes, I think what they really mean is that they don’t feel a need to explicitly think about effect sizes, because they can just rely on a decision criterion of p < .05 to determine whether or not an effect is ‘real’. The problem is that, since the null hypothesis is always false (i.e., effects are never exactly zero in the population), if we just keep collecting data, eventually all effects become statistically significant, rendering the decision criterion completely useless. At that point, we’d presumably have to rely on effect sizes to decide what’s important. So it may look like you can get away without considering effect sizes, but that’s only because, for the kind of sample sizes we usually work with, p values basically end up being (poor) proxies for effect sizes.

Second, I think it’s simply not true that we care about any effect at all. GWW makes a seemingly reasonable suggestion that even if it’s not sensible to care about a null of exactly zero, it’s quite sensible to care about nothing but the direction of an effect. But I don’t think that really works either. The problem is that, in practice, we don’t really just care about the direction of the effect; we also want to know that it’s meaningfully large (where ‘meaningfully’ is intentionally vague, and can vary from person to person or question to question). GWW gives a priming example: if a theoretical model predicts the presence of a priming effect, isn’t it enough just to demonstrate a statistically significant priming effect in the predicted direction? Does it really matter how big the effect is?

Yes. To see this, suppose that I go out and collect priming data online from 100,000 subjects, and happily reject the null at p < .05 based on a priming effect of a quarter of a millisecond (where the mean response time is, say, on the order of a second). Does that result really provide any useful support for my theory, just because I was able to reject the null? Surely not. For one thing, a quarter of a millisecond is so tiny that any reviewer worth his or her salt is going to point out that any number of confounding factors could be responsible for that tiny association. An effect that small is essentially uninterpretable. But there is, presumably, some minimum size for every putative effect which would lead us to say: “okay, that’s interesting. It’s a pretty small effect, but I can’t just dismiss it out of hand, because it’s big enough that it can’t be attributed to utterly trivial confounds.” So yes, we do care about effect sizes.

The problem, of course, is that what constitutes a ‘meaningful’ effect is largely subjective. No doubt that’s why null hypothesis testing holds such an appeal for most of us (myself included)–it may be silly, but it’s at least objectively silly. It doesn’t require you to put your subjective beliefs down on paper. Still, at the end of the day, that apprehensiveness we feel about it doesn’t change the fact that you can’t get away from consideration of effect sizes, whether explicitly or implicitly. Saying that you don’t care about effect sizes doesn’t actually make it so; it just means that you’re implicitly saying that you literally care about any effect that isn’t exactly zero–which is, on its face, absurd. Had you picked any other null to test against (e.g., a range of standardized effect sizes between -0.1 and 0.1), you wouldn’t have that problem.

To reiterate, I’m emphatically not saying that anyone who doesn’t explicitly report, or even think about, effect sizes when running a study should be lined up against a wall and fired upon at will is doing something terribly wrong. I think it’s a very good idea to (a) run power calculations before starting a study, (b) frequently pause to reflect on what kinds of effects one considers big enough to be worth pursuing; and (c) report effect size measures and confidence intervals for all key tests in one’s papers. But I’m certainly not suggesting that if you don’t do these things, you’re a bad person, or even a bad researcher. All I’m saying is that the importance of effect sizes doesn’t go away just because you’re not thinking about them. A decision about what constitutes a meaningful effect size is made every single time you test your data against the null hypothesis; so you may as well be the one making that decision explicitly, instead of having it done for you implicitly in a silly way. No one really cares about anything-but-zero.

the ‘decline effect’ doesn’t work that way

Over the last four or five years, there’s been a growing awareness in the scientific community that science is an imperfect process. Not that everyone used to think science was a crystal ball with a direct line to the universe or anything, but there does seem to be a growing recognition that scientists are human beings with human flaws, and are susceptible to common biases that can make it more difficult to fully trust any single finding reported in the literature. For instance, scientists like interesting results more than boring results; we’d rather keep our jobs than lose them; and we have a tendency to see what we want to see, even when it’s only sort-of-kind-of there, and sometimes not there at all. All of these things contrive to produce systematic biases in the kinds of findings that get reported.

The single biggest contributor to the zeitgeist shift nudge is undoubtedly John Ioannidis (recently profiled in an excellent Atlantic article), whose work I can’t say enough good things about (though I’ve tried). But lots of other people have had a hand in popularizing the same or similar ideas–many of which actually go back several decades. I’ve written a bit about these issues myself in a number of papers (1, 2, 3) and blog posts (1, 2, 3, 4, 5), so I’m partial to such concerns. Still, important as the role of the various selection and publication biases is in charting the course of science, virtually all of the discussions of these issues have had a relatively limited audience. Even Ioannidis’ work, influential as it’s been, has probably been read by no more than a few thousand scientists.

Last week, the debate hit the mainstream when the New Yorker (circulation: ~ 1 million) published an article by Jonah Lehrer suggesting–or at least strongly raising the possibility–that something might be wrong with the scientific method. The full article is behind a paywall, but I can helpfully tell you that some people seem to have un-paywalled it against the New Yorker’s wishes, so if you search for it online, you will find it.

The crux of Lehrer’s argument is that many, and perhaps most, scientific findings fall prey to something called the “decline effect”: initial positive reports of relatively large effects are subsequently followed by gradually decreasing effect sizes, in some cases culminating in a complete absence of an effect in the largest, most recent studies. Lehrer gives a number of colorful anecdotes illustrating this process, and ends on a decidedly skeptical (and frankly, terribly misleading) note:

The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that’s often not the case. Just because an idea is true doesn’t mean it can be proved. And just because an idea can be proved doesn’t mean it’s true. When the experiments are done, we still have to choose what to believe.

While Lehrer’s article received pretty positive reviews from many non-scientist bloggers (many of whom, dismayingly, seemed to think the take-home message was that since scientists always change their minds, we shouldn’t trust anything they say), science bloggers were generally not very happy with it. Within days, angry mobs of Scientopians and Nature Networkers started murdering unicorns; by the end of the week, the New Yorker offices were reduced to rubble, and the scientists and statisticians who’d given Lehrer quotes were all rumored to be in hiding.

Okay, none of that happened. I’m just trying to keep things interesting. Anyway, because I’ve been characteristically lazy slow on the uptake, by the time I got around to writing this post you’re now reading, about eighty hundred and sixty thousand bloggers had already weighed in on Lehrer’s article. That’s good, because it means I can just direct you to other people’s blogs instead of having to do any thinking myself. So here you go: good posts by Games With Words (whose post tipped me off to the article), Jerry Coyne, Steven Novella, Charlie Petit, and Andrew Gelman, among many others.

Since I’ve blogged about these issues before, and agree with most of what’s been said elsewhere, I’ll only make one point about the article. Which is that about half of the examples Lehrer talks about don’t actually seem to me to qualify as instances of the decline effect–at least as Lehrer defines it. The best example of this comes when Lehrer discusses Jonathan Schooler’s attempt to demonstrate the existence of the decline effect by running a series of ESP experiments:

In 2004, Schooler embarked on an ironic imitation of Rhine’s research: he tried to replicate this failure to replicate. In homage to Rhirie’s interests, he decided to test for a parapsychological phenomenon known as precognition. The experiment itself was straightforward: he flashed a set of images to a subject and asked him or her to identify each one. Most of the time, the response was negative—-the images were displayed too quickly to register. Then Schooler randomly selected half of the images to be shown again. What he wanted to know was whether the images that got a second showing were more likely to have been identified the first time around. Could subsequent exposure have somehow influenced the initial results? Could the effect become the cause?

The craziness of the hypothesis was the point: Schooler knows that precognition lacks a scientific explanation. But he wasn’t testing extrasensory powers; he was testing the decline effect. “At first, the data looked amazing, just as we’d expected,“ Schooler says. “I couldn’t believe the amount of precognition we were finding. But then, as we kept on running subjects, the effect size“–a standard statistical measure–“kept on getting smaller and smaller.“ The scientists eventually tested more than two thousand undergraduates. “In the end, our results looked just like Rhinos,“ Schooler said. “We found this strong paranormal effect, but it disappeared on us.“

This is a pretty bad way to describe what’s going on, because it makes it sound like it’s a general principle of data collection that effects systematically get smaller. It isn’t. The variance around the point estimate of effect size certainly gets smaller as samples get larger, but the likelihood of an effect increasing is just as high as the likelihood of it decreasing. The absolutely critical point Lehrer left out is that you would only get the decline effect to show up if you intervened in the data collection or reporting process based on the results you were getting. Instead, most of Lehrer’s article presents the decline effect as if it’s some sort of mystery, rather than the well-understood process that it is. It’s as though Lehrer believes that scientific data has the magical property of telling you less about the world the more of it you have. Which isn’t true, of course; the problem isn’t that science is malfunctioning, it’s that scientists are still (kind of!) human, and are susceptible to typical human biases. The unfortunate net effect is that Lehrer’s article, while tremendously entertaining, achieves exactly the opposite of what good science journalism should do: it sows confusion about the scientific process and makes it easier for people to dismiss the results of good scientific work, instead of helping people develop a critical appreciation for the amazing power science has to tell us about the world.

trouble with biomarkers and press releases

The latest issue of the Journal of Neuroscience contains an interesting article by Ecker et al in which the authors attempted to classify people with autism spectrum disorder (ASD) and health controls based on their brain anatomy, and report achieving “a sensitivity and specificity of up to 90% and 80%, respectively.” Before unpacking what that means, and why you probably shouldn’t get too excited (about the clinical implications, at any rate; the science is pretty cool), here’s a snippet from the decidedly optimistic press release that accompanied the study:

“Scientists funded by the Medical Research Council (MRC) have developed a pioneering new method of diagnosing autism in adults. For the first time, a quick brain scan that takes just 15 minutes can identify adults with autism with over 90% accuracy. The method could lead to the screening for autism spectrum disorders in children in the future.”

If you think this sounds too good to be true, that’s because it is. Carl Heneghan explains why in an excellent article in the Guardian:

How the brain scans results are portrayed is one of the simplest mistakes in interpreting diagnostic test accuracy to make. What has happened is, the sensitivity has been taken to be the positive predictive value, which is what you want to know: if I have a positive test do I have the disease? Not, if I have the disease, do I have a positive test? It would help if the results included a measure called the likelihood ratio (LR), which is the likelihood that a given test result would be expected in a patient with the target disorder compared to the likelihood that the same result would be expected in a patient without that disorder. In this case the LR is 4.5. We’ve put up an article if you want to know more on how to calculate the LR.

In the general population the prevalence of autism is 1 in 100; the actual chances of having the disease are 4.5 times more likely given a positive test. This gives a positive predictive value of 4.5%; about 5 in every 100 with a positive test would have autism.

For those still feeling confused and not convinced, let’s think of 10,000 children. Of these 100 (1%) will have autism, 90 of these 100 would have a positive test, 10 are missed as they have a negative test: there’s the 90% reported accuracy by the media.

But what about the 9,900 who don’t have the disease? 7,920 of these will test negative (the specificity3 in the Ecker paper is 80%). But, the real worry though, is the numbers without the disease who test positive. This will be substantial: 1,980 of the 9,900 without the disease. This is what happens at very low prevalences, the numbers falsely misdiagnosed rockets. Alarmingly, of the 2,070 with a positive test, only 90 will have the disease, which is roughly 4.5%.

In other words, if you screened everyone in the population for autism, and assume the best about the classifier reported in the JNeuro article (e.g., that the sample of 20 ASD participants they used is perfectly representative of the broader ASD population, which seems unlikely), only about 1 in 20 people who receive a positive diagnosis would actually deserve one.

Ecker et al object to this characterization, and reply to Heneghan in the comments (through the MRC PR office):

Our test was never designed to screen the entire population of the UK. This is simply not practical in terms of costs and effort, and besides totally  unjustified- why would we screen everybody in the UK for autism if there is no evidence whatsoever that an individual is affected?. The same case applies to other diagnostic tests. Not every single individual in the UK is tested for HIV. Clearly this would be too costly and unnecessary. However, in the group of individuals that are test for the virus, we can be very confident that if the test is positive that means a patient is infected. The same goes for our approach.

Essentially, the argument is that, since people would presumably be sent for an MRI scan because they were already under consideration for an ASD diagnosis, and not at random, the false positive rate would in fact be much lower than 95%, and closer to the 20% reported in the article.

One response to this reply–which is in fact Heneghan’s response in the comments–is to point out that the pre-test probability of ASD would need to be pretty high already in order for the classifier to add much. For instance, even if fully 30% of people who were sent for a scan actually had ASD, the posterior probability of ASD given a positive result would still be only 66% (Heneghan’s numbers, which I haven’t checked). Heneghan nicely contrasts these results with the standard for HIV testing, which “reports sensitivity of 99.7% and specificity of 98.5% for enzyme immunoassay.” Clearly, we have a long way to go before doctors can order MRI-based tests for ASD and feel reasonably confident that a positive result is sufficient grounds for an ASD diagnosis.

Setting Heneghan’s concerns about base rates aside, there’s a more general issue that he doesn’t touch on. It’s one that’s not specific to this particular study, and applies to nearly all studies that attempt to develop “biomarkers” for existing disorders. The problem is that the sensitivity and specificity values that people report for their new diagnostic procedure in these types of studies generally aren’t the true parameters of the procedure. Rather, they’re the sensitivity and specificity under the assumption that the diagnostic procedures used to classify patients and controls in the first place are themselves correct. In other words, in order to believe the results, you have to assume that the researchers correctly classified the subjects into patient and control groups using other procedures. In cases where the gold standard test used to make the initial classification is known to have near 100% sensitivity and specificity (e.g., for the aforementioned HIV tests), one can reasonably ignore this concern. But when we’re talking about mental health disorders, where diagnoses are fuzzy and borderline cases abound, it’s very likely that the “gold standard” isn’t really all that great to begin with.

Concretely,  studies that attempt to develop biomarkers for mental health disorders face two substantial problems. One is that it’s extremely unlikely that the clinical diagnoses are ever perfect; after all, if they were perfect, there’d be little point in trying to develop other diagnostic procedures! In this particular case, the authors selected subjects into the ASD group based on standard clinical instruments and structured interviews. I don’t know that there are many clinicians who’d claim with a straight face that the current diagnostic criteria for ASD (and there are multiple sets to choose from!) are perfect. From my limited knowledge, the criteria for ASD seem to be even more controversial than those for most other mental health disorders (which is saying something, if you’ve been following the ongoing DSM-V saga). So really, the accuracy of the classifier in the present study, even if you put the best face on it and ignore the base rate issue Heneghan brings up, is undoubtedly south of the 90% sensitivity / 80% specificity the authors report. How much south, we just don’t know, because we don’t really have any independent, objective way to determine who “really” should get an ASD diagnosis and who shouldn’t (assuming you think it makes sense to make that kind of dichotomous distinction at all). But 90% accuracy is probably a pipe dream, if for no other reason than it’s hard to imagine that level of consensus about autism spectrum diagnoses.

The second problem is that, because the researchers are using the MRI-based classifier to predict the clinician-based diagnosis, it simply isn’t possible for the former to exceed the accuracy of the latter. That bears repeating, because it’s important: no matter how good the MRI-based classifier is, it can only be as good as the procedures used to make the original diagnosis, and no better. It cannot, by definition, make diagnoses that are any more accurate than the clinicians who screened the participants in the authors’ ASD sample. So when you see the press release say this:

For the first time, a quick brain scan that takes just 15 minutes can identify adults with autism with over 90% accuracy.

You should really read it as this:

The method relies on structural (MRI) brain scans and has an accuracy rate approaching that of conventional clinical diagnosis.

That’s not quite as exciting, obviously, but it’s more accurate.

To be fair, there’s something of a catch-22 here, in that the authors didn’t really have a choice about whether or not to diagnose the ASD group using conventional criteria. If they hadn’t, reviewers and other researchers would have complained that we can’t tell if the ASD group is really an ASD group, because they authors used non-standard criteria. Under the circumstances, they did the only thing they could do. But that doesn’t change the fact that it’s misleading to intimate, as the press release does, that the new procedure might be any better than the old ones. It can’t be, by definition.

Ultimately, if we want to develop brain-based diagnostic tools that are more accurate than conventional clinical diagnoses, we’re going to need to show that these tools are capable of predicting meaningful outcomes that clinician diagnoses can’t. This isn’t an impossible task, but it’s a very difficult one. One approach you could take, for instance, would be to compare the ability of clinician diagnosis and MRI-based diagnosis to predict functional outcomes among subjects at a later point in time. If you could show that MRI-based classification of subjects at an early age was a stronger predictor of receiving an ASD diagnosis later in life than conventional criteria, that would make a really strong case for using the former approach in the real world. Short of that type of demonstration though, the only reason I can imagine wanting to use a procedure that was developed by trying to duplicate the results of an existing procedure is in the event that the new procedure is substantially cheaper or more efficient than the old one. Meaning, it would be reasonable enough to say “well, look, we don’t do quite as well with this approach as we do with a full clinical evaluation, but at least this new approach costs much less.” Unfortunately, that’s not really true in this case, since the price of even a short MRI scan is generally going to outweigh that of a comprehensive evaluation by a psychiatrist or psychotherapist. And while it could theoretically be much faster to get an MRI scan than an appointment with a mental health professional, I suspect that that’s not generally going to be true in practice either.

Having said all that, I hasten to note that all this is really a critique of the MRC press release and subsequently lousy science reporting, and not of the science itself. I actually think the science itself is very cool (but the Neuroskeptic just wrote a great rundown of the methods and results, so there’s not much point in me describing them here). People have been doing really interesting work with pattern-based classifiers for several years now in the neuroimaging literature, but relatively few studies have applied this kind of technique to try and discriminate between different groups of individuals in a clinical setting. While I’m not really optimistic that the technique the authors introduce in this paper is going to change the way diagnosis happens any time soon (or at least, I’d argue that it shouldn’t), there’s no question that the general approach will be an important piece of future efforts to improve clinical diagnoses by integrating biological data with existing approaches. But that’s not going to happen overnight, and in the meantime, I think it’s pretty irresponsible of the MRC to be issuing press releases claiming that its researchers can diagnose autism in adults with 90% accuracy.

ResearchBlogging.orgEcker C, Marquand A, Mourão-Miranda J, Johnston P, Daly EM, Brammer MJ, Maltezos S, Murphy CM, Robertson D, Williams SC, & Murphy DG (2010). Describing the brain in autism in five dimensions–magnetic resonance imaging-assisted diagnosis of autism spectrum disorder using a multiparameter classification approach. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30 (32), 10612-23 PMID: 20702694

fourteen questions about selection bias, circularity, nonindependence, etc.

A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.

I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead,  I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…

(1) Is circular analysis a problem in systems and cognitive neuroscience?

Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.

(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?

I think Nichols sums it up nicely here:

TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., “˜P<0.001 uncorrected’ and crossing your fingers) remain pervasive.

The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.

(3) Are circular estimates useful measures of effect size?

Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.

(4) Should circular estimates of effect size be presented in papers and, if so, how?

Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:

In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.

A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.

(5) Are effect size estimates important/useful for neuroscience research, and why?

I think my view here is closest to Ed Vul’s:

Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.

I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)

People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.

Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?

(6) What is the best way to accurately estimate effect sizes from imaging data?

Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.

(7) What makes data sets independent? Are different sets of subjects required?

Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:

Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.

Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.

(8) What information can one glean from data selected for a certain effect?

I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.

(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?

Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.

(10) Should data exploration be discouraged in favor of valid confirmatory analyses?

No. I agree with Poldrack’s sentiment here:

Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.

(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?

In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.

(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?

From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).

(13) How much power should a brain-mapping analysis have to be useful?

As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).

(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?

It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.

ResearchBlogging.orgKriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517

the perils of digging too deep

Another in a series of posts supposedly at the intersection of fiction and research methods, but mostly just an excuse to write ridiculous stories and pretend they have some sort of moral.


Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.

“YouTube again?” I asked.

“Yes,” he said. “It’s lunch.”

“It’s 2:30 pm,” I said, pointing to my watch.

“Still my lunch hours.”

Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.

“Fair enough,” I said. “I just stopped by to see how things were going.”

“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”

“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”

He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.

“The data are TERRIBLE,” he said in all capital letters.

I wasn’t terribly surprised at the revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.

“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”

“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”

That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.

Rickles was unimpressed.

“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”

“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found nothing.”

“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”

The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?

“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”

“Alright, alright. Lunch hours are over now anyway.”

He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.

“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”

“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. Each point on that Zimming Range is worth at least $500, I thought.

“Are there any secondary analyses we could publish alongside that,” I asked.

“Oh, I don’t think you want to publish that,” Rickles laughed.

“Why the hell not? It could be big! You just said yourself it was a giant effect!”

“Oh sure. It’s a big effect. But I don’t believe it for one second.”

“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”

I recognized, in the course of uttering those words, that they did not constitute the finest simile ever produced.

“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”

“Okay, fine,” I muttered. “Is there anything else in the data?”

“Sure, tons of things. Like, for example, there’s a statistically significant gamma reduction.”

“A gamma reduction? Are you sure? Or do you mean beta,” I asked.

“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I checked.”

“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”

“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in Statistica Splenda last month.”

“Sure, the Gerryman article, right. I read that. Forget the gamma reduction. What else?”

“There are quite a few schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.

I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.

“How many schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help keep me steady.

“Fourteen,” Rickles said matter-of-factedly.

“Fourteen!” I gasped. “That’s a lot of schweizels!”

“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of schweizels. Seventeen of them, actually.”

“Seventeen schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”

“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”

There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to stop provoking it and go back to bed. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your schweizel count was properly weighted. Still, I didn’t want to just give up on the schweizels entirely. I’d spent too much of my career delicately massaging schweizels to give up without one last tug.

“Maybe we can just say that the A-trax/Nuffton relationship is non-linear?” I suggested.

“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”

I grudgingly conceded the point.

“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like Acta Ziffletica if there was an effect…”

“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There is a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”

I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. I wondered how I’d convince the department chair to let me keep my job.

Then it came to me in a near-blinding flash of insight. Near blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.

“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”

“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”

I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.

“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.

“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”

Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.

“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”

“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psykometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is negative.”

I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.

“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.”

I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had taken a quick jaunt through purgatory every morning before settling into the bench to run some experiments.

“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet. There’s Nordstrom, El-Kabir, inverse Zulu…”

He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.

“…or maybe you can just give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.

In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.

I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any worse with it, and it would be a good way to break her will quickly.

“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”

The

perils of digging too deep

Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.
“YouTube again?” I asked.
“Yes,” he said. “It’s lunch.”
“It’s 2:30 pm,” I said, pointing to my watch.
“Still my lunch hours.”
Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.
“Fair enough,” I said. “I just stopped by to see how things were going.”
“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”
“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”
He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.
“The data are TERRIBLE,” he said in all capital letters.
I wasn’t terribly surprised at that revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.
“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”
“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”
That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.
Rickles was unimpressed.
“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”
“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found *nothing*.”
“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”
The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?
“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”
“Alright, alright. Lunch hours are over now anyway.”
He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.
“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”
“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. *Each point on that Zimming Range is worth at least $500*, I thought.
“Are there any secondary analyses we could publish alongside that,” I asked.
“Oh, I don’t think you want to publish *that*,” Rickles laughed.
“Why the hell not? It could be big! You just said yourself it was a giant effect!”
“Oh *sure*. It’s a big effect. But I don’t believe it for one second.”
“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”
I recognized, in the course of uttering those words, that they did not constitute the finest simile ever.
“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”
“Okay, fine,” I muttered. “Is there anything else in the data?”
“Sure, tons of things. Like, for example, there’s a statistically significant Gamma reduction.”
“A gamma reduction? Are you sure? Or do you mean Beta,” I asked.
“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I looked.”
“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”
“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in *Statistica Splenda* last month.”
“Sure, the Gerryman article, right. Okay. Forget the gamma reduction. What else?”
“There are quite a few Schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.
I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.
“How many Schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help me stay upright.
“Fourteen,” Rickles said matter-of-factedly.
“Fourteen!” I gasped. “That’s a lot of Schweizels!”
“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of Schweizels. Seventeen of them, actually.”
“Seventeen Schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”
“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”
There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to go back to bed and stop stressing it out. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your Schweizel count was properly weighted. Still, I didn’t want to just give up on the Schweizels entirely.
“Maybe we can just say that the A-trax/Nuffton relationship is non-linear,” I proposed.
“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”
I grudgingly conceded the point.
“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like *Acta Ziffletica* if there was an effect…”
“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There *is* a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”
I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. What else was left?
Then it came to me in a near-blinding flash of insight. *Near* blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.
“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”
“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”
I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.
“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.
“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”
Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.
“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”
“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psychometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is *negative*.”
I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.
“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.” I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had collected it after taking a quick jaunt through purgatory every morning.
“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet.”
He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.
“Maybe you can give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.
In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.
I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any *worse* with it, and it *would* be a good way to break her will in a hurry.
“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”

the capricious nature of p < .05, or why data peeking is evil

There’s a time-honored tradition in the social sciences–or at least psychology–that goes something like this. You decide on some provisional number of subjects you’d like to run in your study; usually it’s a nice round number like twenty or sixty, or some number that just happens to coincide with the sample size of the last successful study you ran. Or maybe it just happens to be your favorite number (which of course is forty-four). You get your graduate student to start running the study, and promptly forget about it for a couple of weeks while you go about writing up journal reviews that are three weeks overdue and chapters that are six months overdue.

A few weeks later, you decide you’d like to know how that Amazing New Experiment you’re running is going. You summon your RA and ask him, in magisterial tones, “how’s that Amazing New Experiment we’re running going?” To which he falteringly replies that he’s been very busy with all the other data entry and analysis chores you assigned him, so he’s only managed to collect data from eighteen subjects so far. But he promises to have the other eighty-two subjects done any day now.

“Not to worry,” you say. “We’ll just take a peek at the data now and see what it looks like; with any luck, you won’t even need to run any more subjects! By the way, here are my car keys; see if you can’t have it washed by 5 pm. Your job depends on it. Ha ha.”

Once your RA’s gone to soil himself somewhere, you gleefully plunge into the task of peeking at your data. You pivot your tables, plyr your data frame, and bravely sort your columns. Then you extract two of the more juicy variables for analysis, and after some careful surgery a t-test or six, you arrive at the conclusion that your hypothesis is… “marginally” supported. Which is to say, the magical p value is somewhere north of .05 and somewhere south of .10, and now it’s just parked by the curb waiting for you to give it better directions.

You briefly contemplate reporting your result as a one-tailed test–since it’s in the direction you predicted, right?–but ultimately decide against that. You recall the way your old Research Methods professor used to rail at length against the evils of one-sample tests, and even if you don’t remember exactly why they’re so evil, you’re not willing to take any chances. So you decide it can’t be helped; you need to collect some more data.

You summon your RA again. “Is my car washed yet?” you ask.

“No,” says your RA in a squeaky voice. “You just asked me to do that fifteen minutes ago.”

“Right, right,” you say. “I knew that.”

You then explain to your RA that he should suspend all other assigned duties for the next few days and prioritize running subjects in the Amazing New Experiment. “Abandon all other tasks!” you decree. “If it doesn’t involve collecting new data, it’s unimportant! Your job is to eat, sleep, and breathe new subjects! But not literally!”

Being quite clever, your RA sees an opening. “I guess you’ll want your car keys back, then,” he suggests.

“Nice try, Poindexter,” you say. “Abandon all other tasks… starting tomorrow.”

You also give your RA very careful instructions to email you the new data after every single subject, so that you can toss it into your spreadsheet and inspect the p value at every step. After all, there’s no sense in wasting perfectly good data; once your p value is below .05, you can just funnel the rest of the participants over to the Equally Amazing And Even Newer Experiment you’ve been planning to run as a follow-up. It’s a win-win proposition for everyone involved. Except maybe your RA, who’s still expected to return triumphant with a squeaky clean vehicle by 5 pm.

Twenty-six months and four rounds of review later, you publish the results of the Amazing New Experiment as Study 2 in a six-study paper in the Journal of Ambiguous Results. The reviewers raked you over the coals for everything from the suggested running head of the paper to the ratio between the abscissa and the ordinate in Figure 3. But what they couldn’t argue with was the p value in Study 2, which clocked in at just under p < .05, with only 21 subjects’ worth of data (compare that to the 80 you had to run in Study 4 to get a statistically significant result!). Suck on that, Reviewers!, you think to yourself pleasantly while driving yourself home from work in your shiny, shiny Honda Civic.

So ends our short parable, which has at least two subtle points to teach us. One is that it takes a really long time to publish anything; who has time to wait twenty-six months and go through four rounds of review?

The other, more important point, is that the desire to peek at one’s data, which often seems innocuous enough–and possibly even advisable (quality control is important, right?)–can actually be quite harmful. At least if you believe that the goal of doing research is to arrive at the truth, and not necessarily to publish statistically significant results.

The basic problem is that peeking at your data is rarely a passive process; most often, it’s done in the context of a decision-making process, where the goal is to determine whether or not you need to keep collecting data. There are two possible peeking outcomes that might lead you to decide to halt data collection: a very low p value (i.e., p < .05), in which case your hypothesis is supported and you may as well stop gathering evidence; or a very high p value, in which case you might decide that it’s unlikely you’re ever going to successfully reject the null, so you may as well throw in the towel. Either way, you’re making the decision to terminate the study based on the results you find in a provisional sample.

A complementary situation, which also happens not infrequently, occurs when you collect data from exactly as many participants as you decided ahead of time, only to find that your results aren’t quite what you’d like them to be (e.g., a marginally significant hypothesis test). In that case, it may be quite tempting to keep collecting data even though you’ve already hit your predetermined target. I can count on more than one hand the number of times I’ve overheard people say (often without any hint of guilt) something to the effect of “my p value’s at .06 right now, so I just need to collect data from a few more subjects.”

Here’s the problem with either (a) collecting more data in an effort to turn p < .06 into p < .05, or (b) ceasing data collection because you’ve already hit p < .05: any time you add another subject to your sample, there’s a fairly large probability the p value will go down purely by chance, even if there’s no effect. So there you are sitting at p < .06 with twenty-four subjects, and you decide to run a twenty-fifth subject. Well, let’s suppose that there actually isn’t a meaningful effect in the population, and that p < .06 value you’ve got is a (near) false positive. Adding that twenty-fifth subject can only do one of two things: it can raise your p value, or it can lower it. The exact probabilities of these two outcomes depends on the current effect size in your sample before adding the new subject; but generally speaking, they’ll rarely be very far from 50-50. So now you can see the problem: if you stop collecting data as soon as you get a significant result, you may well be capitalizing on chance. It could be that if you’d collected data from a twenty-sixth and twenty-seventh subject, the p value would reverse its trajectory and start rising. It could even be that if you’d collected data from two hundred subjects, the effect size would stabilize near zero. But you’d never know that if you stopped the study as soon as you got the results you were looking for.

Lest you think I’m exaggerating, and think that this problem falls into the famous class of things-statisticians-and-methodologists-get-all-anal-about-but-that-don’t-really-matter-in-the-real-world, here’s a sobering figure (taken from this chapter):

data_peeking

The figure shows the results of a simulation quantifying the increase in false positives associated with data peeking. The assumptions here are that (a) data peeking begins after about 10 subjects (starting earlier would further increase false positives, and starting later would decrease false positives somewhat), (b) the researcher stops as soon as a peek at the data reveals a result significant at p < .05, and (c) data peeking occurs at incremental steps of either 1 or 5 subjects. Given these assumptions, you can see that there’s a fairly monstrous rise in the actual Type I error rate (relative to the nominal rate of 5%). For instance, if the researcher initially plans to collect 60 subjects, but peeks at the data after every 5 subjects, there’s approximately a 17% chance that the threshold of p < .05 will be reached before the full sample of 60 subjects is collected. When data peeking occurs even more frequently (as might happen if a researcher is actively trying to turn p < .07 into p < .05, and is monitoring the results after each incremental participant), Type I error inflation is even worse. So unless you think there’s no practical difference between a 5% false positive rate and a 15 – 20% false positive rate, you should be concerned about data peeking; it’s not the kind of thing you just brush off as needless pedantry.

How do we stop ourselves from capitalizing on chance by looking at the data? Broadly speaking, there are two reasonable solutions. One is to just pick a number up front and stick with it. If you commit yourself to collecting data from exactly as many subjects as you said you would (you can proclaim the exact number loudly to anyone who’ll listen, if you find it helps), you’re then free to peek at the data all you want. After all, it’s not the act of observing the data that creates the problem; it’s the decision to terminate data collection based on your observation that matters.

The other alternative is to explicitly correct for data peeking. This is a common approach in large clinical trials, where data peeking is often ethically mandated, because you don’t want to either (a) harm people in the treatment group if the treatment turns out to have clear and dangerous side effects, or (b) prevent the control group from capitalizing on the treatment too if it seems very efficacious. In either event, you’d want to terminate the trial early. What researchers often do, then, is pick predetermined intervals at which to peek at the data, and then apply a correction to the p values that takes into account the number of, and interval between, peeking occasions. Provided you do things systematically in that way, peeking then becomes perfectly legitimate. Of course, the downside is that having to account for those extra inspections of the data makes your statistical tests more conservative. So if there aren’t any ethical issues that necessitate peeking, and you’re not worried about quality control issues that might be revealed by eyeballing the data, your best bet is usually to just pick a reasonable sample size (ideally, one based on power calculations) and stick with it.

Oh, and also, don’t make your RAs wash your car for you; that’s not their job.

Shalizi on the confounding of contagion and homophily in social network studies

Cosma Shalizi has a post up today discussing a new paper he wrote with Andrew C. Thomas arguing that it’s pretty much impossible to distinguish the effects of social contagion from homophily in observational studies.

That’s probably pretty cryptic without context, so here’s the background. A number of high-profile studies have been published in the past few years suggesting that everything from obesity to loneliness to pot smoking is socially contagious. The basic argument is that when you look at the diffusion of certain traits within social networks, you find that having friends who are obese is more likely to make you obese, having happy friends is more likely to make you happy, and so on. These effects (it’s been argued) persist even after you control for homophily–that is, the tendency of people to know and like other people who are similar to them–and can be indirect, so that you’re more likely to be obese even if your friends’ friends (who you may not even know know) are obese.

Needless to say, the work has been controversial. A few weeks ago, Dave Johns wrote an excellent pair of articles in Slate describing the original research, as well as the recent critical backlash (see also Andrew Gelman’s post here). Much of the criticism has focused on the question of whether it’s really possible to distinguish homophily from contagion using the kind of observational data and methods that contagion researchers have relied on. That is, if the probability that you’ll become obese (or lonely, or selfish, etc.) increases as a function of the number of obese people you know, is that because your acquaintance with obese people exerts a causal influence on your own body weight (e.g., by shaping your perception of body norms, eating habits, etc.), or is it simply that people with a disposition to become obese tend to seek out other people with the same disposition, and there’s no direct causal influence at all? It’s an important question, but one that’s difficult to answer conclusively.

In their new paper, Shalizi and Thomas use an elegant combination of logical argumentation, graphical causal models, and simulation to show that, in general, contagion effects are unidentifiable: you simply can’t tell whether like begets like because of a direct causal influence (“real” contagion), or because of homophily (birds of a feather flocking together). The only way out of the bind is to make unreasonably strong assumptions–e.g., that the covariates explicitly included in one’s model capture all of the influence of latent traits on observable behaviors. In his post Shalizi sums up the conclusions of the paper this way:

What the statistician or social scientist sees is that bridge-jumping is correlated across the social network. In this it resembles many, many, many behaviors and conditions, such as prescribing new antibiotics (one of the classic examples), adopting other new products, adopting political ideologies, attaching tags to pictures on flickr, attaching mis-spelled jokes to pictures of cats, smoking, drinking, using other drugs, suicide, literary tastes, coming down with infectious diseases, becoming obese, and having bad acne or being tall for your age. For almost all of these conditions or behaviors, our data is purely observational, meaning we cannot, for one reason or another, just push Joey off the bridge and see how Irene reacts. Can we nonetheless tell whether bridge-jumping spreads by (some form) of contagion, or rather is due to homophily, or, if it is both, say how much each mechanism contributes?

A lot of people have thought so, and have tried to come at it in the usual way, by doing regression. Most readers can probably guess what I think about that, so I will just say: don’t you wish. More sophisticated ideas, like propensity score matching, have also been tried, but people have pretty much assumed that it was possible to do this sort of decomposition. What Andrew and I showed is that in fact it isn’t, unless you are willing to make very strong, and generally untestable, assumptions.

It’s a very clear and compelling paper, and definitely worth reading if you have any interest at all in the question of whether and when it’s okay to apply causal modeling techniques to observational data. The answer Shalizi’s argued for on many occasions–and an unfortunate one from many scientists’ perspective–seems to be: very rarely if ever.

correlograms are correlicious

In the last year or so, I’ve been experimenting with different ways of displaying correlation matrices, and have gotten very fond of color-coded correlograms. Here’s one from a paper I wrote investigating the relationship between personality and word use among bloggers (click to enlarge):

Figure S2 Extraversion

The rows reflect language categories from Jamie Pennebaker’s Linguistic Inquiry and Word Count (LIWC) dictionary; the columns reflect Extraversion scores (first column) or scores on the lower-order “facets” of Extraversion (as measured by the IPIP version of the NEO-PI-R). The plot was generated in R using code adapted from the corrgram package (R really does have contributed packages for everything). Positive correlations are in blue, negative ones are in red.

The thing I really like about these figures is that the colors instantly orient you to the most important features of the correlation matrix, instead of having to inspect every cell for the all-important ***magical***asterisks***of***statistical***significance***. For instance, a cursory glance tells you that even though Excitement-Seeking and Cheerfulness are both nominally facets of Extraversion, they’re associated with very different patterns of word use. And then a slightly less cursory glance tells you that that’s because people with high Excitement-Seeking scores like to swear a lot and use negative emotion words, while Cheerful people like to talk about friends, music, and use positive emotional language. You’d get the same information without the color, of course, but it’d take much longer to extract,  and then you’d have to struggle to keep all of the relevant numbers in mind while you mull them over. The colors do a lot to reduce cognitive load, and also have the secondary benefit of looking pretty.

If you’re interested in using correlograms, a good place to start is the Quick-R tutorial on correlograms in R. The documentation for the corrgram package is here, and there’s a nice discussion of the principles behind the visual display of correlation matrices in this article.

p.s. I’m aware this post has the worst title ever; the sign-up sheet for copy editing duties is in the comment box (hint hint).