Category Archives: statistics

The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

the truth is not optional: five bad reasons (and one mediocre one) for defending the status quo

You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.

Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.

“But everyone does it, so how bad can it be?”

We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?

The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.

In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.

“But psychology would break if we could only report results that were truly predicted a priori!”

This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.

In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.

In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.

“But mistakes happen, and people could get falsely accused!”

Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.

For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.

More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.

“But it hurts the public’s perception of our field!”

Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.

While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.

In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.

“But unreliable effects will just fail to replicate, so what’s the big deal?”

This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.

I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?

While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.

In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.

There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.

Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).

The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.

“But it would hurt my career to be meticulously honest about everything I do!”

Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.

I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.

 

Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

A very classy reply from Karl Friston

After writing my last post critiquing Karl Friston’s commentary in NeuroImage, I emailed him the link, figuring he might want the opportunity to respond, and also to make sure he knew my commentary wasn’t intended as a personal attack (I have enormous respect for his seminal contributions to the field of neuroimaging). Here’s his very classy reply (posted with permission):

Many thanks for your kind e-mail and link to your blog. I thought your review and deconstruction of the issues were excellent and I concur with the points that you make.

You are absolutely right that I ignored the use of high (corrected) thresholds when controlling for multiple comparisons – and was focusing on the simple case of a single test. I also agree that, ideally, one would report confidence intervals on effect sizes – indeed the original version of my article concluded with this recommendation (now the last line of appendix 1). I remember – at the inception of SPM – discussing with Andrew Holmes the reporting of confidence intervals using statistical maps – however, the closest we ever got was posterior probability maps (PPM), many years later.

My agenda was probably a bit simpler than you might have supposed – it was to point out that significant p-values from small sample studies are valid and will – on average – detect effects whose sizes are bigger than the equivalent effects with larger sample sizes. I did not mean to imply that large studies are useless – although I do believe that unqualified reports of significant p-values from large sample sizes should be treated with caution. Although my agenda was fairly simple, the issues raised may well require more serious consideration – of the sort that you have offered. I submitted the article as a ‘comments and controversy’, anticipating that it would elicit a thoughtful response of the sort in your blog. If you have not done so already; you could prepare your blog for peer-reviewed submission – perhaps as a response to the ‘comments and controversy’ at NeuroImage?

I will not respond to your blog directly; largely because I have never blogged before and prefer to restrict myself to peer-reviewed formats. However, please feel free to use this e-mail in any way you see fit.

With very best wishes,

Karl

PS: although you may have difficulty believing it – all the critiques I caricatured I have actually seen in one form or another – even the retinotopic mapping critique!

Seeing as my optimistic thought when I sent Friston the link was “I hope he doesn’t eat me alive” (not because he has that kind of reputation, but because, frankly, if someone obnoxiously sent me a link to an abrasive article criticizing my work at length, I might not be very happy either), I was very happy to read that. I wrote back:

Thanks very much for your gracious reply–especially since the tone of my commentary was probably a bit abrasive. If I’m being honest with myself, I’m pretty sure I’d have a hard time setting my ego aside long enough to respond this constructively if someone criticized me like this (no matter how I felt about the substantive issues), so it’s very much appreciated.

I won’t take up any of the substantive issues here, since it sounds like we’re in reasonable agreement on most of the issues. As far as submitting a formal response to NeuroImage, I’d normally be happy to do that, but I’m currently boycotting Elsevier journals as part of the Cost of Knowledge campaign, and feel pretty strongly about that, so I won’t be submitting anything to NeuroImage for the foreseeable future. This isn’t meant as an indictment of the NeuroImage editorial board or staff in any way; it’s strictly out of frustration at Elsevier’s policies and recent actions.

Also, while I like the comments and controversies format at NeuroImage a lot, there’s no question that the process is considerably slower than what online communication affords. The reality is that by the time my comment ever came out (probably in a much abridged form), much of the community will have moved on and lost interest in the issue, and I’ve found in the past that the kind of interactive and rapid engagement it’s possible to get online is very hard to approximate in a print forum. But I can completely understand your hesitation to respond this way; it could quickly become unmanageable. For what it’s worth, I don’t really think blogs are the right medium for this kind of thing in the long term anyway, but until we get publisher-independent evaluation platforms that centralize the debate in one place (which I’m hopeful will happen relatively soon), I think they play a useful role.

Anyway, whatever your opinion of the original commentary and/or my post, I think Friston deserves a lot of credit for his response, which, I’ll just reiterate again, is much more civil and tactful than mine would probably have been in his situation. I can’t think of many cases either in print or online when someone has responded so constructively to criticism.

One other thing I forgot to mention in my reply to Friston, but is worth bringing up here: I think SPM confidence interval maps would be a great idea! It would be fantastic if fMRI analysis packages by default produced 3 effect size maps for every analysis–respectively giving the observed, lower bound, and upper bound estimates of effect size at every voxel. This would naturally discourage researchers from making excessively strong claims (since one imagines almost everyone would at least glance at the lower-bound map) while providing reviewers a very easy way to frame concerns about power and sample size (“can the authors please present the confidence interval maps in the appendix?”). Anyone want to write an SPM plug-in to do this?

Sixteen is not magic: Comment on Friston (2012)

UPDATE: I’ve posted a very classy email response from Friston here.

In a “comments and controversies” piece published in NeuroImage last week, Karl Friston describes “Ten ironic rules for non-statistical reviewers”. As the title suggests, the piece is presented ironically; Friston frames it as a series of guidelines reviewers can follow in order to ensure successful rejection of any neuroimaging paper. But of course, Friston’s real goal is to convince you that the practices described in the commentary are bad ones, and that reviewers should stop picking on papers for such things as having too little power, not cross-validating results, and not being important enough to warrant publication.

Friston’s piece is, simultaneously, an entertaining satire of some lamentable reviewer practices, and—in my view, at least—a frustratingly misplaced commentary on the relationship between sample size, effect size, and inference in neuroimaging. While it’s easy to laugh at some of the examples Friston gives, many of the positions Friston presents and then skewers aren’t just humorous portrayals of common criticisms; they’re simply bad caricatures of comments that I suspect only a small fraction of reviewers ever make. Moreover, the cures Friston proposes—most notably, the recommendation that sample sizes on the order of 16 to 32 are just fine for neuroimaging studies—are, I’ll argue, much worse than the diseases he diagnoses.

Before taking up the objectionable parts of Friston’s commentary, I’ll just touch on the parts I don’t think are particularly problematic. Of the ten rules Friston discusses, seven seem palatable, if not always helpful:

  • Rule 6 seems reasonable; there does seem to be excessive concern about the violation of assumptions of standard parametric tests. It’s not that this type of thing isn’t worth worrying about at some point, just that there are usually much more egregious things to worry about, and it’s been demonstrated that the most common parametric tests are (relatively) insensitive to violations of normality under realistic conditions.
  • Rule 10 is also on point; given that we know the reliability of peer review is very low, it’s problematic when reviewers make the subjective assertion that a paper just isn’t important enough to be published in such-and-such journal, even as they accept that it’s technically sound. Subjective judgments about importance and innovation should be left to the community to decide. That’s the philosophy espoused by open-access venues like PLoS ONE and Frontiers, and I think it’s a good one.
  • Rules 7 and 9—criticizing a lack of validation or a failure to run certain procedures—aren’t wrong, but seem to me much too broad to support blanket pronouncements. Surely much of the time when reviewers highlight missing procedures, or complain about a lack of validation, there are perfectly good reasons for doing so. I don’t imagine Friston is really suggesting that reviewers should stop asking authors for more information or for additional controls when they think it’s appropriate, so it’s not clear what the point of including this here is. The example Friston gives in Rule 9 (of requesting retinotopic mapping in an olfactory study), while humorous, is so absurd as to be worthless as an indictment of actual reviewer practices. In fact, I suspect it’s so absurd precisely because anything less extreme Friston could have come up with would have caused readers to think, “but wait, that could actually be a reasonable concern…”
  • Rules 1, 2, and 3 seem reasonable as far as they go; it’s just common sense to avoid overconfidence, arguments from emotion, and tardiness. Still, I’m not sure what’s really accomplished by pointing this out; I doubt there are very many reviewers who will read Friston’s commentary and say “you know what, I’m an overconfident, emotional jerk, and I’m always late with my reviews–I never realized this before.” I suspect the people who fit that description—and for all I know, I may be one of them—will be nodding and chuckling along with everyone else.

This leaves Rules 4, 5, and 8, which, conveniently, all focus on a set of interrelated issues surrounding low power, effect size estimation, and sample size. Because Friston’s treatment of these issues strikes me as dangerously wrong, and liable to send a very bad message to the neuroimaging community, I’ve laid out some of these issues in considerably more detail than you might be interested in. If you just want the direct rebuttal, skip to the “Reprising the rules” section below; otherwise the next two sections sketch Friston’s argument for using small sample sizes in fMRI studies, and then describe some of the things wrong with it.

Friston’s argument

Friston’s argument is based on three central claims:

  1. Classical inference (i.e., the null hypothesis testing framework) suffers from a critical flaw, which is that the null is always false: no effects (at least in psychology) are ever truly zero. Collect enough data and you will always end up rejecting the null hypothesis with probability of 1.
  2. Researchers care more about large effects than about small ones. In particular, there is some size of effect that any given researcher will call ‘trivial’, below which that researcher is uninterested in the effect.
  3. If the null hypothesis is always false, and if some effects are not worth caring about in practical terms, then researchers who collect very large samples will invariably end up identifying many effects that are statistically significant but completely uninteresting.

I think it would be hard to dispute any of these claims. The first one is the source of persistent statistical criticism of the null hypothesis testing framework, and the second one is self-evidently true (if you doubt it, ask yourself whether you would really care to continue your research if you knew with 100% confidence that all of your effects would never be any larger than one one-thousandth of a standard deviation). The third one follows directly from the first two.

Where Friston’s commentary starts to depart from conventional wisdom is in the implications he thinks these premises have for the sample sizes researchers should use in neuroimaging studies. Specifically, he argues that since large samples will invariably end up identifying trivial effects, whereas small samples will generally only have power to detect large effects, it’s actually in neuroimaging researchers’ best interest not to collect a lot of data. In other words, Friston turns what most commentators have long considered a weakness of fMRI studies—their small sample size—into a virtue.

Here’s how he characterizes an imaginary reviewer’s misguided concern about low power:

Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Friston suggests that the appropriate response from a clever author would be something like the following:

Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.

This is supported by an extensive appendix (written non-ironically), where Friston presents a series of nice sensitivity and classification analyses intended to give the reader an intuitive sense of what different standardized effect sizes mean, and what the implications are for the detection of statistically significant effects using a classical inference (i.e., hypothesis testing) approach. The centerpiece of the appendix is a loss-function analysis where Friston pits the benefit of successfully detecting a large effect (which he defines as a Cohen’s d of 1, i.e., an effect of one standard deviation) against the cost of rejecting the null when the effect is actually trivial (defined as a d of 0.125 or less). Friston notes that the loss function is minimized (i.e., the difference between the hit rate for large effects and the miss rate for trivial effects is maximized) when n = 16, which is where the number he repeatedly quotes as a reasonable sample size for fMRI studies comes from. (Actually, as I discuss in my Appendix I below, I think Friston’s power calculations are off, and the right number, even given his assumptions, is more like 22. But the point is, it’s a small number either way.)

It’s important to note that Friston is not shy about asserting his conclusion that small samples are just fine for neuroimaging studies—especially in the Appendices, which are not intended to be ironic. He makes claims like the following:

The first appendix presents an analysis of effect size in classical inference that suggests the optimum sample size for a study is between 16 and 32 subjects. Crucially, this analysis suggests significant results from small samples should be taken more seriously than the equivalent results in oversized studies.

And:

In short, if we wanted to optimise the sensitivity to large effects but not expose ourselves to trivial effects, sixteen subjects would be the optimum number.

And:

In short, if you cannot demonstrate a significant effect with sixteen subjects, it is probably not worth demonstrating.

These are very strong claims delivered with minimal qualification, and given Friston’s influence, could potentially lead many reviewers to discount their own prior concerns about small sample size and low power—which would be disastrous for the field. So I think it’s important to explain exactly why Friston is wrong and why his recommendations regarding sample size shouldn’t be taken seriously.

What’s wrong with the argument

Broadly speaking, there are three problems with Friston’s argument. The first one is that Friston presents the absolute best-case scenario as if it were typical. Specifically, the recommendation that a sample of 16 – 32 subjects is generally adequate for fMRI studies assumes that  fMRI researchers are conducting single-sample t-tests at an uncorrected threshold of p < .05; that they only care about effects on the order of 1 sd in size; and that any effect smaller than d = .125 is trivially small and is to be avoided. If all of this were true, an n of 16 (or rather, 22—see Appendix I below) might be reasonable. But it doesn’t really matter, because if you make even slightly less optimistic assumptions, you end up in a very different place. For example, for a two-sample t-test at p < .001 (a very common scenario in group difference studies), the optimal sample size, according to Friston’s own loss-function analysis, turns out to be 87 per group, or 174 subjects in total.

I discuss the problems with the loss-function analysis in much more detail in Appendix I below; the main point here is that even if you take Friston’s argument at face value, his own numbers put the lie to the notion that a sample size of 16 – 32 is sufficient for the majority of cases. It flatly isn’t. There’s nothing magic about 16, and it’s very bad advice to suggest that authors should routinely shoot for sample sizes this small when conducting their studies given that Friston’s own analysis would seem to demand a much larger sample size the vast majority of the time.

 What about uncertainty?

The second problem is that Friston’s argument entirely ignores the role of uncertainty in drawing inferences about effect sizes. The notion that an effect that comes from a small study is likely to be bigger than one that comes from a larger study may be strictly true in the sense that, for any fixed p value, the observed effect size necessarily varies inversely with sample size. It’s true, but it’s also not very helpful. The reason it’s not helpful is that while the point estimate of statistically significant effects obtained from a small study will tend to be larger, the uncertainty around that estimate is also greater—and with sample sizes in the neighborhood of 16 – 20, will typically be so large as to be nearly worthless. For example, a correlation of r = .75 sounds huge, right? But when that correlation is detected at a threshold of p < .001 in a sample of 16 subjects, the corresponding 99.9% confidence interval is .06 – .95—a range so wide as to be almost completely uninformative.

Fortunately, what Friston argues small samples can do for us indirectly—namely, establish that effect sizes are big enough to care about—can be done much more directly, simply by looking at the uncertainty associated with our estimates. That’s exactly what confidence intervals are for. If our goal is to ensure that we only end up talking about results big enough to care about, it’s surely better to answer the question “how big is the effect?” by saying, “d = 1.1, with a 95% confidence interval of 0.2 – 2.1″ than by saying “well it’s statistically significant at p < .001 in a sample of 16 subjects, so it’s probably pretty big”. In fact, if you take the latter approach, you’ll be wrong quite often, for the simple reason that p values will generally be closer to the statistical threshold with small samples than with big ones. Remember that, by definition, the point at which one is allowed to reject the null hypothesis is also the point at which the relevant confidence interval borders on zero. So it doesn’t really matter whether your sample is small or large; if you only just barely managed to reject the null hypothesis, you cannot possibly be in a good position to conclude that the effect is likely to be a big one.

As far as I can tell, Friston completely ignores the role of uncertainty in his commentary. For example, he gives the following example, which is supposed to convince you that you don’t really need large samples:

Imagine we compared the intelligence quotient (IQ) between the pupils of two schools. When comparing two groups of 800 pupils, we found mean IQs of 107.1 and 108.2, with a difference of 1.1. Given that the standard deviation of IQ is 15, this would be a trivial effect size … In short, although the differential IQ may be extremely significant, it is scientifically uninteresting … Now imagine that your research assistant had the bright idea of comparing the IQ of students who had and had not recently changed schools. On selecting 16 students who had changed schools within the past five years and 16 matched pupils who had not, she found an IQ difference of 11.6, where this medium effect size just reached significance. This example highlights the difference between an uninformed overpowered hypothesis test that gives very significant, but uninformative results and a more mechanistically grounded hypothesis that can only be significant with a meaningful effect size.

But the example highlights no such thing. One is not entitled to conclude, in the latter case, that the true effect must be medium-sized just because it came from a small sample. If the effect only just reached significance, the confidence interval by definition just barely excludes zero, and we can’t say anything meaningful about the size of the effect, but only about its sign (i.e., that it was in the expected direction)—which is (in most cases) not nearly as useful.

In fact, we will generally be in a much worse position with a small sample than a large one, because at least with a large sample, we at least stand a chance of being able to distinguish small effects from large ones. Recall that Friston suggests against collecting very large samples for the very reason that they are likely to produce a wealth of statistically-significant-but-trivially-small effects. Well, maybe so, but so what? Why would it be a bad thing to detect trivial effects so long as we were also in an excellent position to know that those effects were trivial? Nothing about the hypothesis-testing framework commits us to treating all of our statistically significant results like they’re equally important. If we have a very large sample, and some of our effects have confidence intervals from 0.02 to 0.15 while others have CIs from 0.42 to 0.52, we would be wise to focus most of our attention on the latter rather than the former. At the very least this seems like a more reasonable approach than deliberately collecting samples so small that they will rarely be able to tell us anything meaningful about the size of our effects.

What about the prior?

The third, and arguably biggest, problem with Friston’s argument is that it completely ignores the prior—i.e., the expected distribution of effect sizes across the brain. Friston’s commentary assumes a uniform prior everywhere; for the analysis to go through, one has to believe that trivial effects and very large effects are equally likely to occur. But this is patently absurd; while that might be true in select situations, by and large, we should expect small effects to be much more common than large ones. In a previous commentary (on the Vul et al “voodoo correlations” paper), I discussed several reasons for this; rather than go into detail here, I’ll just summarize them:

  • It’s frankly just not plausible to suppose that effects are really as big as they would have to be in order to support adequately powered analyses with small samples. For example, a correlational analysis with 20 subjects at p < .001 would require a population effect size of r = .77 to have 80% power. If you think it’s plausible that focal activation in a single brain region can explain 60% of the variance in a complex trait like fluid intelligence or extraversion, I have some property under a bridge I’d like you to come by and look at.
  • The low-hanging fruit get picked off first. Back when fMRI was in its infancy in the mid-1990s, people could indeed publish findings based on samples of 4 or 5 subjects. I’m not knocking those studies; they taught us a huge amount about brain function. In fact, it’s precisely because they taught us so much about the brain that researchers can no longer stick 5 people in a scanner and report that doing a working memory task robustly activates the frontal cortex. Nowadays, identifying an interesting effect is more difficult—and if that effect were really enormous, odds are someone would have found it years ago. But this shouldn’t surprise us; neuroimaging is now a relatively mature discipline, and effects on the order of 1 sd or more are extremely rare in most mature fields (for a nice review, see Meyer et al (2001)).
  • fMRI studies with very large samples invariably seem to report much smaller effects than fMRI studies with small samples. This can only mean one of two things: (a) large studies are done much more poorly than small studies (implausible—if anything, the opposite should be true); or (b) the true effects are actually quite small in both small and large fMRI studies, but they’re inflated by selection bias in small studies, whereas large studies give an accurate estimate of their magnitude (very plausible).
  • Individual differences or between-group analyses, which have much less power than within-subject analyses, tend to report much more sparing activations. Again, this is consistent with the true population effects being on the small side.

To be clear, I’m not saying there are never any large effects in fMRI studies. Under the right circumstances, there certainly will be. What I’m saying is that, in the absence of very good reasons to suppose that a particular experimental manipulation is going to produce a large effect, our default assumption should be that the vast majority of (interesting) experimental contrasts are going to produce diffuse and relatively weak effects.

Note that Friston’s assertion that “if one finds a significant effect with a small sample size, it is likely to have been caused by a large effect size” depends entirely on the prior effect size distribution. If the brain maps we look at are actually dominated by truly small effects, then it’s simply not true that a statistically significant effect obtained from a small sample is likely to have been caused by a large effect size. We can see this easily by thinking of a situation in which an experiment has a weak but very diffuse effect on brain activity. Suppose that the entire brain showed ‘trivial’ effects of d = 0.125 in the population, and that there were actually no large effects at all. A one-sample t-test at p < .001 has less than 1% power to detect this effect, so you might suppose, as Friston does, that we could discount the possibility that a significant effect would have come from a trivial effect size. And yet, because a whole-brain analysis typically involves tens of thousands of tests, there’s a very good chance such an analysis will end up identifying statistically significant effects somewhere in the brain. Unfortunately, because the only way to identify a trivial effect with a small sample is to capitalize on chance (Friston discusses this point in his Appendix II, and additional treatments can be found in Ionnadis (2008), or in my 2009 commentary), that tiny effect won’t look tiny when we examine it; it will in all likelihood look enormous.

Since they say a picture is worth a thousand words, here’s one (from an unpublished paper in progress):

The top panel shows you a hypothetical distribution of effects (Pearson’s r) in a 2-dimensional ‘brain’ in the population. Note that there aren’t any astronomically strong effects (though the white circles indicate correlations of .5 or greater, which are certainly very large). The bottom panel shows what happens when you draw random samples of various sizes from the population and use different correction thresholds/approaches. You can see that the conclusion you’d draw if you followed Friston’s advice—i.e., that any effect you observe with n = 20 must be pretty robust to survive correction—is wrong; the isolated region that survives correction at FDR = .05, while ‘real’ in a trivial sense, is not in fact very strong in the true map—it just happens to be grossly inflated by sampling error. This is to be expected; when power is very low but the number of tests you’re performing is very large, the odds are good that you’ll end up identifying some real effect somewhere in the brain–and the estimated effect size within that region will be grossly distorted because of the selection process.

Encouraging people to use small samples is a sure way to ensure that researchers continue to publish highly biased findings that lead other researchers down garden paths trying unsuccessfully to replicate ‘huge’ effects. It may make for an interesting, more publishable story (who wouldn’t rather talk about the single cluster that supports human intelligence than about the complex, highly distributed pattern of relatively weak effects?), but it’s bad science. It’s exactly the same problem geneticists confronted ten or fifteen years ago when the first candidate gene and genome-wide association studies (GWAS) seemed to reveal remarkably strong effects of single genetic variants that subsequently failed to replicate. And it’s the same reason geneticists now run association studies with 10,000+ subjects and not 300.

Unfortunately, the costs of fMRI scanning haven’t come down the same way the costs of genotyping have, so there’s tremendous resistance at present to the idea that we really do need to routinely acquire much larger samples if we want to get a clear picture of how big effects really are. Be that as it may, we shouldn’t indulge in wishful thinking just because of logistical constraints. The fact that it’s difficult to get good estimates doesn’t mean we should pretend our bad estimates are actually good ones.

What’s right with the argument

Having criticized much of Friston’s commentary, I should note that there’s one part I like a lot, and that’s the section on protected inference in Appendix I. The point Friston makes here is that you can still use a standard hypothesis testing approach fruitfully—i.e., without falling prey to the problem of classical inference—so long as you explicitly protect against the possibility of identifying trivial effects. Friston’s treatment is mathematical, but all he’s really saying here is that it makes sense to use non-zero ranges instead of true null hypotheses. I’ve advocated the same approach before (e.g., here), as I’m sure many other people have. The point is simple: if you think an effect of, say, 1/8th of a standard deviation is too small to care about, then you should define a ‘pseudonull’ hypothesis of d = -.125 to .125 instead of a null of exactly zero.

Once you do that, any time you reject the null, you’re now entitled to conclude with reasonable certainty that your effects are in fact non-trivial in size. So I completely agree with Friston when he observes in the conclusion to the Appendix I that:

…the adage ‘you can never have enough data’ is also true, provided one takes care to protect against inference on trivial effect sizes – for example using protected inference as described above.

Of course, the reason I agree with it is precisely because it directly contradicts Friston’s dominant recommendation to use small samples. In fact, since rejecting non-zero values is more difficult than rejecting a null of zero, when you actually perform power calculations based on protected inference, it becomes immediately apparent just how inadequate samples on the order of 16 – 32 subjects will be most of the time (e.g., rejecting a null of zero when detecting an effect of d = 0.5 with 80% power using a one-sample t-test at p < .05 requires 33 subjects, but if you want to reject a ‘trivial’ effect size of d <= |.125|, that n is now upwards of 50).

Reprising the rules

With the above considerations in mind, we can now turn back to Friston’s rules 4, 5, and 8, and see why his admonitions to reviewers are uncharitable at best and insensible at worst. First, Rule 4 (the under-sampled study). Here’s the kind of comment Friston (ironically) argues reviewers should avoid:

 Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Perhaps many reviewers make exactly this argument; I haven’t been an editor, so I don’t know (though I can say that I’ve read many reviews of papers I’ve co-reviewed and have never actually seen this particular variant). But even if we give Friston the benefit of the doubt and accept that one shouldn’t question the validity of a finding on the basis of small samples (i.e., we accept that p values mean the same thing in large and small samples), that doesn’t mean the more general critique from low power is itself a bad one. To the contrary, a much better form of the same criticism–and one that I’ve raised frequently myself in my own reviews–is the following:

 Reviewer: the authors draw some very strong conclusions in their Discussion about the implications of their main finding. But their finding issues from a sample of only 16 subjects, and the confidence interval around the effect is consequently very large, and nearly include zero. In other words, the authors’ findings are entirely consistent with the effect they report actually being very small–quite possibly too small to care about. The authors should either weaken their assertions considerably, or provide additional evidence for the importance of the effect.

Or another closely related one, which I’ve also raised frequently:

 Reviewer: the authors tout their results as evidence that region R is ‘selectively’ activated by task T. However, this claim is based entirely on the fact that region R was the only part of the brain to survive correction for multiple comparisons. Given that the sample size in question is very small, and power to detect all but the very largest effects is consequently very low, the authors are in no position to conclude that the absence of significant effects elsewhere in the brain suggests selectivity in region R. With this small a sample, the authors’ data are entirely consistent with the possibility that many other brain regions are just as strongly activated by task T, but failed to attain significance due to sampling error. The authors should either avoid making any claim that the activity they observed is selective, or provide direct statistical support for their assertion of selectivity.

Neither of these criticisms can be defused by suggesting that effect sizes from smaller samples are likely to be larger than effect sizes from large studies. And it would be disastrous for the field of neuroimaging if Friston’s commentary succeeded in convincing reviewers to stop criticizing studies on the basis of low power. If anything, we collectively need to focus far greater attention on issues surrounding statistical power.

Next, Rule 5 (the over-sampled study):

Reviewer: I would like to commend the authors for studying such a large number of subjects; however, I suspect they have not heard of the fallacy of classical inference. Put simply, when a study is overpowered (with too many subjects), even the smallest treatment effect will appear significant. In this case, although I am sure the population effects reported by the authors are significant; they are probably trivial in quantitative terms. It would have been much more compelling had the authors been able to show a significant effect without resorting to large sample sizes. However, this was not the case and I cannot recommend publication.

I’ve already addressed this above; the problem with this line of reasoning is that nothing says you have to care equally about every statistically significant effect you detect. If you ever run into a reviewer who insists that your sample is overpowered and has consequently produced too many statistically significant effects, you can simply respond like this:

 Response: we appreciate the reviewer’s concern that our sample is potentially overpowered. However, this strikes us as a limitation of classical inference rather than a problem with our study. To the contrary, the benefit of having a large sample is that we are able to focus on effect sizes rather than on rejecting a null hypothesis that we would argue is meaningless to begin with. To this end, we now display a second, more conservative, brain activation map alongside our original one that raises the statistical threshold to the point where the confidence intervals around all surviving voxels exclude effects smaller than d = .125. The reviewer can now rest assured that our results protect against trivial effects. We would also note that this stronger inference would not have been possible if our study had had a much smaller sample.

There is rarely if ever a good reason to criticize authors for having a large sample after it’s already collected. You can always raise the statistical threshold to protect against trivial effects if you need to; what you can’t easily do is magic more data into existence in order to shrink your confidence intervals.

Lastly, Rule 8 (exploiting ‘superstitious’ thinking about effect sizes):

 Reviewer: It appears that the authors are unaware of the dangers of voodoo correlations and double dipping. For example, they report effect sizes based upon data (regions of interest) previously identified as significant in their whole brain analysis. This is not valid and represents a pernicious form of double dipping (biased sampling or non-independence problem). I would urge the authors to read Vul et al. (2009) and Kriegeskorte et al. (2009) and present unbiased estimates of their effect size using independent data or some form of cross validation.

Friston’s recommended response is to point out that concerns about double-dipping are misplaced, because the authors are typically not making any claims that the reported effect size is an accurate representation of the population value, but only following standard best-practice guidelines to include effect size measures alongside p values. This would be a fair recommendation if it were true that reviewers frequently object to the mere act of reporting effect sizes based on the specter of double-dipping; but I simply don’t think this is an accurate characterization. In my experience, the impetus for bringing up double-dipping is almost always one of two things: (a) authors getting overly excited about the magnitude of the effects they have obtained, or (b) authors conducting non-independent tests and treating them as though they were independent (e.g., when identifying an ROI based on a comparison of conditions A and B, and then reporting a comparison of A and C without considering the bias inherent in this second test). Both of these concerns are valid and important, and it’s a very good thing that reviewers bring them up.

The right way to determine sample size

If we can’t rely on blanket recommendations to guide our choice of sample size, then what? Simple: perform a power calculation. There’s no mystery to this; both brief and extended treatises on statistical power are all over the place, and power calculators for most standard statistical tests are available online as well as in most off-line statistical packages (e.g., I use the pwr package for R). For more complicated statistical tests for which analytical solutions aren’t readily available (e.g., fancy interactions involving multiple within- and between-subject variables), you can get reasonably good power estimates through simulation.

Of course, there’s no guarantee you’ll like the answers you get. Actually, in most cases, if you’re honest about the numbers you plug in, you probably won’t like the answer you get. But that’s life; nature doesn’t care about making things convenient for us. If it turns out that it takes 80 subjects to have adequate power to detect the effects we care about and expect, we can (a) suck it up and go for n = 80, (b) decide not to run the study, or (c) accept that logistical constraints mean our study will have less power than we’d like (which implies that any results we obtain will offer only a fractional view of what’s really going on). What we don’t get to do is look the other way and pretend that it’s just fine to go with 16 subjects simply because the last time we did that, we got this amazingly strong, highly selective activation that successfully made it into a good journal. That’s the same logic that repeatedly produced unreplicable candidate gene findings in the 1990s, and, if it continues to go unchecked in fMRI research, risks turning the field into a laughing stock among other scientific disciplines.

Conclusion

The point of all this is not to convince you that it’s impossible to do good fMRI research with just 16 subjects, or that reviewers don’t sometimes say silly things. There are many questions that can be answered with 16 or even fewer subjects, and reviewers most certainly do say silly things (I sometimes cringe when re-reading my own older reviews). The point is that blanket pronouncements, particularly when made ironically and with minimal qualification, are not helpful in advancing the field, and can be very damaging. It simply isn’t true that there’s some magic sample size range like 16 to 32 that researchers can bank on reflexively. If there’s any generalization that we can allow ourselves, it’s probably that, under reasonable assumptions, Friston’s recommendations are much too conservative. Typical effect sizes and analysis procedures will generally require much larger samples than neuroimaging researchers are used to collecting. But again, there’s no substitute for careful case-by-case consideration.

In the natural course of things, there will be cases where n = 4 is enough to detect an effect, and others where the effort is questionable even with 100 subjects; unfortunately, we won’t know which situation we’re in unless we take the time to think carefully and dispassionately about what we’re doing. It would be nice to believe otherwise; certainly, it would make life easier for the neuroimaging community in the short term. But since the point of doing science is to discover what’s true about the world, and not to publish an endless series of findings that sound exciting but don’t replicate, I think we have an obligation to both ourselves and to the taxpayers that fund our research to take the exercise more seriously.

 

 

Appendix I: Evaluating Friston’s loss-function analysis

In this appendix I review a number of weaknesses in Friston’s loss-function analysis, and show that under realistic assumptions, the recommendation to use sample sizes of 16 – 32 subjects is far too optimistic.

First, the numbers don’t seem to be right. I say this with a good deal of hesitation, because I have very poor mathematical skills, and I’m sure Friston is much smarter than I am. That said, I’ve tried several different power packages in R and finally resorted to empirically estimating power with simulated draws, and all approaches converge on numbers quite different from Friston’s. Even the sensitivity plots seem off by a good deal (for instance, Friston’s Figure 3 suggests around 30% sensitivity with n = 80 and d = 0.125, whereas all the sources I’ve consulted produce a value around 20%). In my analysis, the loss function is minimized at n = 22 rather than n = 16. I suspect the problem is with Friston’s approximation, but I’m open to the possibility that I’ve done something very wrong, and confirmations or disconfirmations are welcome in the comments below. In what follows, I’ll report the numbers I get rather than Friston’s (mine are somewhat more pessimistic, but the overarching point doesn’t change either way).

Second, there’s the statistical threshold. Friston’s analysis assumes that all of our tests are conducted without correction for multiple comparisions (i.e., at p < .05), but this clearly doesn’t apply to the vast majority of neuroimaging studies, which are either conducting massive univariate (whole-brain) analyses, or testing at least a few different ROIs or networks. As soon as you lower the threshold, the optimal sample size returned by the loss-function analysis increases dramatically. If the threshold is a still-relatively-liberal (for whole-brain analysis) p < .001, the loss function is now minimized at 48 subjects–hardly a welcome conclusion, and a far cry from 16 subjects. Since this is probably still the modal fMRI threshold, one could argue Friston should have been trumpeting a sample size of 48 all along—not exactly a ‘small’ sample size given the associated costs.

Third, the n = 16 (or 22) figure only holds for the simplest of within-subject tests (e.g., a one-sample t-test)–again, a best-case scenario (though certainly a common one). It doesn’t apply to many other kinds of tests that are the primary focus of a huge proportion of neuroimaging studies–for instance, two-sample t-tests, or interactions between multiple within-subject factors. In fact, if you apply the same analysis to a two-sample t-test (or equivalently, a correlation test), the optimal sample size turns out to be 82 (41 per group) at a threshold of p < .05, and a whopping 174 (87 per group) at a threshold of p < .001. In other words, if we were to follow Friston’s own guidelines, the typical fMRI researcher who aims to conduct a (liberal) whole-brain individual differences analysis should be collecting 174 subjects a pop. For other kinds of tests (e.g., 3-way interactions), even larger samples might be required.

Fourth, the claim that only large effects–i.e., those that can be readily detected with a sample size of 16–are worth worrying about is likely to annoy and perhaps offend any number of researchers who have perfectly good reasons for caring about effects much smaller than half a standard deviation. A cursory look at most literatures suggests that effects of 1 sd are not the norm; they’re actually highly unusual in mature fields. For perspective, the standardized difference in height between genders is about 1.5 sd; the validity of job interviews for predicting success is about .4 sd; and the effect of gender on risk-taking (men take more risks) is about .2 sd—what Friston would call a very small effect (for other examples, see Meyer et al., 2001). Against this backdrop, suggesting that only effects greater than 1 sd (about the strength of the association between height and weight in adults) are of interest would seem to preclude many, and perhaps most, questions that researchers currently use fMRI to address. Imaging genetics studies are immediately out of the picture; so too, in all likelihood, are cognitive training studies, most investigations of individual differences, and pretty much any experimental contrast that claims to very carefully isolate a relatively subtle cognitive difference. Put simply, if the field were to take Friston’s analysis seriously, the majority of its practitioners would have to pack up their bags and go home. Entire domains of inquiry would shutter overnight.

To be fair, Friston briefly considers the possibility that small sample sizes could be important. But he doesn’t seem to take it very seriously:

Can true but trivial effect sizes can ever be interesting? It could be that a very small effect size may have important implications for understanding the mechanisms behind a treatment effect and that one should maximise sensitivity by using large numbers of subjects. The argument against this is that reporting a significant but trivial effect size is equivalent to saying that one can be fairly confident the treatment effect exists but its contribution to the outcome measure is trivial in relation to other unknown effects…

The problem with the latter argument is that the real world is a complicated place, and most interesting phenomena have many causes. A priori, it is reasonable to expect that the vast majority of effects will be small. We probably shouldn’t expect any single genetic variant to account for more than a small fraction of the variation in brain activity, but that doesn’t mean we should give up entirely on imaging genetics. And of course, it’s worth remembering that, in the context of fMRI studies, when Friston talks about ‘very small effect sizes,’ that’s a bit misleading; even medium-sized effects that Friston presumably allows are interesting could be almost impossible to detect at the sample sizes he recommends. For example, a one-sample t-test with n = 16 subjects detects an effect of d = 0.5 only 46% or 5% of the time at p < .05 and p < .001, respectively. Applying Friston’s own loss function analysis to detection of d = 0.5 returns an optimal sample size of n = 63 at p < .05 and n = 139 at p < .001—a message not entirely consistent with the recommendations elsewhere in his commentary.

ResearchBlogging.orgFriston, K. (2012). Ten ironic rules for non-statistical reviewers NeuroImage DOI: 10.1016/j.neuroimage.2012.04.018

no free lunch in statistics

Simon and Tibshirani recently posted a short comment on the Reshef et al MIC data mining paper I blogged about a while back:

The proposal of Reshef et. al. (“MIC”) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations.

They then report some simulation results clearly demonstrating that MIC is (very) underpowered relative to Pearson correlation in most situations, and performs even worse relative to Székely & Rizzo’s distance correlation (which I hadn’t heard about, but will have to look into now). I mentioned low power as a potential concern in my own post, but figured it would be an issue under relatively specific circumstances (i.e., only for certain kinds of associations in relatively small samples). Simon & Tibshirani’s simulations pretty clearly demonstrate that isn’t so. Which, needless to say, rather dampens the enthusiasm for the MIC statistic.

large-scale data exploration, MIC-style

UPDATE 2/8/2012: Simon & Tibshirani posted a critical commentary on this paper here. See additional thoughts here.

Real-world data are messy. Relationships between two variables can take on an infinite number of forms, and while one doesn’t see, say, umbrella-shaped data very often, strange things can happen. When scientists talk about correlations or associations between variables, they’re usually referring to one very specific form of relationship–namely, a linear one. The assumption is that most associations between pairs of variables are reasonably well captured by positing that one variable increases in proportion to the other, with some added noise. In reality, of course, many associations aren’t linear, or even approximately so. For instance, many associations are cyclical (e.g., hours at work versus day of week), or curvilinear (e.g., heart attacks become precipitously more frequent past middle age), and so on.

Detecting a non-linear association is potentially just as easy as detecting a linear relationship if we know the form of that association up front. But there, of course, lies the rub: we generally don’t have strong intuitions about how most variables are likely to be non-linearly related. A more typical situation in many ‘big data’ scientific disciplines is that we have a giant dataset full of thousands or millions of observations and hundreds or thousands of variables, and we want to determine which of the many associations between different variables are potentially important–without knowing anything about their potential shape. The problem, then, is that traditional measures of association don’t work very well; they’re only likely to detect associations to the extent that those associations approximate a linear fit.

A new paper in Science by David Reshef and colleagues (and as a friend pointed out, it’s a feat in and of itself just to get a statistics paper into Science) directly targets this data mining problem by introducing an elegant new measure of association called the Maximal Information Coefficient (MIC; see also the authors’ project website).  The clever insight at the core of the paper is that one can detect a systematic (i.e., non-random) relationship between two variables by quantifying and normalizing their maximal mutual information. Mutual information (MI) is an information theory measure of how much information you have about one variable given knowledge of the other. You have high MI when you can accurately predict the level of one variable given knowledge of the other, and low MI when knowledge of one variable is unhelpful in predicting the other. Importantly, unlike other measures (e.g., the correlation coefficient), MI makes no assumptions about the form of the relationship between the variables; one can have high mutual information for non-linear associations as well as linear ones.

MI and various derivative measures have been around for a long time now; what’s innovative about the Reshef et al paper is that the authors figured out a way to efficiently estimate and normalize the maximal MI one can obtain for any two variables. The very clever approach the authors use is to overlay a series of grids on top of the data, and to keep altering the resolution of the grid and moving its lines around until one obtains the maximum possible MI. In essence, it’s like dropping a wire mesh on top of a scatterplot and playing with it until you’ve boxed in all of the data points in the most informative way possible. And the neat thing is, you can apply the technique to any kind of data at all, and capture a very broad range of systematic relationships, not just linear ones.

To give you an intuitive sense of how this works, consider this Figure from the supplemental material:

The underlying function here is sinusoidal. This is a potentially common type of association in many domains–e.g., it might explain the cyclical relationship between, say, coffee intake and hour of day (more coffee in the early morning and afternoon; less in between). But the linear correlation is essentially zero, so a typical analysis wouldn’t pick it up at all. On the other hand, the relationship itself is perfectly deterministic; if we can correctly identify the generative function in this case, we would have perfect information about Y given X. The question is how to capture this intuition algorithmically–especially given that real data are noisy.

This is where Reshef et al’s grid-based approach comes in. In the left panel above, you have a 2 x 8 grid overlaid on a sinusoidal function (the use of a 2 x 8 resolution here is just illustrative; the algorithm actually produces estimates for a wide range of grid resolutions). Even though it’s the optimal grid of that particular resolution, it still isn’t very good: knowing which row a particular point along the line falls into doesn’t tell you a whole lot about which column it falls into, and vice versa. In other words, mutual information is low. By contrast, the optimal 8 x 2 grid on the right side of the figure has a (perfect) MIC of 1: if you know which row in the grid a point on the line falls into, you can also determine which column it falls into with perfect accuracy. So the MIC approach will detect that there’s a perfectly systematic relationship between these two variables without any trouble, whereas the standard pearson correlation would be 0 (i.e., no relation at all). There are a couple of other steps involved (e.g., one needs to normalize the MIC to account for differences in grid resolution), but that’s the gist of it.

If the idea seems surprisingly simple, it is. But as with many very good ideas, hindsight is 20/20; it’s an idea that seems obvious once you hear it, but clearly wasn’t trivial to come up with (or someone would have done it a long time ago!). And of course, the simplicity of the core idea also shouldn’t blind us to the fact that there was undoubtedly a lot of very sophisticated work involved in figuring out how to normalize and bound the measure, provin that the approach works and implementing a dynamic algorithm capable of computing good MIC estimates in a reasonable amount of time (this Harvard Gazette article suggests Reshef and colleagues worked on the various problems for three years).

The utility of MIC and its improvement over existing measures is probably best captured in Figure 2 from the paper:

Panel A shows the values one obtains with different measures when trying to capture different kinds of noiseless relationships (e.g., linear, exponential, and sinusoidal ones). The key point is that MIC assigns a value of 1 (the maximum) to every kind of association, whereas no other measure is capable of detecting the same range of associations with the same degree of sensitivity (and most fail horribly). By contrast, when given random data, MIC produces a value that tends towards zero (though it’s still not quite zero, a point I’ll come back to later). So what you effectively have is a measure that, with some caveats, can capture a very broad range of associations and place them on the same metric. The latter aspect is nicely captured in Panel G, which gives one a sense of what real (i.e., noisy) data corresponding to different MIC levels would look like. The main point is that, unlike other measures, a given value can correspond to very different types of associations. Admittedly, this may be a mixed blessing, since the flip side is that knowing the MIC value tells you almost nothing about what the association actually looks like (though Anscombe’s Quartet famously demonstrates that even a linear correlation can be misleading in this respect). But on the whole, I think it represents a potentially big advance in our ability to detect novel associations in a data-driven way.

Having introduced and explained the method, Reshef et al then go on to apply it to 4 very different datasets. I’ll just focus on one here–a set of global indicators from the World Health Organization (WHO). The data set contains 357 variables, or 63,546 variable pairs. When plotting MIC against the Pearson correlation coefficient the data look like this (panel A; click to blow up the figure):

The main point to note is that while MIC detects most strong linear effects (e.g., panel D), it also detects quite a few associations that have low linear correlations (e.g., E, F, and G). Reshef et al note that many of these effects have sensible interpretations (e.g., they argue that the left trend line in panel F reflects predominantly Pacific Island nations where obesity is culturally valued, and hence increases with income), but would be completely overlooked by an automated data mining approach that focuses only on linear correlations. They go on to report a number of other interesting examples ranging from analyses of gut bacteria to baseball statistics. All in all, it’s a compelling demonstration of a new metric that could potentially play an important role in large-scale data mining analyses going forward.

That said, while the paper clearly represents an important advance for large-scale data mining efforts, it’s also quite light on caveats and limitations (even for a length-constrained Science paper). Some potential concerns that come to mind:

  • Reshef et al are understandably going to put their best foot forward, so we can expect that the ‘representative’ examples they display (e.g., the WHO scatter plots above) are among the cleanest effects in the data, and aren’t necessarily typical. There’s nothing wrong with this, but it’s worth keeping in mind that much (and perhaps most) of the time, the associations MIC identifies aren’t going to be quite so clear-cut. Reshef’s et al approach can help identify potentially interesting associations, but once they’re identified, it’s still up to the investigator to figure out how to characterize them.
  • MIC is a (potentially quite heavily) biased measure. While it’s true, as the authors suggest, that it will “tend to 0 for statistically independent variables”, in most situations, the observed value will be substantially larger than 0 even when variables are completely uncorrelated. This falls directly out of the ‘M’ in MIC, because when you take the maximal value from some larger search space as your estimate, you’re almost invariably going to end up capitalizing on chance to some degree. MIC will only tend to 0 when the sample size is very large; as this figure (from the supplemental material) shows, even with a sample size of n = 204, the MIC for uncorrelated variables will tend to hover somewhere around .15 for the parameterization used throughout the paper (the red line):
    This isn’t a huge deal, but it does mean that interpretation of small MIC values is going to be very difficult in practice, since the lower end of the distribution is going to depend heavily on sample size. And it’s quite unpleasant to have a putatively standardized metric of effect size whose interpretation depends to some extent on sample parameters.
  • Reshef et al don’t report any analyses quantifying the sensitivity of MIC compared to conventional metrics like Pearson’s correlation coefficient. Obviously, MIC can pick up on effects Pearson can’t; but a crucial question is whether MIC shows comparable sensitivity when effects are linear. Similarly, we don’t know how well MIC performs when sample sizes are substantially smaller than those Reshef et al use in their simulations and empirical analyses. If it breaks down with n’s on the order of, say, 50 – 100, that would be important to know. So it would be great to see follow-up work characterizing performance under such circumstances–preferably before a flood of papers is published that all use MIC to do data mining in relatively small data sets.
  • As Andrew Gelman points out here, it’s not entirely clear that one wants a measure that gives a high r-square-like value for pretty much any non-random association between variables. For instance, a perfect circle would get an MIC of 1 at the limit, which is potentially weird given that you can’t never deterministically predict y from x. I don’t have a strong feeling about this one way or the other, but can see why this might bother someone.

Caveats aside though, from my perspective–as someone who likes to play with very large datasets but isn’t terribly statistically savvy–the Reshef et al paper seems like a really impressive piece of work that could have a big impact on at least some kinds of data mining analyses. I’d be curious to hear what more quantitatively sophisticated folks have to say.

ResearchBlogging.org
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, & Sabeti PC (2011). Detecting novel associations in large data sets. Science (New York, N.Y.), 334 (6062), 1518-24 PMID: 22174245

what aspirin can tell us about the value of antidepressants

There’s a nice post on Science-Based Medicine by Harriet Hall pushing back (kind of) against the increasingly popular idea that antidepressants don’t work. For context, there have been a couple of large recent meta-analyses that used comprehensive FDA data on clinical trials of antidepressants (rather than only published studies, which are biased towards larger, statistically significant, effects) to argue that antidepressants are of little or no use in mild or moderately-depressed people, and achieve a clinically meaningful benefit only in the severely depressed.

Hall points out that whether you think antidepressants have a clinically meaningful benefit or not depends on how you define clinically meaningful (okay, this sounds vacuous, but bear with me). Most meta-analyses of antidepressant efficacy reveal an effect size of somewhere between 0.3 and 0.5 standard deviations. Historically, psychologists consider effect sizes of 0.2, 0.5, and 0.8 standard deviations to be small, medium, and large, respectively. But as Hall points out:

The psychologist who proposed these landmarks [Jacob Cohen] admitted that he had picked them arbitrarily and that they had “no more reliable a basis than my own intuition.” Later, without providing any justification, the UK’s National Institute for Health and Clinical Excellence (NICE) decided to turn the 0.5 landmark (why not the 0.2 or the 0.8 value?) into a one-size-fits-all cut-off for clinical significance.

She goes on to explain why this ultimately leaves the efficacy of antidepressants open to interpretation:

In an editorial published in the British Medical Journal (BMJ), Turner explains with an elegant metaphor: journal articles had sold us a glass of juice advertised to contain 0.41 liters (0.41 being the effect size Turner, et al. derived from the journal articles); but the truth was that the “glass” of efficacy contained only 0.31 liters. Because these amounts were lower than the (arbitrary) 0.5 liter cut-off, NICE standards (and Kirsch) consider the glass to be empty. Turner correctly concludes that the glass is far from full, but it is also far from empty. He also points out that patients’ responses are not all-or-none and that partial responses can be meaningful.

I think this pretty much hits the nail on the head; no one really doubts that antidepressants work at this point; the question is whether they work well enough to justify their side effects and the social and economic costs they impose. I don’t have much to add to Hall’s argument, except that I think she doesn’t sufficiently emphasize how big a role scale plays when trying to evaluate the utility of antidepressants (or any other treatment). At the level of a single individual, a change of one-third of a standard deviation may not seem very big (then again, if you’re currently depressed, it might!). But on a societal scale, even canonically ‘small’ effects can have very large effects in the aggregate.

The example I’m most fond of here is Robert Rosenthal’s famous illustration of the effects of aspirin on heart attack. The correlation between taking aspirin daily and decreased risk of heart attack is, at best, .03 (I say at best because the estimate is based on a large 1988 study, but my understanding is that more recent studies have moderated even this small effect). In most domains of psychology, a correlation of .03 is so small as to be completely uninteresting. Most psychologists would never seriously contemplate running a study to try to detect an effect of that size. And yet, at a population level, even an r of .03 can have serious implications. Cast in a different light, what this effect means is that 3% of people who would be expected to have a heart attack without aspirin would be saved from that heart attack given a daily aspirin regimen. Needless to say, this isn’t trivial. It amounts to a potentially life-saving intervention for 30 out of every 1,000 people. At a public policy level, you’d be crazy to ignore something like that (which is why, for a long time, many doctors recommended that people take an aspirin a day). And yet, by the standards of experimental psychology, this is a tiny, tiny effect that probably isn’t worth getting out of bed for.

The point of course is that when you consider how many people are currently on antidepressants (millions), even small effects–and certainly an effect of one-third of a standard deviation–are going to be compounded many times over. Given that antidepressants demonstrably reduce the risk of suicide (according to Hall, by about 20%), there’s little doubt that tens of thousands of lives have been saved by antidepressants. That doesn’t necessarily justify their routine use, of course, because the side effects and costs also scale up to the societal level (just imagine how many millions of bouts of nausea could be prevented by eliminating antidepressants from the market!). The point is that just that, if you think the benefits of antidepressants outweigh their costs even slightly at the level of the average depressed individual, you’re probably committing yourself to thinking that they have a hugely beneficial impact at a societal level–and that holds true irrespective of whether the effects are ‘clinically meaningful’ by conventional standards.

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology