the parable of zoltan and his twelve sheep, or why a little skepticism goes a long way

What follows is a fictional piece about sheep and statistics. I wrote it about two years ago, intending it to serve as a preface to an article on the dangers of inadvertent data fudging. But then I decided that no journal editor in his or her right mind would accept an article that started out talking about thinking sheep. And anyway, the rest of the article wasn’t very good. So instead, I post this parable here for your ovine amusement. There’s a moral to the story, but I’m too lazy to write about it at the moment.

A shepherd named Zoltan lived in a small village in the foothills of the Carpathian Mountains. He tended to a flock of twelve sheep: Soffia, Krystyna, Anastasia, Orsolya, Marianna, Zigana, Julinka, Rozalia, Zsa Zsa, Franciska, Erzsebet, and Agi. Zoltan was a keen observer of animal nature, and would often point out the idiosyncracies of his sheep’s behavior to other shepherds whenever they got together.

“Anastasia and Orsolya are BFFs. Whatever one does, the other one does too. If Anastasia starts licking her face, Orsolya will too; if Orsolya starts bleating, Anastasia will start harmonizing along with her.”

“Julinka has a limp in her left leg that makes her ornery. She doesn’t want your pity, only your delicious clovers.”

“Agi is stubborn but logical. You know that old saying, spare the rod and spoil the sheep? Well, it doesn’t work for Agi. You need calculus and rhetoric with Agi.”

Zoltan’s colleagues were so impressed by these insights that they began to encourage him to record his observations for posterity.

“Just think, Zoltan,” young Gergely once confided. “If something bad happened to you, the world would lose all of your knowledge. You should write a book about sheep and give it to the rest of us. I hear you only need to know six or seven related things to publish a book.”

On such occasions, Zoltan would hem and haw solemnly, mumbling that he didn’t know enough to write a book, and that anyway, nothing he said was really very important. It was false modestly of course; in reality, he was deeply flattered, and very much concerned that his vast body of sheep knowledge would disappear along with him one day. So one day, Zoltan packed up his knapsack, asked Gergely to look after his sheep for the day, and went off to consult with the wise old woman who lived in the next village.

The old woman listened to Zoltan’s story with a good deal of interest, nodding sagely at all the right moments. When Zoltan was done, the old woman mulled her thoughts over for a while.

“If you want to be taken seriously, you must publish your findings in a peer-reviewed journal,” she said finally.

“What’s Pier Evew?” asked Zoltan.

“One moment,” said the old woman, disappearing into her bedroom. She returned clutching a dusty magazine. “Here,” she said, handing the magazine to Zoltan. “This is peer review.”

That night, after his sheep had gone to bed, Zoltan stayed up late poring over Vol. IV, Issue 5 of Domesticated Animal Behavior Quarterly. Since he couldn’t understand the figures in the magazine, he read it purely for the articles. By the time he put the magazine down and leaned over to turn off the light, the first glimmerings of an empirical research program had begun to dance around in his head. Just like fireflies, he thought. No, wait, those really were fireflies. He swatted them away.

“I like this… science,” he mumbled to himself as he fell asleep.

In the morning, Zoltan went down to the local library to find a book or two about science. He checked out a volume entitled Principia Scientifica Buccolica—a masterful derivation from first principles of all of the most common research methods, with special applications to animal behavior. By lunchtime, Zoltan had covered t-tests, and by bedtime, he had mastered Mordenkainen’s correction for inestimable herds.

In the morning, Zoltan made his first real scientific decision.

“Today I’ll collect some pilot data,” he thought to himself, “and tomorrow I’ll apply for an R01.”

His first set of studies tested the provocative hypothesis that sheep communicate with one another by moving their ears back and forth in Morse code. Study 1 tested the idea observationally. Zoltan and two other raters (his younger cousins), both blind to the hypothesis, studied sheep in pairs, coding one sheep’s ear movements and the other sheep’s behavioral responses. Studies 2 through 4 manipulated the sheep’s behavior experimentally. In Study 2, Zoltan taped the sheep’s ears to their head; in Study 3, he covered their eyes with opaque goggles so that they couldn’t see each other’s ears moving. In Study 4, he split the twelve sheep into three groups of four in order to determine whether smaller groups might promote increased sociability.

That night, Zoltan minded the data. “It’s a lot like minding sheep,” Zoltan explained to his cousin Griga the next day. “You need to always be vigilant, so that a significant result doesn’t get away from you.”

Zoltan had been vigilant, and the first 4 studies produced a number of significant results. In Study 1, Zoltan found that sheep appeared to coordinate ear twitches: if one sheep twitched an ear several times in a row, it was a safe bet that other sheep would start to do the same shortly thereafter (p < .01). There was, however, no coordination of licking, headbutting, stamping, or bleating behaviors, no matter how you sliced and diced it. “It’s a highly selective effect,” Zoltan concluded happily. After all, when you thought about it, it made sense. If you were going to pick just one channel for sheep to communicate through, ear twitching was surely a good one. One could make a very good evolutionary argument that more obvious methods of communication (e.g., bleating loudly) would have been detected by humans long ago, and that would be no good at all for the sheep.

Studies 2 and 3 further supported Zoltan’s story. Study 2 demonstrated that when you taped sheep’s ears to their heads, they ceased to communicate entirely. You could put Rozalia and Erzsebet in adjacent enclosures and show Rozalia the Jack of Spades for three or four minutes at a time, and when you went to test Erzsebet, she still wouldn’t know the Jack of Spades from the Three of Diamonds. It was as if the sheep were blind! Except they weren’t blind, they were dumb. Zoltan knew; he had made them that way by taping their ears to their heads.

In Study 3, Zoltan found that when the sheep’s eyes were covered, they no longer coordinated ear twitching. Instead, they now coordinated their bleating—but only if you excluded bleats that were produced when the sheep’s heads were oriented downwards. “Fantastic,” he thought. “When you cover their eyes, they can’t see each other’s ears any more. So they use a vocal channel. This, again, makes good adaptive sense: communication is too important to eliminate entirely just because your eyes happen to be covered. Much better to incur a small risk of being detected and make yourself known in other, less subtle, ways.”

But the real clincher was Study 4, which confirmed that ear twitching occurred at a higher rate in smaller groups than larger groups, and was particularly common in dyads of well-adjusted sheep (like Anastasia and Orsolya, and definitely not like Zsa Zsa and Marianna).

“Sheep are like everyday people,” Zoltan told his sister on the phone. “They won’t say anything to your face in public, but get them one-on-one, and they won’t stop gossiping about each other.”

It was a compelling story, Zoltan conceded to himself. The only problem was the F test. The difference in twitch rates as a function of group size wasn’t quite statistically significant. Instead, it hovered around p = .07, which the textbooks told Zoltan meant that he was almost right. Almost right was the same thing as potentially wrong, which wasn’t good enough. So the next morning, Zoltan asked Gergely to lend him four sheep so he could increase his sample size.

“Absolutely not,” said Gergely. “I don’t want your sheep filling my sheep’s heads with all of your crazy new ideas.”

“Look,” said Zoltan. “If you lend me four sheep, I’ll let you drive my Cadillac down to the village on weekends after I get famous.”

“Deal,” said Gergely.

So Zoltan borrowed the sheep. But it turned out that four sheep weren’t quite enough; after adding Gergely’s sheep to the sample, the effect only went from p < .07 to p < .06. So Zoltan cut a deal with his other neighbor, Yuri: four of Yuri’s sheep for two days, in return for three days with Zoltan’s new Lexus (once he bought it). That did the trick. Once Zoltan repeated the experiment with Yuri’s sheep, the p-value for Study 2 now came to .046, which the textbooks assured Zoltan meant he was going to be famous.

Data in hand, Zoltan spent the next two weeks writing up his very first journal article. He titled it “Baa baa baa, or not: Sheep communicate via non-verbal channels”—a decidedly modest title for the first empirical work to demonstrate that sheep are capable of sophisticated propositional thought. The article was published to widespread media attention and scientific acclaim, and Zoltan went on to have a productive few years in animal behavioral research, studying topics as interesting and varied as giraffe calisthenics and displays of affection in the common leech.

Much later, it turned out that no one was able to directly replicate his original findings with sheep (though some other researchers did manage to come up with conceptual replications). But that didn’t really matter to Zoltan, because by then he’d decided science was too demanding a career anyway; it was way more fun to lay under trees counting his sheep. Counting sheep, and occasionally, on Saturdays, driving down to the village in his new Lexus,  just to impress all the young cowgirls.

got R? get social science for R!

Drew Conway has a great list of 10 must-have R packages for social scientists. If you’re a social scientist (or really, any kind of scientist) who doesn’t use R, now is a great time to dive in and learn; there are tons of tutorials and guides out there (my favorite is Quick-R, which is incredibly useful incredibly often), and packages are available for just about any application you can think of. Best of all, R is completely free, and is available for just about every platform. Admittedly, there’s a fairly steep learning curve if you’re used to GUI-based packages like SPSS (R’s syntax can be pretty idiosyncratic), but it’s totally worth the time investment, and once you’re comfortable with R you’ll never look back.

Anyway, Drew’s list contains a number of packages I’ve found invaluable in my work, as well as several packages I haven’t used before and am pretty eager to try. I don’t have much to add to his excellent summaries, but I’ll gladly second the inclusion of ggplot2 (the easiest way in the world to make beautiful graphs?) and plyr and sqldf (great for sanitizing, organizing, and manipulating large data sets, which are often a source of frustration in R). Most of the other packages I haven’t had any reason to use personally, though a few seem really cool, and worth finding an excuse to play around with (e.g., Statnet and igraph).

Since Drew’s list focuses on packages useful to social scientists in general, I thought I’d mention a couple of others that I’ve found particularly useful for psychological applications. The most obvious one is William Revelle‘s awesome psych package, which contains tons of useful functions for descriptive statistics, data reduction, simulation, and psychometrics. It’s saved me me tons of time validating and scoring personality measures, though it probably isn’t quite as useful if you don’t deal with individual difference measures regularly. Other packages I’ve found useful are sem for structural equation modeling (which interfaces nicely with GraphViz to easily produce clean-looking path diagrams), genalg for genetic algorithms, MASS (mostly for sampling from multivariate distributions), reshape (similar functionality to plyr), and car, which contains a bunch of useful regression-related functions (e.g., for my dissertation, I needed to run SPSS-like repeated measures ANOVAs in R, which turns out to be a more difficult proposition than you’d imagine, but was handled by the Anova function in car). I’m sure there are others I’m forgetting, but those are the ones that I’ve relied on most heavily in recent work. No doubt there are tons of other packages out there that are handly for common psychology applications, so if there are any you use regularly, I’d love to hear about them in the comments!

specificity statistics for ROI analyses: a simple proposal

The brain is a big place. In the context of fMRI analysis, what that bigness means is that a typical 3D image of the brain might contain anywhere from 50,000 – 200,000 distinct voxels (3D pixels). Any of those voxels could theoretically show meaningful activation in relation to some contrast of interest, so the only way to be sure that you haven’t overlooked potentially interesting activations is to literally test every voxel (or, given some parcellation algorithm, every region).

Unfortunately, the problem that approach raises–which I’ve discussed in more detail here–is the familiar one of multiple comparisons: If you’re going to test 100,000 locations, it’s not really fair to test each one at the conventional level of p < .05, because on average, you’ll get about 5,000 statistically significant results just by chance that way. So you need to do something to correct for the fact that you’re running thousands of tests. The most common approach is to simply make the threshold for significance more conservative–for example, by testing at p < .0001 instead of p < .05, or by using some combination of intensity and cluster extent thresholds (e.g., you look for 20 contiguous voxels that are all significant at, say, p < .001) that’s supposed to guarantee a cluster-wise error rate of .05.

There is, however, a natural tension between false positives and false negatives: When you make your analysis more conservative, you let fewer false positives through the filter, but you also keep more of the true positives out. A lot of fMRI analysis really just boils down to walking a very thin line between running overconservative analyses that can’t detect anything but the most monstrous effects, and running overly liberal analyses that lack any real ability to distinguish meaningful signals from noise. One very common approach that fMRI researchers have adopted in an effort to optimize this balance is to use complementary hypothesis-driven and whole-brain analyses. The idea is that you’re basically carving the brain up into two separate search spaces: One small space for which you have a priori hypotheses that can be tested using a small number of statistical comparisons, and one much larger space (containing everything but the a priori space) where you continue to use a much more conservative threshold.

For example, if I believe that there’s a very specific chunk of right inferotemporal cortex that’s specialized for detecting clown faces, I can focus my hypothesis-testing on that particular region, without having to pretend that all voxels are created equal. So I delineate the boundaries of a CRC (Clown Representation Cortex) region-of-interest (ROI) based on some prior criteria (e.g., anatomy, or CRC activation in previous studies), and then I can run a single test at p < .05 to test my hypothesis, no correction needed. But to ensure that I don’t miss out on potentially important clown-related activation elsewhere in the brain, I also go ahead and run an additional whole-brain analysis that’s fully corrected for multiple comparisons. By coupling these two analyses, I hopefully get the best of both worlds. That is, I combine one approach (the ROI analysis) that maximizes power to test a priori hypotheses at the cost of an inability to detect effects in unexpected places with another approach (the whole-brain analysis) that has a much more limited capacity to detect effects in both expected and unexpected locations.

This two-pronged strategy is generally a pretty successful one, and I’d go so far as to say that a very large minority, if not an outright majority, of fMRI studies currently use it. Used wisely, I think it’s really an invaluable strategy. There is, however, one fairly serious and largely unappreciated problem associated with the incautious application of this approach. It has to do with claims about the specificity of activation that often tend to accompany studies that use a complementary ROI/whole-brain strategy. Specifically, a pretty common pattern is for researchers to (a) confirm their theoretical predictions by successfully detecting activation in one or more a priori ROIs; (b) identify few if any whole-brain activations; and consequently, (c) conclude that not only were the theoretical predictions confirmed, but that the hypothesized effects in the a priori ROIs were spatially selective, because a complementary whole-brain analysis didn’t turn up much (if anything). Or, to put it in less formal terms, not only were we right, we were really right! There isn’t any other part of the brain that shows the effect we hypothesized we’d see in our a priori ROI!

The problem with this type of inference is that there’s usually a massive discrepancy in the level of power available to detect effects in a priori ROIs versus the rest of the brain. If you search at p < .05 within some predetermined space, but at only p < .0001 everywhere else, you’re naturally going to detect results at a much lower rate everywhere else. But that’s not necessarily because there wasn’t just as much to look at everywhere else; it could just be because you didn’t look very carefully. By way of analogy, if you’re out picking berries in the forest, and you decide to spend half your time on just one bush that (from a distance) seemed particularly berry-full, and the other half of your time divided between the other 40 bushes in the area, you’re not really entitled to conclude that you picked the best bush all along simply because you came away with a relatively full basket. Had you done a better job checking out the other bushes, you might well have found some that were even better, and then you’d have come away carrying two baskets full of delicious, sweet, sweet berries.

Now, in an ideal world, we’d solve this problem by simply going around and carefully inspecting all the berry bushes, until we were berry, berry sure really convinced that we’d found all of the best bushes. Unfortunately, we can’t do that, because we’re out here collecting berries on our lunch break, and the boss isn’t paying us to dick around in the woods. Or, to return to fMRI World, we simply can’t carefully inspect every single voxel (say, by testing it at p < .05), because then we’re right back in mega-false-positive-land, which we’ve already established as a totally boring place we want to avoid at all costs.

Since an optimal solution isn’t likely, the next best thing is to figure out what we can do to guard against careless overinterpretation. Here I think there’s actually a very simple, and relatively elegant, solution. What I’ve suggested when I’ve given recent talks on this topic is that we mandate (or at least, encourage) the use of what you could call a specificity statistic (SS). The SS is a very simple measure of how specific a given ROI-level finding is; it’s just the proportion of voxels that are statistically significant when tested at the same level as the ROI-level effects. In most cases, that’s going to be p < .05, so the SS will usually just be the proportion of all voxels anywhere in the brain that are activated at p < .05.

To see why this is useful, consider what could no longer happen: Researchers would no longer be able to (inadvertently) capitalize on the fact that the one or two regions they happened to define as a priori ROIs turned up significant effects when no other regions did in a whole-brain analysis. Suppose that someone reports a finding that negative emotion activates the amygdala in an ROI analysis, but doesn’t activate any other region in a whole-brain analysis. (While I’m pulling this particular example out of a hat here, I feel pretty confident that if you went and did a thorough literature review, you’d find at least three or four studies that have made this exact claim.) This is a case where the SS would come in really handy. Because if the SS is, say, 26% (i.e., about a quarter of all voxels in the brain are active at p < .05, even if none survive full correction for multiple comparisons), you would want to draw a very different conclusion than if it was just 4%. If fully a quarter of the brain were to show greater activation for a negative-minus-neutral emotion contrast, you wouldn’t want to conclude that the amygdala was critically involved in negative emotion; a better interpretation would be that the researchers in question just happened to define an a priori region that fell within the right quarter of the brain. Perhaps all that’s happening is that negative emotion elicits a general increase in attention, and much of the brain (including, but by no means limited to, the amygdala) tends to increase activation correspondingly. So as a reviewer and reader, you’d want to know how specific the reported amygdala activation really is*. But in the vast majority of papers, you currently have no way of telling (and the researchers probably don’t even know the answer themselves!).

The principal beauty of this statistic lies in its simplicity: It’s easy to understand, easy to calculate, and easy to report. Ideally, researchers would report the SS any time ROI analyses are involved, and would do it for every reported contrast. But at minimum, I think we should all encourage each other (and ourselves) to report such a statistic any time we’re making a specificity claim about ROI-based results. In other words,if you want to argue that a particular cognitive function is relatively localized to the ROI(s) you happened to select, you should be required to show that there aren’t that many other voxels (or regions) that show the same effect when tested at the liberal threshold you used for the ROI analysis. There shouldn’t be an excuse for not doing this; it’s a very easy procedure for researchers to implement, and an even easier one for reviewers to demand.

* An alternative measure of specificity would be to report the percentile ranking of all of the voxels within the ROI mask relative to all other individual voxels. In the above example, you’d assign very different interpretations depending on whether the amygdala was in the 32nd or 87th percentile of all voxels, when ordered according to the strength of the effect for the negative – neutral contrast.

a well-written mainstream article on fMRI?!

Craig Bennett, of prefrontal.org and dead salmon fame, links to a really great Science News article on the promises and pitfalls of fMRI. As Bennett points out, the real gem of the article is the “quote of the week” from Nikos Logethetis (which I won’t spoil for you here; you’ll have to do just a little more work to get to it). But the article is full of many other insightful quotes from fMRI researchers, and manages to succinctly and accurately describe a number of recent controversies in the fMRI literature without sacrificing too much detail. Usually when I come across a mainstream article on fMRI, I pre-emptively slap the screen a few times before I start reading, because I know I’m about to get angry. Well, I did that this time too, so my hand hurts per usual, but at least this time I feel pretty good about it. Kudos to Laura Sanders for writing one of the best non-technical accounts I’ve seen of the current state of fMRI research (and that, unlike a number of other articles in this vein, actually ends on a balanced and optimistic note).

every day is national lab day

This week’s issue of Science has a news article about National Lab Day, a White House-supported initiative to pair up teachers and scientists in an effort to improve STEM education nation-wide. As the article notes, National Lab Day is a bit of a misnomer, seeing as the goal is to encourage a range of educational activities over the next year or so. That’s a sentiment I can appreciate; why pick just one national lab day when you can have ALL OF THEM.

In any case, if you’re a scientist, you can sign up simply by giving away all of your deepest secrets and best research ideas providing your contact information and describing your academic background. I’m not really sure what happens after that, but in theory, at some point you’re supposed to wind up in a K-12 classroom demonstrating what you do and why it’s cool, which I guess could involve activities like pulling french fries out of burning oil with your bare hands, or applying TMS to 3rd graders’ foreheads, or other things of that nature. Of course, you can’t really bring an fMRI scanner into a classroom (though I suppose you could bring a classroom to an fMRI scanner), so I’m not really sure what I’ll do if anyone actually contacts me and asks me to come visit their classroom. I guess there’s always videos of lesion patients and the Muller-Lyer illusion, right?

building a cumulative science of human brain function at CNS

Earlier today, I received an email saying that a symposium I submitted for the next CNS meeting was accepted for inclusion in the program. I’m pretty excited about this; I think the topic of the symposium is a really important one, and this will be a great venue to discuss some of the relevant issues. The symposium is titled “Toward a cumulative science of human brain function”, which is a pretty good description of its contents. Actually, I stole borrowed that title from one of the other speakers (Tor Wager); originally, the symposium was going to be called something like “Cognitive Neuroscience would Suck Less if we all Pooled our Findings Together Instead of Each Doing our own Thing.” In hindsight, I think title theft was the right course of action.  Anyway, with the exception of my own talk, which is assured of being perfectly mediocre, the line-up is really stellar; the other speakers are David Van Essen, Tor Wager (my current post-doc advisor), and Russ Poldrack, all of whom do absolutely fantastic research, and give great talks to boot. Here’s the symposium abstract:

This symposium is designed to promote development of a cumulative science of human brain function that advances knowledge through formal synthesis of the rapidly growing functional neuroimaging literature. The first speaker (Tal Yarkoni) will motivate the need for a cumulative approach by highlighting several limitations of individual studies that can only be overcome by synthesizing the results of multiple studies. The second speaker (David Van Essen) will discuss the basic tools required in order to support formal synthesis of multiple studies, focusing particular attention on SumsDB, a massive database of functional neuroimaging data that can support sophisticated search and visualization queries. The third and fourth speakers will discuss two different approaches to combining and filtering results from multiple studies. Tor Wager will review state-of-the-art approaches to meta-analysis of fMRI data, providing empirical examples of the power of meta-analysis to both validate and disconfirm widely held views of brain organization. Russell Poldrack will discuss a novel taxonomic approach that uses collaboratively annotated meta-data to develop formal ontologies of brain function. Collectively, these four complementary talks will familiarize the audience with (a) the importance of adopting cumulative approaches to functional neuroimaging data; (b) currently available tools for accessing and retrieving information from multiple studies; and (c) state-of-the-art techniques for synthesizing the results of different functional neuroimaging studies into an integrated whole.

Anyway, I think it’ll be a really interesting set of talks, so if you’re at CNS next year, and find yourself hanging around at the convention center for half a day (though why you’d want to do that is beyond me, given that the conference is in MONTREAL), please check it out!