the parable of zoltan and his twelve sheep, or why a little skepticism goes a long way

What follows is a fictional piece about sheep and statistics. I wrote it about two years ago, intending it to serve as a preface to an article on the dangers of inadvertent data fudging. But then I decided that no journal editor in his or her right mind would accept an article that started out talking about thinking sheep. And anyway, the rest of the article wasn’t very good. So instead, I post this parable here for your ovine amusement. There’s a moral to the story, but I’m too lazy to write about it at the moment.

A shepherd named Zoltan lived in a small village in the foothills of the Carpathian Mountains. He tended to a flock of twelve sheep: Soffia, Krystyna, Anastasia, Orsolya, Marianna, Zigana, Julinka, Rozalia, Zsa Zsa, Franciska, Erzsebet, and Agi. Zoltan was a keen observer of animal nature, and would often point out the idiosyncracies of his sheep’s behavior to other shepherds whenever they got together.

“Anastasia and Orsolya are BFFs. Whatever one does, the other one does too. If Anastasia starts licking her face, Orsolya will too; if Orsolya starts bleating, Anastasia will start harmonizing along with her.“

“Julinka has a limp in her left leg that makes her ornery. She doesn’t want your pity, only your delicious clovers.“

“Agi is stubborn but logical. You know that old saying, spare the rod and spoil the sheep? Well, it doesn’t work for Agi. You need calculus and rhetoric with Agi.“

Zoltan’s colleagues were so impressed by these insights that they began to encourage him to record his observations for posterity.

“Just think, Zoltan,“ young Gergely once confided. “If something bad happened to you, the world would lose all of your knowledge. You should write a book about sheep and give it to the rest of us. I hear you only need to know six or seven related things to publish a book.“

On such occasions, Zoltan would hem and haw solemnly, mumbling that he didn’t know enough to write a book, and that anyway, nothing he said was really very important. It was false modestly of course; in reality, he was deeply flattered, and very much concerned that his vast body of sheep knowledge would disappear along with him one day. So one day, Zoltan packed up his knapsack, asked Gergely to look after his sheep for the day, and went off to consult with the wise old woman who lived in the next village.

The old woman listened to Zoltan’s story with a good deal of interest, nodding sagely at all the right moments. When Zoltan was done, the old woman mulled her thoughts over for a while.

“If you want to be taken seriously, you must publish your findings in a peer-reviewed journal,” she said finally.

“What’s Pier Evew?” asked Zoltan.

“One moment,” said the old woman, disappearing into her bedroom. She returned clutching a dusty magazine. “Here,” she said, handing the magazine to Zoltan. “This is peer review.”

That night, after his sheep had gone to bed, Zoltan stayed up late poring over Vol. IV, Issue 5 of Domesticated Animal Behavior Quarterly. Since he couldn’t understand the figures in the magazine, he read it purely for the articles. By the time he put the magazine down and leaned over to turn off the light, the first glimmerings of an empirical research program had begun to dance around in his head. Just like fireflies, he thought. No, wait, those really were fireflies. He swatted them away.

“I like this“¦ science,” he mumbled to himself as he fell asleep.

In the morning, Zoltan went down to the local library to find a book or two about science. He checked out a volume entitled Principia Scientifica Buccolica—a masterful derivation from first principles of all of the most common research methods, with special applications to animal behavior. By lunchtime, Zoltan had covered t-tests, and by bedtime, he had mastered Mordenkainen’s correction for inestimable herds.

In the morning, Zoltan made his first real scientific decision.

“Today I’ll collect some pilot data,” he thought to himself, “and tomorrow I’ll apply for an R01.”

His first set of studies tested the provocative hypothesis that sheep communicate with one another by moving their ears back and forth in Morse code. Study 1 tested the idea observationally. Zoltan and two other raters (his younger cousins), both blind to the hypothesis, studied sheep in pairs, coding one sheep’s ear movements and the other sheep’s behavioral responses. Studies 2 through 4 manipulated the sheep’s behavior experimentally. In Study 2, Zoltan taped the sheep’s ears to their head; in Study 3, he covered their eyes with opaque goggles so that they couldn’t see each other’s ears moving. In Study 4, he split the twelve sheep into three groups of four in order to determine whether smaller groups might promote increased sociability.

That night, Zoltan minded the data. “It’s a lot like minding sheep,“ Zoltan explained to his cousin Griga the next day. “You need to always be vigilant, so that a significant result doesn’t get away from you.“

Zoltan had been vigilant, and the first 4 studies produced a number of significant results. In Study 1, Zoltan found that sheep appeared to coordinate ear twitches: if one sheep twitched an ear several times in a row, it was a safe bet that other sheep would start to do the same shortly thereafter (p < .01). There was, however, no coordination of licking, headbutting, stamping, or bleating behaviors, no matter how you sliced and diced it. “It’s a highly selective effect,“ Zoltan concluded happily. After all, when you thought about it, it made sense. If you were going to pick just one channel for sheep to communicate through, ear twitching was surely a good one. One could make a very good evolutionary argument that more obvious methods of communication (e.g., bleating loudly) would have been detected by humans long ago, and that would be no good at all for the sheep.

Studies 2 and 3 further supported Zoltan’s story. Study 2 demonstrated that when you taped sheep’s ears to their heads, they ceased to communicate entirely. You could put Rozalia and Erzsebet in adjacent enclosures and show Rozalia the Jack of Spades for three or four minutes at a time, and when you went to test Erzsebet, she still wouldn’t know the Jack of Spades from the Three of Diamonds. It was as if the sheep were blind! Except they weren’t blind, they were dumb. Zoltan knew; he had made them that way by taping their ears to their heads.

In Study 3, Zoltan found that when the sheep’s eyes were covered, they no longer coordinated ear twitching. Instead, they now coordinated their bleating—but only if you excluded bleats that were produced when the sheep’s heads were oriented downwards. “Fantastic,“ he thought. “When you cover their eyes, they can’t see each other’s ears any more. So they use a vocal channel. This, again, makes good adaptive sense: communication is too important to eliminate entirely just because your eyes happen to be covered. Much better to incur a small risk of being detected and make yourself known in other, less subtle, ways.“

But the real clincher was Study 4, which confirmed that ear twitching occurred at a higher rate in smaller groups than larger groups, and was particularly common in dyads of well-adjusted sheep (like Anastasia and Orsolya, and definitely not like Zsa Zsa and Marianna).

“Sheep are like everyday people,“ Zoltan told his sister on the phone. “They won’t say anything to your face in public, but get them one-on-one, and they won’t stop gossiping about each other.“

It was a compelling story, Zoltan conceded to himself. The only problem was the F test. The difference in twitch rates as a function of group size wasn’t quite statistically significant. Instead, it hovered around p = .07, which the textbooks told Zoltan meant that he was almost right. Almost right was the same thing as potentially wrong, which wasn’t good enough. So the next morning, Zoltan asked Gergely to lend him four sheep so he could increase his sample size.

“Absolutely not,“ said Gergely. “I don’t want your sheep filling my sheep’s heads with all of your crazy new ideas.“

“Look,“ said Zoltan. “If you lend me four sheep, I’ll let you drive my Cadillac down to the village on weekends after I get famous.“

“Deal,“ said Gergely.

So Zoltan borrowed the sheep. But it turned out that four sheep weren’t quite enough; after adding Gergely’s sheep to the sample, the effect only went from p < .07 to p < .06. So Zoltan cut a deal with his other neighbor, Yuri: four of Yuri’s sheep for two days, in return for three days with Zoltan’s new Lexus (once he bought it). That did the trick. Once Zoltan repeated the experiment with Yuri’s sheep, the p-value for Study 2 now came to .046, which the textbooks assured Zoltan meant he was going to be famous.

Data in hand, Zoltan spent the next two weeks writing up his very first journal article. He titled it “Baa baa baa, or not: Sheep communicate via non-verbal channels“—a decidedly modest title for the first empirical work to demonstrate that sheep are capable of sophisticated propositional thought. The article was published to widespread media attention and scientific acclaim, and Zoltan went on to have a productive few years in animal behavioral research, studying topics as interesting and varied as giraffe calisthenics and displays of affection in the common leech.

Much later, it turned out that no one was able to directly replicate his original findings with sheep (though some other researchers did manage to come up with conceptual replications). But that didn’t really matter to Zoltan, because by then he’d decided science was too demanding a career anyway; it was way more fun to lay under trees counting his sheep. Counting sheep, and occasionally, on Saturdays, driving down to the village in his new Lexus,  just to impress all the young cowgirls.

specificity statistics for ROI analyses: a simple proposal

The brain is a big place. In the context of fMRI analysis, what that bigness means is that a typical 3D image of the brain might contain anywhere from 50,000 – 200,000 distinct voxels (3D pixels). Any of those voxels could theoretically show meaningful activation in relation to some contrast of interest, so the only way to be sure that you haven’t overlooked potentially interesting activations is to literally test every voxel (or, given some parcellation algorithm, every region).

Unfortunately, the problem that approach raises–which I’ve discussed in more detail here–is the familiar one of multiple comparisons: If you’re going to test 100,000 locations, it’s not really fair to test each one at the conventional level of p < .05, because on average, you’ll get about 5,000 statistically significant results just by chance that way. So you need to do something to correct for the fact that you’re running thousands of tests. The most common approach is to simply make the threshold for significance more conservative–for example, by testing at p < .0001 instead of p < .05, or by using some combination of intensity and cluster extent thresholds (e.g., you look for 20 contiguous voxels that are all significant at, say, p < .001) that’s supposed to guarantee a cluster-wise error rate of .05.

There is, however, a natural tension between false positives and false negatives: When you make your analysis more conservative, you let fewer false positives through the filter, but you also keep more of the true positives out. A lot of fMRI analysis really just boils down to walking a very thin line between running overconservative analyses that can’t detect anything but the most monstrous effects, and running overly liberal analyses that lack any real ability to distinguish meaningful signals from noise. One very common approach that fMRI researchers have adopted in an effort to optimize this balance is to use complementary hypothesis-driven and whole-brain analyses. The idea is that you’re basically carving the brain up into two separate search spaces: One small space for which you have a priori hypotheses that can be tested using a small number of statistical comparisons, and one much larger space (containing everything but the a priori space) where you continue to use a much more conservative threshold.

For example, if I believe that there’s a very specific chunk of right inferotemporal cortex that’s specialized for detecting clown faces, I can focus my hypothesis-testing on that particular region, without having to pretend that all voxels are created equal. So I delineate the boundaries of a CRC (Clown Representation Cortex) region-of-interest (ROI) based on some prior criteria (e.g., anatomy, or CRC activation in previous studies), and then I can run a single test at p < .05 to test my hypothesis, no correction needed. But to ensure that I don’t miss out on potentially important clown-related activation elsewhere in the brain, I also go ahead and run an additional whole-brain analysis that’s fully corrected for multiple comparisons. By coupling these two analyses, I hopefully get the best of both worlds. That is, I combine one approach (the ROI analysis) that maximizes power to test a priori hypotheses at the cost of an inability to detect effects in unexpected places with another approach (the whole-brain analysis) that has a much more limited capacity to detect effects in both expected and unexpected locations.

This two-pronged strategy is generally a pretty successful one, and I’d go so far as to say that a very large minority, if not an outright majority, of fMRI studies currently use it. Used wisely, I think it’s really an invaluable strategy. There is, however, one fairly serious and largely unappreciated problem associated with the incautious application of this approach. It has to do with claims about the specificity of activation that often tend to accompany studies that use a complementary ROI/whole-brain strategy. Specifically, a pretty common pattern is for researchers to (a) confirm their theoretical predictions by successfully detecting activation in one or more a priori ROIs; (b) identify few if any whole-brain activations; and consequently, (c) conclude that not only were the theoretical predictions confirmed, but that the hypothesized effects in the a priori ROIs were spatially selective, because a complementary whole-brain analysis didn’t turn up much (if anything). Or, to put it in less formal terms, not only were we right, we were really right! There isn’t any other part of the brain that shows the effect we hypothesized we’d see in our a priori ROI!

The problem with this type of inference is that there’s usually a massive discrepancy in the level of power available to detect effects in a priori ROIs versus the rest of the brain. If you search at p < .05 within some predetermined space, but at only p < .0001 everywhere else, you’re naturally going to detect results at a much lower rate everywhere else. But that’s not necessarily because there wasn’t just as much to look at everywhere else; it could just be because you didn’t look very carefully. By way of analogy, if you’re out picking berries in the forest, and you decide to spend half your time on just one bush that (from a distance) seemed particularly berry-full, and the other half of your time divided between the other 40 bushes in the area, you’re not really entitled to conclude that you picked the best bush all along simply because you came away with a relatively full basket. Had you done a better job checking out the other bushes, you might well have found some that were even better, and then you’d have come away carrying two baskets full of delicious, sweet, sweet berries.

Now, in an ideal world, we’d solve this problem by simply going around and carefully inspecting all the berry bushes, until we were berry, berry sure really convinced that we’d found all of the best bushes. Unfortunately, we can’t do that, because we’re out here collecting berries on our lunch break, and the boss isn’t paying us to dick around in the woods. Or, to return to fMRI World, we simply can’t carefully inspect every single voxel (say, by testing it at p < .05), because then we’re right back in mega-false-positive-land, which we’ve already established as a totally boring place we want to avoid at all costs.

Since an optimal solution isn’t likely, the next best thing is to figure out what we can do to guard against careless overinterpretation. Here I think there’s actually a very simple, and relatively elegant, solution. What I’ve suggested when I’ve given recent talks on this topic is that we mandate (or at least, encourage) the use of what you could call a specificity statistic (SS). The SS is a very simple measure of how specific a given ROI-level finding is; it’s just the proportion of voxels that are statistically significant when tested at the same level as the ROI-level effects. In most cases, that’s going to be p < .05, so the SS will usually just be the proportion of all voxels anywhere in the brain that are activated at p < .05.

To see why this is useful, consider what could no longer happen: Researchers would no longer be able to (inadvertently) capitalize on the fact that the one or two regions they happened to define as a priori ROIs turned up significant effects when no other regions did in a whole-brain analysis. Suppose that someone reports a finding that negative emotion activates the amygdala in an ROI analysis, but doesn’t activate any other region in a whole-brain analysis. (While I’m pulling this particular example out of a hat here, I feel pretty confident that if you went and did a thorough literature review, you’d find at least three or four studies that have made this exact claim.) This is a case where the SS would come in really handy. Because if the SS is, say, 26% (i.e., about a quarter of all voxels in the brain are active at p < .05, even if none survive full correction for multiple comparisons), you would want to draw a very different conclusion than if it was just 4%. If fully a quarter of the brain were to show greater activation for a negative-minus-neutral emotion contrast, you wouldn’t want to conclude that the amygdala was critically involved in negative emotion; a better interpretation would be that the researchers in question just happened to define an a priori region that fell within the right quarter of the brain. Perhaps all that’s happening is that negative emotion elicits a general increase in attention, and much of the brain (including, but by no means limited to, the amygdala) tends to increase activation correspondingly. So as a reviewer and reader, you’d want to know how specific the reported amygdala activation really is*. But in the vast majority of papers, you currently have no way of telling (and the researchers probably don’t even know the answer themselves!).

The principal beauty of this statistic lies in its simplicity: It’s easy to understand, easy to calculate, and easy to report. Ideally, researchers would report the SS any time ROI analyses are involved, and would do it for every reported contrast. But at minimum, I think we should all encourage each other (and ourselves) to report such a statistic any time we’re making a specificity claim about ROI-based results. In other words,if you want to argue that a particular cognitive function is relatively localized to the ROI(s) you happened to select, you should be required to show that there aren’t that many other voxels (or regions) that show the same effect when tested at the liberal threshold you used for the ROI analysis. There shouldn’t be an excuse for not doing this; it’s a very easy procedure for researchers to implement, and an even easier one for reviewers to demand.

* An alternative measure of specificity would be to report the percentile ranking of all of the voxels within the ROI mask relative to all other individual voxels. In the above example, you’d assign very different interpretations depending on whether the amygdala was in the 32nd or 87th percentile of all voxels, when ordered according to the strength of the effect for the negative – neutral contrast.

solving the file drawer problem by making the internet the drawer

UPDATE 11/22/2011 — Hal Pashler’s group at UCSD just introduced a new website called PsychFileDrawer that’s vastly superior in every way to the prototype I mention in the post below; be sure to check it out!

Science is a difficult enterprise, so scientists have many problems. One particularly nasty problem is the File Drawer Problem. The File Drawer Problem is actually related to another serious scientific problem known as the Desk Problem. The Desk Problem is that many scientists have messy desks covered with overflowing stacks of papers, which can make it very hard to find things on one’s desk–or, for that matter, to clear enough space to lay down another stack of papers.  A common solution to the Desk Problem is to shove all of those papers into one’s file drawer. Which brings us to the the File Drawer Problem. The File Drawer Problem refers to the fact that, eventually, even the best-funded of scientists run out of room in their file drawers.

Ok, so that’s not exactly right. What the file drawer problem–a term coined by Robert Rosenthal in a seminal 1979 article–really refers to is the fact that null results tend to go unreported in the scientific literature at a much higher rate than positive findings, because journals don’t like to publish papers that say “we didn’t find anything”, and as a direct consequence, authors don’t like to write papers that say “journals won’t want to publish this”.

Because of this blatant prejudice systematic bias against null results, the eventual resting place of many a replication failure is its author’s file drawer. The reason this is a problem is that, over the long term, if only (or mostly) positive findings ever get published, researchers can get a very skewed picture of how strong an effect really is. To illustrate, let’s say that Joe X publishes a study showing that people with lawn gnomes in their front yards tend to be happier than people with no lawn gnomes in their yards. Intuitive as that result may be, someone is inevitably going to get the crazy idea that this effect is worth replicating once or twice before we all stampede toward Home Depot or the Container Store with our wallets out (can you tell I’ve never bought a lawn gnome before?). So let’s say Suzanna Y and Ramesh Z each independently try to replicate the effect in their labs (meaning, they command their graduate students to do it). And they find… nothing! No effect. Turns out, people with lawn gnomes are just as miserable as the rest of us. Well, you don’t need a PhD in lawn decoration to recognize that Suzanna Y and Ramesh Z are not going to have much luck publishing their findings in very prestigious journals–or for that matter, in any journals. So those findings get buried into their file drawers, where they will live out the rest of their days with very sad expressions on their numbers.

Now let’s iterate this process several times. Every couple of years, some enterprising young investigator will decide she’s going to try to replicate that cool effect from 2009, since no one else seems to have bothered to do it. This goes on for a while, with plenty of null results, until eventually, just by chance, someone gets lucky (if you can call a false positive lucky) and publishes a successful replication. And also, once in a blue moon, someone who gets a null result actually bothers to forces their graduate student to write it up, and successfully gets out a publication that very carefully explains that, no, Virginia, lawn gnomes don’t really make you happy. So, over time, a small literature on the hedonic effects of lawn gnomes accumulates.

Eventually, someone else comes across this small literature and notices that it contains “mixed findings”, with some studies finding an effect, and others finding no effect. So this special someone–let’s call them the Master of the Gnomes–decides to do a formal meta-analysis. (A meta-analysis is basically just a fancy way of taking a bunch of other people’s studies, throwing them in a blender, and pouring out the resulting soup into a publication of your very own.) Now you can see why the failure to publish null results is going to be problematic: What the Master of the Gnomes doesn’t know about, the Master of the Gnomes can’t publish about. So any resulting meta-analytic estimate of the association between lawn gnomes and subjective well-being is going to be biased in the positive directio. That is, there’s a good chance that the meta-analysis will end up saying lawn gnomes make people very happy,when in reality lawn gnomes only make people a little happy, or don’t make people happy at all.

There are lots of ways to try to get around the file drawer problem, of course. One approach is to call up everyone you know who you think might have ever done any research on lawn gnomes and ask if you could take a brief peek into their file drawer. But meta-analysts are often very introverted people with no friends, so they may not know any other researchers. Or they might be too shy to ask other people for their data. And then too, some researchers are very protective of their file drawers, because in some cases, they’re hiding more than just papers in there. Bottom line, it’s not always easy to identify all of the null results that are out there.

A very different way to deal with the file drawer problem, and one suggested by Rosenthal in his 1979 article, is to compute a file drawer number, which is basically a number that tells you how many null results that you don’t know about would have to exist in people’s file drawers before the meta-analytic effect size estimate was itself rendered null. So, for example, let’s say you do a meta-analysis of 28 studies, and find that your best estimate, taking all studies into account, is that the standardized effect size (Cohen’s d) is 0.63, which is quite a large effect, and is statistically different from 0 at, say, the p < .00000001 level. Intuitively, that may seem like a lot of zeros, but being a careful methodologist, you decide you’d like a more precise definition of “a lot”. So you compute the file drawer number (in one of its many permutations), and it turns out that there would have to be 4,640,204 null results out there in people’s file drawers before the meta-analytic effect size became statistically non-significant. That’s a lot of studies, and it’s doubtful that there are even that many people studying lawn gnomes, so you can probably feel comfortable that there really is an association there, and that it’s fairly large.

The problem, of course, is that it doesn’t always turn out that way. Sometimes you do the meta-analysis and find that your meta-analytic effect is cutting it pretty close, and that it would only take, say, 12 null results to render the effect non-significant. At that point, the file drawer N is no help; no amount of statistical cleverness is going to give you the extrasensory ability to peer into people’s file drawers at a distance. Moreover, even in cases where you can feel relatively confident that there couldn’t possibly be enough null results out there to make your effect go away entirely, it’s still possible that there are enough null results out there to substantially weaken it. Generally speaking, the file drawer N is a number you compute because you have to, not because you want to. In an ideal world, you’d always have all the information readily available at your fingertips, and all that would be left for you to do is toss it all in the blender and hit “liquify” “meta-analyze”. But of course, we don’t live in an ideal world; we live in a horrible world full of things like tsunamis, lip syncing, and publication bias.

This brings me, in a characteristically long-winded way, to the point of this post. The fact that researchers often don’t have access to other researchers’ findings–null result or not–is in many ways a vestige of the fact that, until recently, there was no good way to rapidly and easily communicate one’s findings to others in an informal way. Of course, the telephone has been around for a long time, and the postal service has been around even longer. But the problem with telling other people what you found on the telephone is that they have to be listening, and you don’t really know ahead of time who’s going to want to hear about your findings. When Rosenthal was writing about file drawers in the 80s, there wasn’t any bulletin board where people could post their findings for all to see without going to the trouble of actually publishing them, so it made sense to focus on ways to work around the file drawer problem instead of through it.

These days, we do have a bulletin board where researchers can post their null results: The internet. In theory, an online database of null results presents an ideal solution to the file drawer problem: Instead of tossing their replication failures into a folder somewhere, researchers could spend a minute or two entering just a minimal amount of information into an online database, and that information would then live on in perpetuity, accessible to anyone else who cared to come along and enter the right keyword into the search box. Such a system could benefit everyone involved: researchers who ended up with unpublishable results could salvage at least some credit for their efforts, and ensure that their work wasn’t entirely lost to the sands of time; prospective meta-analysts could simplify the task of hunting down relevant findings in unlikely places; and scientists contemplating embarking on a new line of research that built heavily on an older finding could do a cursory search to see if other people had already tried (and failed) to replicate the foundational effect.

Sounds good, right? At least, that was my thought process last year, when I spent some time building an online database that could serve as this type of repository for null (and, occasionally, not-null) results. I got a working version up and running at failuretoreplicate.com, and was hoping to spend some time begging people to use it trying to write it up as a short paper, but then I started sinking into the quicksand of my dissertation, and promptly forgot about it. What jogged my memory was this post a couple of days ago, which describes a database, called the Negatome, that contains “a collection of protein and domain (functional units of proteins) pairs thatare unlikely to be engaged in direct physical interactions”. This isn’t exactly the same thing as a database of null results, and is in a completely different field, but it was close enough to rekindle my interest and motivate me to dust off the site I built last year. So now the site is here, and it’s effectively open for business.

I should confess up front that I don’t harbor any great hopes of this working; I suspect it will be quite difficult to build the critical mass needed to make something like this work. Still, I’d like to try. The site is officially in beta, so stuff will probably still break occasionally, but it’s basically functional. You can create an account instantly and immediately start adding studies; it only takes a minute or two per study. There’s no need to enter much in the way of detail; the point isn’t to provide an alternative to peer-reviewed publication, but rather to provide a kind of directory service that researchers could use as a cursory tool for locating relevant information. All you have to do is enter a brief description of the effect you tried to replicate, an indication of whether or not you succeeded, and what branch of psychology the effect falls under. There are plenty of other fields you can enter (e.g., searchable tags, sample sizes, description of procedures, etc.), but they’re almost all optional. The goal is really to make this as effortless as possible for people to use, so that there is no virtually no cost to contributing.

Anyway, right now there’s nothing on the site except a single lonely record I added in order to get things started. I’d be very grateful to anyone who wants to help this project off the ground by adding a study or two. There are full editing and deletion capabilities, so you can always delete anything you add later on if you decide you don’t want to share after all. My hope is that, given enough community involvement and a large enough userbase, this could eventually become a valuable resource psychologists could rely on when trying to establish how likely a finding is to replicate, or when trying to identify relevant studies to include in meta-analyses. You do want to help figure out what effect those sneaky, sneaky lawn gnomes have on our collective mental health, right?

tuesday at 3 pm works for me

Apparently, Tuesday at 3 pm is the best time to suggest as a meeting time–that’s when people have the most flexibility available in their schedule. At least, that’s the conclusion drawn by a study based on data from WhenIsGood, a free service that helps with meeting scheduling. There’s not much to the study beyond the conclusion I just gave away; not surprisingly, people don’t like to meet before 10 or 11 am or after 4 pm, and there’s very little difference in availability across different days of the week.

What I find neat about this isn’t so much the results of the study itself as the fact that it was done at all. I’m a big proponent of using commercial website data for research purposes–I’m about to submit a paper that relies almost entirely on content pulled using the Blogger API, and am working on another project that makes extensive use of the Twitter API. The scope of the datasets one can assemble via these APIs is simply unparalleled; for example, there’s no way I could ever realistically collect writing samples of 50,000+ words from 500+ participants in a laboratory setting, yet the ability to programmatically access blogspot.com blog contents makes the task trivial. And of course, many websites collect data of a kind that just isn’t available off-line. For example, the folks at OKCupid are able to continuously pump out interesting data on people’s online dating habits because they have comprehensive data on interactions between literally millions of prospective dating partners. If you want to try to generate that sort of data off-line, I hope you have a really large lab.

Of course, I recognize that in this case, the WhenIsGood study really just amounts to a glorified press release. You can tell that’s what it is from the URL, which literally includes the “press/” directory in its path. So I’m certainly not naive enough to think that Web 2.0 companies are publishing interesting research based on their proprietary data solely out of the goodness of their hearts. Quite the opposite. But I think in this case the desire for publicity works in researchers’ favor: It’s precisely because virtually any press is considered good press that many of these websites would probably be happy to let researchers play with their massive (de-identified) datasets. It’s just that, so far, hardly anyone’s asked. The Web 2.0 world is a largely untapped resource that researchers (or at least, psychologists) are only just beginning to take advantage of.

I suspect that this will change in the relatively near future. Five or ten years from now, I imagine that a relatively large chunk of the research conducted in many area of psychology (particularly social and personality psychology) will rely heavily on massive datasets derived from commercial websites. And then we’ll all wonder in amazement at how we ever put up with the tediousness of collecting real-world data from two or three hundred college students at a time, when all of this online data was just lying around waiting for someone to come take a peek at it.