Category Archives: psychology

In defense of In Defense of Facebook

A long, long time ago (in social media terms), I wrote a post defending Facebook against accusations of ethical misconduct related to a newly-published study in PNAS. I won’t rehash the study, or the accusations, or my comments in any detail here; for that, you can read the original post (I also recommend reading this or this for added context). While I stand by most of what I wrote, as is the nature of things, sometimes new information comes to light, and sometimes people say things that make me change my mind. So I thought I’d post my updated thoughts and reactions. I also left some additional thoughts in a comment on my last post, which I won’t rehash here.

Anyway, in no particular order…

I’m not arguing for a lawless world where companies can do as they like with your data

Some people apparently interpreted my last post as a defense of Facebook’s data use policy in general. It wasn’t. I probably brought this on myself in part by titling the post “In Defense of Facebook”. Maybe I should have called it something like “In Defense of this one particular study done by one Facebook employee”. In any case, I’ll reiterate: I’m categorically not saying that Facebook–or any other company, for that matter–should be allowed to do whatever it likes with its users’ data. There are plenty of valid concerns one could raise about the way companies like Facebook store, manage, and use their users’ data. And for what it’s worth, I’m generally in favor of passing new rules regulating the use of personal data in the private sector. So, contrary to what some posts suggested, I was categorically not advocating for a laissez-faire world in which large corporations get to do as they please with your information, and there’s nothing us little people can do about it.

The point I made in my last post was much narrower than that–namely, that picking on the PNAS study as an example of ethically questionable practices at Facebook was a bad idea, because (a) there aren’t any new risks introduced by this manipulation that aren’t already dwarfed by the risks associated with using Facebook itself (which is not exactly a high-risk enterprise to begin with), and (b) there are literally thousands of experiments just like this being conducted every day by large companies intent on figuring out how best to market their products and services–so Facebook’s study doesn’t stand out in any respect. My point was not that you shouldn’t be concerned about who has your data and how they’re using it, but that it’s deeply counterproductive to go after Facebook for this particular experiment when Facebook is of the few companies in this arena who actually (occasionally) publish the results of their findings in the scientific literature, instead of hiding them entirely from the light, as almost everyone else does. Of course, that will probably change as a result of this controversy.

I Was Wrong–A/B Testing Edition.

One claim I made in my last post that was very clearly wrong is this (emphasis added):

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface.

After watching the commentary over the past two days, I think it’s pretty clear I was wrong about this. A surprisingly large number of people clearly were genuinely unaware that Facebook, Twitter, Google, and other major players in every major industry (not just tech–also banks, groceries, department stores, you name it) are constantly running large-scale, controlled experiments on their users and customers. For instance, here’s a telling comment left on my last post:

The main issue I have with the experiment is that they conducted it without telling us. Given, that would have been counterproductive, but even a small adverse affect is still an adverse affect. I just don’t like the idea that corporations can do stuff to me without my consent. Just my opinion.

Similar sentiments are all over the place. Clearly, the revelation that Facebook regularly experiments on its users without their knowledge was indeed just that to many people–a revelation. I suppose in this sense, there’s potentially a considerable upside to this controversy, inasmuch as it has clearly served to raise awareness of industry-standard practices.

Questions about the ethics of the PNAS paper’s publication

My post focused largely on the question of whether the experiment Facebook conducted was itself illegal or unethical. I took this to be the primary concern of most lay people who have expressed concern about the episode. As I discussed in my post, I think it’s quite clear that the experiment itself is (a) entirely legal and that (b) any ethical objections one could raise are actually much broader objections about the way we regulate data use and consumer privacy, and have nothing to do with Facebook in particular. However, there’s a separate question that does specifically concern Facebook–or really, the authors of the PNAS paper–which is whether the authors, in their efforts to publish their findings, violated any laws or regulations.

When I wrote my post, I was under the impression–based largely on reports of an interview with the PNAS editor, Susan Fiske–that the authors had in fact obtained approval to conduct the study from an IRB, and had simply neglected to include that information in the text (which would have been an Editorial lapse, but not an unethical act). I wrote as much in a comment on my post. I was not suggesting–as some seemed to take away–that Facebook doesn’t need to get IRB approval. I was operating on the assumption that it had obtained IRB approval, based on the information available at the time.

In any case, it now appears that may not be exactly what happened. Unfortunately, it’s not yet clear exactly what did happen. One version of events people have suggested is that the study’s authors exploited a loophole in the rules by having Facebook conduct and analyze the experiment without the involvement of the other authors–who only contributed to the genesis of the idea and the writing of the manuscript. However, this interpretation is not unambiguous, and risks maligning the authors’ reputations unfairly, because Adam Kramer’s post explaining the motivation for the experiment suggests that the idea for the experiment originated entirely at Facebook, and was related to internal needs:

The reason we did this research is because we care about the emotional impact of Facebook and the people that use our product. We felt that it was important to investigate the common worry that seeing friends post positive content leads to people feeling negative or left out. At the same time, we were concerned that exposure to friends’ negativity might lead people to avoid visiting Facebook. We didn’t clearly state our motivations in the paper.

How you interpret the ethics of the study thus depends largely on what you believe actually happened. If you believe that the genesis and design of the experiment were driven by Facebook’s internal decision-making, and the decision to publish an interesting finding came only later, then there’s nothing at all ethically questionable about the authors’ behavior. It would have made no more sense to seek out IRB approval for this one experiment than for any of the other in-house experiments Facebook regularly conducts. And there is, again, no question whatsoever that Facebook does not have to get approval from anyone to do experiments that are not for the purpose of systematic, generalizable research.

Moreover, since the non-Facebook authors did in fact ask the IRB to review their proposal to use archival data–and the IRB exempted them from review, as is routinely done for this kind of analysis–there would be no legitimacy to the claim that the authors acted unethically. About the only claim one could raise an eyebrow at is that the authors “didn’t clearly state” their motivations. But since presenting a post-hoc justification for one’s studies that has nothing to do with the original intention is extremely common in psychology (though it shouldn’t be), it’s not really fair to fault Kramer et al for doing something that is standard practice.

If, on the other hand, the idea for the study did originate outside of Facebook, and the authors deliberately attempted to avoid prospective IRB review, then I think it’s fair to say that their behavior was unethical. However, given that the authors were following the letter of the law (if clearly not the spirit), it’s not clear that PNAS should have, or could have, rejected the paper. It certainly should have demanded that information regarding interactions with the IRB be included in the manuscript, and perhaps it could have published some kind of expression of concern alongside the paper. But I agree with Michelle Meyer’s analysis that, in taking the steps they took, the authors are almost certainly operating within the rules, because (a) Facebook itself is not subject to HHS rules, (b) the non-Facebook authors were not technically “engaged in research”, and (c) the archival use of already-collected data by the non-Facebook authors was approved by the Cornell IRB (or rather, the study was exempted from further review).

Absent clear evidence of what exactly happened in the lead-up to publication, I think the appropriate course of action is to withhold judgment. In the interim, what the episode clearly does do is lay bare how ill-prepared the existing HHS regulations are for dealing with the research use of data collected online–particularly when the data was acquired by private entities. Actually, it’s not just research use that’s problematic; it’s clear that many people complaining about Facebook’s conduct this week don’t really give a hoot about the “generalizable knowledge” side of things, and are fundamentally just upset that Facebook is allowed to run these kinds of experiments at all without providing any notification.

In my view, what’s desperately called for is a new set of regulations that provide a unitary code for dealing with consumer data across the board–i.e., in both research and non-research contexts. This leaves aside exactly what such regulations would look like, of course. My personal view is that the right direction to move in is to tighten consumer protection laws to better regulate management and use of private citizens’ data, while simultaneously liberalizing the research use of private datasets that have already been acquired. For example, I would favor a law that (a) forced Facebook and other companies to more clearly and explicitly state how they use their users’ data, (b) provided opt-out options when possible, along with the ability for users to obtain report of how their data has been used in the past, and (c) gave blanket approval to use data acquired under these conditions for any and all academic research purposes so long as the data are deidentified. Many people will disagree with this, of course, and have very different ideas. That’s fine; the key point is that the conversation we should be having is about how to update and revise the rules governing research vs. non-research uses of data in such a way that situations like the PNAS study don’t come up again.

What Facebook does is not research–until they try to publish it

Much of the outrage over the Facebook experiment is centered around the perception that Facebook shouldn’t be allowed to conduct research on its users without their consent. What many people mean by this, I think, is that Facebook shouldn’t be allowed to conduct any experiments on its users for purposes of learning things about user experience and behavior unless Facebook explicitly asks for permission. A point that I should have clarified in my original post is that Facebook users are, in the normal course of things, not considered participants in a research study, no matter how or how much their emotions are manipulated. That’s because the HHS’s definition of research includes, as a necessary component, that there be an active intention to contribute to generalizable new knowledge.

Now, to my mind, this isn’t a great way to define “research”–I think it’s a good idea to avoid definitions that depend on knowing what people’s intentions were when they did something. But that’s the definition we’re stuck with, and there’s really no ambiguity over whether Facebook’s normal operations–which include constant randomized, controlled experimentation on its users–constitute research in this sense. They clearly don’t. Put simply, if Facebook were to eschew disseminating its results to the broader community, the experiment in question would not have been subject to any HHS regulations whatsoever (though, as Michelle Meyer astutely pointed out, technically the experiment probably isn’t subject to HHS regulation even now, so the point is moot). Again, to reiterate: it’s only the fact that Kramer et al wanted to publish their results in a scientific journal that opened them up to criticism of research misconduct in the first place.

This observation may not have any impact on your view if your concern is fundamentally about the publication process–i.e., you don’t object to Facebook doing the experiment; what you object to is Facebook trying to disseminate their findings as research. But it should have a strong impact on your views if you were previously under the impression that Facebook’s actions must have violated some existing human subjects regulation or consumer protection law. The laws in the United States–at least as I understand them, and I admittedly am not a lawyer–currently afford you no such protection.

Now, is it a good idea to have two very separate standards, one for research and one for everything else? Probably not. Should Facebook be allowed to do whatever it wants to your user experience so long as it’s covered under the Data Use policy in the user agreement you didn’t read? Probably not. But what’s unequivocally true is that, as it stands right now, your interactions with Facebook–no matter how your user experience, data, or emotions are manipulated–are not considered research unless Facebook manipulates your experience with the express intent of disseminating new knowledge to the world.

Informed consent is not mandatory for research studies

As a last point, there seems to be a very common misconception floating around among commentators that the Facebook experiment was unethical because it didn’t provide informed consent, which is a requirement for all research studies involving experimental manipulation. I addressed this in the comments on my last post in response to other comments:

[I]t’s simply not correct to suggest that all human subjects research requires informed consent. At least in the US (where Facebook is based), the rules governing research explicitly provide for a waiver of informed consent. Directly from the HHS website:

An IRB may approve a consent procedure which does not include, or which alters, some or all of the elements of informed consent set forth in this section, or waive the requirements to obtain informed consent provided the IRB finds and documents that:

(1) The research involves no more than minimal risk to the subjects;

(2) The waiver or alteration will not adversely affect the rights and welfare of the subjects;

(3) The research could not practicably be carried out without the waiver or alteration; and

(4) Whenever appropriate, the subjects will be provided with additional pertinent information after participation.

Granting such waivers is a commonplace occurrence; I myself have had online studies granted waivers before for precisely these reasons. In this particular context, it’s very clear that conditions (1) and (2) are met (because this easily passes the “not different from ordinary experience” test). Further, Facebook can also clearly argue that (3) is met, because explicitly asking for informed consent is likely not viable given internal policy, and would in any case render the experimental manipulation highly suspect (because it would no longer be random). The only point one could conceivably raise questions about is (4), but here again I think there’s a very strong case to be made that Facebook is not about to start providing debriefing information to users every time it changes some aspect of the news feed in pursuit of research, considering that its users have already agreed to its User Agreement, which authorizes this and much more.

Now, if you disagree with the above analysis, that’s fine, but what should be clear enough is that there are many IRBs (and I’ve personally interacted with some of them) that would have authorized a waiver of consent in this particular case without blinking. So this is clearly well within “reasonable people can disagree” territory, rather than “oh my god, this is clearly illegal and unethical!” territory.

I can understand the objection that Facebook should have applied for IRB approval prior to conducting the experiment (though, as I note above, that’s only true if the experiment was initially conducted as research, which is not clear right now). However, it’s important to note that there is no guarantee that an IRB would have insisted on informed consent at all in this case. There’s considerable heterogeneity in different IRBs’ interpretation of the HHS guidelines (and in fact, even across different reviewers within the same IRB), and I don’t doubt that many IRBs would have allowed Facebook’s application to sail through without any problems (see, e.g., this comment on my last post)–though I think there’s a general consensus that a debriefing of some kind would almost certainly be requested.

There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)

This is not a blog post about bullying, negative psychology or replication studies in general. Those are important issues, and a lot of ink has been spilled over them in the past week or two. But this post isn’t about those issues (at least, not directly). This post is about ceiling effects. Specifically, the ceiling effect purportedly present in a paper in Social Psychology, in which Johnson, Cheung, and Donnellan report the results of two experiments that failed to replicate an earlier pair of experiments by Schnall, Benton, and Harvey.

If you’re not up to date on recent events, I recommend reading Vasudevan Mukunth’s post, which provides a nice summary. If you still want to know more after that, you should probably take a gander at the original paper by Schnall, Benton, & Harvey and the replication paper. Still want more? Go read Schnall’s rebuttal. Then read the rejoinder to the rebuttal. Then read Schnall’s first and second blog posts. And maybe a number of other blog posts (here, here, here, and here). Oh, and then, if you still haven’t had enough, you might want to skim the collected email communications between most of the parties in question, which Brian Nosek has been kind enough to curate.

I’m pointing you to all those other sources primarily so that I don’t have to wade very deeply into the overarching issues myself–because (a) they’re complicated, (b) they’re delicate, and (c) I’m still not entirely sure exactly how I feel about them. However, I do have a fairly well-formed opinion about the substantive issue at the center of Schnall’s published rebuttal–namely, the purported ceiling effect that invalidates Johnson et al’s conclusions. So I thought I’d lay that out here in excruciating detail. I’ll warn you right now that if your interests lie somewhere other than the intersection of psychology and statistics (which they probably should), you probably won’t enjoy this post very much. (If your interests do lie at the intersection of psychology and statistics, you’ll probably give this post a solid “meh”.)

Okay, with all the self-handicapping out of the way, let’s get to it. Here’s what I take to be…

Schnall’s argument

The crux of Schnall’s criticism of the Johnson et al replication is a purported ceiling effect. What, you ask, is a ceiling effect? Here’s Schnall’s definition:

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7″, this suggests that they might have given a higher response (e.g., “8″ or “9″) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

This definition has some subtle-but-important problems we’ll come back to, but it’s reasonable as a first approximation. With this definition in mind, here’s how Schnall describes her core analysis, which she uses to argue that Johnson et al’s results are invalid:

Because a ceiling effect on a dependent variable can wash out potential effects of an independent variable (Hessling, Traxel & Schmidt, 2004), the relationship between the percentage of extreme responses and the effect of the cleanliness manipulation was examined. First, using all 24 item means from original and replication studies, the effect of the manipulation on each item was quantified. … Second, for each dilemma the percentage of extreme responses averaged across neutral and clean conditions was computed. This takes into account the extremity of both conditions, and therefore provides an unbiased indicator of ceiling per dilemma. … Ceiling for each dilemma was then plotted relative to the effect of the cleanliness manipulation (Figure 1).

We can (and will) quibble with these analysis choices, but the net result of the analysis is this:

schnall_figure

Here, we see normalized effect size (y-axis) plotted against extremity of item response (x-axis). Schnall’s basic argument is that there’s a strong inverse relationship between the extremity of responses to an item and the size of the experimental effect on that item. In other words, items with extreme responses don’t show an effect, whereas items with non-extreme responses do show an effect. She goes on to note that this pattern is full accounted for by her own original experiments, and that there is no such relationship in Johnson et al’s data. On the basis of this finding, Schnall concludes that:

Scores are compressed toward the top end of the scale and therefore show limited determinate variance near ceiling. Because a significance test compares variance due to a manipulation to variance due to error, an observed lack of effect can result merely from a lack in variance that would normally be associated with a manipulation. Given the observed ceiling effect, a statistical artefact, the analyses reported by Johnson et al. (2014a) are invalid and allow no conclusions about the reproducibility of the original findings.

Problems with the argument

One can certainly debate over what the implications would be even if Schnall’s argument were correct; for instance, it’s debatable whether the presence of a ceiling effect would actually invalidate Johnson et al’s conclusions that they had failed to replicate Schnall et al. An alternative and reasonable interpretation is that Johnson et al would have simply identified important boundary conditions under which the original effect doesn’t work (e.g., that it doesn’t hold in Michigan residents), since they were using Schnall’s original measures. But we don’t have to worry about that in any case, because there are several serious problems with Schnall’s argument. Some of them have to do with the statistical analysis she performs to make her point; some of them have to do with subtle mischaracterizations of what ceiling effects are and where they come from; and some of them have to do with the fact that Schnall’s data actually directly contradict her own argument. Let’s take each of these in turn.

Problems with the analysis

A first problem with Schnall’s analysis is that the normalization procedure she uses to make her point is biased. Schnall computes the normalized effect size for each item as:

(M1 – M2)/(M1 + M2)

Where M1 and M2 are the means for each item in the two experimental conditions (neutral and clean). This transformation is supposed to account for the fact that scores are compressed at the upper end of the scale, near the ceiling.

What Schnall fails to note, however, is that compression should also occur at the bottom of the scale, near the floor. For example, suppose an individual item has means of 1.2 and 1.4. Then Schnall’s normalized effect size estimate would be 0.2/2.6 = 0.07. But if the means had been 4.0 and 4.2–the same relative difference–then the adjusted estimate would actually be much smaller (around 0.02). So Schnall’s analysis is actually biased in favor of detecting the negative correlation she takes as evidence of a ceiling effect, because she’s not accounting for floor effects simultaneously. A true “clipping” or compression of scores shouldn’t occur at only one extreme of the scale; what should matter is how far from the midpoint a response happens to be. What should happen, if Schnall were to recompute the scores in Figure 1 using a modified criterion (e.g., relative deviation from the scale’s midpoint, rather than absolute score), is that the points at the top left of the figure should pull towards the y-axis to some degree, effectively reducing the slope she takes as evidence of a problem. If there’s any pattern that would suggest a measurement problem, it’s actually an inverted u-shape, where normalized effects are greatest for items with means nearest the midpoint, and smallest for items at both extremes, not just near ceiling. But that’s not what we’re shown.

A second problem is that Schnall’s data actually contradict her own conclusion. She writes:

Across the 24 dilemmas from all 4 experiments, dilemmas with a greater percentage of extreme responses were associated with lower effect sizes (r = -.50, p = .01, two-tailed). This negative correlation was entirely driven by the 12 original items, indicating that the closer responses were to ceiling, the smaller was the effect of the manipulation (r = -.49, p = .10).4In contrast, across the 12 replication items there was no correlation (r = .11, p = .74).

But if anything, these results provide evidence of a ceiling effect only in Schnall’s original study, and not in the Johnson et al replications. Recall that Schnall’s argument rests on two claims: (a) effects are harder to detect the more extreme responding on an item gets, and (b) responding is so extreme on the items in the Johnson et al experiments that nothing can be detected. But the results she presents blatantly contradict the second claim. Had there been no variability in item means in the Johnson et al studies, Schnall could have perhaps argued that restriction of range is so extreme that it is impossible to detect any kind of effect. In practice, however, that’s not the case. There is considerable variability along the x-axis, and in particular, one can clearly see that there are two items in Johnson et al that are nowhere near ceiling and yet show no discernible normalized effect of experimental condition at all. Note that these are the very same items that show some of the strongest effects in Schnall’s original study. In other words, the data Schnall presents in support of her argument actually directly contradict her argument. If one is to believe that a ceiling effect is preventing Schnall’s effect from emerging in Johnson et al’s replication studies, then there is no reasonable explanation for the fact that those two leftmost red squares in the figure above are close to the y = 0 line. They should be behaving exactly like they did in Schnall’s study–which is to say, they should be showing very large normalized effects–even if items at the very far right show no effects at all.

Third, Schnall’s argument that a ceiling effect completely invalidates Johnson et al’s conclusions is a gross exaggeration. Ceiling effects are not all-or-none; the degree of score compression into the upper end of a measure will vary continuously (unless there is literally no variance at all in the reponses, which is clearly not the case here). Even if we took at face value Schnall’s finding that there’s an inverse relationship between effect size and extremity in her original data (r = -0.5), all this would tell us is that there’s some compression of scores. Schnall’s suggestion that “given the observed ceiling effect, a statistical artifact, the analyses reported in Johnson et al (2014a) are invalid and allow no conclusions about the reproducibility of the original findings” is simply false. Even in the very best case scenario (which this obviously isn’t), the very strongest claim Schnall could comfortably make is that there may be some compression of scores, with unknown impact on the detectable effect size. It is simply not credible for Schnall to suggest that the mere presence of something that looks vaguely like a ceiling effect is sufficient to completely rule out detection of group differences in the Johnson et al experiments. And we know this with 100% certainty, because…

There are robust group differences in the replication experiments

Perhaps the clearest refutation of Schnall’s argument for a ceiling effect is that, as Johnson et al noted in their rejoinder, the Johnson et al experiments did in fact successfully identify some very clear group differences (and, ironically, ones that were also present in Schnall’s original experiments). Specifically, Johnson et al showed a robust effect of gender on vignette ratings. Here’s what the results look like:

We can see clearly that, in both replication experiments, there’s a large effect of gender but no discernible effect of experimental condition. This pattern directly refutes Schnall’s argument. She cannot have it both ways: if a ceiling effect precludes the presence of group differences, then there cannot be a ceiling effect in the replication studies, or else the gender effect could not have emerged repeatedly. Conversely, if ceiling effects don’t preclude detection of effects, then there is no principled reason why Johnson et al would fail to detect Schnall’s original effect.

Interestingly, it’s not just the overall means that tell the story quite clearly. Here’s what happens if we plot the gender effects in Johnson et al’s experiments in the same way as Schnall’s Figure 1 above:

gender_fx_by_extremity

Notice that we see here the same negative relationship between effect size and extremity that Schnall observed in her own data, and whose absence in Johnson et al’s data she (erroneously) took as evidence of a ceiling effect.

There’s a ceiling effect in Schnall’s own data

Yet another flaw in Schnall’s argument is that taking the ceiling effect charge seriously would actually invalidate at least one of her own experiments. Consider that the only vignette in Schnall et al’s original Experiment 1 that showed a statistically significant effect also had the highest rate of extreme responding in that study (mean rating of 8.25 / 9). Even more strikingly, the proportion of participants who gave the most extreme response possible on that vignette (70%) was higher than for any of the vignettes in either of Johnson et al’s experiments. In other words, Schnall’s core argument is that her effect could not possibly be replicated in Johnson et al’s experiments because of the presence of a ceiling effect, yet the only vignette to show a significant effect in Schnall’s original Experiment 1 had an even more pronounced ceiling effect. Once again, she cannot have it both ways. Either ceiling effects don’t preclude detection of effects, or, by Schnall’s own logic, the original Study 1 effect was probably a false positive.

When pressed on this point by Daniel Lakens in the email thread, Schnall gave the following response:

Note for the original studies we reported that the effect was seen on aggregate data, not necessarily for individual dilemmas. Such results will always show statistical fluctuations at the item level, hence it is important to not focus on any individual dilemma but on the overall pattern.

I confess that I’m not entirely clear on what Schnall means here. One way to read this is that she is conceding that the significant effect in the vignette in question (the “kitten” dilemma) was simply due to random fluctuations. Note that since the effect in Schnall’s Experiment 1 was only barely significant when averaging across all vignettes (in fact, it wasn’t quite significant even so), eliminating this vignette from consideration would actually have produced a null result. But suppose we overlook that and instead agree with Schnall that strange things can happen to individual items, and that what we should focus on is the aggregate moral judgment, averaged across vignettes. That would be perfectly reasonable, except that it’s directly at odds with Schnall’s more general argument. To see this, we need only look at the aggregate distribution of scores in Johnson et al’s Experiments 1 and 2:

johnson_distributions

There’s clearly no ceiling effect here; the mode in both experiments is nowhere near the maximum. So once again, Schnall can’t have it both ways. If her argument is that what matters is the aggregate measure (which seems right to me, since many reputable measures have multiple individual items with skewed distributions, and this can even be a desirable property in certain cases), then there’s nothing objectionable about the scores in the Johnson et al experiments. Conversely, if Schnall’s argument is that it’s fair to pick on individual items, then there is effectively no reason to believe Schnall’s own original Experiment 1 (and for all I know, her experiment 2 as well–I haven’t looked).

What should we conclude?

What can we conclude from all this? A couple of things. First, Schnall has no basis for arguing that there was a fundamental statistical flaw that completely invalidates Johnson et al’s conclusions. From where I’m sitting, there doesn’t seem to be any meaningful ceiling effect in Johnson et al’s data, and that’s attested to by the fact that Johnson et al had no trouble detecting gender differences in both experiments (successfully replicating Schnall’s earlier findings). Moreover, the arguments Schnall makes in support of the postulated ceiling effects suffer from serious flaws. At best, what Schnall could reasonably argue is that there might be some restriction of range in the ratings, which would artificially reduce the effect size. However, given that Johnson et al’s sample sizes were 3 – 5 times larger than Schnall’s, it is highly implausible to suppose that effects as big as Schnall’s completely disappeared–especially given that robust gender effects were detected. Moreover, given that the skew in Johnson et al’s aggregate distributions is not very extreme at all, and that many individual items on many questionnaire measures show ceiling or floor effects (e.g., go look at individual Big Five item distributions some time), taking Schnall’s claims seriously one would in effect invalidate not just Johnson et al’s results, but also a huge proportion of the more general psychology literature.

Second, while Schnall has raised a number of legitimate and serious concerns about the tone of the debate and comments surrounding Johnson et al’s replication, she’s also made a number of serious charges of her own that depend on the validity of her argument about celing effects, and not on the civility (or lack thereof) of commentators on various sides of the debate. Schnall has (incorrectly) argued that Johnson et al have committed a basic statistical error that most peer reviewers would have caught–effectively accusing them of incompetence. She has argued that Johnson et al’s claim of replication failure is unwarranted, and constitutes defamation of her scientific reputation. And she has suggested that the editors of the special issue (Daniel Lakens and Brian Nosek) behaved unethically by first not seeking independent peer review of the replication paper, and then actively trying to suppress her own penetrating criticisms. In my view, none of these accusations are warranted, because they depend largely on Schnall’s presumption of a critical flaw in Johnson et al’s work that is in fact nonexistent. I understand that Schnall has been under a lot of stress recently, and I sympathize with her concerns over unfair comments made by various people (most of whom have now issued formal apologies). But given the acrimonious tone of the more general ongoing debate over replication, it’s essential that we distinguish the legitimate issues from the illegitimate ones so that we can focus exclusively on the former, and don’t end up needlessly generating more hostility on both sides.

Lastly, there is the question of what conclusions we should draw from the Johnson et al replication studies. Personally, I see no reason to question Johnson et al’s conclusions, which are actually very modest:

In short, the current results suggest that the underlying effect size estimates from these replication experiments are substantially smaller than the estimates generated from the original SBH studies. One possibility is that there are unknown moderators that account for these apparent discrepancies. Perhaps the most salient difference betweenthe current studies and the original SBH studies is the student population. Our participants were undergraduates inUnited States whereas participants in SBH’sstudies were undergraduates in the United Kingdom. It is possible that cultural differences in moral judgments or in the meaning and importance of cleanliness may explain any differences.

Note that Johnson et al did not assert or intimate in any way that Schnall et al’s effects were “not real”. They did not suggest that Schnall et al had committed any errors in their original study. They explicitly acknowledged that unknown moderators might explain the difference in results (though they also noted that this was unlikely considering the magnitude of the differences). Effectively, Johnson et al stuck very close to their data and refrained from any kind of unfounded speculation.

In sum, unless Schnall has other concerns about Johnson’s data besides the purported ceiling effect (and she hasn’t raised any that I’ve seen), I think Johnson et al’s paper should enter the record exactly as its authors intended. Johnson, Cheung, & Donnellan (2014) is, quite simply, a direct preregistered replication of Schnall, Benton, & Harvey (2008) that failed to detect the effects reported in the original study, and there should be nothing at all controversial about this. There are certainly worthwhile discussions to be had about why the replication failed, and what that means for the original effect, but this doesn’t change the fundamental fact that the replication did fail, and we shouldn’t pretend otherwise.

What we can and can’t learn from the Many Labs Replication Project

By now you will most likely have heard about the “Many Labs” Replication Project (MLRP)–a 36-site, 12-country, 6,344-subject effort to try to replicate a variety of classical and not-so-classical findings in psychology. You probably already know that the authors tested a variety of different effects–some recent, some not so recent (the oldest one dates back to 1941!); some well-replicated, others not so much–and reported successful replications of 10 out of 13 effects (though with widely varying effect sizes).

By and large, the reception of the MLRP paper has been overwhelmingly positive. Setting aside for the moment what the findings actually mean (see also Rolf Zwaan’s earlier take), my sense is that most psychologists are united in agreement that the mere fact that researchers at 36 different sites were able to get together and run a common protocol testing 13 different effects is a pretty big deal, and bodes well for the field in light of recent concerns about iffy results and questionable research practices.

But not everyone’s convinced. There now seems to be something of an incipient backlash against replication. Or perhaps not so much against replication itself as against the notion that the ongoing replication efforts have any special significance. An in press paper by Joseph Cesario makes a case for deferring independent efforts to replicate an effect until the original effect is theoretically well understood (a suggestion I disagree with quite strongly, and plan to follow up on in a separate post). And a number of people have questioned, in blog comments and tweets, what the big deal is. A case in point:

I think the charitable way to interpret this sentiment is that Gilbert and others are concerned that some people might read too much into the fact that the MLRP successfully replicated 10 out of 13 effects. And clearly, at least some journalists have; for instance, Science News rather irresponsibly reported that the MLRP “offers reassurance” to psychologists. That said, I don’t think it’s fair to characterize this as anything close to a dominant reaction, and I don’t think I’ve seen any researchers react to the MLRP findings as if the 10/13 number means anything special. The piece Dan Gilbert linked to in his tweet, far from promoting “hysteria” about replication, is a Nature News article by the inimitable Ed Yong, and is characteristically careful and balanced. Far from trumpeting the fact that 10 out of 13 findings replicated, here’s a direct quote from the article:

Project co-leader Brian Nosek, a psychologist at the Center of Open Science in Charlottesville, Virginia, finds the outcomes encouraging. “It demonstrates that there are important effects in our field that are replicable, and consistently so,” he says. “But that doesn’t mean that 10 out of every 13 effects will replicate.”

Kahneman agrees. The study “appears to be extremely well done and entirely convincing”, he says, “although it is surely too early to draw extreme conclusions about entire fields of research from this single effort”.

Clearly, the mere fact that 10 out of 13 effects replicated is not in and of itself very interesting. For one thing (and as Ed Yong also noted in his article), a number of the effects were selected for inclusion in the project precisely because they had already been repeatedly replicated. Had the MLRP failed to replicate these effects–including, for instance, the seminal anchoring effect discovered by Kahneman and Tversky in the 1970s–the conclusion would likely have been that something was wrong with the methodology, and not that the anchoring effect doesn’t exist. So I think pretty much everyone can agree with Gilbert that we have most assuredly not learned, as a result of the MLRP, that there’s no replication crisis in psychology after all, and that roughly 76.9% of effects are replicable. Strictly speaking, all we know is that there are at least 10 effects in all of psychology that can be replicated. But that’s not exactly what one would call an earth-shaking revelation. What’s important to appreciate, however, is that the utility of the MLRP was never supposed to be about the number of successfully replicated effects. Rather, its value is tied to a number of other findings and demonstrations–some of which are very important, and have potentially big implications for the field at large. To wit:

1. The variance between effects is greater than the variance within effects.

Here’s the primary figure from the MLRP paper: Many Labs Replication Project results

Notice that the range of meta-analytic estimates for the different effect sizes (i.e., the solid green circles) is considerably larger than the range of individual estimates within a given effect. In other words, if you want to know how big a given estimate is likely to be, it’s more informative to know what effect is being studied than to know which of the 36 sites is doing the study. This may seem like a rather esoteric point, but it has important implications. Most notably, it speaks directly to the question of how much one should expect effect sizes to fluctuate from lab to lab when direct replications are attempted. If you’ve been following the controversy over the relative (non-)replicability of a number of high-profile social priming studies, you’ve probably noticed that a common defense researchers use when their findings fails to replicate is to claim that the underlying effect is very fragile, and can’t be expected to work in other researchers’ hands. What the MLRP shows, for a reasonable set of studies, is that there does not in fact appear to be a huge amount of site-to-site variability in effects. Take currency priming, for example–an effect in which priming participants with money supposedly leads them to express capitalistic beliefs and behaviors more strongly. Given a single failure to replicate the effect, one could plausibly argue that perhaps the effect was simply too fragile to reproduce consistently. But when 36 different sites all produce effects within a very narrow range–with a mean that is effectively zero–it becomes much harder to argue that the problem is that the effect is highly variable. To the contrary, the effect size estimates are remarkably consistent–it’s just that they’re consistently close to zero.

2. Larger effects show systematically greater variability.

You can see in the above figure that the larger an effect is, the more individual estimates appear to vary across sites. In one sense, this is not terribly surprising–you might already have the statistical intuition that the larger an effect is, the more reliable variance should be available to interact with other moderating variables. Conversely, if an effect is very small to begin with, it’s probably less likely that it could turn into a very large effect under certain circumstances–or that it might reverse direction entirely. But in another sense, this finding is actually quite unexpected, because, as noted above, there’s a general sense in the field that it’s the smaller effects that tend to be more fragile and heterogeneous. To the extent we can generalize from these 13 studies, these findings should give researchers some pause before attributing replication failures to invisible moderators that somehow manage to turn very robust effects (e.g., the original currency priming effect was nearly a full standard deviation in size) into nonexistent ones.

3. A number of seemingly important variables don’t systematically moderate effects.

There have long been expressions of concern over the potential impact of cultural and population differences on psychological effects. For instance, despite repeated demonstrations that internet samples typically provide data that are as good as conventional lab samples, many researchers continue to display a deep (and in my view, completely unwarranted) skepticism of findings obtained online. More reasonably, many researchers have worried that effects obtained using university students in Western nations–the so-called WEIRD samples–may not generalize to other social groups, cultures and countries. While the MLRP results are obviously not the last word on this debate, it’s instructive to note that factors like data acquisition approach (online vs. offline) and cultural background (US vs. non-US) didn’t appear to exert a systematic effect on results. This doesn’t mean that there are no culture-specific effects in psychology of course (there undoubtedly are), but simply that our default expectation should probably be that most basic effects will generalize across cultures to at least some extent.

4. Researchers have pretty good intuitions about which findings will replicate and which ones won’t.

At the risk of offending some researchers, I submit that the likelihood that a published finding will successfully replicate is correlated to some extent with (a) the field of study it falls under and (b) the journal in which it was originally published. For example, I don’t think it’s crazy to suggest that if one were to try to replicate all of the social priming studies and all of the vision studies published in Psychological Science in the last decade, the vision studies would replicate at a consistently higher rate. Anecdotal support for this intuition comes from a string of high-profile failures to replicate famous findings–e.g., John Bargh’s demonstration that priming participants with elderly concepts leads them to walk away from an experiment more slowly. However, the MLRP goes one better than anecdote, as it included a range of effects that clearly differ in their a priori plausibility. Fortuitously, just prior to publicly releasing the MLRP results, Brian Nosek asked the following question on Twitter:

Several researchers, including me, took Brian up on his offers; here are the responses:

As you can see, pretty much everyone that replied to Brian expressed skepticism about the two priming studies (#9 and #10 in Hal Pashler’s reply). There was less consensus on the third effect. (Actually, as it happens, there were actually ultimately only 2 failures to replicate–the third effect became statistically significant when samples were weighted properly.) Nonetheless, most of us picked Imagined Contact as number 3, which did in fact emerge as the smallest of the statistically significant effects. (It’s probably worth mentioning that I’d personally only heard of 4 or 5 of the 13 effects prior to reading their descriptions, so it’s not as though my response was based on a deep knowledge of prior work on these effects–I simply read the descriptions of the findings and gauged their plausibility accordingly.)

Admittedly, these are just two (or three) studies. It’s possible that the MLRP researchers just happened to pick two of the only high-profile priming studies that both seem highly counterintuitive and happen to be false positives. That said, I don’t really think these findings stand out from the mass of other counterintuitive priming studies in social psychology in any way. While we obviously shouldn’t conclude from this that no high-profile, counterintuitive priming studies will successfully replicate, the fact that a number of researchers were able to prospectively determine, with a high degree of accuracy, which effects would fail to replicate (and, among those that replicated, which were rather weak), is a pretty good sign that researchers’ intuitions about plausibility and replicability are pretty decent.

Personally, I’d love to see this principle pushed further, and formalized as a much broader tool for evaluating research findings. For example, one can imagine a website where researchers could publicly (and perhaps anonymously) register their degree of confidence in the likely replicability of any finding associated with a doi or PubMed ID. I think such a service would be hugely valuable–not only because it would help calibrate individual researchers’ intuitions and provide a sense of the field’s overall belief in an effect, but because it would provide a useful index of a finding’s importance in the event of successful replication (i.e., the authors of a well-replicated finding should probably receive more credit if the finding was initially viewed with great skepticism than if it was universally deemed rather obvious).

There are other potentially important findings in the MLRP paper that I haven’t mentioned here (see Rolf Zwaan’s blog post for additional points), but if nothing else, I hope this will help convince any remaining skeptics that this is indeed a landmark paper for psychology–even though the number of successful replications is itself largely meaningless.

Oh, there’s one last point worth mentioning, in light of the rather disagreeable tone of the debate surrounding previous replication efforts. If your findings are ever called into question by a multinational consortium of 36 research groups, this is exactly how you should respond:

Social psychologist Travis Carter of Colby College in Waterville, Maine, who led the original flag-priming study, says that he is disappointed but trusts Nosek’s team wholeheartedly, although he wants to review their data before commenting further. Behavioural scientist Eugene Caruso at the University of Chicago in Illinois, who led the original currency-priming study, says, “We should use this lack of replication to update our beliefs about the reliability and generalizability of this effect”, given the “vastly larger and more diverse sample” of the MLRP. Both researchers praised the initiative.

Carter and Caruso’s attitude towards the MLRP is really exemplary; people make mistakes all the time when doing research, and shouldn’t be held responsible for the mere act of publishing incorrect findings (excepting cases of deliberate misconduct or clear negligence). What matters is, as Caruso notes, whether and to what extent one shows a willingness to update one’s beliefs in response to countervailing evidence. That’s one mark of a good scientist.

what do you get when you put 1,000 psychologists together in one journal?

I’m working on a TOP SEKKRIT* project involving large-scale data mining of the psychology literature. I don’t have anything to say about the TOP SEKKRIT* project just yet, but I will say that in the process of extracting certain information I needed in order to do certain things I won’t talk about, I ended up with certain kinds of data that are useful for certain other tangential analyses. Just for fun, I threw some co-authorship data from 2,000+ Psychological Science articles into the d3.js blender, and out popped an interactive network graph of all researchers who have published at least 2 papers in Psych Science in the last 10 years**. It looks like this:

coauthorship_graph

You can click on the image to take a closer (and interactive) look.

I don’t think this is very useful for anything right now, but if nothing else, it’s fun to drag Adam Galinsky around the screen and watch half of the field come along for the ride. There are plenty of other more interesting things one could do with this, though, and it’s also quite easy to generate the same graph for other journals, so I expect to have more to say about this later on.

 

* It’s not really TOP SEKKRIT at all–it just sounds more exciting that way.

** Or, more accurately, researchers who have co-authored at least 2 Psych Science papers with other researchers who meet the same criterion. Otherwise we’d have even more nodes in the graph, and as you can see, it’s already pretty messy.

the truth is not optional: five bad reasons (and one mediocre one) for defending the status quo

You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.

Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.

“But everyone does it, so how bad can it be?”

We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?

The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.

In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.

“But psychology would break if we could only report results that were truly predicted a priori!”

This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.

In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.

In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.

“But mistakes happen, and people could get falsely accused!”

Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.

For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.

More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.

“But it hurts the public’s perception of our field!”

Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.

While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.

In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.

“But unreliable effects will just fail to replicate, so what’s the big deal?”

This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.

I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?

While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.

In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.

There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.

Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).

The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.

“But it would hurt my career to be meticulously honest about everything I do!”

Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.

I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.

 

Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.

bio-, chemo-, neuro-, eco-informatics… why no psycho-?

The latest issue of the APS Observer features a special section on methods. I contributed a piece discussing the need for a full-fledged discipline of psychoinformatics:

Scientific progress depends on our ability to harness and apply modern information technology. Many advances in the biological and social sciences now emerge directly from advances in the large-scale acquisition, management, and synthesis of scientific data. The application of information technology to science isn’t just a happy accident; it’s also a field in its own right — one commonly referred to as informatics. Prefix that term with a Greek root or two and you get other terms like bioinformatics, neuroinformatics, and ecoinformatics — all well-established fields responsible for many of the most exciting recent discoveries in their parent disciplines.

Curiously, following the same convention also gives us a field called psychoinformatics — which, if you believe Google, doesn’t exist at all (a search for the term returns only 500 hits as of this writing; Figure 1). The discrepancy is surprising, because labels aside, it’s clear that psychological scientists are already harnessing information technology in powerful and creative ways — often reshaping the very way we collect, organize, and synthesize our data.

Here’s the picture that’s worth, oh, at least ten or fifteen words:

Figure 1. Number of Google search hits for informatics-related terms, by prefix.

You can read the rest of the piece here if you’re so inclined. Check out some of the other articles too; I particularly like Denny Borsboom’s piece on network analysis. EDIT: and Anna Mikulak’s piece on optogenetics! I forgot the piece on optogenetics! How can you not love optogenetics!

we, the people, who make mistakes–economists included

Andrew Gelman discusses a “puzzle that’s been bugging [him] for a while“:

Pop economists (or, at least, pop micro-economists) are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get 1 or 2 above. They’re complete opposites!

Personally what I find puzzling isn’t really how to reconcile these two strands (which do seem to somehow coexist quite peacefully in pop economists’ writings); it’s how anyone–economist or otherwise–still manages to believe people are rational in any meaningful sense (and I’m not saying Andrew does; in fact, see below).

There are at least two non-trivial ways to define rationality. One is in terms of an ideal agent’s actions–i.e., rationality is what a decision-maker would choose to do if she had unlimited cognitive resources and knew all the information relevant to a given decision. Well, okay, maybe not an ideal agent, but at the very least a very smart one. This is the sense of rationality in which you might colloquially remark to your neighbor that buying lottery tickets is an irrational thing to do, because the odds are stacked against you. The expected value of buying a lottery ticket (i.e., the amount you would expect to end up with in the long run) is generally negative, so in some normative sense, you could say it’s irrational to buy lottery tickets.

This definition of irrationality is probably quite close to the colloquial usage of the term, but it’s not really interesting from an academic standpoint, because nobody (economists included) really believes we’re rational in this sense. It’s blatantly obvious to everyone that none of us really make normatively correct choices much of the time. If for no other reason than we are all somewhat lacking in the omniscience department.

What economists mean when they talk about rationality is something more technical; specifically, it’s that people manifest stationary preferences. That is, given any set of preferences an individual happens to have (which may seem completely crazy to everyone else), rationality implies that that person expresses those preferences in a consistent manner. If you like dark chocolate more than milk chocolate, and milk chocolate more than Skittles, you shouldn’t like Skittles more than dark chocolate. If you do, you’re violating the principle of transitivity, which would effectively make it impossible to model your preferences formally (since we’d have no way of telling what you’d prefer in any given situation). And that would be a problem for standard economic theory, which is based on the assumption that people are fundamentally rational agents (in this particular sense).

The reason I say it’s puzzling that anyone still believes people are rational in even this narrower sense is that decades of behavioral economics and psychology research have repeatedly demonstrated that people just don’t have consistent preferences. You can radically influence and alter decision-makers’ behavior in all sorts of ways that simply aren’t predicted or accounted for by Rational Choice Theory (RCT). I’ll give just two examples here, but there are any number of others, as many excellent books attest (e.g., Dan Ariely‘s Predictably Irrational, or Thaler and Sunstein’s Nudge).

The first example stems from famous work by Madrian and Shea (2001) investigating the effects of savings plan designs on employees’ 401(k) choices. By pretty much anyone’s account, decisions about savings plans should be a pretty big deal for most employees. The difference between opting into a 401(k) and opting out of one can easily amount to several hundred thousand dollars over the course of a lifetime, so you would expect people to have a huge incentive to make the choice that’s most consistent with their personal preferences (whether those preferences happen to be for splurging now or saving for later). Yet what Madrian and Shea convincingly showed was that most employees simply go with the default plan option. When companies switch from opt-in to opt-out (i.e., instead of calling up HR and saying you want to join the plan, you’re enrolled by default, and have to fill out a form if you want to opt out), nearly 50% more employees end up enrolled in the 401(k).

This result (and any number of others along similar lines) makes no sense under rational choice theory, because it’s virtually impossible to conceive of a consistent set of preferences that would explain this type of behavior. Many of the same employees who won’t take ten minutes out of their day to opt in or out of their 401(k) will undoubtedly drive across town to save a few dollars on their groceries; like most people, they’ll look for bargains, buy cheaper goods rather than more expensive ones, worry about leaving something for their children after they’re gone, and so on and so forth. And one can’t simply attribute the discrepancy in behavior to ignorance (i.e., “no one reads the fine print!”), because the whole point of massive incentives is that they’re supposed to incentivize you to do things like look up information that could be relevant to, oh, say, having hundreds of thousands of extra dollars in your bank account in forty years. If you’re willing to look for coupons in the sunday paper to save a few dollars, but aren’t willing to call up HR and ask about your savings plan, there is, to put it frankly, something mildly inconsistent about your preferences.

The other example stems from the enormous literature on risk aversion. The classic risk aversion finding is that most people require a higher nominal payoff on risky prospects than on safe ones before they’re willing to accept the risky prospect. For instance, most people would rather have $10 for sure than $50 with 25% probability, even though the expected value of the latter is 25% higher (an amazing return!). Risk aversion is a pervasive phenomenon, and crops up everywhere, including in financial investments, where it is known as the equity premium puzzle (the puzzle being that many investors prefer bonds to stocks even though the historical record suggests a massively higher rate of return for stocks over the long term).

From a naive standpoint, you might think the challenge risk aversion poses to rational choice theory is that risk aversion is just, you know, stupid. Meaning, if someone keeps offering you $10 with 100% probability or $50 with 25% probability, it’s stupid to keep making the former choice (which is what most people do when you ask them) when you’re going to make much more money by making the latter choice. But again, remember, economic rationality isn’t about preferences per se, it’s about consistency of preferences. Risk aversion may violate a simplistic theory under which people are supposed to simply maximize expected value at all times; but then, no one’s really believed that for  several hundred years. The standard economist’s response to the observation that people are risk averse is to observe that people aren’t maximizing expected value, they’re maximizing utility. Utility has a non-linear relationship with expected value, so that people assign different weight to the Nth+1 dollar earned than to the Nth dollar earned. For instance, the classical value function identified by Kahneman and Tversky in their seminal work (for which Kahneman won the Nobel prize in part) looks like this:

The idea here is that the average person overvalues small gains relative to larger gains; i.e., you may be more satisfied when you receive $200 than when you receive $100, but you’re not going to be twice as satisfied.

This seemed like a sufficient response for a while, since it appears to preserve consistency as the hallmark of rationality. The idea is that you can have people who have more or less curvature in their value and probability weighting functions (i.e., some people are more risk averse than others), and that’s just fine as long as those preferences are consistent. Meaning, it’s okay if you prefer $50 with 25% probability to $10 with 100% probability just as long as you also prefer $50 with 25% probability to $8 with 100% probability, or to $7 with 100% probability, and so on. So long as your preferences are consistent, your behavior can be explained by RCT.

The problem, as many people have noted, is that in actuality there isn’t any set of consistent preferences that can explain most people’s risk averse behavior. A succinct and influential summary of the problem was provided by Rabin (2000), who showed formally that the choices people make when dealing with small amounts of money imply such an absurd level of risk aversion that the only way for them to be consistent would be to reject uncertain prospects with an infinitely large payoff even when the certain payoff was only modestly larger. Put differently,

if a person always turns down a 50-50 lose $100/gain $110 gamble, she will always turn down a 50-50 lose $800/gain $2,090 gamble. … Somebody who always turns down 50-50 lose $100/gain $125 gambles will turn down any gamble with a 50% chance of losing $600.

The reason for this is simply that any concave function that crosses the points expressed by the low-magnitude prospects (e.g., a refusal to take a 50-50 bet with lose $100/gain $110 outcomes) will have to asymptote fairly quickly. So for people to have internally consistent preferences, they would literally have to be turning down infinite but uncertain payoffs for certain but modest ones. Which of course is absurd; in practice, you would have a hard time finding many people who would refuse a coin toss where they lose $600 on heads and win $$$infinity dollarz$$$ on tails. Though you might have a very difficult time convincing them you’re serious about the bet. And an even more difficult time finding infinity trucks with which to haul in those infinity dollarz in the event you lose.

Anyway, these are just two prominent examples; there are literally hundreds of other similar examples in the behavioral economics literature of supposedly rational people displaying wildly inconsistent behavior. And not just a minority of people; it’s pretty much all of us. Presumably including economists. Irrationality, as it turns out, is the norm and not the exception. In some ways, what’s surprising is not that we’re inconsistent, but that we manage to do so well despite our many biases and failings.

To return to the puzzle Andrew Gelman posed, though, I suspect Andrew’s being facetious, and doesn’t really see this as much of a puzzle at all. Here’s his solution:

The key, I believe, is that “rationality” is a good thing. We all like to associate with good things, right? Argument 1 has a populist feel (people are rational!) and argument 2 has an elitist feel (economists are special!). But both are ways of associating oneself with rationality. It’s almost like the important thing is to be in the same room with rationality; it hardly matters whether you yourself are the exemplar of rationality, or whether you’re celebrating the rationality of others.

This seems like a somewhat more tactful way of saying what I suspect Andrew and many other people (and probably most academic psychologists, myself included) already believe, which is that there isn’t really any reason to think that people are rational in the sense demanded by RCT. That’s not to say economics is bunk, or that it doesn’t make sense to think about incentives as a means of altering behavior. Obviously, in a great many situations, pretending that people are rational is a reasonable approximation to the truth. For instance, in general, if you offer more money to have a job done, more people will be willing to do that job. But the fact that the tenets of standard economics often work shouldn’t blind us to the fact that they also often don’t, and that they fail in many systematic and predictable ways. For instance, sometimes paying people more money makes them perform worse, not better. And sometimes it saps them of the motivation to work at all. Faced with overwhelming empirical evidence that people don’t behave as the theory predicts, the appropriate response should be to revisit the theory, or at least to recognize which situations it should be applied in and which it shouldn’t.

Anyway, that’s a long-winded way of saying I don’t think Andrew’s puzzle is really a puzzle. Economists simply don’t express their own preferences and views about consistency consistently, and it’s not surprising, because neither does anyone else. That doesn’t make them (or us) bad people; it just makes us all people.

the APS likes me!

Somehow I wound up profiled in this month’s issue of the APS Observer as a “Rising Star“. I’d like to believe this means I’m a really big deal now, but I suspect what it actually means is that someone on the nominating committee at APS has extraordinarily bad judgment. I say this in no small part because I know some of the other people who were named Rising Stars quite well (congrats to Karl SzpunarJason Chan, and Alan Castel, among many other people!), so I’m pretty sure I can distinguish people who actually deserve this from, say, me.

Of course, I’m not going to look a gift horse in the mouth. And I’m certainly thrilled to be picked for this. I know these things are kind of a crapshoot, but it still feels really nice. So while the part of my brain that understands measurement error is saying “meh, luck of the draw,” that other part of my brain that likes to be told it’s awesome is in the middle of a three day coke bender right now*. The only regret both parts of the brain have is that there isn’t any money attached to the award–or even a token prize like, say, a free statistician for a year. But I don’t think I’m going to push my luck by complaining to APS about it.

One thing I like a lot about the format of the Rising Star awards is they give you a full page to talk about yourself and your research. If there’s one thing I like to talk about, it’s myself. Usually, you can’t talk about yourself for very long before people start giving you dirty looks. But in this case, it’s sanctioned, so I guess it’s okay. In any case, the kind folks at the Observer sent me a series of seven questions to answer. And being an upstanding gentleman who likes to be given fancy awards, I promptly obliged. I figured they would just run what I sent them with minor edits… but I WAS VERY WRONG. They promptly disassembled nearly all of my brilliant observations and advice and replaced them with some very tame ramblings. So if you actually bother to read my responses, and happen to fall asleep halfway through, you’ll know who to blame. But just to set the record straight, I figured I would run through each of the boilerplate questions I was asked, and show you the answer that was printed in the Observer as compared to what I actually wrote**:

What does your research focus on?

What they printed: Most of my current research focuses on what you might call psychoinformatics: the application of information technology to psychology, with the aim of advancing our ability to study the human mind and brain. I’m interested in developing new ways to acquire, synthesize, and share data in psychology and cognitive neuroscience. Some of the projects I’ve worked on include developing new ways to measure personality more efficiently, adapting computer science metrics of string similarity to visual word recognition, modeling fMRI data on extremely short timescales, and conducting large-scale automated synthesis of published neuroimaging findings. The common theme that binds these disparate projects together is the desire to develop new ways of conceptualizing and addressing psychological problems; I believe very strongly in the transformative power of good methods.

What I actually said: I don’t know! There’s so much interesting stuff to think about! I can’t choose!

What drew you to this line of research? Why is it exciting to you?

What they printed: Technology enriches and improves our lives in every domain, and science is no exception. In the biomedical sciences in particular, many revolutionary discoveries would have been impossible without substantial advances in information technology. Entire subfields of research in molecular biology and genetics are now synonymous with bioinformatics, and neuroscience is currently also experiencing something of a neuroinformatics revolution. The same trend is only just beginning to emerge in psychology, but we’re already able to do amazing things that would have been unthinkable 10 or 20 years ago. For instance, we can now collect data from thousands of people all over the world online, sample people’s inner thoughts and feelings in real time via their phones, harness enormous datasets released by governments and corporations to study everything from how people navigate their spatial world to how they interact with their friends, and use high-performance computing platforms to solve previously intractable problems through large-scale simulation. Over the next few years, I think we’re going to see transformative changes in the way we study the human mind and brain, and I find that a tremendously exciting thing to be involved in.

What I actually said: I like psychology a lot, and I like technology a lot. Why not combine them!

Who were/are your mentors or psychological influences?

What they printed: I’ve been fortunate to have outstanding teachers and mentors at every stage of my training. I actually started my academic career quite disinterested in science and owe my career trajectory in no small part to two stellar philosophy professors (Rob Stainton and Chris Viger) who convinced me as an undergraduate that engaging with empirical data was a surprisingly good way to discover how the world really works. I can’t possibly do justice to all the valuable lessons my graduate and postdoctoral mentors have taught me, so let me just pick a few out of a hat. Among many other things, Todd Braver taught me how to talk through problems collaboratively and keep recursively questioning the answers to problems until a clear understanding materializes. Randy Larsen taught me that patience really is a virtue, despite my frequent misgivings. Tor Wager has taught me to think more programmatically about my research and to challenge myself to learn new skills. All of these people are living proof that you can be an ambitious, hard-working, and productive scientist and still be extraordinarily kind and generous with your time. I don’t think I embody those qualities myself right now, but at least I know what to shoot for.

What I actually said: Richard Feynman, Richard Hamming, and my mother. Not necessarily in that order.

To what do you attribute your success in the science?

What they printed: Mostly to blind luck. So far I’ve managed to stumble from one great research and mentoring situation to another. I’ve been fortunate to have exceptional advisors who’ve provided me with the perfect balance of freedom and guidance and amazing colleagues and friends who’ve been happy to help me out with ideas and resources whenever I’m completely out of my depth — which is most of the time.

To the extent that I can take personal credit for anything, I think I’ve been good about pursuing ideas I’m passionate about and believe in, even when they seem unlikely to pay off at first. I’m also a big proponent of exploratory research; I think pure exploration is tremendously undervalued in psychology. Many of my projects have developed serendipitously, as a result of asking, “What happens if we try doing it this way?”

What I actually said: Mostly to blind luck.

What’s your future research agenda?

What they printed: I’d like to develop technology-based research platforms that improve psychologists’ ability to answer existing questions while simultaneously opening up entirely new avenues of research. That includes things like developing ways to collect large amounts of data more efficiently, tracking research participants over time, automatically synthesizing the results of published studies, building online data repositories and collaboration tools, and more. I know that all sounds incredibly vague, and if you have some ideas about how to go about any of it, I’d love to collaborate! And by collaborate, I mean that I’ll brew the coffee and you’ll do the work.

What I actually said: Trading coffee for publications?

Any advice for even younger psychological scientists? What would you tell someone just now entering graduate school or getting their PhD?

What they printed: The responsible thing would probably be to say “Don’t go to graduate school.” But if it’s too late for that, I’d recommend finding brilliant mentors and colleagues and serving them coffee exactly the way they like it. Failing that, find projects you’re passionate about, work with people you enjoy being around, develop good technical skills, and don’t be afraid to try out crazy ideas. Leave your office door open, and talk to everyone you can about the research they’re doing, even if it doesn’t seem immediately relevant. Good ideas can come from anywhere and often do.

What I actually said: “Don’t go to graduate school.”

What publication you are most proud of or feel has been most important to your career?

What they printed: Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scale automated synthesis of human functional neuroimaging data. Manuscript submitted for publication.

In this paper, we introduce a highly automated platform for synthesizing data from thousands of published functional neuroimaging studies. We used a combination of text mining, meta-analysis, and machine learning to automatically generate maps of brain activity for hundreds of different psychological concepts, and we showed that these results could be used to “decode” cognitive states from brain activity in individual human subjects in a relatively open-ended way. I’m very proud of this work, and I’m quite glad that my co-authors agreed to make me first author in return for getting their coffee just right. Unfortunately, the paper isn’t published yet, so you’ll just have to take my word for it that it’s really neat stuff. And if you’re thinking, “Isn’t it awfully convenient that his best paper is unpublished?”… why, yes. Yes it is.

What I actually said: …actually, that’s almost exactly what I said. Except they inserted that bit about trading coffee for co-authorship. Really all I had to do was ask my co-authors nicely.

Anyway, like I said, it’s really nice to be honored in this way, even if I don’t really deserve it (and that’s not false modesty–I’m generally the first to tell other people when I think I’ve done something awesome). But I’m a firm believer in regression to the mean, so I suspect the run of good luck won’t last. In a few years, when I’ve done almost no new original work, failed to land a tenure-track job, and dropped out of academia to ride horses around the racetrack***, you can tell people that you knew me back when I was a Rising Star. Right before you tell them you don’t know what the hell happened.

———————————-

* But not really.

** Totally lying. Pretty much every word is as I wrote it. And the Observer staff were great.

*** Hopefully none of these things will happen. Except the jockey thing; that would be awesome.

how many Cortex publications in the hand is a Nature publication in the bush worth?

A provocative and very short Opinion piece by Julien Mayor (Are scientists nearsighted gamblers? The misleading nature of impact factors) was recently posted on the Frontiers in Psychology website (open access! yay!). Mayor’s argument is summed up nicely in this figure:

The left panel plots the mean versus median number of citations per article in a given year (each year is a separate point) for 3 journals: Nature (solid circles), Psych Review (squares), and Psych Science (triangles). The right panel plots the number of citations each paper receives in each of the first 15 years following its publication. What you can clearly see is that (a) the mean and median are very strongly related for the psychology journals, but completely unrelated for Nature, implying that a very small number of articles account for the vast majority of Nature citations (Mayor cites data indicating that up to 40% of Nature papers are never cited); and (b) Nature papers tend to get cited heavily for a year or two, and then disappear, whereas Psych Science, and particularly Psych Review, tend to have much longer shelf lives. Based on these trends, Mayor concludes that:

From this perspective, the IF, commonly accepted as golden standard for performance metrics seems to reward high-risk strategies (after all your Nature article has only slightly over 50% chance of being ever cited!), and short-lived outbursts. Are scientists then nearsighted gamblers?

I’d very much like to believe this, in that I think the massive emphasis scientists collectively place on publishing work in broad-interest, short-format journals like Nature and Science is often quite detrimental to the scientific enterprise as a whole. But I don’t actually believe it, because I think that, for any individual paper, researchers generally do have good incentives to try to publish in the glamor mags rather than in more specialized journals. Mayor’s figure, while informative, doesn’t take a number of factors into account:

  • The type of papers that gets published in Psych Review and Nature are very different. Review papers, in general, tend to get cited more often, and for a longer time. A better comparison would be between Psych Review papers and only review papers in Nature (there’s not many of them, unfortunately). My guess is that that difference alone probably explains much of the difference in citation rates later on in an article’s life. That would also explain why the temporal profile of Psych Science articles (which are also overwhelmingly short empirical reports) is similar to that of Nature. Major theoretical syntheses stay relevant for decades; individual empirical papers, no matter how exciting, tend to stop being cited as frequently once (a) the finding fails to replicate, or (b) a literature builds up around the original report, and researchers stop citing individual studies and start citing review articles (e.g., in Psych Review).
  • Scientists don’t just care about citation counts, they also care about reputation. The reality is that much of the appeal of having a Nature or Science publication isn’t necessarily that you expect the work to be cited much more heavily, but that you get to tell everyone else how great you must be because you have a publication in Nature. Now, on some level, we know that it’s silly to hold glamor mags in such high esteem, and Mayor’s data are consistent with that idea. In an ideal world, we’d read all papers ultra-carefully before making judgments about their quality, rather than using simple but flawed heuristics like what journal those papers happen to be published in. But this isn’t an ideal world, and the reality is that people do use such heuristics. So it’s to each scientist’s individual advantage (but to the field’s detriment) to take advantage of that knowledge.
  • Different fields have very different citation rates. And articles in different fields have very different shelf lives. For instance, I’ve heard that in many areas of physics, the field moves so fast that articles are basically out of date within a year or two (I have no way to verify if this is true or not). That’s certainly not true of most areas of psychology. For instance, in cognitive neuroscience, the current state of the field in many areas is still reasonably well captured by highly-cited publications that are 5 – 10 years old. Most behavioral areas of psychology seem to advance even more slowly. So one might well expect articles in psychology journals to peak later in time than the average Nature article, because Nature contains a high proportion of articles in the natural sciences.
  • Articles are probably selected for publication in Nature, Psych Science, and Psych Review for different reasons. In particular, there’s no denying the fact that Nature selects articles in large part based on the perceived novelty and unexpectedness of the result. That’s not to say that methodological rigor doesn’t play a role, just that, other things being equal, unexpected findings are less likely to be replicated. Since Nature and Science overwhelmingly publish articles with new and surprising findings, it shouldn’t be surprising if the articles in these journals have a lower rate of replication several years on (and hence, stop being cited). That’s presumably going to be less true of articles in specialist journals, where novelty factor and appeal to a broad audience are usually less important criteria.

Addressing these points would probably go a long way towards closing, and perhaps even reversing, the gap implied  by Mayor’s figure. I suspect that if you could do a controlled experiment and publish the exact same article in Nature and Psych Science, it would tend to get cited more heavily in Nature over the long run. So in that sense, if citations were all anyone cared about, I think it would be perfectly reasonable for scientists to try to publish in the most prestigious journals–even though, again, I think the pressure to publish in such journals actually hurts the field as a whole.

Of course, in reality, we don’t just care about citation counts anyway; lots of other things matter. For one thing, we also need to factor in the opportunity cost associated with writing a paper up in a very specific format for submission to Nature or Science, knowing that we’ll probably have to rewrite much or all of it before it gets published. All that effort could probably have been spent on other projects, so one way to put the question is: how many lower-tier publications in the hand is a top-tier publication in the bush worth?

Ultimately, it’s an empirical matter; I imagine if you were willing to make some strong assumptions, and collect the right kind of data, you could come up with a meaningful estimate of the actual value of a Nature publication, as a function of important variables like the number of other publications the authors had, the amount of work invested in rewriting the paper after rejection, the authors’ career stage, etc. But I don’t know of any published work to that effect; it seems like it would probably be more trouble than it was worth (or, to get meta: how many Nature manuscripts can you write in the time it takes you to write a manuscript about how many Nature manuscripts you should write?). And, to be honest, I suspect that any estimate you obtained that way would have little or no impact on the actual decisions scientists make about where to submit their manuscripts anyway, because, in practice, such decisions are driven as much by guesswork and wishful thinking as by any well-reasoned analysis. And on that last point, I speak from extensive personal experience…

the naming of things

Let’s suppose you were charged with the important task of naming all the various subdisciplines of neuroscience that have anything to do with the field of research we now know as psychology. You might come up with some or all of the following terms, in no particular order:

  • Neuropsychology
  • Biological psychology
  • Neurology
  • Cognitive neuroscience
  • Cognitive science
  • Systems neuroscience
  • Behavioral neuroscience
  • Psychiatry

That’s just a partial list; you’re resourceful, so there are probably others (biopsychology? psychobiology? psychoneuroimmunology?). But it’s a good start. Now suppose you decided to make a game out of it, and threw a dinner party where each guest received a copy of your list (discipline names only–no descriptions!) and had to guess what they thought people in that field study. If your nomenclature made any sense at all, and tried to respect the meanings of the individual words used to generate the compound words or phrases in your list, your guests might hazard something like the following guesses:

  • Neuropsychology: “That’s the intersection of neuroscience and psychology. Meaning, the study of the neural mechanisms underlying cognitive function.”
  • Biological psychology: “Similar to neuropsychology, but probably broader. Like, it includes the role of genes and hormones and kidneys in cognitive function.”
  • Neurology: “The pure study of the brain, without worrying about all of that associated psychological stuff.”
  • Cognitive neuroscience: “Well if it doesn’t mean the same thing as neuropsychology and biological psychology, then it probably refers to the branch of neuroscience that deals with how we think and reason. Kind of like cognitive psychology, only with brains!”
  • Cognitive science: “Like cognitive neuroscience, but not just for brains. It’s the study of human cognition in general.”
  • Systems neuroscience: “Mmm… I don’t really know. The study of how the brain functions as a whole system?”
  • Behavioral neuroscience: “Easy: it’s the study of the relationship between brain and behavior. For example, how we voluntarily generate actions.”
  • Psychiatry: “That’s the branch of medicine that concerns itself with handing out multicolored pills that do funny things to your thoughts and feelings. Of course.”

If this list seems sort of sensible to you, you probably live in a wonderful world where compound words mean what you intuitively think they mean, the subject matter of scientific disciplines can be transparently discerned, and everyone eats ice cream for dinner every night terms that sound extremely similar have extremely similar referents rather than referring to completely different fields of study. Unfortunately, that world is not the world we happen to actually inhabit. In our world, most of the disciplines at the intersection of psychology and neuroscience have funny names that reflect accidents of history, and tell you very little about what the people in that field actually study.

Here’s the list your guests might hand back in this world, if you ever made the terrible, terrible mistake of inviting a bunch of working scientists to dinner:

  • Neuropsychology: The study of how brain damage affects cognition and behavior. Most often focusing on the effects of brain lesions in humans, and typically relying primarily on behavioral evaluations (i.e., no large magnetic devices that take photographs of the space inside people’s skulls). People who call themselves neuropsychologists are overwhelmingly trained as clinical psychologists, and many of them work in big white buildings with a red cross on the front. Note that this isn’t the definition of neuropsychology that Wikipedia gives you; Wikipedia seems to think that neuropsychology is “the basic scientific discipline that studies the structure and function of the brain related to specific psychological processes and overt behaviors.” Nice try, Wikipedia, but that’s much too general. You didn’t even use the words ‘brain damage’, ‘lesion’, or ‘patient’ in the first sentence.
  • Biological psychology: To be perfectly honest, I’m going to have to step out of dinner-guest character for a moment and admit I don’t really have a clue what biological psychologists study. I can’t remember the last time I heard someone refer to themselves as a biological psychologist. To an approximation, I think biological psychology differs from, say, cognitive neuroscience in placing greater emphasis on everything outside of higher cognitive processes (sensory systems, autonomic processes, the four F’s, etc.). But that’s just idle speculation based largely on skimming through the chapter names of my old “Biological Psychology” textbook. What I can definitively confidently comfortably tentatively recklessly assert is that you really don’t want to trust the Wikipedia definition here, because when you type ‘biological psychology‘ into that little box that says ‘search’ on Wikipedia, it redirects you to the behavioral neuroscience entry. And that can’t be right, because, as we’ll see in a moment, behavioral neuroscience refers to something very different…
  • Neurology: Hey, look! A wikipedia entry that doesn’t lie to our face! It says neurology is “a medical specialty dealing with disorders of the nervous system. Specifically, it deals with the diagnosis and treatment of all categories of disease involving the central, peripheral, and autonomic nervous systems, including their coverings, blood vessels, and all effector tissue, such as muscle.” That’s a definition I can get behind, and I think 9 out of 10 dinner guests would probably agree (the tenth is probably drunk). But then, I’m not (that kind of) doctor, so who knows.
  • Cognitive neuroscience: In principle, cognitive neuroscience actually means more or less what it sounds like it means. It’s the study of the neural mechanisms underlying cognitive function. In practice, it all goes to hell in a handbasket when you consider that you can prefix ‘cognitive neuroscience’ with pretty much any adjective you like and end up with a valid subdiscipline. Developmental cognitive neuroscience? Check. Computational cognitive neuroscience? Check. Industrial/organizational cognitive neuroscience? Amazingly, no; until just now, that phrase did not exist on the internet. But by the time you read this, Google will probably have a record of this post, which is really all it takes to legitimate I/OCN as a valid field of inquiry. It’s just that easy to create a new scientific discipline, so be very afraid–things are only going to get messier.
  • Cognitive science: A field that, by most accounts, lives up to its name. Well, kind of. Cognitive science sounds like a blanket term for pretty much everything that has to do with cognition, and it sort of is. You have psychology and linguistics and neuroscience and philosophy and artificial intelligence all represented. I’ve never been to the annual CogSci conference, but I hear it’s a veritable orgy of interdisciplinary activity. Still, I think there’s a definite bias towards some fields at the expense of others. Neuroscientists (of any stripe), for instance, rarely call themselves cognitive scientists. Conversely, philosophers of mind or language love to call themselves cognitive scientists, and the jerk cynic in me says it’s because it means they get to call themselves scientists. Also, in terms of content and coverage, there seems to be a definite emphasis among self-professed cognitive scientists on computational and mathematical modeling, and not so much emphasis on developing neuroscience-based models (though neural network models are popular). Still, if you’re scoring terms based on clarity of usage, cognitive science should score at least an 8.5 / 10.
  • Systems neuroscience: The study of neural circuits and the dynamics of information flow in the central nervous system (note: I stole part of that definition from MIT’s BCS website, because MIT people are SMART). Systems neuroscience doesn’t overlap much with psychology; you can’t defensibly argue that the temporal dynamics of neuronal assemblies in sensory cortex have anything to do with human cognition, right? I just threw this in to make things even more confusing.
  • Behavioral neuroscience: This one’s really great, because it has almost nothing to do with what you think it does. Well, okay, it does have something to do with behavior. But it’s almost exclusively animal behavior. People who refer to themselves as behavioral neuroscientists are generally in the business of poking rats in the brain with very small, sharp, glass objects; they typically don’t care much for human beings (professionally, that is). I guess that kind of makes sense when you consider that you can have rats swim and jump and eat and run while electrodes are implanted in their heads, whereas most of the time when we study human brains, they’re sitting motionless in (a) a giant magnet, (b) a chair, or (c) a jar full of formaldehyde. So maybe you could make an argument that since humans don’t get to BEHAVE very much in our studies, people who study humans can’t call themselves behavioral neuroscientists. But that would be a very bad argument to make, and many of the people who work in the so-called “behavioral sciences” and do nothing but study human behavior would probably be waiting to thump you in the hall the next time they saw you.
  • Psychiatry: The branch of medicine that concerns itself with handing out multicolored pills that do funny things to your thoughts and feelings. Of course.

Anyway, the basic point of all this long-winded nonsense is just that, for all that stuff we tell undergraduates about how science is such a wonderful way to achieve clarity about the way the world works, scientists–or at least, neuroscientists and psychologists–tend to carve up their disciplines in pretty insensible ways. That doesn’t mean we’re dumb, of course; to the people who work in a field, the clarity (or lack thereof) of the terminology makes little difference, because you only need to acquire it once (usually in your first nine years of grad school), and after that you always know what people are talking about. Come to think of it, I’m pretty sure the whole point of learning big words is that once you’ve successfully learned them, you can stop thinking deeply about what they actually mean.

It is kind of annoying, though, to have to explain to undergraduates that, DUH, the class they really want to take given their interests is OBVIOUSLY cognitive neuroscience and NOT neuropsychology or biological psychology. I mean, can’t they read? Or to pedantically point out to someone you just met at a party that saying “the neurological mechanisms of such-and-such” makes them sound hopelessly unsophisticated, and what they should really be saying is “the neural mechanisms,” or “the neurobiological mechanisms”, or (for bonus points) “the neurophysiological substrates”. Or, you know, to try (unsuccessfully) to convince your mother on the phone that even though it’s true that you study the relationship between brains and behavior, the field you work in has very little to do with behavioral neuroscience, and so you really aren’t an expert on that new study reported in that article she just read in the paper the other day about that interesting thing that’s relevant to all that stuff we all do all the time.

The point is, the world would be a slightly better place if cognitive science, neuropsychology, and behavioral neuroscience all meant what they seem like they should mean. But only very slightly better.

Anyway, aside from my burning need to complain about trivial things, I bring these ugly terminological matters up partly out of idle curiosity. And what I’m idly curious about is this: does this kind of confusion feature prominently in other disciplines too, or is psychology-slash-neuroscience just, you know, “special”? My intuition is that it’s the latter; subdiscipline names in other areas just seem so sensible to me whenever I hear them. For instance, I’m fairly confident that organic chemists study the chemistry of Orgas, and I assume condensed matter physicists spend their days modeling the dynamics of teapots. Right? Yes? No? Perhaps my  millions thousands hundreds dozens three regular readers can enlighten me in the comments…