There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)

This is not a blog post about bullying, negative psychology or replication studies in general. Those are important issues, and a lot of ink has been spilled over them in the past week or two. But this post isn’t about those issues (at least, not directly). This post is about ceiling effects. Specifically, the ceiling effect purportedly present in a paper in Social Psychology, in which Johnson, Cheung, and Donnellan report the results of two experiments that failed to replicate an earlier pair of experiments by Schnall, Benton, and Harvey.

If you’re not up to date on recent events, I recommend reading Vasudevan Mukunth’s post, which provides a nice summary. If you still want to know more after that, you should probably take a gander at the original paper by Schnall, Benton, & Harvey and the replication paper. Still want more? Go read Schnall’s rebuttal. Then read the rejoinder to the rebuttal. Then read Schnall’s first and second blog posts. And maybe a number of other blog posts (here, here, here, and here). Oh, and then, if you still haven’t had enough, you might want to skim the collected email communications between most of the parties in question, which Brian Nosek has been kind enough to curate.

I’m pointing you to all those other sources primarily so that I don’t have to wade very deeply into the overarching issues myself–because (a) they’re complicated, (b) they’re delicate, and (c) I’m still not entirely sure exactly how I feel about them. However, I do have a fairly well-formed opinion about the substantive issue at the center of Schnall’s published rebuttal–namely, the purported ceiling effect that invalidates Johnson et al’s conclusions. So I thought I’d lay that out here in excruciating detail. I’ll warn you right now that if your interests lie somewhere other than the intersection of psychology and statistics (which they probably should), you probably won’t enjoy this post very much. (If your interests do lie at the intersection of psychology and statistics, you’ll probably give this post a solid “meh”.)

Okay, with all the self-handicapping out of the way, let’s get to it. Here’s what I take to be…

Schnall’s argument

The crux of Schnall’s criticism of the Johnson et al replication is a purported ceiling effect. What, you ask, is a ceiling effect? Here’s Schnall’s definition:

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7”, this suggests that they might have given a higher response (e.g., “8” or “9”) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

This definition has some subtle-but-important problems we’ll come back to, but it’s reasonable as a first approximation. With this definition in mind, here’s how Schnall describes her core analysis, which she uses to argue that Johnson et al’s results are invalid:

Because a ceiling effect on a dependent variable can wash out potential effects of an independent variable (Hessling, Traxel & Schmidt, 2004), the relationship between the percentage of extreme responses and the effect of the cleanliness manipulation was examined. First, using all 24 item means from original and replication studies, the effect of the manipulation on each item was quantified. … Second, for each dilemma the percentage of extreme responses averaged across neutral and clean conditions was computed. This takes into account the extremity of both conditions, and therefore provides an unbiased indicator of ceiling per dilemma. … Ceiling for each dilemma was then plotted relative to the effect of the cleanliness manipulation (Figure 1).

We can (and will) quibble with these analysis choices, but the net result of the analysis is this:

schnall_figure

Here, we see normalized effect size (y-axis) plotted against extremity of item response (x-axis). Schnall’s basic argument is that there’s a strong inverse relationship between the extremity of responses to an item and the size of the experimental effect on that item. In other words, items with extreme responses don’t show an effect, whereas items with non-extreme responses do show an effect. She goes on to note that this pattern is full accounted for by her own original experiments, and that there is no such relationship in Johnson et al’s data. On the basis of this finding, Schnall concludes that:

Scores are compressed toward the top end of the scale and therefore show limited determinate variance near ceiling. Because a significance test compares variance due to a manipulation to variance due to error, an observed lack of effect can result merely from a lack in variance that would normally be associated with a manipulation. Given the observed ceiling effect, a statistical artefact, the analyses reported by Johnson et al. (2014a) are invalid and allow no conclusions about the reproducibility of the original findings.

Problems with the argument

One can certainly debate over what the implications would be even if Schnall’s argument were correct; for instance, it’s debatable whether the presence of a ceiling effect would actually invalidate Johnson et al’s conclusions that they had failed to replicate Schnall et al. An alternative and reasonable interpretation is that Johnson et al would have simply identified important boundary conditions under which the original effect doesn’t work (e.g., that it doesn’t hold in Michigan residents), since they were using Schnall’s original measures. But we don’t have to worry about that in any case, because there are several serious problems with Schnall’s argument. Some of them have to do with the statistical analysis she performs to make her point; some of them have to do with subtle mischaracterizations of what ceiling effects are and where they come from; and some of them have to do with the fact that Schnall’s data actually directly contradict her own argument. Let’s take each of these in turn.

Problems with the analysis

A first problem with Schnall’s analysis is that the normalization procedure she uses to make her point is biased. Schnall computes the normalized effect size for each item as:

(M1 – M2)/(M1 + M2)

Where M1 and M2 are the means for each item in the two experimental conditions (neutral and clean). This transformation is supposed to account for the fact that scores are compressed at the upper end of the scale, near the ceiling.

What Schnall fails to note, however, is that compression should also occur at the bottom of the scale, near the floor. For example, suppose an individual item has means of 1.2 and 1.4. Then Schnall’s normalized effect size estimate would be 0.2/2.6 = 0.07. But if the means had been 4.0 and 4.2–the same relative difference–then the adjusted estimate would actually be much smaller (around 0.02). So Schnall’s analysis is actually biased in favor of detecting the negative correlation she takes as evidence of a ceiling effect, because she’s not accounting for floor effects simultaneously. A true “clipping” or compression of scores shouldn’t occur at only one extreme of the scale; what should matter is how far from the midpoint a response happens to be. What should happen, if Schnall were to recompute the scores in Figure 1 using a modified criterion (e.g., relative deviation from the scale’s midpoint, rather than absolute score), is that the points at the top left of the figure should pull towards the y-axis to some degree, effectively reducing the slope she takes as evidence of a problem. If there’s any pattern that would suggest a measurement problem, it’s actually an inverted u-shape, where normalized effects are greatest for items with means nearest the midpoint, and smallest for items at both extremes, not just near ceiling. But that’s not what we’re shown.

A second problem is that Schnall’s data actually contradict her own conclusion. She writes:

Across the 24 dilemmas from all 4 experiments, dilemmas with a greater percentage of extreme responses were associated with lower effect sizes (r = -.50, p = .01, two-tailed). This negative correlation was entirely driven by the 12 original items, indicating that the closer responses were to ceiling, the smaller was the effect of the manipulation (r = -.49, p = .10).4In contrast, across the 12 replication items there was no correlation (r = .11, p = .74).

But if anything, these results provide evidence of a ceiling effect only in Schnall’s original study, and not in the Johnson et al replications. Recall that Schnall’s argument rests on two claims: (a) effects are harder to detect the more extreme responding on an item gets, and (b) responding is so extreme on the items in the Johnson et al experiments that nothing can be detected. But the results she presents blatantly contradict the second claim. Had there been no variability in item means in the Johnson et al studies, Schnall could have perhaps argued that restriction of range is so extreme that it is impossible to detect any kind of effect. In practice, however, that’s not the case. There is considerable variability along the x-axis, and in particular, one can clearly see that there are two items in Johnson et al that are nowhere near ceiling and yet show no discernible normalized effect of experimental condition at all. Note that these are the very same items that show some of the strongest effects in Schnall’s original study. In other words, the data Schnall presents in support of her argument actually directly contradict her argument. If one is to believe that a ceiling effect is preventing Schnall’s effect from emerging in Johnson et al’s replication studies, then there is no reasonable explanation for the fact that those two leftmost red squares in the figure above are close to the y = 0 line. They should be behaving exactly like they did in Schnall’s study–which is to say, they should be showing very large normalized effects–even if items at the very far right show no effects at all.

Third, Schnall’s argument that a ceiling effect completely invalidates Johnson et al’s conclusions is a gross exaggeration. Ceiling effects are not all-or-none; the degree of score compression into the upper end of a measure will vary continuously (unless there is literally no variance at all in the reponses, which is clearly not the case here). Even if we took at face value Schnall’s finding that there’s an inverse relationship between effect size and extremity in her original data (r = -0.5), all this would tell us is that there’s some compression of scores. Schnall’s suggestion that “given the observed ceiling effect, a statistical artifact, the analyses reported in Johnson et al (2014a) are invalid and allow no conclusions about the reproducibility of the original findings” is simply false. Even in the very best case scenario (which this obviously isn’t), the very strongest claim Schnall could comfortably make is that there may be some compression of scores, with unknown impact on the detectable effect size. It is simply not credible for Schnall to suggest that the mere presence of something that looks vaguely like a ceiling effect is sufficient to completely rule out detection of group differences in the Johnson et al experiments. And we know this with 100% certainty, because…

There are robust group differences in the replication experiments

Perhaps the clearest refutation of Schnall’s argument for a ceiling effect is that, as Johnson et al noted in their rejoinder, the Johnson et al experiments did in fact successfully identify some very clear group differences (and, ironically, ones that were also present in Schnall’s original experiments). Specifically, Johnson et al showed a robust effect of gender on vignette ratings. Here’s what the results look like:

We can see clearly that, in both replication experiments, there’s a large effect of gender but no discernible effect of experimental condition. This pattern directly refutes Schnall’s argument. She cannot have it both ways: if a ceiling effect precludes the presence of group differences, then there cannot be a ceiling effect in the replication studies, or else the gender effect could not have emerged repeatedly. Conversely, if ceiling effects don’t preclude detection of effects, then there is no principled reason why Johnson et al would fail to detect Schnall’s original effect.

Interestingly, it’s not just the overall means that tell the story quite clearly. Here’s what happens if we plot the gender effects in Johnson et al’s experiments in the same way as Schnall’s Figure 1 above:

gender_fx_by_extremity

Notice that we see here the same negative relationship between effect size and extremity that Schnall observed in her own data, and whose absence in Johnson et al’s data she (erroneously) took as evidence of a ceiling effect.

There’s a ceiling effect in Schnall’s own data

Yet another flaw in Schnall’s argument is that taking the ceiling effect charge seriously would actually invalidate at least one of her own experiments. Consider that the only vignette in Schnall et al’s original Experiment 1 that showed a statistically significant effect also had the highest rate of extreme responding in that study (mean rating of 8.25 / 9). Even more strikingly, the proportion of participants who gave the most extreme response possible on that vignette (70%) was higher than for any of the vignettes in either of Johnson et al’s experiments. In other words, Schnall’s core argument is that her effect could not possibly be replicated in Johnson et al’s experiments because of the presence of a ceiling effect, yet the only vignette to show a significant effect in Schnall’s original Experiment 1 had an even more pronounced ceiling effect. Once again, she cannot have it both ways. Either ceiling effects don’t preclude detection of effects, or, by Schnall’s own logic, the original Study 1 effect was probably a false positive.

When pressed on this point by Daniel Lakens in the email thread, Schnall gave the following response:

Note for the original studies we reported that the effect was seen on aggregate data, not necessarily for individual dilemmas. Such results will always show statistical fluctuations at the item level, hence it is important to not focus on any individual dilemma but on the overall pattern.

I confess that I’m not entirely clear on what Schnall means here. One way to read this is that she is conceding that the significant effect in the vignette in question (the “kitten” dilemma) was simply due to random fluctuations. Note that since the effect in Schnall’s Experiment 1 was only barely significant when averaging across all vignettes (in fact, it wasn’t quite significant even so), eliminating this vignette from consideration would actually have produced a null result. But suppose we overlook that and instead agree with Schnall that strange things can happen to individual items, and that what we should focus on is the aggregate moral judgment, averaged across vignettes. That would be perfectly reasonable, except that it’s directly at odds with Schnall’s more general argument. To see this, we need only look at the aggregate distribution of scores in Johnson et al’s Experiments 1 and 2:

johnson_distributions

There’s clearly no ceiling effect here; the mode in both experiments is nowhere near the maximum. So once again, Schnall can’t have it both ways. If her argument is that what matters is the aggregate measure (which seems right to me, since many reputable measures have multiple individual items with skewed distributions, and this can even be a desirable property in certain cases), then there’s nothing objectionable about the scores in the Johnson et al experiments. Conversely, if Schnall’s argument is that it’s fair to pick on individual items, then there is effectively no reason to believe Schnall’s own original Experiment 1 (and for all I know, her experiment 2 as well–I haven’t looked).

What should we conclude?

What can we conclude from all this? A couple of things. First, Schnall has no basis for arguing that there was a fundamental statistical flaw that completely invalidates Johnson et al’s conclusions. From where I’m sitting, there doesn’t seem to be any meaningful ceiling effect in Johnson et al’s data, and that’s attested to by the fact that Johnson et al had no trouble detecting gender differences in both experiments (successfully replicating Schnall’s earlier findings). Moreover, the arguments Schnall makes in support of the postulated ceiling effects suffer from serious flaws. At best, what Schnall could reasonably argue is that there might be some restriction of range in the ratings, which would artificially reduce the effect size. However, given that Johnson et al’s sample sizes were 3 – 5 times larger than Schnall’s, it is highly implausible to suppose that effects as big as Schnall’s completely disappeared–especially given that robust gender effects were detected. Moreover, given that the skew in Johnson et al’s aggregate distributions is not very extreme at all, and that many individual items on many questionnaire measures show ceiling or floor effects (e.g., go look at individual Big Five item distributions some time), taking Schnall’s claims seriously one would in effect invalidate not just Johnson et al’s results, but also a huge proportion of the more general psychology literature.

Second, while Schnall has raised a number of legitimate and serious concerns about the tone of the debate and comments surrounding Johnson et al’s replication, she’s also made a number of serious charges of her own that depend on the validity of her argument about celing effects, and not on the civility (or lack thereof) of commentators on various sides of the debate. Schnall has (incorrectly) argued that Johnson et al have committed a basic statistical error that most peer reviewers would have caught–effectively accusing them of incompetence. She has argued that Johnson et al’s claim of replication failure is unwarranted, and constitutes defamation of her scientific reputation. And she has suggested that the editors of the special issue (Daniel Lakens and Brian Nosek) behaved unethically by first not seeking independent peer review of the replication paper, and then actively trying to suppress her own penetrating criticisms. In my view, none of these accusations are warranted, because they depend largely on Schnall’s presumption of a critical flaw in Johnson et al’s work that is in fact nonexistent. I understand that Schnall has been under a lot of stress recently, and I sympathize with her concerns over unfair comments made by various people (most of whom have now issued formal apologies). But given the acrimonious tone of the more general ongoing debate over replication, it’s essential that we distinguish the legitimate issues from the illegitimate ones so that we can focus exclusively on the former, and don’t end up needlessly generating more hostility on both sides.

Lastly, there is the question of what conclusions we should draw from the Johnson et al replication studies. Personally, I see no reason to question Johnson et al’s conclusions, which are actually very modest:

In short, the current results suggest that the underlying effect size estimates from these replication experiments are substantially smaller than the estimates generated from the original SBH studies. One possibility is that there are unknown moderators that account for these apparent discrepancies. Perhaps the most salient difference betweenthe current studies and the original SBH studies is the student population. Our participants were undergraduates inUnited States whereas participants in SBH’sstudies were undergraduates in the United Kingdom. It is possible that cultural differences in moral judgments or in the meaning and importance of cleanliness may explain any differences.

Note that Johnson et al did not assert or intimate in any way that Schnall et al’s effects were “not real”. They did not suggest that Schnall et al had committed any errors in their original study. They explicitly acknowledged that unknown moderators might explain the difference in results (though they also noted that this was unlikely considering the magnitude of the differences). Effectively, Johnson et al stuck very close to their data and refrained from any kind of unfounded speculation.

In sum, unless Schnall has other concerns about Johnson’s data besides the purported ceiling effect (and she hasn’t raised any that I’ve seen), I think Johnson et al’s paper should enter the record exactly as its authors intended. Johnson, Cheung, & Donnellan (2014) is, quite simply, a direct preregistered replication of Schnall, Benton, & Harvey (2008) that failed to detect the effects reported in the original study, and there should be nothing at all controversial about this. There are certainly worthwhile discussions to be had about why the replication failed, and what that means for the original effect, but this doesn’t change the fundamental fact that the replication did fail, and we shouldn’t pretend otherwise.

12 thoughts on “There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)”

  1. Thank you for these very detailed analyses that refute the ceiling effect defense of the original study. It seems much more plausible that the replication study provided a better estimate of the true effect size.

  2. Very nice post.

    What I find so fascinating about this whole situation is how much the process and procedures change when the exchanges take place in public. Let me elaborate.

    I, probably like yourself and many of the readers of your blog, have had similar experiences during the review process. Essentially what has happened to me (rather frequently, in fact), is that: 1) my submitted manuscript would be assigned to a reviewer (always anonymous but sometimes one has a good idea of who it is) who has a vested interest in the theory or effects that are addressed in my paper 2) if the data in the paper did not 100% support previous research, the reviewer would concoct a long and involved argument that purported to show some damning flaw (as in the case of Schnall, often statistical, though not always) which would lead to 3) the paper being rejected without the possibility to address the comments of the reviewer.

    When I see what Schnall has written (both “privately” in her email exchanges with Nosek and Lakens and publicly on her site and on the SPSP blog), I see strong echoes of this process that typically occurs in private. Reviewers with personal investment in a literature will devise these types of sophisticated-sounding (though not always so) arguments to prevent the publication of papers that do not match the results or vision of the original work. Typically, they get their way. In a “normal” peer-review process (e.g. not a pre-registered replication study, when all exchanges about the manuscript and data take place privately), Schnall would have had a good chance at shooting down the replication manuscript.

    What I see Schnall saying goes something like this: “As the original author of these studies, I’m the expert and I deserve a chance to have my say on the publication of this paper.” What I see beneath the surface is anxiety about the shifting power structures in the way research is evaluated and published (though let’s not get ahead of ourselves- for most papers it’s still “business as usual”). Schnall, and her “famous” public defenders like Gilbert, Schwarz and Reis, are used to having their say, are used to having their opinions heard, are used to the process going their way and in private. From this perspective, their comments about “bullying” and “replication police” or “false-positive police” make some sort of sense. If you are used to being in charge of the system, attempts to modify or change it can be perceived as threatening.

    But in this case, much of the discourse has taken place publicly. The sophisticated-sounding arguments Schnall has put forth have been subjected to scrutiny by many people (your excellent and thorough analysis among them), and have been found lacking. It’s wonderful to see, for me at least, because I’ve been on the receiving end of these types of comments many times over, and quite often I don’t get the chance to respond.

    I would hope that there’s a lesson here for the field in general, about the value of open discourse, open materials and data, etc. So many more people are getting a say on the value of these results than would normally be the case. Consensus is developing based on careful, thoughtful and open examination of the arguments and counter-arguments (and the data!). Isn’t this what the goal of “peer reviewing” is all about?

    It’s hard to say what will happen from here, especially given the entrenched nature of the existing power structure surrounding reviewing. Some might take my comment as overly negative about the peer-review process. But as Uri Simonsohn pointed out at SPSP this year, even the “best” JPSP papers (e.g. those that made no mention of including covariates in their models), have only around 33% statistical power to detect the effects reported in the articles. These papers are the best that the field has to offer according to current standards, and have been subject to intense scrutiny through peer-review. So I would argue that the current peer-review system isn’t doing a great job at selecting robust results, and that pre-registration and open everything (materials, data, analysis syntax) is a possible way forward. It’s perhaps unlikely to happen, but I’d like to think that this entire episode gives us a vision of what psychological science could be like.

    1. I was with you until you said:

      What I see Schnall saying goes something like this: “As the original author of these studies, I’m the expert and I deserve a chance to have my say on the publication of this paper.” What I see beneath the surface is anxiety about the shifting power structures in the way research is evaluated and published (though let’s not get ahead of ourselves- for most papers it’s still “business as usual”).

      Though I do like what you say about opening the system.

    2. Good point, but thanks to the internet and blogging the power structure is shifting. As always the people in power who benefited from the old system don’t like it, but they have no control over the power of the situation.

  3. Dear MM, you don’t want papers to be peer reviewed because there is often a reviewer who has an agenda to reject the paper, and editors often do a poor job and accept the reviewers comments at face value. On the other hand, you trust editors to evaluate a paper without peer review? Would this be your view for all papers, or only replications? Or pre-registered papers?

    You write: “pre-registration and open everything (materials, data, analysis syntax) is a possible way forward.” That makes a lot of sense, but it is not inconsistent with peer review and a new appreciation of the need for published replication studies. And also post-publication reviews linked to the published work. But I think peer review also has an important role in this, and I would be worried having a single editor making the call as he/she may not have the expertise to make the most informed judgment.

    With regards to the current paper, Tal Yarkoni writes that peer review of the replication paper is not necessary because there was no flaw in the replication study. Is the implication that peer review would have been appropriate if there was a flaw? How would that work?

    Jeff Bowers

    1. Hi Professor Bowers,

      Thanks for your comment on my comment.

      I don’t think I said that I don’t want papers to be peer-reviewed. What I tried to say was that “the current peer-review system isn’t doing a great job at selecting robust results,” and I mentioned experiences that I and my colleagues have had which resemble the social dynamics unfolding in the current case.

      I think that the idea of peer-review is good, but the way in which it is currently implemented is not resulting in a particularly solid empirical literature (see also Asendorpf et al., 2013, European Journal of Psychology; Giner-Sorolla, 2012, Perspectives on Psychological Science; Nosek & Bar-Anan, 2012 Psychological Inquiry, etc.; for a biologist’s perspective see http://www.michaeleisen.org/blog/?p=694).

      There are many possible ways to implement a peer-review system, and perhaps it’s time to explore some other models. PLOS One has peer review, but it is focused on the methods and not on the results. One could make the argument that the papers in the special issue of Social Psychology were also peer-reviewed. Indeed, as Nosek and Lakens state in their intro editorial “Proposals that passed initial editorial review went out for peer review. Reviewers evaluated the importance of conducting a replication and the quality of the methodology. At least one author of the original article was invited to be a reviewer if any were still alive.” Furthermore, the scrutiny that the replication study has been given post-publication is more than I have ever seen for any other paper in social psychology (I guess one could call this “post-publication peer review”).

      As for the specific issues that I mentioned (powerful reviewers with vested interests), other scholars have addressed this issue and proposed some solutions, for example making the review process more public and transparent (e.g. Asendorpf et al., 2013; Giner-Sorolloa, 2012; Nosek & Bar-Anan, 2012). It seems like we’re getting something like that in the current case, as I mentioned in my original comment, and I was just trying to say that I thought that aspect is encouraging and points to one possible way to proceed.

      With regards to the current paper, is/was peer-review appropriate? As I mention above, one could argue that there was peer-review in the current case, both before publication (on the methods and analysis plans) and afterwards (by many members of the community). If, however, one believes that peer-review occurs only when reviewers get an up/down vote on the theory, methods, and results, one would come to a different conclusion about whether this manuscript was peer-reviewed.

      For me the important question is: which type of reviewing system will lead to the most robust literature going forward? I think it’s worth experimenting with other models (including the one used for the Social Psychology special issue), because it’s becoming increasingly clear that our current model does not select the most stable or replicable findings for publication in the premier publication outlets.

  4. Thanks for all the comments. MM, I think the shift toward post-publication review is well underway; the main question is how long it will take to become the norm.

    Jeff Bowers, I’m certainly not arguing that peer review is unnecessary. My post is peer review, as are the other blog posts that have appeared in the last week or two. What I would argue is that pre-publication peer review is unnecessary, and can and should eventually be superseded by post-publication review. (I would also go a step further than that and say that conventional journals are probably not going to be around very much longer, as we now have the tools to filter and evaluate papers much more efficiently. You can read my take on that here if you care to.) The local point I was making was simply that I can’t imagine there being a difference in outcome if anyone other than Schnall had been invited to review the paper, so Schnall’s suggestion that it was an egregious injustice to publish the paper is not credible. From my perspective, however, I’m all for publishing everything and then letting post-publication review separate the good from the bad. I think MM is sketching out a similar kind of view, as have many, many others.

  5. For more discussion of statistical concerns arising in some of the replications in the special issue of Social Psychology (and the studies on which they are based), see the posts with titles starting “Beyond the Buzz …” at http://www.ma.utexas.edu/blogs/mks/. I have been trying to make them not too technical, and will be posting two or three more in the next week or so.

Leave a Reply to The Omniopined PsycholarCancel reply