The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong

Unless you’ve been pleasantly napping under a rock for the last couple of months, there’s a good chance you’ve heard about a forthcoming article in the Journal of Personality and Social Psychology (JPSP) purporting to provide strong evidence for the existence of some ESP-like phenomenon. (If you’ve been napping, see here, here, here, here, here, or this comprehensive list). In the article–appropriately titled Feeling the FutureDaryl Bem reports the results of 9 (yes, 9!) separate experiments that catch ordinary college students doing things they’re not supposed to be able to do–things like detecting the on-screen location of erotic images that haven’t actually been presented yet, or being primed by stimuli that won’t be displayed until after a response has already been made.

As you might expect, Bem’s article’s causing quite a stir in the scientific community. The controversy isn’t over whether or not ESP exists, mind you; scientists haven’t lost their collective senses, and most of us still take it as self-evident that college students just can’t peer into the future and determine where as-yet-unrevealed porn is going to soon be hidden (as handy as that ability might be). The real question on many people’s minds is: what went wrong? If there’s obviously no such thing as ESP, how could a leading social psychologist publish an article containing a seemingly huge amount of evidence in favor of ESP in the leading social psychology journal, after being peer reviewed by four other psychologists? Or, to put it in more colloquial terms–what the fuck?

What the fuck?

Many critiques of Bem’s article have tried to dismiss it by searching for the smoking gun–the single critical methodological flaw that dooms the paper. For instance, one critique that’s been making the rounds, by Wagenmakers et al, argues that Bem should have done a Bayesian analysis, and that his failure to adjust his findings for the infitesimally low prior probability of ESP (essentially, the strength of subjective belief against ESP) means that the evidence for ESP is vastly overestimated. I think these types of argument have a kernel of truth, but also suffer from some problems (for the record, I don’t really agree with the Wagenmaker critique, for reasons Andrew Gelman has articulated here). Having read the paper pretty closely twice, I really don’t think there’s any single overwhelming flaw in Bem’s paper (actually, in many ways, it’s a nice paper). Instead, there are a lot of little problems that collectively add up to produce a conclusion you just can’t really trust. Below is a decidedly non-exhaustive list of some of these problems. I’ll warn you now that, unless you care about methodological minutiae, you’ll probably find this very boring reading. But that’s kind of the point: attending to this stuff is so boring that we tend not to do it, with potentially serious consequences. Anyway:

  • Bem reports 9 different studies, which sounds (and is!) impressive. But a noteworthy feature these studies is that they have grossly uneven sample sizes, ranging all the way from N = 50 to N = 200, in blocks of 50. As far as I can tell, no justification for these differences is provided anywhere in the article, which raises red flags, because the most common explanation for differing sample sizes–especially on this order of magnitude–is data peeking. That is, what often happens is that researchers periodically peek at their data, and halt data collection as soon as they obtain a statistically significant result. This may seem like a harmless little foible, but as I’ve discussed elsewhere, is actually a very bad thing, as it can substantially inflate Type I error rates (i.e., false positives).To his credit, Bem was at least being systematic about his data peeking, since his sample sizes always increase in increments of 50. But even in steps of 50, false positives can be grossly inflated. For instance, for a one-sample t-test, a researcher who peeks at her data in increments of 50 subjects and terminates data collection when a significant result is obtained (or N = 200, if no such result is obtained) can expect an actual Type I error rate of about 13%–nearly 3 times the nominal rate of 5%!
  • There’s some reason to think that the 9 experiments Bem reports weren’t necessarily designed as such. Meaning that they appear to have been ‘lumped’ or ‘splitted’ post hoc based on the results. For instance, Experiment 2 had 150 subjects, but the experimental design for the first 100 differed from the final 50 in several respects. They were minor respects, to be sure (e.g., pictures were presented randomly in one study, but in a fixed sequence in the other), but were still comparable in scope to those that differentiated Experiment 8 from Experiment 9 (which had the same sample size splits of 100 and 50, but were presented as two separate experiments). There’s no obvious reason why a researcher would plan to run 150 subjects up front, then decide to change the design after 100 subjects, and still call it the same study. A more plausible explanation is that Experiment 2 was actually supposed to be two separate experiments (a successful first experiment with N = 100 followed by an intended replication with N = 50) that was collapsed into one large study when the second experiment failed–preserving the statistically significant result in the full sample. Needless to say, this kind of lumping and splitting is liable to additionally inflate the false positive rate.
  • Most of Bem’s experiments allow for multiple plausible hypotheses, and it’s rarely clear why Bem would have chosen, up front, the hypotheses he presents in the paper. For instance, in Experiment 1, Bem finds that college students are able to predict the future location of erotic images that haven’t yet been presented (essentially a form of precognition), yet show no ability to predict the location of negative, positive, or romantic pictures. Bem’s explanation for this selective result is that “… such anticipation would be evolutionarily advantageous for reproduction and survival if the organism could act instrumentally to approach erotic stimuli …”. But this seems kind of silly on several levels. For one thing, it’s really hard to imagine that there’s an adaptive benefit to keeping an eye out for potential mates, but not for other potential positive signals (represented by non-erotic positive images). For another, it’s not like we’re talking about actual people or events here; we’re talking about digital images on an LCD. What Bem is effectively saying is that, somehow, someway, our ancestors evolved the extrasensory capacity to read digital bits from the future–but only pornographic ones. Not very compelling, and one could easily have come up with a similar explanation in the event that any of the other picture categories had selectively produced statistically significant results. Of course, if you get to test 4 or 5 different categories at p < .05, and pretend that you called it ahead of time, your false positive rate isn’t really 5%–it’s closer to 20%.
  • I say p < .05, but really, it’s more like p < .1, because the vast majority of tests Bem reports use one-tailed tests–effectively instantaneously doubling the false positive rate. There’s a long-standing debate in the literature, going back at least 60 years, as to whether it’s ever appropriate to use one-tailed tests, but even proponents of one-tailed tests will concede that you should only use them if you really truly have a directional hypothesis in mind before you look at your data. That seems exceedingly unlikely in this case, at least for many of the hypotheses Bem reports testing.
  • Nearly all of Bem’s statistically significant p values are very close to the critical threshold of .05. That’s usually a marker of selection bias, particularly given the aforementioned unevenness of sample sizes. When experiments are conducted in a principled way (i.e., with minimal selection bias or peeking), researchers will often get very low p values, since it’s very difficult to know up front exactly how large effect sizes will be. But in Bem’s 9 experiments, he almost invariably collects just enough subjects to detect a statistically significant effect. There are really only two explanations for that: either Bem is (consciously or unconsciously) deciding what his hypotheses are based on which results attain significance (which is not good), or he’s actually a master of ESP himself, and is able to peer into the future and identify the critical sample size he’ll need in each experiment (which is great, but unlikely).
  • Some of the correlational effects Bem reports–e.g., that people with high stimulus seeking scores are better at ESP–appear to be based on measures constructed post hoc. For instance, Bem uses a non-standard, two-item measure of boredom susceptibility, with no real justification provided for this unusual item selection, and no reporting of results for the presumably many other items and questionnaires that were administered alongside these items (except to parenthetically note that some measures produced non-significant results and hence weren’t reported). Again, the ability to select from among different questionnaires–and to construct custom questionnaires from different combinations of items–can easily inflate Type I error.
  • It’s not entirely clear how many studies Bem ran. In the Discussion section, he notes that he could “identify three sets of findings omitted from this report so far that should be mentioned lest they continue to languish in the file drawer”, but it’s not clear from the description that follows exactly how many studies these “three sets of findings” comprised (or how many ‘pilot’ experiments were involved). What we’d really like to know is the exact number of (a) experiments and (b) subjects Bem ran, without qualification, and including all putative pilot sessions.

It’s important to note that none of these concerns is really terrible individually. Sure, it’s bad to peek at your data, but data peeking alone probably isn’t going to produce 9 different false positives. Nor is using one-tailed tests, or constructing measures on the fly, etc. But when you combine data peeking, liberal thresholds, study recombination, flexible hypotheses, and selective measures, you have a perfect recipe for spurious results. And the fact that there are 9 different studies isn’t any guard against false positives when fudging is at work; if anything, it may make it easier to produce a seemingly consistent story, because reviewers and readers have a natural tendency to relax the standards for each individual experiment. So when Bem argues that “…across all nine experiments, Stouffer’s z = 6.66, p = 1.34 × 10-11,” that statement that the cumulative p value is 1.34 x 10-11 is close to meaningless. Combining p values that way would only be appropriate under the assumption that Bem conducted exactly 9 tests, and without any influence of selection bias. But that’s clearly not the case here.

What would it take to make the results more convincing?

Admittedly, there are quite a few assumptions involved in the above analysis. I don’t know for a fact that Bem was peeking at his data; that just seems like a reasonable assumption given that no justification was provided anywhere for the use of uneven samples. It’s conceivable that Bem had perfectly good, totally principled, reasons for conducting the experiments exactly has he did. But if that’s the case, defusing these criticisms should be simple enough. All it would take for Bem to make me (and presumably many other people) feel much more comfortable with the results is an affirmation of the following statements:

  • That the sample sizes of the different experiments were determined a priori, and not based on data snooping;
  • That the distinction between pilot studies and ‘real’ studies was clearly defined up front–i.e., there weren’t any studies that started out as pilots but eventually ended up in the paper, or studies that were supposed to end up in the paper but that were disqualified as pilots based on the (lack of) results;
  • That there was a clear one-to-one mapping between intended studies and reported studies; i.e., Bem didn’t ‘lump’ together two different studies in cases where one produced no effect, or split one study into two in cases where different subsets of the data both showed an effect;
  • That the predictions reported in the paper were truly made a priori, and not on the basis of the results (e.g., that the hypothesis that sexually arousing stimuli would be the only ones to show an effect was actually written down in one of Bem’s notebooks somewhere);
  • That the various transformations applied to the RT and memory performance measures in some Experiments weren’t selected only after inspecting the raw, untransformed values and failing to identify significant results;
  • That the individual differences measures reported in the paper were selected a priori and not based on post-hoc inspection of the full pattern of correlations across studies;
  • That Bem didn’t run dozens of other statistical tests that failed to produce statistically non-significant results and hence weren’t reported in the paper.

Endorsing this list of statements (or perhaps a somewhat more complete version, as there are other concerns I didn’t mention here) would be sufficient to cast Bem’s results in an entirely new light, and I’d go so far as to say that I’d even be willing to suspend judgment on his conclusions pending additional data (which would be a big deal for me, since I don’t have a shred of a belief in ESP). But I confess that I’m not holding my breath, if only because I imagine that Bem would have already addressed these concerns in his paper if there were indeed principled justifications for the design choices in question.

It isn’t a bad paper

If you’ve read this far (why??), this might seem like a pretty damning review, and you might be thinking, boy, this is really a terrible paper. But I don’t think that’s true at all. In many ways, I think Bem’s actually been relatively careful. The thing to remember is that this type of fudging isn’t unusual; to the contrary, it’s rampant–everyone does it. And that’s because it’s very difficult, and often outright impossible, to avoid. The reality is that scientists are human, and like all humans, have a deep-seated tendency to work to confirm what they already believe. In Bem’s case, there are all sorts of reasons why someone who’s been working for the better part of a decade to demonstrate the existence of psychic phenomena isn’t necessarily the most objective judge of the relevant evidence. I don’t say that to impugn Bem’s motives in any way; I think the same is true of virtually all scientists–including myself. I’m pretty sure that if someone went over my own work with a fine-toothed comb, as I’ve gone over Bem’s above, they’d identify similar problems. Put differently, I don’t doubt that, despite my best efforts, I’ve reported some findings that aren’t true, because I wasn’t as careful as a completely disinterested observer would have been. That’s not to condone fudging, of course, but simply to recognize that it’s an inevitable reality in science, and it isn’t fair to hold Bem to a higher standard than we’d hold anyone else.

If you set aside the controversial nature of Bem’s research, and evaluate the quality of his paper purely on methodological grounds, I don’t think it’s any worse than the average paper published in JPSP, and actually probably better. For all of the concerns I raised above, there are many things Bem is careful to do that many other researchers don’t. For instance, he clearly makes at least a partial effort to avoid data peeking by collecting samples in increments of 50 subjects (I suspect he simply underestimated the degree to which Type I error rates can be inflated by peeking, even with steps that large); he corrects for multiple comparisons in many places (though not in some places where it matters); and he devotes an entire section of the discussion to considering the possibility that he might be inadvertently capitalizing on chance by falling prey to certain biases. Most studies–including most of those published in JPSP, the premier social psychology journal–don’t do any of these things, even though the underlying problems are just applicable. So while you can confidently conclude that Bem’s article is wrong, I don’t think it’s fair to say that it’s a bad article–at least, not by the standards that currently hold in much of psychology.

Should the study have been published?

Interestingly, much of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk. Much of the debate centers instead over whether the article should have ever been published in a journal as prestigious as JPSP (or any other peer-reviewed journal, for that matter). For the most part, I think the answer is yes. I don’t think it’s the place of editors and reviewers to reject a paper based solely on the desirability of its conclusions; if we take the scientific method–and the process of peer review–seriously, that commits us to occasionally (or even frequently) publishing work that we believe time will eventually prove wrong. The metrics I think reviewers should (and do) use are whether (a) the paper is as good as most of the papers that get published in the journal in question, and (b) the methods used live up to the standards of the field. I think that’s true in this case, so I don’t fault the editorial decision. Of course, it sucks to see something published that’s virtually certain to be false… but that’s the price we pay for doing science. As long as they play by the rules, we have to engage with even patently ridiculous views, because sometimes (though very rarely) it later turns out that those views weren’t so ridiculous after all.

That said, believing that it’s appropriate to publish Bem’s article given current publishing standards doesn’t preclude us from questioning those standards themselves. On a pretty basic level, the idea that Bem’s article might be par for the course, quality-wise, yet still be completely and utterly wrong, should surely raise some uncomfortable questions about whether psychology journals are getting the balance between scientific novelty and methodological rigor right. I think that’s a complicated issue, and I’m not going to try to tackle it here, though I will say that personally I do think that more stringent standards would be a good thing for psychology, on the whole. (It’s worth pointing out that the problem of (arguably) lax standards is hardly unique to psychology; as John Ionannidis has famously pointed out, most published findings in the biomedical sciences are false.)

Conclusion

The controversy surrounding the Bem paper is fascinating for many reasons, but it’s arguably most instructive in underscoring the central tension in scientific publishing between rapid discovery and innovation on the one hand, and methodological rigor and cautiousness on the other. Both values are important, but it’s important to recognize the tradeoff that pursuing either one implies. Many of the people who are now complaining that JPSP should never have published Bem’s article seem to overlook the fact that they’ve probably benefited themselves from the prevalence of the same relaxed standards (note that by ‘relaxed’ I don’t mean to suggest that journals like JPSP are non-selective about what they publish, just that methodological rigor is only one among many selection criteria–and often not the most important one). Conversely, maintaining editorial standards that would have precluded Bem’s article from being published would almost certainly also make it much more difficult to publish most other, much less controversial, findings. A world in which fewer spurious results are published is a world in which fewer studies are published, period. You can reasonably debate whether that would be a good or bad thing, but you can’t have it both ways. It’s wishful thinking to imagine that reviewers could somehow grow a magic truth-o-meter that applies lax standards to veridical findings and stringent ones to false positives.

From a bird’s eye view, there’s something undeniably strange about the idea that a well-respected, relatively careful researcher could publish an above-average article in a top psychology journal, yet have virtually everyone instantly recognize that the reported findings are totally, irredeemably false. You could read that as a sign that something’s gone horribly wrong somewhere in the machine; that the reviewers and editors of academic journals have fallen down and can’t get up, or that there’s something deeply flawed about the way scientists–or at least psychologists–practice their trade. But I think that’s wrong. I think we can look at it much more optimistically. We can actually see it as a testament to the success and self-corrective nature of the scientific enterprise that we actually allow articles that virtually nobody agrees with to get published. And that’s because, as scientists, we take seriously the possibility, however vanishingly small, that we might be wrong about even our strongest beliefs. Most of us don’t really believe that Cornell undergraduates have a sixth sense for future porn… but if they did, wouldn’t you want to know about it?

ResearchBlogging.org
Bem, D. J. (2011). Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect Journal of Personality and Social Psychology

33 thoughts on “The psychology of parapsychology, or why good researchers publishing good articles in good journals can still get it totally wrong”

  1. Excellent post! I agree with you that it is a paper of impressive scope, especially given the current state of high-output of minimum-publishable-units. If anything, it should force us to seriously consider the statistical rigor necessary for publication and the general lack of strong statistical knowledge in the field.

    I overheard the following recently between colleagues “Yes, I know I’m not correcting for multiple comparisons, but the reviewers didn’t care on the last paper!”

    This may be the shining example of what happens when we don’t take small statistical sins seriously.

  2. Excellent post.

    I’m not sure that if I were Bem, I’d have gone about these studies in this way. Supposing he does 1 or 2 studies and finds the “ESP” effect. He’s intrigued and wants to do more studies.

    He decided to do all the studies himself. However this leaves him open to various charges, such as yours, because he is clearly trying to establish his hypothesis.

    perhaps a better way would to have been to write to other people who have used these paradigms – there are loads of them, because they’re common paradigms – under the pretext of some meta-analysis or whatever unrelated to ESP get them to send you the raw data, and see if the ESP effect is there. After making sure it’s a good pretext, so that if it turns out there’s no ESP, you still have something to publish.

    Ethically a bit iffy, but not very, and it would make life a lot harder for the critics if you were using their own data…

  3. It seems likely that the major impact of Bem’s paper will be methodological — i.e. to show standard methods have pitfalls in actual use.

    If that’s what’s being demonstrated, then the paper would be a methological artifact paper, and pretty much condemned to a side journal (particularly since the artifacts are a result of choices by the principal investigator).

    If you just showed these artifacts by a monte carlo study, you’d be more likely to make a blog post than a publication in a good journal.

    The cynic in me suggests that JPSP figured they will get a lot of citations out of the article, improving their impact rating.

  4. Thanks all for the comments!

    Michelle, that’s an unfortunate anecdote… I think people sometimes view reviewers as an externalized conscience–i.e., it’s okay for me as an author to tell the strongest story I can come up with, because it’s your job as reviewer to crap on my paper… and then the ‘truth’ is an emergent property of our interaction.

    Neuroskeptic, actually, that is kind of how Bem got involved in ESP research. He published a meta-analysis in Psych Bulletin nearly 20 years ago claiming to show that there was good evidence for Psi, even after worrying about the file drawer problem. Interestingly, he claims to have started the meta-analysis with the expectation that he would debunk Psi, and changed his mind after seeing the evidence.

    I suspect that the reason he decided to run his own studies is precisely because nobody took the meta-analysis seriously, so maybe he felt people would react more positively to a series of empirical studies. Of course, I doubt most people would take any purported evidence for ESP seriously (at least not without examining that evidence very, very carefully, at which point the case usually falls apart).

    zbicyclist, I very much doubt the editor who accepted the article is motivated by citations. For one thing, it was probably clear to the editor that there would be a large contingent of readers who view the article’s publication as something of a joke, so I suspect the pressure not to publish was much stronger than the pressure to publish. But I think we can take JPSP’s stance at face value: they published the article because they followed their standard review process and the reviewers recommended publication. I think that’s perfectly reasonable. But your point about the article highlighting the problems of standard procedures stands. Whether or not anything will actually change in practice, I don’t know. It’s hard to see how that would happen, as it would require most journals to scale back quite dramatically on the number of articles published, and would require most reviewers to think deeply about methodological issues they are not usually concerned with. So I’m not very optimistic…

  5. Great job discussing this- several of the issues you brought up had bothered me, but none seem dramatic enough to explain the paper. I agree the aggregate effect may be much greater than our intuitions say should be the case.

    I have a very dear friend who is involved in Psi research. She has never been what I consider rigorous in her understanding of reality, but she is incredibly bright. Seeing this article has helped me understand how she can believe it.

    Personally, I think it not only possible but highly likely that the ‘file drawer’ effect is larger than we tend to give it credit for. Bem correctly noted that there would have to be something like 46 failed studies for each successfully published one for this to explain the entire phenomena… he states this like it would be difficult to believe. On the contrary, 46 experiments that don’t work to one that does sounds about par for the course (based on either what my friend has described with her work in psychology or even what I’ve done myself in molecular biology- it’s easy to pretend the natural sciences are superior but hard to reconcile that with ‘well, the effect seems small but if I use statistics…”). Science is a good field for skeptics, obviously, but at some point, science can’t progress unless we accept conclusions enough to do more studies, but finding the balance can indeed be a challenge. If this paper did nothing other than prompt honest discussions over that issue, it will have done a tremendous service to the scientific community.

  6. This is the best discussion of the Bem paper I’ve yet seen – and I’ve seen a lot of them, given I’m attempting to run a replication of one of his reported experiments at the University of Edinburgh.

    I’ve almost finished collecting data for an exact replication of Experiment 9, the ‘Retroactive Facilitation of Recall’ one, chosen because it had the largest effect size of all (d = .4). I figure that, with all those statistical issues potentially inflating the results, if we (myself, Richard Wiseman, and Chris French, who are all working on the project) still find null results, it’ll be particularly devastating. If we find positive results, then we’ll have a nice dataset to pick apart and look for faults in the computer program itself – Bem says he’ll give out his raw data, but none is forthcoming yet.

    Will let you know how we get on, once everything is written up and published.

  7. “Interestingly, most of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk.”

    If you were to look into actual numbers (the few that are available), you’d probably want to strike this sentence – it’s flat out not true. A helpful critique otherwise.

  8. Hey Tal — This was the best critique on the Bem article I’ve read yet. I’ve actually been surprised by how misguided some of the other critiques have been.

    I like the idea that a lot of small problems contributed to the finding, but even so I feel like there are probably some more problems that we have not yet discovered. But when we do it will be informative, which is yet another reason it was good and useful to publish this (almost certainly false) paper.

  9. I enjoyed your analysis, but I would like to add that current developments in modern physics and the nascent field of quantum biology render these phenomena not so outlandish. There is evidence for quantum entanglement in processes such as photosynthesis or DNA control. Is it such an heretical idea to postulate entanglement between humans (Telepathy) or with future events (precogntion)?
    Not many readers here probably know that backward time referral effects (aka, precogntion) have been found by a growing band of mainstream researchers. Please check here:

    http://www.consciousness.arizona.edu/TSC2011PlenaryTimePrecog.htm

    It should also be mentioned that accumulating evidence for a whole range of parapsychological effects has been found over the last 70 – 80 years. See Entangled Minds by Dean Radin for a good overview.
    By all means, let’s be skeptical of claims of psi phenomena, but let’s also give serious consideration to the psi hypothesis.

  10. Personally, I think it not only possible but highly likely that the ‘file drawer’ effect is larger than we tend to give it credit for. Bem correctly noted that there would have to be something like 46 failed studies for each successfully published one for this to explain the entire phenomena… he states this like it would be difficult to believe.

    becca, I agree with you that 46 failed studies for each successful one isn’t inconceivable for ESP (though you really shouldn’t need that many, since if it takes more than 20, you’re already below chance!), but I don’t think we need to grant Bem that number in the first place. It only makes sense if you think that there was no fudging going on in those studies that did report a statistically significant result. But that seems very unlikely. In view of many of the kinds of issues I raised above, it seems almost certain that many of the pro-ESP findings in Bem’s meta-analysis were capitalizing on chance in various ways, and were grossly overstating their case.

    I suspect that there aren’t really that many unpublished ESP replication failures out there… but that’s not because ESP is real–it’s because the people who run ESP studies are generally motivated to play with the numbers until something is publishable. I’m not saying there’s deliberate fraud (except in a very small minority of cases), just that, as Feynman put it, you are the easiest person to fool.
    ———-
    Stuart, nice to hear you (and others) are trying to replicate Bem’s results! I hope you’re collecting a large sample though, because, in fairness to Bem, if you pick the largest effect size of the 9 studies, you’re probably admitting some element of chance yourself… the true effect size is likely to be somewhat smaller.

    In addition to running empirical replications, what I think would be very useful, for exegetical purposes, is to take a dataset with (seemingly) null results–or for that matter, even randomly generated data–and show how, through the magic of selection, one can obtain statistically significant results out of nothing. I.e., take a dataset like Bem’s, and then show that if you allow yourself to test a dozen different hypotheses, keep sampling till significance, use one-tailed tests, apply transformations to your measures, and so on, it’s not hard to turn nothing into something. I think that (in addition to straightforward replication failures) would be a very valuable service.
    ———-
    “Interestingly, most of the scientific debate surrounding Bem’s article has actually had very little to do with the veracity of the reported findings, because the vast majority of scientists take it for granted that ESP is bunk.”

    If you were to look into actual numbers (the few that are available), you’d probably want to strike this sentence – it’s flat out not true. A helpful critique otherwise.

    Alex, do you mean that the debate is about the veracity of the findings, or that most most scientists don’t think ESP is bunk? I take the first point, and have changed it to read ‘much’ instead of ‘most’ (though nothing really hangs on it). On the second point, I’m willing to believe you if you point me to the numbers…
    ———-
    I like the idea that a lot of small problems contributed to the finding, but even so I feel like there are probably some more problems that we have not yet discovered. But when we do it will be informative, which is yet another reason it was good and useful to publish this (almost certainly false) paper.

    Chris, yeah, I’m sure there are other problems with the various studies (I flagged some when I was taking notes, but the post was long enough already). I think you’re right that this paper provides an unusual opportunity to discuss important issues–a lot of papers suffer from similar weaknesses, but people sometimes get upset when you raise them, as though methodological problems are only problems when they support conclusions we don’t like!
    ———-
    There is a lot to recommend to your view on the Bem article, but have you considered the alternative epistemic approach towards claims of E.S.P. outlined in this informative video: http://www.youtube.com/watch?v=Q5nNemCZVds

    I have no words. No, wait, I have some: I enjoyed that; thanks!
    ———-
    By all means, let’s be skeptical of claims of psi phenomena, but let’s also give serious consideration to the psi hypothesis.

    Michael, I don’t think the skepticism is so much about the idea that in principle there might be Psi-like phenomena we can’t easily explain; I think it’s more directed at the idea (which, to me personally, seems ludicrous) that of all the ways ESP could possibly manifest (think of all the amazing potential uses!), the only way we can detect it is in very tiny perturbations in things like people’s abilities to detect the spatial location of very specific kinds of pictures. If you think you can demonstrate ESP reliably under controlled conditions, the James Randi Foundation has a check for $1,000,000 with your name on it. $1,000,000 will pay for a lot of participants! Don’t you think anyone who could really detect the location of an erotic image on screen 53% of the time would have picked up that check by now if they could do it reliably?

  11. In Bem’s Experiment 9, he has 50 participants – we’ll (hopefully) have 150 by the end of our testing.

    You’re right that getting a dataset and seeing how it could be manipulated would be a nice idea. Though wouldn’t it be even nicer to get raw data, and see just what’s been done with it?

    Once you actually get into replicating the experiment, by the way, you find there are a host of other little problems with the setup (ie ones that you can’t glean from the paper). I’ll be writing an article on my blog about these, once we’ve finished the experiment.

  12. Regarding the million dollar challenge, this is a purely rhetorical device for skeptical propaganda purposes and has nothing to do with science. Randi himself has more or less admitted this. In order to get a successful experiment with precognition (at 3% effect size) with odds of a million to one (Randi’s requirement for passing) you would need to run the experiment for over two years! However Dick Bierman from the University of Amsterdam has proposed this to Randi, but alas, Randi hasn’t replied. The Million Dollar challange is not science.
    Regarding your comment about the weakness of the effect size, and trivialisation of the phenomenon, yes, parapsychological effects are generally weak (though not all, as in the telepathy Ganzfeld paradigm). This is because they are hyper-susceptible to a multitude of psychological and situational variables. Given parapsychology’s chronic underfunding (it has been calculated that ALL the funding that has ever been received by parapsychologists would fund academic psychology for ONE month) it is surprising the amount that has been learnt so far.

  13. and this is to Stuart, at appears you are intent on debunking the phenomena. This is unfortunate, as the skeptical attitutde of the experimenters has a HUGE impact on the performance of subjects. Proposal: let’s approach this subject with an OPEN MIND.

  14. Excellent article and review of what can go wrong.

    I particularly liked the admission that certain biases or small-time fudging are not just weaknesses of a small percentage of unethical scientists. Lots of writers try to weasel out, attributing almost all of the “Ionannidis effect” to reporting biases, sample size effects, or other problems that do not require us to do much soul-searching. We wouldn’t want to admit there’s any real problem in public seems to be the pattern. I read mostly about big data (think arrays) in biomedical fields and am ready to declare a crisis, with most papers having some flavor of over-cooking, circular demonstrations, or biased demonstrations. That holds even in the very best journals. Review used to be better I think. Perhaps “everyone is doing it” was less true too. It really burns when you see your competition getting away with sly deceptions.

  15. Michael –

    I’m not ‘intent’ on anything, except testing what I find to be a very surprising claim. But if I was so dismissive of it, do you think I’d waste my time with all the data collection for it? This is, after all, nothing at all to do with my PhD, which I should really be working on!

    If you want to explain away any future negative results with ‘the experimenter’s skeptical attitude meant his psi interfered with the psi of the participants’ then fair enough, but it seems rather like special pleading to me…

  16. This is the best critique so far. None of the issues raised is fundamentally new but the presentation is far better than any so far.

    Bem does go some way toward giving the requested affirmations in a reply to Alcock at http://www.csicop.org/specialarticles/show/response_to_alcocks_back_from_the_future_comments_on_bem

    Bem has presented some of those experiments previously at conferences, one of which he cites in his paper. Looking at the one he cites it becomes an inescapable conclusion that the has indeed lumped together different experiments.

    Objectively this is a terrible paper. The cumulative effect of the misused statistics renders the results meaningless. He might as well have fabricated the data and achieved the same end-result.
    If this is a good paper by the standards of a field then that only means that the field as a whole is terrible. It doesn’t make the paper any better.
    I don’t see any grounds for optimism either. Of course, it’s great to know that psychologists still are open to the possibility that they have been quite massively wrong for a long time but what are they going to do with that openness when they have lost the tools to tell the wheat from the chaff?

  17. Stuart said: ‘the experimenter’s skeptical attitude meant his psi interfered with the psi of the participants’ – it doesn’t take experimenter psi to interfere with the subjects performance. Simple experimenter attitudes and other subtle cues are picked up by the subjects and this affects their performance. Skeptic Wiseman ran a study with Marilyn Schlitz investigating distant intention. He got nothing, she found significant evidence. Exactly the same set-up. Attitudes make a difference, and no special pleading to experimenter psi is required.
    I can emphasise with you on working on away from your doctoral project. I was always being hassled during my Ph.D!
    Good luck with your future endeavors.

  18. nice post. a friend of mine directed me toward this, and i like your balanced, thoughtful critique. keep up the good work!

  19. Stuart said,

    “If we find positive results, then we’ll have a nice dataset to pick apart and look for faults in the computer program itself”

    I find your comment here a bit worrying. Are you implying that if you get positive results, then you are going to see whether methodological problems could have been responsible?

    Wouldn’t it have been better to sort out these problems before you started the experiment?

    It seems like you are giving yourself a get-out clause just in case you find positive results. What psi research needs are replications that can make definitive conclusions rather than this constant bickering over whether the results are legitimate *after” the data has been collected. In other words, design an experiment that everyone’s happy with, do the experiment, and then stand by the results whatever they are!

  20. Well, no, first we’re using Bem’s exact setup and seeing if we find the same result. If we do, then it’s simply good practice to rule out any other explanations before attempting to publish. For instance, when we look at our completed dataset, do we find that some particular words are remembered 100% of the time, unbalancing the randomisation? This is stuff you can’t really do unless you have a dataset, and as I stated above, Bem hasn’t released his raw data yet.

    You guys need to lay off the ‘get-out clause’ stuff. I’ve taken time out to replicate this study – I’m not just dismissing it out if hand like some skeptics – and I’m perfectly willing to follow the data wherever it leads.

  21. Thank you for this reasonable, thorough, and playful investigation. I have read the Bem paper carefully in my lab group and we discussed it for hours. We agree that yours is the most convincing critique we have seen.

  22. Thanks for the pointer to the paper, Bill, hadn’t seen it before. I agree 100% with that; that was the gist of my response to Becca’s comment above: there don’t have to be 46 unpublished studies for every published one, because there was almost certainly strong selection bias in the first place.

  23. Bem’s study were reversals of previously done psychology experiments. the results of these original experiments are well known and therefore, their reversals have a directional hypothesis. It then follows that using a one-tailed test is suitable.

    Finally, take the combined odds of the 9 studies and you’d realize that blaming it on chance effects is pretty ridiculous. You are left with ad hominum arguments that are completely unwarranted when you’re talking about Daryl Bem.

  24. There are some neat simulations showing the aggregate effect of combining even a few of these effects, even to produce bogus retrocausal effects. I discuss one paper showing such, and the failed replications of Bem’s article in this post.

  25. Hi,
    I made a website dedicated to the telepathy.
    This site allows to test telepathic connections online between two users. The database accumulates information about the number of sessions, the number of tries to send telepathic messages, and the number of successful connections.
    Please take a look:
    http://yyoouurrii.6te.net/tp2
    Yury

Leave a Reply