what exactly is it that 53% of neuroscience articles fail to do?

[UPDATE: Jake Westfall points out in the comments that the paper discussed here appears to have made a pretty fundamental mistake that I then carried over to my post. I’ve updated the post accordingly.]

[UPDATE 2: the lead author has now responded and answered my initial question and some follow-up concerns.]

A new paper in Nature Neuroscience by Emmeke Aarts and colleagues argues that neuroscientists should start using hierarchical  (or multilevel) models in their work in order to account for the nested structure of their data. From the abstract:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%). We discuss the factors affecting the type I error rate and the statistical power in nested data, methods that accommodate dependency between observations and ways to determine the optimal study design when data are nested. Notably, optimization of experimental designs nearly always concerns collection of more truly independent observations, rather than more observations from one research object.

I don’t have any objection to the advocacy for hierarchical models; that much seems perfectly reasonable. If you have nested data, where each subject (or petrie dish or animal or whatever) provides multiple samples, it’s sensible to try to account for as many systematic sources of variance as you can. That point may have been made many times before,  but it never hurts to make it again.

What I do find surprising though–and frankly, have a hard time believing–is the idea that 53% of neuroscience articles are at serious risk of Type I error inflation because they fail to account for nesting. This seems to me to be what the abstract implies, yet it’s a much stronger claim that doesn’t actually follow just from the observation that virtually no studies that have reported nested data have used hierarchical models for analysis. What it also requires is for all of those studies that use “conventional” (i.e., non-hierarchical) analyses to have actively ignored the nesting structure and treated repeated measurements as if they in fact came from entirely different subjects or clusters.

To make this concrete, suppose we have a dataset made up of 400 observations, consisting of 20 subjects who each provided 10 trials in 2 different experimental conditions (i.e., 20 x 2 x 10 = 400). And suppose the thing we ultimately want to know is whether or not there’s a statistical difference in outcome between the two conditions. There are three at least three ways we could set up our comparison:

  1. Ignore the grouping variable (i.e., subject) entirely, effectively giving us 200 observations in each condition. We then conduct the test as if we have 200 independent observations in each condition.
  2. Average the 10 trials in each condition within each subject first, then conduct the test on the subject means. In this case, we effectively have 20 observations in each condition (1 per subject).
  3. Explicitly include the effects of both subject and trial in our model. In this case we have 400 observations, but we’re explictly accounting for the correlation between trials within a given subject, so that the statistical comparison of conditions effectively has somewhere between 20 and 400 “observations” (or degrees of freedom).

Now, none of these approaches is strictly “wrong”, in that there could be specific situations in which any one of them would be called for. But as a general rule, the first approach is almost never appropriate. The reason is that we typically want to draw conclusions that generalize across the cases in the higher level of the hierarchy, and don’t have any intrinsic interest in the individual trials themselves. In the above example, we’re asking whether people on average, behave differently in the two conditions. If we treat our data as if we had 200 subjects in each condition, effectively concatenating trials across all subjects, we’re ignoring the fact that the responses acquired from each subject will tend to be correlated (i.e., Jane Doe’s behavior on Trial 2 will tend to be more similar to her own behavior on Trial 1 than to another subject’s behavior on Trial 1). So we’re pretending that we know something about 200 different individuals sampled at random from the population, when in fact we only know something about 20 different  individuals. The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives we’re going to end up answering a question quite different from the one we think we’re answering. [Update: Jake Westfall points out in the comments below that we won’t necessarily inflate Type I error rate. Rather, the net effect of failing to model the nesting structure properly will depend on the relative amount of within-cluster vs. between-cluster variance. The answer we get will, however, usually deviate considerably from the answer we would get using approaches (2) or (3).]

By contrast, approaches (2) and (3) will, in most cases, produce pretty similar results. It’s true that the hierarchical approach is generally a more sensible thing to do, and will tend to provide a better estimate of the true population difference between the two conditions. However, it’s probably better to describe approach (2) as suboptimal, and not as wrong. So long as the subjects in our toy example above are in fact sampled at random, it’s pretty reasonable to assume that we have exactly 20 independent observations, and analyze our data accordingly. Our resulting estimates might not be quite as good as they could have been, but we’re unlikely to miss the mark by much.

To return to the Aarts et al paper, the key question is what exactly the authors mean when they say in their abstract that:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%).

I’ve underlined the key phrases here. It seems to me that the implication the reader is supposed to draw from this is that roughly 53% of the neuroscience literature is at high risk of reporting spurious results. But in reality this depends entirely on whether the authors mean that 53% of studies are modeling trial-level data but ignoring the nesting structure (as in approach 1 above), or that 53% of studies in the literature aren’t using hierarchical models, even though they may be doing nothing terribly wrong otherwise (e.g., because they’re using approach (2) above).

Unfortunately, the rest of the manuscript doesn’t really clarify the matter. Here’s the section in which the authors report how they obtained that 53% number:

To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.

This is, as far as I can see, still ambiguous. The only additional information provided here is that 44% of studies specifically reported the number of observations per cluster. Unfortunately this still doesn’t tell us whether the effective degrees of freedom used in the statistical tests in those papers included nested observations, or instead averaged over nested observations within each group or subject prior to analysis.

Lest this seem like a rather pedantic statistical point, I hasten to emphasize that a lot hangs on it. The potential implications for the neuroscience literature are very different under each of these two scenarios. If it is in fact true that 53% of studies are inappropriately using a “fixed-effects” model (approach 1)–which seems to me to be what the Aarts et al abstract implies–the upshot is that a good deal of neuroscience research is very bad statistical shape, and the authors will have done the community a great service by drawing attention to the problem. On the other hand, if the vast majority of the studies in that 53% are actually doing their analyses in a perfectly reasonable–if perhaps suboptimal–way, then the Aarts et al article seems rather alarmist. It would, of course, still be true that hierarchical models should be used more widely, but the cost of failing to switch would be much lower than seems to be implied.

I’ve emailed the corresponding author to ask for a clarification. I’ll update this post if I get a reply. In the meantime, I’m interested in others’ thoughts as to the likelihood that around half of the neuroscience literature involves inappropriate reporting of fixed-effects analyses. I guess personally I would be very surprised if this were the case, though it wouldn’t be unprecedented–e.g., I gather that in the early days of neuroimaging, the SPM analysis package used a fixed-effects model by default, resulting in quite a few publications reporting grossly inflated t/z/F statistics. But that was many years ago, and in the literatures I read regularly (in psychology and cognitive neuroscience), this problem rarely arises any more. A priori, I would have expected the same to be true in cellular and molecular neuroscience.


UPDATE 04/01 (no, not an April Fool’s joke)

The lead author, Emmeke Aarts, responded to my email. Here’s her reply in full:

Thank you for your interest in our paper. As the first author of the paper, I will answer the question you send to Sophie van der Sluis. Indeed we report that 53% of the papers include nested data using conventional statistics, meaning that they did not use multilevel analysis but an analysis method that assumes independent observations like a students t-test or ANOVA.

As you also note, the data can be analyzed at two levels, at the level of the individual observations, or at the subject/animal level. Unfortunately, with the information the papers provided us, we could not extract this information for all papers. However, as described in the section ‘The prevalence of nesting in neuroscience studies’, 44% of these 53% of papers including nested data, used conventional statistics on the individual observations, with at least a mean of 5 observations per subject/animal. Another 7% of these 53% of papers including nested data used conventional statistics at the subject/animal level. So this leaves 49% unknown. Of this 49%, there is a small percentage of papers which analyzed their data at the level of individual observations, but had a mean less than 5 observations per subject/animal (I would say 10 to 20% out of the top of my head), the remaining percentage is truly unknown. Note that with a high level of dependency, using conventional statistics on nested data with 2 observations per subject/animal is already undesirable. Also note that not only analyzing nested data at the individual level is undesirable, analyzing nested data at the subject/animal level is unattractive as well, as it reduces the statistical power to detect the experimental effect of interest (see fig. 1b in the paper), in a field in which a decent level of power is already hard to achieve (e.g., Button 2013).

I think this definitively answers my original question: according to Aarts, of the 53% of studies that used nested data, at least 44% performed conventional (i.e., non-hierarchical) statistical analyses on the individual observations. (I would dispute the suggestion that this was already stated in the paper; the key phrase is “on the individual observations”, and the wording in the manuscript was much more ambiguous.) Aarts suggests that ~50% of the studies couldn’t be readily classified, so in reality that proportion could be much higher. But we can say that at least 23% of the literature surveyed committed what would, in most domains, constitute a fairly serious statistical error.

I then sent Aarts another email following up on Jake Westfall’s comment (i.e., how nested vs. crossed designs were handled. She replied:

As Jake Westfall points out, it indeed depends on the design if ignoring intercept variance (so variance in the mean observation per subject/animal) leads to an inflated type I error. There are two types of designs we need to distinguish here, design type I, where the experimental variable (for example control or experimental group) does not vary within the subjects/animals but only over the subjects/animals, and design Type II, where the experimental variable does vary within the subject/animal. Only in design type I, the type I error is increased by intercept variance. As pointed out in the discussion section of the paper, the paper only focuses on design Type I (“Here we focused on the most common design, that is, data that span two levels (for example, cells in mice) and an experimental variable that does not vary within clusters (for example, in comparing cell characteristic X between mutants and wild types, all cells from one mouse have the same genotype)”), to keep this already complicated matter accessible to a broad readership. Moreover, design type I is what is most frequently seen in biological neuroscience, taking multiple observations from one animal and subsequently comparing genotypes automatically results in a type I research design.

When dealing with a research design II, it is actually the variation in effect within subject/animals that increases the type I error rate (the so-called slope variance), but I will not elaborate too much on this since it is outside the scope of this paper and a completely different story.

Again, this all sounds very straightforward and sound to me. So after both of these emails, here’s my (hopefully?) final take on the paper:

  • Work in molecular, cellular, and developmental neuroscience–or at least, the parts of those fields well-represented in five prominent journals–does indeed appear to suffer from some systemic statistical problems. While the proportion of studies at high risk of Type I error is smaller than the number Aarts et al’s abstract suggests (53%), the latter, more accurate, estimate (at least 23% of the literature) is still shockingly high. This doesn’t mean that a quarter or more of the literature can’t be trusted–as some of the commenters point out below, most conclusions aren’t based on just a single p value from a single analysis–but it does raise some very serious concerns. The Aarts et al paper is an important piece of work that will help improve statistical practice going forward.
  • The comments on this post, and on Twitter, have been interesting to read. There appear to be two broad camps of people who were sympathetic to my original concern about the paper. One camp consists of people who were similarly concerned about technical aspects of the paper, and in most cases were tripped up by the same confusion surrounding what the authors meant when they said 53% of studies used “conventional statistical analyses”. That point has now been addressed. The other camp consists of people who appear to work in the areas of neuroscience Aarts et al focused on, and were reacting not so much to the specific statistical concern raised by Aarts et al as to the broader suggestion that something might be deeply wrong with the neuroscience literature because of this. I confess that my initial knee-jerk impression to the Aarts et al paper was driven in large part by the intuition that surely it wasn’t possible for so large a fraction of the literature to be routinely modeling subjects/clusters/groups as fixed effects. But since it appears that that is in fact the case, I’m not sure what to say with respect to the broader question over whether it is or isn’t appropriate to ignore nesting in animal studies. I will say that in the domains I personally work in, it seems very clear that collapsing across all subjects for analysis purposes is nearly always (if not always) a bad idea. Beyond that, I don’t really have any further opinion other than what I said in this response to a comment below.
  • While the claims made in the paper appear to be fundamentally sound, the presentation leaves something to be desired. It’s unclear to me why the authors relegated some of the most important technical points to the Discussion, or didn’t explictly state them at all. The abstract also seems to me to be overly sensational–though, in hindsight, not nearly as much as I initially suspected. And it also seems questionable to tar all of neuroscience with a single brush when the analyses reported only applied to a few specific domains (and we know for a fact that in, say, neuroimaging, this problem is almost nonexistent). I guess to be charitable, one could pick the same bone with a very large proportion of published work, and this kind of thing is hardly unique to this study. Then again, the fact that a practice is widespread surely isn’t sufficient to justify that practice–or else there would be little point in Aarts et al criticizing a practice that so many people clearly engage in routinely.
  • Given my last post, I can’t help pointing out that this is a nice example of how mandatory data sharing (or failing that, a culture of strong expectations of preemptive sharing) could have made evaluation of scientific claims far easier. If the authors had attached the data file coding the 315 studies they reviewed as a supplement, I (and others) would have been able to clarify the ambiguity I originally raised much more quickly. I did send a follow up email to Aarts to ask if she and her colleagues would consider putting the data online, but haven’t heard back yet.

28 thoughts on “what exactly is it that 53% of neuroscience articles fail to do?”

  1. The thing that bothers me is the example the author uses. It’s a weird claim that multiple neurons from a single animal violate the assumption of independence. Using multiple observations from a single subject is the standard practice in primate neurophysiology, for example, so the author is implicitly declaring that all primate recording studies are flawed. This is not the view of the primate recording community, which deals with this criticism often. The standard response is that it’s reasonable to believe that neurons are independent, and so a nested design adds no real benefit.

    1. I think the critical question is what kind of inference researchers are trying to draw. If the goal of a neurophysiology study is to draw inferences about what kinds of properties neurons have in monkeys in general, then it seems perfectly reasonable to me to object to the aggregation of neurons across different animals. If you record from 200 neurons in each of 3 animals, you don’t really have 600 independent observations, and you’re not entitled to conclude that your results generalize to the broader population of monkeys unless your statistical methods can account for the covariance of individual neurons within monkeys. Whether it’s standard practice or not is immaterial; it’s possible for most of the researchers in a field to routinely engage in a practice that is clearly not valid statistically–and this has been documented any number of times in other cases (e.g., researchers also routinely accept the null with underpowered samples, and appeal to convention doesn’t mitigate the problem in any way). For what it’s worth, I find it hard to see how different neurons within a given animal could fail to show some dependency. Is it really conceivable that knowing what animal a neuron came from is of no use in predicting the functional form of responses in that neuron? Is there empirical work demonstrating that this is true across a broad range of contexts?

      On the other hand, if the goal of a particular neurophysiology study is something closer to an existence proof–i.e., the claim is along the lines of “neurons clearly work this way in at least some monkeys”, and not “neurons work this way in all monkeys, on average”, then there’s nothing wrong with using a small number of monkeys, or even a single monkey. To my mind, some of the most convincing studies in all of neuroscience are recording studies that demonstrate striking dissociations in the responses properties observed in different neuronal populations within the same general part of space. Very often these effects are so strong that it seems all but certain that even a random-effects analysis with, say, 3 animals would produce a statistically significant effect. But even when that’s not true, there can be considerable scientific utility in demonstrating that a particular pattern of results is possible, even if you can’t yet say that it holds in the entire population. To go back to the 3 approaches I mentioned in my post, this may well be a case where neurophysiologists can sensibly argue that they are using a fixed-effects model (approach 1), and that’s perfectly fine. And from a statistical standpoint, that is fine, just as long as no one pretends that the observed results have been shown to generalize to a population beyond the specific animals studied.

      1. This is the key point. Fixed-effects analysis is not inherently incorrect. It addresses a different question: Whether the effect exists in the group of animals under study. The Aarts et al. paper and also your remarks above, in the context of fixed-effects analysis in earlier versions of SPM, appear to suggest that it is incorrect.

        Human psychophysics and monkey recording studies traditionally have too few subjects per study to attempt generalisation to the population. An inference to the population then cannot be justified on the basis of the data. But we may make that inference based on the *assumption* that what holds for the group studied holds for the population. The latter assumption (often implicit) may be justified for certain questions. It is the reason why psychophycisists and monkey electrophysiologists care about their peers’ results.

        So I would argue that it is *incorrect* to say that type-1 error is inflated in fixed-effects analyses, because “type-1 error” in this statement refers to a hypothesis test that hasn’t been attempted (i.e. a population-level hypothesis test).

        The question then becomes whether the interpretation in the papers goes beyond the animals studied, without the proper caveat that this inference is not supported by the statistics (but requires the prior belief that what goes for this group holds in general).

  2. Regarding your final points, I’ve found that my phd student colleagues in psychology and cognitive science (who have taken those departments’ stats courses) have a much firmer grasp of these issues than the neuroscience students (who are largely doing cellular and molecular work) do.

    However, in these subfields, one’s conclusions rarely rest on a single analysis. Rather, you often have about a dozen unique experiments, including specific control experiments, each with relatively simple (and often inappropriate, no doubt) statistical tests. If one of these twelve tests should be using a nested analysis and its not, that seems less risky to me than reporting a single complex experiment that fails to do the most appropriate statistics. At this point, the reviewers should be able to judge whether the risk of a false positive is accounted for by the other analyses and experiments.

    1. That sounds reasonable, although a counterpoint is that it’s very easy to put more faith in a multiplicity of analyses than might be warranted. It’s not clear to me that compounding 10 individually questionable results will generally produce 1 well-supported overarching result–or else Daryl Bem’s ESP paper with 9 studies would have made us all believers.

      I may also have less faith in reviewers’ ability to detect problems (pre-publication) than you do…

      1. True. And relying on multiplicity might be particularly problematic if there are 10 “failed” experiments in the file drawer for each that makes it into the publication.

  3. Based on the paper itself (pg 493), it is quite obvious that the authors consider way 2 in your post as valid but potentially underpowered. The way I see it, if the ICC (inter cluster correlation) is effectively 0, then hierarchical models (way 3) should give you the same result as way 1, while way 2 throws out much of useable data, significantly reducing the available statistical power. If the ICC is 1, then way 3 should give the same result as way 2, while way 1 is overinflating the statistical significance. In essence, doing second level analysis on subject means seems to be a conservative choice that could benefit with hierarchical modeling providing a boost in power. Going the other way, treating the measurments within the same subject as independent can be dangerous if one is not really sure that ICC is 0 or at least damn close to it. It seems that in some areas of research that would be a valid assumption, or at least is considered as such. I’d be interested in knowing how much it holds up.

    1. I agree with you that the authors clearly don’t see (2) as hugely problematic, but it’s still not clear where that leaves those 53% of studies. I.e., are the authors saying that everyone is doing (1), so it’s immaterial whether (2) is a problem or not, or are they saying that everyone is doing something other than (3), including both (1) and (2)?. I think what they mean is the latter. But if that’s true, then the way the abstract is worded is quite strange, because presumably it’s a much smaller proportion than 53% of studies that are actually at high risk of Type I error–and I don’t see why they wouldn’t have reported that number (which they could easily estimate) anywhere in the paper.

  4. I noticed the paper a couple days ago and read (and was confused by) the abstract, but have not read the paper itself… I guess that is by way of disclaimer.

    Anyway, maybe I missed something, but it seems that the assumption here, both in the blog post and the paper, is that ignoring the dependent structure of the data always leads to a null rejection rate that is greater than or equal to that in the analyses that take into account dependence. (I gather this assumption from, e.g.: “The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives.”) But this is not really true… it totally depends on the design. Note that in the example you give, analysis #1 “illegally” uses more degrees of freedom than is probably appropriate, BUT it also fails to remove the participant main effect variance (i.e., some subjects are “high responders” while others are “low responders”) from the denominator of the test statistic, leading to a lower test statistic. In fact, in the large majority of cases, this far outweighs the effect of simply increasing the degrees of freedom (e.g., going from df=20 to df=400 doesn’t really do much, as you know).

    In theory there could be close to 0 participant variance, in which case yes, the rejection rate for analysis #1 will tend to be slightly higher than appropriate. But in practice this is almost never the case… in most contexts there is a substantial amount of variance in the participant main effects, so the rejection rate for analysis #1 will actually be far lower than it should be, in particular lower than in analyses #2 and #3. A pretty common rule of thumb is that if the clusters (i.e., subjects) are crossed with the predictor of interest, then ignoring dependence tends to lead to deflate the test statistic, but if the clusters are nested in the levels of the predictor, then ignoring the dependence tends to inflate the test statistic. I do not routinely read the neuroscience literature, but my impression is that the former case (corresponding to “within-subjects” designs) is overwhelmingly more common than the latter case in neuroimaging studies, for example.

    Maybe the authors do not actually make this mistake. Like I said, I haven’t read the paper beyond the abstract. But if their conclusions are based simply on the observation that neuroscience researchers very often analyze dependent data as if it were independent, then it seems that the conclusion (“alpha inflation is rampant”) is fundamentally wrong. Tal, maybe you can clarify the logic of their argument (or, judging from what you say in the post, maybe not).

    1. You’re absolutely right; thanks for the correction. I’ve updated my post accordingly. Any insight into where the authors went wrong with their simulations? Looking at their supplement, the data generation process seems a bit odd–it looks like they’re holding the total variance constant, so that as they increase the within-cluster variance, they actually decrease the residual error (rather than letting the total variance grow).

      With respect to the neuroimaging literature, things are complicated by the fact that the baseline usually doesn’t mean anything, so back when people were (erroneously) conducting fixed-effects analyses by concatenating subjects, they also used to first standardized each subject’s data (hence removing the between-subject variance in means). Hence the inflation of the test statistic. But you’re right that in the broader neuroscience literature the effect of failing to model clusters is probably much less predictable, and could very well make analyses more conservative most of the time.

      You should really consider submitting a comment/letter to Nature Neuroscience. If you’re right (and I don’t see how you couldn’t be), this is a pretty egregious error that invalidates much of the paper’s argument (though obviously not the more general case for using multilevel models). If nothing else, submit a comment to the PubPeer page for the article.

    2. Jake, any idea what the distributions of within/between subject variance tend to be in different fields? Seems to me like most of the time you are actually hurting your chance of finding something by not accounting for intercepts.

      I’m guessing your answer is going to be that it totally depends on the design but I wanted to see if you had any idea of these distributions generally speaking.

      1. It’s hard to make general statements about the variance components from designed experiments that will apply reasonably well across different research areas and different study designs. One rule of thumb that seems to be approximately true in many cases is the principle of “hierarchical ordering,” which holds that “lower-order” variance components (e.g., variance in main effects) are often greater than “higher-order” variance components (e.g., variance in three-way interaction effects). I attempted to take up this issue in a stats.stackexchange.com question and answer a while ago here:
        http://stats.stackexchange.com/questions/72819/relative-variances-of-higher-order-vs-lower-order-random-terms-in-mixed-models

        There are (at least) two complications in applying the hierarchical ordering principle. First, it is not clear where in the hierarchy we should consider residual variance to be. (Note that the “within-subject variance” in this case = residual variance.) Are the errors the lowest-order effects or the highest-order effects? I usually consider them to be the lowest-order effects, but that’s just because empirically it seems that residual variance is often the largest variance component in the kinds of studies that I personally have experience with. It’s not clear whether we should expect this to be true in other areas. Second, different study designs confound sources of variation in different ways. In the design considered by Aarts et al., if the “true” or data-generating regression process does in fact include random slope variance (e.g., some rats “would have been” high responders in the other condition, had we observed them in both conditions, while for other rats the reverse is true), then this unobserved variation will be absorbed into the random intercept variance. So even if we are willing to rely on hierarchical ordering and posit that within-subject variance > between-subject variance (implying intraclass correlation between-subject variance + subject-by-condition interaction variance. So we could still be wrong about the variance components even if hierarchical ordering is actually correct, just because of the study design.

        All of that is just to say that a priori statements are problematic, we need data, but the data don’t seem to exist in any systematic form (no one seems to be interested in meta-analyzing random variance components).

      2. The malformed third-to-last sentence above should read:
        “So even if we are willing to rely on hierarchical ordering and posit that within-subject variance > between-subject variance (implying intraclass correlation between-subject variance + subject-by-condition interaction variance.”

      3. Apparently the blog software does not like my use of less-than and greater-than symbols, interpreting any text appearing between this two as an HTML tag and discarding it. So let’s try this sentence one more time:
        “So even if we are willing to rely on hierarchical ordering and posit that within-subject variance is greater than between-subject variance (implying intraclass correlation is less than 0.5), it still might not be true that within-subject variance is greater than between-subject variance + subject-by-condition interaction variance.”

  5. Really nice write-up on what the article glosses over. You’ve captured many of my own concerns elegantly and informatively. I’ll just speak to your comments on “monkey data” as someone who did his Ph.D. recording from monkeys. When you say “Is it really conceivable that knowing what animal a neuron came from is of no use in predicting the functional form of responses in that neuron?” I’m quite comfortable answering “yes,” for the “functional form of responses” that is commonly studied.

    In brief, the entire enterprise of primate neurophysiology is predicated on the observation that, time and again, the functional form of responses in a given neuron are determined by the stimulus presented and the area in which that neuron is to be found. I expect that anyone that maps receptive fields, be they in a fish, a bird, a rodent, monkey, or even human (say, visual fields in occipital cortex) will be sympathetic to this notion. I don’t know of a review that contains the “empirical data” that you speak of, because almost every systems-level neuroscience study builds on almost a century of research: that showing the same set of stimuli to a neuron in a given area will produce comparable responses INDEPENDENT OF THE ANIMAL. This is why we say that neurons in MT are “motion sensitive,” for example, or why motoneurons in the brainstem tend to be tuned to move the eyes in a given direction. The deep question is, should we be all that surprised that this is so?

    First let’s look at studies characterizing sensory areas, where the experimenter attempts to define the sensitivity of neurons in a given animal. First, note that maybe 100 out of millions of responsive neurons are sampled. And the space of sensory stimuli is often restricted to that which is relevant to the animal and easily presented. Is it really surprising that, say, 360 deg of motion is represented more or less equivalently by neurons in area MT across animals? Not to me. Similar phenomena are to be expected (and found) among motor neurons, or brainstem supranuclear populations. The muscles are largely the same across animals, and you’ve got to actuate them similarly if you want to move in a coordinated fashion. So at the other “end” of the nervous system, again, I don’t find it surprising.

    Now, consider a more complicated task, where monkeys are asked to do a behaviorally complex task. Often, years are devoted to training the animal until behavioral performance is excellent, at which point recording takes place over additional years. Estimates are made as to what aspects of the response properties of recorded neurons matter. And then? It’s time for the “double-or-nothing” monkey. Training is repeated, recordings performed again, and, in order to be publishable data, the neurons from the second animal MUST look indistinguishable from the first (along the relevant axes). Few people truly appreciate this. Monkey physiologists worry about the “nesting issue,” more profoundly than most, because you really only get two shots. When things are “different” (i.e. when the identity of the monkey might actually matter) then tragically, there’s nothing to write about. Note that the title of most of these papers is “Neurons in This Area Encode (i.e. are correlated with) Something Interesting.” If recordings from similar areas in the two monkeys look different, it’s game over; there’s little hope for the dataset. In my experience, your approach 3 is the default, so deeply that it is often not reported explicitly. To stick with your terminology, if you aren’t a lot closer to 400 observations than 20, you probably wouldn’t have seen the study.

    So when you say, “For what it’s worth, I find it hard to see how different neurons within a given animal could fail to show some dependency,” you’re overlooking the fact that 1) data are sampled really sparsely 2) the responses of these neurons are almost always driven predominantly by the stimulus or behavior under study 3) genetic constraints on wiring the brain almost certainly guarantee stereotopy in these response properties will emerge across animals 4) neurons are biophysical machines, and they are largely built comparably across mature animals 5) the recordings take place across months, if not years, so the animal at the beginning of the study isn’t really “the same” as at the end.

    There’s lots more about the origin of variability in individual neurons, but for the majority of published monkey studies, it shouldn’t surprise you that it doesn’t matter what animal the neurons came from. A well-designed study (i.e. most that get published) is doing its best to ask a question that will overcome the statistical issues that arise from the necessity of nesting.

    1. David, I think maybe I wasn’t clear in my earlier comment. I don’t doubt at all that there are motion-sensitive neurons (or orientation-sensitive neurons, or color-sensitive neurons, etc.). Or that these properties are broadly conserved across just about every individual of a given species. When I say that it’s almost inconceivable to me that there isn’t a dependency between different neurons, that’s a quantitative, not a qualitative claim. I’m not saying that the kind of response observed is conditional on the animal; I’m saying it would be very odd if neurons from different animals were truly exchangeable.

      Another way to put this is that if neurophysiologists are right that neuronal responses are effectively independent of one another, they have nothing to lose by using multilevel models, since as others have pointed, at the limit, where the ICC is 0, the multilevel model will give the same answer as the “flat” model. Conversely, if it empirically turns out that there is within-animal clustering of neuronal responses, the resulting test statistic will be more conservative than what is typically reported in the literature. So I don’t see any reason for neurophysiologists to be defensive about the suggestion that maybe they should explicitly account for the obvious multilevel structure of their data.

  6. Forgive me if this question is horribly naive, but doesn’t simply using a repeated measures ANOVA address the issue of hierarchical nested data? Maybe not in all cases, but in large part. This seems like a conventional statistical test to me.

    1. Tim,

      Yes. Depending on how the repeated measures component of the ANOVA is handled, that approach is going to end up looking like either approach (2) or (3) above. But the question at hand here is whether the 53% of studies reviewed in the Nature Neuro paper are even doing that much, or if they’re just modeling the repeated measures using a “flat” model, i.e., as if repeated measures were actually independent clusters or subjects of their own.

      1. Thanks, Tal. I did find it hard to believe that so many studies would fail to recognize a repeated measures design and set up an ANOVA accordingly. I guess this goes to your point.

    2. A standard RM anova is a fixed effects model. E.g., given polynomial contrasts, it inclused a single parameter for the intercept, linear effect, quadratic effect, etc. These parameters do not vary over subjects. If you make them random over subjects (e.g. linear mixed option in SPSS, or lme in R) then the parameters vary over subjects. In this case, each effect includes a mean values (mean intercept, mean linear term, etc.) and a variance (variance of mean, variance of intercept, etc.). A RM anova does not address the issue of hierarchical nested data. There are many good books that explain this (e.g., Willett and Sayer, “Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence”, please try google for the publication details).

  7. About Jake Westfall’s comment – unless I’m missing something, the simulations in this paper don’t address conditions in which subjects are crossed with predictors (within-subjects designs) in which ignoring dependence will deflate the test statistic as you suggest. Their examples are all cases in which clusters are nested within conditions – for example genotype (you cannot measure the same neuron from both a wildtype and a knockout mouse). The first example in their supplementary data shows quite large within-cluster relative to between-cluster variance, and they get a significant result with a two-sample t-test treating all observations as independent but the mixed models analysis doesn’t reach statistical significance for effect of genotype.

    It’s definitely not clear from the paper how many studies are actually vulnerable to problems with test statistics because they are taking too many or too few degrees of freedom with nested data (treating all observations as independent or collapsing data down to the cluster level). I do see this problem relatively commonly, although not as often as inappropriate analysis of (or failure to analyze) interactions.

    1. Thanks Mark, you raise a good point; it would be good to know if the 53% of studies the authors report using nested data are nested in the strict sense (i.e., no crossing), or also include studies with repeated measures embedded in within-subject designs. I would guess the latter, but I’m not sure. In any case, I don’t think this changes Jake’s point in principle (i.e., it will still presumably be the case that the fixed-effects analysis reduces power in some between-subject studies), though the magnitude of the problem may certainly diminish somewhat (at least in the sense that trading Type I error for Type II error can be considered a good thing).

  8. In my experience as stats/computational TA, the problem is very real in behavioral research. I had the displeasure of having to explain to someone that their results were no longer significant after accounting for dependence. To make matters worse, I’ve seen analyses of the form (conditions x time) being analyzed independently (e.g. t-test) at every time point. When I tried to do the analysis correctly (fitting a mixed effects model), I was unable to in the statistical package that they were using (GraphPad Prism). This package seems to be preferred by a lot of neuroscientists for its simplicity and (AFAIK) has some severe limitations doing repeated measures ANOVAs.

Leave a Reply to Jake WestfallCancel reply