the weeble distribution: a love story

“I’m a statistician,” she wrote. “By day, I work for the census bureau. By night, I use my statistical skills to build the perfect profile. I’ve mastered the mysterious headline, the alluring photo, and the humorous description that comes off as playful but with a hint of an edge. I’m pretty much irresistible at this point.”

“Really?” I wrote back. “That sounds pretty amazing. The stuff about building the perfect profile, I mean. Not the stuff about working at the census bureau. Working at the census bureau sounds decent, I guess, but not amazing. How do you build the perfect profile? What kind of statistical analysis do you do? I have a bit of programming experience, but I don’t know any statistics. Maybe we can meet some time and you can teach me a bit of statistics.”

I am, as you can tell, a smooth operator.

A reply arrived in my inbox a day later:

No, of course I don’t really spend all my time constructing the perfect profile. What are you, some kind of idiot?

And so was born our brief relationship; it was love at first insult.


“This probably isn’t going to work out,” she told me within five minutes of meeting me in person for the first time. We were sitting in the lobby of the Chateau Laurier downtown. Her choice of venue. It’s an excellent place to meet an internet date; if you don’t like the way they look across the lobby, you just back out quietly and then email the other person to say sorry, something unexpected came up.

“That fast?” I asked. “You can already tell you don’t like me? I’ve barely introduced myself.”

“Oh, no, no. It’s not that. So far I like you okay. I’m just going by the numbers here. It probably isn’t going to work out. It rarely does.”

“That’s a reasonable statement,” I said, “but a terrible thing to say on a first date. How do you ever get a second date with anyone, making that kind of conversation?”

“It helps to be smoking hot,” she said. “Did I offend you terribly?”

“Not really, no. But I’m not a very sentimental kind of guy.”

“Well, that’s good.”


Later, in bed, I awoke to a shooting pain in my leg. It felt like I’d been kicked in the shin.

“Did you just kick me in the shin,” I asked.

“Yes.”

“Any particular reason?”

“You were a little bit on my side of the bed. I don’t like that.”

“Oh. Okay. Sorry.”

“I still don’t think this will work,” she said, then rolled over and went back to sleep.


She was right. We dated for several months, but it never really worked. We had terrific fights, and reasonable make-up sex, but our interactions never had very much substance. We related to one another like two people who were pretty sure something better was going to come along any day now, but in the meantime, why not keep what we had going, because it was better than eating dinner alone.

I never really learned what she liked; I did learn that she disliked most things. Mostly our conversations revolved around statistics and food. I’ll give you some examples.


“Beer is the reason for statistics,” she informed me one night while we were sitting at Cicero’s and sharing a lasagna.

“I imagine beer might be the reason for a lot of bad statistics,” I said.

“No, no. Not just bad statistics. All statistics. The discipline of statistics as we know it exists in large part because of beer.”

“Pray, do go on,” I said, knowing it would have been futile to ask her to shut up.

“Well,” she said, “there once was a man named Student…”

I won’t bore you with all the details; the gist of it is that there once was a man by name of William Gosset, who worked for Guinness as a brewer in the early 1900s. Like a lot of other people, Gosset was interested in figuring out how to make Guinness taste better, so he invented a bunch of statistical tests to help him quantify the differences in quality between different batches of beer. Guinness didn’t want Gosset to publish his statistical work under his real name, for fear he might somehow give away their trade secrets, so they made him use the pseudonym “Student”. As a result, modern-day statisticians often work with somethinfg called Student’s t distribution, which is apparently kind of a big deal. And all because of beer.

“That’s a nice story,” I said. “But clearly, if Student—or Gosset or whatever his real name was—hadn’t been working for Guinness, someone else would have invented the same tests shortly afterwards, right? It’s not like he was so brilliant no one else would have ever thought of the same thing. I mean, if Edison hadn’t invented the light bulb, someone else would have. I take it you’re not really saying that without beer, there would be no statistics.”

“No, that is what I’m saying. No beer, no stats. Simple.”

“Yeah, okay. I don’t believe you.”

“Oh no?”

“No. What’s that thing about lies, damned lies, and stat—”

“Statistics?”

“No. Statisticians.”

“No idea,” she said. “Never heard that saying.”

“It’s that they lie. The saying is that statisticians lie. Repeatedly and often. About anything at all. It’s that they have no moral compass.”

“Sounds about right.”


“I don’t get this whole accurate to within 3 percent 19 times out of 20 business,” I whispered into her ear late one night after we’d had sex all over her apartment. “I mean, either you’re accurate or you’re not, right? If you’re accurate, you’re accurate. And if you’re not accurate, I guess maybe then you could be within 3 percent or 7 percent or whatever. But what the hell does it mean to be accurate X times out of Y? And how would you even know how many times you’re accurate? And why is it always 19 out of 20?”

She turned on the lamp on the nightstand and rolled over to face me. Her hair covered half of her face; the other half was staring at me with those pale blue eyes that always looked like they wanted to either jump you or murder you, and you never knew which.

“You really want me to explain confidence intervals to you at 11:30 pm on a Thursday night?”

“Absolutely.”

“How much time do you have?”

“All, Night, Long,” I said, channeling Lionel Richie.

“Wonderful. Let me put my spectacles on.”

She fumbled around on the nightstand looking for them.

“What do you need your glasses for,” I asked. “We’re just talking.”

“Well, I need to be able to see you clearly. I use the amount of confusion on your face to gauge how much I need to dumb down my explanations.”


Frankly, most of the time she was as cold as ice. The only time she really came alive—other than in the bedroom—was when she talked about statistics. Then she was a different person: excited and exciting, full of energy. She looked like a giant Tesla coil, mid-discharge.

“Why do you like statistics so much,” I asked her over a bento box at ZuNama one day.

“Because,” she said, “without statistics, you don’t really know anything.”

“I thought you said statistics was all about uncertainty.”

“Right. Without statistics, you don’t know anything… and with statistics, you still don’t know anything. But with statistics, we can at least get a sense of how much we know or don’t know.”

“Sounds very… Rumsfeldian,” I said. “Known knowns… unknown unknowns… is that right?”

“It’s kind of right,” she said. “But the error bars are pretty huge.”

“I’m going to pretend I know what that means. If I admit I have no idea, you’ll think I wasn’t listening to you in bed the other night.”

“No,” she said. “I know you were listening. You were listening very well. It’s just that you were understanding very poorly.”


Uncertainty was a big theme for her. Once, to make a point, she asked me how many nostrils a person breathes through at any given time. And then, after I experimented on myself and discovered that the answer was one and not two, she pushed me on it:

“Well, how do you know you’re not the only freak in the world who breathes through one nostril?”

“Easily demonstrated,” I said, and stuck my hand right in front of her face, practically covering her nose.

“Breathe out!”

She did.

“And now breathe in! And then repeat several times!”

She did.

“You see,” I said, retracting my hand once I was satisfied. “It’s not just me. You also breathe through one nostril at a time. Right now it’s your left.”

“That proves nothing,” she said. “We’re not independent observations; I live with you. You probably just gave me your terrible mononarial disease. All you’ve shown is that we’re both sick.”

I realized then that I wasn’t going to win this round—or any other round.

“Try the unagi,” I said, waving at the sushi in a heroic effort to change the topic.

“You know I don’t like to try new things. It’s bad enough I’m eating sushi.”

“Try the unagi,” I suggested again.

So she did.

“It’s not bad,” she said after chewing on it very carefully for a very long time. “But it could use some ketchup.”

“Don’t you dare ask them for ketchup,” I said. “I will get up and leave if you ask them for ketchup.”

She waved her hand at the server.


“There once was a gentleman named Bayes,” she said over coffee at Starbucks one morning. I was running late for work, but so what? Who’s going to pass up the chance to hear about a gentleman named Bayes when the alternative is spending the morning refactoring enterprise code and filing progress reports?

“Oh yes, I’ve heard about him,” I said. “He’s the guy who came up with Bayes’ theorem.” I’d heard of Bayes theorem in some distant class somewhere, and knew it had something to do with statistics, though I had not one clue what it actually referred to.

“No, the Bayes I’m talking about is John Bayes—my mechanic. He’s working on my car right now.”

“Really?”

“No, not really, you idiot. Yes, Bayes as in Bayes’ theorem.”

“Thought so. Well, go ahead and tell me all about him. What is John Bayes famous for?”

“Bayes’ theorem.”

“Huh. How about that.”

She launched into a very dry explanation of conditional probabilities and prior distributions and a bunch of other terms I’d never heard of before and haven’t remembered since. I stopped her about three minutes in.

“You know none of this helps me, right? I mean, really, I’m going to forget anything you tell me. You know what might help, is maybe if instead of giving me these long, dry explanations, you could put things in a way I can remember. Like, if you, I don’t know, made up a limerick. I bet I could remember your explanations that way.”

“Oh, a limerick. You want a Bayesian limerick. Okay.”

She scrunched up her forehead like she was thinking very deeply. Held the pose for a few seconds.

“There once was a man named John Bayes,” she began, and then stopped.

“Yes,” I said. “Go on.”

“Who spent most of his days… calculating the posterior probability of go fuck yourself.”

“Very memorable,” I said, waving for the check.


“Suppose I wanted to estimate how much I love you,” I said over asparagus and leek salad at home one night. “How would I do that?”

“You love me?” she arched an eyebrow.

“Good lord no,” I laughed hysterically. “It’s a completely and utterly hypothetical question. But answer it anyway. How would I do it?”

She shrugged.

“That’s a measurement problem. I’m a statistician, not a psychometrician. I develop and test statistical models. I don’t build psychological instruments. I haven’t the faintest idea how you’d measure love. As I’m sure you’ve observed, it’s something I don’t know or care very much about.”

I nodded. I had observed that.

“You act like there’s a difference between all these things there’s really no difference between,” I said. “Models, measures… what the hell do I care? I asked a simple question, and I want a simple answer.”

“Well, my friend, in that case, the answer is that you must look deep into your own heart and say, heart, how much do I love this woman, and then your heart will surely whisper the answer delicately into your oversized ear.”

“That’s the dumbest thing I’ve ever heard,” I said, tugging self-consciously at my left earlobe. It wasn’t that big.

“Right?” she said. “You said you wanted a simple answer. I gave you a simple answer. It also happens to be a very dumb answer. Well, great, now you know one of the fundamental principles of statistical analysis.”

“That simple answers tend to be bad answers?”

“No,” she said. “That when you’re asking a statistician for help, you need to operationalize your question very carefully, or the statistician is going to give you a sensible answer to a completely different question than the one you actually care about.”


“How come you never ask me about my work,” I asked her one night as we were eating dinner at Chez Margarite. She was devouring lemon-infused pork chops; I was eating a green papaya salad with mint chutney and mango salsa dressing.

“Because I don’t really care about your work,” she said.

“Oh. That’s… kind of blunt.”

“Sorry. I figured I should be honest. That’s what you say you want in a relationship, right? Honesty?”

“Sure,” I said, as the server refilled our water glasses.

“Well,” I offered. “Maybe not that much honesty.”

“Would you like me to feign interest?”

“Maybe just for a bit. That might be nice.”

“Okay,” she sighed, giving me the green light with a hand wave. “Tell me about your work.”

It was a new experience for me; I didn’t want to waste the opportunity, so I tried to choose my words carefully.

“Well, for the last month or so, I’ve been working on re-architecting our site’s database back-end. We’ve never had to worry about scaling before. Our DB can handle a few dozen queries per second, even with some pretty complicated joins. But then someone posts a product page to reddit because of a funny typo, and suddenly we’re getting hundreds of requests a second, and all hell breaks loose.”

I went on to tell her about normal forms and multivalued dependencies and different ways of modeling inheritance in databases. She listened along, nodding intermittently and at roughly appropriate intervals. But I could tell her heart wasn’t in it. She kept looking over with curiosity at the group of middle-aged Japanese businessmen seated at the next table over from us. Or out the window at the homeless man trying to sell rhododendrons to passers-by. Really, she looked everywhere but at me. Finally, I gave up.

“Look,” I said, “I know you’re not into this. I guess I don’t really need to tell you about what I do. Do you want to tell me more about the Weeble distribution?”

Her face lit up with excitement; for a moment, she looked like the moon. A cold, heartless, beautiful moon, full of numbers and error bars and mascara.

Weibull,” she said.

“Fine,” I said. “You tell me about the Weibull distribution, and I’ll feign interest. Then we’ll have crème brulee for dessert, and then I’ll buy you a rhododendron from that guy out there on the way out.”

“Rhododendrons,” she snorted. “What a ridiculous choice of flower.”


“How long do you think this relationship is going to last,” I asked her one brisk evening as we stood outside Gordon’s Gourmets with oversized hot dogs in hand.

I was fully aware our relationship was a transient thing—like two people hanging out on a ferry for a couple of hours, both perfectly willing to having a reasonably good time together until the boat hits the far side of the lake, but neither having any real interest in trading numbers or full names.

I was in it for—let’s be honest—the sex and the conversation. As for her, I’m not really sure what she got out of it; I’m not very good at either of those things. I suppose she probably had a hard time finding anyone willing to tolerate her for more than a couple of days.

“About another month,” she said. “We should take a trip to Europe and break up there. That way it won’t be messy when we come back. You book your plane ticket, I’ll book mine. We’ll go together, but come back separately. I’ve always wanted to end a relationship that way—in a planned fashion where there are no weird expectations and no hurt feelings.”

“You think planning to break up in Europe a month from now is a good way to avoid hurt feelings?”

“Correct.”

“Okay, I guess I can see that.”


And that’s pretty much how it went. About a month later, we were sitting in a graveyard in a small village in southern France, winding our relationship down. Wine was involved, and had been involved for most of the day; we were both quite drunk.

We’d gone to see this documentary film about homeless magicians who made their living doing card tricks for tourists on the beaches of the French Riviera, and then we stumbled around town until we came across the graveyard, and then, having had a lot of wine, we decided, why not sit on the graves and talk. And so we sat on graves and talked for a while until we finally ran out of steam and affection for each other.

“How do you want to end it,” I asked her when we were completely out of meaningful words, which took less time than you might imagine.

“You sound so sinister,” she said. “Like we’re talking about a suicide pact. When really we’re just two people sitting on graves in a quiet cemetery in France, about to break up forever.”

“Yeah, that. How do you want to end it.”

“Well, I like endings like in Sex, Lies and Videotape, you know? Endings that don’t really mean anything.”

“You like endings that don’t mean anything.”

“They don’t have to literally mean nothing. I just mean they don’t have to have any deep meaning. I don’t like movies that end on some fake bullshit dramatic note just to further the plot line or provide a sense of closure. I like the ending of Sex, Lies, and Videotape because it doesn’t follow from anything; it just happens.”

“Remind me how it ends?”

“They’re sitting on the steps outside, and Ann—-Andie McDowell’s character–says “I think it’s going to rain. Then Graham says, “it is raining.” And that’s it. Fade to black.”

“So that’s what you like.”

“Yes.”

“And you want to end our relationship like that.”

“Yes.”

“Okay,” I said. “I guess I can do that.”

I looked around. It was almost dark, and the bottle of wine was empty. Well, why not.

I think it’s going to rain,” I said.

Jesus,” she said incredulously, leaning back against a headstone belonging to some guy named Jean-Francois. ” I meant we should end it like that. That kind of thing. Not that actual thing. What are you, some kind of moron?”

“Oh. Okay. And yes.”

I thought about it for a while.

“I think I got this,” I finally said.

“Ok, go,” she smiled. One of the last—and only—times I saw her smile. It was devastating.

“Okay. I’m going to say: I have some unfinished business to attend to at home. I should really get back to my life. And then you should say something equally tangential and vacuous. Something like: ‘yes, you really should get back there. Your life must be lonely without you.’”

“Your life must be lonely without you…” she tried the words out.

“That’s perfect,” she smiled. “That’s exactly what I wanted.”


Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:

oOooo

In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):

abbreviated_layout

It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

yet another Python state machine (and why you might care)

TL;DR: I wrote a minimalistic state machine implementation in Python. You can find the code on GitHub. The rest of this post explains what a state machine is and why you might (or might not) care. The post is slanted towards scientists who are technically inclined but lack formal training in computer science or software development. If you just want some documentation or examples, see the README.

A common problem that arises in many software applications is the need to manage an application’s trajectory through a state of discrete states. This problem will be familiar, for instance, to almost every researcher who has ever had to program an experiment for a study involving human subjects: there are typically a number of different states your study can be in (informed consent, demographic information, stimulus presentation, response collection, etc.), and these states are governed by a set of rules that determine the valid progression of your participants from one state to another. For example, a participant can proceed from informed consent to a cognitive task, but never the reverse (on pain of entering IRB hell!).

In the best possible case, the transition rules are straightforward. For example, given states [A, B, C, D], life would be simple if the the only valid transitions were A –> B, B –> C, and C –> D. Unfortunately, the real world is more complicated, and state transitions are rarely completely sequential. More commonly, at least some states have multiple potential destinations. Sometimes the identity of the next state depends on meeting certain conditions while in the current state (e.g., if the subject responded incorrectly, the study may transition to a different state than if they had responded correctly); other times the rules may be probabilistic, or depend on the recent trajectory through state space (e.g., a slot machine transitions to a winning or losing state with some fixed probability that may also depend on its current position, recent history, etc.).

In software development, a standard method for dealing with this kind of problem is to use something called a finite-state machine (FSM). FSMs have been around a relatively long time (at least since Mealy and Moore’s work in the 1950s), and have all kinds of useful applications. In a nutshell, what a good state machine implementation does is represent much of the messy logic governing state transitions in a more abstract, formal and clean way. Rather than having to write a lot of complicated nested logic to direct the flow of the application through state space, one can usually get away with a terse description of (a) the possible states of the machine and (b) a list of possible transitions, including a specification of the source and destination states for each transition, what conditions must be met in order for the transition to execute, etc.

For example, suppose you need to write some code to transition between different phases in an online experiment. Your naive implementation might look vaguely like this (leaving out a lot of supporting code and focusing just on the core logic):

This is a minimalistic example, but already, it illustrates several common scenarios–e.g., that the transition from one state to another often depends on meeting some specified condition (we don’t advance beyond the informed consent stage until the user signs the document), and that there may be some actions we want to issue immediately before or after a particular kind of transition (e.g., we save survey responses before we move onto the next phase).

The above code is still quite manageable, so if things never get any more complex than this, there may be no reason to abandon a (potentially lengthy) chain of conditionals in favor of a fundamentally different approach. But trouble tends to arises when the complexity does increase–e.g., you need to throw a few more states into the mix later on–or when you need to move stuff around (e.g., you decide to administer the task before the demographic survey). If you’ve ever had the frustrating experience of tracing the flow of your app through convoluted logic scattered across several files, and being unable to figure out why your code is entering the wrong state in response to some triggered event, the state machine pattern may be right for you.

I’ve made extensive use of state machines in the past when building online studies, and finding a suitable implementation has never been a problem. For example, in Rails–which is what most of my apps have been built in–there are a number of excellent options, including the state_machine plugin and (more recently) Statesman. In the last year or two, though, I’ve begun to transition all of my web development to Python (if you want to know why, read this). Python is a very common language, and the basic FSM pattern is very simple, so there are dozens of Python FSM implementations out there. But for some reason, very few of the Python implementations are as elegant and usable as their Ruby analogs. This isn’t to say there aren’t some nice ones (I’m partial to Fysom, for instance)–just that none of them quite meet my needs (in particular, there are very few fully object-oriented implementations, and I like to have my state machine tightly coupled with the model it’s managing). So I decided to write one. It’s called Transitions, and you can find the code on GitHub, or install it directly from the command prompt (“pip install transitions”, assuming you have pip installed). It’s very lightweight–fewer than 200 lines of code (the documentation is about 10 times as long!)–but still turns out to be quite functional.

For example, here’s some code that does almost exactly the same thing as what we saw above (there are much more extensive examples and documentation in the GitHub README):

That’s it! And now we have a nice object-oriented state machine that elegantly transitions between phases of matter, triggers callback functions as needed, and supports conditional transitions, branching, and various other nice features, all without ever having to write a single explicit conditional or for-loop. Understanding what’s going on is as simple as looking at the specification of the states and transitions. For example, we can tell at a glance from the second transition that if the model is currently in the ‘demographics’ state, calling advance() will effect a transition to the ‘personality’ state–conditional on the validate_demographics() function returns True. Also, right before the transition executes, the save_demographics() callback will be called.

As I noted above, given the simplicity of the example, this may not seem like a huge win. If anything, the second snippet is slightly longer than the first. But it’s also much clearer (once you’re familiar with the semantics of Transitions), scales much better as complexity increases, and will be vastly easier to modify when you need to change anything.

Anyway, I mention all of this here for two reasons. First, as small and simple a project as this is, I think it ended up being one of the more elegant and functional minimalistic Python FSMs–so I imagine a few other people might find it useful (yes, I’m basically just exploiting my PageRank on Google to drive traffic to GitHub). And second, I know many people who read this blog are researchers who regularly program experiments, but probably haven’t encountered state machines before. So, Python implementation aside, the general idea that there’s a better way to manage complex state transitions than writing a lot of ugly logic seems worth spreading.

In defense of In Defense of Facebook

A long, long time ago (in social media terms), I wrote a post defending Facebook against accusations of ethical misconduct related to a newly-published study in PNAS. I won’t rehash the study, or the accusations, or my comments in any detail here; for that, you can read the original post (I also recommend reading this or this for added context). While I stand by most of what I wrote, as is the nature of things, sometimes new information comes to light, and sometimes people say things that make me change my mind. So I thought I’d post my updated thoughts and reactions. I also left some additional thoughts in a comment on my last post, which I won’t rehash here.

Anyway, in no particular order…

I’m not arguing for a lawless world where companies can do as they like with your data

Some people apparently interpreted my last post as a defense of Facebook’s data use policy in general. It wasn’t. I probably brought this on myself in part by titling the post “In Defense of Facebook”. Maybe I should have called it something like “In Defense of this one particular study done by one Facebook employee”. In any case, I’ll reiterate: I’m categorically not saying that Facebook–or any other company, for that matter–should be allowed to do whatever it likes with its users’ data. There are plenty of valid concerns one could raise about the way companies like Facebook store, manage, and use their users’ data. And for what it’s worth, I’m generally in favor of passing new rules regulating the use of personal data in the private sector. So, contrary to what some posts suggested, I was categorically not advocating for a laissez-faire world in which large corporations get to do as they please with your information, and there’s nothing us little people can do about it.

The point I made in my last post was much narrower than that–namely, that picking on the PNAS study as an example of ethically questionable practices at Facebook was a bad idea, because (a) there aren’t any new risks introduced by this manipulation that aren’t already dwarfed by the risks associated with using Facebook itself (which is not exactly a high-risk enterprise to begin with), and (b) there are literally thousands of experiments just like this being conducted every day by large companies intent on figuring out how best to market their products and services–so Facebook’s study doesn’t stand out in any respect. My point was not that you shouldn’t be concerned about who has your data and how they’re using it, but that it’s deeply counterproductive to go after Facebook for this particular experiment when Facebook is of the few companies in this arena who actually (occasionally) publish the results of their findings in the scientific literature, instead of hiding them entirely from the light, as almost everyone else does. Of course, that will probably change as a result of this controversy.

I Was Wrong–A/B Testing Edition.

One claim I made in my last post that was very clearly wrong is this (emphasis added):

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface.

After watching the commentary over the past two days, I think it’s pretty clear I was wrong about this. A surprisingly large number of people clearly were genuinely unaware that Facebook, Twitter, Google, and other major players in every major industry (not just tech–also banks, groceries, department stores, you name it) are constantly running large-scale, controlled experiments on their users and customers. For instance, here’s a telling comment left on my last post:

The main issue I have with the experiment is that they conducted it without telling us. Given, that would have been counterproductive, but even a small adverse affect is still an adverse affect. I just don’t like the idea that corporations can do stuff to me without my consent. Just my opinion.

Similar sentiments are all over the place. Clearly, the revelation that Facebook regularly experiments on its users without their knowledge was indeed just that to many people–a revelation. I suppose in this sense, there’s potentially a considerable upside to this controversy, inasmuch as it has clearly served to raise awareness of industry-standard practices.

Questions about the ethics of the PNAS paper’s publication

My post focused largely on the question of whether the experiment Facebook conducted was itself illegal or unethical. I took this to be the primary concern of most lay people who have expressed concern about the episode. As I discussed in my post, I think it’s quite clear that the experiment itself is (a) entirely legal and that (b) any ethical objections one could raise are actually much broader objections about the way we regulate data use and consumer privacy, and have nothing to do with Facebook in particular. However, there’s a separate question that does specifically concern Facebook–or really, the authors of the PNAS paper–which is whether the authors, in their efforts to publish their findings, violated any laws or regulations.

When I wrote my post, I was under the impression–based largely on reports of an interview with the PNAS editor, Susan Fiske–that the authors had in fact obtained approval to conduct the study from an IRB, and had simply neglected to include that information in the text (which would have been an Editorial lapse, but not an unethical act). I wrote as much in a comment on my post. I was not suggesting–as some seemed to take away–that Facebook doesn’t need to get IRB approval. I was operating on the assumption that it had obtained IRB approval, based on the information available at the time.

In any case, it now appears that may not be exactly what happened. Unfortunately, it’s not yet clear exactly what did happen. One version of events people have suggested is that the study’s authors exploited a loophole in the rules by having Facebook conduct and analyze the experiment without the involvement of the other authors–who only contributed to the genesis of the idea and the writing of the manuscript. However, this interpretation is not unambiguous, and risks maligning the authors’ reputations unfairly, because Adam Kramer’s post explaining the motivation for the experiment suggests that the idea for the experiment originated entirely at Facebook, and was related to internal needs:

The reason we did this research is because we care about the emotional impact of Facebook and the people that use our product. We felt that it was important to investigate the common worry that seeing friends post positive content leads to people feeling negative or left out. At the same time, we were concerned that exposure to friends’ negativity might lead people to avoid visiting Facebook. We didn’t clearly state our motivations in the paper.

How you interpret the ethics of the study thus depends largely on what you believe actually happened. If you believe that the genesis and design of the experiment were driven by Facebook’s internal decision-making, and the decision to publish an interesting finding came only later, then there’s nothing at all ethically questionable about the authors’ behavior. It would have made no more sense to seek out IRB approval for this one experiment than for any of the other in-house experiments Facebook regularly conducts. And there is, again, no question whatsoever that Facebook does not have to get approval from anyone to do experiments that are not for the purpose of systematic, generalizable research.

Moreover, since the non-Facebook authors did in fact ask the IRB to review their proposal to use archival data–and the IRB exempted them from review, as is routinely done for this kind of analysis–there would be no legitimacy to the claim that the authors acted unethically. About the only claim one could raise an eyebrow at is that the authors “didn’t clearly state” their motivations. But since presenting a post-hoc justification for one’s studies that has nothing to do with the original intention is extremely common in psychology (though it shouldn’t be), it’s not really fair to fault Kramer et al for doing something that is standard practice.

If, on the other hand, the idea for the study did originate outside of Facebook, and the authors deliberately attempted to avoid prospective IRB review, then I think it’s fair to say that their behavior was unethical. However, given that the authors were following the letter of the law (if clearly not the spirit), it’s not clear that PNAS should have, or could have, rejected the paper. It certainly should have demanded that information regarding interactions with the IRB be included in the manuscript, and perhaps it could have published some kind of expression of concern alongside the paper. But I agree with Michelle Meyer’s analysis that, in taking the steps they took, the authors are almost certainly operating within the rules, because (a) Facebook itself is not subject to HHS rules, (b) the non-Facebook authors were not technically “engaged in research”, and (c) the archival use of already-collected data by the non-Facebook authors was approved by the Cornell IRB (or rather, the study was exempted from further review).

Absent clear evidence of what exactly happened in the lead-up to publication, I think the appropriate course of action is to withhold judgment. In the interim, what the episode clearly does do is lay bare how ill-prepared the existing HHS regulations are for dealing with the research use of data collected online–particularly when the data was acquired by private entities. Actually, it’s not just research use that’s problematic; it’s clear that many people complaining about Facebook’s conduct this week don’t really give a hoot about the “generalizable knowledge” side of things, and are fundamentally just upset that Facebook is allowed to run these kinds of experiments at all without providing any notification.

In my view, what’s desperately called for is a new set of regulations that provide a unitary code for dealing with consumer data across the board–i.e., in both research and non-research contexts. This leaves aside exactly what such regulations would look like, of course. My personal view is that the right direction to move in is to tighten consumer protection laws to better regulate management and use of private citizens’ data, while simultaneously liberalizing the research use of private datasets that have already been acquired. For example, I would favor a law that (a) forced Facebook and other companies to more clearly and explicitly state how they use their users’ data, (b) provided opt-out options when possible, along with the ability for users to obtain report of how their data has been used in the past, and (c) gave blanket approval to use data acquired under these conditions for any and all academic research purposes so long as the data are deidentified. Many people will disagree with this, of course, and have very different ideas. That’s fine; the key point is that the conversation we should be having is about how to update and revise the rules governing research vs. non-research uses of data in such a way that situations like the PNAS study don’t come up again.

What Facebook does is not research–until they try to publish it

Much of the outrage over the Facebook experiment is centered around the perception that Facebook shouldn’t be allowed to conduct research on its users without their consent. What many people mean by this, I think, is that Facebook shouldn’t be allowed to conduct any experiments on its users for purposes of learning things about user experience and behavior unless Facebook explicitly asks for permission. A point that I should have clarified in my original post is that Facebook users are, in the normal course of things, not considered participants in a research study, no matter how or how much their emotions are manipulated. That’s because the HHS’s definition of research includes, as a necessary component, that there be an active intention to contribute to generalizable new knowledge.

Now, to my mind, this isn’t a great way to define “research”–I think it’s a good idea to avoid definitions that depend on knowing what people’s intentions were when they did something. But that’s the definition we’re stuck with, and there’s really no ambiguity over whether Facebook’s normal operations–which include constant randomized, controlled experimentation on its users–constitute research in this sense. They clearly don’t. Put simply, if Facebook were to eschew disseminating its results to the broader community, the experiment in question would not have been subject to any HHS regulations whatsoever (though, as Michelle Meyer astutely pointed out, technically the experiment probably isn’t subject to HHS regulation even now, so the point is moot). Again, to reiterate: it’s only the fact that Kramer et al wanted to publish their results in a scientific journal that opened them up to criticism of research misconduct in the first place.

This observation may not have any impact on your view if your concern is fundamentally about the publication process–i.e., you don’t object to Facebook doing the experiment; what you object to is Facebook trying to disseminate their findings as research. But it should have a strong impact on your views if you were previously under the impression that Facebook’s actions must have violated some existing human subjects regulation or consumer protection law. The laws in the United States–at least as I understand them, and I admittedly am not a lawyer–currently afford you no such protection.

Now, is it a good idea to have two very separate standards, one for research and one for everything else? Probably not. Should Facebook be allowed to do whatever it wants to your user experience so long as it’s covered under the Data Use policy in the user agreement you didn’t read? Probably not. But what’s unequivocally true is that, as it stands right now, your interactions with Facebook–no matter how your user experience, data, or emotions are manipulated–are not considered research unless Facebook manipulates your experience with the express intent of disseminating new knowledge to the world.

Informed consent is not mandatory for research studies

As a last point, there seems to be a very common misconception floating around among commentators that the Facebook experiment was unethical because it didn’t provide informed consent, which is a requirement for all research studies involving experimental manipulation. I addressed this in the comments on my last post in response to other comments:

[I]t’s simply not correct to suggest that all human subjects research requires informed consent. At least in the US (where Facebook is based), the rules governing research explicitly provide for a waiver of informed consent. Directly from the HHS website:

An IRB may approve a consent procedure which does not include, or which alters, some or all of the elements of informed consent set forth in this section, or waive the requirements to obtain informed consent provided the IRB finds and documents that:

(1) The research involves no more than minimal risk to the subjects;

(2) The waiver or alteration will not adversely affect the rights and welfare of the subjects;

(3) The research could not practicably be carried out without the waiver or alteration; and

(4) Whenever appropriate, the subjects will be provided with additional pertinent information after participation.

Granting such waivers is a commonplace occurrence; I myself have had online studies granted waivers before for precisely these reasons. In this particular context, it’s very clear that conditions (1) and (2) are met (because this easily passes the “not different from ordinary experience” test). Further, Facebook can also clearly argue that (3) is met, because explicitly asking for informed consent is likely not viable given internal policy, and would in any case render the experimental manipulation highly suspect (because it would no longer be random). The only point one could conceivably raise questions about is (4), but here again I think there’s a very strong case to be made that Facebook is not about to start providing debriefing information to users every time it changes some aspect of the news feed in pursuit of research, considering that its users have already agreed to its User Agreement, which authorizes this and much more.

Now, if you disagree with the above analysis, that’s fine, but what should be clear enough is that there are many IRBs (and I’ve personally interacted with some of them) that would have authorized a waiver of consent in this particular case without blinking. So this is clearly well within “reasonable people can disagree” territory, rather than “oh my god, this is clearly illegal and unethical!” territory.

I can understand the objection that Facebook should have applied for IRB approval prior to conducting the experiment (though, as I note above, that’s only true if the experiment was initially conducted as research, which is not clear right now). However, it’s important to note that there is no guarantee that an IRB would have insisted on informed consent at all in this case. There’s considerable heterogeneity in different IRBs’ interpretation of the HHS guidelines (and in fact, even across different reviewers within the same IRB), and I don’t doubt that many IRBs would have allowed Facebook’s application to sail through without any problems (see, e.g., this comment on my last post)–though I think there’s a general consensus that a debriefing of some kind would almost certainly be requested.

In defense of Facebook

[UPDATE July 1st: I’ve now posted some additional thoughts in a second post here.]

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people.

The study

Before getting into the sources of that rage–and why I think it’s misplaced–though, it’s worth describing the study and its results. Here’s a description of the basic procedure, from the paper:

The experiment manipulated the extent to which people (N = 689,003) were exposed to emotional expressions in their News Feed. This tested whether exposure to emotions led people to change their own posting behaviors, in particular whether exposure to emotional content led people to post content that was consistent with the exposure—thereby testing whether exposure to verbal affective expressions leads to similar verbal expressions, a form of emotional contagion. People who viewed Facebook in English were qualified for selection into the experiment. Two parallel experiments were conducted for positive and negative emotion: One in which exposure to friends’ positive emotional content in their News Feed was reduced, and one in which exposure to negative emotional content in their News Feed was reduced. In these conditions, when a person loaded their News Feed, posts that contained emotional content of the relevant emotional valence, each emotional post had between a 10% and 90% chance (based on their User ID) of being omitted from their News Feed for that specific viewing.

And here’s their central finding:

What the figure shows is that, in the experimental conditions, where negative or positive emotional posts are censored, users produce correspondingly more positive or negative emotional words in their own status updates. Reducing the number of negative emotional posts users saw led those users to produce more positive, and fewer negative words (relative to the unmodified control condition); conversely, reducing the number of presented positive posts led users to produce more negative and fewer positive words of their own.

Taken at face value, these results are interesting and informative. For the sake of contextualizing the concerns I discuss below, though, two points are worth noting. First, these effects, while highly statistically significant, are tiny. The largest effect size reported had a Cohen’s d of 0.02–meaning that eliminating a substantial proportion of emotional content from a user’s feed had the monumental effect of shifting that user’s own emotional word use by two hundredths of a standard deviation. In other words, the manipulation had a negligible real-world impact on users’ behavior. To put it in intuitive terms, the effect of condition in the Facebook study is roughly comparable to a hypothetical treatment that increased the average height of the male population in the United States by about one twentieth of an inch (given a standard deviation of ~2.8 inches). Theoretically interesting, perhaps, but not very meaningful in practice.

Second, the fact that users in the experimental conditions produced content with very slightly more positive or negative emotional content doesn’t mean that those users actually felt any differently. It’s entirely possible–and I would argue, even probable–that much of the effect was driven by changes in the expression of ideas or feelings that were already on users’ minds. For example, suppose I log onto Facebook intending to write a status update to the effect that I had an “awesome day today at the beach with my besties!” Now imagine that, as soon as I log in, I see in my news feed that an acquaintance’s father just passed away. I might very well think twice about posting my own message–not necessarily because the news has made me feel sad myself, but because it surely seems a bit unseemly to celebrate one’s own good fortune around people who are currently grieving. I would argue that such subtle behavioral changes, while certainly responsive to others’ emotions, shouldn’t really be considered genuine cases of emotional contagion. Yet given how small the effects were, one wouldn’t need very many such changes to occur in order to produce the observed results. So, at the very least, the jury should still be out on the extent to which Facebook users actually feel differently as a result of this manipulation.

The concerns

Setting aside the rather modest (though still interesting!) results, let’s turn to look at the criticism. Here’s what Katy Waldman, writing in a Slate piece titled “Facebook’s Unethical Experiment“, had to say:

The researchers, who are affiliated with Facebook, Cornell, and the University of California–San Francisco, tested whether reducing the number of positive messages people saw made those people less likely to post positive content themselves. The same went for negative messages: Would scrubbing posts with sad or angry words from someone’s Facebook feed make that person write fewer gloomy updates?

The upshot? Yes, verily, social networks can propagate positive and negative feelings!

The other upshot: Facebook intentionally made thousands upon thousands of people sad.

Or consider an article in the The Wire, quoting Jacob Silverman:

“What’s disturbing about how Facebook went about this, though, is that they essentially manipulated the sentiments of hundreds of thousands of users without asking permission (blame the terms of service agreements we all opt into). This research may tell us something about online behavior, but it’s undoubtedly more useful for, and more revealing of, Facebook’s own practices.”

On Twitter, the reaction to the study has been similarly negative). A lot of people appear to be very upset at the revelation that Facebook would actively manipulate its users’ news feeds in a way that could potentially influence their emotions.

Why the concerns are misplaced

To my mind, the concerns expressed in the Slate piece and elsewhere are misplaced, for several reasons. First, they largely mischaracterize the study’s experimental procedures–to the point that I suspect most of the critics haven’t actually bothered to read the paper. In particular, the suggestion that Facebook “manipulated users’ emotions” is quite misleading. Framing it that way tacitly implies that Facebook must have done something specifically designed to induce a different emotional experience in its users. In reality, for users assigned to the experimental condition, Facebook simply removed a variable proportion of status messages that were automatically detected as containing positive or negative emotional words. Let me repeat that: Facebook removed emotional messages for some users. It did not, as many people seem to be assuming, add content specifically intended to induce specific emotions. Now, given that a large amount of content on Facebook is already highly emotional in nature–think about all the people sharing their news of births, deaths, break-ups, etc.–it seems very hard to argue that Facebook would have been introducing new risks to its users even if it had presented some of them with more emotional content. But it’s certainly not credible to suggest that replacing 10% – 90% of emotional content with neutral content constitutes a potentially dangerous manipulation of people’s subjective experience.

Second, it’s not clear what the notion that Facebook users’ experience is being “manipulated” really even means, because the Facebook news feed is, and has always been, a completely contrived environment. I hope that people who are concerned about Facebook “manipulating” user experience in support of research realize that Facebook is constantly manipulating its users’ experience. In fact, by definition, every single change Facebook makes to the site alters the user experience, since there simply isn’t any experience to be had on Facebook that isn’t entirely constructed by Facebook. When you log onto Facebook, you’re not seeing a comprehensive list of everything your friends are doing, nor are you seeing a completely random subset of events. In the former case, you would be overwhelmed with information, and in the latter case, you’d get bored of Facebook very quickly. Instead, what you’re presented with is a carefully curated experience that is, from the outset, crafted in such a way as to create a more engaging experience (read: keeps you spending more time on the site, and coming back more often). The items you get to see are determined by a complex and ever-changing algorithm that you make only a partial contribution to (by indicating what you like, what you want hidden, etc.). It has always been this way, and it’s not clear that it could be any other way. So I don’t really understand what people mean when they sarcastically suggest–as Katy Waldman does in her Slate piece–that “Facebook reserves the right to seriously bum you out by cutting all that is positive and beautiful from your news feed”. Where does Waldman think all that positive and beautiful stuff comes from in the first place? Does she think it spontaneously grows wild in her news feed, free from the meddling and unnatural influence of Facebook engineers?

Third, if you were to construct a scale of possible motives for manipulating users’ behavior–with the global betterment of society at one end, and something really bad at the other end–I submit that conducting basic scientific research would almost certainly be much closer to the former end than would the other standard motives we find on the web–like trying to get people to click on more ads. The reality is that Facebook–and virtually every other large company with a major web presence–is constantly conducting large controlled experiments on user behavior. Data scientists and user experience researchers at Facebook, Twitter, Google, etc. routinely run dozens, hundreds, or thousands of experiments a day, all of which involve random assignment of users to different conditions. Typically, these manipulations aren’t conducted in order to test basic questions about emotional contagion; they’re conducted with the explicit goal of helping to increase revenue. In other words, if the idea that Facebook would actively try to manipulate your behavior bothers you, you should probably stop reading this right now and go close your account. You also should definitely not read this paper suggesting that a single social message on Facebook prior to the last US presidential election the may have single-handedly increased national voter turn-out by as much as 0.6%). Oh, and you should probably also stop using Google, YouTube, Yahoo, Twitter, Amazon, and pretty much every other major website–because I can assure you that, in every single case, there are people out there who get paid a good salary to… yes, manipulate your emotions and behavior! For better or worse, this is the world we live in. If you don’t like it, you can abandon the internet, or at the very least close all of your social media accounts. But the suggestion that Facebook is doing something unethical simply by publishing the results of one particular experiment among thousands–and in this case, an experiment featuring a completely innocuous design that, if anything, is probably less motivated by a profit motive than most of what Facebook does–seems kind of absurd.

Fourth, it’s worth keeping in mind that there’s nothing intrinsically evil about the idea that large corporations might be trying to manipulate your experience and behavior. Everybody you interact with–including every one of your friends, family, and colleagues–is constantly trying to manipulate your behavior in various ways. Your mother wants you to eat more broccoli; your friends want you to come get smashed with them at a bar; your boss wants you to stay at work longer and take fewer breaks. We are always trying to get other people to feel, think, and do certain things that they would not otherwise have felt, thought, or done. So the meaningful question is not whether people are trying to manipulate your experience and behavior, but whether they’re trying to manipulate you in a way that aligns with or contradicts your own best interests. The mere fact that Facebook, Google, and Amazon run experiments intended to alter your emotional experience in a revenue-increasing way is not necessarily a bad thing if in the process of making more money off you, those companies also improve your quality of life. I’m not taking a stand one way or the other, mind you, but simply pointing out that without controlled experimentation, the user experience on Facebook, Google, Twitter, etc. would probably be very, very different–and most likely less pleasant. So before we lament the perceived loss of all those “positive and beautiful” items in our Facebook news feeds, we should probably remind ourselves that Facebook’s ability to identify and display those items consistently is itself in no small part a product of its continual effort to experimentally test its offering by, yes, experimentally manipulating its users’ feelings and thoughts.

What makes the backlash on this issue particularly strange is that I’m pretty sure most people do actually realize that their experience on Facebook (and on other websites, and on TV, and in restaurants, and in museums, and pretty much everywhere else) is constantly being manipulated. I expect that most of the people who’ve been complaining about the Facebook study on Twitter are perfectly well aware that Facebook constantly alters its user experience–I mean, they even see it happen in a noticeable way once in a while, whenever Facebook introduces a new interface. Given that Facebook has over half a billion users, it’s a foregone conclusion that every tiny change Facebook makes to the news feed or any other part of its websites induces a change in millions of people’s emotions. Yet nobody seems to complain about this much–presumably because, when you put it this way, it seems kind of silly to suggest that a company whose business model is predicated on getting its users to use its product more would do anything other than try to manipulate its users into, you know, using its product more.

Why the backlash is deeply counterproductive

Now, none of this is meant to suggest that there aren’t legitimate concerns one could raise about Facebook’s more general behavior–or about the immense and growing social and political influence that social media companies like Facebook wield. One can certainly question whether it’s really fair to expect users signing up for a service like Facebook’s to read and understand user agreements containing dozens of pages of dense legalese, or whether it would make sense to introduce new regulations on companies like Facebook to ensure that they don’t acquire or exert undue influence on their users’ behavior (though personally I think that would be unenforceable and kind of silly). So I’m certainly not suggesting that we give Facebook, or any other large web company, a free pass to do as it pleases. What I am suggesting, however, is that even if your real concerns are, at bottom, about the broader social and political context Facebook operates in, using this particular study as a lightning rod for criticism of Facebook is an extremely counterproductive, and potentially very damaging, strategy.

Consider: by far the most likely outcome of the backlash Facebook is currently experiencing is that, in future, its leadership will be less likely to allow its data scientists to publish their findings in the scientific literature. Remember, Facebook is not a research institute expressly designed to further understanding of the human condition; it’s a publicly-traded corporation that exists to create wealth for its shareholders. Facebook doesn’t have to share any of its data or findings with the rest of the world if it doesn’t want to; it could comfortably hoard all of its knowledge and use it for its own ends, and no one else would ever be any wiser for it. The fact that Facebook is willing to allow its data science team to spend at least some of its time publishing basic scientific research that draws on Facebook’s unparalleled resources is something to be commended, not criticized.

There is little doubt that the present backlash will do absolutely nothing to deter Facebook from actually conducting controlled experiments on its users, because A/B testing is a central component of pretty much every major web company’s business strategy at this point–and frankly, Facebook would be crazy not to try to empirically determine how to improve user experience. What criticism of the Kramer et al article will almost certainly do is decrease the scientific community’s access to, and interaction with, one of the largest and richest sources of data on human behavior in existence. You can certainly take a dim view of Facebook as a company if you like, and you’re free to critique the way they do business to your heart’s content. But haranguing Facebook and other companies like it for publicly disclosing scientifically interesting results of experiments that it is already constantly conducting anyway–and that are directly responsible for many of the positive aspects of the user experience–is not likely to accomplish anything useful. If anything, it’ll only ensure that, going forward, all of Facebook’s societally relevant experimental research is done in the dark, where nobody outside the company can ever find out–or complain–about it.

[UPDATE July 1st: I’ve posted some additional thoughts in a second post here.]

There is no ceiling effect in Johnson, Cheung, & Donnellan (2014)

This is not a blog post about bullying, negative psychology or replication studies in general. Those are important issues, and a lot of ink has been spilled over them in the past week or two. But this post isn’t about those issues (at least, not directly). This post is about ceiling effects. Specifically, the ceiling effect purportedly present in a paper in Social Psychology, in which Johnson, Cheung, and Donnellan report the results of two experiments that failed to replicate an earlier pair of experiments by Schnall, Benton, and Harvey.

If you’re not up to date on recent events, I recommend reading Vasudevan Mukunth’s post, which provides a nice summary. If you still want to know more after that, you should probably take a gander at the original paper by Schnall, Benton, & Harvey and the replication paper. Still want more? Go read Schnall’s rebuttal. Then read the rejoinder to the rebuttal. Then read Schnall’s first and second blog posts. And maybe a number of other blog posts (here, here, here, and here). Oh, and then, if you still haven’t had enough, you might want to skim the collected email communications between most of the parties in question, which Brian Nosek has been kind enough to curate.

I’m pointing you to all those other sources primarily so that I don’t have to wade very deeply into the overarching issues myself–because (a) they’re complicated, (b) they’re delicate, and (c) I’m still not entirely sure exactly how I feel about them. However, I do have a fairly well-formed opinion about the substantive issue at the center of Schnall’s published rebuttal–namely, the purported ceiling effect that invalidates Johnson et al’s conclusions. So I thought I’d lay that out here in excruciating detail. I’ll warn you right now that if your interests lie somewhere other than the intersection of psychology and statistics (which they probably should), you probably won’t enjoy this post very much. (If your interests do lie at the intersection of psychology and statistics, you’ll probably give this post a solid “meh”.)

Okay, with all the self-handicapping out of the way, let’s get to it. Here’s what I take to be…

Schnall’s argument

The crux of Schnall’s criticism of the Johnson et al replication is a purported ceiling effect. What, you ask, is a ceiling effect? Here’s Schnall’s definition:

A ceiling effect means that responses on a scale are truncated toward the top end of the scale. For example, if the scale had a range from 1-7, but most people selected “7″, this suggests that they might have given a higher response (e.g., “8″ or “9″) had the scale allowed them to do so. Importantly, a ceiling effect compromises the ability to detect the hypothesized influence of an experimental manipulation. Simply put: With a ceiling effect it will look like the manipulation has no effect, when in reality it was unable to test for such an effects in the first place. When a ceiling effect is present no conclusions can be drawn regarding possible group differences.

This definition has some subtle-but-important problems we’ll come back to, but it’s reasonable as a first approximation. With this definition in mind, here’s how Schnall describes her core analysis, which she uses to argue that Johnson et al’s results are invalid:

Because a ceiling effect on a dependent variable can wash out potential effects of an independent variable (Hessling, Traxel & Schmidt, 2004), the relationship between the percentage of extreme responses and the effect of the cleanliness manipulation was examined. First, using all 24 item means from original and replication studies, the effect of the manipulation on each item was quantified. … Second, for each dilemma the percentage of extreme responses averaged across neutral and clean conditions was computed. This takes into account the extremity of both conditions, and therefore provides an unbiased indicator of ceiling per dilemma. … Ceiling for each dilemma was then plotted relative to the effect of the cleanliness manipulation (Figure 1).

We can (and will) quibble with these analysis choices, but the net result of the analysis is this:

schnall_figure

Here, we see normalized effect size (y-axis) plotted against extremity of item response (x-axis). Schnall’s basic argument is that there’s a strong inverse relationship between the extremity of responses to an item and the size of the experimental effect on that item. In other words, items with extreme responses don’t show an effect, whereas items with non-extreme responses do show an effect. She goes on to note that this pattern is full accounted for by her own original experiments, and that there is no such relationship in Johnson et al’s data. On the basis of this finding, Schnall concludes that:

Scores are compressed toward the top end of the scale and therefore show limited determinate variance near ceiling. Because a significance test compares variance due to a manipulation to variance due to error, an observed lack of effect can result merely from a lack in variance that would normally be associated with a manipulation. Given the observed ceiling effect, a statistical artefact, the analyses reported by Johnson et al. (2014a) are invalid and allow no conclusions about the reproducibility of the original findings.

Problems with the argument

One can certainly debate over what the implications would be even if Schnall’s argument were correct; for instance, it’s debatable whether the presence of a ceiling effect would actually invalidate Johnson et al’s conclusions that they had failed to replicate Schnall et al. An alternative and reasonable interpretation is that Johnson et al would have simply identified important boundary conditions under which the original effect doesn’t work (e.g., that it doesn’t hold in Michigan residents), since they were using Schnall’s original measures. But we don’t have to worry about that in any case, because there are several serious problems with Schnall’s argument. Some of them have to do with the statistical analysis she performs to make her point; some of them have to do with subtle mischaracterizations of what ceiling effects are and where they come from; and some of them have to do with the fact that Schnall’s data actually directly contradict her own argument. Let’s take each of these in turn.

Problems with the analysis

A first problem with Schnall’s analysis is that the normalization procedure she uses to make her point is biased. Schnall computes the normalized effect size for each item as:

(M1 – M2)/(M1 + M2)

Where M1 and M2 are the means for each item in the two experimental conditions (neutral and clean). This transformation is supposed to account for the fact that scores are compressed at the upper end of the scale, near the ceiling.

What Schnall fails to note, however, is that compression should also occur at the bottom of the scale, near the floor. For example, suppose an individual item has means of 1.2 and 1.4. Then Schnall’s normalized effect size estimate would be 0.2/2.6 = 0.07. But if the means had been 4.0 and 4.2–the same relative difference–then the adjusted estimate would actually be much smaller (around 0.02). So Schnall’s analysis is actually biased in favor of detecting the negative correlation she takes as evidence of a ceiling effect, because she’s not accounting for floor effects simultaneously. A true “clipping” or compression of scores shouldn’t occur at only one extreme of the scale; what should matter is how far from the midpoint a response happens to be. What should happen, if Schnall were to recompute the scores in Figure 1 using a modified criterion (e.g., relative deviation from the scale’s midpoint, rather than absolute score), is that the points at the top left of the figure should pull towards the y-axis to some degree, effectively reducing the slope she takes as evidence of a problem. If there’s any pattern that would suggest a measurement problem, it’s actually an inverted u-shape, where normalized effects are greatest for items with means nearest the midpoint, and smallest for items at both extremes, not just near ceiling. But that’s not what we’re shown.

A second problem is that Schnall’s data actually contradict her own conclusion. She writes:

Across the 24 dilemmas from all 4 experiments, dilemmas with a greater percentage of extreme responses were associated with lower effect sizes (r = -.50, p = .01, two-tailed). This negative correlation was entirely driven by the 12 original items, indicating that the closer responses were to ceiling, the smaller was the effect of the manipulation (r = -.49, p = .10).4In contrast, across the 12 replication items there was no correlation (r = .11, p = .74).

But if anything, these results provide evidence of a ceiling effect only in Schnall’s original study, and not in the Johnson et al replications. Recall that Schnall’s argument rests on two claims: (a) effects are harder to detect the more extreme responding on an item gets, and (b) responding is so extreme on the items in the Johnson et al experiments that nothing can be detected. But the results she presents blatantly contradict the second claim. Had there been no variability in item means in the Johnson et al studies, Schnall could have perhaps argued that restriction of range is so extreme that it is impossible to detect any kind of effect. In practice, however, that’s not the case. There is considerable variability along the x-axis, and in particular, one can clearly see that there are two items in Johnson et al that are nowhere near ceiling and yet show no discernible normalized effect of experimental condition at all. Note that these are the very same items that show some of the strongest effects in Schnall’s original study. In other words, the data Schnall presents in support of her argument actually directly contradict her argument. If one is to believe that a ceiling effect is preventing Schnall’s effect from emerging in Johnson et al’s replication studies, then there is no reasonable explanation for the fact that those two leftmost red squares in the figure above are close to the y = 0 line. They should be behaving exactly like they did in Schnall’s study–which is to say, they should be showing very large normalized effects–even if items at the very far right show no effects at all.

Third, Schnall’s argument that a ceiling effect completely invalidates Johnson et al’s conclusions is a gross exaggeration. Ceiling effects are not all-or-none; the degree of score compression into the upper end of a measure will vary continuously (unless there is literally no variance at all in the reponses, which is clearly not the case here). Even if we took at face value Schnall’s finding that there’s an inverse relationship between effect size and extremity in her original data (r = -0.5), all this would tell us is that there’s some compression of scores. Schnall’s suggestion that “given the observed ceiling effect, a statistical artifact, the analyses reported in Johnson et al (2014a) are invalid and allow no conclusions about the reproducibility of the original findings” is simply false. Even in the very best case scenario (which this obviously isn’t), the very strongest claim Schnall could comfortably make is that there may be some compression of scores, with unknown impact on the detectable effect size. It is simply not credible for Schnall to suggest that the mere presence of something that looks vaguely like a ceiling effect is sufficient to completely rule out detection of group differences in the Johnson et al experiments. And we know this with 100% certainty, because…

There are robust group differences in the replication experiments

Perhaps the clearest refutation of Schnall’s argument for a ceiling effect is that, as Johnson et al noted in their rejoinder, the Johnson et al experiments did in fact successfully identify some very clear group differences (and, ironically, ones that were also present in Schnall’s original experiments). Specifically, Johnson et al showed a robust effect of gender on vignette ratings. Here’s what the results look like:

We can see clearly that, in both replication experiments, there’s a large effect of gender but no discernible effect of experimental condition. This pattern directly refutes Schnall’s argument. She cannot have it both ways: if a ceiling effect precludes the presence of group differences, then there cannot be a ceiling effect in the replication studies, or else the gender effect could not have emerged repeatedly. Conversely, if ceiling effects don’t preclude detection of effects, then there is no principled reason why Johnson et al would fail to detect Schnall’s original effect.

Interestingly, it’s not just the overall means that tell the story quite clearly. Here’s what happens if we plot the gender effects in Johnson et al’s experiments in the same way as Schnall’s Figure 1 above:

gender_fx_by_extremity

Notice that we see here the same negative relationship between effect size and extremity that Schnall observed in her own data, and whose absence in Johnson et al’s data she (erroneously) took as evidence of a ceiling effect.

There’s a ceiling effect in Schnall’s own data

Yet another flaw in Schnall’s argument is that taking the ceiling effect charge seriously would actually invalidate at least one of her own experiments. Consider that the only vignette in Schnall et al’s original Experiment 1 that showed a statistically significant effect also had the highest rate of extreme responding in that study (mean rating of 8.25 / 9). Even more strikingly, the proportion of participants who gave the most extreme response possible on that vignette (70%) was higher than for any of the vignettes in either of Johnson et al’s experiments. In other words, Schnall’s core argument is that her effect could not possibly be replicated in Johnson et al’s experiments because of the presence of a ceiling effect, yet the only vignette to show a significant effect in Schnall’s original Experiment 1 had an even more pronounced ceiling effect. Once again, she cannot have it both ways. Either ceiling effects don’t preclude detection of effects, or, by Schnall’s own logic, the original Study 1 effect was probably a false positive.

When pressed on this point by Daniel Lakens in the email thread, Schnall gave the following response:

Note for the original studies we reported that the effect was seen on aggregate data, not necessarily for individual dilemmas. Such results will always show statistical fluctuations at the item level, hence it is important to not focus on any individual dilemma but on the overall pattern.

I confess that I’m not entirely clear on what Schnall means here. One way to read this is that she is conceding that the significant effect in the vignette in question (the “kitten” dilemma) was simply due to random fluctuations. Note that since the effect in Schnall’s Experiment 1 was only barely significant when averaging across all vignettes (in fact, it wasn’t quite significant even so), eliminating this vignette from consideration would actually have produced a null result. But suppose we overlook that and instead agree with Schnall that strange things can happen to individual items, and that what we should focus on is the aggregate moral judgment, averaged across vignettes. That would be perfectly reasonable, except that it’s directly at odds with Schnall’s more general argument. To see this, we need only look at the aggregate distribution of scores in Johnson et al’s Experiments 1 and 2:

johnson_distributions

There’s clearly no ceiling effect here; the mode in both experiments is nowhere near the maximum. So once again, Schnall can’t have it both ways. If her argument is that what matters is the aggregate measure (which seems right to me, since many reputable measures have multiple individual items with skewed distributions, and this can even be a desirable property in certain cases), then there’s nothing objectionable about the scores in the Johnson et al experiments. Conversely, if Schnall’s argument is that it’s fair to pick on individual items, then there is effectively no reason to believe Schnall’s own original Experiment 1 (and for all I know, her experiment 2 as well–I haven’t looked).

What should we conclude?

What can we conclude from all this? A couple of things. First, Schnall has no basis for arguing that there was a fundamental statistical flaw that completely invalidates Johnson et al’s conclusions. From where I’m sitting, there doesn’t seem to be any meaningful ceiling effect in Johnson et al’s data, and that’s attested to by the fact that Johnson et al had no trouble detecting gender differences in both experiments (successfully replicating Schnall’s earlier findings). Moreover, the arguments Schnall makes in support of the postulated ceiling effects suffer from serious flaws. At best, what Schnall could reasonably argue is that there might be some restriction of range in the ratings, which would artificially reduce the effect size. However, given that Johnson et al’s sample sizes were 3 – 5 times larger than Schnall’s, it is highly implausible to suppose that effects as big as Schnall’s completely disappeared–especially given that robust gender effects were detected. Moreover, given that the skew in Johnson et al’s aggregate distributions is not very extreme at all, and that many individual items on many questionnaire measures show ceiling or floor effects (e.g., go look at individual Big Five item distributions some time), taking Schnall’s claims seriously one would in effect invalidate not just Johnson et al’s results, but also a huge proportion of the more general psychology literature.

Second, while Schnall has raised a number of legitimate and serious concerns about the tone of the debate and comments surrounding Johnson et al’s replication, she’s also made a number of serious charges of her own that depend on the validity of her argument about celing effects, and not on the civility (or lack thereof) of commentators on various sides of the debate. Schnall has (incorrectly) argued that Johnson et al have committed a basic statistical error that most peer reviewers would have caught–effectively accusing them of incompetence. She has argued that Johnson et al’s claim of replication failure is unwarranted, and constitutes defamation of her scientific reputation. And she has suggested that the editors of the special issue (Daniel Lakens and Brian Nosek) behaved unethically by first not seeking independent peer review of the replication paper, and then actively trying to suppress her own penetrating criticisms. In my view, none of these accusations are warranted, because they depend largely on Schnall’s presumption of a critical flaw in Johnson et al’s work that is in fact nonexistent. I understand that Schnall has been under a lot of stress recently, and I sympathize with her concerns over unfair comments made by various people (most of whom have now issued formal apologies). But given the acrimonious tone of the more general ongoing debate over replication, it’s essential that we distinguish the legitimate issues from the illegitimate ones so that we can focus exclusively on the former, and don’t end up needlessly generating more hostility on both sides.

Lastly, there is the question of what conclusions we should draw from the Johnson et al replication studies. Personally, I see no reason to question Johnson et al’s conclusions, which are actually very modest:

In short, the current results suggest that the underlying effect size estimates from these replication experiments are substantially smaller than the estimates generated from the original SBH studies. One possibility is that there are unknown moderators that account for these apparent discrepancies. Perhaps the most salient difference betweenthe current studies and the original SBH studies is the student population. Our participants were undergraduates inUnited States whereas participants in SBH’sstudies were undergraduates in the United Kingdom. It is possible that cultural differences in moral judgments or in the meaning and importance of cleanliness may explain any differences.

Note that Johnson et al did not assert or intimate in any way that Schnall et al’s effects were “not real”. They did not suggest that Schnall et al had committed any errors in their original study. They explicitly acknowledged that unknown moderators might explain the difference in results (though they also noted that this was unlikely considering the magnitude of the differences). Effectively, Johnson et al stuck very close to their data and refrained from any kind of unfounded speculation.

In sum, unless Schnall has other concerns about Johnson’s data besides the purported ceiling effect (and she hasn’t raised any that I’ve seen), I think Johnson et al’s paper should enter the record exactly as its authors intended. Johnson, Cheung, & Donnellan (2014) is, quite simply, a direct preregistered replication of Schnall, Benton, & Harvey (2008) that failed to detect the effects reported in the original study, and there should be nothing at all controversial about this. There are certainly worthwhile discussions to be had about why the replication failed, and what that means for the original effect, but this doesn’t change the fundamental fact that the replication did fail, and we shouldn’t pretend otherwise.

Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.

Conclusion

So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

estimating the influence of a tweet–now with 33% more causal inference!

Twitter is kind of a big deal. Not just out there in the world at large, but also in the research community, which loves the kind of structured metadata you can retrieve for every tweet. A lot of researchers rely heavily on twitter to model social networks, information propagation, persuasion, and all kinds of interesting things. For example, here’s the abstract of a nice recent paper on arXiv that aims to  predict successful memes using network and community structure:

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

One limitation of much of this body of research is that the data are almost invariably observational. We can build sophisticated models that do a good job predicting some future outcome (like meme success), but we don’t necessarily know that the “important” features we identify carry any causal influence. In principle, they could be completely epiphenomenal–for example, in the study I linked to, maybe the community structure features are just a proxy for some other, causally important, factor (e.g., whether the content of a meme has sufficiently broad appeal to attract attention from many different kinds of people). From a predictive standpoint, this may not matter much; if your goal is just to passively predict whether a meme is going to be successful or not, it’s irrelevant whether or not the features you’re using are doing causal work. On the other hand, if you want to actively design memes in such a way as to maximize their spread, the ability to get a handle on causation starts to look pretty important.

How can we estimate the direct causal influence of a tweet on the downstream popularity of a meme? Here’s a simple and (I suspect) very feasible idea in two steps:

  1. Create a small web app that allows any existing Twitter user to register via Twitter authentication. On signing up, a user has to specify just one (optional) setting: the proportion of their intended retweets they’re willing to withhold. Let’s this the Withholding Fraction (WF).
  2. Every time (or at least some of the time) a registered user wants to retweet a particular tweet*, they do so via the new web app’s interface (which has permission to post to the user’s Twitter account) instead of whatever interface they’re currently using. The key is that the retweet isn’t just obediently passed along; instead, the target tweet is retweeted successfully with probability (1 – WF), and randomly suppressed from the user’s stream with probability (WF).

Doing this  would allow the community to very quickly (assuming rapid adoption, which seems reasonably likely) build up an enormous database of tweets that were targeted for retweeting by an active user, but randomly assigned to fail with some known probability. Researchers would then be able to directly quantify the causal impact of individual retweets on downstream popularity–and to estimate that influence conditional on all of the other standard variables, like the retweeter’s number of followers, the content of the tweet, etc. Of course, this still wouldn’t get us to true experimental manipulation of such features (i.e., we wouldn’t be manipulating users’ follower networks, just randomly omitting tweets from users with different followers), but it seems like a step in the right direction**.

I figure building a barebones app like this would take an experienced developer familiar with the Twitter OAuth API just a day or two. And I suspect many people (myself included!) would be happy to contribute to this kind of experiment, provided that all of the resulting data were made public. (I’m aware that there are all kinds of restrictions on sharing assembled Twitter datasets, but we’re not talking about sharing firehose dumps here, just a restricted set of retweets from users who’ve explicitly given their consent to have the data used in this way.)

Has this kind of thing already been done? If not, does anyone want to build it?

 

* It doesn’t just have to be retweets, of course; the same principle would work just as well for withholding a random fraction of original tweets. But I suspect not many users would be willing to randomly eliminate a proportion of their original content from the firehose.

** If we really wanted to get close to true random assignment, we could potentially inject selected tweets into random users streams based on selected criteria. But I’m not sure how many tweeps would consent to have entirely random retweets published in their name (I probably wouldn’t), so this probably isn’t viable.

then gravity let go

This is fiction.


My grandmother’s stroke destroyed most of Nuremberg and all of Wurzburg. She was sailing down the Danube on a boat when it happened. I won’t tell you who she was with and what they were doing at the time, because you’ll think less of her for it, and anyway it’s not relevant to the story. But she was in the boat, and she was alive and happy, and then the next thing you know, she was unhappy and barely breathing. They were so far out in the water that she would have been dead if the other person she was with had had to row all the way back. So a medical helicopter was sent out, and they strapped her to the sky with hooks and carried her to the hospital dangling sixty feet below a tangle of blades.

All of her life, my grandmother was afraid of heights. She never got on a plane; never even went up a high-rise viewing deck to see the city unfold below her like a tourist map. “No amount of money or gratitude you could give me is worth the vertigo that I’d get when I felt my life rushing away below me,” she told me once. She was very melodramatic, my grandmother. It figures that the one time her feet actually refused gravity long enough for it to count, she was out like a light. That her life started to rush away from her not in an airplane over the sea, as she’d always feared, but in a boat on the water. That it took a trip into the same sky she loathed so much just to keep her alive.

*          *          *

Bavaria occupies the southeast corner of Germany; by area, it makes up one-fifth of the country. It’s the largest state, and pretty densely populated, but for all that, I don’t remember there being very much to do there. As a child, we used to visit my grandmother in Nuremberg in the summers. I remember the front of her brown and white house, coated in green vines, gently hugging the street the way the houses do in Europe. In America, we place our homes a modest distance away from the road, safely detached in their own little fiefdoms. I’ll just be back here, doing my own thing, our houses say. You just keep walking along there, sir—and don’t try to look through my windows. When Columbus discovered all that land, what he was really discovering was the driveway.

When we visited my grandmother, I’d slam the car door shut, run up to the steps, and knock repeatedly until she answered. She’d open the door, look all around, and then, finally seeing me, ask, “Who is this? Who are you?” That was the joke when I was very young. Who Are You was the joke, and after I yelled “grandma, it’s me!” several times, she’d always suddenly remember me, and invite me in to feed me schnitzel. “Why didn’t you say it was you,” she’d say. “Are you trying to give an old lady a heart attack? Do you think that’s funny?”

After her stroke, Who Are You was no longer funny. The words had a different meaning, and when I said, “grandma, it’s me,” she’d look at me sadly, with no recognition, as if she was wondering what could have happened to her beloved Bavaria; how the world could have gotten so bad that every person who knocked on her door now was a scoundrel claiming to be her grandson, lying to an old lady just so he could get inside and steal all of her valuable belongings.

Not that she really had any. Those last few years of her life, the inside of her house changed, until it was all newspapers and gift wrap, wooden soldiers and plastic souvenir cups, spent batteries and change from other countries. She never threw anything away, but there was nothing in there you would have wanted except memories. And by the end, I couldn’t even find the memories for all of the junk. So I just stopped going. Eventually, all of the burglars stopped coming by.

*          *          *

When my grandfather got to the hospital, he was beside himself. He kept running from doctor to doctor, asking them all the same two questions:

“Who was she with,” he asked, “and what were they doing on that boat?”

The doctors all calmly told him the same thing: it’s not really relevant to her condition, and anyway, you’d think less of her. Just go sit in the waiting room. We’ll tell you when you can see her.

Inside the operating room, they weren’t so calm.

“She’s still hemorrhaging,” a doctor said over the din of scalpels and foam alcohol. They unfolded her cortex like a map, laid tangles of blood clot and old memories down to soak against fresh bandages. But there was no stopping the flood.

“We need to save Wurzburg,” said another doctor, tracing his cold finger through the cortical geography on the table. He moved delicately, as if folding and unfolding a series of very small, very fragile secrets; a surgical scalpel carefully traced a path through gyri and sulci, the hills and valleys of my grandmother’s mnemonic Bavaria. Behind it, red blood crashed through arteries to fill new cavities, like flood water racing through inundated forest spillways, desperately looking for some exit, any exit, its urgent crossing shattering windows and homes, obliterating impressions of people and towns that took decades to form, entire histories vanishing from memory in a single cataclysmic moment on the river.

*          *          *

They moved around a lot. My grandfather had trouble holding down a job. The Wurzburg years were the hardest. We stopped visiting my grandmother for a while; she wouldn’t let anyone see her. My grandfather had started out a decent man, but he drank frequently. He suffered his alcohol poorly, and when he became violent, he wouldn’t stop until everyone around him suffered with him. Often, my grandmother was the only person around him.

I remember once—I think it was the only time we saw them in Wurzburg—when we visited, and my grandmother was sporting a black eye she’d inherited from somewhere. “I got it playing tennis,” she said, winking at me. “Your grandfather went for the ball, and accidentally threw the racket. Went right over the net; hit me right in the eye. Tach, just like that.”

My grandmother could always make the best of the worst situation. I used to think that kind of optimism was a good trait—as long as she had a twinkle in her eye, how bad could things be? But after her stroke, I decided that maybe that was exactly the thing that had kept her from leaving him for so many years. A less optimistic person would have long ago lost hope that he would ever change; a less happy person might have run down to the courthouse and annulled him forever. But not her; she kept her good humor, a racket on the wall, and always had that long-running excuse for the black eyes and bruised arms.

Years later, I found out from my mother that she’d never even played tennis.

*          *          *

My grandfather never found out who my grandmother was with on the boat, or what they were doing out on the river. A week after she was admitted, a doctor finally offered to tell him—if you think it’ll make you feel better to have closure. But by then, my grandfather had decided he didn’t want to know. What was the point? There was no one to blame any more, nowhere to point the finger. He wouldn’t be able to yell at her and make her feel guilty about what she’d done, yell at her until she agreed she’d do better next time, and then they could get into bed and read newspapers together, pretending it was all suddenly alright. After my grandmother came home, my grandfather stopped talking to anyone at all, including my grandmother.

I never told my grandfather that I knew what had happened on the boat. I’d found out almost immediately. A friend of mine from the army was a paramedic, and he knew the guy on the chopper who strapped my grandmother to the sky that night. He said the circumstances were such that the chopper had had to come down much closer to the water than it was supposed to, and even then, there was some uncertainty about whether they’d actually be able to lift my grandmother out of the boat. They guy who was on the chopper had been scared. “It was like she had an anvil in her chest,” he told my friend. “And for a moment, I thought it would take us all down into the water with it. But then gravity let go, and we lifted her up above the river.”

*          *          *

In the winter, parts of the Danube freeze, but the current keeps most of the water going. It rushes from the Black Forest in the West to the Ukraine in the East, with temporary stops in Vienna, Budapest, and Belgrade. If the waters ever rise too high, they’ll flood a large part of Europe, a large part of Germany. Ingolstadt, Regensburg, Passau; they’d all be underwater. It would be St. Mary Magdalene all over again, and it would tear away beautiful places, places full of memories and laughter. All the places that I visited as a kid, where my grandmother lived, before the stroke that took away her Bavaria.

what exactly is it that 53% of neuroscience articles fail to do?

[UPDATE: Jake Westfall points out in the comments that the paper discussed here appears to have made a pretty fundamental mistake that I then carried over to my post. I’ve updated the post accordingly.]

[UPDATE 2: the lead author has now responded and answered my initial question and some follow-up concerns.]

A new paper in Nature Neuroscience by Emmeke Aarts and colleagues argues that neuroscientists should start using hierarchical  (or multilevel) models in their work in order to account for the nested structure of their data. From the abstract:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%). We discuss the factors affecting the type I error rate and the statistical power in nested data, methods that accommodate dependency between observations and ways to determine the optimal study design when data are nested. Notably, optimization of experimental designs nearly always concerns collection of more truly independent observations, rather than more observations from one research object.

I don’t have any objection to the advocacy for hierarchical models; that much seems perfectly reasonable. If you have nested data, where each subject (or petrie dish or animal or whatever) provides multiple samples, it’s sensible to try to account for as many systematic sources of variance as you can. That point may have been made many times before,  but it never hurts to make it again.

What I do find surprising though–and frankly, have a hard time believing–is the idea that 53% of neuroscience articles are at serious risk of Type I error inflation because they fail to account for nesting. This seems to me to be what the abstract implies, yet it’s a much stronger claim that doesn’t actually follow just from the observation that virtually no studies that have reported nested data have used hierarchical models for analysis. What it also requires is for all of those studies that use “conventional” (i.e., non-hierarchical) analyses to have actively ignored the nesting structure and treated repeated measurements as if they in fact came from entirely different subjects or clusters.

To make this concrete, suppose we have a dataset made up of 400 observations, consisting of 20 subjects who each provided 10 trials in 2 different experimental conditions (i.e., 20 x 2 x 10 = 400). And suppose the thing we ultimately want to know is whether or not there’s a statistical difference in outcome between the two conditions. There are three at least three ways we could set up our comparison:

  1. Ignore the grouping variable (i.e., subject) entirely, effectively giving us 200 observations in each condition. We then conduct the test as if we have 200 independent observations in each condition.
  2. Average the 10 trials in each condition within each subject first, then conduct the test on the subject means. In this case, we effectively have 20 observations in each condition (1 per subject).
  3. Explicitly include the effects of both subject and trial in our model. In this case we have 400 observations, but we’re explictly accounting for the correlation between trials within a given subject, so that the statistical comparison of conditions effectively has somewhere between 20 and 400 “observations” (or degrees of freedom).

Now, none of these approaches is strictly “wrong”, in that there could be specific situations in which any one of them would be called for. But as a general rule, the first approach is almost never appropriate. The reason is that we typically want to draw conclusions that generalize across the cases in the higher level of the hierarchy, and don’t have any intrinsic interest in the individual trials themselves. In the above example, we’re asking whether people on average, behave differently in the two conditions. If we treat our data as if we had 200 subjects in each condition, effectively concatenating trials across all subjects, we’re ignoring the fact that the responses acquired from each subject will tend to be correlated (i.e., Jane Doe’s behavior on Trial 2 will tend to be more similar to her own behavior on Trial 1 than to another subject’s behavior on Trial 1). So we’re pretending that we know something about 200 different individuals sampled at random from the population, when in fact we only know something about 20 different  individuals. The upshot, if we use approach (1), is that we do indeed run a high risk of producing false positives we’re going to end up answering a question quite different from the one we think we’re answering. [Update: Jake Westfall points out in the comments below that we won’t necessarily inflate Type I error rate. Rather, the net effect of failing to model the nesting structure properly will depend on the relative amount of within-cluster vs. between-cluster variance. The answer we get will, however, usually deviate considerably from the answer we would get using approaches (2) or (3).]

By contrast, approaches (2) and (3) will, in most cases, produce pretty similar results. It’s true that the hierarchical approach is generally a more sensible thing to do, and will tend to provide a better estimate of the true population difference between the two conditions. However, it’s probably better to describe approach (2) as suboptimal, and not as wrong. So long as the subjects in our toy example above are in fact sampled at random, it’s pretty reasonable to assume that we have exactly 20 independent observations, and analyze our data accordingly. Our resulting estimates might not be quite as good as they could have been, but we’re unlikely to miss the mark by much.

To return to the Aarts et al paper, the key question is what exactly the authors mean when they say in their abstract that:

In neuroscience, experimental designs in which multiple observations are collected from a single research object (for example, multiple neurons from one animal) are common: 53% of 314 reviewed papers from five renowned journals included this type of data. These so-called ‘nested designs’ yield data that cannot be considered to be independent, and so violate the independency assumption of conventional statistical methods such as the t test. Ignoring this dependency results in a probability of incorrectly concluding that an effect is statistically significant that is far higher (up to 80%) than the nominal α level (usually set at 5%).

I’ve underlined the key phrases here. It seems to me that the implication the reader is supposed to draw from this is that roughly 53% of the neuroscience literature is at high risk of reporting spurious results. But in reality this depends entirely on whether the authors mean that 53% of studies are modeling trial-level data but ignoring the nesting structure (as in approach 1 above), or that 53% of studies in the literature aren’t using hierarchical models, even though they may be doing nothing terribly wrong otherwise (e.g., because they’re using approach (2) above).

Unfortunately, the rest of the manuscript doesn’t really clarify the matter. Here’s the section in which the authors report how they obtained that 53% number:

To assess the prevalence of nested data and the ensuing problem of inflated type I error rate in neuroscience, we scrutinized all molecular, cellular and developmental neuroscience research articles published in five renowned journals (Science, Nature, Cell, Nature Neuroscience and every month’s first issue of Neuron) in 2012 and the first six months of 2013. Unfortunately, precise evaluation of the prevalence of nesting in the literature is hampered by incomplete reporting: not all studies report whether multiple measurements were taken from each research object and, if so, how many. Still, at least 53% of the 314 examined articles clearly concerned nested data, of which 44% specifically reported the number of observations per cluster with a minimum of five observations per cluster (that is, for robust multilevel analysis a minimum of five observations per cluster is required11, 12). The median number of observations per cluster, as reported in literature, was 13 (Fig. 1a), yet conventional analysis methods were used in all of these reports.

This is, as far as I can see, still ambiguous. The only additional information provided here is that 44% of studies specifically reported the number of observations per cluster. Unfortunately this still doesn’t tell us whether the effective degrees of freedom used in the statistical tests in those papers included nested observations, or instead averaged over nested observations within each group or subject prior to analysis.

Lest this seem like a rather pedantic statistical point, I hasten to emphasize that a lot hangs on it. The potential implications for the neuroscience literature are very different under each of these two scenarios. If it is in fact true that 53% of studies are inappropriately using a “fixed-effects” model (approach 1)–which seems to me to be what the Aarts et al abstract implies–the upshot is that a good deal of neuroscience research is very bad statistical shape, and the authors will have done the community a great service by drawing attention to the problem. On the other hand, if the vast majority of the studies in that 53% are actually doing their analyses in a perfectly reasonable–if perhaps suboptimal–way, then the Aarts et al article seems rather alarmist. It would, of course, still be true that hierarchical models should be used more widely, but the cost of failing to switch would be much lower than seems to be implied.

I’ve emailed the corresponding author to ask for a clarification. I’ll update this post if I get a reply. In the meantime, I’m interested in others’ thoughts as to the likelihood that around half of the neuroscience literature involves inappropriate reporting of fixed-effects analyses. I guess personally I would be very surprised if this were the case, though it wouldn’t be unprecedented–e.g., I gather that in the early days of neuroimaging, the SPM analysis package used a fixed-effects model by default, resulting in quite a few publications reporting grossly inflated t/z/F statistics. But that was many years ago, and in the literatures I read regularly (in psychology and cognitive neuroscience), this problem rarely arises any more. A priori, I would have expected the same to be true in cellular and molecular neuroscience.


UPDATE 04/01 (no, not an April Fool’s joke)

The lead author, Emmeke Aarts, responded to my email. Here’s her reply in full:

Thank you for your interest in our paper. As the first author of the paper, I will answer the question you send to Sophie van der Sluis. Indeed we report that 53% of the papers include nested data using conventional statistics, meaning that they did not use multilevel analysis but an analysis method that assumes independent observations like a students t-test or ANOVA.

As you also note, the data can be analyzed at two levels, at the level of the individual observations, or at the subject/animal level. Unfortunately, with the information the papers provided us, we could not extract this information for all papers. However, as described in the section ‘The prevalence of nesting in neuroscience studies’, 44% of these 53% of papers including nested data, used conventional statistics on the individual observations, with at least a mean of 5 observations per subject/animal. Another 7% of these 53% of papers including nested data used conventional statistics at the subject/animal level. So this leaves 49% unknown. Of this 49%, there is a small percentage of papers which analyzed their data at the level of individual observations, but had a mean less than 5 observations per subject/animal (I would say 10 to 20% out of the top of my head), the remaining percentage is truly unknown. Note that with a high level of dependency, using conventional statistics on nested data with 2 observations per subject/animal is already undesirable. Also note that not only analyzing nested data at the individual level is undesirable, analyzing nested data at the subject/animal level is unattractive as well, as it reduces the statistical power to detect the experimental effect of interest (see fig. 1b in the paper), in a field in which a decent level of power is already hard to achieve (e.g., Button 2013).

I think this definitively answers my original question: according to Aarts, of the 53% of studies that used nested data, at least 44% performed conventional (i.e., non-hierarchical) statistical analyses on the individual observations. (I would dispute the suggestion that this was already stated in the paper; the key phrase is “on the individual observations”, and the wording in the manuscript was much more ambiguous.) Aarts suggests that ~50% of the studies couldn’t be readily classified, so in reality that proportion could be much higher. But we can say that at least 23% of the literature surveyed committed what would, in most domains, constitute a fairly serious statistical error.

I then sent Aarts another email following up on Jake Westfall’s comment (i.e., how nested vs. crossed designs were handled. She replied:

As Jake Westfall points out, it indeed depends on the design if ignoring intercept variance (so variance in the mean observation per subject/animal) leads to an inflated type I error. There are two types of designs we need to distinguish here, design type I, where the experimental variable (for example control or experimental group) does not vary within the subjects/animals but only over the subjects/animals, and design Type II, where the experimental variable does vary within the subject/animal. Only in design type I, the type I error is increased by intercept variance. As pointed out in the discussion section of the paper, the paper only focuses on design Type I (“Here we focused on the most common design, that is, data that span two levels (for example, cells in mice) and an experimental variable that does not vary within clusters (for example, in comparing cell characteristic X between mutants and wild types, all cells from one mouse have the same genotype)”), to keep this already complicated matter accessible to a broad readership. Moreover, design type I is what is most frequently seen in biological neuroscience, taking multiple observations from one animal and subsequently comparing genotypes automatically results in a type I research design.

When dealing with a research design II, it is actually the variation in effect within subject/animals that increases the type I error rate (the so-called slope variance), but I will not elaborate too much on this since it is outside the scope of this paper and a completely different story.

Again, this all sounds very straightforward and sound to me. So after both of these emails, here’s my (hopefully?) final take on the paper:

  • Work in molecular, cellular, and developmental neuroscience–or at least, the parts of those fields well-represented in five prominent journals–does indeed appear to suffer from some systemic statistical problems. While the proportion of studies at high risk of Type I error is smaller than the number Aarts et al’s abstract suggests (53%), the latter, more accurate, estimate (at least 23% of the literature) is still shockingly high. This doesn’t mean that a quarter or more of the literature can’t be trusted–as some of the commenters point out below, most conclusions aren’t based on just a single p value from a single analysis–but it does raise some very serious concerns. The Aarts et al paper is an important piece of work that will help improve statistical practice going forward.
  • The comments on this post, and on Twitter, have been interesting to read. There appear to be two broad camps of people who were sympathetic to my original concern about the paper. One camp consists of people who were similarly concerned about technical aspects of the paper, and in most cases were tripped up by the same confusion surrounding what the authors meant when they said 53% of studies used “conventional statistical analyses”. That point has now been addressed. The other camp consists of people who appear to work in the areas of neuroscience Aarts et al focused on, and were reacting not so much to the specific statistical concern raised by Aarts et al as to the broader suggestion that something might be deeply wrong with the neuroscience literature because of this. I confess that my initial knee-jerk impression to the Aarts et al paper was driven in large part by the intuition that surely it wasn’t possible for so large a fraction of the literature to be routinely modeling subjects/clusters/groups as fixed effects. But since it appears that that is in fact the case, I’m not sure what to say with respect to the broader question over whether it is or isn’t appropriate to ignore nesting in animal studies. I will say that in the domains I personally work in, it seems very clear that collapsing across all subjects for analysis purposes is nearly always (if not always) a bad idea. Beyond that, I don’t really have any further opinion other than what I said in this response to a comment below.
  • While the claims made in the paper appear to be fundamentally sound, the presentation leaves something to be desired. It’s unclear to me why the authors relegated some of the most important technical points to the Discussion, or didn’t explictly state them at all. The abstract also seems to me to be overly sensational–though, in hindsight, not nearly as much as I initially suspected. And it also seems questionable to tar all of neuroscience with a single brush when the analyses reported only applied to a few specific domains (and we know for a fact that in, say, neuroimaging, this problem is almost nonexistent). I guess to be charitable, one could pick the same bone with a very large proportion of published work, and this kind of thing is hardly unique to this study. Then again, the fact that a practice is widespread surely isn’t sufficient to justify that practice–or else there would be little point in Aarts et al criticizing a practice that so many people clearly engage in routinely.
  • Given my last post, I can’t help pointing out that this is a nice example of how mandatory data sharing (or failing that, a culture of strong expectations of preemptive sharing) could have made evaluation of scientific claims far easier. If the authors had attached the data file coding the 315 studies they reviewed as a supplement, I (and others) would have been able to clarify the ambiguity I originally raised much more quickly. I did send a follow up email to Aarts to ask if she and her colleagues would consider putting the data online, but haven’t heard back yet.

a blog about minds, brains, data & stuff