Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.

Conclusion

So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

the APS likes me!

Somehow I wound up profiled in this month’s issue of the APS Observer as a “Rising Star“. I’d like to believe this means I’m a really big deal now, but I suspect what it actually means is that someone on the nominating committee at APS has extraordinarily bad judgment. I say this in no small part because I know some of the other people who were named Rising Stars quite well (congrats to Karl SzpunarJason Chan, and Alan Castel, among many other people!), so I’m pretty sure I can distinguish people who actually deserve this from, say, me.

Of course, I’m not going to look a gift horse in the mouth. And I’m certainly thrilled to be picked for this. I know these things are kind of a crapshoot, but it still feels really nice. So while the part of my brain that understands measurement error is saying “meh, luck of the draw,” that other part of my brain that likes to be told it’s awesome is in the middle of a three day coke bender right now*. The only regret both parts of the brain have is that there isn’t any money attached to the award–or even a token prize like, say, a free statistician for a year. But I don’t think I’m going to push my luck by complaining to APS about it.

One thing I like a lot about the format of the Rising Star awards is they give you a full page to talk about yourself and your research. If there’s one thing I like to talk about, it’s myself. Usually, you can’t talk about yourself for very long before people start giving you dirty looks. But in this case, it’s sanctioned, so I guess it’s okay. In any case, the kind folks at the Observer sent me a series of seven questions to answer. And being an upstanding gentleman who likes to be given fancy awards, I promptly obliged. I figured they would just run what I sent them with minor edits… but I WAS VERY WRONG. They promptly disassembled nearly all of my brilliant observations and advice and replaced them with some very tame ramblings. So if you actually bother to read my responses, and happen to fall asleep halfway through, you’ll know who to blame. But just to set the record straight, I figured I would run through each of the boilerplate questions I was asked, and show you the answer that was printed in the Observer as compared to what I actually wrote**:

What does your research focus on?

What they printed: Most of my current research focuses on what you might call psychoinformatics: the application of information technology to psychology, with the aim of advancing our ability to study the human mind and brain. I’m interested in developing new ways to acquire, synthesize, and share data in psychology and cognitive neuroscience. Some of the projects I’ve worked on include developing new ways to measure personality more efficiently, adapting computer science metrics of string similarity to visual word recognition, modeling fMRI data on extremely short timescales, and conducting large-scale automated synthesis of published neuroimaging findings. The common theme that binds these disparate projects together is the desire to develop new ways of conceptualizing and addressing psychological problems; I believe very strongly in the transformative power of good methods.

What I actually said: I don’t know! There’s so much interesting stuff to think about! I can’t choose!

What drew you to this line of research? Why is it exciting to you?

What they printed: Technology enriches and improves our lives in every domain, and science is no exception. In the biomedical sciences in particular, many revolutionary discoveries would have been impossible without substantial advances in information technology. Entire subfields of research in molecular biology and genetics are now synonymous with bioinformatics, and neuroscience is currently also experiencing something of a neuroinformatics revolution. The same trend is only just beginning to emerge in psychology, but we’re already able to do amazing things that would have been unthinkable 10 or 20 years ago. For instance, we can now collect data from thousands of people all over the world online, sample people’s inner thoughts and feelings in real time via their phones, harness enormous datasets released by governments and corporations to study everything from how people navigate their spatial world to how they interact with their friends, and use high-performance computing platforms to solve previously intractable problems through large-scale simulation. Over the next few years, I think we’re going to see transformative changes in the way we study the human mind and brain, and I find that a tremendously exciting thing to be involved in.

What I actually said: I like psychology a lot, and I like technology a lot. Why not combine them!

Who were/are your mentors or psychological influences?

What they printed: I’ve been fortunate to have outstanding teachers and mentors at every stage of my training. I actually started my academic career quite disinterested in science and owe my career trajectory in no small part to two stellar philosophy professors (Rob Stainton and Chris Viger) who convinced me as an undergraduate that engaging with empirical data was a surprisingly good way to discover how the world really works. I can’t possibly do justice to all the valuable lessons my graduate and postdoctoral mentors have taught me, so let me just pick a few out of a hat. Among many other things, Todd Braver taught me how to talk through problems collaboratively and keep recursively questioning the answers to problems until a clear understanding materializes. Randy Larsen taught me that patience really is a virtue, despite my frequent misgivings. Tor Wager has taught me to think more programmatically about my research and to challenge myself to learn new skills. All of these people are living proof that you can be an ambitious, hard-working, and productive scientist and still be extraordinarily kind and generous with your time. I don’t think I embody those qualities myself right now, but at least I know what to shoot for.

What I actually said: Richard Feynman, Richard Hamming, and my mother. Not necessarily in that order.

To what do you attribute your success in the science?

What they printed: Mostly to blind luck. So far I’ve managed to stumble from one great research and mentoring situation to another. I’ve been fortunate to have exceptional advisors who’ve provided me with the perfect balance of freedom and guidance and amazing colleagues and friends who’ve been happy to help me out with ideas and resources whenever I’m completely out of my depth — which is most of the time.

To the extent that I can take personal credit for anything, I think I’ve been good about pursuing ideas I’m passionate about and believe in, even when they seem unlikely to pay off at first. I’m also a big proponent of exploratory research; I think pure exploration is tremendously undervalued in psychology. Many of my projects have developed serendipitously, as a result of asking, “What happens if we try doing it this way?”

What I actually said: Mostly to blind luck.

What’s your future research agenda?

What they printed: I’d like to develop technology-based research platforms that improve psychologists’ ability to answer existing questions while simultaneously opening up entirely new avenues of research. That includes things like developing ways to collect large amounts of data more efficiently, tracking research participants over time, automatically synthesizing the results of published studies, building online data repositories and collaboration tools, and more. I know that all sounds incredibly vague, and if you have some ideas about how to go about any of it, I’d love to collaborate! And by collaborate, I mean that I’ll brew the coffee and you’ll do the work.

What I actually said: Trading coffee for publications?

Any advice for even younger psychological scientists? What would you tell someone just now entering graduate school or getting their PhD?

What they printed: The responsible thing would probably be to say “Don’t go to graduate school.” But if it’s too late for that, I’d recommend finding brilliant mentors and colleagues and serving them coffee exactly the way they like it. Failing that, find projects you’re passionate about, work with people you enjoy being around, develop good technical skills, and don’t be afraid to try out crazy ideas. Leave your office door open, and talk to everyone you can about the research they’re doing, even if it doesn’t seem immediately relevant. Good ideas can come from anywhere and often do.

What I actually said: “Don’t go to graduate school.”

What publication you are most proud of or feel has been most important to your career?

What they printed: Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scale automated synthesis of human functional neuroimaging data. Manuscript submitted for publication.

In this paper, we introduce a highly automated platform for synthesizing data from thousands of published functional neuroimaging studies. We used a combination of text mining, meta-analysis, and machine learning to automatically generate maps of brain activity for hundreds of different psychological concepts, and we showed that these results could be used to “decode” cognitive states from brain activity in individual human subjects in a relatively open-ended way. I’m very proud of this work, and I’m quite glad that my co-authors agreed to make me first author in return for getting their coffee just right. Unfortunately, the paper isn’t published yet, so you’ll just have to take my word for it that it’s really neat stuff. And if you’re thinking, “Isn’t it awfully convenient that his best paper is unpublished?”… why, yes. Yes it is.

What I actually said: …actually, that’s almost exactly what I said. Except they inserted that bit about trading coffee for co-authorship. Really all I had to do was ask my co-authors nicely.

Anyway, like I said, it’s really nice to be honored in this way, even if I don’t really deserve it (and that’s not false modesty–I’m generally the first to tell other people when I think I’ve done something awesome). But I’m a firm believer in regression to the mean, so I suspect the run of good luck won’t last. In a few years, when I’ve done almost no new original work, failed to land a tenure-track job, and dropped out of academia to ride horses around the racetrack***, you can tell people that you knew me back when I was a Rising Star. Right before you tell them you don’t know what the hell happened.

———————————-

* But not really.

** Totally lying. Pretty much every word is as I wrote it. And the Observer staff were great.

*** Hopefully none of these things will happen. Except the jockey thing; that would be awesome.