Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.

Conclusion

So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

I’m moving to Austin!

The title pretty much says it. After spending four great years in Colorado, I’m happy to say that I’ll be moving to Austin at the end of the month. I’ll be joining the Department of Psychology at UT-Austin as a Research Associate, where I plan to continue dabbling in all things psychological and informatic, but with less snow and more air conditioning.

While my new position nominally has the same title as my old one, the new one’s a bit unusual in that the funding is coming from two quite different sources. Half of it comes from my existing NIH grant for development of the Neurosynth framework, which means that half of my time will be spent more or less the same way I’m spending it now–namely, on building tools to improve and automate the large-scale synthesis of functional MRI data. (Incidentally, I’ll be hiring a software developer and/or postdoc in the very near future, so drop me a line if you think you might be interested.)

The other half of the funding is tied to the PsyHorns course developed by Jamie Pennebaker and Sam Gosling over the past few years. PsyHorns is a synchronous massive online course (SMOC) that lets anyone in the world with an internet connection (okay, and $550 in loose change lying around) take an introductory psychology class via the internet and officially receive credit for it from the University of Texas (this recent WSJ article on PsyHorns provides some more details). My role will be to serve as a bridge between the psychologists and the developers–which means I’ll have an eclectic assortment of duties like writing algorithms to detect cheating, developing tools to predict how well people are doing in the class, mining the gigantic reams of data we’re acquiring, developing ideas for new course features, and, of course, publishing papers.

Naturally, the PILab will be joining me in my southern adventure. Since the PILab currently only has one permanent member (guess who?), and otherwise consists of a single Mac Pro workstation, this latter move involves much less effort than you might think (though it does mean I’ll have to change the lab website’s URL, logo, and–horror of horrors–color scheme). Unfortunately, all the wonderful people of the PILab will be staying behind, as they all have various much more important ties to Boulder (by which I mean that I’m not actually currently paying any of their salaries, and none of them were willing to subsist on the stipend of baked beans, love, and high-speed internet I offered them).

While I’m super excited about moving to Austin, I’m not at all excited to leave Colorado. Boulder is a wonderful place to live*–it’s sunny all the time, has a compact, walkable core, a surprising amount of stuff to do, and these gigantic mountain things you can walk all over. My wife and I have made many incredible friends here, and after four years in Colorado, it’s come to feel very much like home. So leaving will be difficult. Still, I’m excited to move onto new things. As great as the past four years have been, a number of factors precipitated this move:

  • The research fit is better. This isn’t in any way a knock against the environment here at Colorado, which has been great (hey, they’re hiring! If you do computational cognitive neuroscience, you should apply!). I had great colleagues here who work on some really interesting questions–particularly Tor Wager, my postdoc advisor for my first 3 years here, who’s an exceptional scientist and stellar human being. But every department necessarily has to focus on some areas at the expense of others, and much of the research I do (or would ideally like to do) wasn’t well-represented here. In particular, my interests in personality and individual differences have languished during my time in Boulder, as I’ve had trouble finding collaborators for most of the project ideas I’ve had. UT-Austin, by contrast, has one of the premier personality and individual differences groups anywhere. I’m delighted to be working a few doors down from people like Sam Gosling, Jamie Pennebaker, Elliot Tucker-Drob, and David Buss. On top of that, UT-Austin still has major strengths in most of my other areas of interest, most notably neuroimaging (I expect to continue to collaborate frequently with Russ Poldrack) and data mining (a world-class CS department with an expanding focus on Big Data). So, purely in terms of fit, it’s hard for me to imagine a better place than UT.
  • I’m excited to work on a project with immediate real-world impact. While I’d love to believe that most of the work I currently do is making the world better in some very small way, the reality most scientists engaged in basic research face is that at the end of the day, we don’t actually know what impact we’re having. There’s nothing inherently wrong with that, mind you; as a general rule, I’m a big believer in the idea of doing science just because it’s interesting and exciting, without worrying about the consequences (or lack thereof). You know, knowledge for it’s own sake and all that. Still, on a personal level, I find myself increasingly wanting to do something that I feel confers some clear and measurable benefit on the world right now–however small. In that respect, online education strikes me as an excellent area to pour my energy into. And PsyHorns is a particularly unusual (and, to my mind, promising) experiment in online education. The preliminary data from previous iterations of the course suggests that students who take the course synchronously online do better academically–not just in this particular class (as compared to an in-class section), but in other courses as well. While I’m not hugely optimistic about the malleability of the human mind as a general rule–meaning, I don’t think there are as-yet undiscovered teaching approaches that are going to radically improve the learning experience–I do believe strongly in the cumulative impact of many small nudging in the right direction. I think this is the right platform for that kind of nudging.
  • Data. Lots and lots of data. Enrollment in PsyHorns this year is about 1,500 students, and previous iterations have seen comparable numbers. As part of their introduction to psychology, the students engage in a wide range of activities: they have group chats about the material they’re learning; they write essays about a range of topics; they fill out questionnaires and attitude surveys; and, for the first time this year, they use a mobile app that assesses various aspects of their daily experience. Aside from the feedback we provide to the students (some of which is potentially actionable right away), the data we’re collecting provides a unique opportunity to address many questions at the intersection of personality and individual differences, health and subjective well-being, and education. It’s not Big Data by, say, Google or Amazon standards (we’re talking thousands of rows rather than billions), but it’s a dataset with few parallels in psychology, and I’m thrilled to be able to work on it.
  • I like doing research more than I like teaching** or doing service work. Like my current position, the position I’m assuming at UT-Austin is 100% research-focused, with very little administrative or teaching overhead. Obviously, it doesn’t have the long-term security of a tenure-track position, but I’m okay with that. I’m still selectively applying for tenure-track positions (and turned one down this year in favor of the UT position), so it’s not as though I have any principled objections to the tenure stream. But short of a really amazing opportunity, I’m very happy with my current arrangement.
  • mmm, chocolatey Austin goodness...
    Austin seems like a pretty awesome place to live. Boulder is too, but after four years of living in a relatively small place (population: ~100,000), my wife and I are looking forward to living somewhere more city-like. We’ve opted to take the (expensive) plunge and live downtown–where we’ll be within walking distance of just about everything we need. By which of course I mean the chocolate fountain at the Whole Foods mothership.
  • The tech community in Austin is booming. Given that most of my work these days lies at the interface of psychology and informatics, and there are unprecedented opportunities for psychology-related data mining in industry these days, I’m hoping to develop better collaborations with people in industry–at both startups and established companies. While I have no intention of leaving academia in the near future, I do think psychologists have collectively failed to take advantage of the many opportunities to collaborate with folks in industry on interesting questions about human behavior–often at an unprecedented scale. I’ve done a terrible job of that myself, and fixing that is near the top of my agenda. So, hey, if you work at a tech company in Austin and have some data lying around that you think might shed new insights on what people feel, think, and do, let’s chat!
  • I guess sometimes you just get the itch to move onto something new. For me, this is that.

University of Texas Austin campus at sunset-dusk - aerial view

 

 

* Okay, it was an amazing place to live until the massive floods this past week rearranged rivers, roads, and lives. My wife and I  were fortunate enough to escape any personal or material damage, but many others were not so lucky. If you’d like to help, please consider making a donation.

** Actually, I love teaching. What I don’t love is all the stuff surrounding teaching.

elsewhere on the net

Some neat links from the past few weeks:

  • You Are No So Smart: A celebration of self-delusion. An excellent blog by journalist David McCraney that deconstructs common myths about the way the mind works.
  • NPR has a great story by Jon Hamilton about the famous saga of Einstein’s brain and what it’s helped teach us about brain function. [via Carl Zimmer]
  • The Neuroskeptic has a characteristically excellent 1,000 word explanation of how fMRI works.
  • David Rock has an interesting post on some recent work from Baumeister’s group purportedly showing that it’s good to believe in free will (whether or not it exists). My own feeling about this is that Baumeister’s not really studying people’s philosophical views about free will, but rather a construct closely related to self-efficacy and locus of control. But it’s certainly an interesting line of research.
  • The Prodigal Academic is a great new blog about all things academic. I’ve found it particularly interesting since several of the posts so far have been about job searches and job-seeking–something I’ll be experiencing my fill of over the next few months.
  • Prof-like Substance has a great 5-part series (1, 2, 3, 4, 5) on how blogging helps him as an academic. My own (much less eloquent) thoughts on that are here.
  • Cameron Neylon makes a nice case for the development of social webs for data mining.
  • Speaking of data mining, Michael Driscoll of Dataspora has an interesting pair of posts extolling the virtues of Big Data.
  • And just to balance things out, there’s this article in the New York Times by John Allen Paulos that offers some cautionary words about the challenges of using empirical data to support policy decisions.
  • On a totally science-less note, some nifty drawings (or is that photos?) by Ben Heine (via Crooked Brains):