Category Archives: general silliness

Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.

Conclusion

So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

estimating the influence of a tweet–now with 33% more causal inference!

Twitter is kind of a big deal. Not just out there in the world at large, but also in the research community, which loves the kind of structured metadata you can retrieve for every tweet. A lot of researchers rely heavily on twitter to model social networks, information propagation, persuasion, and all kinds of interesting things. For example, here’s the abstract of a nice recent paper on arXiv that aims to  predict successful memes using network and community structure:

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

One limitation of much of this body of research is that the data are almost invariably observational. We can build sophisticated models that do a good job predicting some future outcome (like meme success), but we don’t necessarily know that the “important” features we identify carry any causal influence. In principle, they could be completely epiphenomenal–for example, in the study I linked to, maybe the community structure features are just a proxy for some other, causally important, factor (e.g., whether the content of a meme has sufficiently broad appeal to attract attention from many different kinds of people). From a predictive standpoint, this may not matter much; if your goal is just to passively predict whether a meme is going to be successful or not, it’s irrelevant whether or not the features you’re using are doing causal work. On the other hand, if you want to actively design memes in such a way as to maximize their spread, the ability to get a handle on causation starts to look pretty important.

How can we estimate the direct causal influence of a tweet on the downstream popularity of a meme? Here’s a simple and (I suspect) very feasible idea in two steps:

  1. Create a small web app that allows any existing Twitter user to register via Twitter authentication. On signing up, a user has to specify just one (optional) setting: the proportion of their intended retweets they’re willing to withhold. Let’s this the Withholding Fraction (WF).
  2. Every time (or at least some of the time) a registered user wants to retweet a particular tweet*, they do so via the new web app’s interface (which has permission to post to the user’s Twitter account) instead of whatever interface they’re currently using. The key is that the retweet isn’t just obediently passed along; instead, the target tweet is retweeted successfully with probability (1 – WF), and randomly suppressed from the user’s stream with probability (WF).

Doing this  would allow the community to very quickly (assuming rapid adoption, which seems reasonably likely) build up an enormous database of tweets that were targeted for retweeting by an active user, but randomly assigned to fail with some known probability. Researchers would then be able to directly quantify the causal impact of individual retweets on downstream popularity–and to estimate that influence conditional on all of the other standard variables, like the retweeter’s number of followers, the content of the tweet, etc. Of course, this still wouldn’t get us to true experimental manipulation of such features (i.e., we wouldn’t be manipulating users’ follower networks, just randomly omitting tweets from users with different followers), but it seems like a step in the right direction**.

I figure building a barebones app like this would take an experienced developer familiar with the Twitter OAuth API just a day or two. And I suspect many people (myself included!) would be happy to contribute to this kind of experiment, provided that all of the resulting data were made public. (I’m aware that there are all kinds of restrictions on sharing assembled Twitter datasets, but we’re not talking about sharing firehose dumps here, just a restricted set of retweets from users who’ve explicitly given their consent to have the data used in this way.)

Has this kind of thing already been done? If not, does anyone want to build it?

 

* It doesn’t just have to be retweets, of course; the same principle would work just as well for withholding a random fraction of original tweets. But I suspect not many users would be willing to randomly eliminate a proportion of their original content from the firehose.

** If we really wanted to get close to true random assignment, we could potentially inject selected tweets into random users streams based on selected criteria. But I’m not sure how many tweeps would consent to have entirely random retweets published in their name (I probably wouldn’t), so this probably isn’t viable.

Jirafas

This is fiction.

The party is supposed to start at 7 pm, but of course, no one shows up before 8:45. When the guests finally do arrive, I randomly assign each of them to one of four groups–A through D–as they enter. Each assignment comes with an adhesive 2″ color patch, a nametag, and a sharpie.

The labels are not for the dinner,” I say, “they’re for the orgy that follows the dinner. The bedrooms are all color-coded; there are strict rules governing inter-cubicular transitions. Please read the manual on the table.”

Nobody moves to pick up the manual. There’s a long and uncomfortable silence, made longer and more uncomfortable by the fact that we can all hear the upstairs neighbors loudly having sex on their kitchen counter.

“Turn on the music,” my wife says. “It masks the sex.”

I put on some music. Something soft, by Elton John, followed by something angry—a duet by Tenacious D and Leonard Skynyrd. One of the guests—unsoothed by the music, and noticing the random collection of chairs scattered around the living room—grows restless and asks whether we will all be playing musical chairs this fine evening.

“No,” I reply; “this fine night, we all play Mafia.” Then I shoot him dead as everyone else pretends to stare out the window.

In the kitchen, my wife uncorks the last bottle of wine. As trendy wines go, this one wears its pretention with pride: Jugo de Jirafas, the label proclaims in vermilion Helvetica Neue overtones.

“What does jirafas mean,” I ask my Spanish friend. “Giraffes?”

“No,” she says. “Jirafas was a famous rebel general who came out of hiding during the Spanish Civil War to challenge Franco to a fight to the death. They brawled in the streets for hours, and and just when it looked like Jirafas was about to snap Franco’s neck, Franco screamed for his deputies, who immediately pumped several rounds straight through Jirafas’s heart. They say the body continued to bleed courage into the street for several weeks.”

Jugo de Jirafas, I enunciate out loud.

There’s an awkward silence in the living room as the assembled guests all hold an involuntary thirty-second vigil for the dearly departed General Jirafas, who was taken from us much too soon. Poor man—we barely knew him.

Then the vigil is broken up by the arrival of my Brazilian friend João, who lives across the way. Our housing complex is nominally open to all faculty and staff affiliated with the university, but in practice it more or less operates as a kind of hippie commune for expatriate scientists. On any given day you can hear forty different languages being spoken, and stumble across marauding groups of eight-year old children all babbling away at each other in mutual incomprehension. Walking through our apartment complex is like taking a simultaneous trip through every foreign-language channel on extended cable.

It does have its perks, though. For example, if you want to experience other cultures, you don’t need to travel anywhere. When people suggest that I’ve been working too hard and need a vacation, I yell at João through the bedroom window: how’s Rio this time of year?

Exceptional, he’ll yell back. The cannonball trees are in full bloom. You should come for a visit.

Then I usually take a bottle of wine over—nothing of Jugo de Jirafas caliber, just a basic Zinfandel from Whole Foods—and we sit around and talk about the strange places we’ve lived: Rio and Istanbul for him; Mombasa and Ottawa for me. After dinner we usually play a few games of backgammon, which is not a Brazilian game at all, but is acceptable to play because João spent three years of his life doing a postdoc in Turkey. Thus begins and ends my cosmetic Latin American vacation, punctuated by a detour to the Near East.

Tonight, João shows up with a German lady on his arm. She’s a newly arrived faculty member in the Department of Earth Sciences.

“This is the bad Jew I was telling you about,” he says to the lady by way of introduction.

“It’s true,” I say; “I’m a very bad Jew. Even by Jewish standards.”

She wants to know what makes a Jew a bad Jew. I tell her I eat bacon on the Sabbath and wrap myself in cheeseburgers before bed. And that I make sure to drink the blood of goyim at least four times a year. And that I’m so money-hungry and cunning, I’ve been banned from lending money even to other Jews.

My joke doesn’t go over so well. Germans have had, for obvious reasons, a lot of trouble putting the war behind them. When you make Jew jokes in Germany, people give you a look that’s made up of one part contempt, one part cognitive dissonance. They don’t know what to do; it’s like you’ve lit a warehouse full of bottle rockets up inside their heads all at once. As an American, I don’t mind this, of course. In America, it’s your god-given birthright to make ethnic jokes at your own expense. As long as you’re making fun only of your own in-group and nobody else, no one is allowed to come between you and your chuckles.

The German lady doesn’t see it this way.

“You should not make fun of the Jews,” she says in over-articled English. “Even if you are a one yourself.”

“Well,” says I. “If you can’t laugh at yourself, who can you laugh at?”

She shrugs her shoulders.

“Other people,” offers João.

So I laugh at João, because he’s another person. There’s an uncomfortable pause, but then the earth scientist–whose name turns out to be Brunhilde–laughs too. A moment later, we’re all making small talk again, and I feel pretty confident that any budding crisis in diplomatic relations has been averted.

“Speaking of making fun of others,” João says, “what happened to your lip? It looks like you have the herpes.”

“I damaged myself while flossing,” I tell him.

It’s true: I have a persistent cut on my lip caused by aggressive flossing. It refuses to heal. And now, after several days of incubation, it looks exactly like a cold sore. So I have to walk around my life constantly putting up with herpes jokes.

“I’ll go put something on it,” I say, self-consciously rubbing at the wound. “You just stand here and keep laughing at me, you anti-semite.”

Turns out, I’ve forgotten the name of the lip balm my wife buys. So I walk around the party with a chafed, bloody lip, asking everyone I know if they’ve seen my Tampax. The guests mostly demur quietly, but one particularly mercurial friend looks slightly alarmed, and slowly starts to edge towards the door.

He means Carmex, my wife yells from the kitchen.

Eventually, all of the wine is drunk and the conversation is spent. The guests begin to leave, each one curling his or her self carefully through the doorway in sequence. For some reason, they remind me of ants circling around a drain—but I don’t tell anyone that. There is no longer any music; there was never an orgy. There are no more Jew jokes. I turn the phonograph off—by which I mean I press the stop button on my iTunes playlist—and dim the lights. My wife stays downstairs.

“To do some research,” she says.

Much later, just as I’m making the delicate nightly transition from restless leg syndrome to stage 1 sleep, I’m suddenly jarred wide awake by the sound of someone cursing loudly and repeatedly as they get into bed next to me. I vaguely recognize my wife’s voice, though it sounds different over the haze of near-sleep and a not-insignificant amount of wine.

What’s going on, I ask her.

She mutters that she’s just spent the last hour and a half exhausting the infinite wisdom of Google, circumnavigating the information superhighway, and consulting with various technical support workers scattered all around the Indian subcontinent. And the clear consensus among all sources is that there is not now, and never was, any General Jirafas.

“It just means giraffes,” she says.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

the seedy underbelly

This is fiction. Science will return shortly.


Cornelius Kipling doesn’t take No for an answer. He usually takes several of them–several No’s strung together in rapid sequence, each one louder and more adamant than the last one.

“No,” I told him over dinner at the Rhubarb Club one foggy evening. “No, no, no. I won’t bankroll your efforts to build a new warp drive.”

“But the last one almost worked,” Kip said pleadingly. “I almost had it down before the hull gave way.”

I conceded that it was a clever idea; everyone before Kip had always thought of warp drives as something you put on spaceships. Kip decided to break the mold by placing one on a hydrofoil. Which, naturally, made the boat too heavy to rise above the surface of the water. In fact, it made the boat too heavy to do anything but sink.

“Admittedly, the sinking thing is a small problem,” he said, as if reading my thoughts. “But I’m working on a way to adjust for the extra weight and get it to rise clear out of the water.”

“Good,” I said. “Because lifting the boat out of the water seems like a pretty important step on the road to getting it to travel through space at light speed.”

“Actually, it’s the only remaining technical hurdle,” said Kip. “Once it’s out of the water, everything’s already taken care of. I’ve got onboard fission reactors for power, and a tentative deal to use the International Space Station for supplies. Virgin Galactic is ready to license the technology as soon as we pull off a successful trial run. And there’s an arrangement with James Cameron’s new asteroid mining company to supply us with fuel as we boldly go where… well, you know.”

“Right,” I said, punching my spoon into my crème brûlée in frustration. The crème brûlée retaliated by splattering itself all over my face and jacket.

“See, this kind of thing wouldn’t happen to you if you invested in my company,” Kip helpfully suggested as he passed me an extra napkin. “You’d have so much money other people would feed you. People with ten or fifteen years of experience wielding dessert spoons.”


After dinner we headed downtown. Kip said there was a new bar called Zygote he wanted to show me.

“Actually, it’s not a new bar per se,” he explained as we were leaving the Rhubarb. “It’s new to me. Turns out it’s been here for several years, but you have to know someone to get in. And that someone has to be willing to sponsor you. They review your biography, look up your criminal record, make sure you’re the kind of person they want at the bar, and so on.”

“Sounds like an arranged marriage.”

“You’re not too far off. When you’re first accepted as a member, you’re supposed to give Zygote a dowry of $2,000.”

“That’s a joke, right?” I asked.

“Yes. There’s no dowry. Just the fee.”

“Two thousand dollars? Really?”

“Well, more like fifty a year. But same principle.”

We walked down the mall in silence. I could feel the insoles of my shoes wrapping themselves around my feet, and I knew they were desperately warning me to get away from Kip while I still had a limited amount of sobriety and dignity left.

“How would anyone manage to keep a place like that secret?” I asked. “Especially on the mall.”

“They hire hit men,” Kip said solemnly.

I suspected he was joking, but couldn’t swear to it. I mean, if you didn’t know Kip, you would probably have thought that the idea of putting a warp drive on a hydrofoil was also a big joke.

Kip led us into one of the alleys off Pearl Street, where he quickly located an unobtrusive metal panel set into the wall just below eye level. The panel opened inwards when we pushed it. Behind the panel, we found a faint smell of old candles and a flight of stairs. At the bottom of the stairs–which turned out to run three stories down–we came to another door. This one didn’t open when we pushed it. Instead, Kip knocked on it three times. Then twice more. Then four times.

“Secret code?” I asked.

“No. Obsessive-compulsive disorder.”

The door swung open.

“Evening, Ashraf,” Kip said to the doorman as we stepped through. Ashraf was a tiny Middle Eastern man, very well dressed. Suede pants, cashmere scarf, fedora on his head. Feather in the fedora. The works. I guess when your bar is located behind a false wall three stories below grade, you don’t really need a lot of muscle to keep the peasants out; you knock them out with panache.

“Welcome to Zygote,” Ashraf said. His bland tone made it clear that, truthfully, he wasn’t at all interested in welcoming anyone anywhere. Which made him exactly the kind of person an establishment like this would want as its doorman.

Inside, the bar was mostly empty. There were twelve or fifteen patrons scattered across various booths and animal-print couches. They all took great care not to make eye contact with us as we entered.

“I have to confess,” I whispered to Kip as we made our way to the bar. “Until about three seconds ago, I didn’t really believe you that this place existed.”

“No worries,” he said. “Until about three seconds ago, it had no idea you existed either.”

He looked around.

“Actually, I’m still not sure it knows you exist,” he added apologetically.

“I feel like I’m giving everyone the flu just by standing here,” I told him.

We took a seat at the end of the bar and motioned to the bartender, who looked to be high on a designer drug chemically related to apathy. She eventually wandered over to us–but not before stopping to inspect the countertop, a stack of coasters with pictures of archaeological sites on them, a rack of brandy snifters, and the water running from the faucet.

“Two mojitos and a strawberry daiquiri,” Kip said when she finally got close enough to yell at.

“Who’s the strawberry daiquiri for,” I asked.

“Me. They’re all for me. Why, did you want a drink too?”

I did, so I ordered the special–a pink cocktail called a Flamingo. Each Flamingo came in a tall Flamingo-shaped glass that couldn’t stand up by itself, so you had to keep holding it until you finished it. Once you were done, you could lay the glass on its side on the counter and watch it leak its remaining pink guts out onto the tile. This act was, I gathered from Kip, a kind of rite of passage at Zygote.

“This is a very fancy place,” I said to no one in particular.

“You should have seen it before the gang fights,” the bartender said before walking back to the snifter rack. I had high hopes she would eventually get around to filling our order.

“Gang fights?”

“Yes,” Kip said. “Gang fights. Used to be big old gang fights in here every other week. They trashed the place several times.”

“It’s like there’s this whole seedy underbelly to Boulder that I never knew existed.”

“Oh, this is nothing. It goes much deeper than this. You haven’t seen the seedy underbelly of this place until you’ve tried to convince a bunch of old money hippies to finance your mass-produced elevator-sized vaporizer. You haven’t squinted into the sun or tasted the shadow of death on your shoulder until you’ve taken on the Bicycle Triads of North Boulder single-file in a dark alley. And you haven’t tried to scratch the dirt off your soul–unsuccessfully, mind you–until you’ve held all-night bargaining sessions with local black hat hacker groups to negotiate the purchase of mission-critical zero-day exploits.”

“Well, that may all be true,” I said. “But I don’t think you’ve done any of those things either.”

I should have known better than to question Kip’s credibility; he spent the next fifteen minutes reminding me of the many times he’d risked his life, liberty, and (nonexistent) fortune fighting to suppress the darkest forces in Northern Colorado in the service of the greater good of mankind.

After that, he launched into his standard routine of trying to get me to buy into the latest round of his inane startup ideas. He told me, in no particular order, about his plans to import, bottle and sell the finest grade Kazakh sand as a replacement for the substandard stuff currently found on American kindergarten sandlots; to run a “reverse tourism” operation that would fly in members of distant cultures to visit disabled would-be travelers in the comfort of their own living rooms (tentative slogan: if the customer can’t come to Muhammad, Muhammad must come to the customer); and to create giant grappling hooks that could pull Australia closer to the West Coast so that Kip could speculate in airline stocks and make billions of dollars once shorter flights inevitably caused Los Angeles-Sydney routes to triple in passenger volume.

I freely confess that my recollection of the finer points of the various revenue enhancement plans Kip proposed that night is not the best. I was a little bit distracted by a woman at the far end of the bar who kept gesturing towards me the whole time Kip was talking. Actually, she wasn’t so much gesturing towards me as gently massaging her neck. But she only did it when I happened to look at her. At one point, she licked her index finger and rubbed it on her neck, giving me a pointed look.

After about forty-five minutes of this, I finally worked up the courage to interrupt Kip’s explanation of how and why the federal government could solve all of America’s economic problems overnight by convincing Balinese children to invest in discarded high school football uniforms.

“Look,” I told him, pointing down to the other side of the bar. “You see? This is why I don’t go to bars any more now that I’m married. Attractive women hit on me, and I hate to disappoint them.”

I raised my left hand and deliberately stroked my wedding band in full view.

The lady at the far end didn’t take the hint. Quite the opposite; she pushed back her bar stool and came over to us.

“Christ,” I whispered.

Kip smirked quietly.

“Hi,” said the woman. “I’m Suzanne.”

“Hi,” I said. “I’m flattered. And also married.”

“I see that. I also see that you have some food in your… neckbeard. It looks like whipped cream. At least I hope that’s what it is. I was trying to let you know from down there, so you could wipe it off without embarrassing yourself any further. But apparently you’d rather embarrass yourself.”

“It’s crème brûlée,” I mumbled.

“Weak,” said Suzanne, turning around. “Very weak.”

After she’d left, I wiped my neck on my sleeve and looked at Kip. He looked back at me with a big grin on his face.

“I don’t suppose the thought crossed your mind, at any point in the last hour, to tell me I had crème brûlée in my beard.”

“You mean your neckbeard?”

“Yes,” I sighed, making a mental note to shave more often. “That.”

“It certainly crossed my mind,” Kip said. “Actually, it crossed my mind several times. But each time it crossed, it just waved hello and kept right on going.”

“You know you’re an asshole, right?”

“Whatever you say, Captain Neckbeard.”

“Alright then,” I sighed. “Let’s get out of here. It’s past my curfew anyway. Do you remember where I left my car?”

“No need,” said Kip, putting on his jacket and clapping his hand to my shoulder. “My hydrofoil’s parked in the Spruce lot around the block. The new warp drive is in. Walk with me and I’ll give you a ride. As long as you don’t mind pushing for the first fifty yards.”

several half-truths, and one blatant, unrepenting lie about my recent whereabouts

Apparently time does a thing that is much like flying. Seems like just yesterday I was sitting here in this chair, sipping on martinis, and pleasantly humming old show tunes while cranking out several high-quality blog posts an hour a mediocre blog post every week or two. But then! Then I got distracted! And blinked! And fell asleep in my chair! And then when I looked up again, 8 months had passed! With no blog posts!

Granted, on the Badness Scale, which runs from 1 to Imminent Apocalypse, this one clocks in at a solid 1.04. But still, eight months is a long time to be gone–about 3,000 internet years. So I figured I’d write a short post about the events of the past eight months before setting about the business of trying (and perhaps failing) to post here more regularly. Also, to keep things interesting, I’ve thrown in one fake bullet. See if you can spot the impostor.

  • I started my own lab! You can tell it’s a completely legitimate scientific operation because it has (a) a fancy new website, (b) other members besides me (some of whom I admittedly had to coerce into ‘joining’), and (c) weekly meetings. (As far as I can tell, these are all the necessary requirements for official labhood.) I decided to call my very legitimate scientific lab the Psychoinformatics Lab. Partly because I like how it sounds, and partly because it’s vaguely descriptive of the research I do. But mostly because it results in a catchy abbreviation: PILab. (It’s pronounced Pieeeeeeeeeee lab–the last 10 e’s are silent.)
  • I’ve been slowly writing and re-writing the Neurosynth codebase. Neurosynth is a thing made out of software that lets neuroimaging researchers very crudely stitch together one giant brain image out of other smaller brain images. It’s kind of like a collage, except that unlike most collages, in this case the sum is usually not more than its parts. In fact, the sum tends to look a lot like its parts. In any case, with some hard work and a very large serving of good luck, I managed to land a R01 grant from the NIH last summer, which will allow me to continue stitching images for a few more years. From my perspective, this is a very good thing, for two reasons. FIrst, because it means I’m not unemployed right now (I’m a big fan of employment, you see); and secondly, because I’m finding the stitching surprisingly enjoyable. If you enjoy stitching software into brain images, please help out.
  • I published a bunch of papers in 2012, so, according to my CV at least, it was a good year for me professionally. Actually, I think it was a deceptively good year–meaning, I don’t think I did any more work than I did in previous years, but various factors (old projects coming to fruition, a bunch of papers all getting accepted at the same time, etc.) conspired to produce more publications in 2012. This kind of stuff has a tendency to balance out in fairly short order though, so I fully expect to rack up a grand total of zero publications in 2013.
  • I went to Iceland! And England! And France! And Germany! And the Netherlands! And Canada! And Austin, Texas! Plus some other places. I know many people spend a lot of their time on the road and think hopping across various oceans is no big deal, but, well, it is to me, so BACK OFF. Anyway, it’s been nice to have the opportunity to travel more. And to combine business and pleasure. I am not one of those people–I think you call them ‘sane’–who prefer to keep their work life and their personal life cleanly compartmentalized, and try to cram all their work into specific parts of the year and then save a few days or weeks here and there to do nothing but roll around on the beach or ski down frighteningly tall mountains. I find I’m happiest when I get to spend one part of the day giving a talk or meeting with some people to discuss the way the edges of the brain blur when you shake your head, and then another part of the day roaming around De Jordaan asking passers-by, in a stilted Dutch, “where can I find some more of those baby cheeses?”
  • On a more personal note (as the archives of this blog will attest, I have no shame when it comes to publicly divulging embarrassing personal details), my wife and I celebrated our fifth anniversary a few weeks ago. I think this one is called the congratulations, you haven’t killed each other yet! anniversary. Next up: the ten year anniversary, also known as the but seriously, how are you both still alive? decennial. Fortunately we’re not particularly sentimental people, so we celebrated our wooden achievement with some sushi, some sake, and only 500 of our closest friends an early bedtime (no, seriously–we went to bed early; that’s not a euphemism for anything).
  • I contracted a bad case of vampirism while doing some prospecting work in the Yukon last summer. The details are a little bit sketchy, but I have a vague suspicion it happened on that one occasion when I was out gold panning in the middle of the night under a full moon and was brutally attacked by a man-sized bat that bit me several times on the neck. At least, that’s my best guess. But, whatever–now that my disease is in full bloom, it’s not so bad any more. I’ve become mostly nocturnal, and I have to snack on the blood of an unsuspecting undergraduate student once every month or two to keep from wasting away. But it seems like a small price to pay in return for eternal life, superhuman strength, and really pasty skin.
  • Overall, I’m enjoying myself quite a bit. I recently read somewhere that people are, on average, happiest in their 30s. I also recently read somewhere else that people are, on average, least happy in their 30s. I resolve this apparent contradiction by simply opting to believe the first thing, because in my estimation, I am, on average, happiest in my 30s.

Ok, enough self-indulgent rambling. Looking over this list, it wasn’t even a very eventful eight months, so I really have no excuse for dropping the ball on this blogging thing. I will now attempt to resume posting one to two posts a month about brain imaging, correlograms, and schweizel units. This might be a good cue for you to hit the UNSUBSCRIBE button.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

in which I apologize for my laziness, but not really

I got back from the Cognitive Neuroscience Society meeting last week. I was planning to write a post-CNS wrap-up thing like I did last year and the year before that, but I seem to have misplaced the energy that’s supposed to fuel such an exercise. So instead I’ll just say I had a great time and leave it at that. What happens in Chicago stays in Chicago, etc. etc.

Also, I really appreciate all the people who came up to me at CNS and said nice things about this blog–it’s nice to know that someone actually reads this (puzzling, mind you, because I’m not sure why anyone reads this, but nice nonetheless). A couple of people encouraged me to blog more often, so I’m making an effort to do that, though the most likely outcome will be miserable failure. Either that or I’ll just start pasting random YouTube videos in this space. Like this one:

p.s. on re-reading that, it kind of make it sound like I was swarmed by adoring fans at CNS. To clarify: “all the people” means, like, four people, and the “nice things” were really more like lukewarm “oh yeah, your blog’s not totally awful” sentiments.

p.p.s. I’ve noticed that a lot of my shorter posts take the form of “I was going to write about X, but I’m not actually going to write about X.” I think this is because I’m very lazy but still want partial credit for having good intentions. Which is kind of ridiculous.

on writing: some anecdotal observations, in no particular order

  • Early on in graduate school, I invested in the book “How to Write a Lot“. I enjoyed reading it–mostly because I (mistakenly) enjoyed thinking to myself, “hey, I bet as soon as I finish this book, I’m going to start being super productive!” But I can save you the $9 and tell you there’s really only one take-home point: schedule writing like any other activity, and stick to your schedule no matter what. Though, having said that, I don’t really do that myself. I find I tend to write about 20 hours a week on average. On a very good day, I manage to get a couple of thousand words written, but much more often, I get 200 words written that I then proceed to rewrite furiously and finally trash in frustration. But it all adds up in the long run I guess.
  • Some people are good at writing one thing at a time; they can sit down for a week and crank out a solid draft of a paper without every looking sideways at another project. Personally, unless I have a looming deadline (and I mean a real deadline–more on that below), I find that impossible to do; my general tendency is to work on one writing project for an hour or two, and then switch to something else. Otherwise I pretty much lose my mind. I also find it helps to reward myself–i.e., I’ll work on something I really don’t want to do for an hour, and then play video games for a while switch to writing something more pleasant.
  • I can rarely get any ‘real’ writing (i.e., stuff that leads to publications) done after around 6 pm; late mornings (i.e., right after I wake up) are usually my most productive writing time. And I generally only write for fun (blogging, writing fiction, etc.) after 9 pm. There are exceptions, but by and large that’s my system.
  • I don’t write many drafts. I don’t mean that I never revise papers, because I do–obsessively. But I don’t sit down thinking “I’m going to write a very rough draft, and then I’ll go back and clean up the language.” I sit down thinking “I’m going to write a perfect paper the first time around,” and then I very slowly crank out a draft that’s remarkably far from being perfect. I suspect the former approach is actually the more efficient one, but I can’t bring myself to do it. I hate seeing malformed sentences on the page, even if I know I’m only going to delete them later. It always amazes and impresses me when I get Word documents from collaborators with titles like “AmazingNatureSubmissionVersion18″. I just give my documents all the title “paper_draft”. There might be a V2 or a V3, but there will never, ever be a V18.
  • Papers are not meant to be written linearly. I don’t know anyone who starts with the Introduction, then does the Methods and Results, and then finishes with the Discussion. Personally I don’t even write papers one section at a time. I usually start out by frantically writing down ideas as they pop into my head, and jumping around the document as I think of other things I want to say. I frequently write half a sentence down and then finish it with a bunch of question marks (like so: ???) to indicate I need to come back later and patch it up. Incidentally, this is also why I’m terrified to ever show anyone any of my unfinished paper drafts: an unsuspecting reader would surely come away thinking I suffer from a serious thought disorder. (I suppose they might be right.)
  • Okay, that last point is not entirely true. I don’t write papers completely haphazardly; I do tend to write Methods and Results before Intro and Discussion. I gather that this is a pretty common approach. On the rare occasions when I’ve started writing the Introduction first, I’ve invariably ended up having to completely rewrite it, because it usually turns out the results aren’t actually what I thought they were.
  • My sense is that most academics get more comfortable writing as time goes on. Relatively few grad students have the perseverance to rapidly crank out publication-worthy papers from day 1 (I was definitely not one of them). I don’t think this is just a matter of practice; I suspect part of it is a natural maturation process. People generally get more conscientious as they age; it stands to reason that writing (as an activity most people find unpleasant) should get easier too. I’m better at motivating myself to write papers now, but I’m also much better about doing the dishes and laundry–and I’m pretty sure that’s not because practice makes dishwashing perfect.
  • When I started grad school, I was pretty sure I’d never publish anything, let alone graduate, because I’d never handed in a paper as an undergraduate that wasn’t written at the last minute, whereas in academia, there are virtually no hard deadlines (see below). I’m not sure exactly what changed. I’m still continually surprised every time something I wrote gets published. And I often catch myself telling myself, “hey, self, how the hell did you ever manage to pay attention long enough to write 5,000 words?” And then I reply to myself, “well, self, since you ask, I took a lot of stimulants.”
  • I pace around a lot when I write. A lot. To the point where my labmates–who are all uncommonly nice people–start shooting death glares my way. It’s a heritable tendency, I guess (the pacing, not the death glare attraction); my father also used to pace obsessively. I’m not sure what the biological explanation for it is. My best guess is it’s an arousal-mediated effect: I can think pretty well when I’m around other people, or when I’m in motion, but if I’m sitting at a desk and I don’t already know exactly what I want to say, I can’t get anything done. I generally pace around the lab or house for a while figuring out what I want to say, and then I sit down and write until I’ve forgotten what I want to say, or decide I didn’t really want to say that after all. In practice this usually works out to 10 minutes of pacing for every 5 minutes of writing. I envy people who can just sit down and calmly write for two or three hours without interruption (though I don’t think there are that many of them). At the same time, I’m pretty sure I burn a lot of calories this way.
  • I’ve been pleasantly surprised to discover that I much prefer writing grant proposals to writing papers–to the point where I actually enjoy writing grant proposals. I suspect the main reason for this is that grant proposals have a kind of openness that papers don’t; with a paper, you’re constrained to telling the story the data actually support, whereas a grant proposal is as good as your vision of what’s possible (okay, and plausible). A second part of it is probably the novelty of discovery: once you conduct your analyses, all that’s left is to tell other people what you found, which (to me) isn’t so exciting. I mean, I already think I know what’s going on; what do I care if you know? Whereas when writing a grant, a big part of the appeal for me is that I could actually go out and discover new stuff–just as long as I can convince someone to give me some money first.
  • At a a departmental seminar attended by about 30 people, I once heard a student express concern about an in-progress review article that he and several of the other people at the seminar were collaboratively working on. The concern was that if all of the collaborators couldn’t agree on what was going to go in the paper (and they didn’t seem to be able to at that point), the paper wouldn’t get written in time to make the rapidly approaching deadline dictated by the journal editor. A senior and very brilliant professor responded to the student’s concern by pointing out that this couldn’t possibly be a real problem seeing as in reality there is actually no such thing as a hard writing deadline. This observation didn’t go over so well with some of the other senior professors, who weren’t thrilled that their students were being handed the key to the kingdom of academic procrastination so early in their careers. But it was true, of course: with the major exception of grant proposals (EDIT: and as Garrett points out in the comments below, conference publications in disciplines like Computer Science), most of the things academics write (journal articles, reviews, commentaries, book chapters, etc.) operate on a very flexible schedule. Usually when someone asks you to write something for them, there is some vague mention somewhere of some theoretical deadline, which is typically a date that seems so amazingly far off into the future that you wonder if you’ll even be the same person when it rolls around. And then, much to your surprise, the deadline rolls around and you realize that you must in fact really bea different person, because you don’t seem to have any real desire to work on this thing you signed up for, and instead of writing it, why don’t you just ask the editor for an extension while you go rustle up some motivation. So you send a polite email, and the editor grudgingly says, “well, hmm, okay, you can have another two weeks,” to which you smile and nod sagely, and then, two weeks later, you send another similarly worded but even more obsequious email that starts with the words “so, about that extension…”

    The basic point here is that there’s an interesting dilemma: even though there rarely are any strict writing deadlines, it’s to almost everyone’s benefit to pretend they exist. If I ever find out that the true deadline (insofar as such a thing exists) for the chapter I’m working on right now is 6 months from now and not 3 months ago (which is what they told me), I’ll probably relax and stop working on it for, say, the next 5 and a half months. I sometimes think that the most productive academics are the ones who are just really really good at repeatedly lying to themselves.

  • I’m a big believer in structured procrastination when it comes to writing. I try to always have a really unpleasant but not-so-important task in the background, which then forces me to work on only-slightly-unpleasant-but-often-more-important tasks. Except it often turns out that the unpleasant-but-no-so-important task is actually an unpleasant-but-really-important task after all, and then I wake up in a cold sweat in the middle of the night thinking of all the ways I’ve screwed myself over. No, just kidding. I just bitch about it to my wife for a while and then drown my sorrows in an extra helping of ice cream.
  • I’m really, really, bad at restarting projects I’ve put on the back burner for a while. Right now there are 3 or 4 papers I’ve been working on on-and-off for 3 or 4 years, and every time I pick them up, I write a couple of hundred words and then put them away for a couple of months. I guess what I’m saying is that if you ever have the misfortune of collaborating on a paper with me, you should make sure to nag me several times a week until I get so fed up with you I sit down and write the damn paper. Otherwise it may never see the light of day.
  • I like writing fiction in my spare time. I also occasionally write whiny songs. I’m pretty terrible at both of these things, but I enjoy them, and I’m told (though I don’t believe it for a second) that that’s the important thing.

the neuroinformatics of Neopets

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…