Shalizi on the confounding of contagion and homophily in social network studies

Cosma Shalizi has a post up today discussing a new paper he wrote with Andrew C. Thomas arguing that it’s pretty much impossible to distinguish the effects of social contagion from homophily in observational studies.

That’s probably pretty cryptic without context, so here’s the background. A number of high-profile studies have been published in the past few years suggesting that everything from obesity to loneliness to pot smoking is socially contagious. The basic argument is that when you look at the diffusion of certain traits within social networks, you find that having friends who are obese is more likely to make you obese, having happy friends is more likely to make you happy, and so on. These effects (it’s been argued) persist even after you control for homophily–that is, the tendency of people to know and like other people who are similar to them–and can be indirect, so that you’re more likely to be obese even if your friends’ friends (who you may not even know know) are obese.

Needless to say, the work has been controversial. A few weeks ago, Dave Johns wrote an excellent pair of articles in Slate describing the original research, as well as the recent critical backlash (see also Andrew Gelman’s post here). Much of the criticism has focused on the question of whether it’s really possible to distinguish homophily from contagion using the kind of observational data and methods that contagion researchers have relied on. That is, if the probability that you’ll become obese (or lonely, or selfish, etc.) increases as a function of the number of obese people you know, is that because your acquaintance with obese people exerts a causal influence on your own body weight (e.g., by shaping your perception of body norms, eating habits, etc.), or is it simply that people with a disposition to become obese tend to seek out other people with the same disposition, and there’s no direct causal influence at all? It’s an important question, but one that’s difficult to answer conclusively.

In their new paper, Shalizi and Thomas use an elegant combination of logical argumentation, graphical causal models, and simulation to show that, in general, contagion effects are unidentifiable: you simply can’t tell whether like begets like because of a direct causal influence (“real” contagion), or because of homophily (birds of a feather flocking together). The only way out of the bind is to make unreasonably strong assumptions–e.g., that the covariates explicitly included in one’s model capture all of the influence of latent traits on observable behaviors. In his post Shalizi sums up the conclusions of the paper this way:

What the statistician or social scientist sees is that bridge-jumping is correlated across the social network. In this it resembles many, many, many behaviors and conditions, such as prescribing new antibiotics (one of the classic examples), adopting other new products, adopting political ideologies, attaching tags to pictures on flickr, attaching mis-spelled jokes to pictures of cats, smoking, drinking, using other drugs, suicide, literary tastes, coming down with infectious diseases, becoming obese, and having bad acne or being tall for your age. For almost all of these conditions or behaviors, our data is purely observational, meaning we cannot, for one reason or another, just push Joey off the bridge and see how Irene reacts. Can we nonetheless tell whether bridge-jumping spreads by (some form) of contagion, or rather is due to homophily, or, if it is both, say how much each mechanism contributes?

A lot of people have thought so, and have tried to come at it in the usual way, by doing regression. Most readers can probably guess what I think about that, so I will just say: don’t you wish. More sophisticated ideas, like propensity score matching, have also been tried, but people have pretty much assumed that it was possible to do this sort of decomposition. What Andrew and I showed is that in fact it isn’t, unless you are willing to make very strong, and generally untestable, assumptions.

It’s a very clear and compelling paper, and definitely worth reading if you have any interest at all in the question of whether and when it’s okay to apply causal modeling techniques to observational data. The answer Shalizi’s argued for on many occasions–and an unfortunate one from many scientists’ perspective–seems to be: very rarely if ever.

undergraduates are WEIRD

This month’s issue of Nature Neuroscience contains an editorial lambasting the excessive reliance of psychologists on undergraduate college samples, which, it turns out, are pretty unrepresentative of humanity at large. The impetus for the editorial is a mammoth in-press review of cross-cultural studies by Joseph Henrich and colleagues, which, the authors suggest, collectively indicate that “samples drawn from Western, Educated, Industrialized, Rich and Democratic (WEIRD) societies … are among the least representative populations one could find for generalizing about humans.” I’ve only skimmed the article, but aside from the clever acronym, you could do a lot worse than these (rather graphic) opening paragraphs:

In the tropical forests of New Guinea the Etoro believe that for a boy to achieve manhood he must ingest the semen of his elders. This is accomplished through ritualized rites of passage that require young male initiates to fellate a senior member (Herdt, 1984; Kelley, 1980). In contrast, the nearby Kaluli maintain that  male initiation is only properly done by ritually delivering the semen through the initiate’s anus, not his mouth. The Etoro revile these Kaluli practices, finding them disgusting. To become a man in these societies, and eventually take a wife, every boy undergoes these initiations. Such boy-inseminating practices, which  are enmeshed in rich systems of meaning and imbued with local cultural values, were not uncommon among the traditional societies of Melanesia and Aboriginal Australia (Herdt, 1993), as well as in Ancient Greece and Tokugawa Japan.

Such in-depth studies of seemingly “exotic” societies, historically the province of anthropology, are crucial for understanding human behavioral and psychological variation. However, this paper is not about these peoples. It’s about a truly unusual group: people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. In particular, it’s about the Western, and more specifically American, undergraduates who form the bulk of the database in the experimental branches of psychology, cognitive science, and economics, as well as allied fields (hereafter collectively labeled the “behavioral sciences”). Given that scientific knowledge about human psychology is largely based on findings from this subpopulation, we ask just how representative are these typical subjects in light of the available comparative database. How justified are researchers in assuming a species-level generality for their findings? Here, we review the evidence regarding how WEIRD people compare to other

Anyway, it looks like a good paper. Based on a cursory read, the conclusions the authors draw seem pretty reasonable, if a bit strong. I think most researchers do already recognize that our dependence on undergraduates is unhealthy in many respects; it’s just that it’s difficult to break the habit, because the alternative is to spend a lot more time and money chasing down participants (and there are limits to that too; it just isn’t feasible for most researchers to conduct research with Etoro populations in New Guinea). Then again, just because it’s hard to do science the right way doesn’t really make it OK to do it the wrong way. So, to the extent that we care about our results generalizing across the entire human species (which, in many cases, we don’t), we should probably be investing more energy in weaning ourselves off undergraduates and trying to recruit more diverse samples.

cognitive training doesn’t work (much, if at all)

There’s a beautiful paper in Nature this week by Adrian Owen and colleagues that provides what’s probably as close to definitive evidence as you can get in any single study that “brain training” programs don’t work. Or at least, to the extent that they do work, the effects are so weak they’re probably not worth caring about.

Owen et al used a very clever approach to demonstrate their point. Rather than spending their time running small-sample studies that require people to come into the lab over multiple sessions (an expensive and very time-intensive effort that’s ultimately still usually underpowered), they teamed up with the BBC program ‘Bang Goes The Theory‘. Participants were recruited via the tv show, and were directed to an experimental website where they created accounts, engaged in “pre-training” cognitive testing, and then could repeatedly log on over the course of six weeks to perform a series of cognitive tasks supposedly capable of training executive abilities. After the training period, participants again performed the same battery of cognitive tests, enabling the researchers to compare performance pre- and post-training.

Of course, you expect robust practice effects with this kind of thing (i.e., participants would almost certainly do better on the post-training battery than on the pre-training battery solely because they’d been exposed to the tasks and had some practice). So Owen et al randomly assigned participants logging on to the website to two different training programs (involving different types of training tasks) or to a control condition in which participants answered obscure trivia questions rather than doing any sort of intensive cognitive training per se. The beauty of doing this all online was that the authors were able to obtain gargantuan sample sizes (several thousand in each condition), ensuring that statistical power wasn’t going to be an issue. Indeed, Owen et al focus almost explicitly on effect sizes rather than p values, because, as they point out, once you have several thousand participants in each group, almost everything is going to be statistically significant, so it’s really the effect sizes that matter.

The critical comparison was whether the experimental groups showed greater improvements in performance post-training than the control group did. And the answer, generally speaking, was no. Across four different tasks, the differences in training-related gains in the experimental group relative to the control group were always either very small (no larger than about a fifth of a standard deviation), or even nonexistent (to the extent that for some comparisons, the control group improved more than the experimental groups!). So the upshot is that if there is any benefit of cognitive training (and it’s not at all clear that there is, based on the data), it’s so small that it’s probably not worth caring about. Here’s the key figure:


You could argue that the fact the y-axis spans the full range of possible values (rather than fitting the range of observed variation) is a bit misleading, since it’s only going to make any effects seem even smaller. But even so, it’s pretty clear these are not exactly large effects (and note that the key comparison is not the difference between light and dark bars, but the relative change from light to dark across the different groups).

Now, people who are invested (either intellectually or financially) in the efficacy of cognitive training programs might disagree, arguing that an effect of one-fifth of a standard deviation isn’t actually a tiny effect, and that there are arguably many situations in which that would be a meaningful boost in performance. But that’s the best possible estimate, and probably overstates the actual benefit. And there’s also the opportunity cost to consider: the average participant completed 20 – 30 training sessions, which, even at just 20 minutes a session (an estimate based on the description of the length of each of the training tasks), would take about 8 – 10 hours to complete (and some participants no doubt spent many more hours in training).  That’s a lot of time that could have been invested in other much more pleasant things, some of which might also conceivably improve cognitive ability (e.g., doing Sudoku puzzles, which many people actually seem to enjoy). Owen et al put it nicely:

To illustrate the size of the transfer effects observed in this study, consider the following representative example from the data. The increase in the number of digits that could be remembered following training on tests designed, at least in part, to improve memory (for example, in experimental group 2) was three-hundredth of a digit. Assuming a linear relationship between time spent training and improvement, it would take almost four years of training to remember one extra digit. Moreover, the control group improved by two-tenths of a digit, with no formal memory training at all.

If someone asked you if you wanted to spend six weeks doing a “brain training” program that would provide those kinds of returns, you’d probably politely (or impolitely) refuse. Especially since it’s not like most of us spend much of our time doing digit span tasks anyway; odds are that the kinds of real-world problems we’d like to perform a little better at (say, something trivial like figuring out what to buy or not to buy at the grocery store) are even further removed from the tasks Owen et al (and other groups) have used to test for transfer, so any observable benefits in the real world would presumably be even smaller.

Of course, no study is perfect, and there are three potential concerns I can see. The first is that it’s possible that there are subgroups within the tested population who do benefit much more from the cognitive training. That is, the miniscule overall effect could be masking heterogeneity within the sample, such that some people (say, maybe men above 60 with poor diets who don’t like intellectual activities) benefit much more. The trouble with this line of reasoning, though, is that the overall effects in the entire sample are so small that you’re pretty much forced to conclude that either (a) any group that benefits substantially from the training is a very small proportion of the total sample, or (b) that there are actually some people who suffer as a result of cognitive training, effectively balancing out the gains seen by other people. Neither of these possibilities seem particularly attractive.

The second concern is that it’s conceivable that the control group isn’t perfectly matched to the experimental group, because, by the authors’ own admission, the retention rate was much lower in the control group. Participants were randomly assigned to the three groups, but only about two-thirds as many control participants completed the study. The higher drop-out rate was apparently due to the fact that the obscure trivia questions used as a control task were pretty boring. The reason that’s a potential problem is that attrition wasn’t random, so there may be a systematic difference between participants in the experimental conditions and those in the control conditions. In particular, it’s possible that the remaining control participants had a higher tolerance for boredom and/or were somewhat smarter or more intellectual on average (answering obscure trivia questions clearly isn’t everyone’s cup of tea). If that were true, the lack of any difference between experimental and control conditions might be due to participant differences rather than an absence of a true training effect. Unfortunately, it’s hard to determine whether this might be true, because (as far as I can tell) Owen et al don’t provide the raw mean performance scores on the pre- and post-training testing for each group, but only report the changes in performance. What you’d want to know is that the control participants didn’t do substantially better or worse on the pre-training testing than the experimental participants (due to selective attrition of low-performing subjects), which might make changes in performance difficult to interpret. But at face value, it doesn’t seem very plausible that this would be a serious issue.

Lastly, Owen et al do report a small positive correlation between number of training sessions performed (which was under participants’ control) and gains in performance on the post-training test. Now, this effect was, as the authors note, very small (a maximal Spearman’s rho of .06), so that it’s also not really likely to have practical implications. Still, it does suggest that performance increases as a function of practice. So if we’re being pedantic, we should say that intensive cognitive training may improve cognitive performance in a generalized way, but that the effect is really minuscule and probably not worth the time and effort required to do the training in the first place. Which isn’t exactly the type of careful and measured claim that the people who sell brain training programs are generally interested in making.

At any rate, setting aside the debate over whether cognitive training works or not, one thing that’s perplexed me for a long time about the training literature is why people focus to such an extent on cognitive training rather than other training regimens that produce demonstrably larger transfer effects. I’m thinking in particular of aerobic exercise, which produces much more robust and replicable effects on cognitive performance. There’s a nice meta-analysis by Colcombe and colleagues that found effect sizes on the order of half a standard deviation and up for physical exercise in older adults–and effects were particularly large for the most heavily g-loaded tasks. Now, even if you allow for publication bias and other manifestations of the fudge factor, it’s almost certain that the true effect of physical exercise on cognitive performance is substantially larger than the (very small) effects of cognitive training as reported by Owen et al and others.

The bottom line is that, based on everything we know at the moment, the evidence seems to pretty strongly suggest that if your goal is to improve cognitive function, you’re more likely to see meaningful results by jogging or swimming regularly than by doing crossword puzzles or N-back tasks–particularly if you’re older. And of course, a pleasant side effect is that exercise also improves your health and (for at least some people) mood, which I don’t think N-back tasks do. Actually, many of the participants I’ve tested will tell you that doing the N-back is a distinctly dysphoric experience.

On a completely unrelated note, it’s kind of neat to see a journal like Nature publish what is essentially a null result. It goes to show that people do care about replication failures in some cases–namely, in those cases when the replication failure contradicts a relatively large existing literature, and is sufficiently highly powered to actually say something interesting about the likely effect sizes in question.
Owen AM, Hampshire A, Grahn JA, Stenton R, Dajani S, Burns AS, Howard RJ, & Ballard CG (2010). Putting brain training to the test. Nature PMID: 20407435

links and slides from the CNS symposium

After the CNS symposium on building a cumulative cognitive neuroscience, several people I talked to said it was a pity there wasn’t an online repository where all the sites that the speakers discussed could be accessed. I should have thought of that ahead of time, because even if we made one now, no one would ever find it. So, belatedly, the best I can do is put together a list here, where I’m pretty sure no one’s ever going to read it.

Anyway, this is mostly from memory, so I may be forgetting some of the things people talked about, but here’s what I can remember:

Let me know if there’s anything I’m leaving out.

On a related note, several people at the conference asked me for my slides, but I promptly forgot who they were, so here they are.

UPDATED: Russ Poldrack’s slides are now also on the web here.

CNS wrap-up

I’m back from CNS in Montreal (actually, I’m not quite back; I’m in Ottawa for a few days–but close enough). Some thoughts about the experience, in no particular order, and with very little sense:

  • A huge number of registered attendees (basically, everyone from Europe who didn’t leave for Montreal early) couldn’t make it to the meeting because of that evil, evil Icelandic volcano. As a result, large swaths of posterboard were left blank–or would have been left blank, if not for the clever “Holy Smokes! So-and-so can’t be here…” notes taped to them. So that was really too bad; aside from the fact that the Europeans missed out on the meeting, which kind of sucks, there was a fair amount of chaos during the slide and symposium sessions as speakers were randomly shuffled around. I guess it’s a testament to the organizers that the conference went off relatively smoothly despite the loss of a large chunk of the attendance.
  • The symposium I chaired went well, as far as I can tell. Which is to say, no one streaked naked through the hall, no one went grossly over time, the audience hall was full, and the three talks I got to watch from the audience were all great. I think my talk went well too, but it’s harder to say. In theory, you should be able to tell how these things go based on the ratio of positive to negative feedback you get. But since people generally won’t tell you if they thought your talk sucked, you’re usually stuck trying to determine whether people are giving you well-I-didn’t-really-like-it-but-I-don’t-want-you-to-feel-bad compliments, or I-really-liked-it-and-I’m-not-even-lying-to-your-face compliments. In any case, good or bad reception, I think the topic is a really important one, and I’m glad the symposium was well attended.
  • I love Montreal. As far as I’m concerned they could have CNS in Montreal every year and I wouldn’t complain. Well, maybe I’d complain a little. But only about unimportant things like the interior decoration of the hotel lobby.
  • Speaking of which, I liked the Hilton Bonaventure and all, but the place did remind me a lot of a 70s porn set. All it’s missing are some giant ferns in the lobby and a table lined with cocaine next to the elevators. (You can probably tell that my knowledge of 70s porn is based entirely on watching two-thirds of Boogie Nights once). Also, what the hell is on floors 2 through 12 of Place Bonaventure? And how can a hotel have nearly 400 rooms, all on the same (13th) floor!?
  • That Vietnamese place we had lunch at on Tuesday, which apparently just opened up, isn’t going to last long. When someone asks you for “brown rice”, they don’t mean “white rice with some red food dye stirred in”.
  • Apparently, Mike X. Cohen is not only the most productive man in cognitive neuroscience, but also a master of the neuroimaging haiku (admittedly, a niche specialty).
  • Sushi and baklava at a conference reception? Yes please!
  • The MDRS party on Monday night was a lot of fun, though the downstairs room at the bar was double-booked. I’m sure the 20-odd people at salsa dancing night were a bit surprised, and probably not entirely appreciative, when 100 or so drunken neuroscientists collectively stumbled downstairs for a free drink, hung out for fifteen minutes, then disappeared upstairs again. Other than that–and the $8 beers–a good time was had.
  • Turns out that assortment of vegetables that Afghans call an Afghan salad is exactly what Turks call a Turkish salad and Israelis call an Israeli salad. I guess I’m not surprised that everyone in that part of the world uses the same four or five ingredients in their salad, but let’s not all rush to take credit for what is basically some cucumber, tomato, and parsley in a bowl. That aside, dinner was awesome. And I wish there were more cities full of restaurants that let you bring your own wine.

  • The talks and posters were great this year. ALL OF THEM. If I had to pick favorites, I guess I really liked the symposium on perceptual decision-making, and several of the posters in the reward/motivation session on Sunday or Monday afternoon. But really, ALL OF THEM WERE GREAT. So let’s all give ourselves giant gold medals with pictures of brains on them. And then… let’s melt down those medals, sell the gold, and buy some scanners with the money.

the grand canada tour, 2010 edition

Blogging will be slow(er than normal) for the next couple of weeks. On Wednesday I’m off on a long-awaited Grand Tour of Canada, 2010 edition. The official purpose of the trip is the CNS meeting in Montreal, but seeing as I’m from Canada and most of my family is in Toronto and Ottawa, I’ll be tacking on a few days of R&R at either end of the trip, so I’ll be gone for 10 days. By R&R I mean that I’ll be spending most of my time in Toronto at cheap all-you-can-eat sushi restaurants, and most of my time in Ottawa sleeping in till noon in my mom’s basement.  So really, I guess my plan for the next two weeks is to turn seventeen again.

While I’m in Ottawa, I’ll also be giving a talk at Carleton University. I’d like to lump this under the “invited talks” section of my vita–you know, just to make myself seem slightly more important (being invited somewhere means people actually want to hear you say stuff!)–but I’m not sure it counts as “invited” if you invite yourself to give a talk somewhere else. Which is basically what happened; I did my undergraduate degree at Carleton, so when I emailed my honors thesis advisor to ask if I could give a talk when I was in town, he probably felt compelled to humor me, much as I know he’d secretly like to say no (sorry John!). At any rate, the talk will be closely based on this paper on the relation between personality and word use among bloggers. Amazingly enough, it turns out you can learn something (but not that much) about people from what they write on their blogs. It’s not the most exciting conclusion in the world, but I think there are some interesting results hidden away in there somewhere. If you happen to come across any of them, let me know.

correlograms are correlicious

In the last year or so, I’ve been experimenting with different ways of displaying correlation matrices, and have gotten very fond of color-coded correlograms. Here’s one from a paper I wrote investigating the relationship between personality and word use among bloggers (click to enlarge):

Figure S2 Extraversion

The rows reflect language categories from Jamie Pennebaker’s Linguistic Inquiry and Word Count (LIWC) dictionary; the columns reflect Extraversion scores (first column) or scores on the lower-order “facets” of Extraversion (as measured by the IPIP version of the NEO-PI-R). The plot was generated in R using code adapted from the corrgram package (R really does have contributed packages for everything). Positive correlations are in blue, negative ones are in red.

The thing I really like about these figures is that the colors instantly orient you to the most important features of the correlation matrix, instead of having to inspect every cell for the all-important ***magical***asterisks***of***statistical***significance***. For instance, a cursory glance tells you that even though Excitement-Seeking and Cheerfulness are both nominally facets of Extraversion, they’re associated with very different patterns of word use. And then a slightly less cursory glance tells you that that’s because people with high Excitement-Seeking scores like to swear a lot and use negative emotion words, while Cheerful people like to talk about friends, music, and use positive emotional language. You’d get the same information without the color, of course, but it’d take much longer to extract,  and then you’d have to struggle to keep all of the relevant numbers in mind while you mull them over. The colors do a lot to reduce cognitive load, and also have the secondary benefit of looking pretty.

If you’re interested in using correlograms, a good place to start is the Quick-R tutorial on correlograms in R. The documentation for the corrgram package is here, and there’s a nice discussion of the principles behind the visual display of correlation matrices in this article.

p.s. I’m aware this post has the worst title ever; the sign-up sheet for copy editing duties is in the comment box (hint hint).

my new favorite blog

…teaches you How To Write Badly Well. For instance, if you want to write badly well, you must Refuse to leave the present tense:

I sit at my desk and remember how, years ago, I wonder what my life will be like when I am fifty, which I am now. I’m imagining that I’m living in a big house, I remember as I sit in my one-bedroom apartment. Now I pour myself a drink and cast my mind back to a time when I’m full of hope and passion which is never to be extinguished, as it is now.

‘What am I doing?’ I mutter to myself, taking a sip of my drink. In my memory, I’m seven years old, sitting in the highest branches of a tree which is being planted a hundred years before I am born. Now, though, the tree is long dead. I’m chopping it down at the age of twenty and thinking about when it is supporting my weight at the age of seven. I look at my watch.

‘Late,’ I mutter to myself. It is eight; the retrospective is just starting, half an hour ago.

i’m not dropping out after all

I forgot about April Fool’s day. All week, I’ve been going on and on to my wife about how I was going to orchestrate a monumental prank on April Fool’s day–something like making a police car appear on top of the MIT dome, or making the statue of liberty disappear. But then my wife convinced me these weren’t great ideas, because they would require a lot more intelligence resources than I possess. So I settled for something more pedestrian, namely, pretending I was dropping out of academia because I’d gotten fed up with the poor hours and lack of M&Ms.

Well, while Google was busy renaming itself, and Andrew Gelman was disavowing any relationship with multilevel modeling, I dropped the ball and forgot to pull off my epic prank this morning. Turns out that may not have been such a bad thing: around noon, I found out that pretty much every other academic blogger on the planet had had exactly the same idea. Here’s Professor in Training:

After talking to my postdoc mentor last week, I’ve decided to resign from my position as assistant professor at Really Big U. Postdoc Mentor convinced me that I was a much better postdoc than I am PI and has generously offered me a place in his lab. He can’t afford to pay me as much as I was earning in his lab a couple of years ago and I won’t have my own computer or desk but I’m sure it’ll all work out for the best.

And here’s Prof-like Substance:

After all of the discussions how good postdoc life is and how teaching is sucking the life out of me I have decided to bail on this job and take a postdoc position in another country that I’ve always wanted to live in. My department and Dean are understandably upset and it took some time to make sure that all of my trainees can find PIs to work with so that they can finish their degrees, but sometimes you just have to do what’s right for yourself.

What the hell, people. Do we all share a brain? Are you all listening in on my conversations with my wife? Or is it just that all academics secretly harbor fantasies of dropping out in favor of a less stressful life featuring sunny beaches, cocktails, and afternoon sessions of Jai Alai?

Anyway, long story short, there won’t be an April Fool’s joke this year. I’ve decided to stay in academia. Just to be different.

some thoughtful comments on automatic measure abbreviation

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he” is you and “you” is he because I was writing to Lew…)


1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I'm looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):


What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.