Archive for the ‘methods’ Category

some thoughtful comments on automatic measure abbreviation

Thursday, April 1st, 2010

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he” is you and “you” is he because I was writing to Lew…)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I'm looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.

abbreviating personality measures in R: a tutorial

Wednesday, March 31st, 2010

A while back I blogged about a paper I wrote that uses genetic algorithms to abbreviate personality measures with minimal human intervention. In the paper, I promised to put the R code I used online, so that other people could download and use it. I put off doing that for a long time, because the code was pretty much spaghetti by the time the paper got accepted, and there are any number of things I’d rather do than spend a weekend rewriting my own code. But one of the unfortunate things about publicly saying that you’re going to do something is that you eventually have to do that something. So, since the paper was published in JRP last week, and several people have emailed me to ask for the code, I spent much of the weekend making the code presentable. It’s not a fully-formed R package yet, but it’s mostly legible, and seems to work more or less ok. You can download the file (gaabbreviate.R) here. The rest of this (very long) post is basically a tutorial on how to use the code, so you probably want to stop reading this now unless you have a burning interest in personality measurement.

Prerequisites and installation

Although you won’t need to know much R to follow this tutorial, you will need to have R installed on your system. Fortunately, R is freely available for all major operating systems. You’ll also need the genalg and psych packages for R, because gaabbreviate won’t run without them. Once you have R installed, you can download and install those packages like so:

install.packages(c(‘genalg’, ‘psych’))

Once that’s all done, you’re ready to load gaabbreviate.R:

source(“/path/to/the/file/gaabbreviate.R”)

…where you make sure to specify the right path to the location where you saved the file. And that’s it! Now you’re ready to abbreviate measures.

Reading in data

The file contains several interrelated functions, but the workhorse is gaa.abbreviate(), which takes a set of item scores and scale scores for a given personality measure as input and produces an abbreviated version of the measure, along with a bunch of other useful information. In theory, you can go from old data to new measure in a single line of R code, with almost no knowledge of R required (though I think it’s a much better idea to do it step-by-step and inspect the results at every stage to make sure you know what’s going on).

The abbreviation function is pretty particular about the format of the input it expects. It takes two separate matrices, one with item scores, the other with scale scores (a scale here just refers to any set of one or more items used to generate a composite score). Subjects are in rows, item or scale scores are in columns. So for example, let’s say you have data from 3 subjects, who filled out a personality measure that has two separate scales, each composed of two items. Your item score matrix might look like this:

3 5 1 1

2 2 4 1

2 4 5 5

…which you could assign in R like so:

items = matrix(c(3,2,2,5,2,2,1,4,5,1,1,5), ncol=3)

I.e., the first subject had scores of 3, 5, 1, and 1 on the four items, respectively; the second subject had scores of 2, 2, 4, and 1… and so on.

Based on the above, if you assume items 1 and 2 constitute one scale, and items 3 and 4 constitute the other, the scale score matrix would be:

8 2

4 5

6 10

Of course, real data will probably have hundreds of subjects, dozens of items, and a bunch of different scales, but that’s the basic format. Assuming you can get your data into an R matrix or data frame, you can feed it directly to gaa.abbreviate() and it will hopefully crunch your data without complaining. But if you don’t want to import your data into R before passing it to the code, you can also pass filenames as arguments instead of matrices. For example:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100)

If you pass files instead of data, the referenced text files must be tab-delimited, with subjects in rows, item/scale scores in columns, and a header row that gives the names of the columns (i.e., item names and scale names; these can just be numbers if you like, but they have to be there). Subject identifiers should not be in the files.

Key parameters: stuff you should set every time

Assuming you can get gaabbreviate to read in your data, you can then set about getting it to abbreviate your measure by selecting a subset of items that retain as much of the variance in the original scales as possible. There are a few parameters you’ll need to set; some are mandatory, others aren’t, but should really be specified anyway since the defaults aren’t likely to work well for different applications.

The most important (and mandatory) argument is iters, which is the number of iterations you want the GA to run for. If you pick too high a number, the GA may take a very long time to run if you have a very long measure; if you pick too low a number, you’re going to get a crappy solution. I think iters=100 is a reasonable place to start, though in practice, obtaining a stable solution tends to require several hundred iterations. The good news (which I cover in more detail below) is that you can take the output you get from the abbreviation function and feed it right back in as many times as you want, so it’s not like you need to choose the number of iterations carefully or anything.

The other two key parameters are itemCost and maxItems. The itemCost is what determines the degree to which your measure is compressed. If you want a detailed explanation of how this works, see the definition of the cost function in the paper. Very briefly, the GA tries to optimize the trade-off between number of items and amount of variance explained. Generally speaking, the point of abbreviating a measure is to maximize the amount of explained variance (in the original scale scores) while minimizing the number of items retained. Unfortunately, you can’t do both very well at the same time, because any time you drop an item, you’re also losing its variance. So the trick is to pick a reasonable compromise: a measure that’s relatively short and still does a decent job recapturing the original. The itemCost parameter is what determines the length of that measure. When you set it high, the GA will place a premium on brevity, resulting in a shorter (but less accurate) measure; when you set it low, it’ll allow a longer measure that maximizes fidelity. The optimal itemCost will vary depending on your data, but I find 0.05 is a good place to start, and then you can tweak it to get measures with more or fewer items as you see fit.

The maxItems parameter sets the upper bound on the number of items that will be used to score each scale. The default is 5, but you may find this number too small if you’re trying to abbreviate scales comprised of a large number of items. Again, it’s worth playing around with this to see what happens. Generally speaks, the same trade-off between brevity and fidelity discussed above holds here too.

Given reasonable values for the above arguments, you should be able to feed in raw data and get out an abbreviated measure with minimal work. Assuming you’re reading your data from a file, the entire stream can be as simple as:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100, itemCost=0.05, maxItems=5, writeFile=’outputfile.txt’)

That’s it! Assuming your data are in the correct format (and if they’re not, the script will probably crash with a nasty error message), gaabbreviate will do its thing and produce your new, shorter measure within a few minutes or hours, depending on the size of the initial measure. The writeFile argument is optional, and gives the name of an output file you want the measure saved to. If you don’t specify it, the output will be assigned to the gaa object in the above call (note the “gaa = ” part of the call), but won’t be written to file. But that’s not a problem, because you can always achieve the same effect later by calling the gaa.writeMeasure function (e.g., in the above example, gaa.writeMeasure(gaa, file=”outputfile.txt”) would achieve exactly the same thing).

Other important options

Although you don’t really need to do anything else to produce abbreviated measures, I strongly recommend reading the rest of this document and exploring some of the other options if you’re planning to use the code, because some features are non-obvious. Also, the code isn’t foolproof, and it can do weird things with your data if you’re not paying attention. For one thing, by default, gaabbreviate will choke on missing values (i.e., NAs). You can do two things to get around this: either enable pairwise processing (pairwise=T), or turn on mean imputation (impute=T). I say you can do these things, but I strongly recommend against using either option. If you have missing values in your data, it’s really a much better idea to figure out how to deal with those missing values before you run the abbreviation function, because the abbreviation function is dumb, and it isn’t going to tell you whether pairwise analysis or imputation is a sensible thing to do. For example, if you have 100 subjects with varying degrees of missing data, and only have, say, 20 subjects’ scores for some scales, the resulting abbreviated measure is going to be based on only 20 subjects’ worth of data for some scales if you turn pairwise processing on. Similarly, imputing the mean for missing values is a pretty crude way to handle missing data, and I only put it in so that people who just wanted to experiment with the code wouldn’t have to go to the trouble of doing it themselves. But in general, you’re much better off reading your item and scale scores into R (or SPSS, or any other package), processing any missing values in some reasonable way, and then feeding gaabbreviate the processed data.

Another important point to note is that, by default, gaabbreviate will cross-validate its results. What that means is that only half of your data will be used to generate an abbreviated measure; the other half will be used to provide unbiased estimates of how well the abbreviation process worked. There’s an obvious trade-off here. If you use the split-half cross-validation approach, you’re going to get more accurate estimates of how well the abbreviation process is really working, but the fit itself might be slightly poorer because you have less data. Conversely, if you turn cross-validation off (crossVal=F), you’re going to be using all of your data in the abbreviation process, but the resulting estimates of the quality of the solution will inevitably be biased because you’re going to be capitalizing on chance to some extent.

In practice, I recommend always leaving cross-validation enabled, unless you either (a) really don’t care about quality control (which makes you a bad person), or (b) have a very small sample size, and can’t afford to leave out half of the data in the abbreviation process (in which case you should consider collecting more data). My experience has been that with 200+ subjects, you generally tend to see stable solutions even when leaving cross-validation on, though that’s really just a crude rule of thumb that I’m pulling out of my ass, and larger samples are always better.

Other less important options

There are a bunch other less important options that I won’t cover in any detail here, but that are reasonably well-covered in the comments in the source file if you’re so inclined. Some of these are used to control the genetic algorithm used in the abbreviation process. The gaa.abbreviate function doesn’t actually do the heavy lifting itself; instead, it relies on the genalg library to run the actual genetic algorithm. Although the default genalg parameters will work fine 95% of the time, if you really want to manually set the size of the population or the ratio of initial zeros to ones, you can pass those arguments directly. But there’s relatively little reason to play with these parameters, because you can always achieve more or less the same ends simply by adding iterations.

Two other potentially useful options I won’t touch on, though they’re there if you want them, give you the ability to (a) set a minimum bound on the correlation required in order for an item to be included in the scoring equation for a scale (the minR argument), and (b) apply non-unit weightings to the scales (the sWeights argument), in cases where you want to emphasize some scales at the cost of others (i.e., because you want to measure some scales more accurately).

Two examples

The following two examples assume you’re feeding in item and scale matrices named myItems and myScales, respectively:

example1

This will run a genetic algorithm for 500 generations on mean-imputed data with cross-validation turned off, and assign the result to a variable named my.new.shorter.measure. It will probably produce an only slightly shorter measure, because the itemCost is low and up to 10 items are allowed to load on each scale.

example2

This will run 100 iterations with cross-validation enabled (the default, so we don’t need to specify it explicitly) and write the result to a file named shortMeasure.txt. It’ll probably produce a highly abbreviated measure, because the itemCost is relatively high. It also assigns more weight (twice as much, in fact) to the fourth and fifth scales in the measure relative to the first three, as reflected in the sWeights argument (a vector where the ith element indicates the weight of the ith scale in the measure, so presumably there are five scales in this case).

The gaa object

Assuming you’ve read this far, you’re probably wondering what you get for your trouble once you’ve run the abbreviation function. The answer is that you get… a gaa (which stands for GA Abbreviate) object. The gaa object contains almost all the information that was used at any point in the processing, which you can peruse at your leisure. If you’re familiar with R, you’ll know that you can see what’s in the object with the attributes function. For example, if you assigned the result of the abbreviation function to a variable named ‘myMeasure’, here’s what you’d see:

attributes

The gaa object has several internal lists (data, settings, results, etc.), each of which in turn contains several other variables. I’ve tried to give these sensible names. In brief:

  • data contains all the data used to create the measure (i.e., the item and scale scores you fed in)
  • settings contains all the arguments you specified when you called the abbreviation function (e.g., iters, maxItems, etc.)
  • results contains variables summarizing the results of the GA run, including information about each previous iteration of the GA
  • best contains information about the single best measure produced (this is generally not useful, and is for internal purposes)
  • rbga is the rbga.bin object produced by the genetic library (for more information, see the genalg library documentation)
  • measure is what you’ll probably find most important, as it contains the details of the final measure that was produced

To see the contents of each of these lists in turn, you can easily inspect them:

measure

So the ‘measure’ attribute in the gaa object contains a bunch of other variables with information about the resulting measure. And here’s a brief summary:

  • items: a vector containing the numerical ID of items retained in the final measure relative to the original list (e.g., if you fed in 100 items, and the ‘items’ variable contained the numbers 4, 10, 14… that’s the GA telling you that it decided to keep items no. 4, 10, 14, etc., from the original set of 100 items).
  • nItems: the number of items in the final measure.
  • key: a scoring key for the new measure, where the rows are items on the new measure, and the columns are the scales. This key is compatible with score.items() in Bill Revelle’s excellent psych package, which means that once you’ve got the key, you can automatically score data for the new measure simply by calling score.items() (see the documentation for more details), and don’t need to do any manual calculating or programming yourself.
  • ccTraining and ccValidation: convergent correlations for the training and validation halves of the data, respectively. The convergent correlation is the correlation between the new scale scores (i.e., those that you get using the newly-generated measure) and the “real” scale scores. The ith element in the vector gives you the convergent correlation for the ith scale in your original measure. The validation coefficients will almost invariably be lower than the training coefficients, and the validation numbers are the ones you should trust as an unbiased estimate of the quality of the measure.
  • alpha: coefficient alpha for each scale. Note that you should expect to get lower internal consistency estimates for GA-produced measures than you’re used to, and this is actually a good thing. If you want to know why, read the discussion in the paper.
  • nScaleItems: a vector containing the number of items used to score each scale. If you left minR set to 0, this will always be identical to maxItems for all items. If you raised minR, the number of items will sometimes be lower (i.e., in cases where there were very few items that showed a strong enough correlation to be retained).

Just give me the measure already!

Supposing you’re not really interested in plumbing the depths of the gaa object or working within R more than is necessary, you might just be wondering what the quickest way to get an abbreviated measure you can work with is. In that case, all you really need to do is pass a filename in the writeFile argument when you call gaa.abbreviate (see the examples given above), and you’ll get out a plain text file that contains all the essential details of the new measure. Specifically you’ll get (a) a mapping from old items to new, so that you can figure out which items are included in the new measure (e.g., a line like “4 45″ means that the 4th item on the new measure is no. 45 in the original set of items), and (b) a human-readable scoring key for each scale (the only thing to note here is that an “R” next to an item indicates the item is reverse-keyed), along with key statistics (coefficient alpha and convergent correlations for the training and validation halves). So if all goes well, you really won’t need to do anything else in R beyond call that one line that makes the measure. But again, I’d strongly encourage you to carefully inspect the gaa object in R to make sure everything looks right. The fact that the abbreviation process is fully automated isn’t a reason to completely suspend all rational criteria you’d normally use when developing a scale; it just means you probably have to do substantially less work to get a measure you’re happy with.

Killing time…

Depending on how big your dataset is (actually, mainly the number of items in the original measure), how many iterations you’ve requested, and how fast your computer is, you could be waiting a long time for the abbreviation function to finish its work. Because you probably want to know what the hell is going on internally during that time, I’ve provided a rudimentary monitoring display that will show you the current state of the genetic algorithm after every iteration. It looks like this (click for a larger version of the image):

display

This is admittedly a pretty confusing display, and Edward Tufte would probably murder several kittens if he saw it, but it’s not supposed to be a work of art, just to provide some basic information while you’re sitting there twiddling your thumbs (ok, ok, I promise I’ll label the panels better when I have the time to work on it). But basically, it shows you three things. The leftmost three panels show you the basic information about the best measure produced by the GA as it evolves across generations. Respectively, the top, middle,and bottom panels show you the total cost, measure length, and mean variance explained (R^2) as a function of iteration. The total cost can only ever go down, but the length and R^2 can go up or down (though there will tend to be a consistent trajectory for measure length that depends largely on what itemCost you specified).

The middle panel shows you detailed information about how well the GA-produced measure captures variance in each of the scales in the original measure. In this case, I’m abbreviating the 30 facets of the NEO-PI-R. The red dot displays the amount of variance explained in each trait, as of the current iteration.

Finally, the rightmost panel shows you a graphical representation of which items are included in the best measure identified by the GA at each iteration.Each row represents one iteration (i.e., you’re seeing the display as it appears after 200 iterations of a 250-iteration run); black bars represent items that weren’t included, white bars represent items that were included. The point of this display isn’t to actually tell you which items are being kept (you can’t possibly glean that level of information at this resolution), but rather, to give you a sense of how stable the solution is. If you look at the the first few (i.e., topmost) iterations, you’ll see that the solution is very unstable: the GA is choosing very different items as the “best” measure on each iteration. But after a while, as the GA “settles” into a neighborhood, the solution stabilizes and you see only relatively small (though still meaningful) changes from generation to generation. Basically, once the line in the top left panel (total cost) has asymptoted, and the solution in the rightmost panel is no longer changing much if at all, you know that you’ve probably arrived at as good a solution as you’re going to get.

Incidentally, if you use the generic plot() method on a completed gaa object (e.g., plot(myMeasure)), you’ll get exactly the same figure you see here, with the exception that the middle figure will also have black points plotted alongside the red ones.  The black points show you the amount of variance explained in each trait for the cross-validated results. If you’re lucky, the red and black points will be almost on top of each other; if you’re not, the black ones will be considerably to the left of the red ones .

Consider recycling

The last thing I’ll mention, which I already alluded to earlier, is that you can recycle gaa objects. That’s to say, suppose you ran the abbreviation for 100 iterations, only to get back a solution that’s still clearly suboptimal (i.e., the cost function is still dropping rapidly). Rather than having to start all over again, you can simply feed the gaa object back into the abbreviation function in order to run further iterations. And you don’t need to specify any additional parameters (assuming you want to run the same number of iterations you did last time; otherwise you’ll need to specify iters); all of the settings are contained within the gaa object itself. So, assuming you ran the abbreviation function and stored the result in ‘myMeasure’, you can simply do:

myMeasure = gaa.abbreviate(myMeasure, iters=200)

and you’ll get an updated version of the measure that’s had the benefit of an extra 200 iterations. And of course, you can save and load R objects to/from files, so that you don’t need to worry about all of your work disappearing next time you start R. So save(myMeasure, ‘filename.txt’) will save your gaa object for future use, and the next time you need it, you can call myMeasure = load(‘filename.txt’) to get it back (alternatively, you can just save the entire workspace).

Anyway, I think that covers all of the important stuff. There are a few other things I haven’t documented here, but if you’ve read this far, and have gotten the code to work in R, you should be able to start abbreviating your own measures relatively painlessly. If you do use the code to generate shorter measures, and end up with measures you’re happy with, I’d love to hear about it. And if you can’t get the code to work, or can get it to work but are finding issues with the code or the results, I guess I’ll grudgingly accept those emails too. In general, I’m happy to provide support for the code via email provided I have the time. The caveat is that, if you’re new to R, and are having problems with basic things like installing packages or loading files from source, you should really read a tutorial or reference that introduces you to R (Quick-R is my favorite place to start) before emailing me with problems. But if you’re having problems that are specific to the gaabbreviate code (e.g., you’re getting a weird error message, or aren’t sure what something means), feel free to drop me a line and I’ll try to respond as soon as I can.

green chile muffins and brains in a truck: weekend in albuquerque

Monday, March 22nd, 2010

I spent the better part of last week in Albuquerque for the Mind Research Network fMRI course. It’s a really well-organized 3-day course, and while it’s geared toward people without much background in fMRI, I found a lot of the lectures really helpful. It’s hard impossible to get everything right when you run an fMRI study; the magnet is very fickle and doesn’t like to do what you ask it to–and that assumes you’re asking it to do the right thing, which is also not so common. So I find I learn something interesting from almost every fMRI talk I attend, even when it’s stuff I thought I already knew.

Of course, since I know very little, there’s also almost always stuff that’s completely new to me. In this case, it was a series of lectures on independent components analysis (ICA) of fMRI data, focusing on Vince Calhoun‘s group’s implementation of ICA in the GIFT toolbox. It’s a beautifully implemented set of tools that offer a really powerful alternative to standard univariate analysis, and I’m pretty sure I’ll be using it regularly from now on. So the ICA lectures alone were worth the price of admission. (In the interest of full disclosure, I should note that my post-doc mentor, Tor Wager, is one of the organizers of the MRN course, and I wasn’t paying the $700 tab out of pocket. But I’m not getting any kickbacks to say nice things about the course, I promise.)

Between the lectures and the green chile corn muffins, I didn’t get to see much of Albuquerque (except from the air, where the urban sprawl makes the city seem much larger than its actual population of 800k people would suggest), so I’ll reserve judgment for another time. But the MRN itself is a pretty spectacular facility. Aside from a 3T Siemens Trio magnet, they also have a 1.5T mobile scanner built into a truck. It’s mostly used to scan inmates in the New Mexico prison system (you’ll probably be surprised to learn that they don’t let hardened criminals out of jail to participate in scientific experiments–so the scanner has to go to jail instead). We got a brief tour of the mobile scanner and it was pretty awesome. Which is to say, it beats the pants off my Honda.

There are also some parts of the course I don’t remember so well. Here’s a (blurry) summary of those parts, courtesy of Alex Shackman:

Scott, Tor, and me in Albuquerque

BlurryScott, BlurryTor, and BlurryTal: The Boulder branch of the lab, Albuquerque 2010 edition

functional MRI and the many varieties of reliability

Friday, March 5th, 2010

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”" paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a ‘reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered ‘excellent’, results between 0.59 and 0.75 be considered ‘good’, results between .40 and .58 be considered ‘fair’, and results lower than 0.40 be considered ‘poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

better tools for mining the scientific literature

Sunday, January 31st, 2010

Freethinker’s Asylum has a great post reviewing a number of tools designed to help researchers mine the scientific literature–an increasingly daunting task. The impetus for the post is this article in the latest issue of Nature (note: restricted access), but the FA post discusses a lot of tools that the Nature article doesn’t, and focuses in particular on websites that are currently active and publicly accessible, rather than on proprietary tools currently under development in dark basement labs and warehouses. I hadn’t seen most of these before, but am looking forward to trying them out–e.g., pubget:

When you create an account, pubget signs in to your institution and allows you to search the subscribed resources. When you find a reference you want, just click the pdf icon and there it is. No clicking through to content provider websites. You can tag references as “keepers” to come back to them later, or search for the newest articles from a particular journal.

Sounds pretty handy…

Many of the other sites–as well as most of those discussed in the Nature article–focus on data and literature mining in specific fields, e.g., PubGene and PubAnatomy. These services, which allow you to use specific keywords or topics (e.g., specific genes) to constrain literature searches, aren’t very useful to me personally. But it’s worth pointing out that there are some emerging services that fill much the same niche in the world of cognitive neuroscience that I’m more familiar with. The one that currently looks most promising, in my opinion, is the Cognitive Atlas project led by Russ Poldrack, which is “a collaborative knowledge building project that aims to develop a knowledge base (or ontology) that characterizes the state of current thought in cognitive science. … The Cognitive Atlas aims to capture knowledge from users with expertise in psychology, cognitive science, and neuroscience.”

The Cognitive Atlas is officially still in beta, and you need to have a background in cognitive neuroscience in order to sign up to contribute. But there’s already some content you can navigate, and the site, despite being in the early stages of development, is already pretty impressive. In the interest of full disclosure, as well as shameless plugging, I should note that Russ will be giving a talk about the Cognitive Atlas project as part of a symposium I’m chairing at CNS in Montreal this year. So if you want to learn more about it, stop by! Meantime, check out the Freethinker’s Asylum post for links to all sorts of other interesting tools…

how to measure 200 personality scales in 200 items

Sunday, January 17th, 2010

One of the frustrating things about personality research–for both researchers and participants–is that personality is usually measured using self-report questionnaires, and filling out self-report questionnaires can take a very long time. It doesn’t have to take a very long time, mind you; some questionnaires are very short, like the widely-used Ten-Item Personality Inventory (TIPI), which might take you a whole 2 minutes to fill out on a bad day. So you can measure personality quickly if you have to. But more often than not, researchers want to reliably measure a broad range of different personality traits, and that typically requires administering one or more long-ish questionnaires. For example, in my studies, I often give participants a battery of measures to fill out that includes some combination of the NEO-PI-R, EPQ-R, BIS/BAS scales, UPPS, GRAPES, BDI, TMAS, STAI, and a number of others. That’s a large set of acronyms, and yet it’s just a small fraction of what’s out there; every personality psychologist has his or her own set of favorite measures, and at personality conferences, duels-to-the-death often break out over silly things like whether measure X is better than measure Y, or whether measures A and B can be used interchangeably when no one’s looking. Personality measurement is a pretty intense sport.

The trouble with the way we usually measure personality is that it’s wildly inefficient, for two reasons. One is that many measures are much longer than they need to be. It’s not uncommon to see measures that score each personality trait using a dozen or more different items. In theory, the benefit of this type of redundancy is that you get a more reliable measure, because the error terms associated with individual items tends to cancel out. For example, if you want to know if I’m a depressive kind of guy, you shouldn’t just ask me, “hey, are you depressed?”, because lots of random factors could influence my answer to that one question. Instead, you should ask me a bunch of different questions, like: “hey, are you depressed?” and “why so glum, chum?”, and “does somebody need a hug?”. Adding up responses from multiple items is generally going to give you a more reliable measure. But in practice, it turns out that you typically don’t need more than a handful of items to measure most traits reliably. When people develop “short forms” of measures, the abbreviated scales often have just 4 – 5 items per trait, usually with relatively little loss of reliability and validity. So the fact that most of the measures we use have so many items on them is sort of a waste of both researchers’ and participants’ time.

The other reason personality measurement is inefficient is that most researchers recognize that different personality measures tend to measure related aspects of personality, and yet we persist in administering a whole bunch of questionnaires with similar content to our participants. If you’ve ever participated in a psychology experiment that involved filling out personality questionnaires, there’s a good chance you’ve wondered whether you’re just filling out the same questionnaire over and over. Well you are–kind of. Because the space of personality variation is limited (people can only differ from one another in so many ways), and because many personality constructs have complex interrelationships with one another, personality measures usually end up asking similarly-worded questions. So for example, one measure might give you Extraversion and Agreeableness scores whereas another gives you Dominance and Affiliation scores. But then it turns out that the former pair of dimensions can be “rotated” into the latter two; it’s just a matter of how you partition (or label) the variance. So really, when a researcher gives his or her participants a dozen measures to fill out, that’s not because anyone thinks that there are really a dozen completely different sets of traits to measures; it’s more because we recognize that each instrument gives you a slightly different take on personality, and we tend to think that having multiple potential viewpoints is generally a good thing.

Inefficient personality measurement isn’t inevitable; as I’ve already alluded to above, a number of researchers have developed abbreviated versions of common inventories that capture most of the same variance as much longer instruments. Probably the best-known example is the aforementioned TIPI, developed by Sam Gosling and colleagues, which gives you a workable index of people’s relative standing on the so-called Big Five dimensions of personality. But there are relatively few such abbreviated measures. And to the best of my knowledge, the ones that do exist are all focused on abbreviating a single personality measure. That’s unfortunate, because if you believe that most personality inventories have a substantial amount of overlap, it follows that you should be able to recapture scores on multiple different personality inventories using just one set of (non-redundant) items.

That’s exactly what I try to demonstrate in a paper to be published in the Journal of Research in Personality. The article’s entitled “The abbreviation of personality: How to measure 200 personality scales in 200 items“, which is a pretty accurate, if admittedly somewhat grandiose, description of the contents. The basic goal of the paper is two-fold. First, I develop an automated method for abbreviating personality inventories (or really, any kind of measure with multiple items and/or dimensions). The idea here is to shorten the time and effort required in order to generate shorter versions of existing measures, which should hopefully encourage more researchers to create such short forms. The approach I develop relies heavily on genetic algorithms, which are tools for programmatically obtaining high-quality solutions to high-dimensional problems using simple evolutionary principles. I won’t go into the details (read the paper if you want them!), but I think it works quite well. In the first two studies reported in the paper (data for which were very generously provided by Sam Gosling and Lew Goldberg, respectively), I show that you can reduce the length of existing measures (using the Big Five Inventory and the NEO-PI-R as two examples) quite dramatically with minimal loss of validity. It only takes a few minutes to generate the abbreviated measures, so in theory, it should be possible to build up a database of abbreviated versions of many different measures. I’ve started to put together a site that might eventually serve that purpose (shortermeasures.com), but it’s still in the preliminary stages of development, and may or may not get off the ground.

The other main goal of the paper is to show that the same general approach can be applied to simultaneously abbreviate more than one different measure. To make the strongest case I could think of, I took 8 different broadband personality inventories (“broadband” here just means they each measure a relatively large number of personality traits) that collectively comprise 203 different personality scales and 2,091 different items. Using the same genetic algorithm-based approach, I then reduce these 8 measures down to a single inventory that contains only 181 items (hence the title of the paper). I named the inventory the AMBI (Analog to Multiple Broadband Inventories), and it’s now freely available for use (items and scoring keys are provided both in the paper and at shortermeasures.com). It’s certainly not perfect–it does a much better job capturing some scales than others–but if you have limited time available for personality measures, and still want a reasonably comprehensive survey of different traits, I think it does a really nice job. Certainly, I’d argue it’s better than having to administer many hundreds (if not thousands) of different items to achieve the same effect. So if you have about 15 – 20 minutes to spare in a study and want some personality data, please consider trying out the AMBI!

ResearchBlogging.org

Yarkoni, T. (2010). The Abbreviation of Personality, or how to Measure 200 Personality Scales with 200 Items Journal of Research in Personality DOI: 10.1016/j.jrp.2010.01.002

the parable of zoltan and his twelve sheep, or why a little skepticism goes a long way

Tuesday, December 22nd, 2009

What follows is a fictional piece about sheep and statistics. I wrote it about two years ago, intending it to serve as a preface to an article on the dangers of inadvertent data fudging. But then I decided that no journal editor in his or her right mind would accept an article that started out talking about thinking sheep. And anyway, the rest of the article wasn’t very good. So instead, I post this parable here for your ovine amusement. There’s a moral to the story, but I’m too lazy to write about it at the moment.

A shepherd named Zoltan lived in a small village in the foothills of the Carpathian Mountains. He tended to a flock of twelve sheep: Soffia, Krystyna, Anastasia, Orsolya, Marianna, Zigana, Julinka, Rozalia, Zsa Zsa, Franciska, Erzsebet, and Agi. Zoltan was a keen observer of animal nature, and would often point out the idiosyncracies of his sheep’s behavior to other shepherds whenever they got together.

“Anastasia and Orsolya are BFFs. Whatever one does, the other one does too. If Anastasia starts licking her face, Orsolya will too; if Orsolya starts bleating, Anastasia will start harmonizing along with her.”

“Julinka has a limp in her left leg that makes her ornery. She doesn’t want your pity, only your delicious clovers.”

“Agi is stubborn but logical. You know that old saying, spare the rod and spoil the sheep? Well, it doesn’t work for Agi. You need calculus and rhetoric with Agi.”

Zoltan’s colleagues were so impressed by these insights that they began to encourage him to record his observations for posterity.

“Just think, Zoltan,” young Gergely once confided. “If something bad happened to you, the world would lose all of your knowledge. You should write a book about sheep and give it to the rest of us. I hear you only need to know six or seven related things to publish a book.”

On such occasions, Zoltan would hem and haw solemnly, mumbling that he didn’t know enough to write a book, and that anyway, nothing he said was really very important. It was false modestly of course; in reality, he was deeply flattered, and very much concerned that his vast body of sheep knowledge would disappear along with him one day. So one day, Zoltan packed up his knapsack, asked Gergely to look after his sheep for the day, and went off to consult with the wise old woman who lived in the next village.

The old woman listened to Zoltan’s story with a good deal of interest, nodding sagely at all the right moments. When Zoltan was done, the old woman mulled her thoughts over for a while.

“If you want to be taken seriously, you must publish your findings in a peer-reviewed journal,” she said finally.

“What’s Pier Evew?” asked Zoltan.

“One moment,” said the old woman, disappearing into her bedroom. She returned clutching a dusty magazine. “Here,” she said, handing the magazine to Zoltan. “This is peer review.”

That night, after his sheep had gone to bed, Zoltan stayed up late poring over Vol. IV, Issue 5 of Domesticated Animal Behavior Quarterly. Since he couldn’t understand the figures in the magazine, he read it purely for the articles. By the time he put the magazine down and leaned over to turn off the light, the first glimmerings of an empirical research program had begun to dance around in his head. Just like fireflies, he thought. No, wait, those really were fireflies. He swatted them away.

“I like this… science,” he mumbled to himself as he fell asleep.

In the morning, Zoltan went down to the local library to find a book or two about science. He checked out a volume entitled Principia Scientifica Buccolica—a masterful derivation from first principles of all of the most common research methods, with special applications to animal behavior. By lunchtime, Zoltan had covered t-tests, and by bedtime, he had mastered Mordenkainen’s correction for inestimable herds.

In the morning, Zoltan made his first real scientific decision.

“Today I’ll collect some pilot data,” he thought to himself, “and tomorrow I’ll apply for an R01.”

His first set of studies tested the provocative hypothesis that sheep communicate with one another by moving their ears back and forth in Morse code. Study 1 tested the idea observationally. Zoltan and two other raters (his younger cousins), both blind to the hypothesis, studied sheep in pairs, coding one sheep’s ear movements and the other sheep’s behavioral responses. Studies 2 through 4 manipulated the sheep’s behavior experimentally. In Study 2, Zoltan taped the sheep’s ears to their head; in Study 3, he covered their eyes with opaque goggles so that they couldn’t see each other’s ears moving. In Study 4, he split the twelve sheep into three groups of four in order to determine whether smaller groups might promote increased sociability.

That night, Zoltan minded the data. “It’s a lot like minding sheep,” Zoltan explained to his cousin Griga the next day. “You need to always be vigilant, so that a significant result doesn’t get away from you.”

Zoltan had been vigilant, and the first 4 studies produced a number of significant results. In Study 1, Zoltan found that sheep appeared to coordinate ear twitches: if one sheep twitched an ear several times in a row, it was a safe bet that other sheep would start to do the same shortly thereafter (p < .01). There was, however, no coordination of licking, headbutting, stamping, or bleating behaviors, no matter how you sliced and diced it. “It’s a highly selective effect,” Zoltan concluded happily. After all, when you thought about it, it made sense. If you were going to pick just one channel for sheep to communicate through, ear twitching was surely a good one. One could make a very good evolutionary argument that more obvious methods of communication (e.g., bleating loudly) would have been detected by humans long ago, and that would be no good at all for the sheep.

Studies 2 and 3 further supported Zoltan’s story. Study 2 demonstrated that when you taped sheep’s ears to their heads, they ceased to communicate entirely. You could put Rozalia and Erzsebet in adjacent enclosures and show Rozalia the Jack of Spades for three or four minutes at a time, and when you went to test Erzsebet, she still wouldn’t know the Jack of Spades from the Three of Diamonds. It was as if the sheep were blind! Except they weren’t blind, they were dumb. Zoltan knew; he had made them that way by taping their ears to their heads.

In Study 3, Zoltan found that when the sheep’s eyes were covered, they no longer coordinated ear twitching. Instead, they now coordinated their bleating—but only if you excluded bleats that were produced when the sheep’s heads were oriented downwards. “Fantastic,” he thought. “When you cover their eyes, they can’t see each other’s ears any more. So they use a vocal channel. This, again, makes good adaptive sense: communication is too important to eliminate entirely just because your eyes happen to be covered. Much better to incur a small risk of being detected and make yourself known in other, less subtle, ways.”

But the real clincher was Study 4, which confirmed that ear twitching occurred at a higher rate in smaller groups than larger groups, and was particularly common in dyads of well-adjusted sheep (like Anastasia and Orsolya, and definitely not like Zsa Zsa and Marianna).

“Sheep are like everyday people,” Zoltan told his sister on the phone. “They won’t say anything to your face in public, but get them one-on-one, and they won’t stop gossiping about each other.”

It was a compelling story, Zoltan conceded to himself. The only problem was the F test. The difference in twitch rates as a function of group size wasn’t quite statistically significant. Instead, it hovered around p = .07, which the textbooks told Zoltan meant that he was almost right. Almost right was the same thing as potentially wrong, which wasn’t good enough. So the next morning, Zoltan asked Gergely to lend him four sheep so he could increase his sample size.

“Absolutely not,” said Gergely. “I don’t want your sheep filling my sheep’s heads with all of your crazy new ideas.”

“Look,” said Zoltan. “If you lend me four sheep, I’ll let you drive my Cadillac down to the village on weekends after I get famous.”

“Deal,” said Gergely.

So Zoltan borrowed the sheep. But it turned out that four sheep weren’t quite enough; after adding Gergely’s sheep to the sample, the effect only went from p < .07 to p < .06. So Zoltan cut a deal with his other neighbor, Yuri: four of Yuri’s sheep for two days, in return for three days with Zoltan’s new Lexus (once he bought it). That did the trick. Once Zoltan repeated the experiment with Yuri’s sheep, the p-value for Study 2 now came to .046, which the textbooks assured Zoltan meant he was going to be famous.

Data in hand, Zoltan spent the next two weeks writing up his very first journal article. He titled it “Baa baa baa, or not: Sheep communicate via non-verbal channels”—a decidedly modest title for the first empirical work to demonstrate that sheep are capable of sophisticated propositional thought. The article was published to widespread media attention and scientific acclaim, and Zoltan went on to have a productive few years in animal behavioral research, studying topics as interesting and varied as giraffe calisthenics and displays of affection in the common leech.

Much later, it turned out that no one was able to directly replicate his original findings with sheep (though some other researchers did manage to come up with conceptual replications). But that didn’t really matter to Zoltan, because by then he’d decided science was too demanding a career anyway; it was way more fun to lay under trees counting his sheep. Counting sheep, and occasionally, on Saturdays, driving down to the village in his new Lexus,  just to impress all the young cowgirls.

specificity statistics for ROI analyses: a simple proposal

Sunday, December 13th, 2009

The brain is a big place. In the context of fMRI analysis, what that bigness means is that a typical 3D image of the brain might contain anywhere from 50,000 – 200,000 distinct voxels (3D pixels). Any of those voxels could theoretically show meaningful activation in relation to some contrast of interest, so the only way to be sure that you haven’t overlooked potentially interesting activations is to literally test every voxel (or, given some parcellation algorithm, every region).

Unfortunately, the problem that approach raises–which I’ve discussed in more detail here–is the familiar one of multiple comparisons: If you’re going to test 100,000 locations, it’s not really fair to test each one at the conventional level of p < .05, because on average, you’ll get about 5,000 statistically significant results just by chance that way. So you need to do something to correct for the fact that you’re running thousands of tests. The most common approach is to simply make the threshold for significance more conservative–for example, by testing at p < .0001 instead of p < .05, or by using some combination of intensity and cluster extent thresholds (e.g., you look for 20 contiguous voxels that are all significant at, say, p < .001) that’s supposed to guarantee a cluster-wise error rate of .05.

There is, however, a natural tension between false positives and false negatives: When you make your analysis more conservative, you let fewer false positives through the filter, but you also keep more of the true positives out. A lot of fMRI analysis really just boils down to walking a very thin line between running overconservative analyses that can’t detect anything but the most monstrous effects, and running overly liberal analyses that lack any real ability to distinguish meaningful signals from noise. One very common approach that fMRI researchers have adopted in an effort to optimize this balance is to use complementary hypothesis-driven and whole-brain analyses. The idea is that you’re basically carving the brain up into two separate search spaces: One small space for which you have a priori hypotheses that can be tested using a small number of statistical comparisons, and one much larger space (containing everything but the a priori space) where you continue to use a much more conservative threshold.

For example, if I believe that there’s a very specific chunk of right inferotemporal cortex that’s specialized for detecting clown faces, I can focus my hypothesis-testing on that particular region, without having to pretend that all voxels are created equal. So I delineate the boundaries of a CRC (Clown Representation Cortex) region-of-interest (ROI) based on some prior criteria (e.g., anatomy, or CRC activation in previous studies), and then I can run a single test at p < .05 to test my hypothesis, no correction needed. But to ensure that I don’t miss out on potentially important clown-related activation elsewhere in the brain, I also go ahead and run an additional whole-brain analysis that’s fully corrected for multiple comparisons. By coupling these two analyses, I hopefully get the best of both worlds. That is, I combine one approach (the ROI analysis) that maximizes power to test a priori hypotheses at the cost of an inability to detect effects in unexpected places with another approach (the whole-brain analysis) that has a much more limited capacity to detect effects in both expected and unexpected locations.

This two-pronged strategy is generally a pretty successful one, and I’d go so far as to say that a very large minority, if not an outright majority, of fMRI studies currently use it. Used wisely, I think it’s really an invaluable strategy. There is, however, one fairly serious and largely unappreciated problem associated with the incautious application of this approach. It has to do with claims about the specificity of activation that often tend to accompany studies that use a complementary ROI/whole-brain strategy. Specifically, a pretty common pattern is for researchers to (a) confirm their theoretical predictions by successfully detecting activation in one or more a priori ROIs; (b) identify few if any whole-brain activations; and consequently, (c) conclude that not only were the theoretical predictions confirmed, but that the hypothesized effects in the a priori ROIs were spatially selective, because a complementary whole-brain analysis didn’t turn up much (if anything). Or, to put it in less formal terms, not only were we right, we were really right! There isn’t any other part of the brain that shows the effect we hypothesized we’d see in our a priori ROI!

The problem with this type of inference is that there’s usually a massive discrepancy in the level of power available to detect effects in a priori ROIs versus the rest of the brain. If you search at p < .05 within some predetermined space, but at only p < .0001 everywhere else, you’re naturally going to detect results at a much lower rate everywhere else. But that’s not necessarily because there wasn’t just as much to look at everywhere else; it could just be because you didn’t look very carefully. By way of analogy, if you’re out picking berries in the forest, and you decide to spend half your time on just one bush that (from a distance) seemed particularly berry-full, and the other half of your time divided between the other 40 bushes in the area, you’re not really entitled to conclude that you picked the best bush all along simply because you came away with a relatively full basket. Had you done a better job checking out the other bushes, you might well have found some that were even better, and then you’d have come away carrying two baskets full of delicious, sweet, sweet berries.

Now, in an ideal world, we’d solve this problem by simply going around and carefully inspecting all the berry bushes, until we were berry, berry sure really convinced that we’d found all of the best bushes. Unfortunately, we can’t do that, because we’re out here collecting berries on our lunch break, and the boss isn’t paying us to dick around in the woods. Or, to return to fMRI World, we simply can’t carefully inspect every single voxel (say, by testing it at p < .05), because then we’re right back in mega-false-positive-land, which we’ve already established as a totally boring place we want to avoid at all costs.

Since an optimal solution isn’t likely, the next best thing is to figure out what we can do to guard against careless overinterpretation. Here I think there’s actually a very simple, and relatively elegant, solution. What I’ve suggested when I’ve given recent talks on this topic is that we mandate (or at least, encourage) the use of what you could call a specificity statistic (SS). The SS is a very simple measure of how specific a given ROI-level finding is; it’s just the proportion of voxels that are statistically significant when tested at the same level as the ROI-level effects. In most cases, that’s going to be p < .05, so the SS will usually just be the proportion of all voxels anywhere in the brain that are activated at p < .05.

To see why this is useful, consider what could no longer happen: Researchers would no longer be able to (inadvertently) capitalize on the fact that the one or two regions they happened to define as a priori ROIs turned up significant effects when no other regions did in a whole-brain analysis. Suppose that someone reports a finding that negative emotion activates the amygdala in an ROI analysis, but doesn’t activate any other region in a whole-brain analysis. (While I’m pulling this particular example out of a hat here, I feel pretty confident that if you went and did a thorough literature review, you’d find at least three or four studies that have made this exact claim.) This is a case where the SS would come in really handy. Because if the SS is, say, 26% (i.e., about a quarter of all voxels in the brain are active at p < .05, even if none survive full correction for multiple comparisons), you would want to draw a very different conclusion than if it was just 4%. If fully a quarter of the brain were to show greater activation for a negative-minus-neutral emotion contrast, you wouldn’t want to conclude that the amygdala was critically involved in negative emotion; a better interpretation would be that the researchers in question just happened to define an a priori region that fell within the right quarter of the brain. Perhaps all that’s happening is that negative emotion elicits a general increase in attention, and much of the brain (including, but by no means limited to, the amygdala) tends to increase activation correspondingly. So as a reviewer and reader, you’d want to know how specific the reported amygdala activation really is*. But in the vast majority of papers, you currently have no way of telling (and the researchers probably don’t even know the answer themselves!).

The principal beauty of this statistic lies in its simplicity: It’s easy to understand, easy to calculate, and easy to report. Ideally, researchers would report the SS any time ROI analyses are involved, and would do it for every reported contrast. But at minimum, I think we should all encourage each other (and ourselves) to report such a statistic any time we’re making a specificity claim about ROI-based results. In other words,if you want to argue that a particular cognitive function is relatively localized to the ROI(s) you happened to select, you should be required to show that there aren’t that many other voxels (or regions) that show the same effect when tested at the liberal threshold you used for the ROI analysis. There shouldn’t be an excuse for not doing this; it’s a very easy procedure for researchers to implement, and an even easier one for reviewers to demand.

* An alternative measure of specificity would be to report the percentile ranking of all of the voxels within the ROI mask relative to all other individual voxels. In the above example, you’d assign very different interpretations depending on whether the amygdala was in the 32nd or 87th percentile of all voxels, when ordered according to the strength of the effect for the negative – neutral contrast.

solving the file drawer problem by making the internet the drawer

Thursday, November 26th, 2009

UPDATE 11/22/2011 — Hal Pashler’s group at UCSD just introduced a new website called PsychFileDrawer that’s vastly superior in every way to the prototype I mention in the post below; be sure to check it out!

Science is a difficult enterprise, so scientists have many problems. One particularly nasty problem is the File Drawer Problem. The File Drawer Problem is actually related to another serious scientific problem known as the Desk Problem. The Desk Problem is that many scientists have messy desks covered with overflowing stacks of papers, which can make it very hard to find things on one’s desk–or, for that matter, to clear enough space to lay down another stack of papers.  A common solution to the Desk Problem is to shove all of those papers into one’s file drawer. Which brings us to the the File Drawer Problem. The File Drawer Problem refers to the fact that, eventually, even the best-funded of scientists run out of room in their file drawers.

Ok, so that’s not exactly right. What the file drawer problem–a term coined by Robert Rosenthal in a seminal 1979 article–really refers to is the fact that null results tend to go unreported in the scientific literature at a much higher rate than positive findings, because journals don’t like to publish papers that say “we didn’t find anything”, and as a direct consequence, authors don’t like to write papers that say “journals won’t want to publish this”.

Because of this blatant prejudice systematic bias against null results, the eventual resting place of many a replication failure is its author’s file drawer. The reason this is a problem is that, over the long term, if only (or mostly) positive findings ever get published, researchers can get a very skewed picture of how strong an effect really is. To illustrate, let’s say that Joe X publishes a study showing that people with lawn gnomes in their front yards tend to be happier than people with no lawn gnomes in their yards. Intuitive as that result may be, someone is inevitably going to get the crazy idea that this effect is worth replicating once or twice before we all stampede toward Home Depot or the Container Store with our wallets out (can you tell I’ve never bought a lawn gnome before?). So let’s say Suzanna Y and Ramesh Z each independently try to replicate the effect in their labs (meaning, they command their graduate students to do it). And they find… nothing! No effect. Turns out, people with lawn gnomes are just as miserable as the rest of us. Well, you don’t need a PhD in lawn decoration to recognize that Suzanna Y and Ramesh Z are not going to have much luck publishing their findings in very prestigious journals–or for that matter, in any journals. So those findings get buried into their file drawers, where they will live out the rest of their days with very sad expressions on their numbers.

Now let’s iterate this process several times. Every couple of years, some enterprising young investigator will decide she’s going to try to replicate that cool effect from 2009, since no one else seems to have bothered to do it. This goes on for a while, with plenty of null results, until eventually, just by chance, someone gets lucky (if you can call a false positive lucky) and publishes a successful replication. And also, once in a blue moon, someone who gets a null result actually bothers to forces their graduate student to write it up, and successfully gets out a publication that very carefully explains that, no, Virginia, lawn gnomes don’t really make you happy. So, over time, a small literature on the hedonic effects of lawn gnomes accumulates.

Eventually, someone else comes across this small literature and notices that it contains “mixed findings”, with some studies finding an effect, and others finding no effect. So this special someone–let’s call them the Master of the Gnomes–decides to do a formal meta-analysis. (A meta-analysis is basically just a fancy way of taking a bunch of other people’s studies, throwing them in a blender, and pouring out the resulting soup into a publication of your very own.) Now you can see why the failure to publish null results is going to be problematic: What the Master of the Gnomes doesn’t know about, the Master of the Gnomes can’t publish about. So any resulting meta-analytic estimate of the association between lawn gnomes and subjective well-being is going to be biased in the positive directio. That is, there’s a good chance that the meta-analysis will end up saying lawn gnomes make people very happy,when in reality lawn gnomes only make people a little happy, or don’t make people happy at all.

There are lots of ways to try to get around the file drawer problem, of course. One approach is to call up everyone you know who you think might have ever done any research on lawn gnomes and ask if you could take a brief peek into their file drawer. But meta-analysts are often very introverted people with no friends, so they may not know any other researchers. Or they might be too shy to ask other people for their data. And then too, some researchers are very protective of their file drawers, because in some cases, they’re hiding more than just papers in there. Bottom line, it’s not always easy to identify all of the null results that are out there.

A very different way to deal with the file drawer problem, and one suggested by Rosenthal in his 1979 article, is to compute a file drawer number, which is basically a number that tells you how many null results that you don’t know about would have to exist in people’s file drawers before the meta-analytic effect size estimate was itself rendered null. So, for example, let’s say you do a meta-analysis of 28 studies, and find that your best estimate, taking all studies into account, is that the standardized effect size (Cohen’s d) is 0.63, which is quite a large effect, and is statistically different from 0 at, say, the p < .00000001 level. Intuitively, that may seem like a lot of zeros, but being a careful methodologist, you decide you’d like a more precise definition of “a lot”. So you compute the file drawer number (in one of its many permutations), and it turns out that there would have to be 4,640,204 null results out there in people’s file drawers before the meta-analytic effect size became statistically non-significant. That’s a lot of studies, and it’s doubtful that there are even that many people studying lawn gnomes, so you can probably feel comfortable that there really is an association there, and that it’s fairly large.

The problem, of course, is that it doesn’t always turn out that way. Sometimes you do the meta-analysis and find that your meta-analytic effect is cutting it pretty close, and that it would only take, say, 12 null results to render the effect non-significant. At that point, the file drawer N is no help; no amount of statistical cleverness is going to give you the extrasensory ability to peer into people’s file drawers at a distance. Moreover, even in cases where you can feel relatively confident that there couldn’t possibly be enough null results out there to make your effect go away entirely, it’s still possible that there are enough null results out there to substantially weaken it. Generally speaking, the file drawer N is a number you compute because you have to, not because you want to. In an ideal world, you’d always have all the information readily available at your fingertips, and all that would be left for you to do is toss it all in the blender and hit “liquify” “meta-analyze”. But of course, we don’t live in an ideal world; we live in a horrible world full of things like tsunamis, lip syncing, and publication bias.

This brings me, in a characteristically long-winded way, to the point of this post. The fact that researchers often don’t have access to other researchers’ findings–null result or not–is in many ways a vestige of the fact that, until recently, there was no good way to rapidly and easily communicate one’s findings to others in an informal way. Of course, the telephone has been around for a long time, and the postal service has been around even longer. But the problem with telling other people what you found on the telephone is that they have to be listening, and you don’t really know ahead of time who’s going to want to hear about your findings. When Rosenthal was writing about file drawers in the 80s, there wasn’t any bulletin board where people could post their findings for all to see without going to the trouble of actually publishing them, so it made sense to focus on ways to work around the file drawer problem instead of through it.

These days, we do have a bulletin board where researchers can post their null results: The internet. In theory, an online database of null results presents an ideal solution to the file drawer problem: Instead of tossing their replication failures into a folder somewhere, researchers could spend a minute or two entering just a minimal amount of information into an online database, and that information would then live on in perpetuity, accessible to anyone else who cared to come along and enter the right keyword into the search box. Such a system could benefit everyone involved: researchers who ended up with unpublishable results could salvage at least some credit for their efforts, and ensure that their work wasn’t entirely lost to the sands of time; prospective meta-analysts could simplify the task of hunting down relevant findings in unlikely places; and scientists contemplating embarking on a new line of research that built heavily on an older finding could do a cursory search to see if other people had already tried (and failed) to replicate the foundational effect.

Sounds good, right? At least, that was my thought process last year, when I spent some time building an online database that could serve as this type of repository for null (and, occasionally, not-null) results. I got a working version up and running at failuretoreplicate.com, and was hoping to spend some time begging people to use it trying to write it up as a short paper, but then I started sinking into the quicksand of my dissertation, and promptly forgot about it. What jogged my memory was this post a couple of days ago, which describes a database, called the Negatome, that contains “a collection of protein and domain (functional units of proteins) pairs thatare unlikely to be engaged in direct physical interactions”. This isn’t exactly the same thing as a database of null results, and is in a completely different field, but it was close enough to rekindle my interest and motivate me to dust off the site I built last year. So now the site is here, and it’s effectively open for business.

I should confess up front that I don’t harbor any great hopes of this working; I suspect it will be quite difficult to build the critical mass needed to make something like this work. Still, I’d like to try. The site is officially in beta, so stuff will probably still break occasionally, but it’s basically functional. You can create an account instantly and immediately start adding studies; it only takes a minute or two per study. There’s no need to enter much in the way of detail; the point isn’t to provide an alternative to peer-reviewed publication, but rather to provide a kind of directory service that researchers could use as a cursory tool for locating relevant information. All you have to do is enter a brief description of the effect you tried to replicate, an indication of whether or not you succeeded, and what branch of psychology the effect falls under. There are plenty of other fields you can enter (e.g., searchable tags, sample sizes, description of procedures, etc.), but they’re almost all optional. The goal is really to make this as effortless as possible for people to use, so that there is no virtually no cost to contributing.

Anyway, right now there’s nothing on the site except a single lonely record I added in order to get things started. I’d be very grateful to anyone who wants to help this project off the ground by adding a study or two. There are full editing and deletion capabilities, so you can always delete anything you add later on if you decide you don’t want to share after all. My hope is that, given enough community involvement and a large enough userbase, this could eventually become a valuable resource psychologists could rely on when trying to establish how likely a finding is to replicate, or when trying to identify relevant studies to include in meta-analyses. You do want to help figure out what effect those sneaky, sneaky lawn gnomes have on our collective mental health, right?

tuesday at 3 pm works for me

Wednesday, October 21st, 2009

Apparently, Tuesday at 3 pm is the best time to suggest as a meeting time–that’s when people have the most flexibility available in their schedule. At least, that’s the conclusion drawn by a study based on data from WhenIsGood, a free service that helps with meeting scheduling. There’s not much to the study beyond the conclusion I just gave away; not surprisingly, people don’t like to meet before 10 or 11 am or after 4 pm, and there’s very little difference in availability across different days of the week.

What I find neat about this isn’t so much the results of the study itself as the fact that it was done at all. I’m a big proponent of using commercial website data for research purposes–I’m about to submit a paper that relies almost entirely on content pulled using the Blogger API, and am working on another project that makes extensive use of the Twitter API. The scope of the datasets one can assemble via these APIs is simply unparalleled; for example, there’s no way I could ever realistically collect writing samples of 50,000+ words from 500+ participants in a laboratory setting, yet the ability to programmatically access blogspot.com blog contents makes the task trivial. And of course, many websites collect data of a kind that just isn’t available off-line. For example, the folks at OKCupid are able to continuously pump out interesting data on people’s online dating habits because they have comprehensive data on interactions between literally millions of prospective dating partners. If you want to try to generate that sort of data off-line, I hope you have a really large lab.

Of course, I recognize that in this case, the WhenIsGood study really just amounts to a glorified press release. You can tell that’s what it is from the URL, which literally includes the “press/” directory in its path. So I’m certainly not naive enough to think that Web 2.0 companies are publishing interesting research based on their proprietary data solely out of the goodness of their hearts. Quite the opposite. But I think in this case the desire for publicity works in researchers’ favor: It’s precisely because virtually any press is considered good press that many of these websites would probably be happy to let researchers play with their massive (de-identified) datasets. It’s just that, so far, hardly anyone’s asked. The Web 2.0 world is a largely untapped resource that researchers (or at least, psychologists) are only just beginning to take advantage of.

I suspect that this will change in the relatively near future. Five or ten years from now, I imagine that a relatively large chunk of the research conducted in many area of psychology (particularly social and personality psychology) will rely heavily on massive datasets derived from commercial websites. And then we’ll all wonder in amazement at how we ever put up with the tediousness of collecting real-world data from two or three hundred college students at a time, when all of this online data was just lying around waiting for someone to come take a peek at it.