the Neurosynth viewer goes modular and open source

If you’ve visited the Neurosynth website lately, you may have noticed that it looks… the same way it’s always looked. It hasn’t really changed in the last ~20 months, despite the vague promise on the front page that in the next few months, we’re going to do X, Y, Z to improve the functionality. The lack of updates is not by design; it’s because until recently I didn’t have much time to work on Neurosynth. Now that much of my time is committed to the project, things are moving ahead pretty nicely, though the changes behind the scenes aren’t reflected in any user-end improvements yet.

The github repo is now regularly updated and even gets the occasional contribution from someone other than myself; I expect that to ramp up considerably in the coming months. You can already use the code to run your own automated meta-analyses fairly easily; e.g., with everything set up right (follow the Readme and examples in the repo), the following lines of code:

dataset = cPickle.load(open('dataset.pkl', 'rb'))
studies = get_ids_by_expression("memory* &~ ("wm|working|episod*"), threshold=0.001)
ma = meta.MetaAnalysis(dataset, studies)
ma.save_results('memory')

…will perform an automated meta-analysis of all studies in the Neurosynth database that use the term ‘memory’ at a frequency of 1 in 1,000 words or greater, but don’t use the terms wm or working, or words that start with ‘episod’ (e.g., episodic). You can perform queries that nest to arbitrary depths, so it’s a pretty powerful engine for quickly generating customized meta-analyses, subject to all of the usual caveats surrounding Neurosynth (i.e., that the underlying data are very noisy, that terms aren’t mental states, etc.).

Anyway, with the core tools coming along, I’ve started to turn back to other elements of the project, starting with the image viewer. Yesterday I pushed the first commit of a new version of the viewer that’s currently on the Neurosynth website. In the next few weeks, this new version will be replacing the current version of the viewer, along with a bunch of other changes to the website.

A live demo of the new viewer is available here. It’s not much to look at right now, but behind the scenes, it’s actually a huge improvement on the old viewer in a number of ways:

  • The code is completely refactored and is all nice and object-oriented now. It’s also in CoffeeScript, which is an alternative and (if you’re coming from a Python or Ruby background) much more readable syntax for JavaScript. The source code is on github and contributions are very much encouraged. Like most scientists, I’m generally loathe to share my code publicly because I think it sucks most of the time. But I actually feel pretty good about this code. It’s not good code by any stretch, but I think it rises to the level of ‘mostly sensible’, which is about as much as I can hope for.
  • The viewer now handles multiple layers simultaneously, with the ability to hide and show layers, reorder them by dragging, vary the transparency, assign different color palettes, etc. These features have been staples of offline viewers pretty much since the prehistoric beginnings of fMRI time, but they aren’t available in the current Neurosynth viewer or most other online viewers I’m aware of, so this is a nice addition.
  • The architecture is modular, so that it should be quite easy in future to drop in other alternative views onto the data without having to muck about with the app logic. E.g., adding a 3D WebGL-based view to complement the current 2D slice-based HTML5 canvas approach is on the near-term agenda.
  • The resolution of the viewer is now higher–up from 4 mm to 2 mm (which is the most common native resolution used in packages like SPM and FSL). The original motivation for downsampling to 4 mm in the prior viewer was to keep filesize to a minimum and speed up the initial loading of images. But at some point I realized, hey, we’re living in the 21st century; people have fast internet connections now. So now the files are all in 2 mm resolution, which has the unpleasant effect of increasing file sizes by a factor of about 8, but also has the pleasant effect of making it so that you can actually tell what the hell you’re looking at.

Most importantly, there’s now a clean, and near-complete, separation between the HTML/CSS content and the JavaScript code. Which means that you can now effectively drop the viewer into just about any HTML page with just a few lines of code. So in theory, you can have basically the same viewer you see in the demo just by sticking something like the following into your page:

 viewer = Viewer.get('#layer_list', '.layer_settings')
 viewer.addView('#view_axial', 2);
 viewer.addView('#view_coronal', 1);
 viewer.addView('#view_sagittal', 0);
 viewer.addSlider('opacity', '.slider#opacity', 'horizontal', 'false', 0, 1, 1, 0.05);
 viewer.addSlider('pos-threshold', '.slider#pos-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addSlider('neg-threshold', '.slider#neg-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addColorSelect('#color_palette');
 viewer.addDataField('voxelValue', '#data_current_value')
 viewer.addDataField('currentCoords', '#data_current_coords')
 viewer.loadImageFromJSON('data/MNI152.json', 'MNI152 2mm', 'gray')
 viewer.loadImageFromJSON('data/emotion_meta.json', 'emotion meta-analysis', 'bright lights')
 viewer.loadImageFromJSON('data/language_meta.json', 'language meta-analysis', 'hot and cold')
 viewer.paint()

Well, okay, there are some other dependencies and styling stuff you’re not seeing. But all of that stuff is included in the example folder here. And of course, you can modify any of the HTML/CSS you see in the example; the whole point is that you can now easily style the viewer however you want it, without having to worry about any of the app logic.

What’s also nice about this is that you can easily pick and choose which of the viewer’s features you want to include in your page; nothing will (or at least, should) break no matter what you do. So, for example, you could decide you only want to display a single view showing only axial slices; or to allow users to manipulate the threshold of layers but not their opacity; or to show the current position of the crosshairs but not the corresponding voxel value; and so on. All you have to do is include or exclude the various addSlider() and addData() lines you see above.

Of course, it wouldn’t be a mediocre open source project if it didn’t have some important limitations I’ve been hiding from you until near the very end of this post (hoping, of course, that you wouldn’t bother to read this far down). The biggest limitation is that the viewer expects images to be in JSON format rather than a binary format like NIFTI or Analyze. This is a temporary headache until I or someone else can find the time and motivation to adapt one of the JavaScript NIFTI readers that are already out there (e.g., Satra Ghosh‘s parser for xtk), but for now, if you want to load your own images, you’re going to have to take the extra step of first converting them to JSON. Fortunately, the core Neurosynth Python package has a img_to_json() method in the imageutils module that will read in a NIFTI or Analyze volume and produce a JSON string in the expected format. Although I’m pretty sure it doesn’t handle orientation properly for some images, so don’t be surprised if your images look wonky. (And more importantly, if you fix the orientation issue, please commit your changes to the repo.)

In any case, as long as you’re comfortable with a bit of HTML/CSS/JavaScript hacking, the example/ folder in the github repo has everything you need to drop the viewer into your own pages. If you do use this code internally, please let me know! Partly for my own edification, but mostly because when I write my annual progress reports to the NIH, it’s nice to be able to truthfully say, “hey, look, people are actually using this neat thing we built with taxpayer money.”

estimating bias in text with Ruby

Over the past couple of months, I’ve been working on and off on a collaboration with my good friend Nick Holtzman and some other folks that focuses on ways to automatically extract bias from text using a vector space model. The paper is still in progress, so I won’t give much away here, except to say that Nick’s figured out what I think is a pretty clever way to show that, yes, Fox likes Republicans more than Democrats, and MSNBC likes Democrats more than Republicans. It’s not meant to be a surprising result, but simply a nice validation of the underlying method, which can be flexibly applied to all sorts of interesting questions.

The model we’re using is a simplified variant of Jones and Mewhort’s (2007) BEAGLE model. Essentially, similarity between words is quantified by looking at the degree to which words have similar co-occurrence patterns with other words. This basic idea is actually common to pretty much all vector space models, so in that sense, there’s not much new here (there’s plenty that’s new in Jones and Mewhort (2007), but we’re mostly leaving those features out for the sake of simplicity and computational speed). The novel aspect is the contrast coding of similarity terms in order to produce bias estimates. But you’ll have to wait for the paper to read more about that.

In the meantime, one thing we’ve tried to do is develop software that can be used to easily implement the kind of analyses we describe in the paper. With plenty of input from Nick and Mike Jones, I’ve written a set of tools in Ruby that’s now freely available for download here. The tools are actually bundled as a Ruby gem, so installation should be a snap on most platforms. We’re still working on documentation, so there’s no full-blown manual yet, but the quick-start guide should be sufficient to get many users up and running. And for people who share my love of Ruby and are interested in using the tools programmatically, there’s a fairly well-commented RDoc.

The code should really be considered an alpha release at the moment; I’m sure there are plenty of bugs (if you find any, email me!), and the feature set is currently pretty limited. Hopefully it’ll grow over time. I also plan to throw the code up on GitHub at some point in the near future so that anyone who’s interested can help out with the development. In the meantime, if you’re interested in semantic space models and want to play around with a crude (but relatively fast) implementation of one, there’s a (very) small chance you might find these tools useful.

links and slides from the CNS symposium

After the CNS symposium on building a cumulative cognitive neuroscience, several people I talked to said it was a pity there wasn’t an online repository where all the sites that the speakers discussed could be accessed. I should have thought of that ahead of time, because even if we made one now, no one would ever find it. So, belatedly, the best I can do is put together a list here, where I’m pretty sure no one’s ever going to read it.

Anyway, this is mostly from memory, so I may be forgetting some of the things people talked about, but here’s what I can remember:

Let me know if there’s anything I’m leaving out.

On a related note, several people at the conference asked me for my slides, but I promptly forgot who they were, so here they are.

UPDATED: Russ Poldrack’s slides are now also on the web here.

some thoughtful comments on automatic measure abbreviation

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he” is you and “you” is he because I was writing to Lew…)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I’m looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.

abbreviating personality measures in R: a tutorial

A while back I blogged about a paper I wrote that uses genetic algorithms to abbreviate personality measures with minimal human intervention. In the paper, I promised to put the R code I used online, so that other people could download and use it. I put off doing that for a long time, because the code was pretty much spaghetti by the time the paper got accepted, and there are any number of things I’d rather do than spend a weekend rewriting my own code. But one of the unfortunate things about publicly saying that you’re going to do something is that you eventually have to do that something. So, since the paper was published in JRP last week, and several people have emailed me to ask for the code, I spent much of the weekend making the code presentable. It’s not a fully-formed R package yet, but it’s mostly legible, and seems to work more or less ok. You can download the file (gaabbreviate.R) here. The rest of this (very long) post is basically a tutorial on how to use the code, so you probably want to stop reading this now unless you have a burning interest in personality measurement.

Prerequisites and installation

Although you won’t need to know much R to follow this tutorial, you will need to have R installed on your system. Fortunately, R is freely available for all major operating systems. You’ll also need the genalg and psych packages for R, because gaabbreviate won’t run without them. Once you have R installed, you can download and install those packages like so:

install.packages(c(‘genalg’, ‘psych’))

Once that’s all done, you’re ready to load gaabbreviate.R:

source(“/path/to/the/file/gaabbreviate.R”)

…where you make sure to specify the right path to the location where you saved the file. And that’s it! Now you’re ready to abbreviate measures.

Reading in data

The file contains several interrelated functions, but the workhorse is gaa.abbreviate(), which takes a set of item scores and scale scores for a given personality measure as input and produces an abbreviated version of the measure, along with a bunch of other useful information. In theory, you can go from old data to new measure in a single line of R code, with almost no knowledge of R required (though I think it’s a much better idea to do it step-by-step and inspect the results at every stage to make sure you know what’s going on).

The abbreviation function is pretty particular about the format of the input it expects. It takes two separate matrices, one with item scores, the other with scale scores (a scale here just refers to any set of one or more items used to generate a composite score). Subjects are in rows, item or scale scores are in columns. So for example, let’s say you have data from 3 subjects, who filled out a personality measure that has two separate scales, each composed of two items. Your item score matrix might look like this:

3 5 1 1

2 2 4 1

2 4 5 5

…which you could assign in R like so:

items = matrix(c(3,2,2,5,2,2,1,4,5,1,1,5), ncol=3)

I.e., the first subject had scores of 3, 5, 1, and 1 on the four items, respectively; the second subject had scores of 2, 2, 4, and 1… and so on.

Based on the above, if you assume items 1 and 2 constitute one scale, and items 3 and 4 constitute the other, the scale score matrix would be:

8 2

4 5

6 10

Of course, real data will probably have hundreds of subjects, dozens of items, and a bunch of different scales, but that’s the basic format. Assuming you can get your data into an R matrix or data frame, you can feed it directly to gaa.abbreviate() and it will hopefully crunch your data without complaining. But if you don’t want to import your data into R before passing it to the code, you can also pass filenames as arguments instead of matrices. For example:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100)

If you pass files instead of data, the referenced text files must be tab-delimited, with subjects in rows, item/scale scores in columns, and a header row that gives the names of the columns (i.e., item names and scale names; these can just be numbers if you like, but they have to be there). Subject identifiers should not be in the files.

Key parameters: stuff you should set every time

Assuming you can get gaabbreviate to read in your data, you can then set about getting it to abbreviate your measure by selecting a subset of items that retain as much of the variance in the original scales as possible. There are a few parameters you’ll need to set; some are mandatory, others aren’t, but should really be specified anyway since the defaults aren’t likely to work well for different applications.

The most important (and mandatory) argument is iters, which is the number of iterations you want the GA to run for. If you pick too high a number, the GA may take a very long time to run if you have a very long measure; if you pick too low a number, you’re going to get a crappy solution. I think iters=100 is a reasonable place to start, though in practice, obtaining a stable solution tends to require several hundred iterations. The good news (which I cover in more detail below) is that you can take the output you get from the abbreviation function and feed it right back in as many times as you want, so it’s not like you need to choose the number of iterations carefully or anything.

The other two key parameters are itemCost and maxItems. The itemCost is what determines the degree to which your measure is compressed. If you want a detailed explanation of how this works, see the definition of the cost function in the paper. Very briefly, the GA tries to optimize the trade-off between number of items and amount of variance explained. Generally speaking, the point of abbreviating a measure is to maximize the amount of explained variance (in the original scale scores) while minimizing the number of items retained. Unfortunately, you can’t do both very well at the same time, because any time you drop an item, you’re also losing its variance. So the trick is to pick a reasonable compromise: a measure that’s relatively short and still does a decent job recapturing the original. The itemCost parameter is what determines the length of that measure. When you set it high, the GA will place a premium on brevity, resulting in a shorter (but less accurate) measure; when you set it low, it’ll allow a longer measure that maximizes fidelity. The optimal itemCost will vary depending on your data, but I find 0.05 is a good place to start, and then you can tweak it to get measures with more or fewer items as you see fit.

The maxItems parameter sets the upper bound on the number of items that will be used to score each scale. The default is 5, but you may find this number too small if you’re trying to abbreviate scales comprised of a large number of items. Again, it’s worth playing around with this to see what happens. Generally speaks, the same trade-off between brevity and fidelity discussed above holds here too.

Given reasonable values for the above arguments, you should be able to feed in raw data and get out an abbreviated measure with minimal work. Assuming you’re reading your data from a file, the entire stream can be as simple as:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100, itemCost=0.05, maxItems=5, writeFile=’outputfile.txt’)

That’s it! Assuming your data are in the correct format (and if they’re not, the script will probably crash with a nasty error message), gaabbreviate will do its thing and produce your new, shorter measure within a few minutes or hours, depending on the size of the initial measure. The writeFile argument is optional, and gives the name of an output file you want the measure saved to. If you don’t specify it, the output will be assigned to the gaa object in the above call (note the “gaa = ” part of the call), but won’t be written to file. But that’s not a problem, because you can always achieve the same effect later by calling the gaa.writeMeasure function (e.g., in the above example, gaa.writeMeasure(gaa, file=”outputfile.txt”) would achieve exactly the same thing).

Other important options

Although you don’t really need to do anything else to produce abbreviated measures, I strongly recommend reading the rest of this document and exploring some of the other options if you’re planning to use the code, because some features are non-obvious. Also, the code isn’t foolproof, and it can do weird things with your data if you’re not paying attention. For one thing, by default, gaabbreviate will choke on missing values (i.e., NAs). You can do two things to get around this: either enable pairwise processing (pairwise=T), or turn on mean imputation (impute=T). I say you can do these things, but I strongly recommend against using either option. If you have missing values in your data, it’s really a much better idea to figure out how to deal with those missing values before you run the abbreviation function, because the abbreviation function is dumb, and it isn’t going to tell you whether pairwise analysis or imputation is a sensible thing to do. For example, if you have 100 subjects with varying degrees of missing data, and only have, say, 20 subjects’ scores for some scales, the resulting abbreviated measure is going to be based on only 20 subjects’ worth of data for some scales if you turn pairwise processing on. Similarly, imputing the mean for missing values is a pretty crude way to handle missing data, and I only put it in so that people who just wanted to experiment with the code wouldn’t have to go to the trouble of doing it themselves. But in general, you’re much better off reading your item and scale scores into R (or SPSS, or any other package), processing any missing values in some reasonable way, and then feeding gaabbreviate the processed data.

Another important point to note is that, by default, gaabbreviate will cross-validate its results. What that means is that only half of your data will be used to generate an abbreviated measure; the other half will be used to provide unbiased estimates of how well the abbreviation process worked. There’s an obvious trade-off here. If you use the split-half cross-validation approach, you’re going to get more accurate estimates of how well the abbreviation process is really working, but the fit itself might be slightly poorer because you have less data. Conversely, if you turn cross-validation off (crossVal=F), you’re going to be using all of your data in the abbreviation process, but the resulting estimates of the quality of the solution will inevitably be biased because you’re going to be capitalizing on chance to some extent.

In practice, I recommend always leaving cross-validation enabled, unless you either (a) really don’t care about quality control (which makes you a bad person), or (b) have a very small sample size, and can’t afford to leave out half of the data in the abbreviation process (in which case you should consider collecting more data). My experience has been that with 200+ subjects, you generally tend to see stable solutions even when leaving cross-validation on, though that’s really just a crude rule of thumb that I’m pulling out of my ass, and larger samples are always better.

Other less important options

There are a bunch other less important options that I won’t cover in any detail here, but that are reasonably well-covered in the comments in the source file if you’re so inclined. Some of these are used to control the genetic algorithm used in the abbreviation process. The gaa.abbreviate function doesn’t actually do the heavy lifting itself; instead, it relies on the genalg library to run the actual genetic algorithm. Although the default genalg parameters will work fine 95% of the time, if you really want to manually set the size of the population or the ratio of initial zeros to ones, you can pass those arguments directly. But there’s relatively little reason to play with these parameters, because you can always achieve more or less the same ends simply by adding iterations.

Two other potentially useful options I won’t touch on, though they’re there if you want them, give you the ability to (a) set a minimum bound on the correlation required in order for an item to be included in the scoring equation for a scale (the minR argument), and (b) apply non-unit weightings to the scales (the sWeights argument), in cases where you want to emphasize some scales at the cost of others (i.e., because you want to measure some scales more accurately).

Two examples

The following two examples assume you’re feeding in item and scale matrices named myItems and myScales, respectively:

example1

This will run a genetic algorithm for 500 generations on mean-imputed data with cross-validation turned off, and assign the result to a variable named my.new.shorter.measure. It will probably produce an only slightly shorter measure, because the itemCost is low and up to 10 items are allowed to load on each scale.

example2

This will run 100 iterations with cross-validation enabled (the default, so we don’t need to specify it explicitly) and write the result to a file named shortMeasure.txt. It’ll probably produce a highly abbreviated measure, because the itemCost is relatively high. It also assigns more weight (twice as much, in fact) to the fourth and fifth scales in the measure relative to the first three, as reflected in the sWeights argument (a vector where the ith element indicates the weight of the ith scale in the measure, so presumably there are five scales in this case).

The gaa object

Assuming you’ve read this far, you’re probably wondering what you get for your trouble once you’ve run the abbreviation function. The answer is that you get… a gaa (which stands for GA Abbreviate) object. The gaa object contains almost all the information that was used at any point in the processing, which you can peruse at your leisure. If you’re familiar with R, you’ll know that you can see what’s in the object with the attributes function. For example, if you assigned the result of the abbreviation function to a variable named ‘myMeasure’, here’s what you’d see:

attributes

The gaa object has several internal lists (data, settings, results, etc.), each of which in turn contains several other variables. I’ve tried to give these sensible names. In brief:

  • data contains all the data used to create the measure (i.e., the item and scale scores you fed in)
  • settings contains all the arguments you specified when you called the abbreviation function (e.g., iters, maxItems, etc.)
  • results contains variables summarizing the results of the GA run, including information about each previous iteration of the GA
  • best contains information about the single best measure produced (this is generally not useful, and is for internal purposes)
  • rbga is the rbga.bin object produced by the genetic library (for more information, see the genalg library documentation)
  • measure is what you’ll probably find most important, as it contains the details of the final measure that was produced

To see the contents of each of these lists in turn, you can easily inspect them:

measure

So the ‘measure’ attribute in the gaa object contains a bunch of other variables with information about the resulting measure. And here’s a brief summary:

  • items: a vector containing the numerical ID of items retained in the final measure relative to the original list (e.g., if you fed in 100 items, and the ‘items’ variable contained the numbers 4, 10, 14… that’s the GA telling you that it decided to keep items no. 4, 10, 14, etc., from the original set of 100 items).
  • nItems: the number of items in the final measure.
  • key: a scoring key for the new measure, where the rows are items on the new measure, and the columns are the scales. This key is compatible with score.items() in Bill Revelle’s excellent psych package, which means that once you’ve got the key, you can automatically score data for the new measure simply by calling score.items() (see the documentation for more details), and don’t need to do any manual calculating or programming yourself.
  • ccTraining and ccValidation: convergent correlations for the training and validation halves of the data, respectively. The convergent correlation is the correlation between the new scale scores (i.e., those that you get using the newly-generated measure) and the “real” scale scores. The ith element in the vector gives you the convergent correlation for the ith scale in your original measure. The validation coefficients will almost invariably be lower than the training coefficients, and the validation numbers are the ones you should trust as an unbiased estimate of the quality of the measure.
  • alpha: coefficient alpha for each scale. Note that you should expect to get lower internal consistency estimates for GA-produced measures than you’re used to, and this is actually a good thing. If you want to know why, read the discussion in the paper.
  • nScaleItems: a vector containing the number of items used to score each scale. If you left minR set to 0, this will always be identical to maxItems for all items. If you raised minR, the number of items will sometimes be lower (i.e., in cases where there were very few items that showed a strong enough correlation to be retained).

Just give me the measure already!

Supposing you’re not really interested in plumbing the depths of the gaa object or working within R more than is necessary, you might just be wondering what the quickest way to get an abbreviated measure you can work with is. In that case, all you really need to do is pass a filename in the writeFile argument when you call gaa.abbreviate (see the examples given above), and you’ll get out a plain text file that contains all the essential details of the new measure. Specifically you’ll get (a) a mapping from old items to new, so that you can figure out which items are included in the new measure (e.g., a line like “4 45” means that the 4th item on the new measure is no. 45 in the original set of items), and (b) a human-readable scoring key for each scale (the only thing to note here is that an “R” next to an item indicates the item is reverse-keyed), along with key statistics (coefficient alpha and convergent correlations for the training and validation halves). So if all goes well, you really won’t need to do anything else in R beyond call that one line that makes the measure. But again, I’d strongly encourage you to carefully inspect the gaa object in R to make sure everything looks right. The fact that the abbreviation process is fully automated isn’t a reason to completely suspend all rational criteria you’d normally use when developing a scale; it just means you probably have to do substantially less work to get a measure you’re happy with.

Killing time…

Depending on how big your dataset is (actually, mainly the number of items in the original measure), how many iterations you’ve requested, and how fast your computer is, you could be waiting a long time for the abbreviation function to finish its work. Because you probably want to know what the hell is going on internally during that time, I’ve provided a rudimentary monitoring display that will show you the current state of the genetic algorithm after every iteration. It looks like this (click for a larger version of the image):

display

This is admittedly a pretty confusing display, and Edward Tufte would probably murder several kittens if he saw it, but it’s not supposed to be a work of art, just to provide some basic information while you’re sitting there twiddling your thumbs (ok, ok, I promise I’ll label the panels better when I have the time to work on it). But basically, it shows you three things. The leftmost three panels show you the basic information about the best measure produced by the GA as it evolves across generations. Respectively, the top, middle,and bottom panels show you the total cost, measure length, and mean variance explained (R^2) as a function of iteration. The total cost can only ever go down, but the length and R^2 can go up or down (though there will tend to be a consistent trajectory for measure length that depends largely on what itemCost you specified).

The middle panel shows you detailed information about how well the GA-produced measure captures variance in each of the scales in the original measure. In this case, I’m abbreviating the 30 facets of the NEO-PI-R. The red dot displays the amount of variance explained in each trait, as of the current iteration.

Finally, the rightmost panel shows you a graphical representation of which items are included in the best measure identified by the GA at each iteration.Each row represents one iteration (i.e., you’re seeing the display as it appears after 200 iterations of a 250-iteration run); black bars represent items that weren’t included, white bars represent items that were included. The point of this display isn’t to actually tell you which items are being kept (you can’t possibly glean that level of information at this resolution), but rather, to give you a sense of how stable the solution is. If you look at the the first few (i.e., topmost) iterations, you’ll see that the solution is very unstable: the GA is choosing very different items as the “best” measure on each iteration. But after a while, as the GA “settles” into a neighborhood, the solution stabilizes and you see only relatively small (though still meaningful) changes from generation to generation. Basically, once the line in the top left panel (total cost) has asymptoted, and the solution in the rightmost panel is no longer changing much if at all, you know that you’ve probably arrived at as good a solution as you’re going to get.

Incidentally, if you use the generic plot() method on a completed gaa object (e.g., plot(myMeasure)), you’ll get exactly the same figure you see here, with the exception that the middle figure will also have black points plotted alongside the red ones.  The black points show you the amount of variance explained in each trait for the cross-validated results. If you’re lucky, the red and black points will be almost on top of each other; if you’re not, the black ones will be considerably to the left of the red ones .

Consider recycling

The last thing I’ll mention, which I already alluded to earlier, is that you can recycle gaa objects. That’s to say, suppose you ran the abbreviation for 100 iterations, only to get back a solution that’s still clearly suboptimal (i.e., the cost function is still dropping rapidly). Rather than having to start all over again, you can simply feed the gaa object back into the abbreviation function in order to run further iterations. And you don’t need to specify any additional parameters (assuming you want to run the same number of iterations you did last time; otherwise you’ll need to specify iters); all of the settings are contained within the gaa object itself. So, assuming you ran the abbreviation function and stored the result in ‘myMeasure’, you can simply do:

myMeasure = gaa.abbreviate(myMeasure, iters=200)

and you’ll get an updated version of the measure that’s had the benefit of an extra 200 iterations. And of course, you can save and load R objects to/from files, so that you don’t need to worry about all of your work disappearing next time you start R. So save(myMeasure, ‘filename.txt’) will save your gaa object for future use, and the next time you need it, you can call myMeasure = load(‘filename.txt’) to get it back (alternatively, you can just save the entire workspace).

Anyway, I think that covers all of the important stuff. There are a few other things I haven’t documented here, but if you’ve read this far, and have gotten the code to work in R, you should be able to start abbreviating your own measures relatively painlessly. If you do use the code to generate shorter measures, and end up with measures you’re happy with, I’d love to hear about it. And if you can’t get the code to work, or can get it to work but are finding issues with the code or the results, I guess I’ll grudgingly accept those emails too. In general, I’m happy to provide support for the code via email provided I have the time. The caveat is that, if you’re new to R, and are having problems with basic things like installing packages or loading files from source, you should really read a tutorial or reference that introduces you to R (Quick-R is my favorite place to start) before emailing me with problems. But if you’re having problems that are specific to the gaabbreviate code (e.g., you’re getting a weird error message, or aren’t sure what something means), feel free to drop me a line and I’ll try to respond as soon as I can.

fMRI becomes big, big science

There are probably lots of criteria you could use to determine the relative importance of different scientific disciplines, but the one I like best is the Largest Number of Authors on a Paper. Physicists have long had their hundred-authored papers (see for example this individual here; be sure to click on the “show all authors/affiliations” link), and with the initial sequencing and analysis of the human genome, which involved contributions from 452 different persons, molecular geneticists also joined the ranks of Officially Big Science. Meanwhile, us cognitive neuroscientists have long had to content ourselves with silly little papers that have only four to seven authors (maybe a dozen on a really good day). Which means, despite the pretty pictures we get to put in our papers, we’ve long had this inferiority complex about our work, and a nagging suspicion that it doesn’t really qualify as big science (full disclosure: so when I say “we”, I probably just mean “I”).

UNTIL NOW.

Thanks to the efforts of Bharat Biswal and 53 collaborators (yes, I counted) reported in a recent paper in PNAS, fMRI is now officially Big, Big Science. Granted, 54 authors is still small potatoes in physics-and-biology-land. And for all I know, there could be other fMRI papers with even larger author lists out there that I’ve missed.  BUT THAT’S NOT THE POINT. The point is, people like me now get to run around and say we do something important.

You might think I’m being insincere here, and that I’m really poking fun at ridiculously long author lists that couldn’t possibly reflect meaningful contributions from that many people. Well, I’m not. While I’m not seriously suggesting that the mark of good science is how many authors are on the paper, I really do think that the prevalence of long author lists in a discipline are an important sign of a discipline’s maturity, and that the fact that you can get several dozen contributors to a single paper means you’re seeing a level of collaboration across different labs that previously didn’t exist.

The importance of large-scale collaboration is one of the central elements of the new PNAS article, which is appropriately entitled Toward discovery science of human brain function. What Biswal et al have done is compile the largest publicly-accessible fMRI dataset on the planet, consisting of over 1,400 scans from 35 different centers. All of the data, along with some tools for analysis, are freely available for download from NITRC. Be warned though: you’re probably going to need a couple of terabytes of free space if you want to download the entire dataset.

You might be wondering why no one’s assembled an fMRI dataset of this scope until now; after all, fMRI isn’t that new a technique, having been around for about 20 years now. The answer (or at least, one answer) is that it’s not so easy–and often flatly impossible–to combine raw fMRI datasets in any straightforward way. The problem is that the results of any given fMRI study only really make sense in the context of a particular experimental design. Functional MRI typically measures the change in signal associated with some particular task, which means that you can’t really go about combining the results of studies of phonological processing with those of thermal pain and obtain anything meaningful (actually, this isn’t entirely true; there’s a movement afoot to create image-based centralized databases that will afford meta-analyses on an even more massive scale,  but that’s a post for another time). You need to ensure that the tasks people performed across different sites are at least roughly in the same ballpark.

What allowed Biswal et al  to consolidate datasets to such a degree is that they focused exclusively on one particular kind of cognitive task. Or rather, they focused on a non-task: all 1400+ scans in the 1000 Functional Connectomes Project (as they’re calling it) are from participants being scanned during the “resting state”. The resting state is just what it sounds like: participants are scanned while they’re just resting; usually they’re given no specific instructions other than to lie still, relax, and not fall asleep. The typical finding is that, when you contrast this resting state with activation during virtually any kind of goal-directed processing, you get widespread activation increases in a network that’s come to be referred to as the “default” or “task-negative” network (in reference to the fact that it’s maximally active when people are in their “default” state).

One of the main (and increasingly important) applications of resting state fMRI data is in functional connectivity analyses, which aim to identify patterns of coactivation across different regions rather than mean-level changes associated with some task. The fundamental idea is that you can get a lot of traction on how the brain operates by studying how different brain regions interact with one another spontaneously over time, without having to impose an external task set. The newly released data is ideal for this kind of exploration, since you have a simply massive dataset that includes participants from all over the world scanned in a range of different settings using different scanners. So if you want to explore the functional architecture of the human brain during the resting state, this should really be your one-stop shop. (In fact, I’m tempted to say that there’s going to be much less incentive for people to collect resting-state data from now on, since there really isn’t much you’re going to learn from one sample of 20 – 30 people that you can’t learn from 1,400 people from 35+ combined samples).

Aside from introducing the dataset to the literature, Biswal et al also report a number of new findings. One neat finding is that functional parcellation of the brain using seed-based connectivity (i.e., identifying brain regions that coactivate with a particular “seed” or target region) shows marked consistency across different sites, revealing what Biswal et al call a “universal architecture”. This type of approach by itself isn’t particularly novel, as similar techniques have been used before. Bt no one’s done it on anything approaching this scale. Here’s what the results look like:

You can see that different seeds produce difference functional parcellations across the brain (the brighter areas denote ostensive boundaries).

Another interesting finding is the presence of gender and age differences in functional connectivity:

What this image shows is differences in functional connectivity with specific seed regions (the black dots) as a function of age (left) or gender (right). (The three rows reflect different techniques for producing the maps, with the upshot being that the results are very similar regardless of exactly how you do the analysis.) It isn’t often you get to see scatterplots with 1,400+ points in cognitive neuroscience, so this is a welcome sight. Although it’s also worth pointing out the inevitable downside of having huge sample sizes, which is that even tiny effects attain statistical significance. Which is to say, while the above findings are undoubtedly more representative of gender and age differences in functional connectivity than anything else you’re going to see for a long time, notice that they’re they’re very small effects (e.g., in the right panels, you can see that the differences between men and women are only a fraction of a standard deviation in size, despite the fact that these regions are probably selected because they show some of the “strongest” effects). That’s not meant as a criticism; it’s actually a very good thing, in that these modest effects are probably much closer to the truth than what previous studies have reported. Such findings should serve as an important reminder that most of the effects identified by fMRI studies are almost certainly massively inflated by small sample size (as I’ve discussed before here and in this paper).

Anyway, the bottom line is that if you’ve ever thought to yourself, “gee, I wish I could do cutting-edge fMRI research, but I really don’t want to leave my house to get a PhD; it’s almost lunchtime,” this is your big chance. You can download the data, rejoice in the magic that is the resting state, and bathe yourself freely in functional connectivity. The Biswal et al paper bills itself as “a watershed event in functional imaging,” and it’s hard to argue otherwise. Researchers now have a definitive data set to use for analyses of functional connectivity and the resting state, as well as a model for what other similar data sets might look like in the future.

More importantly, with 54 authors on the paper, fMRI is now officially big science. Prepare to suck it, Human Genome Project!

ResearchBlogging.orgBiswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J., Buckner, R., Colcombe, S., Dogonowski, A., Ernst, M., Fair, D., Hampson, M., Hoptman, M., Hyde, J., Kiviniemi, V., Kotter, R., Li, S., Lin, C., Lowe, M., Mackay, C., Madden, D., Madsen, K., Margulies, D., Mayberg, H., McMahon, K., Monk, C., Mostofsky, S., Nagel, B., Pekar, J., Peltier, S., Petersen, S., Riedl, V., Rombouts, S., Rypma, B., Schlaggar, B., Schmidt, S., Seidler, R., Siegle, G., Sorg, C., Teng, G., Veijola, J., Villringer, A., Walter, M., Wang, L., Weng, X., Whitfield-Gabrieli, S., Williamson, P., Windischberger, C., Zang, Y., Zhang, H., Castellanos, F., & Milham, M. (2010). Toward discovery science of human brain function Proceedings of the National Academy of Sciences, 107 (10), 4734-4739 DOI: 10.1073/pnas.0911855107

better tools for mining the scientific literature

Freethinker’s Asylum has a great post reviewing a number of tools designed to help researchers mine the scientific literature–an increasingly daunting task. The impetus for the post is this article in the latest issue of Nature (note: restricted access), but the FA post discusses a lot of tools that the Nature article doesn’t, and focuses in particular on websites that are currently active and publicly accessible, rather than on proprietary tools currently under development in dark basement labs and warehouses. I hadn’t seen most of these before, but am looking forward to trying them out–e.g., pubget:

When you create an account, pubget signs in to your institution and allows you to search the subscribed resources. When you find a reference you want, just click the pdf icon and there it is. No clicking through to content provider websites. You can tag references as “keepers” to come back to them later, or search for the newest articles from a particular journal.

Sounds pretty handy…

Many of the other sites–as well as most of those discussed in the Nature article–focus on data and literature mining in specific fields, e.g., PubGene and PubAnatomy. These services, which allow you to use specific keywords or topics (e.g., specific genes) to constrain literature searches, aren’t very useful to me personally. But it’s worth pointing out that there are some emerging services that fill much the same niche in the world of cognitive neuroscience that I’m more familiar with. The one that currently looks most promising, in my opinion, is the Cognitive Atlas project led by Russ Poldrack, which is “a collaborative knowledge building project that aims to develop a knowledge base (or ontology) that characterizes the state of current thought in cognitive science. … The Cognitive Atlas aims to capture knowledge from users with expertise in psychology, cognitive science, and neuroscience.”

The Cognitive Atlas is officially still in beta, and you need to have a background in cognitive neuroscience in order to sign up to contribute. But there’s already some content you can navigate, and the site, despite being in the early stages of development, is already pretty impressive. In the interest of full disclosure, as well as shameless plugging, I should note that Russ will be giving a talk about the Cognitive Atlas project as part of a symposium I’m chairing at CNS in Montreal this year. So if you want to learn more about it, stop by! Meantime, check out the Freethinker’s Asylum post for links to all sorts of other interesting tools…