estimating bias in text with Ruby

Over the past couple of months, I’ve been working on and off on a collaboration with my good friend Nick Holtzman and some other folks that focuses on ways to automatically extract bias from text using a vector space model. The paper is still in progress, so I won’t give much away here, except to say that Nick’s figured out what I think is a pretty clever way to show that, yes, Fox likes Republicans more than Democrats, and MSNBC likes Democrats more than Republicans. It’s not meant to be a surprising result, but simply a nice validation of the underlying method, which can be flexibly applied to all sorts of interesting questions.

The model we’re using is a simplified variant of Jones and Mewhort’s (2007) BEAGLE model. Essentially, similarity between words is quantified by looking at the degree to which words have similar co-occurrence patterns with other words. This basic idea is actually common to pretty much all vector space models, so in that sense, there’s not much new here (there’s plenty that’s new in Jones and Mewhort (2007), but we’re mostly leaving those features out for the sake of simplicity and computational speed). The novel aspect is the contrast coding of similarity terms in order to produce bias estimates. But you’ll have to wait for the paper to read more about that.

In the meantime, one thing we’ve tried to do is develop software that can be used to easily implement the kind of analyses we describe in the paper. With plenty of input from Nick and Mike Jones, I’ve written a set of tools in Ruby that’s now freely available for download here. The tools are actually bundled as a Ruby gem, so installation should be a snap on most platforms. We’re still working on documentation, so there’s no full-blown manual yet, but the quick-start guide should be sufficient to get many users up and running. And for people who share my love of Ruby and are interested in using the tools programmatically, there’s a fairly well-commented RDoc.

The code should really be considered an alpha release at the moment; I’m sure there are plenty of bugs (if you find any, email me!), and the feature set is currently pretty limited. Hopefully it’ll grow over time. I also plan to throw the code up on GitHub at some point in the near future so that anyone who’s interested can help out with the development. In the meantime, if you’re interested in semantic space models and want to play around with a crude (but relatively fast) implementation of one, there’s a (very) small chance you might find these tools useful.