I’ve recently started recruiting participants for online experiments via Mechanical Turk. In the past I’ve always either relied on on directory listings (like this one) or targeted specific populations (e.g., bloggers and twitterers) via email solicitation. But recently I’ve started running a very large-sample decision-making study (it’s here, if you care to contribute to the sample), and waiting for participants to trickle in via directories isn’t cutting it. So I’ve started paying people (very) small amounts of money for participation.
One challenge I’ve had to deal with is figuring out how to filter out participants who aren’t really interested in contributing to science, and are strictly in it for the money. 20 or 30 cents is a pittance to most people in the developed world, but as I’ve found out the hard way, gaming MTurk appears to be a thriving business in some developing countries (some of which I’ve unfortunately had to resort to banning entirely). Cheaters aren’t so much of an issue for very quick tasks like providing individual ratings of faces, because (a) the time it takes to give a fake rating isn’t substantially greater than giving one’s actual opinion, and (b) the standards for what counts as accurate performance are clear, so it’s easy to train workers and weed out the bad apples. Unfortunately, my studies generally involve fairly long personality questionnaires combined with other cognitive tasks (e.g., in the current study, you get to repeatedly allocate hypothetical money between yourself and a computer partner, and rate some faces). They often take around half an hour, and involve 20+ questions per screen, so there’s a pretty big incentive for workers who are only in it for the cash to produce random responses and try to increase their effective wage. And the obvious question then is how to detect cheating in the data.
One of the techniques I’ve found works surprisingly well is to simply compare each person’s pattern of responses across items with the mean for the entire sample. In other words, you just compute the correlation between each individual’s item scores and the means for all the items scores across everyone who’s filled out the same measure. I know that there’s an entire literature on this stuff full of much more sophisticated ways to detect random responding, but I find this crude approach really does quite well (I’ve verified this by comparing it with a bunch of other similar metrics), and has the benefit of being trivial to implement.
Anyway, one of the things that surprised me when I first computed these correlations is just how strong the relationship between the sample mean and most individuals’ responses is. Here’s what the distribution looks like for one particular inventory, the 181-item Analog to Multiple Broadband Inventories (AMBI, whichI introduced in this paper, and discuss further here):
This is based on a sample of about 600 internet respondents, which actually turns out to be pretty representative of the broader population, as Sam Gosling, Simine Vazire, and Sanjay Srivastava will tell you (for what it’s worth, I’ve done the exact same analysis on a similar-sized off-line dataset from Lew Goldberg’s Eugene-Springfield Community Sample (check out that URL!) and obtained essentially the same results). In this sample, the median correlation is .48; so, in effect, you can predict a quarter of the variance in a typical participant’s responses without knowing anything at all about them. Human beings, it turns out, have some things in common with one another (who knew?). What you think you’re like is probably not very dissimilar to what I think I’m like. Which is kind of surprising, considering you’re a well-adjusted, friendly human being, and I’m a real freakshow somewhat eccentric, paranoid kind of guy.
What drives that similarity? Much of it probably has to do with social desirability–i.e., many of the AMBI items (and those on virtually all personality inventories) are evaluatively positive or negative statements that most people are inclined to strongly agree or disagree with. But it seems to be a particular kind of social desirability–one that has to do with openness to new experiences, and particular intellectual ones. For instance, here are the top 10 most endorsed items (based on mean likert scores across the entire sample; scores are in parentheses):
like to read (4.62)
like to visit new places (4.39)
was a better than average student when I was in school (4.28)
am a good listener (4.25)
would love to explore strange places (4.22)
am concerned about others (4.2)
am open to new experiences (4.18)
amuse my friends (4.16)
love excitement (4.08)
spend a lot of time reading (4.07)
And conversely, here are the 10 least-endorsed items:
was a slow learner in school (1.52)
don’t think that laws apply to me (1.8)
do not like to visit museums (1.83)
have difficulty imagining things (1.84)
have no special urge to do something original (1.87)
do not like art (1.95)
feel little concern for others (1.97)
don’t try to figure myself out (2.01)
break my promises (2.01)
make enemies (2.06)
You can see a clear evaluative component in both lists: almost everyone believes that they’re concerned about others and thinks that they’re smarter than average. But social desirability and positive illusions aren’t enough to explain these patterns, because there are plenty of other items on the AMBI that have an equally strong evaluative component–for instance, “don’t have much energy”, “cannot imagine lying or cheating”, “see myself as a good leader”, and “am easily annoyed”–yet have mean scores pretty close to the midpoint (in fact, the item ‘am easily annoyed’ is endorsed more highly than 107 of the 181 items!). So it isn’t just that we like to think and say nice things about ourselves; we’re willing to concede that we have some bad traits, but maybe not the ones that have to do with disliking cultural and intellectual experiences. I don’t have much of an idea as to why that might be, but it does introspectively feel to me like there’s more of a stigma about, say, not liking to visit new places or experience new things than admitting that you’re kind of an irritable person. Or maybe it’s just that many of the openness items can be interpreted more broadly than the other evaluative items–e.g., there are lots of different art forms, so almost everyone can endorse a generic “I like art” statement. I don’t really know.
Anyway, there’s nothing the least bit profound about any of this; if anything, it’s just a nice reminder that most of us are not really very good at evaluating where we stand in relation to other people, at least for many traits (for more on that, go read Simine Vazire’s work). The nominal midpoint on most personality scales is usually quite far from the actual median in the general population. This is a pretty big challenge for personality psychology, and if we could figure out how to get people to rank themselves more accurately relative to other people on self-report measures, that would be a pretty huge advance. But it seems quite likely that you just can’t do it, because people simply may not have introspective access to that kind of information.
Fortunately for our ability to measure individual differences in personality, there are plenty of items that do show considerable variance across individuals (actually, in fairness, even items with relatively low variance like the ones above can be highly discriminative if used properly–that’s what item response theory is for). Just for kicks, here are the 10 AMBI items with the largest standard deviations (in parentheses):
disliked math in school (1.56)
wanted to run away from home when I was a child (1.56)
believe in a universal power or god (1.53)
have felt contact with a divine power (1.51)
rarely cry during sad movies (1.46)
am able to fix electrical-wiring problems (1.46)
am devoted to religion (1.44)
shout or scream when I’m angry (1.43)
love large parties (1.42)
felt close to my parents when I was a child (1.42)
So now finally we come to the real moral of this post… that which you’ve read all this long way for. And the moral is this, grasshopper: if you want to successfully pick a fight at a large party, all you need to do is angrily yell at everyone that God told you math sucks.
Blogging will be slow(er than normal) for the next couple of weeks. On Wednesday I’m off on a long-awaited Grand Tour of Canada, 2010 edition. The official purpose of the trip is the CNS meeting in Montreal, but seeing as I’m from Canada and most of my family is in Toronto and Ottawa, I’ll be tacking on a few days of R&R at either end of the trip, so I’ll be gone for 10 days. By R&R I mean that I’ll be spending most of my time in Toronto at cheap all-you-can-eat sushi restaurants, and most of my time in Ottawa sleeping in till noon in my mom’s basement.Â So really, I guess my plan for the next two weeks is to turn seventeen again.
While I’m in Ottawa, I’ll also be giving a talk at Carleton University. I’d like to lump this under the “invited talks” section of my vita–you know, just to make myself seem slightly more important (being invited somewhere means people actually want to hear you say stuff!)–but I’m not sure it counts as “invited” if you invite yourself to give a talk somewhere else. Which is basically what happened; I did my undergraduate degree at Carleton, so when I emailed my honors thesis advisor to ask if I could give a talk when I was in town, he probably felt compelled to humor me, much as I know he’d secretly like to say no (sorry John!). At any rate, the talk will be closely based on this paper on the relation between personality and word use among bloggers. Amazingly enough, it turns out you can learn something (but not that much) about people from what they write on their blogs. It’s not the most exciting conclusion in the world, but I think there are some interesting results hidden away in there somewhere. If you happen to come across any of them, let me know.
In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:
Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he“ is you and “you“ is he because I was writing to Lew“¦)
1. So this is what it feels like to be replaced by a machine.
I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.
2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.
This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed ofÂ different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.
One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.
The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.
When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.
The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.
It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).
In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).
3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.
[BTW, that last sentence is the key one. I’m looking forward to seeing more of what you and others can do with this approach.]
When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):
What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.
Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.
A while back I blogged about a paper I wrote that uses genetic algorithms to abbreviate personality measures with minimal human intervention. In the paper, I promised to put the R code I used online, so that other people could download and use it. I put off doing that for a long time, because the code was pretty much spaghetti by the time the paper got accepted, and there are any number of things I’d rather do than spend a weekend rewriting my own code. But one of the unfortunate things about publicly saying that you’re going to do something is that you eventually have to do that something. So, since the paper was published in JRP last week, and several people have emailed me to ask for the code, I spent much of the weekend making the code presentable. It’s not a fully-formed R package yet, but it’s mostly legible, and seems to work more or less ok. You can download the file (gaabbreviate.R) here. The rest of this (very long) post is basically a tutorial on how to use the code, so you probably want to stop reading this now unless you have a burning interest in personality measurement.
Prerequisites and installation
Although you won’t need to know much R to follow this tutorial, you will need to haveR installed on your system. Fortunately, R is freely available for all major operating systems. You’ll also need the genalg and psych packages for R, because gaabbreviate won’t run without them. Once you have R installed, you can download and install those packages like so:
Once that’s all done, you’re ready to load gaabbreviate.R:
…where you make sure to specify the right path to the location where you saved the file. And that’s it! Now you’re ready to abbreviate measures.
Reading in data
The file contains several interrelated functions, but the workhorse is gaa.abbreviate(), which takes a set of item scores and scale scores for a given personality measure as input and produces an abbreviated version of the measure, along with a bunch of other useful information. In theory, you can go from old data to new measure in a single line of R code, with almost no knowledge of R required (though I think it’s a much better idea to do it step-by-step and inspect the results at every stage to make sure you know what’s going on).
The abbreviation function is pretty particular about the format of the input it expects. It takes two separate matrices, one with item scores, the other with scale scores (a scale here just refers to any set of one or more items used to generate a composite score). Subjects are in rows, item or scale scores are in columns. So for example, let’s say you have data from 3 subjects, who filled out a personality measure that has two separate scales, each composed of two items. Your item score matrix might look like this:
I.e., the first subject had scores of 3, 5, 1, and 1 on the four items, respectively; the second subject had scores of 2, 2, 4, and 1… and so on.
Based on the above, if you assume items 1 and 2 constitute one scale, and items 3 and 4 constitute the other, the scale score matrix would be:
Of course, real data will probably have hundreds of subjects, dozens of items, and a bunch of different scales, but that’s the basic format. Assuming you can get your data into an R matrix or data frame, you can feed it directly to gaa.abbreviate() and it will hopefully crunch your data without complaining. But if you don’t want to import your data into R before passing it to the code, you can also pass filenames as arguments instead of matrices. For example:
If you pass files instead of data, the referenced text files must be tab-delimited, with subjects in rows, item/scale scores in columns, and a header row that gives the names of the columns (i.e., item names and scale names; these can just be numbers if you like, but they have to be there). Subject identifiers should not be in the files.
Key parameters: stuff you should set every time
Assuming you can get gaabbreviate to read in your data, you can then set about getting it to abbreviate your measure by selecting a subset of items that retain as much of the variance in the original scales as possible. There are a few parameters you’ll need to set; some are mandatory, others aren’t, but should really be specified anyway since the defaults aren’t likely to work well for different applications.
The most important (and mandatory) argument is iters, which is the number of iterations you want the GA to run for. If you pick too high a number, the GA may take a very long time to run if you have a very long measure; if you pick too low a number, you’re going to get a crappy solution. I think iters=100 is a reasonable place to start, though in practice, obtaining a stable solution tends to require several hundred iterations. The good news (which I cover in more detail below) is that you can take the output you get from the abbreviation function and feed it right back in as many times as you want, so it’s not like you need to choose the number of iterations carefully or anything.
The other two key parameters are itemCost and maxItems. The itemCost is what determines the degree to which your measure is compressed. If you want a detailed explanation of how this works, see the definition of the cost function in the paper. Very briefly, the GA tries to optimize the trade-off between number of items and amount of variance explained. Generally speaking, the point of abbreviating a measure is to maximize the amount of explained variance (in the original scale scores) while minimizing the number of items retained. Unfortunately, you can’t do both very well at the same time, because any time you drop an item, you’re also losing its variance. So the trick is to pick a reasonable compromise: a measure that’s relatively short and still does a decent job recapturing the original. The itemCost parameter is what determines the length of that measure. When you set it high, the GA will place a premium on brevity, resulting in a shorter (but less accurate) measure; when you set it low, it’ll allow a longer measure that maximizes fidelity. The optimal itemCost will vary depending on your data, but I find 0.05 is a good place to start, and then you can tweak it to get measures with more or fewer items as you see fit.
The maxItems parameter sets the upper bound on the number of items that will be used to score each scale. The default is 5, but you may find this number too small if you’re trying to abbreviate scales comprised of a large number of items. Again, it’s worth playing around with this to see what happens. Generally speaks, the same trade-off between brevity and fidelity discussed above holds here too.
Given reasonable values for the above arguments, you should be able to feed in raw data and get out an abbreviated measure with minimal work. Assuming you’re reading your data from a file, the entire stream can be as simple as:
That’s it! Assuming your data are in the correct format (and if they’re not, the script will probably crash with a nasty error message), gaabbreviate will do its thing and produce your new, shorter measure within a few minutes or hours, depending on the size of the initial measure. The writeFile argument is optional, and gives the name of an output file you want the measure saved to. If you don’t specify it, the output will be assigned to the gaa object in the above call (note the “gaa = ” part of the call), but won’t be written to file. But that’s not a problem, because you can always achieve the same effect later by calling the gaa.writeMeasure function (e.g., in the above example, gaa.writeMeasure(gaa, file=”outputfile.txt”) would achieve exactly the same thing).
Other important options
Although you don’t really need to do anything else to produce abbreviated measures, I strongly recommend reading the rest of this document and exploring some of the other options if you’re planning to use the code, because some features are non-obvious. Also, the code isn’t foolproof, and it can do weird things with your data if you’re not paying attention. For one thing, by default, gaabbreviate will choke on missing values (i.e., NAs). You can do two things to get around this: either enable pairwise processing (pairwise=T), or turn on mean imputation (impute=T). I say you can do these things, but I strongly recommend against using either option. If you have missing values in your data, it’s really a much better idea to figure out how to deal with those missing values before you run the abbreviation function, because the abbreviation function is dumb, and it isn’t going to tell you whether pairwise analysis or imputation is a sensible thing to do. For example, if you have 100 subjects with varying degrees of missing data, and only have, say, 20 subjects’ scores for some scales, the resulting abbreviated measure is going to be based on only 20 subjects’ worth of data for some scales if you turn pairwise processing on. Similarly, imputing the mean for missing values is a pretty crude way to handle missing data, and I only put it in so that people who just wanted to experiment with the code wouldn’t have to go to the trouble of doing it themselves. But in general, you’re much better off reading your item and scale scores into R (or SPSS, or any other package), processing any missing values in some reasonable way, and then feeding gaabbreviate the processed data.
Another important point to note is that, by default, gaabbreviate will cross-validate its results. What that means is that only half of your data will be used to generate an abbreviated measure; the other half will be used to provide unbiased estimates of how well the abbreviation process worked. There’s an obvious trade-off here. If you use the split-half cross-validation approach, you’re going to get more accurate estimates of how well the abbreviation process is really working, but the fit itself might be slightly poorer because you have less data. Conversely, if you turn cross-validation off (crossVal=F), you’re going to be using all of your data in the abbreviation process, but the resulting estimates of the quality of the solution will inevitably be biased because you’re going to be capitalizing on chance to some extent.
In practice, I recommend always leaving cross-validation enabled, unless you either (a) really don’t care about quality control (which makes you a bad person), or (b) have a very small sample size, and can’t afford to leave out half of the data in the abbreviation process (in which case you should consider collecting more data). My experience has been that with 200+ subjects, you generally tend to see stable solutions even when leaving cross-validation on, though that’s really just a crude rule of thumb that I’m pulling out of my ass, and larger samples are always better.
Other less important options
There are a bunch other less important options that I won’t cover in any detail here, but that are reasonably well-covered in the comments in the source file if you’re so inclined. Some of these are used to control the genetic algorithm used in the abbreviation process. The gaa.abbreviate function doesn’t actually do the heavy lifting itself; instead, it relies on the genalg library to run the actual genetic algorithm. Although the default genalg parameters will work fine 95% of the time, if you really want to manually set the size of the population or the ratio of initial zeros to ones, you can pass those arguments directly. But there’s relatively little reason to play with these parameters, because you can always achieve more or less the same ends simply by adding iterations.
Two other potentially useful options I won’t touch on, though they’re there if you want them, give you the ability to (a) set a minimum bound on the correlation required in order for an item to be included in the scoring equation for a scale (the minR argument), and (b) apply non-unit weightings to the scales (the sWeights argument), in cases where you want to emphasize some scales at the cost of others (i.e., because you want to measure some scales more accurately).
The following two examples assume you’re feeding in item and scale matrices named myItems and myScales, respectively:
This will run a genetic algorithm for 500 generations on mean-imputed data with cross-validation turned off, and assign the result to a variable named my.new.shorter.measure. It will probably produce an only slightly shorter measure, because the itemCost is low and up to 10 items are allowed to load on each scale.
This will run 100 iterations with cross-validation enabled (the default, so we don’t need to specify it explicitly) and write the result to a file named shortMeasure.txt. It’ll probably produce a highly abbreviated measure, because the itemCost is relatively high. It also assigns more weight (twice as much, in fact) to the fourth and fifth scales in the measure relative to the first three, as reflected in the sWeights argument (a vector where the ith element indicates the weight of the ith scale in the measure, so presumably there are five scales in this case).
The gaa object
Assuming you’ve read this far, you’re probably wondering what you get for your trouble once you’ve run the abbreviation function. The answer is that you get… a gaa (which stands for GA Abbreviate) object. The gaa object contains almost all the information that was used at any point in the processing, which you can peruse at your leisure. If you’re familiar with R, you’ll know that you can see what’s in the object with the attributes function. For example, if you assigned the result of the abbreviation function to a variable named ‘myMeasure’, here’s what you’d see:
The gaa object has several internal lists (data, settings, results, etc.), each of which in turn contains several other variables. I’ve tried to give these sensible names. In brief:
data contains all the data used to create the measure (i.e., the item and scale scores you fed in)
settings contains all the arguments you specified when you called the abbreviation function (e.g., iters, maxItems, etc.)
results contains variables summarizing the results of the GA run, including information about each previous iteration of the GA
best contains information about the single best measure produced (this is generally not useful, and is for internal purposes)
rbga is the rbga.bin object produced by the genetic library (for more information, see the genalg library documentation)
measure is what you’ll probably find most important, as it contains the details of the final measure that was produced
To see the contents of each of these lists in turn, you can easily inspect them:
So the ‘measure’ attribute in the gaa object contains a bunch of other variables with information about the resulting measure. And here’s a brief summary:
items: a vector containing the numerical ID of items retained in the final measure relative to the original list (e.g., if you fed in 100 items, and the ‘items’ variable contained the numbers 4, 10, 14… that’s the GA telling you that it decided to keep items no. 4, 10, 14, etc., from the original set of 100 items).
nItems: the number of items in the final measure.
key: a scoring key for the new measure, where the rows are items on the new measure, and the columns are the scales. This key is compatible with score.items() in Bill Revelle’s excellent psych package, which means that once you’ve got the key, you can automatically score data for the new measure simply by calling score.items() (see the documentation for more details), and don’t need to do any manual calculating or programming yourself.
ccTraining and ccValidation: convergent correlations for the training and validation halves of the data, respectively. The convergent correlation is the correlation between the new scale scores (i.e., those that you get using the newly-generated measure) and the “real” scale scores. The ith element in the vector gives you the convergent correlation for the ith scale in your original measure. The validation coefficients will almost invariably be lower than the training coefficients, and the validation numbers are the ones you should trust as an unbiased estimate of the quality of the measure.
alpha: coefficient alpha for each scale. Note that you should expect to get lower internal consistency estimates for GA-produced measures than you’re used to, and this is actually a good thing. If you want to know why, read the discussion in the paper.
nScaleItems: a vector containing the number of items used to score each scale. If you left minR set to 0, this will always be identical to maxItems for all items. If you raised minR, the number of items will sometimes be lower (i.e., in cases where there were very few items that showed a strong enough correlation to be retained).
Just give me the measure already!
Supposing you’re not really interested in plumbing the depths of the gaa object or working within R more than is necessary, you might just be wondering what the quickest way to get an abbreviated measure you can work with is. In that case, all you really need to do is pass a filename in the writeFile argument when you call gaa.abbreviate (see the examples given above), and you’ll get out a plain text file that contains all the essential details of the new measure. Specifically you’ll get (a) a mapping from old items to new, so that you can figure out which items are included in the new measure (e.g., a line like “4 45” means that the 4th item on the new measure is no. 45 in the original set of items), and (b) a human-readable scoring key for each scale (the only thing to note here is that an “R” next to an item indicates the item is reverse-keyed), along with key statistics (coefficient alpha and convergent correlations for the training and validation halves). So if all goes well, you really won’t need to do anything else in R beyond call that one line that makes the measure. But again, I’d strongly encourage you to carefully inspect the gaa object in R to make sure everything looks right. The fact that the abbreviation process is fully automated isn’t a reason to completely suspend all rational criteria you’d normally use when developing a scale; it just means you probably have to do substantially less work to get a measure you’re happy with.
Depending on how big your dataset is (actually, mainly the number of items in the original measure), how many iterations you’ve requested, and how fast your computer is, you could be waiting a long time for the abbreviation function to finish its work. Because you probably want to know what the hell is going on internally during that time, I’ve provided a rudimentary monitoring display that will show you the current state of the genetic algorithm after every iteration. It looks like this (click for a larger version of the image):
This is admittedly a pretty confusing display, and Edward Tufte would probably murder several kittens if he saw it, but it’s not supposed to be a work of art, just to provide some basic information while you’re sitting there twiddling your thumbs (ok, ok, I promise I’ll label the panels better when I have the time to work on it). But basically, it shows you three things. The leftmost three panels show you the basic information about the best measure produced by the GA as it evolves across generations. Respectively, the top, middle,and bottom panels show you the total cost, measure length, and mean variance explained (R^2) as a function of iteration. The total cost can only ever go down, but the length and R^2 can go up or down (though there will tend to be a consistent trajectory for measure length that depends largely on what itemCost you specified).
The middle panel shows you detailed information about how well the GA-produced measure captures variance in each of the scales in the original measure. In this case, I’m abbreviating the 30 facets of the NEO-PI-R. The red dot displays the amount of variance explained in each trait, as of the current iteration.
Finally, the rightmost panel shows you a graphical representation of which items are included in the best measure identified by the GA at each iteration.Each row represents one iteration (i.e., you’re seeing the display as it appears after 200 iterations of a 250-iteration run); black bars represent items that weren’t included, white bars represent items that were included. The point of this display isn’t to actually tell you which items are being kept (you can’t possibly glean that level of information at this resolution), but rather, to give you a sense of how stable the solution is. If you look at the the first few (i.e., topmost) iterations, you’ll see that the solution is very unstable: the GA is choosing very different items as the “best” measure on each iteration. But after a while, as the GA “settles” into a neighborhood, the solution stabilizes and you see only relatively small (though still meaningful) changes from generation to generation. Basically, once the line in the top left panel (total cost) has asymptoted, and the solution in the rightmost panel is no longer changing much if at all, you know that you’ve probably arrived at as good a solution as you’re going to get.
Incidentally, if you use the generic plot() method on a completed gaa object (e.g., plot(myMeasure)), you’ll get exactly the same figure you see here, with the exception that the middle figure will also have black points plotted alongside the red ones.Â The black points show you the amount of variance explained in each trait for the cross-validated results. If you’re lucky, the red and black points will be almost on top of each other; if you’re not, the black ones will be considerably to the left of the red ones .
The last thing I’ll mention, which I already alluded to earlier, is that you can recycle gaa objects. That’s to say, suppose you ran the abbreviation for 100 iterations, only to get back a solution that’s still clearly suboptimal (i.e., the cost function is still dropping rapidly). Rather than having to start all over again, you can simply feed the gaa object back into the abbreviation function in order to run further iterations. And you don’t need to specify any additional parameters (assuming you want to run the same number of iterations you did last time; otherwise you’ll need to specify iters); all of the settings are contained within the gaa object itself. So, assuming you ran the abbreviation function and stored the result in ‘myMeasure’, you can simply do:
myMeasure = gaa.abbreviate(myMeasure, iters=200)
and you’ll get an updated version of the measure that’s had the benefit of an extra 200 iterations. And of course, you can save and load R objects to/from files, so that you don’t need to worry about all of your work disappearing next time you start R. So save(myMeasure, ‘filename.txt’) will save your gaa object for future use, and the next time you need it, you can call myMeasure = load(‘filename.txt’) to get it back (alternatively, you can just save the entire workspace).
Anyway, I think that covers all of the important stuff. There are a few other things I haven’t documented here, but if you’ve read this far, and have gotten the code to work in R, you should be able to start abbreviating your own measures relatively painlessly. If you do use the code to generate shorter measures, and end up with measures you’re happy with, I’d love to hear about it. And if you can’t get the code to work, or can get it to work but are finding issues with the code or the results, I guess I’ll grudgingly accept those emails too. In general, I’m happy to provide support for the code via email provided I have the time. The caveat is that, if you’re new to R, and are having problems with basic things like installing packages or loading files from source, you should really read a tutorial or reference that introduces you to R (Quick-R is my favorite place to start) before emailing me with problems. But if you’re having problems that are specific to the gaabbreviate code (e.g., you’re getting a weird error message, or aren’t sure what something means), feel free to drop me a line and I’ll try to respond as soon as I can.
One of the frustrating things about personality research–for both researchers and participants–is that personality is usually measured using self-report questionnaires, and filling out self-report questionnaires can take a very long time. It doesn’t have to take a very long time, mind you; some questionnaires are very short, like the widely-used Ten-Item Personality Inventory (TIPI), which might take you a whole 2 minutes to fill out on a bad day. So you can measure personality quickly if you have to. But more often than not, researchers want to reliably measure a broad range of different personality traits, and that typically requires administering one or more long-ish questionnaires. For example, in my studies, I often give participants a battery of measures to fill out that includes some combination of the NEO-PI-R, EPQ-R, BIS/BAS scales, UPPS, GRAPES, BDI, TMAS, STAI, and a number of others. That’s a large set of acronyms, and yet it’s just a small fraction of what’s out there; every personality psychologist has his or her own set of favorite measures, and at personality conferences, duels-to-the-death often break out over silly things like whether measure X is better than measure Y, or whether measures A and B can be used interchangeably when no one’s looking. Personality measurement is a pretty intense sport.
The trouble with the way we usually measure personality is that it’s wildly inefficient, for two reasons. One is that many measures are much longer than they need to be. It’s not uncommon to see measures that score each personality trait using a dozen or more different items. In theory, the benefit of this type of redundancy is that you get a more reliable measure, because the error terms associated with individual items tends to cancel out. For example, if you want to know if I’m a depressive kind of guy, you shouldn’t just ask me, “hey, are you depressed?”, because lots of random factors could influence my answer to that one question. Instead, you should ask me a bunch of different questions, like: “hey, are you depressed?” and “why so glum, chum?”, and “does somebody need a hug?”. Adding up responses from multiple items is generally going to give you a more reliable measure. But in practice, it turns out that you typically don’t need more than a handful of items to measure most traits reliably. When people develop “short forms” of measures, the abbreviated scales often have just 4 – 5 items per trait, usually with relatively little loss of reliability and validity. So the fact that most of the measures we use have so many items on them is sort of a waste of both researchers’ and participants’ time.
The other reason personality measurement is inefficient is that most researchers recognize that different personality measures tend to measure related aspects of personality, and yet we persist in administering a whole bunch of questionnaires with similar content to our participants. If you’ve ever participated in a psychology experiment that involved filling out personality questionnaires, there’s a good chance you’ve wondered whether you’re just filling out the same questionnaire over and over. Well you are–kind of. Because the space of personality variation is limited (people can only differ from one another in so many ways), and because many personality constructs have complex interrelationships with one another, personality measures usually end up asking similarly-worded questions. So for example, one measure might give you Extraversion and Agreeableness scores whereas another gives you Dominance and Affiliation scores. But then it turns out that the former pair of dimensions can be “rotated” into the latter two; it’s just a matter of how you partition (or label) the variance. So really, when a researcher gives his or her participants a dozen measures to fill out, that’s not because anyone thinks that there are really a dozen completely different sets of traits to measures; it’s more because we recognize that each instrument gives you a slightly different take on personality, and we tend to think that having multiple potential viewpoints is generally a good thing.
Inefficient personality measurement isn’t inevitable; as I’ve already alluded to above, a number of researchers have developed abbreviated versions of common inventories that capture most of the same variance as much longer instruments. Probably the best-known example is the aforementioned TIPI, developed by Sam Gosling and colleagues, which gives you a workable index of people’s relative standing on the so-called Big Five dimensions of personality. But there are relatively few such abbreviated measures. And to the best of my knowledge, the ones that do exist are all focused on abbreviating a single personality measure. That’s unfortunate, because if you believe that most personality inventories have a substantial amount of overlap, it follows that you should be able to recapture scores on multiple different personality inventories using just one set of (non-redundant) items.
That’s exactly what I try to demonstrate in a paper to be published in the Journal of Research in Personality. The article’s entitled “The abbreviation of personality: How to measure 200 personality scales in 200 items“, which is a pretty accurate, if admittedly somewhat grandiose, description of the contents. The basic goal of the paper is two-fold. First, I develop an automated method for abbreviating personality inventories (or really, any kind of measure with multiple items and/or dimensions). The idea here is to shorten the time and effort required in order to generate shorter versions of existing measures, which should hopefully encourage more researchers to create such short forms. The approach I develop relies heavily on genetic algorithms, which are tools for programmatically obtaining high-quality solutions to high-dimensional problems using simple evolutionary principles. I won’t go into the details (read the paper if you want them!), but I think it works quite well. In the first two studies reported in the paper (data for which were very generously provided by Sam Gosling and Lew Goldberg, respectively), I show that you can reduce the length of existing measures (using the Big Five Inventory and the NEO-PI-R as two examples) quite dramatically with minimal loss of validity. It only takes a few minutes to generate the abbreviated measures, so in theory, it should be possible to build up a database of abbreviated versions of many different measures. I’ve started to put together a site that might eventually serve that purpose (shortermeasures.com), but it’s still in the preliminary stages of development, and may or may not get off the ground.
The other main goal of the paper is to show that the same general approach can be applied to simultaneously abbreviate more than one different measure. To make the strongest case I could think of, I took 8 different broadband personality inventories (“broadband” here just means they each measure a relatively large number of personality traits) that collectively comprise 203 different personality scales and 2,091 different items. Using the same genetic algorithm-based approach, I then reduce these 8 measures down to a single inventory that contains only 181 items (hence the title of the paper). I named the inventory the AMBI (Analog to Multiple Broadband Inventories), and it’s now freely available for use (items and scoring keys are provided both in the paper and at shortermeasures.com). It’s certainly not perfect–it does a much better job capturing some scales than others–but if you have limited time available for personality measures, and still want a reasonably comprehensive survey of different traits, I think it does a really nice job. Certainly, I’d argue it’s better than having to administer many hundreds (if not thousands) of different items to achieve the same effect. So if you have about 15 – 20 minutes to spare in a study and want some personality data, please consider trying out the AMBI!
Yarkoni, T. (2010). The Abbreviation of Personality, or how to Measure 200 Personality Scales with 200 Items Journal of Research in Personality DOI: 10.1016/j.jrp.2010.01.002