Category Archives: fun with data

estimating the influence of a tweet–now with 33% more causal inference!

Twitter is kind of a big deal. Not just out there in the world at large, but also in the research community, which loves the kind of structured metadata you can retrieve for every tweet. A lot of researchers rely heavily on twitter to model social networks, information propagation, persuasion, and all kinds of interesting things. For example, here’s the abstract of a nice recent paper on arXiv that aims to  predict successful memes using network and community structure:

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

One limitation of much of this body of research is that the data are almost invariably observational. We can build sophisticated models that do a good job predicting some future outcome (like meme success), but we don’t necessarily know that the “important” features we identify carry any causal influence. In principle, they could be completely epiphenomenal–for example, in the study I linked to, maybe the community structure features are just a proxy for some other, causally important, factor (e.g., whether the content of a meme has sufficiently broad appeal to attract attention from many different kinds of people). From a predictive standpoint, this may not matter much; if your goal is just to passively predict whether a meme is going to be successful or not, it’s irrelevant whether or not the features you’re using are doing causal work. On the other hand, if you want to actively design memes in such a way as to maximize their spread, the ability to get a handle on causation starts to look pretty important.

How can we estimate the direct causal influence of a tweet on the downstream popularity of a meme? Here’s a simple and (I suspect) very feasible idea in two steps:

  1. Create a small web app that allows any existing Twitter user to register via Twitter authentication. On signing up, a user has to specify just one (optional) setting: the proportion of their intended retweets they’re willing to withhold. Let’s this the Withholding Fraction (WF).
  2. Every time (or at least some of the time) a registered user wants to retweet a particular tweet*, they do so via the new web app’s interface (which has permission to post to the user’s Twitter account) instead of whatever interface they’re currently using. The key is that the retweet isn’t just obediently passed along; instead, the target tweet is retweeted successfully with probability (1 – WF), and randomly suppressed from the user’s stream with probability (WF).

Doing this  would allow the community to very quickly (assuming rapid adoption, which seems reasonably likely) build up an enormous database of tweets that were targeted for retweeting by an active user, but randomly assigned to fail with some known probability. Researchers would then be able to directly quantify the causal impact of individual retweets on downstream popularity–and to estimate that influence conditional on all of the other standard variables, like the retweeter’s number of followers, the content of the tweet, etc. Of course, this still wouldn’t get us to true experimental manipulation of such features (i.e., we wouldn’t be manipulating users’ follower networks, just randomly omitting tweets from users with different followers), but it seems like a step in the right direction**.

I figure building a barebones app like this would take an experienced developer familiar with the Twitter OAuth API just a day or two. And I suspect many people (myself included!) would be happy to contribute to this kind of experiment, provided that all of the resulting data were made public. (I’m aware that there are all kinds of restrictions on sharing assembled Twitter datasets, but we’re not talking about sharing firehose dumps here, just a restricted set of retweets from users who’ve explicitly given their consent to have the data used in this way.)

Has this kind of thing already been done? If not, does anyone want to build it?

 

* It doesn’t just have to be retweets, of course; the same principle would work just as well for withholding a random fraction of original tweets. But I suspect not many users would be willing to randomly eliminate a proportion of their original content from the firehose.

** If we really wanted to get close to true random assignment, we could potentially inject selected tweets into random users streams based on selected criteria. But I’m not sure how many tweeps would consent to have entirely random retweets published in their name (I probably wouldn’t), so this probably isn’t viable.

what do you get when you put 1,000 psychologists together in one journal?

I’m working on a TOP SEKKRIT* project involving large-scale data mining of the psychology literature. I don’t have anything to say about the TOP SEKKRIT* project just yet, but I will say that in the process of extracting certain information I needed in order to do certain things I won’t talk about, I ended up with certain kinds of data that are useful for certain other tangential analyses. Just for fun, I threw some co-authorship data from 2,000+ Psychological Science articles into the d3.js blender, and out popped an interactive network graph of all researchers who have published at least 2 papers in Psych Science in the last 10 years**. It looks like this:

coauthorship_graph

You can click on the image to take a closer (and interactive) look.

I don’t think this is very useful for anything right now, but if nothing else, it’s fun to drag Adam Galinsky around the screen and watch half of the field come along for the ride. There are plenty of other more interesting things one could do with this, though, and it’s also quite easy to generate the same graph for other journals, so I expect to have more to say about this later on.

 

* It’s not really TOP SEKKRIT at all–it just sounds more exciting that way.

** Or, more accurately, researchers who have co-authored at least 2 Psych Science papers with other researchers who meet the same criterion. Otherwise we’d have even more nodes in the graph, and as you can see, it’s already pretty messy.

unconference in Leipzig! no bathroom breaks!

Südfriedhof von Leipzig [HDR]

Many (most?) regular readers of this blog have probably been to at least one academic conference. Some of you even have the misfortune of attending conferences regularly. And a still-smaller fraction of you scholarly deviants might conceivably even enjoy the freakish experience. You know, that whole thing where you get to roam around the streets of some fancy city for a few days seeing old friends, learning about exciting new scientific findings, and completely ignoring the manuscripts and reviews piling up on your desk in your absence. It’s a loathsome, soul-scorching experience. Unfortunately it’s part of the job description for most scientists, so we shoulder the burden without complaining too loudly to the government agencies that force us to go to these things.

This post, thankfully, isn’t about a conference. In fact, it’s about the opposite of a conference, which is… an UNCONFERENCE. An unconference is a social event type of thing that strips away all of the unpleasant features of a regular conference–you know, the fancy dinners, free drinks, and stimulating conversation–and replaces them with a much more authentic academic experience. An authentic experience in which you spend the bulk of your time situated in a 10′ x 10′ room (3 m x 3 m for non-Imperialists) with 10 – 12 other academics, and no one’s allowed to leave the room, eat anything, or take bathroom breaks until someone in the room comes up with a brilliant discovery and wins a Nobel prize. This lasts for 3 days (plus however long it takes for the Nobel to be awarded), and you pay $1200 for the privilege ($1160 if you’re a post-doc or graduate student). Believe me when I tell you that it’s a life-changing experience.

Okay, I exaggerate a bit. Most of those things aren’t true. Here’s one explanation of what an unconference actually is:

An unconference is a participant-driven meeting. The term “unconference” has been applied, or self-applied, to a wide range of gatherings that try to avoid one or more aspects of a conventional conference, such as high fees, sponsored presentations, and top-down organization. For example, in 2006, CNNMoney applied the term to diverse events including Foo Camp, BarCamp, Bloggercon, and Mashup Camp.

So basically, my description was accurate up until the part where I said there were no bathroom breaks.

Anyway, I’m going somewhere with this, I promise. Specifically, I’m going to Leipzig, Germany! In September! And you should come too!

The happy occasion is Brainhack 2012, an unconference organized by the creative minds over at the Neuro Bureau–coordinators of such fine projects as the Brain Art Competition at OHBM (2012 incarnation going on in Beijing right now!) and the admittedly less memorable CNS 2007 Surplus Brain Yard Sale (guess what–turns out selling human brains out of the back of an unmarked van violates all kinds of New York City ordinances!).

Okay, as you can probably tell, I don’t quite have this event promotion thing down yet. So in the interest of ensuring that more than 3 people actually attend this thing, I’ll just shut up now and paste the official description from the Brainhack website:

The Neuro Bureau is proud to announce the 2012 Brainhack, to be held from September 1-4 at the Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany.

Brainhack 2012 is a unique workshop with the goals of fostering interdisciplinary collaboration and open neuroscience. The structure builds from the concepts of an unconference and a hackathon: The term “unconference” refers to the fact that most of the content will be dynamically created by the participants — a hackathon is an event where participants collaborate intensively on science-related projects.

Participants from all disciplines related to neuroimaging are welcome. Ideal participants span in range from graduate students to professors across any disciplines willing to contribute (e.g., mathematics, computer science, engineering, neuroscience, psychology, psychiatry, neurology, medicine, art, etc…). The primary requirement is a desire to work in close collaborations with researchers outside of your specialization in order to address neuroscience questions that are beyond the expertise of a single discipline.

In all seriousness though, I think this will be a blast, and I’m really looking forward to it. I’m contributing the full Neurosynth dataset as one of the resources participants will have access to (more on that in a later post), and I’m excited to see what we collectively come up with. I bet it’ll be at least three times as awesome as the Surplus Brain Yard Sale–though maybe not quite as lucrative.

 

 

p.s. I’ll probably also be in Amsterdam, Paris, and Geneva in late August/early September; if you live in one of these fine places and want to show me around, drop me an email. I’ll buy you lunch! Well, except in Geneva. If you live in Geneva, I won’t buy you lunch, because I can’t afford lunch in Geneva. You’ll buy yourself a nice Swiss lunch made of clockwork and gold, and then maybe I’ll buy you a toothpick.

the neuroinformatics of Neopets

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…

large-scale data exploration, MIC-style

UPDATE 2/8/2012: Simon & Tibshirani posted a critical commentary on this paper here. See additional thoughts here.

Real-world data are messy. Relationships between two variables can take on an infinite number of forms, and while one doesn’t see, say, umbrella-shaped data very often, strange things can happen. When scientists talk about correlations or associations between variables, they’re usually referring to one very specific form of relationship–namely, a linear one. The assumption is that most associations between pairs of variables are reasonably well captured by positing that one variable increases in proportion to the other, with some added noise. In reality, of course, many associations aren’t linear, or even approximately so. For instance, many associations are cyclical (e.g., hours at work versus day of week), or curvilinear (e.g., heart attacks become precipitously more frequent past middle age), and so on.

Detecting a non-linear association is potentially just as easy as detecting a linear relationship if we know the form of that association up front. But there, of course, lies the rub: we generally don’t have strong intuitions about how most variables are likely to be non-linearly related. A more typical situation in many ‘big data’ scientific disciplines is that we have a giant dataset full of thousands or millions of observations and hundreds or thousands of variables, and we want to determine which of the many associations between different variables are potentially important–without knowing anything about their potential shape. The problem, then, is that traditional measures of association don’t work very well; they’re only likely to detect associations to the extent that those associations approximate a linear fit.

A new paper in Science by David Reshef and colleagues (and as a friend pointed out, it’s a feat in and of itself just to get a statistics paper into Science) directly targets this data mining problem by introducing an elegant new measure of association called the Maximal Information Coefficient (MIC; see also the authors’ project website).  The clever insight at the core of the paper is that one can detect a systematic (i.e., non-random) relationship between two variables by quantifying and normalizing their maximal mutual information. Mutual information (MI) is an information theory measure of how much information you have about one variable given knowledge of the other. You have high MI when you can accurately predict the level of one variable given knowledge of the other, and low MI when knowledge of one variable is unhelpful in predicting the other. Importantly, unlike other measures (e.g., the correlation coefficient), MI makes no assumptions about the form of the relationship between the variables; one can have high mutual information for non-linear associations as well as linear ones.

MI and various derivative measures have been around for a long time now; what’s innovative about the Reshef et al paper is that the authors figured out a way to efficiently estimate and normalize the maximal MI one can obtain for any two variables. The very clever approach the authors use is to overlay a series of grids on top of the data, and to keep altering the resolution of the grid and moving its lines around until one obtains the maximum possible MI. In essence, it’s like dropping a wire mesh on top of a scatterplot and playing with it until you’ve boxed in all of the data points in the most informative way possible. And the neat thing is, you can apply the technique to any kind of data at all, and capture a very broad range of systematic relationships, not just linear ones.

To give you an intuitive sense of how this works, consider this Figure from the supplemental material:

The underlying function here is sinusoidal. This is a potentially common type of association in many domains–e.g., it might explain the cyclical relationship between, say, coffee intake and hour of day (more coffee in the early morning and afternoon; less in between). But the linear correlation is essentially zero, so a typical analysis wouldn’t pick it up at all. On the other hand, the relationship itself is perfectly deterministic; if we can correctly identify the generative function in this case, we would have perfect information about Y given X. The question is how to capture this intuition algorithmically–especially given that real data are noisy.

This is where Reshef et al’s grid-based approach comes in. In the left panel above, you have a 2 x 8 grid overlaid on a sinusoidal function (the use of a 2 x 8 resolution here is just illustrative; the algorithm actually produces estimates for a wide range of grid resolutions). Even though it’s the optimal grid of that particular resolution, it still isn’t very good: knowing which row a particular point along the line falls into doesn’t tell you a whole lot about which column it falls into, and vice versa. In other words, mutual information is low. By contrast, the optimal 8 x 2 grid on the right side of the figure has a (perfect) MIC of 1: if you know which row in the grid a point on the line falls into, you can also determine which column it falls into with perfect accuracy. So the MIC approach will detect that there’s a perfectly systematic relationship between these two variables without any trouble, whereas the standard pearson correlation would be 0 (i.e., no relation at all). There are a couple of other steps involved (e.g., one needs to normalize the MIC to account for differences in grid resolution), but that’s the gist of it.

If the idea seems surprisingly simple, it is. But as with many very good ideas, hindsight is 20/20; it’s an idea that seems obvious once you hear it, but clearly wasn’t trivial to come up with (or someone would have done it a long time ago!). And of course, the simplicity of the core idea also shouldn’t blind us to the fact that there was undoubtedly a lot of very sophisticated work involved in figuring out how to normalize and bound the measure, provin that the approach works and implementing a dynamic algorithm capable of computing good MIC estimates in a reasonable amount of time (this Harvard Gazette article suggests Reshef and colleagues worked on the various problems for three years).

The utility of MIC and its improvement over existing measures is probably best captured in Figure 2 from the paper:

Panel A shows the values one obtains with different measures when trying to capture different kinds of noiseless relationships (e.g., linear, exponential, and sinusoidal ones). The key point is that MIC assigns a value of 1 (the maximum) to every kind of association, whereas no other measure is capable of detecting the same range of associations with the same degree of sensitivity (and most fail horribly). By contrast, when given random data, MIC produces a value that tends towards zero (though it’s still not quite zero, a point I’ll come back to later). So what you effectively have is a measure that, with some caveats, can capture a very broad range of associations and place them on the same metric. The latter aspect is nicely captured in Panel G, which gives one a sense of what real (i.e., noisy) data corresponding to different MIC levels would look like. The main point is that, unlike other measures, a given value can correspond to very different types of associations. Admittedly, this may be a mixed blessing, since the flip side is that knowing the MIC value tells you almost nothing about what the association actually looks like (though Anscombe’s Quartet famously demonstrates that even a linear correlation can be misleading in this respect). But on the whole, I think it represents a potentially big advance in our ability to detect novel associations in a data-driven way.

Having introduced and explained the method, Reshef et al then go on to apply it to 4 very different datasets. I’ll just focus on one here–a set of global indicators from the World Health Organization (WHO). The data set contains 357 variables, or 63,546 variable pairs. When plotting MIC against the Pearson correlation coefficient the data look like this (panel A; click to blow up the figure):

The main point to note is that while MIC detects most strong linear effects (e.g., panel D), it also detects quite a few associations that have low linear correlations (e.g., E, F, and G). Reshef et al note that many of these effects have sensible interpretations (e.g., they argue that the left trend line in panel F reflects predominantly Pacific Island nations where obesity is culturally valued, and hence increases with income), but would be completely overlooked by an automated data mining approach that focuses only on linear correlations. They go on to report a number of other interesting examples ranging from analyses of gut bacteria to baseball statistics. All in all, it’s a compelling demonstration of a new metric that could potentially play an important role in large-scale data mining analyses going forward.

That said, while the paper clearly represents an important advance for large-scale data mining efforts, it’s also quite light on caveats and limitations (even for a length-constrained Science paper). Some potential concerns that come to mind:

  • Reshef et al are understandably going to put their best foot forward, so we can expect that the ‘representative’ examples they display (e.g., the WHO scatter plots above) are among the cleanest effects in the data, and aren’t necessarily typical. There’s nothing wrong with this, but it’s worth keeping in mind that much (and perhaps most) of the time, the associations MIC identifies aren’t going to be quite so clear-cut. Reshef’s et al approach can help identify potentially interesting associations, but once they’re identified, it’s still up to the investigator to figure out how to characterize them.
  • MIC is a (potentially quite heavily) biased measure. While it’s true, as the authors suggest, that it will “tend to 0 for statistically independent variables”, in most situations, the observed value will be substantially larger than 0 even when variables are completely uncorrelated. This falls directly out of the ‘M’ in MIC, because when you take the maximal value from some larger search space as your estimate, you’re almost invariably going to end up capitalizing on chance to some degree. MIC will only tend to 0 when the sample size is very large; as this figure (from the supplemental material) shows, even with a sample size of n = 204, the MIC for uncorrelated variables will tend to hover somewhere around .15 for the parameterization used throughout the paper (the red line):
    This isn’t a huge deal, but it does mean that interpretation of small MIC values is going to be very difficult in practice, since the lower end of the distribution is going to depend heavily on sample size. And it’s quite unpleasant to have a putatively standardized metric of effect size whose interpretation depends to some extent on sample parameters.
  • Reshef et al don’t report any analyses quantifying the sensitivity of MIC compared to conventional metrics like Pearson’s correlation coefficient. Obviously, MIC can pick up on effects Pearson can’t; but a crucial question is whether MIC shows comparable sensitivity when effects are linear. Similarly, we don’t know how well MIC performs when sample sizes are substantially smaller than those Reshef et al use in their simulations and empirical analyses. If it breaks down with n’s on the order of, say, 50 – 100, that would be important to know. So it would be great to see follow-up work characterizing performance under such circumstances–preferably before a flood of papers is published that all use MIC to do data mining in relatively small data sets.
  • As Andrew Gelman points out here, it’s not entirely clear that one wants a measure that gives a high r-square-like value for pretty much any non-random association between variables. For instance, a perfect circle would get an MIC of 1 at the limit, which is potentially weird given that you can’t never deterministically predict y from x. I don’t have a strong feeling about this one way or the other, but can see why this might bother someone.

Caveats aside though, from my perspective–as someone who likes to play with very large datasets but isn’t terribly statistically savvy–the Reshef et al paper seems like a really impressive piece of work that could have a big impact on at least some kinds of data mining analyses. I’d be curious to hear what more quantitatively sophisticated folks have to say.

ResearchBlogging.org
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, & Sabeti PC (2011). Detecting novel associations in large data sets. Science (New York, N.Y.), 334 (6062), 1518-24 PMID: 22174245

more on the ADHD-200 competition results

Several people left enlightening comments on my last post about the ADHD-200 Global Competition results, so I thought I’d bump some of them up and save you the trip back there (though the others are worth reading too!), since they’re salient to some of the issues raised in the last post.

Matthew Brown, the project manager on the Alberta team that was disqualified on a minor technicality (they cough didn’t use any of the imaging data), pointed out that they actually did initially use the imaging data (answering Sanjay’s question in another comment)–it just didn’t work very well:

For the record, we tried a pile of imaging-based approaches. As a control, we also did classification with age, gender, etc. but no imaging data. It was actually very frustrating for us that none of our imaging-based methods did better than the no imaging results. It does raise some very interesting issues.

He also pointed out that the (relatively) poor performance of the imaging-based classifiers isn’t cause for alarm:

I second your statement that we’ve only scratched the surface with fMRI-based diagnosis (and prognosis!). There’s a lot of unexplored potential here. For example, the ADHD-200 fMRI scans are resting state scans. I suspect that fMRI using an attention task could work better for diagnosing ADHD. Resting state fMRI has also shown promise for diagnosing other conditions (eg: schizophrenia – see Shen et al.).

I think big, multi-centre datasets like the ADHD-200 are the future though the organizational and political issues with this approach are non-trivial. I’m extremely impressed with the ADHD-200 organizers and collaborators for having the guts and perseverance to put this data out there. I intend to follow their example with the MR data that we collect in future.

I couldn’t agree more!

Jesse Brown explains why his team didn’t use the demographics alone:

I was on the UCLA/Yale team, we came in 4th place with 54.87% accuracy. We did include all the demographic measures in our classifier along with a boatload of neuroimaging measures. The breakdown of demographics by site in the training data did show some pretty strong differences in ADHD vs. TD. These differences were somewhat site dependent, eg girls from OHSU with high IQ are very likely TD. We even considered using only demographics at some point (or at least one of our wise team members did) but I thought that was preposterous. I think we ultimately figured that the imaging data may generalize better to unseen examples, particularly for sites that only had data in the test dataset (Brown). I guess one lesson is to listen to the data and not to fall in love with your method. Not yet anyway.

I imagine some of the other groups probably had a similar experience of trying to use the demographic measures alone and realizing they did better than the imaging data, but sticking with the latter anyway. Seems like a reasonable decision, though ultimately, I still think it’s a good thing the Alberta team used only the demographic variables, since their results provided an excellent benchmark against which to compare the performance of the imaging-based models. Sanjay Srivastava captured this sentiment nicely:

Two words: incremental validity. This kind of contest is valuable, but I’d like to see imaging pitted against behavioral data routinely. The fact that they couldn’t beat a prediction model built on such basic information should be humbling to advocates of neurodiagnosis (and shows what a low bar “better than chance” is). The real question is how an imaging-based diagnosis compares to a clinician using standard diagnostic procedures. Both “which is better” (which accounts for more variance as a standalone prediction) and “do they contain non-overlapping information” (if you put both predictions into a regression, does one or both contribute unique variance).

And Russ Poldrack raised a related question about what it is that the imaging-based models are actually doing:

What is amazing is that the Alberta team only used age, sex, handedness, and IQ.  That suggests to me that any successful imaging-based decoding could have been relying upon correlates of those variables rather than truly decoding a correlate of the disease.

This seems quite plausible inasmuch as age, sex, and IQ are pretty powerful variables, and there are enormous literatures on their structural and functional correlates. While there probably is at least some incremental information added by the imaging data (and very possibly a lot of it), it’s currently unclear just how much–and whether that incremental variance might also be picked up by (different) behavioral variables. Ultimately, time (and more competitions like this one!) will tell.

brain-based prediction of ADHD–now with 100% fewer brains!

UPDATE 10/13: a number of commenters left interesting comments below addressing some of the issues raised in this post. I expand on some of them here.

The ADHD-200 Global Competition, announced earlier this year, was designed to encourage researchers to develop better tools for diagnosing mental health disorders on the basis of neuroimaging data:

The competition invited participants to develop diagnostic classification tools for ADHD diagnosis based on functional and structural magnetic resonance imaging (MRI) of the brain. Applying their tools, participants provided diagnostic labels for previously unlabeled datasets. The competition assessed diagnostic accuracy of each submission and invited research papers describing novel, neuroscientific ideas related to ADHD diagnosis. Twenty-one international teams, from a mix of disciplines, including statistics, mathematics, and computer science, submitted diagnostic labels, with some trying their hand at imaging analysis and psychiatric diagnosis for the first time.

Data for the contest came from several research labs around the world, who donated brain scans from participants with ADHD (both inattentive and hyperactive subtypes) as well as healthy controls. The data were made openly available through the International Neuroimaging Data-sharing Initiative, and nicely illustrate the growing movement towards openly sharing large neuroimaging datasets and promoting their use in applied settings. It is, in virtually every respect, a commendable project.

Well, the results of the contest are now in–and they’re quite interesting. The winning team, from Johns Hopkins, came up with a method that performed substantially above chance and showed particularly high specificity (i.e., it made few false diagnoses, though it missed a lot of true ADHD cases). And all but one team performed above chance, demonstrating that the imaging data has at least some (though currently not a huge amount) of utility in diagnosing ADHD and ADHD subtype. There are some other interesting results on the page worth checking out.

But here’s hands-down the most entertaining part of the results, culled from the “Interesting Observations” section:

The team from the University of Alberta did not use imaging data for their prediction model. This was not consistent with intent of the competition. Instead they used only age, sex, handedness, and IQ. However, in doing so they obtained the most points, outscoring the team from Johns Hopkins University by 5 points, as well as obtaining the highest prediction accuracy (62.52%).

…or to put it differently, if you want to predict ADHD status using the ADHD-200 data, your best bet is to not really use the ADHD-200 data! At least, not the brain part of it.

I say this with tongue embedded firmly in cheek, of course; the fact that the Alberta team didn’t use the imaging data doesn’t mean imaging data won’t ultimately be useful for diagnosing mental health disorders. It remains quite plausible that ten or twenty years from now, structural or functional MRI scans (or some successor technology) will be the primary modality used to make such diagnoses. And the way we get from here to there is precisely by releasing these kinds of datasets and promoting this type of competition. So on the whole, I think this should actually be seen as a success story for the field of human neuroimaging–especially since virtually all of the teams performed above chance using the imaging data.

That said, there’s no question this result also serves as an important and timely reminder that we’re still in the very early days of brain-based prediction. Right now anyone who claims they can predict complex real-world behaviors better using brain imaging data than using (much cheaper) behavioral data has a lot of ‘splainin to do. And there’s a good chance that they’re trying to sell you something (like, cough, neuromarketing ‘technology’).

elsewhere on the net, vacation edition

I’m hanging out in Boston for a few days, so blogging will probably be sporadic or nonexistent. Which is to say, you probably won’t notice any difference.

The last post on the Dunning-Kruger effect somehow managed to rack up 10,000 hits in 48 hours; but that was last week. Today I looked at my stats again, and the blog is back to a more normal 300 hits, so I feel like it’s safe to blog again. Here are some neat (and totally unrelated) links from the past week:

  • OKCupid has another one of those nifty posts showing off all the cool things they can learn from their gigantic userbase (who else gets to say things like “this analysis includes 1.51 million users’ data”???). Apparently, tall people (claim to) have more sex, attractive photos are more likely to be out of date, and most people who claim to be bisexual aren’t really bisexual.
  • After a few months off, my department-mate Chris Chatham is posting furiously again over at Developing Intelligence, with a series of excellent posts reviewing recent work on cognitive control and the perils of fMRI research. I’m not really sure what Chris spent his blogging break doing, but given the frequency with which he’s been posting lately, my suspicion is that he spent it secretly writing blog posts.
  • Mark Liberman points out a fundamental inconsistency in the way we view attributions of authorship: we get appropriately angry at academics who pass someone else’s work off as their own, but think it’s just fine for politicians to pay speechwriters to write for them. It’s an interesting question, and leads to an intimately related, and even more important question–namely, will anyone get mad at me if I pay someone else to write a blog post for me about someone else’s blog post discussing people getting angry at people paying or not paying other people to write material for other people that they do or don’t own the copyright on?
  • I like oohing and aahing over large datasets, and the Guardian’s Data Blog provides a nice interface to some of the most ooh- and aah-able datasets out there. [via R-Chart]
  • Ed Yong has a characteristically excellent write-up about recent work on the magnetic vision of birds. Yong also does link dump posts better than anyone else, so you should probably stop reading this one right now and read his instead.
  • You’ve probably heard about this already, but some time last week, the brain trust at ScienceBlogs made the amazingly clever decision to throw away their integrity by selling PepsiCo its very own “science” blog. Predictably, a lot of the bloggers weren’t happy with the decision, and many have now moved onto greener pastures; Carl Zimmer’s keeping score. Personally, I don’t have anything intelligent to add to everything that’s already been said; I’m literally dumbfounded.
  • Andrew Gelman takes apart an obnoxious letter from pollster John Zogby to Nate Silver of fivethirtyeight.com. I guess now we know that Zogby didn’t get where he is by not being an ass to other people.
  • Vaughan Bell of Mind Hacks points out that neuroplasticity isn’t a new concept, and was discussed seriously in the literature as far back as the 1800s. Apparently our collective views about the malleability of mind are not, themselves, very plastic.
  • NPR ran a three-part story by Barbara Bradley Hagerty on the emerging and somewhat uneasy relationship between neuroscience and the law. The articles are pretty good, but much better, in my opinion, was the Talk of the Nation episode that featured Hagerty as a guest alongside Joshua Greene, Kent Kiehl, and Stephen Morse–people who’ve all contributed in various ways to the emerging discipline of NeuroLaw. It’s a really interesting set of interviews and discussions. For what it’s worth, I think I agree with just about everything Greene has to say about these issues–except that he says things much more eloquently than I think them.
  • Okay, this one’s totally frivolous, but does anyone want to buy me one of these things? I don’t even like dried food; I just think it would be fun to stick random things in there and watch them come out pale, dried husks of their former selves. Is it morbid to enjoy watching the life slowly being sucked out of apples and mushrooms?

in brief…

Some neat stuff from the past week or so:

  • If you’ve ever wondered how to go about getting a commentary on an article published in a peer-reviewed journal, wonder no longer… you can’t. Or rather, you can, but it may not be worth your trouble. Rick Trebino explains. [new to me via A.C. Thomas, though apparently this one's been around for a while.]
  • The data-driven life: A great article in the NYT magazine discusses the growing number of people who’re quantitatively recording the details of every aspect of their lives, from mood to glucose levels to movement patterns. I dabbled with this a few years ago, recording my mood, diet, and exercise levels for about 6 months. I’m not sure how much I learned that was actually useful, but if nothing else, it’s a fun exercise to play aroundwith a giant matrix of correlations that are all about YOU.
  • Cameron Neylon has an excellent post up defending the viability (and superiority) of the author-pays model of publication.
  • In typical fashion, Carl Zimmer has a wonderful blog up post explaining why tapeworms in Madagascar tell us something important about human evolution.
  • The World Bank, as you might expect, has accumulated a lot of economic data. For years, they’ve been selling it at a premium, but as of 2010 the World Development Indicators are completely free to access. via [via Flowing Data]
  • Every tried Jew’s Ear Juice? No? In China, you can–but not for long, if the government has its way. The NYT reports on efforts to eradicate Chinglish in public. Money quote:

“The purpose of signage is to be useful, not to be amusing,” said Zhao Huimin, the former Chinese ambassador to the United States who, as director general of the capital’s Foreign Affairs Office, has been leading the fight for linguistic standardization and sobriety.

the OKCupid guide to dating older women

Continuing along on their guided tour of Data I Wish I Had Access To, the OKCupid folks have posted another set of interesting figures on their blog. This time, they make the case for dating older women, suggesting that men might get more bang for their buck (in a literal sense, I suppose) by trying to contact women their age or older, rather than trying to hit on the young ‘uns. Men, it turns out, are creepy. Here’s how creepy:

Actually, that’s not so creepy. All it says is that men say they prefer to date younger women. That’s not going to shock anyone. This one is creepier:

The reason it’s creepy is that it basically says that, irrespective of what age ranges men say they find acceptable in a potential match, they’re actually all indiscriminately messaging 18-year old women. So basically, if you’re a woman on OKCupid who’s searching for that one special, non-creepy guy, be warned: they don’t exist. They’re pretty much all going to be eying 18-year olds for the rest of their lives. (To be fair, women also show a tendency to contact men below their lowest reported acceptable age. But it’s a much weaker effect; 40-year old women only occasionally try to hit on 24-year old guys, and tend to stay the hell away from the not-yet-of-drinking-age male population.)

Anyway, using this type of data, the OKCupid folks then generate this figure:

…which also will probably surprise no one, as it basically says women are most desirable when they’re young, and men when they’re (somewhat) older. But what the OKCupid folks then suggest is that it would be to men’s great advantage to broaden their horizons, because older women (which, in their range-restricted population, basically means anything over 30) self-report being much more interested in having sex more often, having casual sex, and using protection. I won’t bother hotlinking to all of those images, but here’s where they’re ultimately going with this:

I’m not going to comment on the appropriateness of trying to nudge one’s male userbase in the direction of more readily available casual sex (though I suspect they don’t need much nudging anyway). What I do wonder is to what extent these results reflect selection effects rather than a genuine age difference. The OKCupid folks suggest that women’s sexual interest increases as they age, which seems plausible given the conventional wisdom that women peak sexually in their 30s. But the effects in this case look pretty huge (unless the color scheme is misleading, which it might be; you’ll have to check out the post for the neat interactive flash animations), and it seems pretty plausible that much of the age effect could be driven by selection bias. Women with a more monogamous orientation are probably much more likely to be in committed, stable relationships by the time they turn 30 or 35, and probably aren’t scanning OKCupid for potential mates. Women who are in their 30s and 40s and still using online dating services are probably those who weren’t as interested in monogamous relationships to begin with. (Of course, the same is probably true of older men. Except that since men of all ages appear to be pretty interested in casual sex, there’s unlikely to be an obvious age differential.)

The other thing I’m not clear on is whether these analyses control for the fact that the userbase is heavily skewed toward younger users:

The people behind OKCupid are all mathematicians by training, so I’d be surprised if they hadn’t taken the underlying age distribution into consideration. But they don’t say anything about it in their post. The worry is that, if the base rate of different age groups isn’t taken into consideration, the heat map displayed above could be quite misleading. Given that there are many, many more 25-year old women on OKCupid than 35-year old women, failing to normalize properly would almost invariably make it look like there’s a heavy skew for men to message relatively younger women, irrespective of the male sender’s age. By the same token, it’s not clear that it’d be good advice to tell men to seek out older women, given that there are many fewer older women in the pool to begin with. As a thought experiment, suppose that the entire OKCupid male population suddenly started messaging women 5 years older than them, and entirely ignored their usual younger targets. The hit rate wouldn’t go up; it would probably actually fall precipitously, since there wouldn’t be enough older women to keep all the younger men entertained (at least, I certainly hope there wouldn’t). No doubt there’s a stable equilibrium point somewhere, where men and women are each targeting exactly the right age range to maximize their respective chances. I’m just not sure that it’s in OKCupid’s proposed “zone of greatness” for the men.

It’s also a bit surprising that OKCupid didn’t break down the response rate to people of the opposite gender as a function of the sender and receiver’s age. They’ve done this in the past, and it seems like the most direct way of testing whether men are more likely to get lucky by messaging older or younger women. Without knowing whether older women are actually responding to younger men’s overtures, it’s kind of hard to say what it all means. Except that I’d still kill to have their data.