UPDATE 2/8/2012: Simon & Tibshirani posted a critical commentary on this paper here. See additional thoughts here.
Real-world data are messy. Relationships between two variables can take on an infinite number of forms, and while one doesn’t see, say, umbrella-shaped data very often, strange things can happen. When scientists talk about correlations or associations between variables, they’re usually referring to one very specific form of relationship–namely, a linear one. The assumption is that most associations between pairs of variables are reasonably well captured by positing that one variable increases in proportion to the other, with some added noise. In reality, of course, many associations aren’t linear, or even approximately so. For instance, many associations are cyclical (e.g., hours at work versus day of week), or curvilinear (e.g., heart attacks become precipitously more frequent past middle age), and so on.
Detecting a non-linear association is potentially just as easy as detecting a linear relationship if we know the form of that association up front. But there, of course, lies the rub: we generally don’t have strong intuitions about how most variables are likely to be non-linearly related. A more typical situation in many ‘big data’ scientific disciplines is that we have a giant dataset full of thousands or millions of observations and hundreds or thousands of variables, and we want to determine which of the many associations between different variables are potentially important–without knowing anything about their potential shape. The problem, then, is that traditional measures of association don’t work very well; they’re only likely to detect associations to the extent that those associations approximate a linear fit.
A new paper in Science by David Reshef and colleagues (and as a friend pointed out, it’s a feat in and of itself just to get a statistics paper into Science) directly targets this data mining problem by introducing an elegant new measure of association called the Maximal Information Coefficient (MIC; see also the authors’ project website). The clever insight at the core of the paper is that one can detect a systematic (i.e., non-random) relationship between two variables by quantifying and normalizing their maximal mutual information. Mutual information (MI) is an information theory measure of how much information you have about one variable given knowledge of the other. You have high MI when you can accurately predict the level of one variable given knowledge of the other, and low MI when knowledge of one variable is unhelpful in predicting the other. Importantly, unlike other measures (e.g., the correlation coefficient), MI makes no assumptions about the form of the relationship between the variables; one can have high mutual information for non-linear associations as well as linear ones.
MI and various derivative measures have been around for a long time now; what’s innovative about the Reshef et al paper is that the authors figured out a way to efficiently estimate and normalize the maximal MI one can obtain for any two variables. The very clever approach the authors use is to overlay a series of grids on top of the data, and to keep altering the resolution of the grid and moving its lines around until one obtains the maximum possible MI. In essence, it’s like dropping a wire mesh on top of a scatterplot and playing with it until you’ve boxed in all of the data points in the most informative way possible. And the neat thing is, you can apply the technique to any kind of data at all, and capture a very broad range of systematic relationships, not just linear ones.
To give you an intuitive sense of how this works, consider this Figure from the supplemental material:
The underlying function here is sinusoidal. This is a potentially common type of association in many domains–e.g., it might explain the cyclical relationship between, say, coffee intake and hour of day (more coffee in the early morning and afternoon; less in between). But the linear correlation is essentially zero, so a typical analysis wouldn’t pick it up at all. On the other hand, the relationship itself is perfectly deterministic; if we can correctly identify the generative function in this case, we would have perfect information about Y given X. The question is how to capture this intuition algorithmically–especially given that real data are noisy.
This is where Reshef et al’s grid-based approach comes in. In the left panel above, you have a 2 x 8 grid overlaid on a sinusoidal function (the use of a 2 x 8 resolution here is just illustrative; the algorithm actually produces estimates for a wide range of grid resolutions). Even though it’s the optimal grid of that particular resolution, it still isn’t very good: knowing which row a particular point along the line falls into doesn’t tell you a whole lot about which column it falls into, and vice versa. In other words, mutual information is low. By contrast, the optimal 8 x 2 grid on the right side of the figure has a (perfect) MIC of 1: if you know which row in the grid a point on the line falls into, you can also determine which column it falls into with perfect accuracy. So the MIC approach will detect that there’s a perfectly systematic relationship between these two variables without any trouble, whereas the standard pearson correlation would be 0 (i.e., no relation at all). There are a couple of other steps involved (e.g., one needs to normalize the MIC to account for differences in grid resolution), but that’s the gist of it.
If the idea seems surprisingly simple, it is. But as with many very good ideas, hindsight is 20/20; it’s an idea that seems obvious once you hear it, but clearly wasn’t trivial to come up with (or someone would have done it a long time ago!). And of course, the simplicity of the core idea also shouldn’t blind us to the fact that there was undoubtedly a lot of very sophisticated work involved in figuring out how to normalize and bound the measure, provin that the approach works and implementing a dynamic algorithm capable of computing good MIC estimates in a reasonable amount of time (this Harvard Gazette article suggests Reshef and colleagues worked on the various problems for three years).
The utility of MIC and its improvement over existing measures is probably best captured in Figure 2 from the paper:
Panel A shows the values one obtains with different measures when trying to capture different kinds of noiseless relationships (e.g., linear, exponential, and sinusoidal ones). The key point is that MIC assigns a value of 1 (the maximum) to every kind of association, whereas no other measure is capable of detecting the same range of associations with the same degree of sensitivity (and most fail horribly). By contrast, when given random data, MIC produces a value that tends towards zero (though it’s still not quite zero, a point I’ll come back to later). So what you effectively have is a measure that, with some caveats, can capture a very broad range of associations and place them on the same metric. The latter aspect is nicely captured in Panel G, which gives one a sense of what real (i.e., noisy) data corresponding to different MIC levels would look like. The main point is that, unlike other measures, a given value can correspond to very different types of associations. Admittedly, this may be a mixed blessing, since the flip side is that knowing the MIC value tells you almost nothing about what the association actually looks like (though Anscombe’s Quartet famously demonstrates that even a linear correlation can be misleading in this respect). But on the whole, I think it represents a potentially big advance in our ability to detect novel associations in a data-driven way.
Having introduced and explained the method, Reshef et al then go on to apply it to 4 very different datasets. I’ll just focus on one here–a set of global indicators from the World Health Organization (WHO). The data set contains 357 variables, or 63,546 variable pairs. When plotting MIC against the Pearson correlation coefficient the data look like this (panel A; click to blow up the figure):
The main point to note is that while MIC detects most strong linear effects (e.g., panel D), it also detects quite a few associations that have low linear correlations (e.g., E, F, and G). Reshef et al note that many of these effects have sensible interpretations (e.g., they argue that the left trend line in panel F reflects predominantly Pacific Island nations where obesity is culturally valued, and hence increases with income), but would be completely overlooked by an automated data mining approach that focuses only on linear correlations. They go on to report a number of other interesting examples ranging from analyses of gut bacteria to baseball statistics. All in all, it’s a compelling demonstration of a new metric that could potentially play an important role in large-scale data mining analyses going forward.
That said, while the paper clearly represents an important advance for large-scale data mining efforts, it’s also quite light on caveats and limitations (even for a length-constrained Science paper). Some potential concerns that come to mind:
- Reshef et al are understandably going to put their best foot forward, so we can expect that the ‘representative’ examples they display (e.g., the WHO scatter plots above) are among the cleanest effects in the data, and aren’t necessarily typical. There’s nothing wrong with this, but it’s worth keeping in mind that much (and perhaps most) of the time, the associations MIC identifies aren’t going to be quite so clear-cut. Reshef’s et al approach can help identify potentially interesting associations, but once they’re identified, it’s still up to the investigator to figure out how to characterize them.
- MIC is a (potentially quite heavily) biased measure. While it’s true, as the authors suggest, that it will “tend to 0 for statistically independent variables”, in most situations, the observed value will be substantially larger than 0 even when variables are completely uncorrelated. This falls directly out of the ‘M’ in MIC, because when you take the maximal value from some larger search space as your estimate, you’re almost invariably going to end up capitalizing on chance to some degree. MIC will only tend to 0 when the sample size is very large; as this figure (from the supplemental material) shows, even with a sample size of n = 204, the MIC for uncorrelated variables will tend to hover somewhere around .15 for the parameterization used throughout the paper (the red line):
This isn’t a huge deal, but it does mean that interpretation of small MIC values is going to be very difficult in practice, since the lower end of the distribution is going to depend heavily on sample size. And it’s quite unpleasant to have a putatively standardized metric of effect size whose interpretation depends to some extent on sample parameters.
- Reshef et al don’t report any analyses quantifying the sensitivity of MIC compared to conventional metrics like Pearson’s correlation coefficient. Obviously, MIC can pick up on effects Pearson can’t; but a crucial question is whether MIC shows comparable sensitivity when effects are linear. Similarly, we don’t know how well MIC performs when sample sizes are substantially smaller than those Reshef et al use in their simulations and empirical analyses. If it breaks down with n’s on the order of, say, 50 – 100, that would be important to know. So it would be great to see follow-up work characterizing performance under such circumstances–preferably before a flood of papers is published that all use MIC to do data mining in relatively small data sets.
- As Andrew Gelman points out here, it’s not entirely clear that one wants a measure that gives a high r-square-like value for pretty much any non-random association between variables. For instance, a perfect circle would get an MIC of 1 at the limit, which is potentially weird given that you can’t never deterministically predict y from x. I don’t have a strong feeling about this one way or the other, but can see why this might bother someone.
Caveats aside though, from my perspective–as someone who likes to play with very large datasets but isn’t terribly statistically savvy–the Reshef et al paper seems like a really impressive piece of work that could have a big impact on at least some kinds of data mining analyses. I’d be curious to hear what more quantitatively sophisticated folks have to say.
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, & Sabeti PC (2011). Detecting novel associations in large data sets. Science (New York, N.Y.), 334 (6062), 1518-24 PMID: 22174245