Category Archives: data mining

The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

I’m moving to Austin!

The title pretty much says it. After spending four great years in Colorado, I’m happy to say that I’ll be moving to Austin at the end of the month. I’ll be joining the Department of Psychology at UT-Austin as a Research Associate, where I plan to continue dabbling in all things psychological and informatic, but with less snow and more air conditioning.

While my new position nominally has the same title as my old one, the new one’s a bit unusual in that the funding is coming from two quite different sources. Half of it comes from my existing NIH grant for development of the Neurosynth framework, which means that half of my time will be spent more or less the same way I’m spending it now–namely, on building tools to improve and automate the large-scale synthesis of functional MRI data. (Incidentally, I’ll be hiring a software developer and/or postdoc in the very near future, so drop me a line if you think you might be interested.)

The other half of the funding is tied to the PsyHorns course developed by Jamie Pennebaker and Sam Gosling over the past few years. PsyHorns is a synchronous massive online course (SMOC) that lets anyone in the world with an internet connection (okay, and $550 in loose change lying around) take an introductory psychology class via the internet and officially receive credit for it from the University of Texas (this recent WSJ article on PsyHorns provides some more details). My role will be to serve as a bridge between the psychologists and the developers–which means I’ll have an eclectic assortment of duties like writing algorithms to detect cheating, developing tools to predict how well people are doing in the class, mining the gigantic reams of data we’re acquiring, developing ideas for new course features, and, of course, publishing papers.

Naturally, the PILab will be joining me in my southern adventure. Since the PILab currently only has one permanent member (guess who?), and otherwise consists of a single Mac Pro workstation, this latter move involves much less effort than you might think (though it does mean I’ll have to change the lab website’s URL, logo, and–horror of horrors–color scheme). Unfortunately, all the wonderful people of the PILab will be staying behind, as they all have various much more important ties to Boulder (by which I mean that I’m not actually currently paying any of their salaries, and none of them were willing to subsist on the stipend of baked beans, love, and high-speed internet I offered them).

While I’m super excited about moving to Austin, I’m not at all excited to leave Colorado. Boulder is a wonderful place to live*–it’s sunny all the time, has a compact, walkable core, a surprising amount of stuff to do, and these gigantic mountain things you can walk all over. My wife and I have made many incredible friends here, and after four years in Colorado, it’s come to feel very much like home. So leaving will be difficult. Still, I’m excited to move onto new things. As great as the past four years have been, a number of factors precipitated this move:

  • The research fit is better. This isn’t in any way a knock against the environment here at Colorado, which has been great (hey, they’re hiring! If you do computational cognitive neuroscience, you should apply!). I had great colleagues here who work on some really interesting questions–particularly Tor Wager, my postdoc advisor for my first 3 years here, who’s an exceptional scientist and stellar human being. But every department necessarily has to focus on some areas at the expense of others, and much of the research I do (or would ideally like to do) wasn’t well-represented here. In particular, my interests in personality and individual differences have languished during my time in Boulder, as I’ve had trouble finding collaborators for most of the project ideas I’ve had. UT-Austin, by contrast, has one of the premier personality and individual differences groups anywhere. I’m delighted to be working a few doors down from people like Sam Gosling, Jamie Pennebaker, Elliot Tucker-Drob, and David Buss. On top of that, UT-Austin still has major strengths in most of my other areas of interest, most notably neuroimaging (I expect to continue to collaborate frequently with Russ Poldrack) and data mining (a world-class CS department with an expanding focus on Big Data). So, purely in terms of fit, it’s hard for me to imagine a better place than UT.
  • I’m excited to work on a project with immediate real-world impact. While I’d love to believe that most of the work I currently do is making the world better in some very small way, the reality most scientists engaged in basic research face is that at the end of the day, we don’t actually know what impact we’re having. There’s nothing inherently wrong with that, mind you; as a general rule, I’m a big believer in the idea of doing science just because it’s interesting and exciting, without worrying about the consequences (or lack thereof). You know, knowledge for it’s own sake and all that. Still, on a personal level, I find myself increasingly wanting to do something that I feel confers some clear and measurable benefit on the world right now–however small. In that respect, online education strikes me as an excellent area to pour my energy into. And PsyHorns is a particularly unusual (and, to my mind, promising) experiment in online education. The preliminary data from previous iterations of the course suggests that students who take the course synchronously online do better academically–not just in this particular class (as compared to an in-class section), but in other courses as well. While I’m not hugely optimistic about the malleability of the human mind as a general rule–meaning, I don’t think there are as-yet undiscovered teaching approaches that are going to radically improve the learning experience–I do believe strongly in the cumulative impact of many small nudging in the right direction. I think this is the right platform for that kind of nudging.
  • Data. Lots and lots of data. Enrollment in PsyHorns this year is about 1,500 students, and previous iterations have seen comparable numbers. As part of their introduction to psychology, the students engage in a wide range of activities: they have group chats about the material they’re learning; they write essays about a range of topics; they fill out questionnaires and attitude surveys; and, for the first time this year, they use a mobile app that assesses various aspects of their daily experience. Aside from the feedback we provide to the students (some of which is potentially actionable right away), the data we’re collecting provides a unique opportunity to address many questions at the intersection of personality and individual differences, health and subjective well-being, and education. It’s not Big Data by, say, Google or Amazon standards (we’re talking thousands of rows rather than billions), but it’s a dataset with few parallels in psychology, and I’m thrilled to be able to work on it.
  • I like doing research more than I like teaching** or doing service work. Like my current position, the position I’m assuming at UT-Austin is 100% research-focused, with very little administrative or teaching overhead. Obviously, it doesn’t have the long-term security of a tenure-track position, but I’m okay with that. I’m still selectively applying for tenure-track positions (and turned one down this year in favor of the UT position), so it’s not as though I have any principled objections to the tenure stream. But short of a really amazing opportunity, I’m very happy with my current arrangement.
  • mmm, chocolatey Austin goodness...
    Austin seems like a pretty awesome place to live. Boulder is too, but after four years of living in a relatively small place (population: ~100,000), my wife and I are looking forward to living somewhere more city-like. We’ve opted to take the (expensive) plunge and live downtown–where we’ll be within walking distance of just about everything we need. By which of course I mean the chocolate fountain at the Whole Foods mothership.
  • The tech community in Austin is booming. Given that most of my work these days lies at the interface of psychology and informatics, and there are unprecedented opportunities for psychology-related data mining in industry these days, I’m hoping to develop better collaborations with people in industry–at both startups and established companies. While I have no intention of leaving academia in the near future, I do think psychologists have collectively failed to take advantage of the many opportunities to collaborate with folks in industry on interesting questions about human behavior–often at an unprecedented scale. I’ve done a terrible job of that myself, and fixing that is near the top of my agenda. So, hey, if you work at a tech company in Austin and have some data lying around that you think might shed new insights on what people feel, think, and do, let’s chat!
  • I guess sometimes you just get the itch to move onto something new. For me, this is that.

University of Texas Austin campus at sunset-dusk - aerial view

 

 

* Okay, it was an amazing place to live until the massive floods this past week rearranged rivers, roads, and lives. My wife and I  were fortunate enough to escape any personal or material damage, but many others were not so lucky. If you’d like to help, please consider making a donation.

** Actually, I love teaching. What I don’t love is all the stuff surrounding teaching.

what do you get when you put 1,000 psychologists together in one journal?

I’m working on a TOP SEKKRIT* project involving large-scale data mining of the psychology literature. I don’t have anything to say about the TOP SEKKRIT* project just yet, but I will say that in the process of extracting certain information I needed in order to do certain things I won’t talk about, I ended up with certain kinds of data that are useful for certain other tangential analyses. Just for fun, I threw some co-authorship data from 2,000+ Psychological Science articles into the d3.js blender, and out popped an interactive network graph of all researchers who have published at least 2 papers in Psych Science in the last 10 years**. It looks like this:

coauthorship_graph

You can click on the image to take a closer (and interactive) look.

I don’t think this is very useful for anything right now, but if nothing else, it’s fun to drag Adam Galinsky around the screen and watch half of the field come along for the ride. There are plenty of other more interesting things one could do with this, though, and it’s also quite easy to generate the same graph for other journals, so I expect to have more to say about this later on.

 

* It’s not really TOP SEKKRIT at all–it just sounds more exciting that way.

** Or, more accurately, researchers who have co-authored at least 2 Psych Science papers with other researchers who meet the same criterion. Otherwise we’d have even more nodes in the graph, and as you can see, it’s already pretty messy.

unconference in Leipzig! no bathroom breaks!

Südfriedhof von Leipzig [HDR]

Many (most?) regular readers of this blog have probably been to at least one academic conference. Some of you even have the misfortune of attending conferences regularly. And a still-smaller fraction of you scholarly deviants might conceivably even enjoy the freakish experience. You know, that whole thing where you get to roam around the streets of some fancy city for a few days seeing old friends, learning about exciting new scientific findings, and completely ignoring the manuscripts and reviews piling up on your desk in your absence. It’s a loathsome, soul-scorching experience. Unfortunately it’s part of the job description for most scientists, so we shoulder the burden without complaining too loudly to the government agencies that force us to go to these things.

This post, thankfully, isn’t about a conference. In fact, it’s about the opposite of a conference, which is… an UNCONFERENCE. An unconference is a social event type of thing that strips away all of the unpleasant features of a regular conference–you know, the fancy dinners, free drinks, and stimulating conversation–and replaces them with a much more authentic academic experience. An authentic experience in which you spend the bulk of your time situated in a 10′ x 10′ room (3 m x 3 m for non-Imperialists) with 10 – 12 other academics, and no one’s allowed to leave the room, eat anything, or take bathroom breaks until someone in the room comes up with a brilliant discovery and wins a Nobel prize. This lasts for 3 days (plus however long it takes for the Nobel to be awarded), and you pay $1200 for the privilege ($1160 if you’re a post-doc or graduate student). Believe me when I tell you that it’s a life-changing experience.

Okay, I exaggerate a bit. Most of those things aren’t true. Here’s one explanation of what an unconference actually is:

An unconference is a participant-driven meeting. The term “unconference” has been applied, or self-applied, to a wide range of gatherings that try to avoid one or more aspects of a conventional conference, such as high fees, sponsored presentations, and top-down organization. For example, in 2006, CNNMoney applied the term to diverse events including Foo Camp, BarCamp, Bloggercon, and Mashup Camp.

So basically, my description was accurate up until the part where I said there were no bathroom breaks.

Anyway, I’m going somewhere with this, I promise. Specifically, I’m going to Leipzig, Germany! In September! And you should come too!

The happy occasion is Brainhack 2012, an unconference organized by the creative minds over at the Neuro Bureau–coordinators of such fine projects as the Brain Art Competition at OHBM (2012 incarnation going on in Beijing right now!) and the admittedly less memorable CNS 2007 Surplus Brain Yard Sale (guess what–turns out selling human brains out of the back of an unmarked van violates all kinds of New York City ordinances!).

Okay, as you can probably tell, I don’t quite have this event promotion thing down yet. So in the interest of ensuring that more than 3 people actually attend this thing, I’ll just shut up now and paste the official description from the Brainhack website:

The Neuro Bureau is proud to announce the 2012 Brainhack, to be held from September 1-4 at the Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany.

Brainhack 2012 is a unique workshop with the goals of fostering interdisciplinary collaboration and open neuroscience. The structure builds from the concepts of an unconference and a hackathon: The term “unconference” refers to the fact that most of the content will be dynamically created by the participants — a hackathon is an event where participants collaborate intensively on science-related projects.

Participants from all disciplines related to neuroimaging are welcome. Ideal participants span in range from graduate students to professors across any disciplines willing to contribute (e.g., mathematics, computer science, engineering, neuroscience, psychology, psychiatry, neurology, medicine, art, etc…). The primary requirement is a desire to work in close collaborations with researchers outside of your specialization in order to address neuroscience questions that are beyond the expertise of a single discipline.

In all seriousness though, I think this will be a blast, and I’m really looking forward to it. I’m contributing the full Neurosynth dataset as one of the resources participants will have access to (more on that in a later post), and I’m excited to see what we collectively come up with. I bet it’ll be at least three times as awesome as the Surplus Brain Yard Sale–though maybe not quite as lucrative.

 

 

p.s. I’ll probably also be in Amsterdam, Paris, and Geneva in late August/early September; if you live in one of these fine places and want to show me around, drop me an email. I’ll buy you lunch! Well, except in Geneva. If you live in Geneva, I won’t buy you lunch, because I can’t afford lunch in Geneva. You’ll buy yourself a nice Swiss lunch made of clockwork and gold, and then maybe I’ll buy you a toothpick.

the neuroinformatics of Neopets

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…

Attention publishers: the data in your tables want to be free! Free!

The Neurosynth database is getting an upgrade over the next couple of weeks; it’s going to go from 4,393 neuroimaging studies to around 5,800. Unfortunately, updating the database is kind of a pain, because academic publishers like to change the format of their full-text HTML articles, which has a nasty habit of breaking the publisher-specific HTML parsers I’ve written. When you expect ScienceDirect to give you <table cellspacing=10>, but you get <table> with no cellspacing attribute (the horror!), bad things happen in XPath land. And then those bad things need to be repaired. And I hate repairing stuff! So I don’t do it very often. Like, once every 6 to 9 months.

In an ideal world, there would be no need to write (and fix) custom filters for different publishers, because the publishers would all simultaneously make XML representations of their articles available (in addition to HTML, PDF, etc.), and then people who have legitimate data mining reasons for regularly downloading hundreds of articles at a time wouldn’t have to cry themselves to sleep every night. But as it stands, only one major publisher of neuroimaging articles (PLoS) provides XML versions of all articles. A minority of articles from other publishers are available in XML from BioMed Central, but that’s still just a fraction of the existing literature.

Anyway, the HTML thing is annoying, but it’s possible to work around it. What’s much more problematic is that some publishers lock up the data in the tables of their articles. To make Neurosynth work, I have to be able to identify rows in tables that look like brain activations. That is, things that look roughly like this:

Most publishers are nice enough to format article tables as HTML tables; which is to say, I can look for tags like <table> and then work down the XPath tree to identify all the the rows, and then scan each rows for values that look activation-like. Then those values go into the database, and poof, next thing you know, you have meta-analytic brain activation maps from hundreds of studies. But some publishers–most notably, Frontiers–throw a wrench in the works by failing to format tables in HTML; instead, they present the tables as images (see for instance this JPEG table, pulled from this article). Which means I can’t really extract any data from them, and as a result, you’re not going to see activations from articles published in Frontiers journals in Neurosynth any time soon. So if you publish fMRI articles in Frontiers in Human Neuroscience regularly, and are wondering why I’ve been ignoring you (I like you! I promise!), now you know.

Anyway, on the remote chance that anyone reading this has any sway with people high up at Frontiers, could you please ask them to release their data? Pretty please? Lack of access to data in tables seems to be a pretty common complaint in the data mining community; I’ve talked to other people in the neuroinformatics world who’ve also expressed frustration about it, and I imagine the same is true of people in other disciplines. It’s particularly surprising given that Frontiers is, in theory, an open access publisher. I can see the data in your tables, Frontiers; why won’t you also let me read it?

Okay, I know this kind of stuff doesn’t really interest anyone; I’m just venting. The main point is, Neurosynth is going to be bigger and (very slightly) better in the near future.