Archive for the ‘data mining’ Category

what do you get when you put 1,000 psychologists together in one journal?

Friday, April 5th, 2013

I’m working on a TOP SEKKRIT* project involving large-scale data mining of the psychology literature. I don’t have anything to say about the TOP SEKKRIT* project just yet, but I will say that in the process of extracting certain information I needed in order to do certain things I won’t talk about, I ended up with certain kinds of data that are useful for certain other tangential analyses. Just for fun, I threw some co-authorship data from 2,000+ Psychological Science articles into the d3.js blender, and out popped an interactive network graph of all researchers who have published at least 2 papers in Psych Science in the last 10 years**. It looks like this:

coauthorship_graph

You can click on the image to take a closer (and interactive) look.

I don’t think this is very useful for anything right now, but if nothing else, it’s fun to drag Adam Galinsky around the screen and watch half of the field come along for the ride. There are plenty of other more interesting things one could do with this, though, and it’s also quite easy to generate the same graph for other journals, so I expect to have more to say about this later on.

 

* It’s not really TOP SEKKRIT at all–it just sounds more exciting that way.

** Or, more accurately, researchers who have co-authored at least 2 Psych Science papers with other researchers who meet the same criterion. Otherwise we’d have even more nodes in the graph, and as you can see, it’s already pretty messy.

unconference in Leipzig! no bathroom breaks!

Monday, June 11th, 2012

Südfriedhof von Leipzig [HDR]

Many (most?) regular readers of this blog have probably been to at least one academic conference. Some of you even have the misfortune of attending conferences regularly. And a still-smaller fraction of you scholarly deviants might conceivably even enjoy the freakish experience. You know, that whole thing where you get to roam around the streets of some fancy city for a few days seeing old friends, learning about exciting new scientific findings, and completely ignoring the manuscripts and reviews piling up on your desk in your absence. It’s a loathsome, soul-scorching experience. Unfortunately it’s part of the job description for most scientists, so we shoulder the burden without complaining too loudly to the government agencies that force us to go to these things.

This post, thankfully, isn’t about a conference. In fact, it’s about the opposite of a conference, which is… an UNCONFERENCE. An unconference is a social event type of thing that strips away all of the unpleasant features of a regular conference–you know, the fancy dinners, free drinks, and stimulating conversation–and replaces them with a much more authentic academic experience. An authentic experience in which you spend the bulk of your time situated in a 10′ x 10′ room (3 m x 3 m for non-Imperialists) with 10 – 12 other academics, and no one’s allowed to leave the room, eat anything, or take bathroom breaks until someone in the room comes up with a brilliant discovery and wins a Nobel prize. This lasts for 3 days (plus however long it takes for the Nobel to be awarded), and you pay $1200 for the privilege ($1160 if you’re a post-doc or graduate student). Believe me when I tell you that it’s a life-changing experience.

Okay, I exaggerate a bit. Most of those things aren’t true. Here’s one explanation of what an unconference actually is:

An unconference is a participant-driven meeting. The term “unconference” has been applied, or self-applied, to a wide range of gatherings that try to avoid one or more aspects of a conventional conference, such as high fees, sponsored presentations, and top-down organization. For example, in 2006, CNNMoney applied the term to diverse events including Foo Camp, BarCamp, Bloggercon, and Mashup Camp.

So basically, my description was accurate up until the part where I said there were no bathroom breaks.

Anyway, I’m going somewhere with this, I promise. Specifically, I’m going to Leipzig, Germany! In September! And you should come too!

The happy occasion is Brainhack 2012, an unconference organized by the creative minds over at the Neuro Bureau–coordinators of such fine projects as the Brain Art Competition at OHBM (2012 incarnation going on in Beijing right now!) and the admittedly less memorable CNS 2007 Surplus Brain Yard Sale (guess what–turns out selling human brains out of the back of an unmarked van violates all kinds of New York City ordinances!).

Okay, as you can probably tell, I don’t quite have this event promotion thing down yet. So in the interest of ensuring that more than 3 people actually attend this thing, I’ll just shut up now and paste the official description from the Brainhack website:

The Neuro Bureau is proud to announce the 2012 Brainhack, to be held from September 1-4 at the Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany.

Brainhack 2012 is a unique workshop with the goals of fostering interdisciplinary collaboration and open neuroscience. The structure builds from the concepts of an unconference and a hackathon: The term “unconference” refers to the fact that most of the content will be dynamically created by the participants — a hackathon is an event where participants collaborate intensively on science-related projects.

Participants from all disciplines related to neuroimaging are welcome. Ideal participants span in range from graduate students to professors across any disciplines willing to contribute (e.g., mathematics, computer science, engineering, neuroscience, psychology, psychiatry, neurology, medicine, art, etc…). The primary requirement is a desire to work in close collaborations with researchers outside of your specialization in order to address neuroscience questions that are beyond the expertise of a single discipline.

In all seriousness though, I think this will be a blast, and I’m really looking forward to it. I’m contributing the full Neurosynth dataset as one of the resources participants will have access to (more on that in a later post), and I’m excited to see what we collectively come up with. I bet it’ll be at least three times as awesome as the Surplus Brain Yard Sale–though maybe not quite as lucrative.

 

 

p.s. I’ll probably also be in Amsterdam, Paris, and Geneva in late August/early September; if you live in one of these fine places and want to show me around, drop me an email. I’ll buy you lunch! Well, except in Geneva. If you live in Geneva, I won’t buy you lunch, because I can’t afford lunch in Geneva. You’ll buy yourself a nice Swiss lunch made of clockwork and gold, and then maybe I’ll buy you a toothpick.

the neuroinformatics of Neopets

Thursday, January 26th, 2012

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…

Attention publishers: the data in your tables want to be free! Free!

Saturday, January 7th, 2012

The Neurosynth database is getting an upgrade over the next couple of weeks; it’s going to go from 4,393 neuroimaging studies to around 5,800. Unfortunately, updating the database is kind of a pain, because academic publishers like to change the format of their full-text HTML articles, which has a nasty habit of breaking the publisher-specific HTML parsers I’ve written. When you expect ScienceDirect to give you <table cellspacing=10>, but you get <table> with no cellspacing attribute (the horror!), bad things happen in XPath land. And then those bad things need to be repaired. And I hate repairing stuff! So I don’t do it very often. Like, once every 6 to 9 months.

In an ideal world, there would be no need to write (and fix) custom filters for different publishers, because the publishers would all simultaneously make XML representations of their articles available (in addition to HTML, PDF, etc.), and then people who have legitimate data mining reasons for regularly downloading hundreds of articles at a time wouldn’t have to cry themselves to sleep every night. But as it stands, only one major publisher of neuroimaging articles (PLoS) provides XML versions of all articles. A minority of articles from other publishers are available in XML from BioMed Central, but that’s still just a fraction of the existing literature.

Anyway, the HTML thing is annoying, but it’s possible to work around it. What’s much more problematic is that some publishers lock up the data in the tables of their articles. To make Neurosynth work, I have to be able to identify rows in tables that look like brain activations. That is, things that look roughly like this:

Most publishers are nice enough to format article tables as HTML tables; which is to say, I can look for tags like <table> and then work down the XPath tree to identify all the the rows, and then scan each rows for values that look activation-like. Then those values go into the database, and poof, next thing you know, you have meta-analytic brain activation maps from hundreds of studies. But some publishers–most notably, Frontiers–throw a wrench in the works by failing to format tables in HTML; instead, they present the tables as images (see for instance this JPEG table, pulled from this article). Which means I can’t really extract any data from them, and as a result, you’re not going to see activations from articles published in Frontiers journals in Neurosynth any time soon. So if you publish fMRI articles in Frontiers in Human Neuroscience regularly, and are wondering why I’ve been ignoring you (I like you! I promise!), now you know.

Anyway, on the remote chance that anyone reading this has any sway with people high up at Frontiers, could you please ask them to release their data? Pretty please? Lack of access to data in tables seems to be a pretty common complaint in the data mining community; I’ve talked to other people in the neuroinformatics world who’ve also expressed frustration about it, and I imagine the same is true of people in other disciplines. It’s particularly surprising given that Frontiers is, in theory, an open access publisher. I can see the data in your tables, Frontiers; why won’t you also let me read it?

Okay, I know this kind of stuff doesn’t really interest anyone; I’m just venting. The main point is, Neurosynth is going to be bigger and (very slightly) better in the near future.