Neurohackademy 2018: A wrap-up

It’s become something of a truism in recent years that scientists in many fields find themselves drowning in data. This is certainly the case in neuroimaging, where even small functional MRI datasets typically consist of several billion observations (e.g., 100,000 points in the brain, each measured at 1,000 distinct timepoints, in each of 20 subjects). Figuring out how to store, manage, analyze, and interpret data on this scale is a monumental challenge–and one that arguably requires a healthy marriage between traditional neuroimaging and neuroscience expertise, and computational skills more commonly found in data science, statistics, or computer science departments.

In an effort to help bridge this gap, Ariel Rokem and I have spent part of our summer each of the last three years organizing a summer institute at the intersection of neuroimaging and data science. The most recent edition of the institute–Neurohackademy 2018–just wrapped up last week, so I thought this would be a good time to write up a summary of the course: what the course is about, who attended and instructed, what everyone did, and what lessons we’ve learned.

What is Neurohackademy?

Neurohackademy started its life in Summer 2016 as the somewhat more modestly-named Neurohackweek–a one-week program for 40 participants modeled on Astrohackweek, a course organized by the eScience Institute in collaboration with data science initiatives at Berkeley and NYU. The course was (and continues to be) held on the University of Washington’s beautiful campus in Seattle, where Ariel is based (I make the trip from Austin, Texas every year–which, as you can imagine, is a terrible sacrifice on my part given the two locales’ respective summer climates). The first two editions were supported by UW’s eScience Institute (and indirectly, by grants from the Moore and Sloan foundations). Thanks to generous support from the National Institute of Mental Health (NIMH), this year the course expanded to two weeks, 60 participants, and over 20 instructors (our funding continues through 2021, so there will be at least 3 more editions).

The overarching goal of the course is to give neuroimaging researchers the scientific computing and data science skills they need in order to get the most out of their data. Over the course of two weeks, we cover a variety of introductory and (occasionally) advanced topics in data science, and demonstrate how they can be productively used in a range of neuroimaging applications. The course is loosely structured into three phases (see the full schedule here): the first few days feature domain-general data science tutorials; the next few days focus on sample neuroimaging applications; and the last few days consist of a full-blown hackathon in which participants pitch potential projects, self-organize into groups, and spend their time collaboratively working on a variety of software, analysis, and documentation projects.

Who attended?

Admission to Neurohackademy 2018 was extremely competitive: we received nearly 400 applications for just 60 spots. This was a very large increase from the previous two years, presumably reflecting the longer duration of the course and/or our increased efforts to publicize it. While we were delighted by the deluge of applications, it also meant we had to be far more selective about admissions than in previous years. The highly interactive nature of the course, coupled with the high per-participant costs (we provide two weeks of accommodations and meals), makes it unlikely that Neurohackademy will grow beyond 60 participants in future editions, despite the clear demand. Our rough sense is that somewhere between half and two-thirds of all applicants were fully qualified and could have easily been admitted, so there’s no question that, for many applicants, blind luck played a large role in determining whether or not they were accepted. I mention this mainly for the benefit of people who applied for the 2018 course and didn’t make it in: don’t take it personally! There’s always next year. (And, for that matter, there are also a number of other related summer schools we encourage people to apply to, including the Methods in Neuroscience at Dartmouth Computational Summer School, Allen Institute Summer Workshop on the Dynamic Brain, Summer School in Computational Sensory-Motor Neuroscience, and many others.)

The 60 participants who ended up joining us came from a diverse range of demographic backgrounds, academic disciplines, and skill levels. Most of our participants were trainees in academic programs (40 graduate students, 12 postdocs), but we also had 2 faculty members, 6 research staff, and 2 medical residents (note that all of these counts include 4 participants who were admitted to the course but declined to, or could not, attend). We had nearly equal numbers of male and female participants (30F, 33M), and 11 participants came from traditionally underrepresented backgrounds. 43 participants were from institutions or organizations based in the United States, with the remainder coming from 14 different countries around the world.

The disciplinary backgrounds and expertise levels of participants are a bit harder to estimate for various reasons, but our sense is that the majority (perhaps two-thirds) of participants received their primary training in non-computational fields (psychology, neuroscience, etc.). This was not necessarily by design–i.e., we didn’t deliberately favor applicants from biomedical fields over applicants from computational fields–and primarily mirrored the properties of the initial applicant pool. We did impose a hard requirement that participants should have at least some prior expertise in both programming and neuroimaging, but subject to that constraint, there was enormous variation in previous experience along both dimensions–something that we see as a desirable feature of the course (more on this below).

We intend to continue to emphasize and encourage diversity at Neurohackademy, and we hope that all of our participants experienced the 2018 edition as a truly inclusive, welcoming event.

Who taught?

We were fortunate to be able to bring together more than 20 instructors with world-class expertise in a diverse range of areas related to neuroimaging and data science. “Instructor” is a fairly loose term at Neurohackademy: we deliberately try to keep the course non-hierarchical, so that for the most part, instructors are just participants who happen to fall on the high-experience tail of the experience distribution. That said, someone does have to teach the tutorials and lectures, and we were lucky to have a stellar cast of experts on hand. Many of the data science tutorials during the first phase of the course were taught by eScience staff and UW faculty kind enough to take time out of their other duties to help teach participants a range of core computing skills: Git and GitHub (Bernease Herman), R (Valentina Staneva and Tara Madhyastha), web development (Anisha Keshavan), and machine learning (Jake Vanderplas), among others.

In addition to the local instructors, we were joined for the tutorial phase by Kirstie Whitaker (Turing Institute), Chris Gorgolewski (Stanford), Satra Ghosh (MIT), and JB Poline (McGill)–all veterans of the course from previous years (Kirstie was a participant at the first edition!). We’re particularly indebted to Kirstie and Chris for their immense help. Kirstie was instrumental in helping a number of participants bridge the (large!) gap between using git privately, and using it to actively collaborate on a public project. As one of the participants elegantly put it:

Chris shouldered a herculean teaching load, covering Docker, software testing, BIDS and BIDS-Apps, and also leading an open science panel. I’m told he even sleeps on occasion.

We were also extremely lucky to have Fernando Perez (Berkeley)–the creator of IPython and leader of the Jupyter team–join us for several days; his presentation on Jupyter (videos: part 1 and part 2) was one of the highlights of the course for me personally, and I heard many other instructors and participants share the same sentiment. Jupyter was a critical part of our course infrastructure (more on that below), so it was fantastic to have Fernando join us and share his insights on the fascinating history of Jupyter, and on reproducible science more generally.

As the course went on, we transitioned from tutorials focused on core data science skills to more traditional lectures focusing on sample applications of data science methods to neuroimaging data. Instructors during this phase of the course included Tor Wager (Colorado), Eva Dyer (Georgia Tech), Gael Varoquaux (INRIA), Tara Madhyastha (UW), Sanmi Koyejo (UIUC), and Nick Cain and Justin Kiggins (Allen Institute for Brain Science). We continued to emphasize hands-on interaction with data; many of the presenters during this phase spent much of their time showing participants how to work with programmatic tools to generate the kinds of results one might find in papers they’ve authored (e.g., Tor Wager and Gael Varoquaux demonstrated tools for neuroimaging data analysis written in Matlab and Python, respectively).

The fact that so many leading experts were willing to take large chunks of time out of their schedule (most of the instructors hung around for several days, facilitating extended interactions with participants) to visit with us at Neurohackademy speaks volumes about the kind of people who make up the neuroimaging data science community. We’re tremendously grateful to these folks for their contributions, and hope they’ll return to teach at future editions of the institute.

What did we cover?

The short answer is: see for yourself! We’ve put most of the slides, code, and videos from the course online, and encourage people to interact with, learn from, and reuse these materials.

Now the long(er) answer. One of the challenges in organizing scientific training courses that focus on technical skill development is that participants almost invariably arrive with a wide range of backgrounds and expertise levels. At Neurohackademy, some of the participants were effectively interchangeable with instructors, while others were relatively new to programming and/or neuroimaging. The large variance in technical skill is a feature of the course, not a bug: while we require all admitted participants to have some prior programming background, we’ve found that having a range of skill levels is an excellent way to make sure that everyone is surrounded by people who they can alternately learn from, help out, and collaborate with.

That said, the wide range of backgrounds does present some organizational challenges: introductory sessions often bore more advanced participants, while advanced sessions tend to frustrate newcomers. To accommodate the range of skill levels, we tried to design the course in a way that benefits as many people as possible (though we don’t pretend to think it worked great for everyone). During the first two days, we featured two tracks of tutorials at most times, with simultaneously-held presentations generally differing in topic and/or difficulty (e.g., Git/GitHub opposite Docker; introduction to Python opposite introduction to R; basic data visualization opposite computer vision).

Throughout Neurohackademy, we deliberately placed heavy emphasis on the Python programming language. We think Python has a lot going for it as a lingua franca of data science and scientific computing. The language is free, performant, relatively easy to learn, and very widely used within the data science, neuroimaging, and software development communities. It also helps that many of our instructors (e.g., Fernando Perez, Jake Vanderplas, and Gael Varoquaux) are major contributors to the scientific Python ecosystem, so there was a very high concentration of local Python expertise to draw on. That said, while most of our instruction was done in Python, we were careful to emphasize that participants were free to work in whatever language(s) they like. We deliberately include tutorials and lectures that featured R, Matlab, or JavaScript, and a number of participant projects (see below) were written partly or entirely in other languages, including R, Matlab, JavaScript, and C.

We’ve also found that the tooling we provide to participants matters–a lot. A robust, common computing platform can spell the difference between endless installation problems that eat into valuable course time, and a nearly seamless experience that participants can dive into right away. At Neurohackademy, we made extensive use of the Jupyter suite of tools for interactive computing. In particular, thanks to Ariel’s heroic efforts (which built on some very helpful docs, similarly heroic efforts by Chris Holdgraf, Yuvi Panda, and Satra Ghosh last year), we were able to conduct a huge portion of our instruction and collaborative hacking using a course-wide Jupyter Hub allocation, deployed via Kubernetes, running on the Google Cloud. This setup allowed Ariel to create a common web-accessible environment for all course participants, so that, at the push of a button, each participant was dropped into a Jupyter Lab environment containing many of the software dependencies, notebooks, and datasets we used throughout the course. While we did run into occasional scaling bottlenecks (usually when an instructor demoed a computationally intensive method, prompting dozens of people to launch the same process in their pods), for the most part, our participants were able to drop into a running JupyterLab instance within seconds and immediately start interactively playing with the code being presented by instructors.

Surprisingly (at least to us), our total Google Cloud computing costs for the entire two-week, 60-participant course came to just $425. Obviously, that number could have easily skyrocketed had we scaled up our allocation dramatically and allowed our participants to execute arbitrarily large jobs (e.g., preprocessing data from all ~1,200 HCP subjects). But we thought the limits we imposed were pretty reasonable, and our experience suggests that not only is Jupyter Hub an excellent platform from a pedagogical standpoint, but it can also be an extremely cost-effective one.

What did we produce?

Had Neurohackademy produced nothing at all besides the tutorials, slides, and videos generated by instructors, I think it’s fair to say that participants would still have come away feeling that they learned a lot (more on that below). But a major focus of the institute was on actively hacking on the brain–or at least, on data related to the brain. To this effect, the last 3.5 days of the course were dedicated exclusively to a full-blown hackathon in which participants pitched potential projects, self-organized into groups, and then spent their time collaboratively working on a variety of software, analysis, and documentation projects. You can find a list of most of the projects on the course projects repository (most link out to additional code or resources).

As one might expect given the large variation in participant experience, project group size, and time investment (some people stuck to one project for all three days, while others moved around), the scope of projects varied widely. From our perspective–and we tried to emphasize this point throughout the hackathon–the important thing was not what participants’ final product looked like, but how much they learned along the way. There’s always a tension between exploitation and exploration at hackathons, with some people choosing to spend most of their time expanding on existing projects using technologies they’re already familiar with, and others deciding to start something completely new, or to try out a new language–and then having to grapple with the attendant learning curve. While some of the projects were based on packages that predated Neurohackademy, most participants ended up working on projects they came up with de novo at the institute, often based on tools or resources they first learned about during the course. I’ll highlight just three projects here that provide a representative cross-section of the range of things people worked on:

1. Peer Herholz and Rita Ludwig created a new BIDS-app called Bidsonym for automated de-identification of neuroimaging data. The app is available from Docker Hub, and features not one, not two, but three different de-identification algorithms. If you want to shave the faces off of your MRI participants with minimal fuss, make friends with Bidsonym.

2. A group of eight participants ambitiously set out to develop a new “O-Factor” metric intended to serve as a relative measure of the openness of articles published in different neuroscience-related journals. The project involved a variety of very different tasks, including scraping (public) data from the PubMed Central API, computing new metrics of code and data sharing, and interactively visualizing the results using a d3 dashboard. While the group was quick to note that their work is preliminary, and has a bunch of current limitations, the results look pretty great–though some disappointment was (facetiously) expressed during the project presentations that the journal Nature is not, as some might have imagined, a safe house where scientific datasets can hide from the prying public.

3. Emily Wood, Rebecca Martin, and Rosa Li worked on tools to facilitate mixed-model analysis of fMRI data using R. Following a talk by Tara Madhyastha  on her Neuropointillist R framework for fMRI data analysis, the group decided to create a new series of fully reproducible Markdown-based tutorials for the package (the original documentation was based on non-public datasets). The group expanded on the existing installation instructions (discovering some problems in the process), created several tutorials and examples, and also ended up patching the neuropointillist code to work around a very heavy dependency (FSL).

You can read more about these 3 projects and 14 others on the project repository, and in some cases, you can even start using the tools right away in your own work. Or you could just click through and stare at some of the lovely images participants produced.

So, how did it go?

It went great!

Admittedly, Ariel and I aren’t exactly impartial parties–we wouldn’t keep doing this if we didn’t think participants get a lot out of it. But our assessment isn’t based just on our personal impressions; we have participants fill out a detailed (and anonymous) survey every year, and go out of our way to encourage additional constructive criticism from the participants (which a majority provide). So I don’t think we’re being hyperbolic when we say that most people who participated in the course had an extremely educational and enjoyable experience. Exhibit A is this set of unsolicited public testimonials, courtesy of twitter:

The organizers and instructors all worked hard to build an event that would bring people together as a collaborative and productive (if temporary) community, and it’s very gratifying to see those goals reflected in participants’ experiences.

Of course, that’s not to say there weren’t things we could do better; there were plenty, and we’ve already made plans to adjust and improve the course next year based on feedback we received. For example, some suggestions we received from multiple participants included adding more ice-breaking activities early on in the course; reducing the intensity of the tutorial/lecture schedule the first week (we went 9 am to 6 pm every day, stopping only for an hourlong lunch and a few short breaks); and adding designated periods for interaction with instructors and other participants. We’ve already made plans to address these (and several other) recommendations in next year’s edition, and expect it to looks slightly different from (and hopefully better than!) Neurohackademy 2018.

Thank you!

I think that’s a reasonable summary of what went on at Neurohackademy 2018. We’re delighted at how the event turned out, and are happy to answer questions (feel free to leave them in the comments below, or to email Ariel and/or me).

We’d like to end by thanking all of the people and organizations who helped make Neurohackademy 2018 a success: NIMH for providing the funding that makes Neurohackademy possible; the eScience Institute and staff for throwing their wholehearted support behind the course (particularly our awesome course coordinator, Rachael Murray); and the many instructors who each generously took several days (and in a few cases, more than a week!) out of their schedule, unpaid, to come to Seattle and share their knowledge with a bunch of enthusiastic strangers. On a personal note, I’d also like to thank Ariel, who did the lion’s share of the actual course directing. I mostly just get to show up in Seattle, teach some stuff, hang out with great people, and write a blog post about it.

Lastly, and above all else, we’d like to thank our participants. It’s a huge source of inspiration and joy to us each year to see what a group of bright, enthusiastic, motivated researchers can achieve when given time, space, and freedom (and, okay, maybe also a large dollop of cloud computing credits). We’re looking forward to at least three more years of collaborative, productive neurohacking!