The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

96 thoughts on “The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch”

  1. Hi, Tal!

    Speaking of ORM, I suggest you to take a look at PonyORM (disclaimer – I’m one of the authors): http://ponyorm.com

    PonyORM allows writing queries with a minimum of boilerplate code, in form of Python generators. PonyORM takes a generator expression, decompiles its bytecode into abstract syntax tree, and then translates this AST into an equivalent SQL query. So, you can write something like this:

    select(p for p in Product if p.price == max(p.price for p in Product))

    And then Pony will translate this Python generator into something like this:

    SELECT “p”.”id”, “p”.”name”, “p”.”price”, “p”.”quantity”
    FROM “Product” “p”
    WHERE “p”.”price” = (
    SELECT MAX(“p”.”price”)
    FROM “Product” “p”
    )

    Also we have visual diagram editor, which can be used to design entities and auto-generate corresponding Python code, like this: https://editor.ponyorm.com/user/pony/eStore

    You can see how some Pony users use it for scientific computing. The was some discussion about PonyORM at PyData NYC 2013 conference (along with other technologies), you can see video here: http://vimeo.com/79532571
    and here are the slides: https://github.com/ihaque/pydata_nyc_2013

  2. Alexander, thanks, hadn’t come across PonyORM. It does look awesome–I’ll definitely try it out in the near future. And that visual diagram editor is fantastic!

  3. Cool article, thanks. I don’t know if it’s been mentioned yet, but you can do Active Record like setups in SQLAlchemy very easily now using Declarative Base. But the killer feature is that when you discover you need to rejig your schema under some already running code, you can still drop into the more explicit style to glue stuff together. If you ever need to interact with a big hair legacy db or make two dbs pretend to be one, SQLAlchemy is really the only game in town.

  4. Interesting post. What are you using as an IDE for your python development?
    One thing I love about R is the convenience of using it within Rstudio. Also there is shiny for building web interfaces and dashboards. I would be interested what alternatives you recommend regarding the python ecosystem.

  5. If cognitive overhead is your concern then Python shouldn’t be your choice in spite of what language might most convenient given access to certain languages. There certainly is cognitive overhead when using Python as it has poorly grafted syntax and semantics for object orientation. If you really need to access what Python or any other language affords by way of libraries, then use a bridge and choose which ever language takes your fancy.

    1. If cognitive overhead is your concern then Python shouldn’t be your choice in spite of what language might be most convenient given access to certain languages’ libraries. There certainly is cognitive overhead when using Python as it has poorly grafted syntax and semantics for object orientation. If you really need to access what Python or any other language affords by way of libraries, then use a bridge and choose which ever language takes your fancy.

  6. Still, it’s a shame that the Python VM is so terribly engineered and an even bigger shame that PyPy is laughably impotent compared to other efforts like LuaJIT.

  7. Which version of Python do you use? A big concern I have with Python is that it seems to have forked into two languages. Most business developers I have met use v2.7, which is not, so far as I know, being evolved. And then there’s v3.x which seems to be mainly used by academics. One good thing about R is that it’s one language.

    1. It’s a little late but here it is. Python 2 is still updated and new features are also implemented in it.
      Latest stable releases :
      3.5.1 / 7 December 2015[1]
      2.7.11 / 5 December 2015
      Each programming languages will break compatibility with its older version and this more than once, that means community is alive and language evolving which is very important, you don’t want to use obsolete technology nobody uses. Also, changes between python2 and python3 are not ‘dramatics’. There is a few more constancy in python3, like `print` is now a real function you can pass like any other, this means you have to put () to call it, in python2 it’s only a keyword. That said, all major libraries supports both version and that at the runtime. That means you don’t have to download a lib for each, you can use the same modules, it will detect your version and execute the propper code when needed.
      About 2.7 beeing more used in business and 3 more on research and academic, i’d say you were right 3 years ago or at leas for existing project. Now python3 is taking up the lands of python2, at this moment new projects tends to be more started with python3 instead of python2.

      Python has lot of nice libraries that R doesn’t have, but if we talk about statistics or molecular biology. That was true, when I started to work in those fields 3 years ago, I was struggeling to find libraries and had to handwrite many code to fit other work already made with python and also had to fallback to R (which has a terrible syntax and structure compared to python) to make a nicer job. But that time is past now, there is really nice libs for python that oustand most libs for R. Except Shiny from rstudio. Some for python are really close though. I believe in the next 2 years R will be dead (halleluja) for those fields.

  8. In bioinformatics you still have to use R and also python for some tasks. Most of the analyses rely on specialist packages. In the early phases of an experiment (e.g., RNA-seq) there is huge amounts of textual data. Python is used to process those data down to a more manageable size. But nothing beats R/Bioconductor for the downstream and integrative analysis. It is probably true that either language can be used for machine learning tasks equally (maybe a slight edge to python) or for visualization (probably a slight edge to R). For making a web service of the analysis python is more useful, but specialist R code would be fiendishly difficult to translate. So maybe there you would use either R, or do a simplified, less sophisticated, version of the analysis tool for online work. You can of course run Bioconductor via python, to some extent. But not sure that really helps.

  9. Python in scientific programming is a peer reviewed, open access journal which provides ground for research and practical experience with software engineering environments, tools, languages, scientific and engineering computing…Thanks for sharing these types of informative…

Leave a Reply