R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

23 thoughts on “R, the master troll of statistical languages”

  1. I am, of course, going to share this story with my students at the end of my “programming through R” class this fall.

    But why get rid of the non-numeric columns at all? If you’re doing regression, just give the appropriate column names in the formulas, and pass the whole data frame to lm.

  2. Of course you’ll share it at the end… because you’re the master troll of statistics instructors. :)

    There was a bunch of other stuff involved in processing the data; e.g., most of the variables were continuous but the people who collected it had taken the odd step of coding missing responses as a 6 (on an otherwise 5-point scale), so recoding was necessary for numerical columns. Plus she didn’t just want to drop the non-numeric columns; there were a bunch of factors (race, gender, etc.) that went into the regression as well as nominal covariates. So there were principled reasons for having to identify all and only the numeric columns.

  3. I’m glad it isn’t just me … I love R but the apply() family always catches me. My heuristic is never to use apply() type functions without checking ?apply, ?sapply etc. first

    I don’t do enough R programming to know the differences automatically … I’ve also recently realized they can often be avoided by rowMeans() and colMeans() …

  4. Hilarious. And so true. I laughed a lot. Especially at the “stubborn insistence on pretending that he knows what he’s doing”. I’m with you 100% … unfortunately. The point that using a (gasp) loop is not actually going to cause a thunderbolt to strike us is a good one. We’re never going to be Shalizi or Wickham or Ikara, so get over it and go for the simple route.

  5. Thom, to be honest, I’m a bit glad to hear you have this problem too; makes me feel better about myself if you still struggle with this kind of stuff! But on the other hand, maybe that means the road to gurudom is even longer than I thought…

    Alan, I hear you and agree… but then again, at least one of the people who’ve commented in this thread is going to be Shalizi or Wickham or Ikara, so… ;)

    Jake, sure, but unless you’re doing something that really places a premium on CPU cycles, I think most programmers’ energies (well, mine at least) are directed at minimizing the amount of code they write, not the total amount that gets executed. Personally if I need to do anything that involves more than 10 – 20 lines of code and isn’t completely specific to my own project, I’ll usually spend a few minutes searching for existing packages that could save me the trouble. But it’s quite possible I’m just a particularly lazy programmer!

  6. Great post. As a beginner R user, I will take it as a cautionary tale, especially when dispensing advice to a significant other.

  7. “actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart.”

    I hate to tell you this, but this problem gets worse over time as your own history gets longer (I wrote my first stat code in 1969, just to give you some perspective).

    Luckily, all these “desktop search” functions can help you find your own answers, assuming you digitized them at some point.

    Final word of advice: Yeah, I know it’s better (more efficient) to use other structures besides loops in R, but it’s also better to get the code working quickly and accurately. A loop is a wonderful thing.

  8. “That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. ”

    Actually, sapply and lapply are much more basic than apply. Just look at the code. Apply does looping in R whereas lapply is an internal (C-level) function and sapply is just lapply plus some simplifying. But there is also vapply that is much better than any of these and that your wife should have used. (But I have never figured out how it works).

  9. Hi, I read your blog Im currently looking for Business Analysts who are specialist in R lanquage. I work at American Software company Micros Fidelio, Im the HR Manager. Can you please send me candidates or names who are professionist in this?

    Thanks

    Mezei Andrea

  10. R gives the world its ten-thousand-and-first computer language. However, I have found that using R as a standalone language is a bad idea. It’s much, much better to prepare data for R, and to receive data from R, from a scripting language like Perl or Python or Ruby. The extraordinarily limited number of data types, the lack of pointers (references), and a host of other things make this tough sledding for people who are used to languages that can stand on their own.

    R’s convenience functions for textual data are hilariously underpowered. It’s nice that R circumvents the bloatedness of old SPSS or SAS programs and it’s also nice that R is so easy to call from all the major scripting languages. However, I can only see R fanatics insisting that this is a full tool all by itself because most apps in the world need statistics as ONE of the outputs of a piece of software. That’s why your, for example, Perl script calls R as a kind of Perl convenience function … and it is VERY convenient for that.

  11. John P. : “However, I can only see R fanatics insisting that this is a full tool all by itself because most apps in the world need statistics as ONE of the outputs of a piece of software.”

    That’s a bit of a straw man – even R fanatics don’t tend to use or advocate R for general programming – just for statistics. Most R fanatics (that’s I’m aware of) will happily use other languages to call R (e.g., Python) or use R to manage other software (JAGs etc.).

    Mind you, lots of R users probably over-use R in the sense that some other language would be more efficient for their task, but that’s because of switching and other costs. However, that’s a general rule of programming (or indeed technology).

  12. Thom: “even R fanatics don’t tend to use or advocate R for general programming – just for statistics.”

    Not disputing your general argument, but I just finished reading the book, “Quantitative Corpus Linguistics with R”, in which R is promoted for text processing. I think it was the most mind-numbing misapplication of a programming language I have ever seen.

  13. Great post. I particularly liked the bit about the multiple ways to select a column from a data frame.

    Honestly, though, “there is more than one way to do it” doesn’t make for great language design (IMHO); it makes for steeper learning curves. (Cf perl.)

  14. I feel your pain regarding reading R code from other people… but I am going to flip it… I fear the day someone tries to figure out my code. Take for example my function NameClass which I wrote a couple of years ago and I use almost every time I open RStudio.
    function(df) {
    nc <- as.data.frame(names(df))
    for (i in 1:dim(nc)[1]) {
    if (class(df[,i])[1] =="labelled") {
    nc[i,2] <- class(df[,i])[2]
    } else {
    nc[i,2] <- class(df[,i])
    }
    names(nc) <- c("var.name","var.class")
    }
    nc$var.name <- as.character(nc$var.name)
    nc$var.class <- as.factor(nc$var.class)
    message("Dataframe contains two variables var.name & var.class")
    return(nc)
    }

  15. What a coincidence! I ran into this exact same problem earlier today. I realized I could use `sapply`, but I didn’t investigate why `apply` hadn’t worked. Thanks for the explanation!

  16. As to using <- for assignment. The reason I do it is to avoid confusing assignments with function arguments, where = is the only operator allowed. It just means when I look back on my code I can easily separate the two.

  17. I have the same experience, God knows how many hours I’ve spent trying to debug an R program. I think 13 years of Matlab programming has had it’s effects on me.

  18. A little late to the party here, but thank you for letting me know it’s not just me. I’ve been programming in various languages for decades, I think I’m pretty good at it, but R wrong-foots me every time. Expectations: violated!

Leave a Reply