Apply is not for dataframes

Last night, I wrote some crappy R code and remarked that it was definitely ugly, probably contained bugs, and would likely give me trouble in the morning. I was right on all counts. This is the story of the trouble it gave me, not because it was surprising but because it produced such weird symptoms.

A small spoiler to start things off

> format(8:12) == 8:12
[1] FALSE FALSE TRUE TRUE TRUE

If this makes perfect sense to you and you already see how it relates to the title, congratulations. You can leave now.

The problem

I’m processing minirhizotron images and I needed to check that we’d used the correct calibration constants for each tube. The real data are large, ugly, and not public yet, so here’s a minimal equivalent demo:

calib = data.frame(
	num=5:15, 
	date=as.Date(c(
		rep("2014-01-01", 3), 
		rep("2014-01-02", 3), 
		rep("2014-01-03", 3), 
		rep("2014-01-04", 2))))

data = expand.grid(
	num=1:20, 
	date=as.Date(16070:16075, origin="1970-01-01"),
	value1=1,  # These vary in real data, but doesn't matter for demo.
	value2=2,
	KEEP.OUT.ATTRS=FALSE)

Since calibrations don’t usually change within a day, I wanted to cross-reference one dataframe of canonical calibrations (calib) against the calibrations recorded in a second dataframe (data); multiple values for one number/day combination indicate trouble.

whichval1 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value1 from the dataset.
	unique(data$value1[data$num == x[1] & data$date == x[2] ])
}
whichval2 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value2 from the dataset.
	unique(data$value2[data$num == x[1] & data$date == x[2] ])
}

chkvals = function(df){
	# Given a calibration table, pass one row at a time to whichval,
	# and display the result added to the calibration table.
	df$val1 = apply(df, 1, whichval1)
	df$val2 = apply(df, 1, whichval2)
	return(df)
}

The result:

> chkvals(calib)
   num       date val1 val2
1    5 2014-01-01         2
2    6 2014-01-01         2
3    7 2014-01-01         2
4    8 2014-01-02         2
5    9 2014-01-02         2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2

We know that value1 equals 1 everywhere, so why the empty spaces? And weirder, why doesn’t whichval2 ever fail in the same way? The functions are identical! Let’s test with just one row…

> chkvals(calib[1,])
  num       date val1 val2
1   5 2014-01-01    1    2

Wait, but that failed just a second ago…

> chkvals(calib[4,])
  num       date val1 val2
4   8 2014-01-02    1    2
> chkvals(calib[5,])
  num       date val1 val2
5   9 2014-01-02    1    2
> chkvals(calib[6,])
  num       date val1 val2
6  10 2014-01-02    1    2
> chkvals(calib[4:6,])
  num       date val1 val2
4   8 2014-01-02         2
5   9 2014-01-02         2
6  10 2014-01-02    1    2

wut.

I’ll skip the rest of the debugging except to say it involved a lot of str() and cursing. Here’s what was happening.

Dataframes are not matrices

The basic problem is that apply is intended for use on arrays, not dataframes. It expects to operate on a single datatype, and converts its input to achieve that. For a dataframe, this is done with a call to as.matrix, which checks the type of each column, finds a non-numeric type (in our case, dates) and coerces everything to a string by calling format() on it… and format pads its output with whitespace!

> format(c(1,2,3,4,5))
[1] "1" "2" "3" "4" "5"
> format(c(1,10,100,1000,10000))
[1] "    1" "   10" "  100" " 1000" "10000"

When these formatted no-longer-numbers get passed in to whichval1(), R’s type coercion rules do their thing again and we learn that "1" == 1 but " 9" != 9.

Type conversion is complicated

But it gets weirder! Why doesn’t the same thing happen when we call whichval2 a moment later? Because whichval1 is actually returning a list, and it’s still a list after it’s added to the data frame! I had to go read the definition of as.matrix.data.frame to learn that when as.matrix reads this new list-bearing data frame, it flags the whole matrix as “non-atomic”, skips the non-numeric conversions, and returns a numeric matrix. 1==1 and 9==9, and the matching works as intended.

“But wait!” you say. “What about the dates? The things that made us go down this whole coercion-to-strings path in the first place?” Well, they played along happily and survived the conversion just fine because… because… because Dates are stored as integers in the first place.

Grrrrr.

The fix

Don’t use apply. Apply is for matrices, and dataframes are lists not matrices.

Dataframes are lists, not matrices.

Dataframes are lists. Not matrices.

Dataframes are lists! Not matrices!

So after all this, I rewrote my cross-indexing functions:

whichval.new = function(var, n, d){
	unique(data[data$num == n & data$date == d, var ])
}

chkvals.new = function(df){
	df$val1 = mapply(whichval.new, "value1", df$num, df$date)
	df$val2 = mapply(whichval.new, "value2", df$num, df$date)
	return(df)
}

I don’t claim it’s brilliant, but less ugly than last night. Also, it works right.

> chkvals.new(calib)
   num       date val1 val2
1    5 2014-01-01    1    2
2    6 2014-01-01    1    2
3    7 2014-01-01    1    2
4    8 2014-01-02    1    2
5    9 2014-01-02    1    2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>