Re-cycling

I was assigned to clear out a bunch of old paper from the lab, including some of my advisor’s teaching notes from the late 90s. I couldn’t resist decorating my window with these transparencies of the carbon and nitrogen cycles, but the building administrators really don’t like tape on the windows, so I got creative with the other contents of the file drawer.



TomorrowSoon: Photos of mid-80s scientific equipment catalogs.

Apply is not for dataframes

Last night, I wrote some crappy R code and remarked that it was definitely ugly, probably contained bugs, and would likely give me trouble in the morning. I was right on all counts. This is the story of the trouble it gave me, not because it was surprising but because it produced such weird symptoms.

A small spoiler to start things off

> format(8:12) == 8:12
[1] FALSE FALSE TRUE TRUE TRUE

If this makes perfect sense to you and you already see how it relates to the title, congratulations. You can leave now.

The problem

I’m processing minirhizotron images and I needed to check that we’d used the correct calibration constants for each tube. The real data are large, ugly, and not public yet, so here’s a minimal equivalent demo:

calib = data.frame(
	num=5:15, 
	date=as.Date(c(
		rep("2014-01-01", 3), 
		rep("2014-01-02", 3), 
		rep("2014-01-03", 3), 
		rep("2014-01-04", 2))))

data = expand.grid(
	num=1:20, 
	date=as.Date(16070:16075, origin="1970-01-01"),
	value1=1,  # These vary in real data, but doesn't matter for demo.
	value2=2,
	KEEP.OUT.ATTRS=FALSE)

Since calibrations don’t usually change within a day, I wanted to cross-reference one dataframe of canonical calibrations (calib) against the calibrations recorded in a second dataframe (data); multiple values for one number/day combination indicate trouble.

whichval1 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value1 from the dataset.
	unique(data$value1[data$num == x[1] & data$date == x[2] ])
}
whichval2 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value2 from the dataset.
	unique(data$value2[data$num == x[1] & data$date == x[2] ])
}

chkvals = function(df){
	# Given a calibration table, pass one row at a time to whichval,
	# and display the result added to the calibration table.
	df$val1 = apply(df, 1, whichval1)
	df$val2 = apply(df, 1, whichval2)
	return(df)
}

The result:

> chkvals(calib)
   num       date val1 val2
1    5 2014-01-01         2
2    6 2014-01-01         2
3    7 2014-01-01         2
4    8 2014-01-02         2
5    9 2014-01-02         2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2

We know that value1 equals 1 everywhere, so why the empty spaces? And weirder, why doesn’t whichval2 ever fail in the same way? The functions are identical! Let’s test with just one row…

> chkvals(calib[1,])
  num       date val1 val2
1   5 2014-01-01    1    2

Wait, but that failed just a second ago…

> chkvals(calib[4,])
  num       date val1 val2
4   8 2014-01-02    1    2
> chkvals(calib[5,])
  num       date val1 val2
5   9 2014-01-02    1    2
> chkvals(calib[6,])
  num       date val1 val2
6  10 2014-01-02    1    2
> chkvals(calib[4:6,])
  num       date val1 val2
4   8 2014-01-02         2
5   9 2014-01-02         2
6  10 2014-01-02    1    2

wut.

I’ll skip the rest of the debugging except to say it involved a lot of str() and cursing. Here’s what was happening.

Dataframes are not matrices

The basic problem is that apply is intended for use on arrays, not dataframes. It expects to operate on a single datatype, and converts its input to achieve that. For a dataframe, this is done with a call to as.matrix, which checks the type of each column, finds a non-numeric type (in our case, dates) and coerces everything to a string by calling format() on it… and format pads its output with whitespace!

> format(c(1,2,3,4,5))
[1] "1" "2" "3" "4" "5"
> format(c(1,10,100,1000,10000))
[1] "    1" "   10" "  100" " 1000" "10000"

When these formatted no-longer-numbers get passed in to whichval1(), R’s type coercion rules do their thing again and we learn that "1" == 1 but " 9" != 9.

Type conversion is complicated

But it gets weirder! Why doesn’t the same thing happen when we call whichval2 a moment later? Because whichval1 is actually returning a list, and it’s still a list after it’s added to the data frame! I had to go read the definition of as.matrix.data.frame to learn that when as.matrix reads this new list-bearing data frame, it flags the whole matrix as “non-atomic”, skips the non-numeric conversions, and returns a numeric matrix. 1==1 and 9==9, and the matching works as intended.

“But wait!” you say. “What about the dates? The things that made us go down this whole coercion-to-strings path in the first place?” Well, they played along happily and survived the conversion just fine because… because… because Dates are stored as integers in the first place.

Grrrrr.

The fix

Don’t use apply. Apply is for matrices, and dataframes are lists not matrices.

Dataframes are lists, not matrices.

Dataframes are lists. Not matrices.

Dataframes are lists! Not matrices!

So after all this, I rewrote my cross-indexing functions:

whichval.new = function(var, n, d){
	unique(data[data$num == n & data$date == d, var ])
}

chkvals.new = function(df){
	df$val1 = mapply(whichval.new, "value1", df$num, df$date)
	df$val2 = mapply(whichval.new, "value2", df$num, df$date)
	return(df)
}

I don’t claim it’s brilliant, but less ugly than last night. Also, it works right.

> chkvals.new(calib)
   num       date val1 val2
1    5 2014-01-01    1    2
2    6 2014-01-01    1    2
3    7 2014-01-01    1    2
4    8 2014-01-02    1    2
5    9 2014-01-02    1    2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2

Where I’ve been

Via approximately everybody on Facebook: here’s a map of the states & provinces I’ve lived in (green), know well (blue), have visited (amber), have passed through (red), or have never knowingly set foot in (white).

I defined everything conservatively: MN, NY, and VA are all blue. I grew up on the Minnesota border and couldn’t count the number of days and nights I’ve spent there, but it was never home. I lived in VA and NY for one summer each, but only saw about as many sights in each summer as I’d now see in a week of vacation.

States and provinces Chris Black has visited, as of November 2013.

Next, of course, you’ll want to make your own.

Kale and chard leaves in water, showing a silver color from the reflective hydophobic interface

When you wash a fresh kale leaf, the epicuticular waxes on the underside make a smooth hydrophobic interface with the water, which reflects light in a kind of dancing, fast-changing way that makes it look silver and is also very hard to photograph.

This is one of my favorite colors.

Vote of Confidence

I work in a fishbowl: My desk is right next to a glass wall that separates the lab from the hallway. Every week or two I look up to find some gaggle of visiting dignitaries scrutinizing the back my head as they get the We’ll Just Walk By The Lab tour.

Today I quit for lunch, turned around, and found my advisor sitting on the hall bench right next to my desk. He was ignoring me and looking at his phone, but sitting in the exact spot that has the best view of my screen.

I stepped into the hallway. “Watching me work, are you?”

My advisor didn’t even look up from his phone. “No way, dude. That’s like watching paint dry.”