TomorrowSoon: Photos of mid-80s scientific equipment catalogs.
> format(8:12) == 8:12
[1] FALSE FALSE TRUE TRUE TRUE
If this makes perfect sense to you and you already see how it relates to the title, congratulations. You can leave now.
I’m processing minirhizotron images and I needed to check that we’d used the correct calibration constants for each tube. The real data are large, ugly, and not public yet, so here’s a minimal equivalent demo:
calib = data.frame( num=5:15, date=as.Date(c( rep("2014-01-01", 3), rep("2014-01-02", 3), rep("2014-01-03", 3), rep("2014-01-04", 2)))) data = expand.grid( num=1:20, date=as.Date(16070:16075, origin="1970-01-01"), value1=1, # These vary in real data, but doesn't matter for demo. value2=2, KEEP.OUT.ATTRS=FALSE)
Since calibrations don’t usually change within a day, I wanted to cross-reference one dataframe of canonical calibrations (calib
) against the calibrations recorded in a second dataframe (data
); multiple values for one number/day combination indicate trouble.
whichval1 = function(x){ # Given one row (tube number, date) of calibration, # return all distinct value1 from the dataset. unique(data$value1[data$num == x[1] & data$date == x[2] ]) } whichval2 = function(x){ # Given one row (tube number, date) of calibration, # return all distinct value2 from the dataset. unique(data$value2[data$num == x[1] & data$date == x[2] ]) } chkvals = function(df){ # Given a calibration table, pass one row at a time to whichval, # and display the result added to the calibration table. df$val1 = apply(df, 1, whichval1) df$val2 = apply(df, 1, whichval2) return(df) }
The result:
> chkvals(calib) num date val1 val2 1 5 2014-01-01 2 2 6 2014-01-01 2 3 7 2014-01-01 2 4 8 2014-01-02 2 5 9 2014-01-02 2 6 10 2014-01-02 1 2 7 11 2014-01-03 1 2 8 12 2014-01-03 1 2 9 13 2014-01-03 1 2 10 14 2014-01-04 1 2 11 15 2014-01-04 1 2
We know that value1
equals 1 everywhere, so why the empty spaces? And weirder, why doesn’t whichval2
ever fail in the same way? The functions are identical! Let’s test with just one row…
> chkvals(calib[1,]) num date val1 val2 1 5 2014-01-01 1 2
Wait, but that failed just a second ago…
> chkvals(calib[4,]) num date val1 val2 4 8 2014-01-02 1 2 > chkvals(calib[5,]) num date val1 val2 5 9 2014-01-02 1 2 > chkvals(calib[6,]) num date val1 val2 6 10 2014-01-02 1 2 > chkvals(calib[4:6,]) num date val1 val2 4 8 2014-01-02 2 5 9 2014-01-02 2 6 10 2014-01-02 1 2
…wut.
I’ll skip the rest of the debugging except to say it involved a lot of str()
and cursing. Here’s what was happening.
The basic problem is that apply
is intended for use on arrays, not dataframes. It expects to operate on a single datatype, and converts its input to achieve that. For a dataframe, this is done with a call to as.matrix
, which checks the type of each column, finds a non-numeric type (in our case, dates) and coerces everything to a string by calling format()
on it… and format
pads its output with whitespace!
> format(c(1,2,3,4,5)) [1] "1" "2" "3" "4" "5" > format(c(1,10,100,1000,10000)) [1] " 1" " 10" " 100" " 1000" "10000"
When these formatted no-longer-numbers get passed in to whichval1()
, R’s type coercion rules do their thing again and we learn that "1" == 1
but " 9" != 9
.
But it gets weirder! Why doesn’t the same thing happen when we call whichval2
a moment later? Because whichval1
is actually returning a list, and it’s still a list after it’s added to the data frame! I had to go read the definition of as.matrix.data.frame
to learn that when as.matrix
reads this new list-bearing data frame, it flags the whole matrix as “non-atomic”, skips the non-numeric conversions, and returns a numeric matrix. 1==1
and 9==9
, and the matching works as intended.
“But wait!” you say. “What about the dates? The things that made us go down this whole coercion-to-strings path in the first place?” Well, they played along happily and survived the conversion just fine because… because… because Date
s are stored as integers in the first place.
Grrrrr.
Don’t use apply
. Apply is for matrices, and dataframes are lists not matrices.
Dataframes are lists, not matrices.
Dataframes are lists. Not matrices.
Dataframes are lists! Not matrices!
So after all this, I rewrote my cross-indexing functions:
whichval.new = function(var, n, d){ unique(data[data$num == n & data$date == d, var ]) } chkvals.new = function(df){ df$val1 = mapply(whichval.new, "value1", df$num, df$date) df$val2 = mapply(whichval.new, "value2", df$num, df$date) return(df) }
I don’t claim it’s brilliant, but less ugly than last night. Also, it works right.
> chkvals.new(calib) num date val1 val2 1 5 2014-01-01 1 2 2 6 2014-01-01 1 2 3 7 2014-01-01 1 2 4 8 2014-01-02 1 2 5 9 2014-01-02 1 2 6 10 2014-01-02 1 2 7 11 2014-01-03 1 2 8 12 2014-01-03 1 2 9 13 2014-01-03 1 2 10 14 2014-01-04 1 2 11 15 2014-01-04 1 2]]>
I defined everything conservatively: MN, NY, and VA are all blue. I grew up on the Minnesota border and couldn’t count the number of days and nights I’ve spent there, but it was never home. I lived in VA and NY for one summer each, but only saw about as many sights in each summer as I’d now see in a week of vacation.
Next, of course, you’ll want to make your own.
]]>When you wash a fresh kale leaf, the epicuticular waxes on the underside make a smooth hydrophobic interface with the water, which reflects light in a kind of dancing, fast-changing way that makes it look silver and is also very hard to photograph.
This is one of my favorite colors.
]]>Today I quit for lunch, turned around, and found my advisor sitting on the hall bench right next to my desk. He was ignoring me and looking at his phone, but sitting in the exact spot that has the best view of my screen.
I stepped into the hallway. “Watching me work, are you?”
My advisor didn’t even look up from his phone. “No way, dude. That’s like watching paint dry.”
]]>