<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Infotrophing &#187; R</title>
	<atom:link href="http://bodger.org/blog/category/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://bodger.org/blog</link>
	<description>A pile of things</description>
	<lastBuildDate>Fri, 21 Mar 2014 02:15:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Apply is not for dataframes</title>
		<link>http://bodger.org/blog/2014/02/apply-is-not-for-dataframes/</link>
		<comments>http://bodger.org/blog/2014/02/apply-is-not-for-dataframes/#comments</comments>
		<pubDate>Tue, 25 Feb 2014 04:12:36 +0000</pubDate>
		<dc:creator>bodger</dc:creator>
				<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://bodger.org/blog/?p=138</guid>
		<description><![CDATA[Last night, I wrote some crappy R code and remarked that it was definitely ugly, probably contained bugs, and would likely give me trouble in the morning. I was right on all counts. This is the story of the trouble it gave me, not because it was surprising but because it produced such weird symptoms. [...]]]></description>
				<content:encoded><![CDATA[<p>Last night, I wrote some crappy R code and remarked that it was definitely ugly, probably contained bugs, and would likely give me trouble in the morning. I was right on all counts. This is the story of the trouble it gave me, not because it was surprising but because it produced such weird symptoms.</p>
<h2>A small spoiler to start things off</h2>
<p><code>&gt; format(8:12) == 8:12<br />
[1] FALSE FALSE TRUE TRUE TRUE</code><br />
If this makes perfect sense to you and you already see how it relates to the title, congratulations. You can leave now.</p>
<h2>The problem</h2>
<p>I&#8217;m processing <a href="https://www.msu.edu/~ajay/Roots%20of%20Energy.html">minirhizotron</a> images and I needed to check that we&#8217;d used the correct calibration constants for each tube. The real data are large, ugly, and not public yet, so here&#8217;s a minimal equivalent demo:</p>
<pre>calib = data.frame(
	num=5:15, 
	date=as.Date(c(
		rep("2014-01-01", 3), 
		rep("2014-01-02", 3), 
		rep("2014-01-03", 3), 
		rep("2014-01-04", 2))))

data = expand.grid(
	num=1:20, 
	date=as.Date(16070:16075, origin="1970-01-01"),
	value1=1,  # These vary in real data, but doesn't matter for demo.
	value2=2,
	KEEP.OUT.ATTRS=FALSE)
</pre>
<p>Since calibrations don&#8217;t usually change within a day, I wanted to cross-reference one dataframe of canonical calibrations (<code>calib</code>) against the calibrations recorded in a second dataframe (<code>data</code>); multiple values for one number/day combination indicate trouble.</p>
<pre>
whichval1 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value1 from the dataset.
	unique(data$value1[data$num == x[1] &#038; data$date == x[2] ])
}
whichval2 = function(x){
	# Given one row (tube number, date) of calibration, 
	# return all distinct value2 from the dataset.
	unique(data$value2[data$num == x[1] &#038; data$date == x[2] ])
}

chkvals = function(df){
	# Given a calibration table, pass one row at a time to whichval,
	# and display the result added to the calibration table.
	df$val1 = apply(df, 1, whichval1)
	df$val2 = apply(df, 1, whichval2)
	return(df)
}
</pre>
<p>The result:</p>
<pre>
> chkvals(calib)
   num       date val1 val2
1    5 2014-01-01         2
2    6 2014-01-01         2
3    7 2014-01-01         2
4    8 2014-01-02         2
5    9 2014-01-02         2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2
</pre>
<p>We know that <code>value1</code> equals 1 everywhere, so why the empty spaces? And weirder, why doesn&#8217;t <code>whichval2</code> ever fail in the same way? The functions are identical! Let&#8217;s test with just one row&#8230;</p>
<pre>
> chkvals(calib[1,])
  num       date val1 val2
1   5 2014-01-01    1    2
</pre>
<p>Wait, but that failed just a second ago&#8230;</p>
<pre>
> chkvals(calib[4,])
  num       date val1 val2
4   8 2014-01-02    1    2
> chkvals(calib[5,])
  num       date val1 val2
5   9 2014-01-02    1    2
> chkvals(calib[6,])
  num       date val1 val2
6  10 2014-01-02    1    2
> chkvals(calib[4:6,])
  num       date val1 val2
4   8 2014-01-02         2
5   9 2014-01-02         2
6  10 2014-01-02    1    2
</pre>
<p>&#8230;<em>wut</em>.</p>
<p>I&#8217;ll skip the rest of the debugging except to say it involved a lot of <code>str()</code> and cursing. Here&#8217;s what was happening.</p>
<h2>Dataframes are not matrices</h2>
<p>The basic problem is that <code>apply</code> is intended for use on arrays, not dataframes. It expects to operate on a single datatype, and converts its input to achieve that. For a dataframe, this is done with a call to <code>as.matrix</code>, which checks the type of each column, finds a non-numeric type (in our case, dates) and coerces everything to a string <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html">by calling <code>format()</code> on it</a>&#8230; and <code>format</code> pads its output with whitespace!</p>
<pre>
> format(c(1,2,3,4,5))
[1] "1" "2" "3" "4" "5"
> format(c(1,10,100,1000,10000))
[1] "    1" "   10" "  100" " 1000" "10000"
</pre>
<p>When these formatted no-longer-numbers get passed in to <code>whichval1()</code>, R&#8217;s type coercion rules do their thing again and we learn that <code>"1" == 1</code> but <code>" 9" != 9</code>.</p>
<h2>Type conversion is complicated</h2>
<p>But it gets weirder! Why doesn&#8217;t the same thing happen when we call <code>whichval2</code> a moment later? Because <code>whichval1</code> is actually returning a list, and it&#8217;s still a list after it&#8217;s added to the data frame! I had to go read <a href="https://github.com/wch/r-source/blob/776708efe6003e36f02587ad47b2eaaaa19e2f69/src/library/base/R/dataframe.R#L1405">the definition of <code>as.matrix.data.frame</code></a> to learn that when <code>as.matrix</code> reads this new list-bearing data frame, it flags the whole matrix as &#8220;non-atomic&#8221;, <em>skips the non-numeric conversions</em>, and returns a numeric matrix. <code>1==1</code> and <code>9==9</code>, and the matching works as intended.</p>
<p>&#8220;But wait!&#8221; you say. &#8220;What about the dates? The things that made us go down this whole coercion-to-strings path in the first place?&#8221; Well, they played along happily and survived the conversion just fine because&#8230; because&#8230; because <code>Date</code>s are <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html">stored as integers in the first place</a>. </p>
<p>Grrrrr.</p>
<h2>The fix</h2>
<p>Don&#8217;t use <code>apply</code>. Apply is for matrices, and dataframes are lists not matrices.</p>
<p>Dataframes are lists, not matrices.</p>
<p>Dataframes are lists. Not matrices.</p>
<p>Dataframes are lists! Not matrices!</p>
<p>So after all this, I rewrote my cross-indexing functions:</p>
<pre>
whichval.new = function(var, n, d){
	unique(data[data$num == n &#038; data$date == d, var ])
}

chkvals.new = function(df){
	df$val1 = mapply(whichval.new, "value1", df$num, df$date)
	df$val2 = mapply(whichval.new, "value2", df$num, df$date)
	return(df)
}
</pre>
<p>I don&#8217;t claim it&#8217;s brilliant, but less ugly than last night. Also, it works right.</p>
<pre>
> chkvals.new(calib)
   num       date val1 val2
1    5 2014-01-01    1    2
2    6 2014-01-01    1    2
3    7 2014-01-01    1    2
4    8 2014-01-02    1    2
5    9 2014-01-02    1    2
6   10 2014-01-02    1    2
7   11 2014-01-03    1    2
8   12 2014-01-03    1    2
9   13 2014-01-03    1    2
10  14 2014-01-04    1    2
11  15 2014-01-04    1    2

</pre>
]]></content:encoded>
			<wfw:commentRss>http://bodger.org/blog/2014/02/apply-is-not-for-dataframes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
