character handling: mean() vs sd()

Here’s a weird R error/bug/nuance I came across today:

What would you expect the following lines of code to return?

x = c('1', '2', '3')

Well, apparently it is:

# mean(x)
[1] NA

# sd(x)
[1] 1

So, sd() silently converts its input to numeric, while mean() does not. More evidence of this is in the source:

> mean
function (x, ...) 
<bytecode: 0x000000001165e790>
<environment: namespace:base>

> sd
function (x, na.rm = FALSE) 
sqrt(var(if (is.vector(x)) x else as.double(x), na.rm = na.rm))
<bytecode: 0x000000001158eb00>
<environment: namespace:stats>

One hour of my work day was spent sorting this out. You've been warned.

Written with StackEdit.


RegEx: Named Capture in R (Round 2)

Previously, I came up with a solution to R's less than ideal handling of named capture in regular expressions with my re.capture() function. A little more than a year later, the problem is rearing its ugly - albeit subtly different - head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture() function was based on R's regexpr() function, it would only return the first match:

> re.capture('`(?<tok>.*?)`', x)$names
[1] "a"

Simply switching the underlying regexpr() to gregexpr() wasn't straight forward as gregexpr() returns a list:

> str(gregexpr('`(?<tok>.*?)`', x, perl=T))
List of 1
 $ : atomic [1:4] 1 7 15 24
  ..- attr(*, "match.length")= int [1:4] 3 5 6 7
  ..- attr(*, "useBytes")= logi TRUE
  ..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`'
> z = '`f` + `[g]` + `[1h]` + `[i] j`'
> str(gregexpr('`(?<tok>.*?)`', c(x,z) , perl=T), max.level=0)
List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list() generated by gregexpr() looking for name captured tokens:

gregexcap = function(pattern, x, ...) {
  args = list(...)
  args[['perl']] = T

  re = do.call(gregexpr, c(list(pattern, x), args))

  mapply(function(re, x){

    cap = sapply(attr(re, 'capture.names'), function(n, re, x){
      start = attr(re, 'capture.start')[, n]
      len   = attr(re, 'capture.length')[, n]
      end   = start + len - 1
      tok   = substr(rep(x, length(start)), start, end)

    }, re, x, simplify=F, USE.NAMES=T)

  }, re, x, SIMPLIFY=F)


thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?<tok>.*?)`', x)
[1] "a"     "[b]"   "[1c]"  "[d] e"

> gregexcap('`(?<tok>.*?)`', c(x,z))
[1] "a"     "[b]"   "[1c]"  "[d] e"

[1] "ff"      "[gg]"    "[11hh]"  "[ii] jj"

Written with StackEdit.


mfg, mfcol, mfrow, and layout() - secret friends

I was working on an issue (enhancement) today in my groan R-package today that required adding additional plotting elements via lines() and points() to a device that had already been partitioned by layout(). The code I wanted to use was essentially:

# Y and S are lists of xy.coords() objects of the same length

lyt = matrix(1:length(Y), ncol=10)

# function A
# plot Y first as points
lapply(Y, function(x) {
  plot(x, ...)

# function B
# overlay S as lines on the grid of plots for Y
lapply(S, function(x){
  lines(x, ...)

However, I would only get all of the above lines in one subplot. For a brief moment, I considered rewriting my whole set of plotting methods to use split.screen() or par(mfcol). Ugh!

On a whim, I decided to check what par('mfg') would return after a device had been partitioned and plotted to with:

layout(matrix(1:9, nrow=3))

I was pleasantly surprised to find:

> par('mfg')
[1] 1 1 3 3

indicating that I could potentially direct the next plot in a layout()'ed device by setting the value of mfg= to the next plot id:

lyt = matrix(1:9, nrow=3)
par(mfg=which(lyt == NextPlotID, arr.ind=TRUE)[1,])

Unicorns and rainbows, this works! (despite all the dire warnings in the documentation regarding incompatibilities)

Thus, the resulting code:

# Y and S are lists of xy.coords() objects of the same length

lyt = matrix(1:length(Y), ncol=10)

# function A
# plot Y first as points
lapply(Y, function(x) {
  plot(x, ...)

# function B
# overlay S as lines on the grid of plots for Y
lapply(seq_along(S), function(n){
  par(mfg=which(lyt == n, arr.ind=TRUE)[1,]) # sets next plot in grid!
  lines(S[[n]], ...)

and issue resolved.

Written with StackEdit.


Groan - my first R package

Being one of two R experts at my current job I figured I should be familiar with package development. Frankly, I've been procrastinating on this topic since I started using R in 2007 - I was doing just fine with source() and the section of the R manual on package development fell into the category of TL;DR.

What finally drove me to learn the esoteric details of R packaging were the following two events at work:

  1. A coworker sent out an email announcing a new R analysis script which included a few algorithms I wrote and passed on. It was accompanied by documentation in a README.txt file and employed console menus and user dialogs to ease use. Otherwise, details of what the code was doing were left to code comments.

  2. I read an email train between coworkers after being out of the office and disconnected for a day asking about the correct set of parameters to use in a function I wrote. Fortunately, they figured it out thanks to my excessive commenting habits.

Lesson learned: I no longer write R code just for myself and using code comments as documentation just wasn't cutting it anymore. I needed an efficient way to:

  • distribute code
  • provide documentation that uses the in-built help system

Unlike Matlab or Python, R does not have an effective way to provide simple documentation for code - functions, objects, etc. There is this post on StackOverflow, but I expect that such functionality should be built into the environment, not hacked on.

Long winded introduction over, I finally dove in. Thankfully, my entry wasn't too rough. A couple months ago I read a couple posts submitted to R-bloggers regarding "easy" package development using roxygen2, devtools, and RStudio (my R IDE of choice).

My problem to solve: get parameters from biological growth curves
My package: groan

Package: groan
Type: Package
Title: Utilities for biological GROwth curve ANalysis
Version: 1.0
Date: 2013-05-14
Author: W. Lee Pang
Description: groan is a set of tools to assist in the analysis of biological
    growth curves.  It provides functions to smooth input data and extract key
    parameters such as specific growth rate (mu), carrying capacity (A), and lag
    time (tau).

Working with microorganisms, a common task is determining a culture's specific growth rate - e.g. how many times the population will double in an hour. While not a hard task, it can be tedious if numerous cultures are involved, or if the underlying data is noisy.

groan is essentially the R package I wish I had as a grad student and postdoc, but was too occupied otherwise to write.

Yes, the name groan is a pun:

  • "grown" : as in yay the cells grew
  • "groan" : as in ugh, I have to process yet another growth curve

Humor aside, it reduces a CSV of multiple growth curve data points into a table of growth parameters and a summary plot in under 10 lines of code.

From this ...

Via ...

Y = read.csv('path/to/your/data.csv', stringsAsFactors=F)
Y = groan.init(Y)
Y.s = groan.smooth(Y, adaptive=T, method='loess')

U = groan.mu(Y.s)
U.s = groan.smooth(U, adaptive=T, method='loess')
U.f = groan.fit(U.s, method='pulse')

stats = data.frame(mumax = max(U.f),
                   t.lag = groan.tlag(U.f),
                   gen   = groan.generations(U.f))

plot(Y)   # plot thumbnail grid of raw growth curves

To ...

For more information, examples, or to test out the code yourself head to the groan repository on Github.


A new R trick ... for me at least

What were going to be talking about today are dynamic argument lists for functions. Specifically, how to unpack and prepare them in R using ..., list(), and do.call()

Biased by Matlab and varargin

Initially, I based my use of ... in R on my experience with Matlab's varargin. Using varargin, Matlab functions can have a signature of:

function f(varargin)
% do stuff here

Functions that use varargin are responsible for processing its contents, which is easy since it is simply a cell array. Thus, it can be "unpacked" and modified using cell array methods.

function f(varargin)
arg1 = varargin{1}
arg2 = varargin{2}

At call, arguments captured by varargin can be specified as an expanded cell array:

args = {foo, bar}

As a matter of fact, functions that do not use varargin can also be called this way since Matlab effectively interprets an expanded cell array as a comma-separated list

This comes in handy when you have a mixture of required and optional arguments for a function.

f(arg, opts{:})

Back to R ...

I used to think ... was analogous to varargin since:

  • it captures all function arguments not explicitly defined by the call signature
  • the number of arguments it captures can vary

However, unlike varargin:

  • ... is a special R language expression/object
  • it needs to be converted to a list to access the arguments (names and/or values) that it captures

The former point is strength and quirk of R, as it allows for arguments encapsulated in ... to be passed on to additional functions:

f = function(x, ...) {
  y = g(x, ...)

The latter point above (unpacking ...) is actually easy to do:

f = function(x, ...) {
  args = list(...) # contains a=1, b=2
  return(args$a * args$b)

Where confusion arises for many is that ... is essentially immutable (cannot be changed). While conceptually a list(), you can't modify it directly using list accessors:

f = function(x, ...) {
  ...[[1]] = 3 # this produces an error, as would ...$var and ...[1]
  y = g(x, ...)

So, what if I wanted to unpack arguments in ..., check/change their values, and repackage it for another function call? Since ... is immutable the code below would throw an error.

f = function(x, ...) {
  args = list(...) # unpack, contains a='foo'
  args$a = bar

  ... = args # ERROR!

  y = g(x, ...)

Also, there isn't a way (that I've found yet) to unroll a list() object in R into a comma-separated list like you can with a cell array in Matlab.

# this totally doesn't work
args = list(a=1, b='foo')
result = f(args[*]) # making up syntax here. would be nice, no?

As it turns out, ... doesn't even come into play here. In fact, you need to use a rather deep R concept - calls.

Whenever a function is used in R, a call is produced, which is an unprocessed expression that is then interpreted by the underlying engine. Why the delay? Only the creators/developers of R can fully detail why, but it does allow for some neat effects - e.g. the automatic labeling of plots.

To package a programmatically generated argument list one uses the do.call() function:

result = do.call('fun', list(arg1, arg2, etc, etc))

where the first argument is the name of the function to call, and the second argument is a list of arguments to pass along. For all intents and purposes, the R statement above is equivalent to the Matlab statement below.

results = fun(args{:}) % where args = {arg1, arg2, etc, etc}

Thus, process to unpack ..., check/modify an argument, and repack for another function call becomes:

f = function(x, ...) {
  args = list(...) # unpack, contains a='foo'
  args$a = bar     # change argument "a"

  y = do.call(g, c(x, args)) # repack arguments for call to g()

I must credit this epiphany to the following StackOverflow question and answer: http://stackoverflow.com/questions/3414078/unpacking-argument-lists-for-ellipsis-in-r

Written with StackEdit.


Non-Linear Curve Fitting is Nature Publication Worthy?

Someone needs to explain to me why this is a Nature publication ...

Granted it is Nature Protocols which started its publication run when I was in grad school and is likely less high-impact as full blown Nature.  Seriously tough - using the Solver Add-in in Excel to do non-linear curve fitting gets published?  If a younger version of myself knew about this, I'd have a Nature publication as a junior in undergrad.

It should be noted that such Nature worthy protocols are accomplished in the following 6 lines of R code (in this case for fitting a Michaelis-Menton curve):

f = formula(v ~ vmax*s/(km + s))
data = data.frame(
   s = c(1250, 625, 313, 156, 78, 39, 20, 10),
   v = c(148.68, 123.66, 96.84, 69.76, 45.54, 27.84, 16.38, 9.24)

fit = nls(f, data=data, start=list(vmax=1, km=1),
          algorithm='port', trace=F, lower=0)

(note: I'm including the blank line in my 6-line count) On a similar thread, does that mean I can get a publication for this?


PID Control-R

On a whim, I thought it might be fun to try to implement a PID control algorithm ... in R.
Yes, I know I have a strange idea of fun.


PID control is a heuristic method of automatically controlling processes as wide ranging as water levels in tanks to the direction of ships running against a current.  What's important is that you don't need to know much about a process to control it with a PID controller.  Hence, why it is so broadly used.  I studied the theory behind it in college and even built a LabVIEW UI for use in a student teaching lab.

For more detailed information, the obligatory Wikipedia link should be a good starting point.  In fact, that's where I got all the information I needed since my control theory text is currently buried in a closet somewhere.

A PID controller produces an output (aka a "Manipulated Variable", MV(t) or u(t)) - usually from 0 to 100% - based on a weighted sum of three terms:
  • Error between where you want the process to be (a setpoint) and where the process is
  • How much error has accumulated over time
  • How rapidly the error is changing
which naturally leads to the P, I, and D in the name and the three weighting parameters (gains) that need to be tuned:
  • Proportional
  • Integral
  • Derivative
The math for this looks like:
Don't be scared, it isn't as gnarly as it looks.

For the moment, let's ignore the Kp on the outside while we sort out the individual parts being added together.

The main variable in the equation above is e(t).  This is the error between the process variable PV(t) - the thing you want to control - and the setpoint SP(t) - the value you want the process to be at.  So this makes e(t):

e(t) = PV(t) - SP(t)

This happens to also be the first term - the Proportional response.  This should make sense - move the control by about as much error that you currently see.

The second term - the Integral response - isn't as ugly as it looks.  Integrals can be thought of as the area under a curve - in this case e(t).  Numerically, this simplifies to the sum of e(t)*dt, where dt is the amount of time between readings of the process variable.  To complete the Integral response, this sum is divided by Ti, the integral time, which is an estimate of how long it will take to abolish all the accumulated error.  In terms of the overall sum, this adds more control output to account for error that was previously "missed".

Last is the third and final term - the Derivative response.  Here the derivative is of e(t) with respect to time - or how fast is e(t) changing at time t.  To compute this, take the difference between the current value of e(t) and its last value e(t-dt) and divide it by dt.  To complete the Derivative response, multiply the derivative value by Td, the derivative time, which is an estimate of how far in the future the error can be reliably predicted.  This term adds control output to account for error that might happen.

All of the above is multiplied by Kp, the proportional gain, which scales the whole control response.



First things first, we need a process variable.

The code above is a simple function that simulates a process that, uncontrolled, will grow exponentially over time.  For added realism, I've included some uniform noise to the process growth.  The value of the control variable (in this case `u`) decreases the process value.


To code the controller for the above process, we first need to initialize a few variables:

In the interest of book-keeping, I'm storing the values of the Integral and Derivative responses in vectors EI and ED, respectively.  This isn't necessary, and indeed if this were a continuously operating controller, I wouldn't want to for the sake of conserving memory space.

Next a setpoint.  Below is code to create a setpoint profile that has a few step changes in time.

From here on, it's a matter of running a simulation loop.

Folks savvy in calculus may note that my implementation of the integral isn't entirely textbook - meaning, I'm not using the Trapezoid rule or Simpson's rule to compute a more accurate area for each time step.  This is on purpose, making the algorithm easier to implement.  It also assumes that my time step, dt, is small enough that this doesn't matter much.

Running the simulation and plotting the necessary variables, shows that the controller works reasonably well. Note, I disabled the noise in the process variable for this first run.

Adding noise to the process doesn't affect the results much, just a noisier output in U (the manipulated variable):

Using a more complicated setpoint profile:


From here, one can have some fun (or do some work) playing with the parameters, process variable, and setpoint profile.


A PID controller is based on a reasonably simple algorithm that can be quickly implemented in R, or any other programming language for that matter.


Set operations on more than two sets in R


Set operations are a common place thing to do in R, and the enabling functions in the base stats package are:

  • intersect(x, y)
  • union(x, y)
  • setdiff(x, y)
  • setequal(x, y)
That said, you'll note that each ONLY takes two arguments - i.e. set X and set Y - which was a limitation I needed to overcome recently.

Googling around I found that you can apply set operations to multiple (>2) sets using the Reduce() function.  For example, an intersection of sets stored in the list x would be achieved with:
Reduce(intersect, x)

However, things get trickier if you want to do a setdiff() in this manner - that is find the elements that are only in set a, but not in sets b, c, d, ..., etc.  Since I'm not a master at set theory, I decided to write my own, brute force method to get intersections, unions, and setdiffs for an arbitrary number of sets.


The function I wrote uses a truth table to do this where:

  • rows are the union of all elements in all sets
  • columns are sets
So each "cell" of the table is TRUE if the element represented by the row is in the set represented by the column.

To find an element that intersects all sets, the values across the row need to be all TRUE.
To find an element that is only in one set, only one row value is TRUE.  To determine which set that element is in, numeric values are applied to each column such that:

  • col1 = 1
  • col2 = 2
  • col3 = 4
  • col4 = 8
  • ... and so on
The values above are multiplied by each row's TRUE/FALSE (ergo: 1/0) values and summed across to produce a row's inclusion value.

For example, if an element is only in set 2 (that is column 2) the corresponding row inclusion value is 2.  If an element is in both sets 2 and 3 (columns 2 and 3) the corresponding row inclusion value is 6.

Visually ...


Thus, determining intersections or setdiffs becomes a simple matter of filtering by row sums.



A copper toned publication!

At long last (1.5yrs since the first submission attempt to be exact), the research I worked on as a post-doctoral fellow has been published!

Click on the image above to head over to the article for some light reading.  A lot of work went into this paper, and there are many to thank who don't appear in the author list - basically everyone in my former lab. On a fun note, the model we developed and just about all the figures were produced in R.  Quite the conversion, since I started this work as a big proponent for Matlab.


Overly Honest Methods

A friend of mine shared this on Facebook - A curated list of tweets tagged #overlyhonestmethods.  Basically, how science is really done behind that sparkly-smarty-pants curtain.