Non-Linear Curve Fitting is Nature Publication Worthy?

Someone needs to explain to me why this is a Nature publication ...

Granted it is Nature Protocols which started its publication run when I was in grad school and is likely less high-impact as full blown Nature.  Seriously tough - using the Solver Add-in in Excel to do non-linear curve fitting gets published?  If a younger version of myself knew about this, I'd have a Nature publication as a junior in undergrad.

It should be noted that such Nature worthy protocols are accomplished in the following 6 lines of R code (in this case for fitting a Michaelis-Menton curve):

f = formula(v ~ vmax*s/(km + s))
data = data.frame(
   s = c(1250, 625, 313, 156, 78, 39, 20, 10),
   v = c(148.68, 123.66, 96.84, 69.76, 45.54, 27.84, 16.38, 9.24)

fit = nls(f, data=data, start=list(vmax=1, km=1),
          algorithm='port', trace=F, lower=0)

(note: I'm including the blank line in my 6-line count) On a similar thread, does that mean I can get a publication for this?


PID Control-R

On a whim, I thought it might be fun to try to implement a PID control algorithm ... in R.
Yes, I know I have a strange idea of fun.


PID control is a heuristic method of automatically controlling processes as wide ranging as water levels in tanks to the direction of ships running against a current.  What's important is that you don't need to know much about a process to control it with a PID controller.  Hence, why it is so broadly used.  I studied the theory behind it in college and even built a LabVIEW UI for use in a student teaching lab.

For more detailed information, the obligatory Wikipedia link should be a good starting point.  In fact, that's where I got all the information I needed since my control theory text is currently buried in a closet somewhere.

A PID controller produces an output (aka a "Manipulated Variable", MV(t) or u(t)) - usually from 0 to 100% - based on a weighted sum of three terms:
  • Error between where you want the process to be (a setpoint) and where the process is
  • How much error has accumulated over time
  • How rapidly the error is changing
which naturally leads to the P, I, and D in the name and the three weighting parameters (gains) that need to be tuned:
  • Proportional
  • Integral
  • Derivative
The math for this looks like:
Don't be scared, it isn't as gnarly as it looks.

For the moment, let's ignore the Kp on the outside while we sort out the individual parts being added together.

The main variable in the equation above is e(t).  This is the error between the process variable PV(t) - the thing you want to control - and the setpoint SP(t) - the value you want the process to be at.  So this makes e(t):

e(t) = PV(t) - SP(t)

This happens to also be the first term - the Proportional response.  This should make sense - move the control by about as much error that you currently see.

The second term - the Integral response - isn't as ugly as it looks.  Integrals can be thought of as the area under a curve - in this case e(t).  Numerically, this simplifies to the sum of e(t)*dt, where dt is the amount of time between readings of the process variable.  To complete the Integral response, this sum is divided by Ti, the integral time, which is an estimate of how long it will take to abolish all the accumulated error.  In terms of the overall sum, this adds more control output to account for error that was previously "missed".

Last is the third and final term - the Derivative response.  Here the derivative is of e(t) with respect to time - or how fast is e(t) changing at time t.  To compute this, take the difference between the current value of e(t) and its last value e(t-dt) and divide it by dt.  To complete the Derivative response, multiply the derivative value by Td, the derivative time, which is an estimate of how far in the future the error can be reliably predicted.  This term adds control output to account for error that might happen.

All of the above is multiplied by Kp, the proportional gain, which scales the whole control response.



First things first, we need a process variable.

The code above is a simple function that simulates a process that, uncontrolled, will grow exponentially over time.  For added realism, I've included some uniform noise to the process growth.  The value of the control variable (in this case `u`) decreases the process value.


To code the controller for the above process, we first need to initialize a few variables:

In the interest of book-keeping, I'm storing the values of the Integral and Derivative responses in vectors EI and ED, respectively.  This isn't necessary, and indeed if this were a continuously operating controller, I wouldn't want to for the sake of conserving memory space.

Next a setpoint.  Below is code to create a setpoint profile that has a few step changes in time.

From here on, it's a matter of running a simulation loop.

Folks savvy in calculus may note that my implementation of the integral isn't entirely textbook - meaning, I'm not using the Trapezoid rule or Simpson's rule to compute a more accurate area for each time step.  This is on purpose, making the algorithm easier to implement.  It also assumes that my time step, dt, is small enough that this doesn't matter much.

Running the simulation and plotting the necessary variables, shows that the controller works reasonably well. Note, I disabled the noise in the process variable for this first run.

Adding noise to the process doesn't affect the results much, just a noisier output in U (the manipulated variable):

Using a more complicated setpoint profile:


From here, one can have some fun (or do some work) playing with the parameters, process variable, and setpoint profile.


A PID controller is based on a reasonably simple algorithm that can be quickly implemented in R, or any other programming language for that matter.


Set operations on more than two sets in R


Set operations are a common place thing to do in R, and the enabling functions in the base stats package are:

  • intersect(x, y)
  • union(x, y)
  • setdiff(x, y)
  • setequal(x, y)
That said, you'll note that each ONLY takes two arguments - i.e. set X and set Y - which was a limitation I needed to overcome recently.

Googling around I found that you can apply set operations to multiple (>2) sets using the Reduce() function.  For example, an intersection of sets stored in the list x would be achieved with:
Reduce(intersect, x)

However, things get trickier if you want to do a setdiff() in this manner - that is find the elements that are only in set a, but not in sets b, c, d, ..., etc.  Since I'm not a master at set theory, I decided to write my own, brute force method to get intersections, unions, and setdiffs for an arbitrary number of sets.


The function I wrote uses a truth table to do this where:

  • rows are the union of all elements in all sets
  • columns are sets
So each "cell" of the table is TRUE if the element represented by the row is in the set represented by the column.

To find an element that intersects all sets, the values across the row need to be all TRUE.
To find an element that is only in one set, only one row value is TRUE.  To determine which set that element is in, numeric values are applied to each column such that:

  • col1 = 1
  • col2 = 2
  • col3 = 4
  • col4 = 8
  • ... and so on
The values above are multiplied by each row's TRUE/FALSE (ergo: 1/0) values and summed across to produce a row's inclusion value.

For example, if an element is only in set 2 (that is column 2) the corresponding row inclusion value is 2.  If an element is in both sets 2 and 3 (columns 2 and 3) the corresponding row inclusion value is 6.

Visually ...


Thus, determining intersections or setdiffs becomes a simple matter of filtering by row sums.