In my ongoing quest to webappify various R scripts I discovered that rApache cannot load any R packages that depend on rJava. For several of the scripts that I've written that grab data out of MS Excel files, and therein use the xlsx package, this is a serious brick wall.
In my current workaround, I've resorted to using a shell script to do the xls(x) to .RData conversion. Then I stumbled upon the gdata package. Buried deep deep deep within the documentation it is a function called
read.xls()
that relies on Perl rather than Java to do the heavy lifting of crawling both of Microsoft's proprietary binary and xml based formats.
Testing is currently underway and a comparative write-up is planned.
I consider myself a decent RegEx user. References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:
*nix command line / shell scripts
Javascript
PHP
Matlab
Python
R
Now we arrive to the point of disappointment - R. You see, more often than not, I use 'named capture' to extract parts from a RegEx match. It's way easier than keeping array indices straight (especially after the code has collected a couple cobwebs). Unlike its counterparts above (i.e. Matlab and Python), R does not implement named capture all that intuitively. In fact, named capture is a new feature in R's generic RegEx functions (regexpr, gregexpr) as of version 2.14.0 (released sometime late 2011) and hasn't changed in 2.15 (released 2012-03-30).
To get a sense of R's named capture inadequacy, here's a simple scenario ...
The Problem:
You are given a list of files with names like:
chA_0001
chA_0002
chA_0003
chB_0001
chB_0002
chB_0003
Your task is to separate identify the channel (either 'A' or 'B') and file ID (0001, 0002, ..., etc).
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
ch : A, A, A, B, B, B
id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here's some Matlab code that basically does this in one line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
which would result in the following console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There is a lot of work here! To help explain what's going on, here's the corresponding console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The above steps are encapsulated into a much easier to use function re.capture() that allows for one-line-ish extraction:
> src
[1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003"
> pat
[1] "ch(?[A-Z])\\_(?[0-9]{4})"
> re.capture(pat, src)$names$ch
[1] "A" "A" "A" "B" "B" "B"
> re.capture(pat, src)$names$id
[1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values for regexpr() and gregexpr(). Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?