Odd Hypothesis: regex

2013/10/09

RegEx: Named Capture in R (Round 2)

Previously, I came up with a solution to R's less than ideal handling of named capture in regular expressions with my re.capture() function. A little more than a year later, the problem is rearing its ugly - albeit subtly different - head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture() function was based on R's regexpr() function, it would only return the first match:

> re.capture('`(?<tok>.*?)`', x)$names
$tok
[1] "a"

Simply switching the underlying regexpr() to gregexpr() wasn't straight forward as gregexpr() returns a list:

> str(gregexpr('`(?<tok>.*?)`', x, perl=T))
List of 1
 $ : atomic [1:4] 1 7 15 24
  ..- attr(*, "match.length")= int [1:4] 3 5 6 7
  ..- attr(*, "useBytes")= logi TRUE
  ..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`'
> z = '`f` + `[g]` + `[1h]` + `[i] j`'
> str(gregexpr('`(?<tok>.*?)`', c(x,z) , perl=T), max.level=0)
List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list() generated by gregexpr() looking for name captured tokens:

gregexcap = function(pattern, x, ...) {
  args = list(...)
  args[['perl']] = T

  re = do.call(gregexpr, c(list(pattern, x), args))

  mapply(function(re, x){

    cap = sapply(attr(re, 'capture.names'), function(n, re, x){
      start = attr(re, 'capture.start')[, n]
      len   = attr(re, 'capture.length')[, n]
      end   = start + len - 1
      tok   = substr(rep(x, length(start)), start, end)

      return(tok)
    }, re, x, simplify=F, USE.NAMES=T)

    return(cap)
  }, re, x, SIMPLIFY=F)

}

thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?<tok>.*?)`', x)
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

> gregexcap('`(?<tok>.*?)`', c(x,z))
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

[[2]]
[[2]]$tok
[1] "ff"      "[gg]"    "[11hh]"  "[ii] jj"

Written with StackEdit.

2012/05/03

RegEx: Named Capture in R

I consider myself a decent RegEx user. References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:

*nix command line / shell scripts
Javascript
PHP
Matlab
Python
R

Now we arrive to the point of disappointment - R. You see, more often than not, I use 'named capture' to extract parts from a RegEx match. It's way easier than keeping array indices straight (especially after the code has collected a couple cobwebs). Unlike its counterparts above (i.e. Matlab and Python), R does not implement named capture all that intuitively. In fact, named capture is a new feature in R's generic RegEx functions (regexpr, gregexpr) as of version 2.14.0 (released sometime late 2011) and hasn't changed in 2.15 (released 2012-03-30).

To get a sense of R's named capture inadequacy, here's a simple scenario ...

The Problem:

You are given a list of files with names like:

chA_0001
chA_0002
chA_0003
chB_0001
chB_0002
chB_0003

Your task is to separate identify the channel (either 'A' or 'B') and file ID (0001, 0002, ..., etc).

The regular expression with named capture to do this is quite simple:

ch(?[A-Z])\_(?[0-9]{4})

which, given the list of file names, should return some structure with a property:value pairs of the sort:

ch : A, A, A, B, B, B
id : 0001, 0002, 0003, 0001, 0002, 0003

The Solutions:

Here's some Matlab code that basically does this in one line:

which would result in the following console output:

Now here's the equivalent R code:

There is a lot of work here! To help explain what's going on, here's the corresponding console output:

Here's what's happening:

regexpr(..., perl=T) is used to create a regular expression result with named capture which is placed in the $result item of the output list.

$result
[1] 1 1 1 1 1 1
attr(,"match.length")
[1] 8 8 8 8 8 8
attr(,"useBytes")
[1] TRUE
attr(,"capture.start")
     ch id
[1,]  3  5
[2,]  3  5
[3,]  3  5
[4,]  3  5
[5,]  3  5
[6,]  3  5
attr(,"capture.length")
     ch id
[1,]  1  4
[2,]  1  4
[3,]  1  4
[4,]  1  4
[5,]  1  4
[6,]  1  4
attr(,"capture.names")
[1] "ch" "id"

This result is pretty unusable since all of the important captured information is buried in attribute settings.

To do anything with the output from regexpr(), the result from #1 has to have its attributes probed using attr() (via a for loop) to get:
- captured group names
- start locations within the strings of the captured groups
- length of the captured groups (oddly/depressingly, end positions are not returned)
The combination of the above is used by substr() to extract the actual match strings from the input list:
```
rex$names[[.name]] = substr(rex$src,
                            attr(rex$result, 'capture.start')[,.name],
                            attr(rex$result, 'capture.start')[,.name]
                            + attr(rex$result, 'capture.length')[,.name]
                            - 1)
```

The above steps are encapsulated into a much easier to use function re.capture() that allows for one-line-ish extraction:

> src
[1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003"
> pat
[1] "ch(?[A-Z])\\_(?[0-9]{4})"
> re.capture(pat, src)$names$ch
[1] "A" "A" "A" "B" "B" "B"
> re.capture(pat, src)$names$id
[1] "0001" "0002" "0003" "0001" "0002" "0003"

Summary

All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values for regexpr() and gregexpr(). Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?