2013/10/09

RegEx: Named Capture in R (Round 2)

Previously, I came up with a solution to R's less than ideal handling of named capture in regular expressions with my re.capture() function. A little more than a year later, the problem is rearing its ugly - albeit subtly different - head again.

I now have a single character string:

x = '`a` + `[b]` + `[1c]` + `[d] e`'

from which I need to pull matches from. In the case above anything encapuslated in backticks. Since my original re.capture() function was based on R's regexpr() function, it would only return the first match:

> re.capture('`(?<tok>.*?)`', x)$names
$tok
[1] "a"

Simply switching the underlying regexpr() to gregexpr() wasn't straight forward as gregexpr() returns a list:

> str(gregexpr('`(?<tok>.*?)`', x, perl=T))
List of 1
 $ : atomic [1:4] 1 7 15 24
  ..- attr(*, "match.length")= int [1:4] 3 5 6 7
  ..- attr(*, "useBytes")= logi TRUE
  ..- attr(*, "capture.start")= int [1:4, 1] 2 8 16 25
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.length")= int [1:4, 1] 1 3 4 5
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : NULL
  .. .. ..$ : chr "tok"
  ..- attr(*, "capture.names")= chr "tok"

which happens to be as long as the input character vector against which the regex pattern is matched:

> x = '`a` + `[b]` + `[1c]` + `[d] e`'
> z = '`f` + `[g]` + `[1h]` + `[i] j`'
> str(gregexpr('`(?<tok>.*?)`', c(x,z) , perl=T), max.level=0)
List of 2

each element of which is a regex match object with its own set of attributes. Thus the new solution was to write a new function that walks the list() generated by gregexpr() looking for name captured tokens:

gregexcap = function(pattern, x, ...) {
  args = list(...)
  args[['perl']] = T

  re = do.call(gregexpr, c(list(pattern, x), args))

  mapply(function(re, x){

    cap = sapply(attr(re, 'capture.names'), function(n, re, x){
      start = attr(re, 'capture.start')[, n]
      len   = attr(re, 'capture.length')[, n]
      end   = start + len - 1
      tok   = substr(rep(x, length(start)), start, end)

      return(tok)
    }, re, x, simplify=F, USE.NAMES=T)

    return(cap)
  }, re, x, SIMPLIFY=F)

}

thereby returning my R coding universe to one-liner bliss:

> gregexcap('`(?<tok>.*?)`', x)
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

> gregexcap('`(?<tok>.*?)`', c(x,z))
[[1]]
[[1]]$tok
[1] "a"     "[b]"   "[1c]"  "[d] e"

[[2]]
[[2]]$tok
[1] "ff"      "[gg]"    "[11hh]"  "[ii] jj"

Written with StackEdit.