2012/05/03

RegEx: Named Capture in R

I consider myself a decent RegEx user.  References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:
  • *nix command line / shell scripts
  • Javascript
  • PHP
  • Matlab
  • Python
  • R
Now we arrive to the point of disappointment - R.  You see, more often than not, I use 'named capture' to extract parts from a RegEx match.  It's way easier than keeping array indices straight (especially after the code has collected a couple cobwebs).  Unlike its counterparts above (i.e. Matlab and Python), R does not implement named capture all that intuitively.  In fact, named capture is a new feature in R's generic RegEx functions (regexpr, gregexpr) as of version 2.14.0 (released sometime late 2011) and hasn't changed in 2.15 (released 2012-03-30).

To get a sense of R's named capture inadequacy, here's a simple scenario ...

The Problem:

You are given a list of files with names like:
  • chA_0001
  • chA_0002
  • chA_0003
  • chB_0001
  • chB_0002
  • chB_0003
Your task is to separate identify the channel (either 'A' or 'B') and file ID (0001, 0002, ..., etc).

The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})

which, given the list of file names, should return some structure with a property:value pairs of the sort:
  • ch : A, A, A, B, B, B
  • id : 0001, 0002, 0003, 0001, 0002, 0003

The Solutions:

Here's some Matlab code that basically does this in one line:


which would result in the following console output:

Now here's the equivalent R code:

There is a lot of work here! To help explain what's going on, here's the corresponding console output:

Here's what's happening:
  1. regexpr(..., perl=T) is used to create a regular expression result with named capture which is placed in the $result item of the output list.
    $result
    [1] 1 1 1 1 1 1
    attr(,"match.length")
    [1] 8 8 8 8 8 8
    attr(,"useBytes")
    [1] TRUE
    attr(,"capture.start")
         ch id
    [1,]  3  5
    [2,]  3  5
    [3,]  3  5
    [4,]  3  5
    [5,]  3  5
    [6,]  3  5
    attr(,"capture.length")
         ch id
    [1,]  1  4
    [2,]  1  4
    [3,]  1  4
    [4,]  1  4
    [5,]  1  4
    [6,]  1  4
    attr(,"capture.names")
    [1] "ch" "id"
    
    This result is pretty unusable since all of the important captured information is buried in attribute settings.
  2. To do anything with the output from regexpr(), the result from #1 has to have its attributes probed using attr() (via a for loop) to get:
    • captured group names
    • start locations within the strings of the captured groups
    • length of the captured groups (oddly/depressingly, end positions are not returned)
    The combination of the above is used by substr() to extract the actual match strings from the input list:
    rex$names[[.name]] = substr(rex$src,
                                attr(rex$result, 'capture.start')[,.name],
                                attr(rex$result, 'capture.start')[,.name]
                                + attr(rex$result, 'capture.length')[,.name]
                                - 1)
    
  3. The above steps are encapsulated into a much easier to use function re.capture() that allows for one-line-ish extraction:
    > src
    [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003"
    > pat
    [1] "ch(?[A-Z])\\_(?[0-9]{4})"
    > re.capture(pat, src)$names$ch
    [1] "A" "A" "A" "B" "B" "B"
    > re.capture(pat, src)$names$id
    [1] "0001" "0002" "0003" "0001" "0002" "0003"
    

Summary

All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values for regexpr() and gregexpr(). Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?

2 comments:

  1. That would be pretty neat. Probably due to the lack of a clean function, I've never even tried to use named capture in [R] and just use sub/gsub instead.

    > src = c('chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003')
    > ch = sub('^ch','',sub('_[0-9]*$','',src))
    > ids = sub('^ch.*_','',src)
    > rv = data.frame(ch=ch, ids=ids)

    > rv
    ch ids
    1 A 0001
    2 A 0002
    3 A 0003
    4 B 0001
    5 B 0002
    6 B 0003
    > rv$ch
    [1] A A A B B B
    Levels: A B
    > rv$ids
    [1] 0001 0002 0003 0001 0002 0003
    Levels: 0001 0002 0003

    -Sam

    ReplyDelete
    Replies
    1. I agree, not having a clear cut solution as is found in other languages like Matlab or Python makes the sub/gsub method an obvious and often suggested choice. However, it has the potential of getting pretty hairy when the number of matches to capture and their context grows.

      Delete