- *nix command line / shell scripts
- Javascript
- PHP
- Matlab
- Python
- R
To get a sense of R's named capture inadequacy, here's a simple scenario ...
The Problem:
You are given a list of files with names like:- chA_0001
- chA_0002
- chA_0003
- chB_0001
- chB_0002
- chB_0003
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(? [0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
- ch : A, A, A, B, B, B
- id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here's some Matlab code that basically does this in one line:which would result in the following console output:
Now here's the equivalent R code:
There is a lot of work here! To help explain what's going on, here's the corresponding console output:
Here's what's happening:
- regexpr(..., perl=T) is used to create a regular expression result with named capture which is placed in the
$result
item of the output list.
$result [1] 1 1 1 1 1 1 attr(,"match.length") [1] 8 8 8 8 8 8 attr(,"useBytes") [1] TRUE attr(,"capture.start") ch id [1,] 3 5 [2,] 3 5 [3,] 3 5 [4,] 3 5 [5,] 3 5 [6,] 3 5 attr(,"capture.length") ch id [1,] 1 4 [2,] 1 4 [3,] 1 4 [4,] 1 4 [5,] 1 4 [6,] 1 4 attr(,"capture.names") [1] "ch" "id"
This result is pretty unusable since all of the important captured information is buried in attribute settings. - To do anything with the output from
regexpr()
, the result from #1 has to have its attributes probed usingattr()
(via a for loop) to get:- captured group names
- start locations within the strings of the captured groups
- length of the captured groups (oddly/depressingly, end positions are not returned)
substr()
to extract the actual match strings from the input list:rex$names[[.name]] = substr(rex$src, attr(rex$result, 'capture.start')[,.name], attr(rex$result, 'capture.start')[,.name] + attr(rex$result, 'capture.length')[,.name] - 1)
- The above steps are encapsulated into a much easier to use function
re.capture()
that allows for one-line-ish extraction:> src [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" > pat [1] "ch(?
[A-Z])\\_(? [0-9]{4})" > re.capture(pat, src)$names$ch [1] "A" "A" "A" "B" "B" "B" > re.capture(pat, src)$names$id [1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values forregexpr()
and gregexpr()
. Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?
That would be pretty neat. Probably due to the lack of a clean function, I've never even tried to use named capture in [R] and just use sub/gsub instead.
ReplyDelete> src = c('chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003')
> ch = sub('^ch','',sub('_[0-9]*$','',src))
> ids = sub('^ch.*_','',src)
> rv = data.frame(ch=ch, ids=ids)
> rv
ch ids
1 A 0001
2 A 0002
3 A 0003
4 B 0001
5 B 0002
6 B 0003
> rv$ch
[1] A A A B B B
Levels: A B
> rv$ids
[1] 0001 0002 0003 0001 0002 0003
Levels: 0001 0002 0003
-Sam
I agree, not having a clear cut solution as is found in other languages like Matlab or Python makes the sub/gsub method an obvious and often suggested choice. However, it has the potential of getting pretty hairy when the number of matches to capture and their context grows.
Delete