- *nix command line / shell scripts
- Javascript
- PHP
- Matlab
- Python
- R
To get a sense of R's named capture inadequacy, here's a simple scenario ...
The Problem:
You are given a list of files with names like:- chA_0001
- chA_0002
- chA_0003
- chB_0001
- chB_0002
- chB_0003
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(? [0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
- ch : A, A, A, B, B, B
- id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here's some Matlab code that basically does this in one line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
src = {'chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003'}; | |
pat = 'ch(?<ch>[A-Z])\_(?<id>[0-9]{4})'; | |
rex = regexp(src, pat, 'names') | |
rex{1} | |
rex{1}.id |
which would result in the following console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>> rex = regexp(src, pat, 'names') | |
rex = | |
[1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] [1x1 struct] | |
>> rex{1} | |
ans = | |
ch: 'A' | |
id: '0001' | |
>> rex{1}.id | |
ans = | |
0001 |
Now here's the equivalent R code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# regular expressions with named capture in R | |
src = c('chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003') | |
pat = 'ch(?<ch>[A-Z])\\_(?<id>[0-9]{4})' | |
re.capture = function(pattern, string, ...) { | |
rex = list(src=string, | |
result=regexpr(pattern, string, perl=TRUE, ...), | |
names=list()) | |
for (.name in attr(rex$result, 'capture.name')) { | |
rex$names[[.name]] = substr(rex$src, | |
attr(rex$result, 'capture.start')[,.name], | |
attr(rex$result, 'capture.start')[,.name] | |
+ attr(rex$result, 'capture.length')[,.name] | |
- 1) | |
} | |
return(rex) | |
} | |
print(re.capture(pat, src)) |
There is a lot of work here! To help explain what's going on, here's the corresponding console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$src | |
[1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" | |
$result | |
[1] 1 1 1 1 1 1 | |
attr(,"match.length") | |
[1] 8 8 8 8 8 8 | |
attr(,"useBytes") | |
[1] TRUE | |
attr(,"capture.start") | |
ch id | |
[1,] 3 5 | |
[2,] 3 5 | |
[3,] 3 5 | |
[4,] 3 5 | |
[5,] 3 5 | |
[6,] 3 5 | |
attr(,"capture.length") | |
ch id | |
[1,] 1 4 | |
[2,] 1 4 | |
[3,] 1 4 | |
[4,] 1 4 | |
[5,] 1 4 | |
[6,] 1 4 | |
attr(,"capture.names") | |
[1] "ch" "id" | |
$names | |
$names$ch | |
[1] "A" "A" "A" "B" "B" "B" | |
$names$id | |
[1] "0001" "0002" "0003" "0001" "0002" "0003" |
Here's what's happening:
- regexpr(..., perl=T) is used to create a regular expression result with named capture which is placed in the
$result
item of the output list.
$result [1] 1 1 1 1 1 1 attr(,"match.length") [1] 8 8 8 8 8 8 attr(,"useBytes") [1] TRUE attr(,"capture.start") ch id [1,] 3 5 [2,] 3 5 [3,] 3 5 [4,] 3 5 [5,] 3 5 [6,] 3 5 attr(,"capture.length") ch id [1,] 1 4 [2,] 1 4 [3,] 1 4 [4,] 1 4 [5,] 1 4 [6,] 1 4 attr(,"capture.names") [1] "ch" "id"
This result is pretty unusable since all of the important captured information is buried in attribute settings. - To do anything with the output from
regexpr()
, the result from #1 has to have its attributes probed usingattr()
(via a for loop) to get:- captured group names
- start locations within the strings of the captured groups
- length of the captured groups (oddly/depressingly, end positions are not returned)
substr()
to extract the actual match strings from the input list:rex$names[[.name]] = substr(rex$src, attr(rex$result, 'capture.start')[,.name], attr(rex$result, 'capture.start')[,.name] + attr(rex$result, 'capture.length')[,.name] - 1)
- The above steps are encapsulated into a much easier to use function
re.capture()
that allows for one-line-ish extraction:> src [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003" > pat [1] "ch(?
[A-Z])\\_(? [0-9]{4})" > re.capture(pat, src)$names$ch [1] "A" "A" "A" "B" "B" "B" > re.capture(pat, src)$names$id [1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values forregexpr()
and gregexpr()
. Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?
That would be pretty neat. Probably due to the lack of a clean function, I've never even tried to use named capture in [R] and just use sub/gsub instead.
ReplyDelete> src = c('chA_0001', 'chA_0002', 'chA_0003', 'chB_0001', 'chB_0002', 'chB_0003')
> ch = sub('^ch','',sub('_[0-9]*$','',src))
> ids = sub('^ch.*_','',src)
> rv = data.frame(ch=ch, ids=ids)
> rv
ch ids
1 A 0001
2 A 0002
3 A 0003
4 B 0001
5 B 0002
6 B 0003
> rv$ch
[1] A A A B B B
Levels: A B
> rv$ids
[1] 0001 0002 0003 0001 0002 0003
Levels: 0001 0002 0003
-Sam
I agree, not having a clear cut solution as is found in other languages like Matlab or Python makes the sub/gsub method an obvious and often suggested choice. However, it has the potential of getting pretty hairy when the number of matches to capture and their context grows.
Delete