One way of doing complex mappings of data is to use a sequence of s///.
The problem with this is that you need to write the regular expressions to
take into account the changes made in previous s/// and this can be
particularly problematic if you start with an 8-bit coding and are heading
towards Unicode, for example. Since, half way through the s/// series,
half your data may be in 8-bit and half in UTF-8 and it all gets very
confusing.
A better approach is to try to match all the regular expressions, in some
priority order, at each position. Here the good old alternation comes into
play. You can list your regular expressions in priority order and then do
the substitution as needed. Thus a|b|c, etc. The problem here is that you
really would like to know which regular expression matches, even more than
which characters were matched. I am therefore making a request for a
language addition to the regular expression language of something like (?|
which would return the index in the alternate list rather than the
characters matched.
For example:
s/(?|a|b|c)/$outputs[$1]/og;
where a, b & c are regular expressions.
Or perhaps, something a touch more powerful and exciting:
s/((?|a|b|c))/&{$outputs[$2]}($1)/og;
Of course, you can write this out in Perl, but it is very painful:
while(pos($str) < length($str))
{
$found = 0;
for ($i = 0; $i <= $#regexps; $i++)
{
if ($str =~ m/\G($regexps[$i])/gc);
{
$out .= $outputs[$i]; # or $out .=
&{$outputs[$i]}($1)
$found = 1;
last;
}
}
unless ($found)
{
$str =~ m/\G(.)/gc;
$out .= $1;
}
}
And if your setup is more complicated, the savings and speed that (?|
might lead to are increased.
I just mention this since the regular expression engine seems to have its
hood open at the moment, and I can't see the addition of (?| as being
immensely difficult. But then it is a new idea, and new ideas may have to
wait for V6 :(
Personally, I might make the return value from (?| based on 1, so that 0
indicates no match, but I have no strong views on that one.
Martin Hosken