perl-unicode

Fast conversions

2000-12-11 20:42:32
One way of doing complex mappings of data is to use a sequence of s///. 
The problem with this is that you need to write the regular expressions to 
take into account the changes made in previous s/// and this can be 
particularly problematic if you start with an 8-bit coding and are heading 
towards Unicode, for example. Since, half way through the s/// series, 
half your data may be in 8-bit and half in UTF-8 and it all gets very 
confusing.

A better approach is to try to match all the regular expressions, in some 
priority order, at each position. Here the good old alternation comes into 
play. You can list your regular expressions in priority order and then do 
the substitution as needed. Thus a|b|c, etc. The problem here is that you 
really would like to know which regular expression matches, even more than 
which characters were matched. I am therefore making a request for a 
language addition to the regular expression language of something like (?| 
which would return the index in the alternate list rather than the 
characters matched.

For example:

        s/(?|a|b|c)/$outputs[$1]/og;

where a, b & c are regular expressions.

Or perhaps, something a touch more powerful and exciting:

        s/((?|a|b|c))/&{$outputs[$2]}($1)/og;

Of course, you can write this out in Perl, but it is very painful:

while(pos($str) < length($str))
{
        $found = 0;
        for ($i = 0; $i <= $#regexps; $i++)
        {
                if ($str =~ m/\G($regexps[$i])/gc);
                {
                        $out .= $outputs[$i];           # or $out .= 
&{$outputs[$i]}($1)
                        $found = 1;
                        last;
                }
        }
        unless ($found)
        {
                $str =~ m/\G(.)/gc;
                $out .= $1;
        }
}

And if your setup is more complicated, the savings and speed that (?| 
might lead to are increased.

I just mention this since the regular expression engine seems to have its 
hood open at the moment, and I can't see the addition of (?| as being 
immensely difficult. But then it is a new idea, and new ideas may have to 
wait for V6 :(

Personally, I might make the return value from (?| based on 1, so that 0 
indicates no match, but I have no strong views on that one.

Martin Hosken

<Prev in Thread] Current Thread [Next in Thread>
  • Fast conversions, Martin_Hosken <=