On 8/9/05, SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> wrote:
Hello,
On Tue, 9 Aug 2005 15:09:42 +0530, Sastry <ravisastryk(_at_)gmail(_dot_)com>
wrote
Hi
As suggested by you, I ran the following script which resulted in
substituting all the characters with X irrespective of the "special
case" [i-j].
($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ s/[\x89-\x91]/X/g;
is($a, "XXXXXXXX");
Right, that behavior of ranges in character classes [ ] is expectable
from literal_endpoint, which is introduced by Change 16556.
cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556
I have also observed that whenever there are any gapped characters eg:
[r-s] as in the following script, it just translates 'r' and 's' to X
alone!
($a = "\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2") =~ tr/\x99-\xa2/X/;
is($a, "XXXXXXXXXX");
a) Why is it mentioned that when [i-j] is included [\x89-\x91] should
not be included?
b) Do you think there is a bug in the tr// implementation as a
consequence of the above?
-Sastry
Answer for a) is mentioned in perlebcdic.pod.
The last sentence ("This works in...") seems to be added there
in accompanied with Change 16556 as above.
+++quote begin
REGULAR EXPRESSION DIFFERENCES
As of perl 5.005_03 the letter range regular expression such as [A-Z]
and [a-z] have been especially coded to not pick up gap characters.
For example, characters such as o WITH CIRCUMFLEX that lie between I
and J would not be matched by the regular expression range /[H-K]/.
This works in the other direction, too, if either of the range end
points is explicitly numeric: [\x89-\x91] will match \x8e, even though
\x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
viewpoint.
If I specify [\x89-\x91] it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct?
-Sastry
----quote end
I'll give some additional explanations from the viewpoint
of portability:
a letter range [h-k] always means [hijk], even on EBCDIC platforms,
but not [hi\x8A-\x90jk], because the string "h" is always the small
letter 'h' whether its code value is 0x68 or 0x88;
thus a numeric range [\x89-\x91] should always mean
[\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms,
but not [\x89\x91], because the string "\x89" always stands for
the code value 0x89 whether it encodes a certain C1 control character
or the letter 'i'.
b): In my opinion the above change in [ ] for regular expressions
is an improvement and a similar change in tr/// is also advisable.
The reason why I hesitate to use the word "bug" is based on
the following statement on tr/// in perlop.pod, esp. the last sentence:
+++quote begin
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
that begin from and end at either alphabets of equal case (a-e, A-E),
or digits (0-4). Anything else is unsafe. If in doubt, spell out
the character sets in full.
----quote end
where numeric ranges such as \x89-\x91 are not declared
to be safe, but to be unsafe.
Regards,
SADAHIRO Tomoyuki