perl-unicode

Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-10 23:56:50
Hi,
This is Rajarshi expressing Sastry's viewpoints since he's on vacation. 
 


SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> wrote:

On Wed, 10 Aug 2005 14:06:56 +0530, Sastry wrote

As suggested by you, I ran the following script which resulted in
substituting all the characters with X irrespective of the "special
case" [i-j].

($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ s/[\x89-\x91]/X/g;
is($a, "XXXXXXXX");

+++quote begin
REGULAR EXPRESSION DIFFERENCES
As of perl 5.005_03 the letter range regular expression such as [A-Z]
and [a-z] have been especially coded to not pick up gap characters.
For example, characters such as o WITH CIRCUMFLEX that lie between I
and J would not be matched by the regular expression range /[H-K]/.
This works in the other direction, too, if either of the range end
points is explicitly numeric: [\x89-\x91] will match \x8e, even though
\x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
viewpoint.
If I specify [\x89-\x91] it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct? 
-Sastry

According to the above statement in perlebcdic.pod,
s/[\x89-\x91]/X/g must substitute \x8e with X.
But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
with X or not, since tr/// does not use brackets, [ ].

Though I think ranges in [ ] and ranges in tr/// should coincide
and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
that is just my opinion.
I don't know whether it is true and correct.
Is there some way we can confirm if this is correct (and expected behaviour) 
since there isnt any explicit documentation for the tr operator ? 


By the way, when you say "If I specify [\x89-\x91]", does it
mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused.
We mean tr/\x89-\x91/X/.


We are first informed by you that gapped characters are not
substituted with X by tr/\x89-\x91/X/.
And you said s/[\x89-\x91]/X/g substituted all the characters
including gapped characters with X, hadn't you? 

Yes.
If so, I assume your [\x89-\x91] which doesn't matching any of
the gapped characters to be tr/\x89-\x91/X/.
That's correct. We mean tr/\x89-\x91/X/.


The following is a part of the current core tests from op/pat.t.
I believe they should be passed.
Yes all the following tests pass. I think the following tests are in the 
context of the 

s/[]/X/ operator and hence pass. 

Thanks,

Rajarshi.


Regards,
SADAHIRO Tomoyuki

+++begin
# The 242 and 243 go with the 244 and 245.
# The trick is that in EBCDIC the explicit numeric range should match
# (as also in non-EBCDIC) but the explicit alphabetic range should not match.

if ("\x8e" =~ /[\x89-\x91]/) {
print "ok 242\n";
} else {
print "not ok 242\n";
}

if ("\xce" =~ /[\xc9-\xd1]/) {
print "ok 243\n";
} else {
print "not ok 243\n";
}

# In most places these tests would succeed since \x8e does not
# in most character sets match 'i' or 'j' nor would \xce match
# 'I' or 'J', but strictly speaking these tests are here for
# the good of EBCDIC, so let's test these only there.
if (ord('i') == 0x89 && ord('J') == 0xd1) { # EBCDIC
if ("\x8e" !~ /[i-j]/) {
print "ok 244\n";
} else {
print "not ok 244\n";
}
if ("\xce" !~ /[I-J]/) {
print "ok 245\n";
} else {
print "not ok 245\n";
}
} else {
for (244..245) {
print "ok $_ # Skip: only in EBCDIC\n";
}
}
---end








__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com