perl-unicode

Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-10 17:48:20

On Wed, 10 Aug 2005 14:06:56 +0530, Sastry <ravisastryk(_at_)gmail(_dot_)com> 
wrote

As suggested by you, I ran the following script which resulted in
substituting all the characters with X irrespective of the "special
case" [i-j].

($a = "\x89\x8a\x8b\x8c\x8d\x8f\x90\x91") =~ s/[\x89-\x91]/X/g;
is($a, "XXXXXXXX");

+++quote begin
REGULAR EXPRESSION DIFFERENCES
As of perl 5.005_03 the letter range regular expression such as [A-Z]
and [a-z] have been especially coded to not pick up gap characters.
For example, characters such as o WITH CIRCUMFLEX that lie between I
and J would not be matched by the regular expression range /[H-K]/.
This works in the other direction, too, if either of the range end
points is explicitly numeric: [\x89-\x91] will match \x8e, even though
\x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
viewpoint.
If I specify  [\x89-\x91]  it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct? 
-Sastry

According to the above statement in perlebcdic.pod,
s/[\x89-\x91]/X/g must substitute \x8e with X.
But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
with X or not, since tr/// does not use brackets, [ ].

Though I think ranges in [ ] and ranges in tr/// should coincide
and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
that is just my opinion.
I don't know whether it is true and correct.

By the way, when you say "If I specify  [\x89-\x91]", does it
mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ?  I'm confused.

We are first informed by you that gapped characters are not
substituted with X by tr/\x89-\x91/X/.
And you said s/[\x89-\x91]/X/g substituted all the characters
including gapped characters with X, hadn't you?
If so, I assume your [\x89-\x91] which doesn't matching any of
the gapped characters to be tr/\x89-\x91/X/.

The following is a part of the current core tests from op/pat.t.
I believe they should be passed.

Regards,
SADAHIRO Tomoyuki

+++begin
# The 242 and 243 go with the 244 and 245.
# The trick is that in EBCDIC the explicit numeric range should match
# (as also in non-EBCDIC) but the explicit alphabetic range should not match.

if ("\x8e" =~ /[\x89-\x91]/) {
  print "ok 242\n";
} else {
  print "not ok 242\n";
}

if ("\xce" =~ /[\xc9-\xd1]/) {
  print "ok 243\n";
} else {
  print "not ok 243\n";
}

# In most places these tests would succeed since \x8e does not
# in most character sets match 'i' or 'j' nor would \xce match
# 'I' or 'J', but strictly speaking these tests are here for
# the good of EBCDIC, so let's test these only there.
if (ord('i') == 0x89 && ord('J') == 0xd1) { # EBCDIC
  if ("\x8e" !~ /[i-j]/) {
    print "ok 244\n";
  } else {
    print "not ok 244\n";
  }
  if ("\xce" !~ /[I-J]/) {
    print "ok 245\n";
  } else {
    print "not ok 245\n";
  }
} else {
  for (244..245) {
    print "ok $_ # Skip: only in EBCDIC\n";
  }
}
---end