perl-unicode

Handling MacArabic in perl 5.8.0

2003-01-24 03:30:04

I'm having trouble using 5.8.0 Encode with the MacArabic code table.
(It took a long time to figure out the cause, and I still don't 
understand where Encode gets/keeps its info about character mappings.)

The problem affects all the points in the MacArabic table whose Unicode 
correlates include the "<LR>+" or "<RL>+" indicators -- e.g. (quoting 
from the MAC/ARABIC.TXT listing available from www.unicode.org):

#=======================================================================
#   FTP file name:  ARABIC.TXT
#
#   Contents:       Map (external version) from Mac OS Arabic
#                   character set to Unicode 2.1
#
#   Copyright:      (c) 1994-1999 by Apple Computer, Inc., all rights
#                   reserved.
...
0x20    <LR>+0x0020     # SPACE, left-right
0x21    <LR>+0x0021     # EXCLAMATION MARK, left-right
0x22    <LR>+0x0022     # QUOTATION MARK, left-right
...
0x81    <RL>+0x00A0     # NO-BREAK SPACE, right-left
...
0x8C    <RL>+0x00AB     # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK, right-left
...
0xA0    <RL>+0x0020     # SPACE, right-left
0xA1    <RL>+0x0021     # EXCLAMATION MARK, right-left
0xA2    <RL>+0x0022     # QUOTATION MARK, right-left
0xA3    <RL>+0x0023     # NUMBER SIGN, right-left
0xA4    <RL>+0x0024     # DOLLAR SIGN, right-left
...

I'll attach a code snippet below to demonstrate (it can operate as a
self-contained program), together with the output of "perl -V" on my
system (in case that helps).

I understand that Mac developers would consider a conversion to unicode
"lossy" or "non-reversible" if the directionality indicators are not
preserved somehow (using RLE/LRE or RLO/LRO), and this might constitute
an "algorithmic" approach that 'enc2xs' would not support.

Is there a work-around that will allow all the MacArabic code points to
be converted successfully, given that their respective character
semantics are all well established in unicode?  Even a "lossy" 
conversion (ditching the directionality specs) would be better than the 
failures I'm getting now.

-----------
David Graff                     Linguistic Data Consortium
graff(_at_)ldc(_dot_)upenn(_dot_)edu            3600 Market St., Suite 810
voice: (215) 898-0887           University of Pennsylvania
fax:   (215) 573-2175           Philadelphia, PA 19104
                http://www.ldc.upenn.edu


--------------- perl -V output:
Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=solaris, osvers=2.8, archname=sun4-solaris
    uname='sunos follicle.seas.upenn.edu 5.8 generic_108528-09 sun4u sparc 
sunw,sun-blade-1000 '
    config_args='-Dcc=gcc -Dprefix=/pkg/p/perl-5.8.0'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef 
usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=y, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include 
-I/pkg/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O',
    cppflags='-fno-strict-aliasing -I/usr/local/include -I/pkg/include'
    ccversion='', gccversion='2.95.2 19991024 (release)', 
gccosandvers='solaris2.7'
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-L/usr/local/lib -R/usr/local/lib '
    libpth=/usr/local/lib /usr/lib /usr/ccs/lib /pkg/lib
    libs=-lsocket -lnsl -lgdbm -ldl -lm -lc
    perllibs=-lsocket -lnsl -ldl -lm -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-fPIC -L/pkg/lib -R/pkg/lib -I/pkg/include', lddlflags='-G 
-L/usr/local/lib -R/usr/local/lib -I/pkg/lib'


Characteristics of this binary (from libperl): 
  Compile-time options: USE_LARGE_FILES
  Built under solaris
  Compiled at Sep 23 2002 15:26:38
  @INC:
    /pkg/p/perl-5.8.0/lib/5.8.0/sun4-solaris
    /pkg/p/perl-5.8.0/lib/5.8.0
    /pkg/p/perl-5.8.0/lib/site_perl/5.8.0/sun4-solaris
    /pkg/p/perl-5.8.0/lib/site_perl/5.8.0
    /pkg/p/perl-5.8.0/lib/site_perl
    .



use strict;
use Encode;

my ($octet_out, $utf8_out);

my @octet_in;

push @octet_in, chr($_) for (0x20 .. 0x7E, 0x80 .. 0xFF);

# Show that Encode functions are working for some vendor tables:

foreach my $table ( qw/cp1256 MacArabic/ ) {
    my @fail = ();
    my @succ = ();
    my @msgs = ();
    foreach ( @octet_in ) {
        my $char = $_;
        eval "\$utf8_out = decode( \'$table\', \$char, Encode::FB_CROAK )";
        if ( $@ ) {
            push @fail, $_;
            push @msgs, $@;
        } else {
            push @succ, $utf8_out;
        }
    }
    print join( ' ', "decoding via $table succeeds on:", (@succ) ? @succ : 
"nothing"), $/;
    print join( ' ', "decoding via $table fails on:", (@fail) ? @fail : 
"nothing"), $/;
    print STDERR @msgs;
}
<Prev in Thread] Current Thread [Next in Thread>