I'm having trouble using 5.8.0 Encode with the MacArabic code table.
(It took a long time to figure out the cause, and I still don't
understand where Encode gets/keeps its info about character mappings.)
The problem affects all the points in the MacArabic table whose Unicode
correlates include the "<LR>+" or "<RL>+" indicators -- e.g. (quoting
from the MAC/ARABIC.TXT listing available from www.unicode.org):
#=======================================================================
# FTP file name: ARABIC.TXT
#
# Contents: Map (external version) from Mac OS Arabic
# character set to Unicode 2.1
#
# Copyright: (c) 1994-1999 by Apple Computer, Inc., all rights
# reserved.
...
0x20 <LR>+0x0020 # SPACE, left-right
0x21 <LR>+0x0021 # EXCLAMATION MARK, left-right
0x22 <LR>+0x0022 # QUOTATION MARK, left-right
...
0x81 <RL>+0x00A0 # NO-BREAK SPACE, right-left
...
0x8C <RL>+0x00AB # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK, right-left
...
0xA0 <RL>+0x0020 # SPACE, right-left
0xA1 <RL>+0x0021 # EXCLAMATION MARK, right-left
0xA2 <RL>+0x0022 # QUOTATION MARK, right-left
0xA3 <RL>+0x0023 # NUMBER SIGN, right-left
0xA4 <RL>+0x0024 # DOLLAR SIGN, right-left
...
I'll attach a code snippet below to demonstrate (it can operate as a
self-contained program), together with the output of "perl -V" on my
system (in case that helps).
I understand that Mac developers would consider a conversion to unicode
"lossy" or "non-reversible" if the directionality indicators are not
preserved somehow (using RLE/LRE or RLO/LRO), and this might constitute
an "algorithmic" approach that 'enc2xs' would not support.
Is there a work-around that will allow all the MacArabic code points to
be converted successfully, given that their respective character
semantics are all well established in unicode? Even a "lossy"
conversion (ditching the directionality specs) would be better than the
failures I'm getting now.
-----------
David Graff Linguistic Data Consortium
graff(_at_)ldc(_dot_)upenn(_dot_)edu 3600 Market St., Suite 810
voice: (215) 898-0887 University of Pennsylvania
fax: (215) 573-2175 Philadelphia, PA 19104
http://www.ldc.upenn.edu
--------------- perl -V output:
Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
Platform:
osname=solaris, osvers=2.8, archname=sun4-solaris
uname='sunos follicle.seas.upenn.edu 5.8 generic_108528-09 sun4u sparc
sunw,sun-blade-1000 '
config_args='-Dcc=gcc -Dprefix=/pkg/p/perl-5.8.0'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=y, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include
-I/pkg/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O',
cppflags='-fno-strict-aliasing -I/usr/local/include -I/pkg/include'
ccversion='', gccversion='2.95.2 19991024 (release)',
gccosandvers='solaris2.7'
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='gcc', ldflags ='-L/usr/local/lib -R/usr/local/lib '
libpth=/usr/local/lib /usr/lib /usr/ccs/lib /pkg/lib
libs=-lsocket -lnsl -lgdbm -ldl -lm -lc
perllibs=-lsocket -lnsl -ldl -lm -lc
libc=, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-fPIC -L/pkg/lib -R/pkg/lib -I/pkg/include', lddlflags='-G
-L/usr/local/lib -R/usr/local/lib -I/pkg/lib'
Characteristics of this binary (from libperl):
Compile-time options: USE_LARGE_FILES
Built under solaris
Compiled at Sep 23 2002 15:26:38
@INC:
/pkg/p/perl-5.8.0/lib/5.8.0/sun4-solaris
/pkg/p/perl-5.8.0/lib/5.8.0
/pkg/p/perl-5.8.0/lib/site_perl/5.8.0/sun4-solaris
/pkg/p/perl-5.8.0/lib/site_perl/5.8.0
/pkg/p/perl-5.8.0/lib/site_perl
.
use strict;
use Encode;
my ($octet_out, $utf8_out);
my @octet_in;
push @octet_in, chr($_) for (0x20 .. 0x7E, 0x80 .. 0xFF);
# Show that Encode functions are working for some vendor tables:
foreach my $table ( qw/cp1256 MacArabic/ ) {
my @fail = ();
my @succ = ();
my @msgs = ();
foreach ( @octet_in ) {
my $char = $_;
eval "\$utf8_out = decode( \'$table\', \$char, Encode::FB_CROAK )";
if ( $@ ) {
push @fail, $_;
push @msgs, $@;
} else {
push @succ, $utf8_out;
}
}
print join( ' ', "decoding via $table succeeds on:", (@succ) ? @succ :
"nothing"), $/;
print join( ' ', "decoding via $table fails on:", (@fail) ? @fail :
"nothing"), $/;
print STDERR @msgs;
}