perl-unicode

digits in iso-8859-6 to utf8 conversion

2003-05-20 15:30:06

When using the standard 5.8.0 Encode distribution for this sort of 
operation:

   $utf8str = decode( 'iso-8859-6', $octets );

I find that ASCII digits in $octets ([0-9]) end up being converted to 
Arabic-Indic digits in $utf8str ([\x06f0-\x06f9]).

Why should this be?  I believe it is not required on behalf of language
users -- I've been told by native speakers that all readers of Arabic
are familiar with the ASCII digits and use them comfortably.  (I'd be
interested if anyone has evidence to the contrary.)

On the other hand, it seems counter-productive to recode all the digits
this way when transliterating things like (X|HT)ML, "flat table" text
files, etc, where a wide range of applications would expect to do string
matches and arithmetic on ASCII digits, but not on Arabic-Indic numbers.

Three different work-arounds I can think of are:

 - define a "special" Encode module, e.g. "iso-8859-6-nd", that leaves
the ASCII digits alone (I'll do this anyway, because I want to learn the
process, but I'd want my special module to always override the default
iso-8859-6; how to do that?)

 - be very scrupulous about using "decode('iso-8859-6', ...)", making
sure that I never pass it strings that contain digits (this seems like a
complicated and error-prone approach)

 - after using the default module in the normal, simple way, I make 
sure to convert all the Arabic-Indic numerals back to their ASCII 
equivalents (silly and wasteful).

In general, though, when a non-Unicode character set contains ASCII code
points, is there _ever_ a good reason for an Encode module to replace
those ASCII codes with multi-byte (non-ASCII) correlates?  IMO, this
sort of remapping is unmotivated, and just obfuscates the data.

        Dave G.