perl-unicode

Re: digits in iso-8859-6 to utf8 conversion

2003-05-23 10:30:05

Thanks for the amazingly quick and thorough response.

While waiting for my local sys-admins to catch up with the new Encode
distribution, I installed my own local mapping table for iso-8859-6
(which was remarkably simple to do -- tremendous praise for the folks
who developed and documented enc2xs).

In the process, I chose to set up the digit-related entries in my
"iso-8859-6-nd.ucm" file as follows:

...
<U0030> \x30 |0 #       DIGIT ZERO
<U0031> \x31 |0 #       DIGIT ONE
<U0032> \x32 |0 #       DIGIT TWO
<U0033> \x33 |0 #       DIGIT THREE
<U0034> \x34 |0 #       DIGIT FOUR
<U0035> \x35 |0 #       DIGIT FIVE
<U0036> \x36 |0 #       DIGIT SIX
<U0037> \x37 |0 #       DIGIT SEVEN
<U0038> \x38 |0 #       DIGIT EIGHT
<U0039> \x39 |0 #       DIGIT NINE
...
<U0660> \x30 |1 #  ARABIC-INDIC DIGIT ZERO
<U0661> \x31 |1 #  ARABIC-INDIC DIGIT ONE
<U0662> \x32 |1 #  ARABIC-INDIC DIGIT TWO
<U0663> \x33 |1 #  ARABIC-INDIC DIGIT THREE
<U0664> \x34 |1 #  ARABIC-INDIC DIGIT FOUR
<U0665> \x35 |1 #  ARABIC-INDIC DIGIT FIVE
<U0666> \x36 |1 #  ARABIC-INDIC DIGIT SIX
<U0667> \x37 |1 #  ARABIC-INDIC DIGIT SEVEN
<U0668> \x38 |1 #  ARABIC-INDIC DIGIT EIGHT
<U0669> \x39 |1 #  ARABIC-INDIC DIGIT NINE
<U06F0> \x30 |1 #  EXTENDED ARABIC-INDIC DIGIT ZERO
<U06F1> \x31 |1 #  EXTENDED ARABIC-INDIC DIGIT ONE
<U06F2> \x32 |1 #  EXTENDED ARABIC-INDIC DIGIT TWO
<U06F3> \x33 |1 #  EXTENDED ARABIC-INDIC DIGIT THREE
<U06F4> \x34 |1 #  EXTENDED ARABIC-INDIC DIGIT FOUR
<U06F5> \x35 |1 #  EXTENDED ARABIC-INDIC DIGIT FIVE
<U06F6> \x36 |1 #  EXTENDED ARABIC-INDIC DIGIT SIX
<U06F7> \x37 |1 #  EXTENDED ARABIC-INDIC DIGIT SEVEN
<U06F8> \x38 |1 #  EXTENDED ARABIC-INDIC DIGIT EIGHT
<U06F9> \x39 |1 #  EXTENDED ARABIC-INDIC DIGIT NINE

The point here is that when Arabic text in Unicode happens to contain
Arabic-Indic digit characters, and we want to convert to iso-8859-6, it
would seem a good idea for these multi-byte digit characters to be
translated into their ASCII correlates, rather than being treated as
exceptions (replaced by "?", or throwing an error if encode's "CHECK"
flag is set to do that).

Some people might object to this being a "default" behavior for encoding
into 8859-6, but if it were available as an alternative, I think a lot
of people could find it useful.  (Personally, I'd vote for this to be
the default behavior.)

        Dave G.