perl-unicode

Re: Encode from XS

2003-08-08 12:30:10

simon(_at_)simon-cozens(_dot_)org said:
Can someone give me a few quick examples of creating Encode::XS
objects to do simple transcoding, from XS? 

Have you read the "enc2xs" man page that comes with perl 5.8?

I've used it myself (having never done this before), and potentially 
the part that takes longest is preparing the code-point mapping table 
that enc2xs uses as input to create a character-set module for Encode.

In my case, I noticed that the 5.8.0 release installed on our local
server had a "iso-8859-6" (Arabic) module that converted ASCII digits
into Arabic-Indic digits when converting to unicode.  Rather than do
extra scripting every time I use this module (in order to undo the 
digit conversion), I created an alternate version, "iso-8859-6-nd".  
Not only would this version leave ASCII digits alone when converting 
from 8859-6 to unicode, but if I convert unicode back to 8859-6, any 
Arabic-Indic digits in the unicode data would be converted to ASCII.

First I needed a ucm file, which was simply the unicode/iso-8859-6 
character map with an extra data column, where I set the digit 
character correspondences the way I wanted (and left everything else 
as-is):

#
# iso-8859-6-nd.ucm : Unicode Character Map for 8-bit Arabic 
#
# This version differs from the iso-8859-6.ucm provided with the
# standard Perl-5.8 Encode module by virtue of the way it treats
# digit characters.  8859-6 does not include Arabic-Indic digits
# and instead uses ASCII digits for all numeric strings; the
# standard Encode module for this character set maps all ASCII
# digits to Arabic-Indic numerals (\x{0660} - \x{0669}).  The
# following table leaves all ASCII digits unmodified.
#
<code_set_name> "iso-8859-6-nd"
<code_set_alias> "iso-arabic"
<mb_cur_min> 1
<mb_cur_max> 1
<subchar> \x3f
#
CHARMAP
...
<U0030> \x30 |0 #       DIGIT ZERO
<U0031> \x31 |0 #       DIGIT ONE
<U0032> \x32 |0 #       DIGIT TWO
<U0033> \x33 |0 #       DIGIT THREE
<U0034> \x34 |0 #       DIGIT FOUR
<U0035> \x35 |0 #       DIGIT FIVE
<U0036> \x36 |0 #       DIGIT SIX
<U0037> \x37 |0 #       DIGIT SEVEN
<U0038> \x38 |0 #       DIGIT EIGHT
<U0039> \x39 |0 #       DIGIT NINE
...
<U0660> \x30 |1 #  ARABIC-INDIC DIGIT ZERO
<U0661> \x31 |1 #  ARABIC-INDIC DIGIT ONE
<U0662> \x32 |1 #  ARABIC-INDIC DIGIT TWO
<U0663> \x33 |1 #  ARABIC-INDIC DIGIT THREE
<U0664> \x34 |1 #  ARABIC-INDIC DIGIT FOUR
<U0665> \x35 |1 #  ARABIC-INDIC DIGIT FIVE
<U0666> \x36 |1 #  ARABIC-INDIC DIGIT SIX
<U0667> \x37 |1 #  ARABIC-INDIC DIGIT SEVEN
<U0668> \x38 |1 #  ARABIC-INDIC DIGIT EIGHT
<U0669> \x39 |1 #  ARABIC-INDIC DIGIT NINE

The enc2xs docs explain how to set up this file, and then gives a simple
cook-book sequence of operations to process it and produce a module that
you install into your "@INC" path (or into some path you can address 
with "-I/path").

Hope that helps.

        Dave Graff


<Prev in Thread] Current Thread [Next in Thread>