perl-unicode

Re: Unicode::Map

1999-11-16 22:35:00
On 16 Nov, Martin Schwartz wrote:
-} James,
-} 
-} > I am very interested in all aspects of Perl Unicode support, and
-} > am willing to take over your module.
-} 
-} Great!
-} 
-} > Any other volunteers?
-} 

I'd be glad to collaborate, but don't want to take the lead on it.
I seem to be included in this august body of correspondants as a
result of mailing to Martin about the potential for documentation
for the tables in Unicode::Map (rather than digging it out myself,
shudder) so that I could at least compare with the rules being used 
by Tcl.

I'm VERY interested in preserving Tcl<=>Perl compatibility where
possible, and of course Tcl8.2 has internal Unicode processing,
with a series of (incompatible with Unicode::Map) translation
tables for other encodings.  Tcl 8.x includes natives:

        encoding convertfrom ?encoding? data
           (converts based on bundled .enc tables FROM specified
            or default encoding TO Unicode)
        encoding convertto ?encoding? string
           (converts based on same bundled .enc tables FROM Unicode
            to specified (or default) encoding).
        encoding names
           (returns a list of known .enc tables)
        encoding system ?encoding?
           (sets/returns the "default" encoding).

     Example:
        set s [encoding convertfrom euc-jp "\xA4\xCF"]
        returns "\u306F" (Hiragana HA) in $s, after consulting
        a built-in table named "euc-jp.enc".

     Encoding tables handle 8-bit encodings, escaped encodings
     (with inclusion by reference of other tables when needed)
     and multi-byte encodings.  Tables are physically in ASCII.

I'm still not sure that FreeType is fully handling some of the
TTF Unicode.  I know that ttf2bdf for platform 3, encoding 1
(Windows/Unicode) is generating some grungy fonts at times,
which has made me cautious about installing a FreeType enabled
fontserver.  I'm still not sure of Mac/Unicode in-font tabling,
but most Fontographer fonts floating around don't have that.

Also, with some (e.g. indic) languages, a common entry method
is romanized syllables.  This type of encoding should also be
parsable for autoconversion to Unicode, as it is to the many
encodings supported by "itrans", and its successor (in development)
"iscript".  

At the same time, TSCII is growing pretty fast for Malaysia, but "TAB"
encoding has been endorsed by the Maylay government as the standard
Tamil encoding - RATHER than the assigned Unicode block.  I haven't
looked (yet) to see if this is merely a truncation to the low-order
byte of the Unicode or what.

So to do multilingual processing, there needs to be an integration
possible, and Perl and Tcl are the best candidates.

Comparing a charmap of

    -freetype-jagran-medium-r-normal--19-140-100-100-p-83-macroman-0

generated from jagran.ttf's Platform-1 table, with 

    -*-devnagari-medium-o-*-*-*-120-*-*-*-*-*-fontspecific
             (note BDF was BUILT with splats in the name!) 
which is distributed with dvedit, there is an enormous overlap in the
glyphs, but no relationship I can see in the encodings.   Yet as I
recall, the Jagran encoding is being used in some indic on-line
newspapers at the moment.

        Bruce Gingery   <bgingery(_at_)gtcs(_dot_)com>



<Prev in Thread] Current Thread [Next in Thread>