perl-unicode

Re: .enc docs comments [was Re: Encode's .enc files and a question]

2000-10-26 12:31:21


On Thu, 26 Oct 2000, Mark Leisher wrote:

[fine suggestions snipped]

Again, UCS-2 is implicit by the restriction of 256 two-byte values and should
be stated as such.

Uncomfortable to say the least.  Could a surrogate scalar encoding be
done as an escaped encoding where the high and low pairs are put into
the .enc files as HHHHLLLL where both H and L =~ /[0-9A-F]/? hence
necessitating a shift to reading 8 characters (possibly implemented using
the "E" mechanism?).

A couple of final notes:

1. The Unicode Consortium has deprecated UCS-2 in favor of UTF-16.

Yes and the Mormon Deseret alphabet will be in the next issue of unicode
(I don't know if that will be called 3.0 or 3.1).  Also on the approval
track are (not necessarily going to be in the next release): more than
4000 additional HAN ideographs, Klingon, and Ancient Egyptian Hieroglyphs
(more than 7300 characters).  All of these scripts will be implemented via
the surrogate expansion mechanism.

2. The syntax of the .enc files should be modified to include an "unknown"
   mapping other than 0x0000 when converting from some encoding to Unicode.
   The two most obvious options are to add another value to the third line for
   unknown characters in the source text or change the 0x0000's to 0xFFFF.

Or one could use the source text "GGGG" to *really* indicate a "hex" value
that does not map to a character (I am being more than a little facetious
here ;-).

How firmly established is the Tcl scheme?  Is it still being hammered out?
I do think that it would be nice to avoid yet another gratuitous file
format incompatability if possible.  So how do the Tcl folks plan to
handle surrogates or truly unrecognized characters?

Peter Prymmer