perl-unicode

.enc docs comments [was Re: Encode's .enc files and a question]

2000-10-26 11:35:00
Out of the documentation that Nick sent, the following three paragraphs need
changing (reasons below each paragraph):

  The third line of the file is three numbers.  The first number is the
  fallback character (in base 16) to use when converting from Unicode to this
  encoding.  The second number is a B<1> if this file represents the encoding
  for a symbol font, or B<0> otherwise.  The last number (in base 10) is how
  many pages of data follow.

"UTF-8" was changed to "Unicode" because the Unicode text being converted back
to this encoding can just as easily be encoded as UCS-2, UTF-16, UTF-32, or
even Java character constant form, \uXXXX.

  Subsequent lines in the example above are pages that describe how to map
  from the encoding into double-byte Unicode (UCS-2).  The first line in a
  page identifies the page number.  Following it are 256 double-byte numbers,
  arranged as 16 rows of 16 numbers.  Given a character in the encoding, the
  high byte of that character is used to select which page, and the low byte
  of that character is used as an index to select one of the double-byte
  numbers in that page - the value obtained being the corresponding Unicode
  character.  By examination of the example above, one can see that the
  characters 0x7E and 0x8163 in B<shiftjis> map to 203E and 2026 in Unicode,
  respectively.

Because the .enc representation only allows a single two-byte Unicode/ISO10646
character code, it is implicitly UCS-2, which should be stated.  UTF-16 would
require 1 or 2 two-byte Unicode character codes, and UTF-32 would require 1
four-byte ISO10646 character code (or 1 3-byte Unicode character code if you
want to get technical).

Note that there is a distinction between ISO10646 and Unicode made here, but
the curious can be pointed at the Unicode and ISO10646 docs if they care about
it.

  Following the first page will be all the other pages, each in the same
  format as the first: one number identifying the page followed by 256
  double-byte Unicode (UCS-2) characters.  If a character in the encoding maps
  to the Unicode character 0000, it means that the character doesn't actually
  exist.  If all characters on a page would map to 0000, that page can be
  omitted.

Again, UCS-2 is implicit by the restriction of 256 two-byte values and should
be stated as such.

A couple of final notes:

1. The Unicode Consortium has deprecated UCS-2 in favor of UTF-16.

2. The syntax of the .enc files should be modified to include an "unknown"
   mapping other than 0x0000 when converting from some encoding to Unicode.
   The two most obvious options are to add another value to the third line for
   unknown characters in the source text or change the 0x0000's to 0xFFFF.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            Cinema, radio, television, magazines are a
New Mexico State University       school of inattention: people look without
Box 30001, Dept. 3CRL             seeing, listen without hearing.
Las Cruces, NM  88003                            -- Robert Bresson