Out of the documentation that Nick sent, the following three paragraphs need
changing (reasons below each paragraph):
The third line of the file is three numbers. The first number is the
fallback character (in base 16) to use when converting from Unicode to this
encoding. The second number is a B<1> if this file represents the encoding
for a symbol font, or B<0> otherwise. The last number (in base 10) is how
many pages of data follow.
"UTF-8" was changed to "Unicode" because the Unicode text being converted back
to this encoding can just as easily be encoded as UCS-2, UTF-16, UTF-32, or
even Java character constant form, \uXXXX.
Subsequent lines in the example above are pages that describe how to map
from the encoding into double-byte Unicode (UCS-2). The first line in a
page identifies the page number. Following it are 256 double-byte numbers,
arranged as 16 rows of 16 numbers. Given a character in the encoding, the
high byte of that character is used to select which page, and the low byte
of that character is used as an index to select one of the double-byte
numbers in that page - the value obtained being the corresponding Unicode
character. By examination of the example above, one can see that the
characters 0x7E and 0x8163 in B<shiftjis> map to 203E and 2026 in Unicode,
respectively.
Because the .enc representation only allows a single two-byte Unicode/ISO10646
character code, it is implicitly UCS-2, which should be stated. UTF-16 would
require 1 or 2 two-byte Unicode character codes, and UTF-32 would require 1
four-byte ISO10646 character code (or 1 3-byte Unicode character code if you
want to get technical).
Note that there is a distinction between ISO10646 and Unicode made here, but
the curious can be pointed at the Unicode and ISO10646 docs if they care about
it.
Following the first page will be all the other pages, each in the same
format as the first: one number identifying the page followed by 256
double-byte Unicode (UCS-2) characters. If a character in the encoding maps
to the Unicode character 0000, it means that the character doesn't actually
exist. If all characters on a page would map to 0000, that page can be
omitted.
Again, UCS-2 is implicit by the restriction of 256 two-byte values and should
be stated as such.
A couple of final notes:
1. The Unicode Consortium has deprecated UCS-2 in favor of UTF-16.
2. The syntax of the .enc files should be modified to include an "unknown"
mapping other than 0x0000 when converting from some encoding to Unicode.
The two most obvious options are to add another value to the third line for
unknown characters in the source text or change the 0x0000's to 0xFFFF.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab Cinema, radio, television, magazines are a
New Mexico State University school of inattention: people look without
Box 30001, Dept. 3CRL seeing, listen without hearing.
Las Cruces, NM 88003 -- Robert Bresson