On Thu, 26 Oct 2000, Philip Newton wrote:
On Wed, 25 Oct 2000, Mark Leisher wrote:
There may some day be a use for the Unicode codepoint 0x0000. It might be
better to make this 0xFFFF, which is a guaranteed non-character in Unicode
and
probably in ISO10646.
Isn't that the natural character to use for null-terminated strings? For
example, if I'm processing UTF-8 text in C, "foo" is equivalent to 0066
006F 006F 0000. In which case, it's very much in use already.
Mark Leisher then replied:
If the converted string contains 0xFFFF, it will be pretty clear the
source text had bogus characters the moment you display it.
According to Nick's translated doc the first character on the third line
of the .enc file is the one to be displayed if the Encode module cannot
figure out what to do with a given character. In iso8859-1.enc we
see:
# Encoding file: iso8859-1, single-byte
S
003F 0 1
00
which maps to '?'. In the last rendition of my proposal for cp1047.enc
I had left that as is, whereas to be compatible with iso8859-1.enc I ought
to have written:
# Encoding file: cp1047, single-byte
S
006F 0 1
00
with similar headers for cp37.enc and posix-bc.enc.
Although I am quite hard pressed to find an example of a double byte
character encoding that does make use of 0xFFFF, I do think that there
could be a problem with the syllogism: "Unicode(tm) guarentees that 0xFFFF
is not a character. All encodings can be mapped to Unicode(tm).
Therefore all coded character sets must reserve 0xFFFF as a
non-character." Unless it is the case that Encode .enc files are to be
used solely as a to/from Unicode set as an intermediary coding.
Apparently(?) another problem with abondoning the (admittedly awkward)
special-ness of 0x0000 for this purpose is that perl .enc files would then
become incompatable with Tcl .enc files.
Peter Prymmer