ietf-822
[Top] [All Lists]

[no subject]

1993-02-04 09:12:10
dear ietf-822ers,

        i am not sure i can clarify what to do about character sets;
everyone else grappling with 10646 is having similiar problems.
i think the best thing to do in these circumstances is try to figure
out what you need/want to do with these pesky characters, and then do
your definition.

        in other words, you have to specify the underlying model first,
and then how octets and character sets figure into that. To make this concrete,
let me state MY model (i am unsure of what the old mail model was) for text
interchange was. The (simple) model is that the text i supply is drawn
left to right across the display with NL (0xA) causing the start of a new line.
On some devices, X <backspace> Y would appear as Y, on others as X and Y
superimposed. printing characters from ASCII are assumed to be displayed
correctly; all others were pure luck. The text was a sequence of characters
(one-to-one with bytes) which were formed into white space/punctuation/0xA
seperated words.

        thats it. i know its a mishmash of stuff but thats what my view was.
i know that the rather low expectations embedded in this model weren't even
met with existing mailers (on VMS, you have to do some crap to get the LFs
interpreted as newlines). but i do not wish to revisit the sins of the past.

        the difficulty (i think) with the new 10646 era, is that the concept
of text element has appeared formally. (we used to have it -- A backspace _
is a single text element -- but we could safely ignore it.) from your
(ietf) point of view, text ought to be a sequence of text elements. all the
definitions, formats, header lines ought to be done with text elements
(and not characters). 10646 defines what a text element is: essentially,
a (base) character followed by an infinite sequence of combining marks.
Other standards can have a simple definition assumed (say,
a character/glyph == text element). Once this is done, you can then sort out
how text elements are represented (encoded) as an independent issue
(much as you have done).

        I claim this, plus a small display model, is all you need for text.
Anything more complicated needs to be done via some more elaborate
mechanism (richtext or sgml or ...). The simple display model i used
above will do, although i note that it won't do for Ohta-san as, for
example, sequences of Han characters will probably be displayed in the
same font and not in seperate fonts for the japanese/chinese text.

        coming back to the original point, given this model, i can now
tell what the defn of a character set ought to look like, namely a
way to unambiguously map octets to text elements (or if glyphs == text
elements, then octets to glyphs).

        just understand that to define a character set without reference
to a model of how its used is futile, and will inevitably lead to confusion.
and thats why this seemingly interminable esoteric discussion of character
sets keeps reappearing in this newsgroup. (the same discussion will be replayed
in many other venues over the next few years as everyone else comes to grip
with 10646.)

                andrew