[no subject]

andrew(_at_)research(_dot_)att(_dot_)com writes:

      in other words, you have to specify the underlying model first,
and then how octets and character sets figure into that. To make this 
concrete,
let me state MY model (i am unsure of what the old mail model was) for text
interchange was. The (simple) model is that the text i supply is drawn
left to right across the display with NL (0xA) causing the start of a new 
line.


Well, that is too simple, we need to address right-to-left writing,
eg. in arabic, but I am unsure if that has implications on your model.

      the difficulty (i think) with the new 10646 era, is that the concept
of text element has appeared formally. (we used to have it -- A backspace _
is a single text element -- but we could safely ignore it.) from your
(ietf) point of view, text ought to be a sequence of text elements. all the
definitions, formats, header lines ought to be done with text elements
(and not characters). 10646 defines what a text element is: essentially,
a (base) character followed by an infinite sequence of combining marks.
Other standards can have a simple definition assumed (say,
a character/glyph == text element). Once this is done, you can then sort out
how text elements are represented (encoded) as an independent issue
(much as you have done).


The text elements, or as it is called in ISO 10646 "combining sequences"
are now indeed formally introduced, but there exist no specification
of its contents. You do not know how the encoding is and how to
use them, they are simply not defined. I would advise IETF to avoid this
unstandardised area like the plague, and just stick to the character
level of ISO 10646/Unicode, which is well defined and capable of 
doing the job wrt the requirements of IETF, namely communication.

      just understand that to define a character set without reference
to a model of how its used is futile, and will inevitably lead to confusion.
and thats why this seemingly interminable esoteric discussion of character
sets keeps reappearing in this newsgroup. (the same discussion will be 
replayed
in many other venues over the next few years as everyone else comes to grip
with 10646.)


I agree that we should look towards the use of "charsets" in IETF,
first in MIME, and then clearly spell out what we mean with it.
The precise definition will be something a little different
from ISO terminology, as prose in RFC 1341 and 1345 shows.
Some qualities of IETF charsets:

1. Unambigeous. No code at a given pace and with a given state
   may define more than one character. generic ISO 646 cannot be used.
2. defines the encoding (not the repertoire).
3. covers all of information in the bit stream, including
   control characters. This is in contrast to eg. ISO 8859-1
   which does not define control characters (eg LF, CR, TAB is not
   defined in 8859-1)
4. Defines encoding of characters, not text elements/combining sequences.
5. Allows stateful encoding, eg iso-2022-jp is one charset.
6. allows mnemonic charsets as charsets.
7. byte ordering defined, eg no confusion on big/little endian.
8. allows ISO 10646 and Unicode (in some form) 
9. limited number of characters. (the number may be significant, but
   not unlimited).

keld