Late comments from KTH

First we would like to stress that it is now time for compromise
and consensus.  For our part we can certainly live with the
current draft and those changes proposed on this list that
hasn't met opposition.  Our following specific comments are not
of the "show stopper" kind, but suggestions for the further
improvement of RFC-XXXX (or arguments for retaining good things
in it).


CHARACTER SET PROBLEM: Contrary to some other persons on this
list we are not worried of the consequences of including support
for different character codes (or CODED CHARACTER SETS, which is
the proper ISO term) in RFC-XXXX.  (Maybe I, Olle Jarnefors, can
contribute with some expertise in this field.  Since May 1990
I'm the Swedish representative to the ISO&IEC/JTC1/SC2/WG2
working group, responsible for the development of ISO 10646.  I
have been to its three most recent meetings.  I'm also a member
of  the European, Nordic, and Swedish counterparts to JTC1/SC2.)

We assume that nobody has a problem with the inclusion of
ISO-8859-1 through ISO-8859-9.  (RFC-XXXX should explicitly
enumerate all supported parts of ISO 8859.  Of the currently 9
published parts all are useful except part 4 (Northern Europe)
which to our knowledge has never been used in our part of Europe
due to technical deficiences.  It could and should be left out
of RFC-XXXX.  It will be succeeded by a new part 10 early 1992,
which may be added at a later stage.)

The standard ISO 2022 does not in itself specify any coded
character set, but it (or, in practice, a subset of it) may be
used to combine two or more different coded character sets into
a new combined character set, which can represent all characters
of any of its elementary coded character sets.  (For this reason
the standard talks about "code extension techniques", not "code
switching".  See also the discussion of the concept of coded
character sets in the appendix at the end of this message.)  For
this reason the value "ISO-2022" shouldn't be used for a
character set in RFC-XXXX.

On the other side, the Japanese use of ISO 2022 is limited to
four well-defined elementary character sets and a selected
subset of its code extension techniques, as shown in Masahiro
Sekiguchi's message (13 Nov 91 11:31:42 JST).  This therefore
_is_ a coded character set, as well defined and usable as e.g.
US-ASCII.  It probably can be given the name "ISO-2022-JP".
There is no reason to treat it in any other way than e.g.
ISO-8859-1.  If that coded character set is indicated by a
"Content-Type: Text/Plain; charset=ISO-8859-1", then the
Japanese character set should be indicated by
"Content-Type: Text/Plain; charset=ISO-2022-JP" and _not_
treated as a new subtype of Text.

Masahiro Sekiguchi's specification of ISO-2022-JP is precise and
concise and may very well be included as the definition of
ISO-2022-JP in RFC-XXXXX.  There is only one ambiguity (which
probably reflects different practices):  The initial state may
be either ASCII (ESC 2/8 4/2) or the left half of JIS X0201 (ESC
2/8 4/10).  These two character sets are identical except for
5/12 (reverse solidus; yen sign) and 7/14 (tilde; overline), but
probably RFC-XXXX should choose ASCII for the initial state.

Some people on this list still seem to be very skeptical to the
future of ISO 10646.  After the astonishingly successful
so-called merger process involving JTC1/SC2/WG2 and the Unicode
consortium, which

+ started at the informal meeting in San Francisco in May,
+ was finalized at WG2's meeting in Paris in October
+ was accepted by JTC1/SC2 in Rennes in October
+ will make Unicode version 1.1 a proper subset of 2nd DIS 10646

actually everybody involved is very optimistic about the outcome
of the second DIS ballot.  The DIS will be published in December
or January, the ballot period will end in May, and any remaining
difficulties will be resolved at WG2's meeting in Korea 29 June
- 3 July 1992.

This of course doesn't mean that a specification of a ISO-10646
character set can be included in RFC-XXXX now (the standard is
not completely frozen yet, and it actually defines three
different coded character sets: the 16-bit form, the 32-bit
form, and the UTF), but we think it is appropriate to include
text saying that ISO 10646 will probably be the future way to go
and that another RFC specifying the proper use of it in the
RFC-XXXX framework will be published as soon as possible.

Like ISO 2022 and ISO 10646 the MNEMONIC text format -- as
specified in RFC-MNEM, "Mnemonic text format" (Draft, 15 July
1991) -- isn't a coded character set itself.  But this
specification provides parameters specific to the Text/Mnemonic
type (as defined in the October draft of RFC-XXXX) and for each
pair of values for the first two parameters (Charset and Intro),
the character set is indeed well-defined.  E.g. when the Charset
parameter is US-ASCII and the Intro parameter is 38, the Yen
mark character is represented by the three bytes corresponding
to "&Ye", but with Charset=ISO-8859-1 it is represented by the
byte value 165 (irrespective of the value of Intro).  MNEMONIC
thus may be regarded as a coded-character-set-valued function of
the two variables Charset and Intro.  With a "attribute=value"
syntax for character sets, the proper syntax for MNEMONIC
probably should be something like
   Content-Type: Text/Plain; Charset=Mnemonic-US-ASCII-38
Here, the actual value for the Charset attribute is constructed
by appending to the string "Mnemonic-" first the character set
name from the Mnemonic specification that in the current
RFC-MNEM constitutes the Charset parameter, then a "-", and
finally the decimal value of the Intro character (presently the
Intro parameter of RFC-MNEM).


HEADER PROBLEM:  We sincerely think it would be a big mistake to
publish an ambitious upgrade of RFC-822 like RFC-XXXX without a
corresponding upgrade of the expressive power of header fields.
We think the best would be to include it into RFC-XXXX, but can
accept that it is put in another RFC, if they are released
simultaneously and reference each other.

We are generally quite happy with Keith Moore's proposal for a
RFC-HEADERS.  We have found one thing in it to be inadequately
specified.  In sections III and VII sequences of encoded-words
are explicitly allowed in the same header.  Section VIII then
prohibits the use of multiple encoded-words in "text", "ctext"
and "qtext". This would e.g. limit the length of the Subject:
line to 67 characters, as one wouldn't be allowed to split it
into two encoded-words.

We would like to generalize the use of encoded-words by not only
allowing several of them in the same header field but also make
it possible to intersperse ASCII text with encoded-words.

Specifically, we suggest the following wordings. (">" indicates
old text, "+" new text to replace it.)

In section VII:

A sequence of one or more encoded-words is used to represent non-ASCII
textual data within a message header.  An encoded-word must be separated 
from any adjacent encoded-words, "word"s, "text", "ctext", or "qtext" 
by a linear white-space character or an end-of-line.


+ One or more encoded-words may be used to represent non-ASCII textual
+ data within a message header. In that case, each encoded-word must
+ be separated from any adjacent part of the text by a linear white-space
+ character or an end-of-line. If this is not the case, the whole header
+ field-body will be interpreted as ASCII text.

When multiple encoded-words appear in the same header, separated only 
by ends-of-lines or linear white space, the ends-of-lines or white space 
are not displayed.


+ When an encoded-word appear in a header field-body, any preceding or
+ following ends-of-lines or linear white space are not displayed.

In section VIII:

An encoded-word may be distinguished from an ordinary "word", "text",
"ctext", or "qtext" as follows:


+ An encoded-word may be recognized as forming a part of a "text", "ctext"
+ or "qtext" or being a "word", "text", "ctext", or "qtext" as follows:

1.  An encoded-word begins with "=?"

2.  An encoded-word ends with "?="

3.  An encoded-word contains exactly four "?" characters, including the
    beginning "=?" and ending "?=" delimiters.


Extra condition:

+ 4.  An encoded-word is preceded and followed by linear white space,
+     ends-of-lines, or nothing.

If the "word", "text", "ctext", or "qtext" does not meet the above
tests, it should be displayed as it appears in the message header.


+ If not all alleged encoded-word meets the above tests, the whole
+ header field-body should be displayed as it appears.


--
Olle Jarnefors                     Peter Svanberg
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>             
<psv(_at_)nada(_dot_)kth(_dot_)se>
Royal Institute of Technology (KTH), Stockholm, Sweden




Appendix: The concept of a coded character set
----------------------------------------------

The concept "coded character set" isn't very difficult to
define:

   A coded character set is a set of rules that unambiguously
   and completely determines which sequence of characters is
   represented by each possible sequence of n-bit bytes.
   (Nowadays, n = 8 for most coded character sets.)

   The (mathematical) set of characters that can be represented
   in a coded character set is called its "repertoire".

These definitions deserve a few comments:

o The definition provides a quite _abstract_ concept of "coded
   character set".  A character code standard may define 0, 1 or
   several coded character sets, but there are also a lot of
   proprietary coded character sets and de-facto standardized
   coded character sets.  Also an unsuccessful standard proposal
   like the 1st DIS 10646, which may never have been
   implemented, can be a coded character set in this sense.

o The simplest kind of coded character sets are a (mathematical)
   function from the set of n-bit bytes to a set of characters
   (which then is its repertoire).  US-ASCII and the ISO-8859
   codes belong to this kind.  They have the nice property that
   any byte, regardless of its context, represent the same
   character.

o Different structural complexities are present in other coded
   character sets, such as ISO 6937 and CCITT T.61, ISO 2nd DIS
   10646, and combined character sets using ISO 2022, e.g.
   ISO-2022-JP.

o Certain byte sequences may be "not allowed" in a coded
   character set, e.g. any string starting with BS in US-ASCII.

o A _good_ coded character set should fulfill certain
   requirements, e.g. all possible sequences of characters of
   its repertoire should be codable, and no character should
   have redundant encodings.  Such requirements are not a part
   of the _definition_ of the concept, though.

o The definition implies that only text can be coded by a coded
   character set.  In other kinds of data a coded character set
   may be utilized only in textual parts, and in principle
   different codes can be used in different parts of the same
   file.  These parts are called _CC-data-elements_ (character
   coded data elements) in more recent ISO character coding
   standards.

o  The only really problematic concept used in the definition of
   "coded character set" is that of a "character", but these
   problems probably don't have any specific implications in
   email.  Some of them have to do with "character" being an
   abstract concept which isn't completely determined neither by
   the graphical shape of the symbol (the glyph), by its
   meaning, nor by its historical origin.  In some cases it is
   also difficult to decide if a certain symbol is one character
   or a composition of two or sometimes more distinct parts,
   each of which is a character in its own right.  (Another,
   quite unnecessary problem stems from the fact that the word
   "character" in the Unicode community is used in a quite
   different meaning from the one used in ISO standards, and
   here.)