Minutes of the June 21 meeting (Message format)


Below are the minutes of the 822 extensions  meeting held June 21st.
There was one meeting, however, the minutes have been split according
to the interests of the two mailing lists.  Please send corrections
to me personally.

Greg Vaudreuil





                    Minutes of the June 21
           Message Format Extensions Working Group.


Attendees
---------

Phill Gross              pgross(_at_)nis(_dot_)ans(_dot_)net
Peter Svanberg           psu(_at_)nada(_dot_)kth(_dot_)se
Byungnam Chung           bnchung.sokri.etra.re.kr
Bob Kummerfeld           bob(_at_)ca(_dot_)pn(_dot_)oz(_dot_)au
Jonny Eriksson           bygg(_at_)sunet(_dot_)se
Jan Michael Rynning      jmr(_at_)nada(_dot_)kth(_dot_)se
Keld Simonsen            keld(_dot_)simonsen(_at_)dkuug(_dot_)dk
Greg Vaudreuil           gvaudre(_at_)nri(_dot_)reston(_dot_)va(_dot_)us

Agenda
------

1) Character Set Selection

   - Status and Input to the ISO 10646 process
     o Unicode <=> ISO 10646 Union?
     o Use of CO and C1 codespace

   - Selection of "Common" character sets or schemes
     o ISO 8859-1, ISO 8849-n, Profiles for the use of ISO 2022?
     o Specifying "requiredness"

   - Specification of 8 bit character sets in headers

Minutes
-------

1) Character Set Issues

   a) Unified character set

     1) Administrative

At last word, the ISO DIS 10646 received at least one negative vote,
and work is proceeding to resolve the remaining issues.  An unofficial
but promising effort is the work underway to unify ISO DIS 10646 and
Unicode, another scheme for a global character set.  This working
group was asked to discuss this effort and endorse it if possible.
The working group discussed this effort, and agreed that the efforts
to combine Unicode and 10646 were in fact positive.

     2) Technical

The unification of ISO DIS 10646 and Unicode requires the resolution
of several technical issues.  The primary issue,tentatively resolved
involves "Han unification" a scheme that re-uses many of the graphics
of the various Kanji character sets.  Other issues involve the use of
CO and C1 codespace.  The use of C0 and C1 codespace involves
transport issues and this working group was asked for its input.

C0 codespace consists of the spaces between 0 and 32,traditionally
used for control characters.  There is a proposal to use this space in
the second octet of a multi-byte character for graphic characters.
The working group discussed this and rejected the use of this space.
A graphic character in the C0 space will likely be interpreted by a
transport protocol as a control character.  Many transport protocols
which interpret in-band data such as SMTP may behave unpredictably in
this situation.  One example is where the sequence of graphics may be
mis-interpreted as a cr-lf-.-cr-lf sequence terminating the session
prematurely.  Other related anomalies were envisioned. Unless all
transport protocols are made aware of the multi-byte nature of the
data, an unlikely occurrence any time soon, reuse of C0 space is not
recommended.

C1 codespace consists of the spaces between 128-150, space that may be
interpreted as control characters if the high order bit is stripped.
ISO 8859-n character sets, and the current 10646 proposal reserve this
space for control characters only, with an eye toward backward
compatibility with 7 bit systems.  The working group discussed this
and concluded that use of C1 codespace could be used for graphics if
transport protocols could be relied upon to never strip the high order
bit and interpret the resulting character as control sequences.  The
working group did not make a specific recommendation, only that the
use of C1 space to compact a character set was a positive thing, and
future evolution transport protocols should support the use of this
space for graphics.


   b) Common Character Sets

In the absence of a single international standard character set,the
working group needs to profile the use of a limited number of the 200+
character sets in use worldwide to facilitate interoperation.  Keld S.
gave an overview of the current character sets in usage.

ISO 7 bit family:
     ASCII
     National Versions
       10 National use
       2 Alternate rep # $
     ECMA registry
       7, 8, 16 bit
       ISO 2022 shifts

ISO 8 bit 8859 family:
     1 char = 1 octet
     ASCII in pos 0-127
     Pos 160-255
       Latin sets (5)
       Cyrillic
       Greek
       Arabic
       Hebrew

ISO 6937-2 family 8/16 bit:
     6937-2, T.61
     Non-Spacing accents
     1 char = 1 or 2 bytes
     about 330 graphical chars

Vendor 8 bit sets
     DEC-MCS
     HP Roman8
     IBM PC codepages (5)
       Uses also 128-159 (C1)
     IBM EBCDIC
       Many versions
       Not ASCII Compatible

16 bit char sets
     Japanese: JIS 0208, 0212
     Chinese: GB 1980
     Korean:
     Japanese 8/16 bit: Shift JIS
     Unicode: New vendor charset unifies CN, JP, KO sets
        Incompatible with ISO

Multi-byte:
     EUC: Extended UNIX code
       ISO 2022 shifting
       SS1 SS2 SS3
       4 char sets
       8/16/24 bits   

32 Bits:
     ISO 10646
       Also usable in 8, 16, or 24 bit compaction methods
       Proper encoding subsets: ASCII and ISO 8859-1

Control Character Sets:
     ISO 646: 0-31, 127
     ISO 6429: 0-31, 127-159
     EBCDIC: as ISO 646  
     
Several ideas were batted around, including strict use of ISO2022,
profiling language to character set mapping, and the use of
"preferred" character sets.  The working group felt that the best
approach was to codify existing practice in the interim,pending
adoption of an "international" character set.  This existing practice
was reduced to the following.

If possible, use ISO 8859, with the lowest version number possible,
i.e., use 8859-1 (Latin 1) over 8859-10? (Latin 5?). If the characters
needed are not in the 8859 sets (i.e. Kanji)use the 2022 character
switching standard, declaring 2022 in the header of the document.
While this may lead to the use of any of the many characters in the
ECMA registry, the WG felt that in practice, only the current Oriental
mail systems will use the2022 system and only with limited character
sets.

    c) Use of Non-ASCII character sets in headers. 


What a mess!  The attendees of this meeting spend over an hour working
on various schemes for indicating character sets in the headers of a
message other than ascii.  It was identified as a requirement that the
fields defined as TEXT be able to have variable character sets.  While
this goal was stated, no mechanism for the implementation was agreed
upon.

A modification of the BNF notation was suggested by Keld S.     

CHAR-EIGHT     = <any Eight-bit character>; (0-377, 0.-255)

qtext          = <any CHAR-EIGHT excepting <">,"\" & CR, and
               including linear-white-space>

quoted-pair    = "\" CHAR-EIGHT

text           = <any CHAR-EIGHT, including bare CR & bare LF but
               NOT including CRLF>


This notation was accepted by the attendees of the meeting, however
several problems were identified and not resolved.  1)
Identification of the header character set and the need to for
conversion, and 2) Encoding the header character sets in 7 bit
transport format.

It was not clear how a conversion gateway would know that the header
was 8 bit and needed encoding.  A suggestion accepted by the group was
that the use of the new BNF requires the use of a header- charset
header line.  This additional header adds complexity to user agents
and conversion gateways by requiring two passes of the header to
determine and convert the header into a passable or readable form.  It
was felt that this was inelegant but do-able.

Several proposals were discussed for encoding the 8 bit text strings
when 7 bit transport was required.  It was accepted that this was a
hard requirement.

1) Variable Substitution

     On proposal for the insertion of 8 bit text was to substitute a
variable name in the header for each text string needing 8 bit
characters. The variable could then be defined elsewhere in the
header, including the encoded actual string and a token indicating the
character set.  This was rejected as messy and difficult to implement
in current user agents.

2) Message Encapsulation

Encapsulate the mail message using the message type body part and a
suitable transport encoding, preferable quoted-printable.  This
proposal is controversial among at least one implementor of the
message format standard as having excessive complexity for the user
agent.  It is not clear the encapsulated message will be permitted to
have a transport encoding.

3) Encoded Text Fields

This proposal would specify a standard encoding for the header fields,
possibly quoted-readable or quoted-printable and identify this fact in
a header-transport-encoding header or the header- character-set
header.

Conclusions

While no one was happy, the group tentatively agreed to not permit 8
bit text in the headers. The only reasonable way to encode 7 bit text
was to encode the text fields, and insert a new header line.  With
this overhead the group agreed that while not ideal, a requirement
that extended character sets should always be encoded, eliminating the
need for intermediate gateways to parse and convert the headers.