Below are the minutes of the 822 extensions meeting held June 21st.
There was one meeting, however, the minutes have been split according
to the interests of the two mailing lists. Please send corrections
to me personally.
Greg Vaudreuil
Minutes of the June 21
Message Format Extensions Working Group.
Attendees
---------
Phill Gross pgross(_at_)nis(_dot_)ans(_dot_)net
Peter Svanberg psu(_at_)nada(_dot_)kth(_dot_)se
Byungnam Chung bnchung.sokri.etra.re.kr
Bob Kummerfeld bob(_at_)ca(_dot_)pn(_dot_)oz(_dot_)au
Jonny Eriksson bygg(_at_)sunet(_dot_)se
Jan Michael Rynning jmr(_at_)nada(_dot_)kth(_dot_)se
Keld Simonsen keld(_dot_)simonsen(_at_)dkuug(_dot_)dk
Greg Vaudreuil gvaudre(_at_)nri(_dot_)reston(_dot_)va(_dot_)us
Agenda
------
1) Character Set Selection
- Status and Input to the ISO 10646 process
o Unicode <=> ISO 10646 Union?
o Use of CO and C1 codespace
- Selection of "Common" character sets or schemes
o ISO 8859-1, ISO 8849-n, Profiles for the use of ISO 2022?
o Specifying "requiredness"
- Specification of 8 bit character sets in headers
Minutes
-------
1) Character Set Issues
a) Unified character set
1) Administrative
At last word, the ISO DIS 10646 received at least one negative vote,
and work is proceeding to resolve the remaining issues. An unofficial
but promising effort is the work underway to unify ISO DIS 10646 and
Unicode, another scheme for a global character set. This working
group was asked to discuss this effort and endorse it if possible.
The working group discussed this effort, and agreed that the efforts
to combine Unicode and 10646 were in fact positive.
2) Technical
The unification of ISO DIS 10646 and Unicode requires the resolution
of several technical issues. The primary issue,tentatively resolved
involves "Han unification" a scheme that re-uses many of the graphics
of the various Kanji character sets. Other issues involve the use of
CO and C1 codespace. The use of C0 and C1 codespace involves
transport issues and this working group was asked for its input.
C0 codespace consists of the spaces between 0 and 32,traditionally
used for control characters. There is a proposal to use this space in
the second octet of a multi-byte character for graphic characters.
The working group discussed this and rejected the use of this space.
A graphic character in the C0 space will likely be interpreted by a
transport protocol as a control character. Many transport protocols
which interpret in-band data such as SMTP may behave unpredictably in
this situation. One example is where the sequence of graphics may be
mis-interpreted as a cr-lf-.-cr-lf sequence terminating the session
prematurely. Other related anomalies were envisioned. Unless all
transport protocols are made aware of the multi-byte nature of the
data, an unlikely occurrence any time soon, reuse of C0 space is not
recommended.
C1 codespace consists of the spaces between 128-150, space that may be
interpreted as control characters if the high order bit is stripped.
ISO 8859-n character sets, and the current 10646 proposal reserve this
space for control characters only, with an eye toward backward
compatibility with 7 bit systems. The working group discussed this
and concluded that use of C1 codespace could be used for graphics if
transport protocols could be relied upon to never strip the high order
bit and interpret the resulting character as control sequences. The
working group did not make a specific recommendation, only that the
use of C1 space to compact a character set was a positive thing, and
future evolution transport protocols should support the use of this
space for graphics.
b) Common Character Sets
In the absence of a single international standard character set,the
working group needs to profile the use of a limited number of the 200+
character sets in use worldwide to facilitate interoperation. Keld S.
gave an overview of the current character sets in usage.
ISO 7 bit family:
ASCII
National Versions
10 National use
2 Alternate rep # $
ECMA registry
7, 8, 16 bit
ISO 2022 shifts
ISO 8 bit 8859 family:
1 char = 1 octet
ASCII in pos 0-127
Pos 160-255
Latin sets (5)
Cyrillic
Greek
Arabic
Hebrew
ISO 6937-2 family 8/16 bit:
6937-2, T.61
Non-Spacing accents
1 char = 1 or 2 bytes
about 330 graphical chars
Vendor 8 bit sets
DEC-MCS
HP Roman8
IBM PC codepages (5)
Uses also 128-159 (C1)
IBM EBCDIC
Many versions
Not ASCII Compatible
16 bit char sets
Japanese: JIS 0208, 0212
Chinese: GB 1980
Korean:
Japanese 8/16 bit: Shift JIS
Unicode: New vendor charset unifies CN, JP, KO sets
Incompatible with ISO
Multi-byte:
EUC: Extended UNIX code
ISO 2022 shifting
SS1 SS2 SS3
4 char sets
8/16/24 bits
32 Bits:
ISO 10646
Also usable in 8, 16, or 24 bit compaction methods
Proper encoding subsets: ASCII and ISO 8859-1
Control Character Sets:
ISO 646: 0-31, 127
ISO 6429: 0-31, 127-159
EBCDIC: as ISO 646
Several ideas were batted around, including strict use of ISO2022,
profiling language to character set mapping, and the use of
"preferred" character sets. The working group felt that the best
approach was to codify existing practice in the interim,pending
adoption of an "international" character set. This existing practice
was reduced to the following.
If possible, use ISO 8859, with the lowest version number possible,
i.e., use 8859-1 (Latin 1) over 8859-10? (Latin 5?). If the characters
needed are not in the 8859 sets (i.e. Kanji)use the 2022 character
switching standard, declaring 2022 in the header of the document.
While this may lead to the use of any of the many characters in the
ECMA registry, the WG felt that in practice, only the current Oriental
mail systems will use the2022 system and only with limited character
sets.
c) Use of Non-ASCII character sets in headers.
What a mess! The attendees of this meeting spend over an hour working
on various schemes for indicating character sets in the headers of a
message other than ascii. It was identified as a requirement that the
fields defined as TEXT be able to have variable character sets. While
this goal was stated, no mechanism for the implementation was agreed
upon.
A modification of the BNF notation was suggested by Keld S.
CHAR-EIGHT = <any Eight-bit character>; (0-377, 0.-255)
qtext = <any CHAR-EIGHT excepting <">,"\" & CR, and
including linear-white-space>
quoted-pair = "\" CHAR-EIGHT
text = <any CHAR-EIGHT, including bare CR & bare LF but
NOT including CRLF>
This notation was accepted by the attendees of the meeting, however
several problems were identified and not resolved. 1)
Identification of the header character set and the need to for
conversion, and 2) Encoding the header character sets in 7 bit
transport format.
It was not clear how a conversion gateway would know that the header
was 8 bit and needed encoding. A suggestion accepted by the group was
that the use of the new BNF requires the use of a header- charset
header line. This additional header adds complexity to user agents
and conversion gateways by requiring two passes of the header to
determine and convert the header into a passable or readable form. It
was felt that this was inelegant but do-able.
Several proposals were discussed for encoding the 8 bit text strings
when 7 bit transport was required. It was accepted that this was a
hard requirement.
1) Variable Substitution
On proposal for the insertion of 8 bit text was to substitute a
variable name in the header for each text string needing 8 bit
characters. The variable could then be defined elsewhere in the
header, including the encoded actual string and a token indicating the
character set. This was rejected as messy and difficult to implement
in current user agents.
2) Message Encapsulation
Encapsulate the mail message using the message type body part and a
suitable transport encoding, preferable quoted-printable. This
proposal is controversial among at least one implementor of the
message format standard as having excessive complexity for the user
agent. It is not clear the encapsulated message will be permitted to
have a transport encoding.
3) Encoded Text Fields
This proposal would specify a standard encoding for the header fields,
possibly quoted-readable or quoted-printable and identify this fact in
a header-transport-encoding header or the header- character-set
header.
Conclusions
While no one was happy, the group tentatively agreed to not permit 8
bit text in the headers. The only reasonable way to encode 7 bit text
was to encode the text fields, and insert a new header line. With
this overhead the group agreed that while not ideal, a requirement
that extended character sets should always be encoded, eliminating the
need for intermediate gateways to parse and convert the headers.