Charles Lindsey wrote:
In Usefor, where headers in Netnews will be in UTF-8, we have to say how
to gateway into email.
"Headers" is somewhat ambiguous; HTML has "headers", but those are not
Internet text message (RFC 822 / 2822) headers. RFC 822 / 2822 preclude
8th-bit-high octets implied by UTF-8 in Internet text message header fields.
So presumably the new Usefor format is completely separate from RFC 822
Internet text message format, unlike RFC 1036, which used the same message
format as RFC 822 Internet text messages, and "headers" w.r.t. the Usefor
draft is presumably not the same as "headers" in RFC 822 / 2822 Internet
text messages (unlike RFC 1036 Usenet messages, where they were the same
except for some additional restrictions imposed by 1036).
So I am here showing you the text that is currently proposed, primarily
because I want the RFC 2047 experts here to check that what I say is
correct, or at least as correct as it is possible tom be with RFC 2047.
Reference to RFC 2047 presumes that one is dealing with RFC 822 Internet
text messages; if that is not the case (as seems to be w.r.t. the Usefor
draft), all bets are off.
Your comments would be appreciated.
8.8.1.1. Gatewaying into email
Although headers containing non-ASCII characters may well be conveyed
"headers containing non-ASCII characters" in the context of Internet text
messages (RFC 822 / 2822) is analogous to "cows with testicles" or "bulls
with udders"; there's no such critter.
[...]
2. If the header is unstructured, any word(s) which is contained
within a comment and is delimited by FWS or by the "(" or ")"
delimiting that comment
Unstructured RFC 822 / 2822 header fields cannot be said to contain comments;
comments imply structure, and cannot exist in an unstructured field (or in
unstructured parts of mixed structured / unstructured fields). In
RFC 822 / 2822 Internet text messages, the Subject header field is
unstructured;
Subject: foo (bar)
does not contain a comment, because the header field is unstructured.
Date: Fri, 13 Dec 2002 12:34:56 -0000 (no triskadekaphobia here)
is a structured RFC 822 / 2822 header field which contains a comment.
If Usefor draft "header" means something different from RFC 822 / 2822
"header", then whether or not an "unstructured" Usefor "header" can be
said to contain a "comment" depends strongly on the Usefor definitions
of "header", "unstructured", and "comment".
[...]
5. If the header is not one defined by this standard or by any Email
standard known to the gateway (so that it cannot be determined
whether it is unstructured, or otherwise where comments and
phrases occur within it), then it is not possible to encode it
according to a strict interpretation of [RFC 2047]. Nevertheless,
it is preferable to attempt an encoding than to discard that
header or to allow the gatewaying to fail. It is therefore
suggested that, outside of regions contained within properly
matched DQUOTEs, <...> or [...], any word(s) contained within
properly nested "(" and ")" be treated as being within a comment
and any other word(s) be treated as being within a phrase.
Likewise, following any ";", anything of the syntactic form of a
parameter should be treated as such.
This has recently been discussed here; for display, such an error-prone
heuristic
may be marginally acceptable (user cut-and-paste is likely to result in
problems), and it will fail on a number of not-uncommon constructs, but use
of such an unreliable mechanism for *gateways* is highly inadvisable, to say the
least.
In all cases, there are additional restrictions imposed by [RFC 2047]
regarding the size, placement and contents of encoded-words which
MUST be observed. Moreover, these transformations MUST be applied
both within the header of the article and within any body part
headers (including the headers of any message/rfc822). It is
generally preferable for encodings to use the charset UTF-8, although
it might be wise first to confirm that that is indeed the charset
which had been used (see 4.4.1).
That raises the issue of round-trip gateways; what mechanism is used to convey
charset (and language) information through a mail -> Usefor -> mail
transformation
to ensure that the charset (and language) information is maintained? If no
such mechanism is provided, one cannot "confirm that that is [...] the charset
which had been used".
Of course, with RFC 1036, which uses RFC 822 Internet text messages, one would
simply use the Internet text message mechanisms, including RFC 2047, verbatim
and there would be no problem. The problems with the Usefor draft proposal all
relate to the attempt to force untagged (w.r.t. both charset and language) raw
utf-8 into a message format which has evolved to use charset tags and a robust
encoding mechanism for anything other than the historical default 7-bit ascii
charset, and which provides for language tags as well (since charset is
insufficient to provide language context).
This has been discussed on the Usefor mailing list; RFC 1036 has the
considerable
advantage of using the same Internet text message format as email -- making
combined mail/news user agents and gateways practical. The news part of such
combined applications and gateways merely need to be able to deal with
news-specific
concepts such as newsgroups, distributions, and follow-ups. If the common
Internet
text message format is maintained (viz. no unencoded 8-bit-high octets), that
can
continue to be the case. Conversely, if the new Usefor format is to deviate
from
822 / 2822, gateways become considerably more difficult to implement, one cannot
strictly apply RFC 2047 / 2231 directly (as those are based on 822), combined
mail/news user agents become impractical, etc.