Re: Upgrading to UTF-8


Sam Roberts wrote:

Quoteing blilly(_at_)erols(_dot_)com, on Mon, Feb 10, 2003 at 02:57:26PM -0500:

Because there are old messages, all of the existing methods need to
continue to be supported indefinitely so that those old messages can
still be read.  A transition requires backward compatibility so that
the infrastructure doesn't suddenly break (as that would not constitute
a transition), and it requires a feasible plan.  The Usefor draft
breaks backward compatibility and provides no feasible transition plan.
That's not "working on solutions", that's *compounding* the real problem.



All currently valid email is 7bit/ASCII. Its meaning will not change if
future email defines a meaning to 8bit message headers, and assigns that
meaning to be some character set, such as utf-8.

So, it is backwards compatible in this sense, is it not?


No, because the existing infrastructure is designed according to
the current standards, where 8-bit content in message and MIME-part
header fields is illegal.  Presenting that infrastructure with
illegal content is not backwards compatible.  A transition plan
could address change in stages. The first stage must include the
proviso that content generators only generate content which is
backwards compatible, i.e. compliant with the prior specification
to which the existing infrastructure has been designed.  What you
describe is non-obsolescence of existing legal content, which is
an often important, but different, issue.

In theory, it is not backwards compatible with the SMTP transport, since
that expects messages to be 7bit ASCII.


No theory about it; it's not compliant with the current (or prior)
SMTP standard.

In practice, I get 8bit messages (mostly spam, but some from native
French speakers) very frequently.


It's not standards-compliant, is it?  Good way to identify spam.

So, a sender of a message with utf-8 in the headers may find it not
delivered. This doesn't sound like a catastrophic break in the current
messaging system. It actually sounds like the only people who will
notice are the senders and receivers, and nobody else.


Standards compliance doesn't have a Richter scale; compliance,
like pregnancy, is a binary condition -- either content presented
to the infrastructure is standards-compliant or it is not.
Compliance is not measured by delivery failure or software crashes,
etc.; it is measured by the totality of the requirements presented
in the relevant standard (which might include, among other
requirements, reliable delivery and/or no crashes on illegal input).

The main objections to utf-8 becoming the "native" charset of internet
messages seems to be:

[...]

- It is incompatible with RFC[2]822

  In my mind, it is a "compatible" extension of RFC2822. It does not
  change the meaning of any currently valid messages.


But it presents illegal content to 2822-compliant software and
networks.

  A utf-8 message, of course, does NOT have a defined meaning to an
  RFC822 UA. One could argue that neither do RFC2047 encoded messages.
  Seeing =?iso-8859-1?b?45;lakdfj322lkdkd?= as a subject isn't much
  better than how my UA displays Korean.


[ignoring the errors in what I assume to be intended as an encoded-word]
It is better in a number of respects:
1. it is fully standards-compliant
2. the charset is clearly indicated. Not all systems support all
   charsets, and at least with an indication of charset, software
   can determine if the requested charset is supported, and if it
   is not, the encoded form can be displayed.
3. the encoded form, if and when displayed, at least is legible. It
   is so precisely because it consists entirely of characters which
   are common to (essentially) all systems, viz. a specific subset
   of the graphic characters of ANSI X3.4.
4. it passes the "telephone test"; if you need to read it aloud
   (or transcribe it to paper, etc.) you can do so unambiguously
   (see RFC 2396 for a description of these issues, which were also
   considered in the design of URI syntax).
5. it can convey language information:
     Subject: =?ibm367*en-us?boot?=
     Subject: =?ibm367*en-uk?boot?=
   and
     Subject: =?ibm367*de?boot?=
   mean different things.

Returning to your first point in the above quote, some clarification
is in order.  It is true that neither text strings nor protocol
elements containing any 8th-bit-high octect have any meaning
(other than illegal content) to any software (not only UAs) compliant
with RFCs 561, 680, 724, 733, 821, 822, 2821, or 2822 (and all of
the MIME RFCs as far as MIME-part header fields and many media
types' body content are concerned).  RFC 2047 encoded-words *never*
apply to protocol elements; they are used solely for human-readable
text, and as such have meaning only to humans, never to software.
But the protocol elements in a text message *do* have prescribed
meaning to software, whether or not encoded-words are used (in
legal context) anywhere in the same message, whether or not those
protocol elements happen to appear to be similar to the form of an
encoded-word. E.g.
   Message-ID: (=?ibm367*en?q?encoded_comment?=) 
<=?us?b?c===?=(_at_)foo(_dot_)net>
has a legal left-hand side in the msg-id which is *not* an
encoded-word, and that msg-id is used exactly like any other msg-id.

[...]

The interesting questions seem to be:

1 - does this mean that it can't be standardized?

  It WILL be transported by some SMTP implementations, and by all NNTP
  ones. But, I can see a strong objection to allowing a message format
  that "may or may not be" transportable.


"some" isn't adequate; a sender expects his message to be delivered.
That can be handled with negotiation and fallback  And don't forget IMAP,
which is used for both news and mail.

2 - can a utf-8 encoded message be down-coded during transport?

  This is the real problem, it seems, and it seems to be a fundamental
  property of the RFC822 format: the header field formats aren't
  self-describing. Its not possible to know whether a header field is
  unstructured, structured


It is possible for all standard fields because the standard
indicates whether or not the field is structured, and if structured,
what the specific structure is.  It is true that user-defined
("x-" prefixed) fields, non-standard fields, and extension fields
designed after a given piece of software was implemented cannot be
identified as structured or unstructured.  Note that that is not a
problem for the intended purpose, viz. display, but is is a problem
with any attempt to bend, fold, spindle, or otherwise mutilate
messages in transport.

> and if structured, whether words are allowed
>   to be encoded.

The rules for encoded-words in header fields are simple and clear:
1. never anywhere in any Received field (it's not clear *why*, but
   that's the rule).  Of course. that's not a news issue.
2. never in a MIME parameter (2231 extended parameters may be used there)
3. any "word" in a "phrase" (RFC 822 definitions), separated from
   adjacent graphic characters by whitespace
4. in a comment, adjacent to an unquoted parenthesis or separated
   from other adjacent graphic characters by whitespace
5. in unstructured text ("*text" per RFC 822 definitions)
The last three rules can be compressed to one; they comprise all of
the places where human-readable text (vs. protocol elements) are
located.

> Because of that, its not possible to encode/decode

  without knowing the field definition, and an automated grep of
  all RFCs to determine it would be a little much to ask.

[...]

  So, you can only transform some fields. Like the ones that you know
  are allowed to containt utf-8, because they are in the USEFOR draft.

  What about the others? What about throwing experimental headers that
  have binary in them away? Or leaving them, at the gateway admins
  option, raw.

  What are the problems with this approach, operationally?


One fundamental problem is that language-tagging, which is an
essential part of internationalization, is ignored.  Another problem
is that text in an untagged charset conveys no information about its
charset (it is not self-describing, to use your words; how many of
those messages with 8-bit content that you have received were in
utf-8?).  An encoded-word (or extended parameter) has provision for
both charset and language and that information can't be pulled out of
thin air.  Note that charset is not optional in an encoded-word, and
as detailed elsewhere, language information MUST be carried if
desired by the originating user.  Throwing away message content is
undesirable; transports shouldn't do that (and header fields,
experimental or not, are part of the message content). Passing
8-bit data to SMTP or IMAP is not standards-compliant.  Finally,
current gateways, injection agents, etc. are not expected to
handle 8-bit content and do not perform any such transformations --
so even if it were possible to perform such a transformation in
the absence of charset or language information and without knowledge
of the field syntax, a transition period would be required *before*
any content with raw utf-8 was generated so that all of the injection
agents, gateways, etc. could be modified.

One of the design principles (if I may temporarily jump ahead to
another topic you raised) is that of end-to-end communications.
So if a message-originating user chooses a valid charset and
specifies a language for some text, the charset and language
information should be carried from the point of origination,
maintained intact through transport, and presented to the
receiving users in an appropriate way (by using the specified
charset or indicating that it is unavailable, and by using the
language information if appropriate, as for a screen reader for
the visually impaired).  It should not be immediately discarded
by the UA and then a half-assed attept made to recreate it from
thin air in mid-transit.  That is a fundamental issue; charset
and language information must be carried with the message from
the moment of origination, and that precludes any transformation
approach (barring some explicit means to preserve charset and
language information).  In any event, with a proper transition
plan such a transformation would not be necessary.

This seems to be a really important issue, and speaking as an
implementor, if the mail standards HAVE to be as baroque and difficult
as they are, fine, I can deal keep dealing with them, but, I would
really, really, like to know the design rationale, because utf-8 sure
does seem like it would solve a whole lot of problems.


UTF-8 per se doesn't really solve any problems. As a charset, it is
subject to the same limitations as all charsets (e.g. support is
not necessarily universal; while a sender may have an implementation
of a particular charset, his recipient(s) might not).  Considered as
an (8-bit) encoding of (32-bit) ISO 10646, there is another, not
entirely unrelated, issue, namely that there are different versions
of the underlying character set (in the ISO 10646/Unicode sense,
which is not the same as charset).  One can, of course, use utf-8
as the charset specified with MIME (RFC 2047/2231) methods, with
all of the advantages of those methods (standards-compliance, clear
indication of charset, legible fallback display, transcribability,
language-tagging, etc.) and with exactly the same characteristics
inherent in utf-8 noted in the earlier part of this paragraph.

The RFCs are fairly lacking in any "design and architecture of the IETF
text messaging system" section,


It's mostly about interoperability. While not specific to text
messaging, see RFC 1958.  Text messaging specific information does
appear in the various relevant RFCs, often the older, obsoleted
ones more so than the current ones -- see RFCs 724 and 2045-7 for
examples of old and current ones which do discuss design choices.