Re: Dual names, IDN and ASCII, in e-mail addresses?


Jacob Palme wrote:


IDN will soon be a practical reality, and non-ASCII
characters in e-mail addresses will probably soon be common
also in the localpart of the e-mail address.

This means that lots of people are going to have e-mail
addresses like Göran(_at_)Müller(_dot_)de(_dot_) And one of the main
problems with such names is that people in other countries
will have difficulty typing them. My guess is that many
Americans will have problem typing Göran(_at_)Müller(_dot_)de, and
typing names in totally different alphabets like Greek,
Arabic, Cyrillic, Chinese, Korean, Japanese will be even
more difficult.

Because of this, many people will have business cards with
English information on one side, and information in their
own language on the other side. They will then also have
two e-mail addresses, since the ASCII version of IDN should
obviously be hidden from humans as much as possible.


Actually, the "ASCII version", i.e. the one that actually is
used with DNS and other protocols, as opposed to the display
version, probably shouldn't be hidden (w.r.t. business cards
etc.) -- there are in fact many reasons to prefer it over the
display version:

1. it can be typed unambiguously, unlike the display version,
   for which, in the example given, there may be a multitude
   of ways to type lower-case latin o with dieresis and/or
   lower-case latin u with dieresis.  Of course when using
   IDN, there is *supposed* to be a normalization process,
   but it is not at all clear how many implementations will
   fail to correctly implement IDN, including the quite complex
   normalization process.
2. For conventional paper-and-ink business cards, as well as
   for handwritten communications, it can be written and read
   unambiguously. The difference between lower-case latin o
   with dieresis and lower-case latin o with double acute is
   subtle. Latin upper-case A and Cyrillic upper-case A are
   for all practical purposes indistinguishable (likewise for
   K, M, O, T, with the added possibility of confusing H, C, P
   and possibly a few others).
3. It *can* be displayed (or printed, etc.). Many display
   and print devices lack some characters. Aside from the obvious
   cases of non-latin scripts, consider the lower-case latin c
   with caron, which is not unusual in some eastern European
   names.
etc. These are not new issues; the same issues arose in URL's.
   See the discussion in RFC 2396 section 1.5.

They might then use their national e-mail address for
national mail, and their international e-mail address for
international mail. And certainly mail intended for one
region may get into another region, so that e-mail with
national names in some heading field will often get resent
internatinally.


I think you may be confusing the display version with the actual
protocol exchange -- the protocol exchange is always in a subset
of US-ASCII, and UAs and other software MUST use that form in
protocol exchanges and not some (device-specific!) display codes.
This is not really new either; the display name (phrase) associated
with an email address may well be encoded via RFC 2047 methods
(and MUST always be used as such in message header fields), but
a UA might display such an encoded phrase by decoding into the
specified charset (and possibly language in the case of text-to-
speech conversion for the visually impaired).  Fortunately, RFC 2047
(as amended by RFC 2231) provides charset and language-tagging
facilities in compliance with RFC 2277. I don't know of any provision
for language tagging for IDN...[*]

One of the areas where MIME still, ten years after its
introduction, often fails is when people copy e-mail
messages into the bodies of new e-mail addresses. Quite
often, you see text encoded according to the e-mail heading
rules for non-ASCII characters in the bodies of such
messages.


There is indeed one limitation of MIME w.r.t. message bodies, viz.
there is no convenient mechanism for switching charset or language
within a text message body.  One could of course use HTML (suitably
tagged via MIME) for the message body, using HTML's capability for
charset and language tagging. But not all UAs are capable of rendering
HTML.

The key issue in any case is implementation. It is certainly possible
for a UA to take an RFC 2047/2231 Q-encoded display name from a header
field and, by performing a few simple transformations, generate quoted-
printable encoded text suitable for a message body (and appropriately
tag the message body as quoted-printable). Since B encoding can be
transformed to Q encoding, the same can be done if the display name is
B encoded.  Likewise, it is possible to generate HTML for the body.

The conclusion of what I have written above is that there
will maybe be a need to extend the existing e-mail
standards, so that the e-mail address of especially From
and Sender fields can be specified in a dual format, with
both the national and the international (ASCII) name
specified at the same time. Is this something IETF should
begin thinking about? Or is it something IETF has already
started thinking about?


It is a fait accompli.  The display form can be provided as a phrase
using RFC 2047/2231 encoding.  It could also be provided in an RFC 822
comment (though the proposed standard RFC 2822 deprecates comments in
address fields).  The display name is just that -- for display, not for
protocol exchanges.

It is not a simple problem to solve, especially if you
want both backwards compatibility with existing mail,
and something user-friendly, preferally user-friendly
even for people using mail programs which do not support
dual e-mail addresses.


Problem? What is the problem with
From: =?ISO-8859-1*de?Q?Claus_F=E4rber?= 
list-ietf-wg-apps-usefor(_at_)faerber(_dot_)muc(_dot_)de
which is legal (i.e. compatible with all existing text message clients),
includes language-tagging, and is displayed appropriately in MIME-conforming
UAs (and can be displayed with pre-MIME UAs via metamail) (i.e. is
user-friendly, even for those using pre-MIME UAs (provided they use metamail)
and for those using text-to-speech)?  None of that changes even if the domain
name becomes something like qz-blurfl.de -- with a display name provided,
many UAs do not even display the raw address.  Indeed, if an IDN-ized domain
name is used, how is a text-to-speech processor supposed to know what
language to use for the domain name if it is "displayed" (i.e. converted to
speech)?

* Unicode 3.something introduced a set of codes for language tagging,
but they do not use standard language tags (which are specified using
 a subset of US-ASCII), they of course are not available in earlier
versions of Unicode implementations, and they may well have been
withdrawn in Unicode 4.0.