One addition to Harald's comments...
--On Sunday, 23 March, 2008 20:43 +0100 Harald Tveit Alvestrand
Because internationalized local parts may cause email
addresses to be longer, processes which parse, store, or
handle email addresses or local parts must take extra care
not to overflow buffers, truncate addresses, exceed storage
allotments, or, when comparing, fail to use the entire
technical: this is great advice, but I don't understand how
UTF-8 changes the situation. If you aren't changing the
998-octet requirement, software that breaks for UTF-8 would
also break for ASCII headers with the same octet
If someone uses another representation internally (for
instance UTF-16), and has a 998-character buffer, that will
sometimes fit into 998 octets of UTF-8, and sometimes not.
The same goes in the other direction.... I'm sure others will
think of other cases.
Spencer, I'm a little confused by your even asking the question,
so let me try for a slightly different answer in case you were
asking a different question. Two of the advantages we have
with ASCII (and the closely-related ISO 8859 code character
sets) are that every character is the same length as every other
character and that every character is exactly one octet. As a
consequence of that relationship, we have clutter in many places
in the RFC space, and probably in implementations, in which
"character" and "octet" are used interchangeably when referring
I note that you carefully, and correctly, said "same octet
length" above and not the "same length in characters". But RFC
821 talks about lengths in characters and, to my astonishment
and shame, so does section 184.108.40.206 of rfc2821bis (I've just
flagged that to the relevant ADs and will try to get it fixed
before the thing is published). But that is the definitional
problem, and perhaps the new risk, in a nutshell.
Now, if one goes to UTF-32, the characters are all the same
length, but four octets instead of one. An implementation that
counts characters, but allocates buffers in octets (assuming
that they are the same thing) is obviously headed for trouble,
but computing the length from the character count or vice versa
is pretty straightforward.
UTF-8 (and technically UTF-16) break both of those original
assumptions. The characters may be more than one octet long and
one cannot compute the number of octets from the number of
characters (UTF-8 is aggressively variable-length; UTF-16
occupies either two or four octets per character depending on
whether the character has a high enough code point that
surrogate pairs are needed).
9.2. Informative References
Hoffman, P., "SMTP Service Extensions or
Transmission of Headers in UTF-8 Encoding",
draft-hoffman-utf8headers-00.txt (work in
progress), December 2003.
Technical: I know this is how we refer to Internet Drafts,
but "2003" isn't
"work in progress". You might s/work in progress/expired
Internet Draft/, or
(probably better) simply move the rest of the full citation
to the Acknowledgements section - it didn't seem like you
really expected anyone to
actually refer to this reference, anyway :-)
It's a part of the history, and we can probably safely lose it.
It is referenced, and its historical role mentioned, in RFC
4952, so can almost certainly be dropped utf8headers.
On the more general subject, I've tried raising the issue of
these documents that are referenced for historical reasons and
hence, IMO, should not say "work in progress" and should include
the exact file name so that people can find them if interested.
I've gotten nowhere, so it is someone else's turn. What is
really needed, I think, is a policy on these sorts of things,
corresponding modifications to tools like xml2rfc, etc. I don't
think hiding the references in inline text is that right answer,
but that is just my opinion.
IETF mailing list