[Top] [All Lists]

Re: The transition to UTF-8 header fields

1999-02-11 00:36:03
Your argument is irrelevant because it assumes that is some value
in the type 2 approach.  There is not.
  [ ... ]
It is pointless to choose a strategy that creates an extra header 
which adds no value.

Do you want another disaster like 8BITMIME? 

No, but I fail to see how having an extra header (or not) would
affect that outcome.  The header doesn't help the user agent
distinguish UTF-8 from other 8bit text because the text in the
message header comes from a variety of sources. The header
doesn't help the MTA because the MTA can just as easily scan
for UTF8 as it can scan for the new header.

We will almost certainly need a UTF8 SMTP option anyway, because
the vast majority of deployed MTAs cannot handle 8bit headers,
some of them failing miserably.  I don't like this conclusion 
because adding complexity to MTAs makes them less reliable.  
I would like to find a better way, but adding another field
to the message header doesn't solve the problem of MTAs that
break when you feed them 8bit headers.

If not, you should be more
careful in your cost-benefit analyses.

Reader strategy 1 interferes with many existing messages. Reader
strategy 2 does not. 

I disagree.  Strategy 1 does not "interfere" with the operation
of anything, though in rare cases existing text may be mis-displayed.  

Such cases are very rare with 8859/1 text, because UTF-8 characters
for values greater than 127 are represented by sequences of from
2 to 6 octets, each of which has the most significant bit set.
All but the first octet must be in the range 128-191.
In 8859/1 these correspond mostly to special characters and
upper case vowels with accents or diacritical marks.  In all of 
the languges that I know of that use the Latin alphabet, a string 
of more than two or three characters which consists entirely of 
special characters and upper case vowels, is very unlikely to 
occur in practice in the human-readable portion of a message header.
Nobody has a name which contains 1/2 or the copyright symbol, and 
very few words in the Subject field can be spelled entirely with upper
case accented letters.

Text in other 8-bit character sets, notably those with a Latin alphabet 
in the 0-127 range and a non-Latin alphabet in the 128-255 range, is 
somewhat more likely to produce a valid UTF-8 string.  But even then 
this is not terribly likely, because the number of octets in a character
is encoded in the upper bits of the first octet, and all subsequent 
octets in that character must be in the range 128-191.  And I am
of the impression that the largest portion of the installed base
which uses 8bit characters in message headers, uses 8859/1.

You're advocating another strategy. Your strategy entails certain risks
that aren't present in strategy 2. 

Yes, but strategy 2 also entails certain risks.  At best, adding
an extra header will tell a user agent that a particular sequence
of 8-bit characters in the header *might* be UTF-8; in general, 
it cannot guarantee that it *is* UTF-8.  Trusting the header alone 
is less reliable than trusting just the validity of the UTF-8 sequence.
Trusting both the header and the sequence validity (which you would
do anyway, since how do you display an invalid UTF-8 sequence?)
doesn't improve the reliability much, and the additional reliability 
comes at the cost of having to scan the entire message header before 
displaying anything.   I doubt sincerely that many implementors would
bother doing this.

It's _possible_ that the resulting
damage is so small that it is outweighed by the tiny costs of an extra
header field; but the necessary studies have not yet been performed.

I am of the understanding from I18N experts that this feature was a 
design goal of UTF-8, and that the "necessary studies" have indeed 
been performed.  I haven't tried to look up those studies myself,
because I trust their judgement and because my back-of-the-envelope 
analysis seems sufficient.