Re: RFC 2047 and gatewaying


D. J. Bernstein <djb(_at_)cr(_dot_)yp(_dot_)to> schrieb/wrote:

Bruce Lilly writes:

and in fact Usenet abounds with untagged charsets

Obviously we can't make all of them work simultaneously. The way out of
this mess is for message readers to support UTF-8---as many implementors
have already done---so that message writers can safely use UTF-8.


Remember that UTF-8 and RFC 2047 are completly orthagonal features,
which solve different problems.

UTF-8 (or any other Unicode transformation) solves the multiple charset
problem: Instead of a large number of different charset that have to be
supported, you only have one.
RFC 2047 solves two other problems: Charset (and language) tagging and
encoding 8bit data in 7bit.

This gives us four different possibilities:

1. Unencoded leagacy 8bit charsets: Clearly a mess because of the
missing charset tagging. Users have to set the charset on a per-news-
group basis; unworkable for email with contacts from different locales.

2. RFC 2047-encoded leagacy 8bit charsets: An improvement over #1
because user agents can detect the charset automatically. (I assume that
most current user agents already support RFC 2047, whose ancestor is
nearly a decade old.) However, user agents that don't know a charset
can't display an encoded character even if it's available on the current
system.

3. RFC 2047-encoded UTF-8: Similar to #2 but it solves the problem of
having multiple charsets. It can coexist with #2. Most recent user
agents already support both UTF-8 and RFC 2047, so they don't have a
problem with this.

4. Unencoded UTF-8: Granted, the most appealing solution. It can coexist
with #2 and #3. It could also coexist with #1 because UTF-8 has a
distinctive structure that could be described as self-tagging.
The problem is that this is not currently implemented: Current
newsreaders have a setting for a 'default' or fallback charset (i.e. the
legacy charset from #1) but they can't automatically detect UTF-8 from
leagacy encodings. This also rules out MIME- and charset-ignorant user
agents: These can't handle two different charsets -- leagacy and UTF-8
-- at the same time. In mail, unencoded UTF-8 will further cause more
problems than other 8bit charsets because it makes use of the range
0x80..0x9F.
Then, of course, it's incompatible with current standards and there's no
way to define a confined environment (like the 8BITMIME world) as a
translation without potential data loss is impossible.
Further, the Unicode language tags are deprecated in favour of other  
means of language tagging.

So #3 finally wins over #4: It's already standardised, it's implemented
and it does not cause problems with software that can't handle the full
8bit range.

Claus
-- 
------------------------ http://www.faerber.muc.de/ ------------------------
OpenPGP: DSS 1024/639680F0 E7A8 AADB 6C8A 2450 67EA AF68 48A5 0E63 6396 80F0