[Top] [All Lists]

Re: 3028bis open issue #3: require 2047 decoding?

2005-07-01 15:20:09

Michael Haardt <michael(_at_)freenet-ag(_dot_)de> writes:
On Thu, Jun 30, 2005 at 05:43:53PM -0700, Philip Guenther wrote:
   Comparisons are performed in Unicode.  Implementations convert
   text from header fields in all charsets [HEADER-CHARSET] to
   Unicode as input to the comparator (see 2.7.3).  Implementations
   must be capable of decoding US-ASCII, ISO-8859-1, the US-ASCII
   subset of ISO-8859-* character sets, and UTF-8.

That sounds good.

Hmm, I think the paragraph needs to also specify that text in unknown
charsets never matches, no?

Either that or not decoding it, if it can not be converted.

So, if the implementation didn't understand the charset, it would
instead convert the raw ASCII (e.g., "=?charset?Q?blah?=") to Unicode
(an identity function, yes) and feed that to the comparator?  I guess I
can see _a_ use to that (scripts could match "=?charset?" to check for
use of particular charsets that the implementation doesn't support).

Hmm, I don't see any guidance in RFC 2047 or 2048 that could apply to
this.  Oh well...

I think leaving the encoded word intact in this case is slightly better. But
only slightly. The bottom line is that if there's a problem decoding (either
because you don't know the encoding or the encoded material is gronked somehow)
or converting (either because you don't recognize the charset or the material
doesn't actually match the charset), getting totally consistent and reasonable
match behavior is next to impossible. As such, any preference we express should
be just that: A preference. There's no best practice here as far as I can tell.

But now that you cited the scope of 3028bis, can we do that?

Good question.  RFC 3028 did not specify how an implementation should
handle charsets that it doesn't understand, effectively leaving it
implementation defined.  Is that causing interoperability problems?

I haven't seen actual problems in this space, but others may have
different experiences to report.

so, then my understanding it that fixing it is in scope.

It's in scope if a fix is possible. Aside from having a minimal set of charsets
you have to support (which I believe we already have) I don't see how to
fix this.