[Top] [All Lists]

Re: Matching NUL characters

2003-04-04 06:12:09

I'm not sure about that.  I wasn't here when the draft was discussed,
but it seems to me "Implementations decode header charsets to UTF-8"
is as normative as a MUST.  "Implementations should decode charsets
..." is a cop-out since it is _very_ hard to support every charset in
existence.  the list of required charsets is therefore listed
explicitly instead.

That may have been the intention, but there should be no room for
interpretation, so let's fix it.  I see three issues:

o  Must MIME decoding of correct words succeed?
o  How are broken words treated?
o  How are unknown character sets treated?

The current RFC says, although not in a way that is overly clear, and
which may not even reflect the intention when it was written:

o  MIME decoding may not be implemented and everything is (legally) treated
   as literal, although implementing it is a strong SHOULD.  If not
   implemented, comparison works unless 8-bit characters are encountered.
o  Behaviour for broken words is not specified.
o  Unknown character sets are not converted, but assumed to be the
   one-byte code US-ASCII.  Comparison fails if 8-bit characters are

You don't seem to agree with that, and neither do I, so I suggest:

o  MIME decoding MUST be implemented.
o  Broken words are treated as literal strings (MUAs either do that or
   decode them to junk, when their parser fails)
o  Correct words with unknown character sets are treated as literal
   strings.  The assumption that all unknown character sets are one-byte
   codes and identical to US-ASCII in their lower 128 octets is not sound.

Rationale: I would like the following test to be true:

   Subject: abc =?iso-8859-1?q?=c3abc?= =?unknown?q?def?=

   header : contains ["Subject"] ["abc"]

The header can not be decoded entirely, so Sieve scripts should view it
as the UTF-8 character for capital A with diaresis, followed by "abc"
and "=?unknown?q??=".

RFC 3028 would let the test fail, because the _whole_ header could not
be converted and there is an 8-bit character.


<Prev in Thread] Current Thread [Next in Thread>