Re: Matching NUL characters

  o MIME decoding may not be implemented and everything is (legally)
    treated as literal, although implementing it is a strong SHOULD.
    If not implemented, comparison works unless 8-bit characters are
    encountered.


no, MIME decoding is required.


It is not, because decoding etc. in section 2.7.2 is only a SHOULD and
followed by:

   If implementations fail to support the above behavior, they MUST
   conform to the following:

      No two strings can be considered equal if one contains octets
      greater than 127.

So it is not *required*, but it's not clear what is required.  That's
subject to interpretation.

  o  Broken words are treated as literal strings (MUAs either do that or
     decode them to junk, when their parser fails)


words that doesn't match the grammer are treated as literals.  how to
treat words that broken for other reasons is specified in RFC 2047
section 6.


I see the following in section 6:

   A mail reader need not attempt to display the text associated with an
   'encoded-word' that is incorrectly formed.  However, a mail reader
   MUST NOT prevent the display or handling of a message because an
   'encoded-word' is incorrectly formed.

So what is sieve supposed to do with incorrect words and where is that
defined?

   If the mail reader does not support the character set used, it may
   (a) display the 'encoded-word' as ordinary text (i.e., as it appears
   in the header), (b) make a "best effort" to display using such
   characters as are available, or (c) substitute an appropriate message
   indicating that the decoded text could not be displayed.

That's fine for a human interface, but an automated filter should
have a determined behaviour.

  o  Correct words with unknown character sets are treated as literal
     strings.  The assumption that all unknown character sets are
     one-byte codes and identical to US-ASCII in their lower 128
     octets is not sound.


remember that you can only compare against literal UTF-8 strings
(well, at least until we have a variables extension), so the
alternative is to make _all_ matches against strings in an unknown
character set fail.  that is not very useful.


The alternative is not to decode and convert what can't be decoded
and converted.  If the word is broken, or the character set is unknown,
then don't decode and convert it.

  Rationale: I would like the following test to be true:
  
     Subject: abc =?iso-8859-1?q?=c3abc?= =?unknown?q?def?=
  
     header : contains ["Subject"] ["abc"]
  
  The header can not be decoded entirely, so Sieve scripts should
  view it as the UTF-8 character for capital A with diaresis,
  followed by "abc" and "=?unknown?q??=".
  
  RFC 3028 would let the test fail, because the _whole_ header could
  not be converted and there is an 8-bit character.


actually, this isn't clear in the RFC.  the string "abc" doesn't
contain 8-bit characters, and neither does the matching substring
"abc"...  I think the RFC can be read either way.  I favour that the
test is allowed to fail, though.


I favour well determined behaviour.

Concerning my original point of NUL characters, I read RFC 2047 again:

   Only printable and white space character data should be encoded using
   this scheme.  However, since these encoding schemes allow the
   encoding of arbitrary octet values, mail readers that implement this
   decoding should also ensure that display of the decoded data on the
   recipient's terminal will not cause unwanted side-effects.

First, the requirement of printable and white space character data is
a SHOULD.  Second, an embedded NUL causing a string to be truncated,
is an "unwanted side-effect" to me, but I have to accept it's only a
SHOULD, too.

This can probably be interpreted in sieve implementations failing to
match the substring "def" in "=?iso-8859-1?q?abc=00def?=" to be correct.
Amazing, isn't it?

Again, I favour well determined behaviour.  All those SHOULDs make it
very hard to implement Sieve, which is not helpful.  Converting some to
MUSTs would make is much easier.

Michael