Re: Matching NUL characters


[michael(_at_)freenet-ag(_dot_)de]:


  > I'm not sure about that.  I wasn't here when the draft was discussed,
  > but it seems to me "Implementations decode header charsets to UTF-8"
  > is as normative as a MUST.  "Implementations should decode charsets
  > ..." is a cop-out since it is _very_ hard to support every charset in
  > existence.  the list of required charsets is therefore listed
  > explicitly instead.
  
  That may have been the intention, but there should be no room for
  interpretation, so let's fix it.  I see three issues:
  
  o  Must MIME decoding of correct words succeed?
  o  How are broken words treated?


this is specified in RFC 2047.

  o  How are unknown character sets treated?


this is specified in the Sieve RFC.

  The current RFC says, although not in a way that is overly clear,
  and which may not even reflect the intention when it was written:
  
  o MIME decoding may not be implemented and everything is (legally)
    treated as literal, although implementing it is a strong SHOULD.
    If not implemented, comparison works unless 8-bit characters are
    encountered.


no, MIME decoding is required.

  o  Behaviour for broken words is not specified.


RFC 2047.

  o  Unknown character sets are not converted, but assumed to be the
     one-byte code US-ASCII.  Comparison fails if 8-bit characters are
     encountered.


yes.

  You don't seem to agree with that, and neither do I, so I suggest:
  
  o  MIME decoding MUST be implemented.
  o  Broken words are treated as literal strings (MUAs either do that or
     decode them to junk, when their parser fails)


words that doesn't match the grammer are treated as literals.  how to
treat words that broken for other reasons is specified in RFC 2047
section 6.

  o  Correct words with unknown character sets are treated as literal
     strings.  The assumption that all unknown character sets are
     one-byte codes and identical to US-ASCII in their lower 128
     octets is not sound.


remember that you can only compare against literal UTF-8 strings
(well, at least until we have a variables extension), so the
alternative is to make _all_ matches against strings in an unknown
character set fail.  that is not very useful.

  Rationale: I would like the following test to be true:
  
     Subject: abc =?iso-8859-1?q?=c3abc?= =?unknown?q?def?=
  
     header : contains ["Subject"] ["abc"]
  
  The header can not be decoded entirely, so Sieve scripts should
  view it as the UTF-8 character for capital A with diaresis,
  followed by "abc" and "=?unknown?q??=".
  
  RFC 3028 would let the test fail, because the _whole_ header could
  not be converted and there is an 8-bit character.


actually, this isn't clear in the RFC.  the string "abc" doesn't
contain 8-bit characters, and neither does the matching substring
"abc"...  I think the RFC can be read either way.  I favour that the
test is allowed to fail, though.

-- 
Kjetil T.                       |  read and make up your own mind
                                |  http://www.cactus48.com/truth.html