[Top] [All Lists]

Re: Matching NUL characters

2003-04-04 04:31:25


  > I don't believe this is true.  every implementation must understand
  > that the sequence of N octets making up one UTF-8 character is _one_
  > character, not N.
  Where would it make a difference, if the implementation could not
  decode headers to UTF-8, which is allowed? Quoting section 2.7.2:
     Implementations decode header charsets to UTF-8.  Two strings are
     considered equal if their UTF-8 representations are identical.
     Implementations should decode charsets represented in the forms
     specified by [MIME] for both message headers and bodies.
     Implementations must be capable of decoding US-ASCII, ISO-8859-1,
     the ASCII subset of ISO-8859-* character sets, and UTF-8.
  If implementations fail to support the above behavior, they MUST
  conform to the following:
     No two strings can be considered equal if one contains octets
     greater than 127.
  To me, that means an implementation could entirely forget about UTF-8.
  If someone used a script that contains UTF-8 characters, it does not
  make a difference but for comparisons, and those are always false if
  the string contains UTF-8 encoding for unicode characters >127.

I'm not sure about that.  I wasn't here when the draft was discussed,
but it seems to me "Implementations decode header charsets to UTF-8"
is as normative as a MUST.  "Implementations should decode charsets
..." is a cop-out since it is _very_ hard to support every charset in
existence.  the list of required charsets is therefore listed
explicitly instead.

so the fallback rule only applies to headers containing unknown
charsets.  if the decoded header contains octets with 8th bit set, it
can never match anything.

  Personally, I am surprised that UTF-8 aware string comparisons do
  not require an extension, since RFC conforming Sieve implementations
  do not absolutely have to support it.  Depending on the (conforming)
  implementation, the following test may be true or false
     Subject: =?iso-8859-1?q?abc=80def?=
     header :contains ["Subject"] ["abc"]
  "Fail to support the above behaviour" means "fail to decode MIME words"
  (no MIME support) or "fail to decode MIME words to UTF-8" (MIME support,
  but no character set translation).  If the intention was different,
  the specification should be, too.

the specification may not be completely clear, but let's not postulate
that broken behaviour is allowed if we don't have to.

btw, if you change the header to

  Subject: =?iso-8859-15?q?abc=a0def?=

the implementation can fail to decode it to UTF-8, and the fallback
rule applies.  I'm not completely sure, but I think the test MUST fail
due to the existence of U+A0 in the subject string.

Kjetil T.                       |  read and make up your own mind

<Prev in Thread] Current Thread [Next in Thread>