o MIME decoding may not be implemented and everything is (legally)
treated as literal, although implementing it is a strong SHOULD.
If not implemented, comparison works unless 8-bit characters are
encountered.
no, MIME decoding is required.
It is not, because decoding etc. in section 2.7.2 is only a SHOULD and
followed by:
If implementations fail to support the above behavior, they MUST
conform to the following:
No two strings can be considered equal if one contains octets
greater than 127.
So it is not *required*, but it's not clear what is required. That's
subject to interpretation.
o Broken words are treated as literal strings (MUAs either do that or
decode them to junk, when their parser fails)
words that doesn't match the grammer are treated as literals. how to
treat words that broken for other reasons is specified in RFC 2047
section 6.
I see the following in section 6:
A mail reader need not attempt to display the text associated with an
'encoded-word' that is incorrectly formed. However, a mail reader
MUST NOT prevent the display or handling of a message because an
'encoded-word' is incorrectly formed.
So what is sieve supposed to do with incorrect words and where is that
defined?
If the mail reader does not support the character set used, it may
(a) display the 'encoded-word' as ordinary text (i.e., as it appears
in the header), (b) make a "best effort" to display using such
characters as are available, or (c) substitute an appropriate message
indicating that the decoded text could not be displayed.
That's fine for a human interface, but an automated filter should
have a determined behaviour.
o Correct words with unknown character sets are treated as literal
strings. The assumption that all unknown character sets are
one-byte codes and identical to US-ASCII in their lower 128
octets is not sound.
remember that you can only compare against literal UTF-8 strings
(well, at least until we have a variables extension), so the
alternative is to make _all_ matches against strings in an unknown
character set fail. that is not very useful.
The alternative is not to decode and convert what can't be decoded
and converted. If the word is broken, or the character set is unknown,
then don't decode and convert it.
Rationale: I would like the following test to be true:
Subject: abc =?iso-8859-1?q?=c3abc?= =?unknown?q?def?=
header : contains ["Subject"] ["abc"]
The header can not be decoded entirely, so Sieve scripts should
view it as the UTF-8 character for capital A with diaresis,
followed by "abc" and "=?unknown?q??=".
RFC 3028 would let the test fail, because the _whole_ header could
not be converted and there is an 8-bit character.
actually, this isn't clear in the RFC. the string "abc" doesn't
contain 8-bit characters, and neither does the matching substring
"abc"... I think the RFC can be read either way. I favour that the
test is allowed to fail, though.
I favour well determined behaviour.
Concerning my original point of NUL characters, I read RFC 2047 again:
Only printable and white space character data should be encoded using
this scheme. However, since these encoding schemes allow the
encoding of arbitrary octet values, mail readers that implement this
decoding should also ensure that display of the decoded data on the
recipient's terminal will not cause unwanted side-effects.
First, the requirement of printable and white space character data is
a SHOULD. Second, an embedded NUL causing a string to be truncated,
is an "unwanted side-effect" to me, but I have to accept it's only a
SHOULD, too.
This can probably be interpreted in sieve implementations failing to
match the substring "def" in "=?iso-8859-1?q?abc=00def?=" to be correct.
Amazing, isn't it?
Again, I favour well determined behaviour. All those SHOULDs make it
very hard to implement Sieve, which is not helpful. Converting some to
MUSTs would make is much easier.
Michael