Currently, 3028bis, section 2.7.2 is unchanged from RFC 3028:
----
2.7.2. Comparisons Across Character Sets
All Sieve scripts are represented in UTF-8, but messages may involve
a number of character sets. In order for comparisons to work across
character sets, implementations SHOULD implement the following
behavior:
Implementations decode header charsets to UTF-8. Two strings are
considered equal if their UTF-8 representations are identical.
Implementations should decode charsets represented in the forms
specified by [MIME] for both message headers and bodies.
Implementations must be capable of decoding US-ASCII, ISO-8859-1,
the ASCII subset of ISO-8859-* character sets, and UTF-8.
If implementations fail to support the above behavior, they MUST
conform to the following:
No two strings can be considered equal if one contains octets
greater than 127.
----
That is, support for RFC 2047 is only a SHOULD and not a MUST. Do
we want to leave that as is or should it be made stricter, with a
MUST support RFC 2047, MUST support conversion of US-ASCII and UTF-8
and SHOULD support conversion of ISO-8859-1 and the US-ASCII subset
of ISO-8859-*?
This question arose from a query from Bob Johannessen during the
WGLC of the 'body' extension on that extension's requirements for
charset conversion. Those requirements currently read:
MIME parts identified as using charsets other than UTF-8 as
defined in [UTF-8] SHOULD be converted to UTF-8 prior to the match.
A conversion from US-ASCII to UTF-8 MUST be supported.
If an implementation does not support conversion of a given
charset to UTF-8, it MAY compare against the US-ASCII subset
of the transfer-decoded character data instead. Characters from
documents tagged with charsets that the local implementation
cannot convert to UTF-8 and text from mistagged documents MAY
be omitted or processed according to local conventions.
Now, I had always read that paragraph as requiring support for
matching UTF-8 against text parts labelled as being in UTF-8 or
US-ASCII; treating a UTF-8 part as if it was US-ASCII with some
unmatching octets was banned. On reflection, I now think that
paragraph should be read as a giving practically the same choice
as section 2.7.2 in the base-spec and that it should therefore be
simply replaced with a reference to that section in the base-spec,
ala:
Implementations MUST use the same rules for comparisons
against body parts in charsets other than UTF-8 as they use
for comparisons against header fields in such charsets (c.f.
[SIEVE] section 2.7.2).
What are peoples' opinions on the base-spec and body-extension
requirements in this area?
Philip Guenther