There are three issues to address:
(1) Which charsets are permitted in the scripting language (currently
only UTF-8).
(2) Which charsets constitute minimal support for RFC 2049 in incoming
message headers.
(3) What to do when charsets from (1) & (2) mismatch.
I have a strong preference to keep (1) unchanged -- UTF-8 only. Allowing
a script to contain embedded data in multiple charsets makes script
viewing and composition _much_ more complex as well as making the
cross-charset comparison problem even worse.
As for (2), I'd say if RFC 2049 decoding is done, support for UTF-8,
ISO-8859-1, and the ASCII subset of the ISO-8859-* charsets should be the
minimum required. UTF-8 is required by RFC 2277 and is easy since it
matches the charset for the scripting language. ISO-8859-1 is easy since
it's a proper subset of UTF-8. And the other rule (at least the ASCII
subset of ISO-8859-*) comes directly from RFC 2049.
In the interest of getting Sieve deployed faster, it may be desirable to
permit implementations which don't support RFC 2049 to be compliant
(possibly under a SHOULD support RFC 2049 clause). When RFC 2049 isn't
supported, I'd say that a comparision string in the script with 8-bit
content MUST fail to match (we have to hold the line on just-send-8 in
headers now so we can allow UTF-8 in headers down the road).
(3) is nasty and creates the behavior change you noted as servers are
upgraded to support more charsets. As long as (1) is fixed at UTF-8, that
effectively requires translation to UTF-8 to do comparisons.
The alternative would be to allow scripts to embed octet strings labelled
with a charset for comparison purposes. While that would make Japanese or
Chinese localization easier, it makes the international problem harder in
addition to the other drawbacks mentioned above.
- Chris