ietf-mta-filters
[Top] [All Lists]

Charset sensitive compares (was Re: List of expected changes ...)

1998-11-10 10:50:32
There are three issues to address:

(1) Which charsets are permitted in the scripting language (currently
only UTF-8).

(2) Which charsets constitute minimal support for RFC 2049 in incoming
message headers.

(3) What to do when charsets from (1) & (2) mismatch.

I have a strong preference to keep (1) unchanged -- UTF-8 only.  Allowing
a script to contain embedded data in multiple charsets makes script
viewing and composition _much_ more complex as well as making the
cross-charset comparison problem even worse.

As for (2), I'd say if RFC 2049 decoding is done, support for UTF-8,
ISO-8859-1, and the ASCII subset of the ISO-8859-* charsets should be the
minimum required.  UTF-8 is required by RFC 2277 and is easy since it
matches the charset for the scripting language.  ISO-8859-1 is easy since
it's a proper subset of UTF-8.  And the other rule (at least the ASCII
subset of ISO-8859-*) comes directly from RFC 2049. 

In the interest of getting Sieve deployed faster, it may be desirable to
permit implementations which don't support RFC 2049 to be compliant
(possibly under a SHOULD support RFC 2049 clause).  When RFC 2049 isn't
supported, I'd say that a comparision string in the script with 8-bit
content MUST fail to match (we have to hold the line on just-send-8 in
headers now so we can allow UTF-8 in headers down the road).

(3) is nasty and creates the behavior change you noted as servers are
upgraded to support more charsets.  As long as (1) is fixed at UTF-8, that
effectively requires translation to UTF-8 to do comparisons.


The alternative would be to allow scripts to embed octet strings labelled
with a charset for comparison purposes.  While that would make Japanese or
Chinese localization easier, it makes the international problem harder in
addition to the other drawbacks mentioned above.

                - Chris