Re: Charset sensitive compares (was Re: List of expected changes ...)

At 09:53 10.11.98 -0800, Chris Newman wrote:

There are three issues to address:

(1) Which charsets are permitted in the scripting language (currently
only UTF-8).

(2) Which charsets constitute minimal support for RFC 2049 in incoming
message headers.

(3) What to do when charsets from (1) & (2) mismatch.

I have a strong preference to keep (1) unchanged -- UTF-8 only.  Allowing
a script to contain embedded data in multiple charsets makes script
viewing and composition _much_ more complex as well as making the
cross-charset comparison problem even worse.


Agreed.

As for (2), I'd say if RFC 2049 decoding is done, support for UTF-8,
ISO-8859-1, and the ASCII subset of the ISO-8859-* charsets should be the
minimum required.  UTF-8 is required by RFC 2277 and is easy since it
matches the charset for the scripting language.  ISO-8859-1 is easy since
it's a proper subset of UTF-8.  And the other rule (at least the ASCII
subset of ISO-8859-*) comes directly from RFC 2049.


Makes sense.

In the interest of getting Sieve deployed faster, it may be desirable to
permit implementations which don't support RFC 2049 to be compliant
(possibly under a SHOULD support RFC 2049 clause).  When RFC 2049 isn't
supported, I'd say that a comparision string in the script with 8-bit
content MUST fail to match (we have to hold the line on just-send-8 in
headers now so we can allow UTF-8 in headers down the road).


Agreeed.
Do you have anything like a "feature test macro" in the language now?
I'm thinking of something like

   require charset iso-2022-jp

to let a script say that unless the server is able to decode this
charset into UTF-8 for comparision, the script should go Boink at once
instead of behaving randomly.

(3) is nasty and creates the behavior change you noted as servers are
upgraded to support more charsets.  As long as (1) is fixed at UTF-8, that
effectively requires translation to UTF-8 to do comparisons.


The alternative would be to allow scripts to embed octet strings labelled
with a charset for comparison purposes.  While that would make Japanese or
Chinese localization easier, it makes the international problem harder in
addition to the other drawbacks mentioned above.


OK, let's require that "all actions have the same result as if all
character data was converted to UTF-8 before comparisions are done".
And 2049 strings in unknown charsets don't match anything at all.

Seems to make sense to me.

                  Harald

-- 
Harald Tveit Alvestrand, Maxware, Norway
Harald(_dot_)Alvestrand(_at_)maxware(_dot_)no