Re: status of 3028bis

On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme wrote:

   The "en;ascii-casemap" collation is a simple collation intended for
   use with English language text in pure US-ASCII.  It provides
   equality, substring and ordering functions.  The algorithm first
   applies a canonicalization algorithm to both input strings which
   subtracts 32 (0x20) from all octet values between 97 (0x61) and 122
   (0x7A) inclusive.  The result of the collation is then the same as
   the result of the "i;octet" collation for the canonicalized strings.

(the algorithm in RFC 2244 is essentially the same.)

this was surprising and interesting to me, since it means that with
"abc" :matches "ab?", ${1} will hold the uppercase "C"!  I wonder how
many users would expect that one, or how many implementations get it
right.

Oh my god.  Basically, this means a Sieve implementation does not
know anything about its input being UTF8.


Quite correct. IMO this is a feature, not a bug.

All it does is converting
headers to UTF8, but other than that, it works on octets, not characters.


Well, it has always been possible to specify a comparator that operates
on characters rather than octets. I've had the hooks in place for this
in our implementation waiting for the comparator stuff to be completed.

If so, we must not talk about unicode, but UTF8, and s/character/octet/g.
And I have to change my implementation, which works on characters,
crippling it.


There are advantages as well as drawbacks. The ability to match illegal gunk is
something I've used more than once, and I know that some of our cusstomers have
used it too.

Do all implementations else work on octets instead of characters?


Ours does.

Are you saying that even using "en;ascii-casemap", the wildcard "?"
does not match a single character outside US-ASCII?


since the spec defines it in terms of i;octet, the "?" wildcard is
essentially broken.  it gets really interesting with "*", though, since
you will probably get doubly encoded UTF-8 :-(

Worse, matched octets can be invalid (incomplete) UTF8 octet-sequences.


The underlying structure of UTF-8 itself tends to minimize the liklihood of
this happening. As long as you're dealing with well formed UTF-8 inputs there
is no way for part of one character to match part of another. (This is one of
the real advantages of UTF-8 over, say, UTF-16 - you can perform byte by byte
comparisons and most things end up working properly. :matches and ? are an
obvious exception, of course.)

Of course once you leave the realm of well formed utf-8 this no longer
applies.

You gave an example how variables extract a single octet of a sequence,
and you can't even control if ? matches an incomplete UTF8 octet-sequence
or a US-ASCII character: You can not use the matched part to modify or
generate mails or you risk to ruin the result.


Assuming:

(1) An octet-based comparator.
(2) A single ? used in isolation with no adjacent *s or ?s.
(3) Well formed UTF-8 as input.

The somewhat surprising result is that ? can only match an ASCII character. Of
course something like ???? can get really interesting and match anything that
encodes down to four octets.

this is truly a mess.  we need a comparator which makes sense, a
comparator which operates case-fully on Unicode characters, but without
the difficult bits (e.g. normalising combining characters).


Well, we have quite a large user based on this stuff and most of these
issues have proved to be theoretical. Variables may change this, however.
A construct like:

    require "variables";
    if header :matches "subject" "*" {set "subject" "${1}"}
    else {set "subject" ""}

ends up storing the subject in all caps, which likely isn't what was intended.
For this to work properly you need to say:

    require "variables";
    if header :comparator "i;octet" :matches "subject" "*" {set "subject" 
"${1}"}
    else {set "subject" ""}

I believe that some of the examples in the variables draft make have similar
mistakes in them.

                                Ned