Re: status of 3028bis


On Fri Oct 21 11:29:13 2005, Michael Haardt wrote:

On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Hommewrote:> The "en;ascii-casemap" collation is a simple collationintended for
>    use with English language text in pure US-ASCII.  It provides
> equality, substring and ordering functions. The algorithmfirst> applies a canonicalization algorithm to both input stringswhich> subtracts 32 (0x20) from all octet values between 97 (0x61)and 122> (0x7A) inclusive. The result of the collation is then thesame as> the result of the "i;octet" collation for the canonicalizedstrings.
> > (the algorithm in RFC 2244 is essentially the same.)
> > this was surprising and interesting to me, since it means thatwith> "abc" :matches "ab?", ${1} will hold the uppercase "C"! I wonderhow> many users would expect that one, or how many implementations getit
> right.

Oh my god.  Basically, this means a Sieve implementation does not
know anything about its input being UTF8.  All it does is converting
headers to UTF8, but other than that, it works on octets, notcharacters.

Yep. But don't be too alarmed, this only breaks :matches.

Incidentally, I really suspect that :matches in this case should bereturning "c", not "C", and it depends heavily on a lot ofspecification that I suspect doesn't exist.

If so, we must not talk about unicode, but UTF8, ands/character/octet/g.
And I have to change my implementation, which works on characters,
crippling it.

Do all implementations else work on octets instead of characters?

It actually doesn't matter. All the comparators so far defined at RFClevel (ie, by ACAP), operate on octets, not characters, but assumingyour implementation (modulo :matches) operates such that *if* allcomapators operated on UTF-8 encoded octet strings rather thanUnicode character strings, then this is easily rectifiable.


:matches is the problem.

> > Are you saying that even using "en;ascii-casemap", the wildcard"?"
> > does not match a single character outside US-ASCII?
> > since the spec defines it in terms of i;octet, the "?" wildcardis> essentially broken. it gets really interesting with "*", though,since
> you will probably get doubly encoded UTF-8 :-(
Worse, matched octets can be invalid (incomplete) UTF8octet-sequences.You gave an example how variables extract a single octet of asequence,and you can't even control if ? matches an incomplete UTF8octet-sequenceor a US-ASCII character: You can not use the matched part to modifyor
generate mails or you risk to ruin the result.

Yes, true. Fun, isn't it?

> this is truly a mess.  we need a comparator which makes sense, a
> comparator which operates case-fully on Unicode characters, butwithout
> the difficult bits (e.g. normalising combining characters).

Indeed!

This is a very, very quick proposal that I've had in the back of mymind, but the use of :matches convinces me that this is the way to go.

The current comparator draft changed the name of comparators tocollations - I thought that was wrong, because I'd have liked to seecomparators that performed pattern matching, which precludes theirability to collate.


I'd like to propose:

1) A family of comparators for both UTF-8 and Octet matching, thematches themselves being Globs (as Sieve :matches), and regularexpressions. (perhaps Basic and Extended). This is off the top of myhead, and it's a Saturday, so I can't recall the conventions forcomparator naming, but I'll use the example of "i;utf-8;glob".

This comparator would perform the "EQUAL" operation such that if theleft hand side were matched by the pattern contained in theright-hand side, it returned true, otherwise false.

"SUBSTRING" would match if the pattern occured anywhere in thestring, and "PREFIX" if it occured at the beginning of the string.

We need at least a UTF-8 glob to replace :matches, and a case-foldingvariant, too.

2) A new API requirement that "submatches" may be extracted from theresult of a successful EQUAL, PREFIX, or SUBSTRING match.

3) In Sieve, :matches becomes deprecated, and for backwardscompatibility, it should be rewritten such that:

a) The comparator is changed from "i;octet" to "i;utf-8;glob", andfrom "i;ascii-casemap" to "en;utf-8;casemap;glob".


b) The operation is changed from :matches to :is

The net result is not only that :matches now works as people expect,but that regular expression searching and glob pattern matchingbecome available in ACAP and IMAP, too.

So I can do something like EQUAL "addressbook.Email""en;utf-8;casemap;regex" "dave(\\+[^(_at_)]*)?(_at_)cridland(_dot_)net(\0[a-z]*)?"when searching my addressbook, which'd keep me happy.


Dave.
--
          You see things; and you say "Why?"
  But I dream things that never were; and I say "Why not?"
   - George Bernard Shaw