ietf-mta-filters
[Top] [All Lists]

Re: status of 3028bis

2005-10-22 05:02:45

On Fri Oct 21 11:29:13 2005, Michael Haardt wrote:

On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme wrote: > The "en;ascii-casemap" collation is a simple collation intended for
>    use with English language text in pure US-ASCII.  It provides
> equality, substring and ordering functions. The algorithm first > applies a canonicalization algorithm to both input strings which > subtracts 32 (0x20) from all octet values between 97 (0x61) and 122 > (0x7A) inclusive. The result of the collation is then the same as > the result of the "i;octet" collation for the canonicalized strings.
> > (the algorithm in RFC 2244 is essentially the same.)
> > this was surprising and interesting to me, since it means that with > "abc" :matches "ab?", ${1} will hold the uppercase "C"! I wonder how > many users would expect that one, or how many implementations get it
> right.

Oh my god.  Basically, this means a Sieve implementation does not
know anything about its input being UTF8.  All it does is converting
headers to UTF8, but other than that, it works on octets, not characters.


Yep. But don't be too alarmed, this only breaks :matches.

Incidentally, I really suspect that :matches in this case should be returning "c", not "C", and it depends heavily on a lot of specification that I suspect doesn't exist.


If so, we must not talk about unicode, but UTF8, and s/character/octet/g.
And I have to change my implementation, which works on characters,
crippling it.

Do all implementations else work on octets instead of characters?


It actually doesn't matter. All the comparators so far defined at RFC level (ie, by ACAP), operate on octets, not characters, but assuming your implementation (modulo :matches) operates such that *if* all comapators operated on UTF-8 encoded octet strings rather than Unicode character strings, then this is easily rectifiable.

:matches is the problem.


> > Are you saying that even using "en;ascii-casemap", the wildcard "?"
> > does not match a single character outside US-ASCII?
> > since the spec defines it in terms of i;octet, the "?" wildcard is > essentially broken. it gets really interesting with "*", though, since
> you will probably get doubly encoded UTF-8 :-(

Worse, matched octets can be invalid (incomplete) UTF8 octet-sequences. You gave an example how variables extract a single octet of a sequence, and you can't even control if ? matches an incomplete UTF8 octet-sequence or a US-ASCII character: You can not use the matched part to modify or
generate mails or you risk to ruin the result.


Yes, true. Fun, isn't it?


> this is truly a mess.  we need a comparator which makes sense, a
> comparator which operates case-fully on Unicode characters, but without
> the difficult bits (e.g. normalising combining characters).

Indeed!


This is a very, very quick proposal that I've had in the back of my mind, but the use of :matches convinces me that this is the way to go.

The current comparator draft changed the name of comparators to collations - I thought that was wrong, because I'd have liked to see comparators that performed pattern matching, which precludes their ability to collate.

I'd like to propose:

1) A family of comparators for both UTF-8 and Octet matching, the matches themselves being Globs (as Sieve :matches), and regular expressions. (perhaps Basic and Extended). This is off the top of my head, and it's a Saturday, so I can't recall the conventions for comparator naming, but I'll use the example of "i;utf-8;glob".

This comparator would perform the "EQUAL" operation such that if the left hand side were matched by the pattern contained in the right-hand side, it returned true, otherwise false.

"SUBSTRING" would match if the pattern occured anywhere in the string, and "PREFIX" if it occured at the beginning of the string.

We need at least a UTF-8 glob to replace :matches, and a case-folding variant, too.

2) A new API requirement that "submatches" may be extracted from the result of a successful EQUAL, PREFIX, or SUBSTRING match.

3) In Sieve, :matches becomes deprecated, and for backwards compatibility, it should be rewritten such that:

a) The comparator is changed from "i;octet" to "i;utf-8;glob", and from "i;ascii-casemap" to "en;utf-8;casemap;glob".

b) The operation is changed from :matches to :is

The net result is not only that :matches now works as people expect, but that regular expression searching and glob pattern matching become available in ACAP and IMAP, too.

So I can do something like EQUAL "addressbook.Email" "en;utf-8;casemap;regex" "dave(\\+[^(_at_)]*)?(_at_)cridland(_dot_)net(\0[a-z]*)?" when searching my addressbook, which'd keep me happy.

Dave.
--
          You see things; and you say "Why?"
  But I dream things that never were; and I say "Why not?"
   - George Bernard Shaw

<Prev in Thread] Current Thread [Next in Thread>