Re: status of 3028bis


On Sun Oct 23 22:02:38 2005, Ned Freed wrote:

> On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Hommewrote:> > The "en;ascii-casemap" collation is a simple collationintended for
> >    use with English language text in pure US-ASCII.  It provides
> > equality, substring and ordering functions. The algorithmfirst> > applies a canonicalization algorithm to both input stringswhich> > subtracts 32 (0x20) from all octet values between 97 (0x61)and 122> > (0x7A) inclusive. The result of the collation is then thesame as> > the result of the "i;octet" collation for the canonicalizedstrings.
> >
> > (the algorithm in RFC 2244 is essentially the same.)
> >
> > this was surprising and interesting to me, since it means thatwith> > "abc" :matches "ab?", ${1} will hold the uppercase "C"! Iwonder how> > many users would expect that one, or how many implementationsget it
> > right.

> Oh my god.  Basically, this means a Sieve implementation does not
> know anything about its input being UTF8.

Quite correct. IMO this is a feature, not a bug.

Hmmm... I've been reading about :matches, and I'm not so sure thereisn't a bug here.

:matches has a problem - its specification talks about characters,and in particular, "?" matches 'a single character'.

This is all very nice, but comparators don't operate on characters(or if they do, they get to define what a character *is*), so exactlywhat this means is vague. It really doesn't help that neither ACAPnor the new "collations" draft mention any kind of matching operation- its sole definition is in one paragraph in RFC3028 or the new draft.

> You gave an example how variables extract a single octet of asequence,> and you can't even control if ? matches an incomplete UTF8octet-sequence> or a US-ASCII character: You can not use the matched part tomodify or
> generate mails or you risk to ruin the result.

Assuming:

(1) An octet-based comparator.
(2) A single ? used in isolation with no adjacent *s or ?s.
(3) Well formed UTF-8 as input.
The somewhat surprising result is that ? can only match an ASCIIcharacter. Ofcourse something like ???? can get really interesting and matchanything thatencodes down to four octets.

I think you intended to say that "?" can only match a character if itis within ASCII - or more generally, if it happens to encode to asingle octet in UTF-8. But it'll match any octet, of course, whatevercharacter it might happen to be part of the encoding for.

> > this is truly a mess.  we need a comparator which makes sense, a
> > comparator which operates case-fully on Unicode characters, butwithout
> > the difficult bits (e.g. normalising combining characters).

I'm certainly in full agreement there.

Well, we have quite a large user based on this stuff and most oftheseissues have proved to be theoretical. Variables may change this,however.
A construct like:

    require "variables";
    if header :matches "subject" "*" {set "subject" "${1}"}
    else {set "subject" ""}
ends up storing the subject in all caps, which likely isn't whatwas intended.


I think that's a matter of interpretation.

Variables says, in section 3.2, that the list variables expand towhat the wildcard matched.

I see nothing saying that this must be in the internal transformationof the string by a comparator (if such a thing exists), nor that itshould be those matching portions of the original string, but my gutfeeling is that a comparator should be essentially a black box - thatis, the internal transformations of the comparator shouldn't bevisible to the script.

This is especially true since some comparators don't transform to astring internally - not that we should be trying matches with"i;ascii-numeric" (and why is this remaining "i;" if we have tochange "i;ascii-casemap"? ASCII digits aren't available everywhere.Perhaps I'm being naïve again.)


Moreover:

require "variables";
if header "subject" :comparator "i;ascii-casemap" :matches "[foo] *" {
        set "subject" "${1}";
} else {
        set "subject" "${1}";
}

would be a whole load more useful if ${subject} was a substring ofthe original subject, rather than an upper-case variant of it.

I believe that some of the examples in the variables draft makehave similar
mistakes in them.


Different interpretations, rather than mistakes.

Dave.
--
          You see things; and you say "Why?"
  But I dream things that never were; and I say "Why not?"
   - George Bernard Shaw