ietf-mta-filters
[Top] [All Lists]

Re: status of 3028bis

2005-10-23 15:42:57

On Sun Oct 23 22:02:38 2005, Ned Freed wrote:


> On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme wrote: > > The "en;ascii-casemap" collation is a simple collation intended for
> >    use with English language text in pure US-ASCII.  It provides
> > equality, substring and ordering functions. The algorithm first > > applies a canonicalization algorithm to both input strings which > > subtracts 32 (0x20) from all octet values between 97 (0x61) and 122 > > (0x7A) inclusive. The result of the collation is then the same as > > the result of the "i;octet" collation for the canonicalized strings.
> >
> > (the algorithm in RFC 2244 is essentially the same.)
> >
> > this was surprising and interesting to me, since it means that with > > "abc" :matches "ab?", ${1} will hold the uppercase "C"! I wonder how > > many users would expect that one, or how many implementations get it
> > right.

> Oh my god.  Basically, this means a Sieve implementation does not
> know anything about its input being UTF8.

Quite correct. IMO this is a feature, not a bug.


Hmmm... I've been reading about :matches, and I'm not so sure there isn't a bug here.

:matches has a problem - its specification talks about characters, and in particular, "?" matches 'a single character'.

This is all very nice, but comparators don't operate on characters (or if they do, they get to define what a character *is*), so exactly what this means is vague. It really doesn't help that neither ACAP nor the new "collations" draft mention any kind of matching operation - its sole definition is in one paragraph in RFC3028 or the new draft.

> You gave an example how variables extract a single octet of a sequence, > and you can't even control if ? matches an incomplete UTF8 octet-sequence > or a US-ASCII character: You can not use the matched part to modify or
> generate mails or you risk to ruin the result.

Assuming:

(1) An octet-based comparator.
(2) A single ? used in isolation with no adjacent *s or ?s.
(3) Well formed UTF-8 as input.

The somewhat surprising result is that ? can only match an ASCII character. Of course something like ???? can get really interesting and match anything that encodes down to four octets.
I think you intended to say that "?" can only match a character if it is within ASCII - or more generally, if it happens to encode to a single octet in UTF-8. But it'll match any octet, of course, whatever character it might happen to be part of the encoding for.


> > this is truly a mess.  we need a comparator which makes sense, a
> > comparator which operates case-fully on Unicode characters, but without
> > the difficult bits (e.g. normalising combining characters).


I'm certainly in full agreement there.


Well, we have quite a large user based on this stuff and most of these issues have proved to be theoretical. Variables may change this, however.
A construct like:

    require "variables";
    if header :matches "subject" "*" {set "subject" "${1}"}
    else {set "subject" ""}

ends up storing the subject in all caps, which likely isn't what was intended.

I think that's a matter of interpretation.

Variables says, in section 3.2, that the list variables expand to what the wildcard matched.

I see nothing saying that this must be in the internal transformation of the string by a comparator (if such a thing exists), nor that it should be those matching portions of the original string, but my gut feeling is that a comparator should be essentially a black box - that is, the internal transformations of the comparator shouldn't be visible to the script.

This is especially true since some comparators don't transform to a string internally - not that we should be trying matches with "i;ascii-numeric" (and why is this remaining "i;" if we have to change "i;ascii-casemap"? ASCII digits aren't available everywhere. Perhaps I'm being naïve again.)

Moreover:

require "variables";
if header "subject" :comparator "i;ascii-casemap" :matches "[foo] *" {
        set "subject" "${1}";
} else {
        set "subject" "${1}";
}

would be a whole load more useful if ${subject} was a substring of the original subject, rather than an upper-case variant of it.

I believe that some of the examples in the variables draft make have similar
mistakes in them.

Different interpretations, rather than mistakes.

Dave.
--
          You see things; and you say "Why?"
  But I dream things that never were; and I say "Why not?"
   - George Bernard Shaw

<Prev in Thread] Current Thread [Next in Thread>