ietf-mta-filters
[Top] [All Lists]

Re: status of 3028bis

2005-10-23 17:12:40

> > Oh my god.  Basically, this means a Sieve implementation does not
> > know anything about its input being UTF8.
>
> Quite correct. IMO this is a feature, not a bug.
>
>
Hmmm... I've been reading about :matches, and I'm not so sure there
isn't a bug here.

A specification bug, perhaps.

:matches has a problem - its specification talks about characters,
and in particular, "?" matches 'a single character'.

This is all very nice, but comparators don't operate on characters
(or if they do, they get to define what a character *is*), so exactly
what this means is vague. It really doesn't help that neither ACAP
nor the new "collations" draft mention any kind of matching operation
- its sole definition is in one paragraph in RFC3028 or the new draft.

Some comparators do octets, others may group those octets in some way,
including but not limited to grouping them into things we call characters.
So it's a single something, with "something" being defined by the underlying
comparator.

I don't know the best way to write this, but that's in effect what it needs to
say.

> > You gave an example how variables extract a single octet of a
> sequence,
> > and you can't even control if ? matches an incomplete UTF8
> octet-sequence
> > or a US-ASCII character: You can not use the matched part to
> modify or
> > generate mails or you risk to ruin the result.
>
> Assuming:
>
> (1) An octet-based comparator.
> (2) A single ? used in isolation with no adjacent *s or ?s.
> (3) Well formed UTF-8 as input.
>
> The somewhat surprising result is that ? can only match an ASCII
> character. Of
> course something like ???? can get really interesting and match
> anything that
> encodes down to four octets.
>
I think you intended to say that "?" can only match a character if it
is within ASCII - or more generally, if it happens to encode to a
single octet in UTF-8. But it'll match any octet, of course, whatever
character it might happen to be part of the encoding for.

Yes, but this fails to take the context into account. There are three subcases
where ? appears at the beginning, middle, or end of the pattern.

If ? appears at the beginning and the target begins with a non-ASCII character,
the ? will match the first octet of that character. But since this is well
formed UTF-8 there has to be a second byte and that byte cannot possibly
match the next thing in the pattern. So the the match always fails unless
the first character of the target is ASCII.

I won't bother to walk through the remaining cases, but they all end up the
same way: The ? may match a non-ASCII character in isolation, but unless it is
adjacent to some other wildcard there is no way the entire pattern can
subsequently match.

> Well, we have quite a large user based on this stuff and most of
> these issues have proved to be theoretical. Variables may change this,
> however.
> A construct like:
>
>     require "variables";
>     if header :matches "subject" "*" {set "subject" "${1}"}
>     else {set "subject" ""}
>
> ends up storing the subject in all caps, which likely isn't what
> was intended.

I think that's a matter of interpretation.

Variables says, in section 3.2, that the list variables expand to
what the wildcard matched.

But what the wildcard matched was transformed text. Try as I might, I cannot
read what's there the way you do.

I see nothing saying that this must be in the internal transformation
of the string by a comparator (if such a thing exists), nor that it
should be those matching portions of the original string, but my gut
feeling is that a comparator should be essentially a black box - that
is, the internal transformations of the comparator shouldn't be
visible to the script.

Although it will be some work for me to change my implementation to match
this, I would be happy to make such a change because I think the resulting
behavior is better. However, if this is what people want it needs to be
made explicit in the variables specification. I think your reading of
what's there now is a huge stretch.

This is especially true since some comparators don't transform to a
string internally - not that we should be trying matches with
"i;ascii-numeric" (and why is this remaining "i;" if we have to
change "i;ascii-casemap"? ASCII digits aren't available everywhere.
Perhaps I'm being naïve again.)

IMO we have much too much deployed code to change the names of any of these
things. At the very most we can add new names and make the new names the
preferred names. The old names have to remain and be recognized indefinitely.

Moreover:

require "variables";
if header "subject" :comparator "i;ascii-casemap" :matches "[foo] *" {
        set "subject" "${1}";
} else {
        set "subject" "${1}";
}

would be a whole load more useful if ${subject} was a substring of
the original subject, rather than an upper-case variant of it.

Yes, that's the case that argues strongly for your approach. It is also worth
nothing that those who want to force things into a particular case can do so
with the various modifiers to set.

                                Ned

<Prev in Thread] Current Thread [Next in Thread>