On Sun Oct 23 22:02:38 2005, Ned Freed wrote:
> On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme
wrote:
> > The "en;ascii-casemap" collation is a simple collation
intended for
> > use with English language text in pure US-ASCII. It provides
> > equality, substring and ordering functions. The algorithm
first
> > applies a canonicalization algorithm to both input strings
which
> > subtracts 32 (0x20) from all octet values between 97 (0x61)
and 122
> > (0x7A) inclusive. The result of the collation is then the
same as
> > the result of the "i;octet" collation for the canonicalized
strings.
> >
> > (the algorithm in RFC 2244 is essentially the same.)
> >
> > this was surprising and interesting to me, since it means that
with
> > "abc" :matches "ab?", ${1} will hold the uppercase "C"! I
wonder how
> > many users would expect that one, or how many implementations
get it
> > right.
> Oh my god. Basically, this means a Sieve implementation does not
> know anything about its input being UTF8.
Quite correct. IMO this is a feature, not a bug.
Hmmm... I've been reading about :matches, and I'm not so sure there
isn't a bug here.
:matches has a problem - its specification talks about characters,
and in particular, "?" matches 'a single character'.
This is all very nice, but comparators don't operate on characters
(or if they do, they get to define what a character *is*), so exactly
what this means is vague. It really doesn't help that neither ACAP
nor the new "collations" draft mention any kind of matching operation
- its sole definition is in one paragraph in RFC3028 or the new draft.
> You gave an example how variables extract a single octet of a
sequence,
> and you can't even control if ? matches an incomplete UTF8
octet-sequence
> or a US-ASCII character: You can not use the matched part to
modify or
> generate mails or you risk to ruin the result.
Assuming:
(1) An octet-based comparator.
(2) A single ? used in isolation with no adjacent *s or ?s.
(3) Well formed UTF-8 as input.
The somewhat surprising result is that ? can only match an ASCII
character. Of
course something like ???? can get really interesting and match
anything that
encodes down to four octets.
I think you intended to say that "?" can only match a character if it
is within ASCII - or more generally, if it happens to encode to a
single octet in UTF-8. But it'll match any octet, of course, whatever
character it might happen to be part of the encoding for.
> > this is truly a mess. we need a comparator which makes sense, a
> > comparator which operates case-fully on Unicode characters, but
without
> > the difficult bits (e.g. normalising combining characters).
I'm certainly in full agreement there.
Well, we have quite a large user based on this stuff and most of
these
issues have proved to be theoretical. Variables may change this,
however.
A construct like:
require "variables";
if header :matches "subject" "*" {set "subject" "${1}"}
else {set "subject" ""}
ends up storing the subject in all caps, which likely isn't what
was intended.
I think that's a matter of interpretation.
Variables says, in section 3.2, that the list variables expand to
what the wildcard matched.
I see nothing saying that this must be in the internal transformation
of the string by a comparator (if such a thing exists), nor that it
should be those matching portions of the original string, but my gut
feeling is that a comparator should be essentially a black box - that
is, the internal transformations of the comparator shouldn't be
visible to the script.
This is especially true since some comparators don't transform to a
string internally - not that we should be trying matches with
"i;ascii-numeric" (and why is this remaining "i;" if we have to
change "i;ascii-casemap"? ASCII digits aren't available everywhere.
Perhaps I'm being naïve again.)
Moreover:
require "variables";
if header "subject" :comparator "i;ascii-casemap" :matches "[foo] *" {
set "subject" "${1}";
} else {
set "subject" "${1}";
}
would be a whole load more useful if ${subject} was a substring of
the original subject, rather than an upper-case variant of it.
I believe that some of the examples in the variables draft make
have similar
mistakes in them.
Different interpretations, rather than mistakes.
Dave.
--
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw