Re: status of 3028bis
2005-10-22 05:02:45
On Fri Oct 21 11:29:13 2005, Michael Haardt wrote:
On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme
wrote:
> The "en;ascii-casemap" collation is a simple collation
intended for
> use with English language text in pure US-ASCII. It provides
> equality, substring and ordering functions. The algorithm
first
> applies a canonicalization algorithm to both input strings
which
> subtracts 32 (0x20) from all octet values between 97 (0x61)
and 122
> (0x7A) inclusive. The result of the collation is then the
same as
> the result of the "i;octet" collation for the canonicalized
strings.
> > (the algorithm in RFC 2244 is essentially the same.)
> > this was surprising and interesting to me, since it means that
with
> "abc" :matches "ab?", ${1} will hold the uppercase "C"! I wonder
how
> many users would expect that one, or how many implementations get
it
> right.
Oh my god. Basically, this means a Sieve implementation does not
know anything about its input being UTF8. All it does is converting
headers to UTF8, but other than that, it works on octets, not
characters.
Yep. But don't be too alarmed, this only breaks :matches.
Incidentally, I really suspect that :matches in this case should be
returning "c", not "C", and it depends heavily on a lot of
specification that I suspect doesn't exist.
If so, we must not talk about unicode, but UTF8, and
s/character/octet/g.
And I have to change my implementation, which works on characters,
crippling it.
Do all implementations else work on octets instead of characters?
It actually doesn't matter. All the comparators so far defined at RFC
level (ie, by ACAP), operate on octets, not characters, but assuming
your implementation (modulo :matches) operates such that *if* all
comapators operated on UTF-8 encoded octet strings rather than
Unicode character strings, then this is easily rectifiable.
:matches is the problem.
> > Are you saying that even using "en;ascii-casemap", the wildcard
"?"
> > does not match a single character outside US-ASCII?
> > since the spec defines it in terms of i;octet, the "?" wildcard
is
> essentially broken. it gets really interesting with "*", though,
since
> you will probably get doubly encoded UTF-8 :-(
Worse, matched octets can be invalid (incomplete) UTF8
octet-sequences.
You gave an example how variables extract a single octet of a
sequence,
and you can't even control if ? matches an incomplete UTF8
octet-sequence
or a US-ASCII character: You can not use the matched part to modify
or
generate mails or you risk to ruin the result.
Yes, true. Fun, isn't it?
> this is truly a mess. we need a comparator which makes sense, a
> comparator which operates case-fully on Unicode characters, but
without
> the difficult bits (e.g. normalising combining characters).
Indeed!
This is a very, very quick proposal that I've had in the back of my
mind, but the use of :matches convinces me that this is the way to go.
The current comparator draft changed the name of comparators to
collations - I thought that was wrong, because I'd have liked to see
comparators that performed pattern matching, which precludes their
ability to collate.
I'd like to propose:
1) A family of comparators for both UTF-8 and Octet matching, the
matches themselves being Globs (as Sieve :matches), and regular
expressions. (perhaps Basic and Extended). This is off the top of my
head, and it's a Saturday, so I can't recall the conventions for
comparator naming, but I'll use the example of "i;utf-8;glob".
This comparator would perform the "EQUAL" operation such that if the
left hand side were matched by the pattern contained in the
right-hand side, it returned true, otherwise false.
"SUBSTRING" would match if the pattern occured anywhere in the
string, and "PREFIX" if it occured at the beginning of the string.
We need at least a UTF-8 glob to replace :matches, and a case-folding
variant, too.
2) A new API requirement that "submatches" may be extracted from the
result of a successful EQUAL, PREFIX, or SUBSTRING match.
3) In Sieve, :matches becomes deprecated, and for backwards
compatibility, it should be rewritten such that:
a) The comparator is changed from "i;octet" to "i;utf-8;glob", and
from "i;ascii-casemap" to "en;utf-8;casemap;glob".
b) The operation is changed from :matches to :is
The net result is not only that :matches now works as people expect,
but that regular expression searching and glob pattern matching
become available in ACAP and IMAP, too.
So I can do something like EQUAL "addressbook.Email"
"en;utf-8;casemap;regex" "dave(\\+[^(_at_)]*)?(_at_)cridland(_dot_)net(\0[a-z]*)?"
when searching my addressbook, which'd keep me happy.
Dave.
--
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- status of 3028bis, Philip Guenther
- Re: status of 3028bis, Michael Haardt
- Re: status of 3028bis, Alexey Melnikov
- Re: status of 3028bis, Michael Haardt
- Re: status of 3028bis, Dave Cridland
- Re: status of 3028bis, Ned Freed
- Re: status of 3028bis, Michael Haardt
- Re: status of 3028bis, Kjetil Torgrim Homme
- Re: status of 3028bis, Michael Haardt
- Re: status of 3028bis,
Dave Cridland <=
- Re: status of 3028bis, Ned Freed
- Re: status of 3028bis, Dave Cridland
- Re: status of 3028bis, Ned Freed
- Re: status of 3028bis, Kjetil Torgrim Homme
|
|
|