[Top] [All Lists]

Re: status of 3028bis

2005-10-21 03:43:40

On Fri, Oct 21, 2005 at 02:38:29AM +0200, Kjetil Torgrim Homme wrote:
   The "en;ascii-casemap" collation is a simple collation intended for
   use with English language text in pure US-ASCII.  It provides
   equality, substring and ordering functions.  The algorithm first
   applies a canonicalization algorithm to both input strings which
   subtracts 32 (0x20) from all octet values between 97 (0x61) and 122
   (0x7A) inclusive.  The result of the collation is then the same as
   the result of the "i;octet" collation for the canonicalized strings.

(the algorithm in RFC 2244 is essentially the same.)

this was surprising and interesting to me, since it means that with
"abc" :matches "ab?", ${1} will hold the uppercase "C"!  I wonder how
many users would expect that one, or how many implementations get it

Oh my god.  Basically, this means a Sieve implementation does not
know anything about its input being UTF8.  All it does is converting
headers to UTF8, but other than that, it works on octets, not characters.

If so, we must not talk about unicode, but UTF8, and s/character/octet/g.
And I have to change my implementation, which works on characters,
crippling it.

Do all implementations else work on octets instead of characters?

Are you saying that even using "en;ascii-casemap", the wildcard "?"
does not match a single character outside US-ASCII?

since the spec defines it in terms of i;octet, the "?" wildcard is
essentially broken.  it gets really interesting with "*", though, since
you will probably get doubly encoded UTF-8 :-(

Worse, matched octets can be invalid (incomplete) UTF8 octet-sequences.
You gave an example how variables extract a single octet of a sequence,
and you can't even control if ? matches an incomplete UTF8 octet-sequence
or a US-ASCII character: You can not use the matched part to modify or
generate mails or you risk to ruin the result.

this is truly a mess.  we need a comparator which makes sense, a
comparator which operates case-fully on Unicode characters, but without
the difficult bits (e.g. normalising combining characters).



<Prev in Thread] Current Thread [Next in Thread>