On Fri, 2005-10-21 at 00:03 +0200, Michael Haardt wrote:
On Thu, Oct 20, 2005 at 07:44:05PM +0100, Dave Cridland wrote:
We need to restrict this discussion to just the one mailing list,
really, but I've posted a message saying that actually, the reverse
is true - comparators match on octet strings, and happen to have a
decode built in - hence i;octet doesn't decode, and i;ascii-* both
decode using ASCII.
What exactly do you mean by "decode"? Removing the MIME encoding or
converting the character set?
I'm not Dave, but I would mean "decode UTF-8 octet sequences to Unicode
characters". RFC 2047 and RFC 2231 decoding will always be done. BTW,
the draft does not mention RFC 2231 at all, I think that should be more
explicit, since it has explicit references to RFC 2047.
The notion that comparators work on character strings is a notion
that comes pre-flawed - ACAP does not operate on character strings,
but octet strings, which might on a good day happen to be UTF-8
encoded text, but might be anything.
That explains why we have that mess. Over here, users certainly expect
"en;ascii-case" to match characters, and will be confused if the first
test is true and the second is not, and yet more, if both are false:
Subject: =?utf8?q?A=c3=a4?=
:comparator "en;ascii-casemap" :matches "a?"
no. if you change the pattern to "a?*", and have variables, ${1} would
hold U+00C3 (not U+00E4, the decoded value).
:comparator "i;octet" :matches "A?"
no. same as above.
If "i;octet" operates on octets, we can't talk of unicode, but need
to talk about UTF8 for comparisons, and users will ask instantly:
How can I match characters case sensitive? The base spec makes me think
"i;octet" is just that, and operating on characters, despite the name.
well, to me it is quite obvious that when it says "octet", it doesn't
support multibyte encodings, but rather operates on raw byte values.
RFC 2244 is quite explicit:
For collation, the i;octet comparator interprets the value of
an attribute as a series of unsigned octets with ordinal
values from 0 to 255.
the base spec does not support caseful matching where "?" makes sense in
non-ASCII. in fact, the latest comparator draft I can find doesn't
provide one either. the algorithm for en;ascii-casemap is quite
amusing:
The "en;ascii-casemap" collation is a simple collation intended for
use with English language text in pure US-ASCII. It provides
equality, substring and ordering functions. The algorithm first
applies a canonicalization algorithm to both input strings which
subtracts 32 (0x20) from all octet values between 97 (0x61) and 122
(0x7A) inclusive. The result of the collation is then the same as
the result of the "i;octet" collation for the canonicalized strings.
(the algorithm in RFC 2244 is essentially the same.)
this was surprising and interesting to me, since it means that with
"abc" :matches "ab?", ${1} will hold the uppercase "C"! I wonder how
many users would expect that one, or how many implementations get it
right.
Section 2.7.1, Match Type, does not mention octets anywhere.
I've also suggested that where all the protocol has is a character
string, then the semantics of a comparator must behave as though the
string were encoded using UTF-8 (possibly by actually doing so).
Are you saying that even using "en;ascii-casemap", the wildcard "?"
does not match a single character outside US-ASCII?
since the spec defines it in terms of i;octet, the "?" wildcard is
essentially broken. it gets really interesting with "*", though, since
you will probably get doubly encoded UTF-8 :-(
this is truly a mess. we need a comparator which makes sense, a
comparator which operates case-fully on Unicode characters, but without
the difficult bits (e.g. normalising combining characters).
--
Kjetil T.