Re: document status: 3028bis, body, editheader


On Wed, Mar 22, 2006 at 05:05:20PM +0100, Arnt Gulbrandsen wrote:

Could you please elaborate?

Do you mean that ascii-casemap should consider two octet strings to be 
equal if they are the same after decoding UTF-8?


That's what I would like indeed, but as I said, it would break all
implementations and for that reason and because RFC 3028 is interpreted
differently by the majority (read: anybody but me in the first place),
I doubt the change can happen.

Should it do Unicode 
composition/decomposition?


I don't know what that is, as I don't know much about Unicode at all.
Are you talking about encoding/decoding unicode characters in octets
or something different?

The way you ask implies that there are multiple ways how a unicode
character can be encoded in UTF-8.  Is that correct? I always assumed
UTF-8 was a 1:1 encoding of Unicode.

As it is, I'm suggesting that 0x80 0x82 is equal to precisely one 
string, namely 0x80 0x20, and that independently of whether 0x80 0x82 
is meant to be a character encoded in UTF-8 or something else.


You raise a very interesting point.  How do I match this?

Subject: =?utf-8?q?=80=82?=

That sequence is not valid UTF-8 encoding.  Sieve scripts must be valid
UTF-8, so you can not match the above.  That is fine for comparators
operating on characters, but a true octet-wise comparator will be
crippled.  It has been suggested to encode each of the code points
0x80 and 0x82 in UTF-8 and the comparator had to decode the literal.
Very ugly, and I don't think existing implementations work that way.

As far as I remember, the consensus was: The above is a problem not
worth discussing.  Let Sieve scripts continue to be valid UTF-8, and
let's be unable to match certain values.  If anybody has a problem,
he better invents some kind of "raw" header test.  I agree.

Thus, we can't talk about the octet sequence 0x80 0x82 in Sieve. :-)

But I got distracted.  As long as we consider ":is" or ":contains",
there is no problem really, because both behave the same, no matter
if they decode UTF-8 or just work on octets, as long as we have valid
UTF-8 encoding.  The "?" wildcard is the problem: Should it match an
octet or a character?

Since all (most?) implementations ignore unicode and just work on octets,
"?" matches an octet, too.  Now that is weird to users, because if their
unicode aware client displays a one-character subject, "?" will only match
it if it is an US-ASCII character, and if it is a German umlaut character,
"??" matches it, because those are encoded in two octets.

That's where it becomes obvious if a comparator works on octets or
characters.  RFC 3028 does not specify that as clear as required and
the installed base unfortunately decided for octets, not characters.
The latest 3028bis draft follows that decision, and clearly so.

Of course the comparator specification has the last word on this.  I guess
you see the problem now, and should you decide that "en;ascii-case"
works on characters and "?" matches a unicode character, you had my vote
and I would change my implementation within 5 minutes.  *ducks and hides*

Michael