ietf-mta-filters
[Top] [All Lists]

Re: document status: 3028bis, body, editheader

2006-03-27 21:00:27
On Sat, 2006-03-25 at 09:26 -0800, Ned Freed wrote:

I thought Dave Cridland's suggestion to specify matching behaviour in
the comparator itself was intriguing:

http://permalink.gmane.org/gmane.ietf.mta-filters/2689

unfortunately, [draft-newman-i18n-comparator-08] says «the equality test
MUST be reflexive, symmetric and transitive», so "EQUAL" can't be used.
I must admit I don't quite understand how :matches and :regex work with
comparators, though.


I think of it this way: A comparator has as one of it's components a
normalization operation. Pull that operation out, apply it, and then
perform the glob or regex operation on the result. Note that the
output of the normalization is best seen as a series of nonnegative
integers or someting similar, not octets or characters.


thanks.

so you split your match pattern into constant strings without the
wildcards, send each to the comparator which returns a list of start and
end points for each matched constant string, and then see if you can
find a sequence of (start, end) for each constant string where the
intervals between the constant strings match the wildcards.  *phew*

that will work for :matches (although I doubt anyone will actually
implement it that way -- so the pluggable comparator idea goes out the
window), but not for :regex.  unless I'm missing something again :-)


another possibility is to have a capability which adds an action which
changes the default comparator to reduce the verbosity.


It makes scripts a bit shorter, but at the expense of having something
with far-reaching impact specified at the top and not where the impact
is felt. I'm far from convinced this is a good tradeoff.


what do you mean by impact?  server load?


another possibility is to allow the wildcard comparator, so
that :comparator "*" «[selects] the collation with the broadest scope
(preferably international scope), the most recent table versions and the
greatest number of supported operations.»  (the comparators the server
chooses from would have to be "require"d in advance, I think, although
«require "comparator-*"» is a possibility)


I really don't like this - now you have scripts working in subtlely
different ways in different places. This is big steps backwards, I
think.


you can reduce the problem by not doing require on more comparators than
you need.  e.g., if you do

  require "comparator-i;basic;uca=3.1.1;uv=3.2";
  require "comparator-i;ascii-numeric";
  require "comparator-*";

it's quite obvious which comparator has the «broadest international
scope» (just to make it clear, this is the wording about the behaviour
of "*" from [COLLATION]).


And good luck using i;basic;uca=3.1.1;uv=3.2 to trap specific sequences of
illegal 8bit in headers. Such stuff is rarely if ever in UTF-8, in my
experience at least.

this is impossible today, isn't it?

It is done all the time and the base specification allows it, more or
less.


well, that's in contention.


how do you specify the string to compare with?


You just do it. Nothing in the current sieve specification says that
it is an error to specify material that isn't valid utf-8.


8.1.     Lexical Tokens

   Sieve scripts are encoded in UTF-8.  The following assumes a valid
   UTF-8 encoding; special characters in Sieve scripts are all ASCII.

"are" is considered equivalent with a MUST, isn't it?


(This is now prohibited by the ABNF in the revised base specification,
and IMO this restriction needs to be removed for string constants.


I strongly disagree.  let's not make a charset soup on purpose.


It is at a minimum in direct conflict with the vacation
specification.)



not relevant, since vacation hasn't left draft status.



in any case, if you want to trap raw non-UTF8 in headers,
you should use i;octet.


Agreed, but in practice scripts often don't specify this.


such scripts are broken and rely on undocumented behaviour IMO.



 but then again:

5.7.  Octet Collation

   The i;octet (Section 9.5) collation is only usable with protocols
   based on octet-strings.  Clients and servers MUST NOT use i;octet
   with other protocols.

which would disqualify the use of i;octet with Sieve, since 3028bis says:

   The language is represented in UTF-8, as specified in [UTF-8].


Represent doesn't mean things are restricted to UTF-8.


yes, this is fine.  the message can contain octet strings even if the
script itself can not.  sorry about the red herring.


If it did it would be in direct conflict with later language in the
same document, which quite specifically allows material in other
charsets in strings.


what language are you thinking of?


the collation-draft exempts RFC 3028 from this "MUST NOT", but it's not
clear to me that a 3028bis can get the same exemption.


It better or backwards compatibility goes down the toilet, which is
not acceptable as far as I'm concerned.


how about implementations where "?" matches a Unicode character even for
the default comparator?  which implementation gets to define backwards
compatibility?


notice that
"represented in UTF-8" only means constant strings can't contain raw
octets which are illegal UTF-8 sequences.


I disagree 100% that that is what it means. And since the
specification later talks about putting non-UTF-8 material in constant
strings in section 2.4.2.4, it seems it agrees with me.


it's not clear that 2.4.2.4 expects the Sieve implementation to decode
the embedded MIME document into raw octets.  I would expect the
opposite, that the MIME headers and encoded data supplied in the script
are copied verbatim.


which brings us back to the
discussion on character escapes from a year ago.

http://comments.gmane.org/gmane.ietf.mta-filters/2030

I'd like to suggest we implement (2), but with the extension defined in
the base spec.


I have no problem with this and would like to see it happen,


I'm glad to hear it.  I guess I should submit text for consideration :-)


Regardless of whether you write some sequence that's illegal in UTF-8
directly as a series of octets in a string constant or indirectly
using some sort of encoding, you're still presenting the sieve
interpreter with text that isn't in utf-8 at some level.


the Sieve interpreter needs to handle non-UTF-8 already. that is, header
bodies which are encoded in an unknown charset must be handled as
octet-strings.

looking at the current draft, I think this excerpt from 2.7.2 needs a
little tweak:


   If implementations fail to support the above behavior, they MUST
   conform to the following:

      No two strings can be considered equal if one contains octets
      greater than 127.


(this is unchanged since 3028).  I believe it should be allowed to
compare such strings using i;octet.  I don't have suitable replacement
text, though.


Any attempt to enforce some sort of rule that nothing but utf-8 can be
present is still going to fail. And any +Unnnn sort of scheme is
inherently incapable of representing UTF-8 anyway.


well, yes, it represents a Unicode code point, not an octet sequence.
but as we have established, with the correct comparator, you can use "?"
to see that the *client visible* representation of Sieve strings is in
fact UTF-8.
-- 
Kjetil T.