Re: I-D ACTION:draft-degener-sieve-body-00.txt

Jutta Degener <jutta(_at_)sendmail(_dot_)com> writes:

[ This is a much revised version in response to the calls for more
  text processing before a match on this list.  --Jutta]

A New Internet-Draft is available from the on-line Internet-Drafts 
directories.

    Title           : Sieve -- 'body' extension

I like this version much better, but still some small
comments/nits/thoughts:

0) No way to split based on MIME structure or MIME headers.  Compare
   the complex MIME-parsing features of the IMAP FETCH BODY[] command.
   It may take the sieve body extension too far to dvelve into a
   full-blown MIME parser, but it might be useful some day.  E.g.,
   split all OpenPGP multipart/encrypted mail into one folder, all
   S/MIME multipart/encrypted mail into another folder.


This is starting to get complicated, and complexity can be expensive...

Two issues regarding the text:

,----
|    MIME parts encoded in a content transfer encoding must be decoded,
|    and text MIME parts in charsets other than UTF-8 MUST be converted
|    to UTF-8 prior to the match.
`----

1) Converting into UTF-8 is non-trivial.  Standards wise it is not
even defined what this means, since I don't know of any standards body
that publish official transcoding tables.  It is possible for
transcoding tables "out there" to even conflict (e.g. CP437 used in a
german environment may transcode differently than CP437 used in a
greek environment).  I suggest at least noting that this is
non-trivial, but that such efforts should however be made, and the
exact implementation is left to local policy.


This sounds like a reasonable thing to do, but there's also the issue of how
unknown charsets are handled. It may be that searching unconverted text is a
reasonable fallback or it may not. I'd suggest a compromise: A search involving
8bit characters on text with an unknown charset always fails, while a search
involving only the ASCII subset proceeds on the unconverted text. I base this
on the observation that ASCII-compatible charsets are pretty common. (Although
perhaps a further heuristic that any unknown charset that has 2022 in the name
shouldn't receive such treatment would also be useful...)

2) CTE "decoding" is a bit loose.  I suggest specifying that
   implementations must support some set of common CTE's, such as
   base64, qp, 8bit.  It may make sense to discuss what should happen
   if CTE is syntactically incorrect, or contains a value which the
   implementation does not support (e.g., "x-yenc").


I agree that this should be discussed. I'm not completely sure how to handle
it, but I'm leaning towards saying that the search should simply fail.

3) What about multipart/partial?


Hmm. Tricky. I think any search involving non-ASCII should fail. I'm less
clear on whether or not an ASCII search should proceed.

Searches of encrypted data are also likely to produce less than useful
results.

4) (Revealing my ignorance:) Are sieve scripts binary clean?


No. Sieves are defined to be in UTF-8, with the restrictions that implies.

   If I want to implement something like the unix "file" command in a sieve
   script to split some data into special folders depending on magic
   values in the file, I might need to express binary data.


I think searching for patterns in binary data goes beyond the scope of this
particular extension. In the specific case of a magic number test, I'd suggest
that rather than attempting to code such tests into sieve that this be done by
having a "magic" test that applies a set of system-provided magic number tests
to the file and returns a result that the sieve can then check.

                                Ned