[Top] [All Lists]

Re: draft-degener-sieve-body-02.txt

2004-01-07 18:50:05

On Wed, Jan 07, 2004 at 06:56:30PM -0500, Mark E. Mallett wrote:

In the new :binary stuff:

Unlike in :content, the charset of the :binary MIME content is
disregarded.   Instead, the match against the keys provided in
the "body" statement proceeds as if the file's content data had
been translated into space-separated hex bytes of the form
[0-9a-f][0-9a-f] prior to matching

It would be nice to consider whitespace in the pattern as

At first, I figured it just would be binary strings.  But that
doesn't work with everything being UTF-8.  There are lots of 
binary strings that aren't valid UTF-8.

I then tried pretending the binary is ISO-8859-1 and encoding
it in UTF-8.  But that's really too silly to explain, there
are binary sequences that aren't valid ISO-8859-1, and most
humans are bad at doing ISO-8859-1-to-UTF-8 conversions in
their head.

Okay, so we're matching against a hexdump.  Easy.  Everybody
likes hexdumps.  But that doesn't give you a way around
nybble-shift -- matching hex f5 in "ow" (6f57).

Finally, the space-separated bytes were a way of anchoring the
nybbles and still having a readable string that can be used
with existing string match mechanisms like :contains and :matches.

If you want to send a white-space-normalizer into the world,
you could do that as a generic comparator and make a lot of sense
in a lot of different contexts.  But by completely disregarding
white space, you'd do more harm than good in the context of
matching a hexdump.

especially since one has to transform it anyway in order
to do the comparison.

Given the ability to match against that white space and individual
nybbles (I'm not happy about that, but restricting it added more
warts than it fixed), the transformation to binary that you may
have in mind doesn't quite work in the general case anyway.

For another, if whitespace is allowed at all, why
not let the script writer feel free to use it the way they
want to..

I could do that, but then I'd have to toss use of the existing match
types and define my own just for binary.  I didn't think the extra
mechanism was worth it.

And speaking of variables, is it reasonable to make some note about
whether the matched strings are available to the variables extension?

Given how hard to implement that is, it probably should!
What do you think it should it say?

Jutta <jutta(_at_)sendmail(_dot_)com>

<Prev in Thread] Current Thread [Next in Thread>