Re: 'header' test and whitespace

On Fri, 2005-11-25 at 08:41 -0800, Ned Freed wrote:

Yes. I don't view this as part of the unfolding algorithm, however. But now
that you mention it, a better way to describe this is to do the unfolding
part first and then remove leading and trailing spaces. So the steps become:

(1) Remove all CRLFs.

(2) Remove leading and trailing spaces.

(3) Decode RFC 2047 and convert to utf-8.

if we're going to mention 2047, we should also mention that RFC 2231
applies for some headers.  although it seems to me that using a
bastardised 2047 is the deployed norm for Content-Disposition and
friends, that's not correct according to the specifications, and being
more explicit about it might help conformance to spec.

The last two steps should be skipped in :raw mode. (The first step is 
retained
because folding points aren't supposed to have any semantics and are known 
to
change unpredictably. The same cannot be said for spaces or encoded words.)

the exact method used of RFC 2047 encoding doesn't have any semantics,
either.


In the abstract, maybe. But in practice it does. For example, it happens to
make perfect sense for me to filter out all messages that use the koi8-r 
charset. I receive maybe 300 such messages every day, and they are without
exception spam.

There are also cases where language can be inferred with reasonable reliability
from charset, and this can be very useful in some applications. (There often
isn't enough text in a header to analyze to determine the langauge being used.)

a discussion thread might have a Subject which switches back
and forth between Q and B encoding, or in the length of encoded-words
and hence the number of lines.  this is very apparent for us
non-USans :-)


The encoding probably doesn't matter, especially since it sometimes changes in
transit. Charset changes are much rarer.

so I don't think it's appropriate to make :raw be :halfraw.  the CR LF
is actually a part of the raw header.


I disagree. Folding points change all the time, they are specifically defined
not to have semantics, and I don't believe I've ever seen a case where I wanted
a test that is sensitive to CRLFs. The same cannot be said for encoded words
and trailing spaces.

while we're at it, what do we do about headers which can have multiple
values, e.g. "Cc"?  (multiple headers is deprecated in 2822, but must be
supported.)  I don't have a good suggestion.  the naive approach is to
concatenate them as if they were simply separated by CR LF, but for "Cc"
you would really want to include a comma in the delimiter as well.  the
other option is to say that the first matching header is used.  this
sits well with short-circuiting logic, but means it's impossible to
capture the complete value of the header.


Simple: You test all the values as separate fields. Sieve tests already allow 
for
a list of fields but there's an implicit list even when there's only a single
field name specified.

I am personally Ok with adding :raw to the base spec, but do we need a
new capability?


Yes, I'm afraid so. Perhaps something like rawheader? (I believe the header
test is the only one for which :raw makes sense.)

what happens if you add a :raw argument and upload to today's
implementations?  will they reject during upload?  will they ignore it
during runtime?  will they bomb during runtime?


Our implementation will return a runtime error. I believe this is the correct
behavior. We don't check during upload - our sieves are typically provisioned
via LDAP and we have no control over what tools are used to insert them into
the directory.

And yes, this makes error reporting a real challenge.

IMHO it's okay as long as it doesn't cause a runtime error.  (Cyrus 2.2
will reject upload, I haven't checked others.)


This is just another extension, so I don't see why causing a runtime
error is a problem.

the behaviour of "header" is under-specified in the current spec, and
deployed implementations probably deviating behaviour.  the more
detailed specs we're discussing here might make many of them obviously
non-conforming where they were only arguably non-conforming before (i.e.
not performing according to intent).  but is it important?  the number
of capabilities makes life harder for both server and client
implementors, and we should try to limit the number of them as much as
possible.


I disagree with this as well. Additional new capabilities are nowhere as big
a problem as nailing things down in ways that break current behavior. To be
blunt, if this specification evolves to a point where our deployed sieve
scripts are incompatible with it, I'll have no choice but to stop supporting
sieve as defined by the IETF.

                                Ned