Re: 'header' test and whitespace

On Fri, 2005-11-25 at 15:53 -0800, Ned Freed wrote:

the exact method used of RFC 2047 encoding doesn't have any semantics,
either.


In the abstract, maybe. But in practice it does. For example, it happens to
make perfect sense for me to filter out all messages that use the koi8-r
charset. I receive maybe 300 such messages every day, and they are without
exception spam.

that will work until the spammers are savvy enough to switch to UTF-8 to
avoid such filtering.  the correct test would check for code points, not
encodings.


The same can be said for every other check you can think of - eventually
the spammers will adapt. That doesn't mean it isn't useful now, and it doesn't
mean other similar tests won't be useful in the future.

This is not, and should not be, a search for the FUSSP.

There are also cases where language can be inferred with reasonable 
reliability
from charset, and this can be very useful in some applications. (There often
isn't enough text in a header to analyze to determine the langauge being 
used.)

yes, although I have received messages in Norwegian using GB2312
encoding and messages in English using KOI8-R encoding, it's a safe bet
that the sender of these messages understand Chinese and Russian
respectively since they've set up their e-mail program to use these
encodings.

a discussion thread might have a Subject which switches back
and forth between Q and B encoding, or in the length of encoded-words
and hence the number of lines.  this is very apparent for us
non-USans :-)


The encoding probably doesn't matter, especially since it sometimes changes 
in
transit. Charset changes are much rarer.

not in this neck of the woods.  different users will prefer UTF-8, ISO
8859-1 and ISO 8859-15, and the Subject will switch back and forth.


That's quite true here but beside the point. I'm talking about changes in
transit, not the use of various different stuff when messages are composed.
Indeed, the reason that you see such mixtures regularly (as do I) is because
such changes _aren't_ being made by composing and transfer agents - they leave
the original source alone.

so I don't think it's appropriate to make :raw be :halfraw.  the CR LF
is actually a part of the raw header.


I disagree. Folding points change all the time, they are specifically 
defined
not to have semantics, and I don't believe I've ever seen a case where I 
wanted
a test that is sensitive to CRLFs. The same cannot be said for encoded words
and trailing spaces.

okay, fine.  I was thinking abut cases where you want to repeat it back
at the sender, e.g.

  vacation :subject text:
Auto: I'm away from office
 (was: ${subject} )
.
;

where the fact that the CRLFs are included in ${subject} makes it
unnecessary to worry about folding of overlong subjects.  the vacation
draft should perhaps be explicit in that Subject must be folded
appropriately?


Wow, I had never even considered putting explicit folding points into a
vacation subject argument. Given that there can be substitutions and encoded
word generation and who knows what else happening it seems like a really bad
idea for a script to do this explicitly, but I suppose the vacation draft needs
to say that (a) The subjects and other headers in generated vacation responses
have to be properly folded and (b) CRLFs in subject arguments need to be
handled sensibly. In particular, the case of
<text-with-no-space><CRLF><text-with-no-space> has to be handled in a way that
doesn't cause an illegal header to be generated. I think I'll say something
like "CRLFs in :subject arguments MAY simply be removed but MUST be handled
in a fashion that prevents an illegal header field from being produced."

while we're at it, what do we do about headers which can have multiple
values, e.g. "Cc"?  (multiple headers is deprecated in 2822, but must be
supported.)  I don't have a good suggestion.  the naive approach is to
concatenate them as if they were simply separated by CR LF, but for "Cc"
you would really want to include a comma in the delimiter as well.  the
other option is to say that the first matching header is used.  this
sits well with short-circuiting logic, but means it's impossible to
capture the complete value of the header.


Simple: You test all the values as separate fields. Sieve tests already 
allow for
a list of fields but there's an implicit list even when there's only a 
single
field name specified.

yes, "but [this] means it's impossible to capture the complete value of
the header", "*" will only fetch the contents of the first one.


Yep.

perhaps
a loop construct for the address test would be a more appropriate
solution to that problem?

  for.every.address ["To", "Cc"] { block }


Yes, seems reasonable. In fact I once argued that the concept of a result
generator might make sense in sieve - the Icon language uses this to great
advantage in a bunch of different ways.

this can of course wait until we see the need.  for the base
specification, I would like a tiny change to make this a little more
apparent:

   For instance, the test `header :contains ["To", "Cc"]
-  ["me(_at_)example(_dot_)com", "me00(_at_)landru(_dot_)example(_dot_)edu"]' 
is true if either the
+  ["me(_at_)example(_dot_)com", "me00(_at_)landru(_dot_)example(_dot_)edu"]' 
is true if either a
   To header or Cc header of the input message contains either of the
   email addresses "me(_at_)example(_dot_)com" or 
"me00(_at_)landru(_dot_)example(_dot_)edu".


Seems like a good idea to me.

Yes, I'm afraid so. Perhaps something like rawheader? (I believe the 
header
test is the only one for which :raw makes sense.)

I'd like it for address, too.  this is the difference between
"xn--srlandslaget-vjb.no" and "sørlandslaget.no".  okay, so this isn't
in the spec today, but we might as well be ready for it.


Agreed, and it may be closer than you think given how the meeting at the last
IETF on i18n stuff in email went.

what happens if you add a :raw argument and upload to today's
implementations?  will they reject during upload?  will they ignore it
during runtime?  will they bomb during runtime?


Our implementation will return a runtime error. I believe this is the 
correct
behavior. We don't check during upload - our sieves are typically 
provisioned
via LDAP and we have no control over what tools are used to insert them into
the directory.

And yes, this makes error reporting a real challenge.

IMHO it's okay as long as it doesn't cause a runtime error.  (Cyrus 2.2
will reject upload, I haven't checked others.)


This is just another extension, so I don't see why causing a runtime
error is a problem.

right, that was my point.  if you didn't raise a runtime error, we might
get away with not declaring it an extension.  but you do, so we don't.


OK, seems like we're in agreement then.

                                Ned