[Top] [All Lists]

Re: Proposal for escaping on non-UTF-8 sequences in Sieve

2006-10-20 05:13:19

If you're referring to RFC 2047's Q encoding, 3028bis currently says:

       <...>  An encoded NUL octet
       (character zero) SHOULD NOT cause early termination of the header
       content being compared against.

So an implementation that refused to match beyond the =00 would still be 
"conditionally compliant".

So how about handling an encoded NUL in strings just the same? SHOULD NOT
is reasonable.

The original MIME documents didn't make an exception for NUL...and that 
had to be changed when they were revised as part of moving to Draft 
Standard.  RFC 2049 says:
     (17)  The definitions of "7bit" and "8bit" have been
           tightened so that use of bare CR, LF can only be used
           as end-of-line sequences.  The document also no longer
           requires that NUL characters be preserved, which brings
           MIME into alignment with real-world implementations.

Without looking it up, it sounds as if that applies to unencoded NUL
characters.  I agree there is a real-world problem with them and the
above makes perfect sense.  I would not like if quoted-printable may
drop =00.

1) Given that variables are *explicitly* not handled that way, why should
    that be true of these?

2) Why would pulling in the variable capabilty change the behavior of the

2) The behavior you describe would be useful how?

The variable capability changes how argument string are interpreted.
Arguments to string functions are strings, too.  That's why it makes
sense to me that variables cause recursive evaluation, but that may
just be me.

Implementations that do not implement variables can scan strings
with a linear effort and without the need for recursion or a parser.
Side-effect free functions of constant arguments can be evaluated until
you reach the top level, which can not be represented in UTF-8, hence
we need the top level to encode the result, but no more.

I don't insist on this behaviour, it was just a suggestion.  Using
${hex:...} opens the door to other string functions, and we are really
talking about introducing string functions here and the question is:
Which extension introduces recursive evaluation? I suggested variables,
but it may as well be some not-yet-specified extension.

My interactions with Sendmail's Japanese customers tell me that 
ISO-2022-JP is still heavily preferred over UTF-8 in email in Japan.  Are 
you claiming otherwise?  Or do you think that it doesn't matter?

Actually, my feedback from customers is that national character sets
are still preferred over UTF-8 - independent of their location.  I am
REAL GLAD Sieve went the UTF-8 route to begin with, because national
character sets are a royal pain and I hope that historic crap disappears
some day.  Each day Unicode is used a little more, we are a little closer
to that point.

Personally, I have no stuff capable of displaying UTF-8, and looking at
other (Linux) installations, the claim of being able to switch completely
to Unicode is not YET true.  iconv(1) and similar tools help a lot
with existing installations that are tied to national character sets,
e.g. when viewing a Sieve script stored in UTF-8 on my ISO-8859-15
display.  Japanese may be worse, but German has a non-US-ASCII letter
that only exists in lower case.  Translating it to upper case results
in two letters and you can not convert it back without a dictionary.
It's not like I would be ignorant of non-US-ASCII text.

I had hordes of customers yell and scream at me that they will cease to
exist without their open SMTP relay.  Some probably still hate me for
having to close it.  I don't always listen to their wishes.


<Prev in Thread] Current Thread [Next in Thread>