Re: non-UTF-8 sequences in Sieve scripts


On Mon, 2006-09-18 at 11:07 -0400, Cyrus Daboo wrote:

I cannot see any restriction that would prevent use of 
Content-Transfer-Encoding: 8-bit, in which case the MIME parts are not 
necessarily US-ASCII and not necessarily UTF-8, e.g. a text/plain part with 
charset=iso8859-1.


in my opinion, it is quite clear that the intent of RFC 3028 is that a
Sieve script is pure Unicode, it is actually quite explicit about it:

[RFC 3028: 2.1. Form of the Language]:
|   [...]
|   The language is represented in UTF-8, as specified in [UTF-8].

[RFC 3028: 8.1. Lexical Tokens]:
|
|   Sieve scripts are encoded in UTF-8.  The following assumes a valid
|   UTF-8 encoding; special characters in Sieve scripts are all ASCII.

it then goes on to define the grammar, but it takes a shortcut and says:

|   CHAR-NOT-STAR = (%x00-51 / %x53-ff)
|   quoted-string = DQUOTE *CHAR DQUOTE

etc., which may seem to allow arbitrary octets, but it is constrained by
the introductory paragraph.

the embedded MIME part is no different from the rest of the script, and
needs to conform to this.  it shouldn't be necessary to state the
restriction explicitly.  these kinds of restrictions are very familiar
in the context of MIME, and several options are available to make a MIME
part which conforms to a UTF-8 transport.

okay, fine, we don't have to make 3028bis more explicit than 3028 about
disallowing arbitrary octets.  but we should definitely not add explicit
text which allows it either!  let's stick to the original wording, but
keep some of the clarifications from the draft:

[3028bis-09: 2.1. Form of the Language (§2)]:
|   With the exceptions of strings and comments, the language is limited
|   to US-ASCII characters.  Strings and comments may contain octets
|   outside the US-ASCII range.  Specifically, they will normally be in
|   UTF-8, as specified in [UTF-8].  NUL (US-ASCII 0) is never permitted
|   in scripts, while CR and LF can only appear as the CRLF line ending.

[my suggestion]:
|   With the exceptions of strings and comments, the language is limited
|   to US-ASCII characters.  Strings and comments are encoded in
|   UTF-8, as specified in [UTF-8].  NUL (US-ASCII 0) is never permitted
|   in scripts, while CR and LF can only appear as the CRLF line ending.

[3028bis-09: 2.4.2. Strings (§6)]:
|   As message header data is converted to [UTF-8] for comparison (see
|   section 2.7.2), most strings will use the UTF-8 encoding.  However,
|   implementations MUST accept all strings that match the grammar in
|   section 8.  The ability to use non-UTF-8 encoded strings matches
|   existing practice and has proven to be useful both in tests for
|   invalid data and in arguments containing raw MIME parts for extension
|   actions that generate outgoing messages.

[my suggestion]:
|   [strike paragraph 6 in its entirety]

the text from 8.1 is unchanged in 3028bis-09, so this is all we need to
maintain status quo.  we'll also keep the accurate UTF-8 definitions of
characters out of the ABNF, but may decide to change that later, in the
Standard revision of the document.

I hope this proposal is agreeable to all concerned.
-- 
Kjetil T.