Re: Invalid syntax to cause runtime error in encoded-character


On Wed, 8 Nov 2006, Kjetil Torgrim Homme wrote:

hmm, runtime isn't too appealing, but compile time would be nice.  this
actually depends on the order we process encode-character and variables.
if we do variables first (so "${hex:${var}}" works), it must be runtime.
if we do encoded-characters first (so "${v${hex:61}r}" works), it can be
compile time.

Hmm, I thought it was settled that encoded-characters were to always beexpanded before variables, such that encoded-characters could beimplemented purely in an implementation's lexer, or even viapre-substitution. This should be stated in the variables draft, to avoidcreating a normative reference from 3028bis to variables. Inserting thatshouldn't be a huge process issue.

I don't have any strong feelings about it.  it's good to catch as many
errors as possible at upload time, but it may be better to leave the
task of finding probable typos to a Sieve lint program.

Okay. Given that and the opinions expressed in the meeting, I propose wego with the following text, mostly based on a previous diff Kjetil posted:


1) the following addition to section 2.1 "Form of the Language"

      While this specification permits arbitrary octets to appear in
      sieve scripts inside strings and comments, this has made it
      difficult to robustly handle sieve scripts that are concerned with
      the enconding of data.  The "encoded-character" capability
      (section 2.4.2.4) provides an alternative means of representing
      such octets in strings using just US-ASCII characters.  As such,
      the use of non-UTF-8 text in scripts should be considered a
      deprecated feature that may be abandoned.

2) the addition of section 2.4.2.4, as follows:

2.4.2.4. Encoding characters using "encoded-character"

   When the "encoded-character" extension is in effect, certain
   character sequences in strings are replaced by the unencoded value.
   This happens after escape sequences are interpreted and dot-
   unstuffing has been done.  Implementations SHOULD support "encoded-
   character".

   Arbitrary octets can be embedded in strings by using the syntax
   encoded-arb-octets.  The sequence is replaced by the octets with the
   hexadecimal values given by each hex-pair.

   encoded-arb-octets   = "${hex:" hex-pair-seq "}"
   hex-pair-seq         = hex-pair *(WSP hex-pair)
   hex-pair             = 1*2HEXDIG

   It may be inconvenient or undesirable to enter Unicode characters
   verbatim, and in these cases the syntax encoded-unicode-char can be
   used.  The sequence is replaced by the UTF-8 encoding of the
   specified Unicode characters, which are identified by the hexadecimal
   value of unicode-hex.

   encoded-unicode-char = "${unicode:" unicode-hex-seq "}"
   unicode-hex-seq      = unicode-hex *(WSP unicode-hex)
   unicode-hex          = 1*6HEXDIG

   It is an error for a script to use a hexadecimal value that isn't in
   either the range 0 to D7FF or the range E000 to 10FFFF.  (The range
   D800 to DFFF is excluded as those character numbers are only used as
   part of the UTF-16 encoding form and are not applicable to the UTF-8
   encoding that the syntax here represents.)

   The capability string for use with the require command is "encoded-
   character".

   In the following script, message A is discarded, since the specified
   test string is equivalent to "$$$".

   Example:  require "encoded-character";
             if header :contains "Subject" "$${hex:24 24}" {
                discard;
             }


3) the following addition to section 6.2.3 "Initial Capability Registrations"

   Capability name: encoded-character
   Description:     changes the interpretation of strings to allow
                    arbitrary octets and Unicode characters to be
                    represented using US-ASCII
   RFC number:      this RFC (Sieve base spec)
   Contact address: The Sieve discussion list 
<ietf-mta-filters(_at_)imc(_dot_)org>



Philip Guenther