Re: Proposal for escaping on non-UTF-8 sequences in Sieve

On Sun, 2006-10-01 at 16:04 +0200, Kjetil Torgrim Homme wrote:

variables don't allow a reference which acts as a function with
arbitrary input (e.g., "${hex:e6 f8 e5}"), the tail end has to be an
identifier or numbered variable.  unfortunately, this means ${hex:7e} is
disallowed, since "7e" is neither.


I'm sorry, I was very confused.  the "variables" syntax uses period, not
colon, to separate namespaces, so the suggested syntax would _resemble_
it, but be completely independent of it ("variables" only affects
substrings which match the specified syntax exactly, so there's
definitely no conflict).

in other words, we can do whatever we like with ${keyword:data}.  I
prefer an extensible syntax over a compact one (Alexey's $%xx
suggestion), so my vote is for ${hex:7e}.  please see suggested patch
below.

a couple of notes:

a) the extension name, "quoted-character", is off the top of my head.
feel free to use a different one if you prefer ("encode-char",
perhaps?).

b) the syntax requires spaces between the items.  it's possible to allow
${hex:ABCD} if we use this syntax instead:

   quoted-arb-octets   = "${hex:" hex-pair-seq "}"
   hex-pair-seq        = hex-pair *(*WSP hex-pair)
   hex-pair            = 2HEXDIG

on the other hand, some Unicode code points may need five hex digits
(such as U+1D10A, which is 𝄊, the musical symbol "da capo"), but these
are quite rare, so most people will probably want to write just four
digits (e.g. U+4e2d U+56fd, which is 中国, the name of China).  this
means a sequence of Unicode characters can't be written unambiguosly and
conveniently as one long string of hex digits.  therefore, to make them
consistent, both encodings require the values to be split by whitespace.
I think this improves readability, anyway.

c) it may be presumptious of me to add this extension to the "SHOULD be
implemented" list, I won't be offended if it's listed as "MAY" ;-)

d) I'm no expert on ABNF, so please review.

-- 
Kjetil T.

--- draft-ietf-sieve-3028bis-09.txt     2006-10-06 01:47:33.989869000 +0200
+++ draft-ietf-sieve-3028bis-kjetilho.txt       2006-10-06 03:22:35.025194000 
+0200
@@ -385,13 +385,9 @@
    are permitted in quoted strings.  Quoted strings MAY span multiple
    lines.  NUL (US-ASCII 0) is not allowed in strings.
 
-   As message header data is converted to [UTF-8] for comparison (see
-   section 2.7.2), most strings will use the UTF-8 encoding.  However,
-   implementations MUST accept all strings that match the grammar in
-   section 8.  The ability to use non-UTF-8 encoded strings matches
-   existing practice and has proven to be useful both in tests for
-   invalid data and in arguments containing raw MIME parts for extension
-   actions that generate outgoing messages.
+   The extension "quoted-character" may be used to encode arbitrary
+   characters as a sequence of US-ASCII characters (see 2.4.2.4 for
+   details).
 
    For entering larger amounts of text, such as an email message, a
    multi-line form is allowed.  It starts with the keyword "text:",
@@ -470,6 +466,42 @@
    valid, but need not ensure that they actually identify an email
    recipient.
 
+2.4.2.4. Encoding characters using "quoted-character"
+
+   When the "quoted-character" extension is in effect, certain
+   character sequences in strings are replaced by the unencoded value.
+   This happens after escape sequences are interpreted and
+   dot-unstuffing has been done.
+
+   Arbitrary octets can be embedded in strings by using the syntax
+   quoted-arb-octets.  The sequence is replaced by the octets with the
+   hexadecimal values given by each hex-pair.
+
+   quoted-arb-octets   = "${hex:" hex-pair-seq "}"
+   hex-pair-seq        = hex-pair *(WSP hex-pair)
+   hex-pair            = 1*2HEXDIG
+
+   It may be inconvenient or undesirable to enter Unicode characters
+   verbatim, and in these cases the syntax quoted-unicode-char can be
+   used. The sequence is replaced by the UTF-8 encoding of the
+   specified Unicode characters, which are identified by the
+   hexadecimal value of unicode-hex.
+
+   quoted-unicode-char = "${unicode:" unicode-hex-seq "}"
+   unicode-hex-seq     = unicode-hex *(WSP unicode-hex)
+   unicode-hex         = 1*5HEXDIG
+
+   The capability string for use with the require command is
+   "quoted-character".
+
+   In the following script, message A is discarded, since the
+   specified test string is equivalent to "$$$".
+
+   Example:   require "quoted-character";
+              if header :contains "Subject" "$${hex:24 24}" {
+                    discard;
+              }
+
 2.5.     Tests
 
    Tests are given as arguments to commands in order to control their
@@ -1075,7 +1107,7 @@
    Implementations MUST support the "keep", "discard", and "redirect"
    actions.
 
-   Implementations SHOULD support "fileinto".
+   Implementations SHOULD support "fileinto" and "quoted-character".
 
    Implementations MAY limit the number of certain actions taken (see
    section 2.10.4).
@@ -1561,6 +1593,12 @@
    RFC number:      this RFC (Sieve base spec)
    Contact address: The Sieve discussion list 
<ietf-mta-filters(_at_)imc(_dot_)org>
 
+   Capability name: quoted-character
+   Description:     changes the parsing of strings to allow arbitrary
+                    characters to be embedded
+   RFC number:      this RFC (Sieve base spec)
+   Contact address: The Sieve discussion list 
<ietf-mta-filters(_at_)imc(_dot_)org>
+
    Capability name: comparator-* (anything starting with "comparator-")
    Description:     adds the indicated comparator for use with the
                     :comparator argument