ietf-mta-filters
[Top] [All Lists]

Re: Proposal for escaping on non-UTF-8 sequences in Sieve

2006-10-06 07:50:15

Kjetil Torgrim Homme wrote:

On Sun, 2006-10-01 at 16:04 +0200, Kjetil Torgrim Homme wrote:
variables don't allow a reference which acts as a function with
arbitrary input (e.g., "${hex:e6 f8 e5}"), the tail end has to be an
identifier or numbered variable.  unfortunately, this means ${hex:7e} is
disallowed, since "7e" is neither.

I'm sorry, I was very confused.  the "variables" syntax uses period, not
colon, to separate namespaces, so the suggested syntax would _resemble_
it, but be completely independent of it ("variables" only affects
substrings which match the specified syntax exactly, so there's
definitely no conflict).

in other words, we can do whatever we like with ${keyword:data}.  I
prefer an extensible syntax over a compact one (Alexey's $%xx
suggestion), so my vote is for ${hex:7e}.  please see suggested patch
below.
I would prefer if we pick a more unique prefix. Something starting with '$' but not followed by '{' would be great. However if others feel strongly in favor of your variant, that would be fine too. Apart from that your proposal is fine with me.

a couple of notes:

a) the extension name, "quoted-character", is off the top of my head.
feel free to use a different one if you prefer ("encode-char",
perhaps?).
"encode-char" is slightly better, IMHO. quoted-* has strong association with quoted strings.

b) the syntax requires spaces between the items.  it's possible to allow
${hex:ABCD} if we use this syntax instead:

  quoted-arb-octets   = "${hex:" hex-pair-seq "}"
  hex-pair-seq        = hex-pair *(*WSP hex-pair)
  hex-pair            = 2HEXDIG

on the other hand, some Unicode code points may need five hex digits
(such as U+1D10A, which is 𝄊, the musical symbol "da capo"), but these
are quite rare, so most people will probably want to write just four
digits (e.g. U+4e2d U+56fd, which is 中国, the name of China).  this
means a sequence of Unicode characters can't be written unambiguosly and
conveniently as one long string of hex digits.  therefore, to make them
consistent, both encodings require the values to be split by whitespace.
I think this improves readability, anyway.

c) it may be presumptious of me to add this extension to the "SHOULD be
implemented" list, I won't be offended if it's listed as "MAY" ;-)
Speaking personally, SHOULD is fine with me.

d) I'm no expert on ABNF, so please review.
====

--- draft-ietf-sieve-3028bis-09.txt     2006-10-06 01:47:33.989869000 +0200
+++ draft-ietf-sieve-3028bis-kjetilho.txt       2006-10-06 03:22:35.025194000 
+0200
@@ -385,13 +385,9 @@
   are permitted in quoted strings.  Quoted strings MAY span multiple
   lines.  NUL (US-ASCII 0) is not allowed in strings.

-   As message header data is converted to [UTF-8] for comparison (see
-   section 2.7.2), most strings will use the UTF-8 encoding.  However,
-   implementations MUST accept all strings that match the grammar in
-   section 8.  The ability to use non-UTF-8 encoded strings matches
-   existing practice and has proven to be useful both in tests for
-   invalid data and in arguments containing raw MIME parts for extension
-   actions that generate outgoing messages.
+   The extension "quoted-character" may be used to encode arbitrary
+   characters as a sequence of US-ASCII characters (see 2.4.2.4 for
+   details).

   For entering larger amounts of text, such as an email message, a
   multi-line form is allowed.  It starts with the keyword "text:",
I am against this change, as it doesn't agree with the rough consensus in the group, which is to try keep existing implementations compliant.

However your change is fine, as long as the original text is not deleted.

@@ -470,6 +466,42 @@
   valid, but need not ensure that they actually identify an email
   recipient.

+2.4.2.4. Encoding characters using "quoted-character"
+
+   When the "quoted-character" extension is in effect, certain
+   character sequences in strings are replaced by the unencoded value.
+   This happens after escape sequences are interpreted and
+   dot-unstuffing has been done.
+
+   Arbitrary octets can be embedded in strings by using the syntax
+   quoted-arb-octets.  The sequence is replaced by the octets with the
+   hexadecimal values given by each hex-pair.
+
+   quoted-arb-octets   = "${hex:" hex-pair-seq "}"
+   hex-pair-seq        = hex-pair *(WSP hex-pair)
+   hex-pair            = 1*2HEXDIG
Did you really want to allow for
${hex: 7 8 9}
?

The rest looks fine to me.