On Wed, 8 Nov 2006, Kjetil Torgrim Homme wrote:
hmm, runtime isn't too appealing, but compile time would be nice. this
actually depends on the order we process encode-character and variables.
if we do variables first (so "${hex:${var}}" works), it must be runtime.
if we do encoded-characters first (so "${v${hex:61}r}" works), it can be
compile time.
Hmm, I thought it was settled that encoded-characters were to always be
expanded before variables, such that encoded-characters could be
implemented purely in an implementation's lexer, or even via
pre-substitution. This should be stated in the variables draft, to avoid
creating a normative reference from 3028bis to variables. Inserting that
shouldn't be a huge process issue.
I don't have any strong feelings about it. it's good to catch as many
errors as possible at upload time, but it may be better to leave the
task of finding probable typos to a Sieve lint program.
Okay. Given that and the opinions expressed in the meeting, I propose we
go with the following text, mostly based on a previous diff Kjetil posted:
1) the following addition to section 2.1 "Form of the Language"
While this specification permits arbitrary octets to appear in
sieve scripts inside strings and comments, this has made it
difficult to robustly handle sieve scripts that are concerned with
the enconding of data. The "encoded-character" capability
(section 2.4.2.4) provides an alternative means of representing
such octets in strings using just US-ASCII characters. As such,
the use of non-UTF-8 text in scripts should be considered a
deprecated feature that may be abandoned.
2) the addition of section 2.4.2.4, as follows:
2.4.2.4. Encoding characters using "encoded-character"
When the "encoded-character" extension is in effect, certain
character sequences in strings are replaced by the unencoded value.
This happens after escape sequences are interpreted and dot-
unstuffing has been done. Implementations SHOULD support "encoded-
character".
Arbitrary octets can be embedded in strings by using the syntax
encoded-arb-octets. The sequence is replaced by the octets with the
hexadecimal values given by each hex-pair.
encoded-arb-octets = "${hex:" hex-pair-seq "}"
hex-pair-seq = hex-pair *(WSP hex-pair)
hex-pair = 1*2HEXDIG
It may be inconvenient or undesirable to enter Unicode characters
verbatim, and in these cases the syntax encoded-unicode-char can be
used. The sequence is replaced by the UTF-8 encoding of the
specified Unicode characters, which are identified by the hexadecimal
value of unicode-hex.
encoded-unicode-char = "${unicode:" unicode-hex-seq "}"
unicode-hex-seq = unicode-hex *(WSP unicode-hex)
unicode-hex = 1*6HEXDIG
It is an error for a script to use a hexadecimal value that isn't in
either the range 0 to D7FF or the range E000 to 10FFFF. (The range
D800 to DFFF is excluded as those character numbers are only used as
part of the UTF-16 encoding form and are not applicable to the UTF-8
encoding that the syntax here represents.)
The capability string for use with the require command is "encoded-
character".
In the following script, message A is discarded, since the specified
test string is equivalent to "$$$".
Example: require "encoded-character";
if header :contains "Subject" "$${hex:24 24}" {
discard;
}
3) the following addition to section 6.2.3 "Initial Capability Registrations"
Capability name: encoded-character
Description: changes the interpretation of strings to allow
arbitrary octets and Unicode characters to be
represented using US-ASCII
RFC number: this RFC (Sieve base spec)
Contact address: The Sieve discussion list
<ietf-mta-filters(_at_)imc(_dot_)org>
Philip Guenther