In article
<199903161516(_dot_)PAA19750(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk>,
Charles Lindsey
<chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk> writes
Nit-picking on the canonicalization algorithm.
1. The field-name (header-name) at the start of the header is
converted to lowercase.
2. If the header is unstructured, all instances of FWS are
replaced by a single SPACE; otherwise (the header is
structured and) all instances of FWS are omitted, except
within comments where they are replaced by a single SPACE (the
header has now been unfolded into a single line). Any
whitespace at the end of the header is removed, and it is
ensured that the header ends with a single CRLF.
In RFC2234 space is SP, not SPACE.
3. All instances of DQUOTE (ASCII '"') are removed, except when
they occur between properly matched pairs of "<" and ">"
(thus, in particular, they are not removed within msg-ids).
"<\">"@domain
Are these < > properly matched? Should this be canonicalized to
<\">@domain
or
<\>@domain
I can see what you are trying to do but I think that there may
be several pitfalls here.
4. Any date-time occurring in a Date, Resent-Date or Expires
header (but not in any other header) is converted into the
number of seconds since the start of January 1st 1970 UTC,
expressed as a decimal number without leading zeroes.
As phrased the number of seconds since the start of January 1st
1970 UTC includes leap seconds. But this will give software a
problem: how can it be written to cope with messages in the
future? as future leap-seconds are undecided.
Better to exclude leap-seconds, in which case you might need a
note about what to do with a seconds value of 60.
5. Any sequence of octets of length not more than 75 and not
including any SPACE (and hence presumed present in the same
line prior to Step 2), and which satisfies the syntax for an
encoded-word [RFC2047], and which is not enclosed between
properly matched pairs of "<" and ">" is replaced by the
sequence of octets obtained by decoding it. This is done
irrespective of whether that encoded-word was syntactically
allowed to be present at that position in the header according
to [RFC2047] or any extension thereof.
(\<)=?utf-8?q?=E2=82=A0?=(\>)
Are these < > properly matched? Should this be canonicalized to
(\<)₠(\>)
or not?
(For those without utf-8 it's a euro symbol).
Regards
--
Paul Overell T U R N P I K E Ltd