There is one thing which troubles me with the "BINARY" transport,
and either you gentlemen (I mean the group, not anybody personally)
don't consider it an issue, or have missed it:
It is an issue, but it has always been dealt with.
What happens when I want to process email with MIME-structures
in it (with "--ContentBoundaryString"s in it), and there is a
body-part with a UNICODE 16-bit chars in it containing explicite
16-bit CRLF: 000D 000A ?
The CRLF in front of the boundary marker is part of the marker itself. It is
not part of the preceeding part, and as such is not subject to the conventions
that apply to the preceeding part, whatever they may be.
It is incorrect to code a MIME CRLF sequence in any fashion other than 0D0A.
This is quite clearly specified, and it applies to CRLF in front of boundaries,
in message headers, and various other places as well.
Now how my scanner is supposed to recognize:
CRLF --ContentBoundaryString CRLF
so that it can continue processing on the other bodyparts.
(These are in "8-bit" US-ASCII byte sequences, after all..)
Are those boundary-related CRLF's to be always in 8-bit bytes ?
Yes. This is how the current specification defines them.
Aren't there any unicode encoded value 0D0A, which could cause
problems ?
Not unless your MIME object is illegal.
Before there was MIME, I lived in "just-send-8" universe, and did (once)
a BINARY transmission of a couple MB of TeX dvi-file. I was apparently
lucky, or then UNIX->UNIX transport did encoding and decoding of
LF -> CRLF -> LF on line ends exactly symmetrically (likely it did).
That file did traverse thru, and after a cleanup (head & tail), it worked!
Sometimes it works. And sometimes it doesn't. Suppose I have a document that
contains CR, LF, LFCR, and CRLF sequens that all mean different things. (This
is quite possible in many binary formats.) Any sort of canonicalization of this
will destroy it.
I've recieved unencoded binary material through email hundreds of times because
people mistakenly sent something without encoding it first. I recall it being
recoverable about twice.
Now with MIME "BINARY" I understand that sometimes the UNIX-LF terminated
lines can become rather long, though for a state-machine scanners such
matters aren't a problem.. (Coding line-wise scanners is a lot simpler,
though..)
However what WILL BE a problem is the treatment of the binary UNICODE CRLF.
When UNIX sends such, it conventionally assumes that any LF is a valid place
to convert to CRLF on the SMTP output (+- dot-insert/-removal).
This is only true if the sending system is confused and tries to canonicalize
binary material for transport when it should not. There are only two ways to
get around this:
(1) Make the canonicalization/encoding process on your system much much
smarter, so it understands when to canonicalize and when not to. This
implies an understanding of MIME structure among other things.
(2) Always use the network canonical format so you don't have to convert
in transit.
If similarly implemented UNIX SMTP is the receiver, such data will likely
be converted back to the original, but when the receiving SMTP is on any
system with different end-of-line convention, it will get corrupted data!
Of course it will. This is what makes binary transport so tricky.
On a hypothetical folder format capable to store binary material along
with its length-tag I would be glad to tag the part content as CTE: BINARY,
however while living in the non-perfect world with previous implementations
(*), I guess I have to store such material in BASE64, don't I ?
Yes you do.
(*): The UNIX "mail folder"-format is an awfull kludge, but it is one of
the things with which we have to cope with...
Yes it is.
I hope you can provide an easy answer, like RFC NNNN chapter xx.yy.zz.aa ...
(Somehow I doubt it..)
This has been discussed endlessly on this list as well as many others and it is
fairly well understood what all the issues are.
Ned