If I could add my two cents worth to this discussion --
I am concerned about two things -- the efficiency of transmission of PEM
and/or
PEM/MIME messages over high-speed modems, and the understanding the semantic
content of complex objects that are signed and then validated in a different
environment than they were created in.
The new breed of modems that are available can achieve substantial improvement
in the effective transmission rate through the use of compression, often 50 to
100kbps using a 28kbps modem. The popular disk storage compression routines
provide a similar savings.However, the unintelligent use of encryption will
prevent any compression from occuring, and in fact a moderate degree of
message
expansion may occur. As a result, encrypted PEM or PEM/MIME messages will
incur
a substantial performance penalty in transmission and storage..
Therefore, the appropriate sequence of steps should be:
1. Canonicalize the text. (This isn't an entirely straight-forward process,
even for simple RFC822 messages, d as I will discuss below.)
2. Digitally sign the canonical text. Ideally, a WYSIWYG editor would display
the text in the canonical form for approval before the signature is applied.
3. Compress the text, using Lempel-Ziv or some other highly efficient
algorithm.
The order of steps 2 and 3 isn't especially important. You end up using both
in practice with MIME/PEM.
4. Encrypt the output, with 8 bits in and 8 bits out.
5. Expand from 8-bit to a 7-bit or other code as required for transmission.
(My
preference would be to have the definition of the encapsulated object that is
being transmitted stop at the encryption boundary, and have the
mail/transmission system cope with the vaguries of 7-bit, 6-bit, or tom-tom
encoding. I hate to see a lowest common denominator perpetuated in an
encapsulated object definition.)
Every time that I have raised this issue, I have been told that that is no
problem, that MIME can handle compression. So now I'll ask it again -- DOES
it,
within the current PEM/MIME spec, and if so, how?
MIME/PEM doesn't deal with this, nor is it MIME/PEM's job to do so. The issue
of compression is purely a MIME issue -- you want to be able to compress
message content regardless of whether or not you sign or encrypt it or both.
As such, MIME/PEM *only* responsibility here is to make sure that use of
security services doesn't break the ability to compress things. And MIME/PEM
has in fact been designed so that it doesn't. (This is not much of a feat, in
that I don't see how MIME/PEM could have compromised this aspect of MIME even
had it wanted to.)
Now, as for MIME facilities for compression. First of all, you very breezily
say stuff like, "a highly efficient algorithm like Lempel-Ziv". There is no
such thing given the immense variety of types of data MIME deals with. As one
of the conclusions we've come to as part of the MIME work is that there's just
no way to separate compression from the type of data you're working with.
This argues for including the compression the data type, and this is in fact
what has been done. Types likes image/tiff, video/mpeg, and
application/postscript include their own compression, and it often offers 10 to
100-fold improvement, whereas you'll actually get data size growth with
something like Lempel-Ziv on some of this stuff.
Unfortunately this work hasn't extended to text yet. There are several
problems in this area:
(1) Should it be done with a content type or with an encoding? I prefer using
an encoding since there are lots of "text-like" objects around.
(2) Patent issues. This effectively means that gzip is probably the most
viable.
(3) Specifications. Nobody has ever written a precise, detailed description of
the algorithm suitable for publication as an RFC.
In short, all that's needed is a a little more work. But nobody has done it,
nor is it likely to get done until some of us get some other tasks off of the
to-do list. (There's a little hint here...)
With regard to the canonicalization problem, there is much more to
canonicalization than just the ASCII/EBCDIC and CR/LF issues. If the message
is
straight text, then I would really like to see that all nonprintable /
printer-transparent characters are eliminated. This would include, but is not
limited to, the elimination of any backspace characters, and trailing blanks
before a CR, and any trailing blank lines before a page eject character. This
would greatly simplify the revalidation of a digital signature by re-scanning
the printed document, assuming a straight message format is used.
Sorry, these are NOT canonicalization issues. Spaces at the ends of lines can
be significant, as can backspaces and all sorts of other stuff. Tabs and
spaces aren't equivalent either.
Once we get outside the straight RFC822 environment, then the issue of foreign
alphabets arises. Even within the Roman alphabets, how do we handle the
relatively simple case of umlauts, s-zet, c-cadilla, n-tildes, etc? In many
cases, the way that this is handled is by defining nonspacing characters which
PRECEDE the characters that they modify. Needless to say, this plays hob with
sorting and searching. Once you start addressing non-Roman alphabets, the
problem becomes much worse, with 16-bit codes having to be used in many cases.
Here you are attempting to repeat work that many of us have devoted literally
thousands of hours of effort to in the past few years. Please study the
ietf-822 list archives and read the MIME specifications carefully before you
get into all this! It has been dealt with already!
More complex objects have their own special set of problems. For example, what
is the canonical form of a PostScript file?
See RFC1521.
Are the definitions of all of the
fonts supposed to be included? If so, Adobe and others may sue for copyright
violations of their fonts. If not, then how does the PostScript interpretor
know how to translate a given code into the corresponding glyph? Over the last
several years there has been a reduction in the variability of font encodings,
but a number of variations still remain, especially with the so-called expert
fonts. If you don't think this is important, consider substituting the yen
sign
for the dollar sign in your next paycheck.
This is a non-issue for correct PostScript -- they are different characters.
Again, you are attempting to repeat work that has already been done.
As if the problem of fonts weren't enough, what about the header files that
can
be downloaded in advance? Windows uses one type, the Mac uses another. and
remember that the PostScript language is a quite powerful PROGRAMMING
LANGUAGE,
and in many cases it has access to your screen, the hard disk in your
computer,
and certainly the memory and hard disk (if any) of your printer.( I haven't
yet
seen a PostScript virus, but I expect to see one any day.) From the standpoint
of nonrepudiation (and why else would we be using a digital signature?), if
the
complete header and all of the fonts are not included in the PostScript file,
the results are indeterminate. Surely the minimum that canonicalization should
do is assure that the results are completely determinate, but it beats me how
to accomplish this in general. Maybe we should transform them into Acrobat
before signing them.
Most of this isn't canonicalization.
I'm not familiar with the internal workings of JPEG, MPEG, and GIF files, nor
with the various sound files, especially MIDI files that are being used these
days. But I suspect that those approaches share some close similarities to
PostScript files, and may have the same set of problems.
You are extending the concept of canonicalization much too far here.
Assuming that the purpose of canonicalization is to ensure that the same
results, i.e., the same semantic content, will be implied by a signature
across
various platforms, then I think we have to at least stop and think a bit about
what the semantic content of a complex MIME object really IS, and what we are
implying when we sign it.
We have stopped and thought about this, in considerable detail, for several
years now. I cannot help the fact that it wasn't done on this list so you
could see it, however.
Suppose that I sign a JPEG-encoded photo. What does that mean? Is it a picture
of me? Did I take the picture? Is the picture a faithful representation of
some
real-wold object? All anyone really know is that the encoded photo hasn't been
modified since I ran it through my fingers (metaphorically). Of course, if I
add some explanatory text and bind it to the object by signing the complex
object that will help, but now, presumably, we have to canonicalize the
complex
object as a whole.
In summary, I am very concerned that we understand the implications of signing
a bucket of bits. I'm confident that the PEM/MIME spec does a reasonably good
job of describing the syntax of these complex objects. I have much less
confidence that we have a good handle on the semantics.
Now you're dragging out the old issue of what a signature means. Please stop!
Ned