Carl,
You wrote:
This one point I contest. The fastest hardware you can do (all on one VLSI
chip) will show the 3 loop case to be 3x faster than the 1 loop case.
There is no way to design it the other way without artificially slowing
down the individual DES operations in the 3-loop case, beyond what each
does in the 1-loop case.
With respect to keeping three DES processors busy while using the one
loop algorithm, why not simply demultiplex the input into three or four
parallel streams. (Three seems like the obvious factor, but four
might be better. I'll continue in terms of four.)
The plaintext is a sequence of 64 words, D[1], ... D[n]. Demultiplex
the stream into four streams:
Da[i] = D[4*i-3]
Db[i] = D[4*i-2]
Dc[i] = D[4*i-1]
Dd[i] = D[4*i]
Use EDE-CBC on each stream. This makes it possible to keep three
processors, P1 through P3, busy full time after the first couple of
words. The schedule looks like this:
Processor
P1(E) P2(D) P3(E)
Time step
1 Da[1]
2 Db[1] Da[1]
3 Dc[1] Db[1] Da[1]
4 Dd[1] Dc[1] Db[1]
5 Da[2] Dd[1] Dc[1]
etc.
I chose four streams instead of three in case some time is needed for
the chaining, but I don't feel strongly about this.
Steve Kent pointed out schemes similar to this have been used in the
past to keep up with the data speeds of satellite connections. He
also pointed out that the average padding will increase from four
bytes to twelve bytes. Other than the additional padding, I don't see
anything about this scheme which is inferior to the existing
proposals. As best I can tell, this scheme should be as fast as the
three loop algorithm and also fits onto existing one loop hardware.
Analysis of the crypto properties of one loop versus three loops is a
separate matter. I'm willing to accept that EDE in ECB mode has been
adequately analyzed and documented to be roughly 112 bits strong. (I
do want to see the literature and have it cited and summarized in any
document we produce.) It's not immediatley obvious to me that this
result extrapolates without further ado to the CBC mode, so I'd like
to see some documentation of this point. If one loop EDE-CBC doesn't
hold up under this analysis, then speed or convenience won't matter.
Steve