You're right -- we were making completely different base assumptions.
I was assuming that that with hardware DES most of the delay is in the
application software, the operating system and the machine's bus; that is,
the bottleneck is in getting the data to the DES chip in the first place
rather than inside the chip.
I agree that if the speed of the silicon is the bottle-neck, 3 feedback
loops can be 3 times faster. In situations where we're really pushing the
silicon (Asynchronous Transfer Mode link encryptors come immediately to mind),
then you have a persuasive argument for 3 feedback loops.
I would still claim that for the sort of machines people are running PEM
on (eg. Un*x boxes and PCs), it isn't going to make a lot of difference.
However, I may be wrong --- some real measurements would be useful here!
Mike