Re: [openpgp] [Cfrg] streamable AEAD construct for stored data?

To return to this thread - DKG brought up one important potential functionality 
goal for the next OpenPGP message format (streaming-mode integrity protection); 
then the thread diverged into a different and I think orthogonal - though 
equally interesting - potential functionality goal (namely random-access 
capability via Merkle trees as in Tahoe-LAFS).

I included a slide on this topic in my OpenPGP WG presentation 
(https://drive.google.com/file/d/0BwK1bcoczINtMEI2Y3A1Rm52SXc/view?usp=sharing 
<https://drive.google.com/file/d/0BwK1bcoczINtMEI2Y3A1Rm52SXc/view?usp=sharing> 
slide 10) and was hoping to solicit discussion but there wasn’t time, so 
perhaps we can continue here?

To be clear, there are two separate use-cases, each of which make sense without 
the other and require different technical solutions (but could also make sense 
together):

1. Streaming-mode integrity protection:  We want to make sure OpenPGP can be 
used Unix filter-style on both encryption and decryption sides, to process 
arbitrarily large files (e.g., huge backup tarballs), while satisfying the 
following joint requirements:

        (a) Ensure that neither the encryptor nor decryptor ever has to buffer 
the entire stream in memory or any other intermediate storage.
        (b) Ensure that the decryptor integrity-checks everything it decrypts 
BEFORE passing it onto the next pipeline stage (e.g., un-tar).

2. Random-access: Once a potentially-huge OpenPGP-encrypted file has been 
written to some random-access-capable medium, allow a reader to decrypt and 
integrity-check parts of that encrypted file without (re-)processing the whole 
thing: i.e., support integrity-protected random-access reads.

Let’s call these goals #1 and #2, respectively.

Achieving either goal will require dividing encrypted files into chunks of some 
kind, but the exact metadata these chunks need to have will vary depending on 
which goal we want to achieve (or both).

To achieve goal #1 properly, it appears that what we need is not only a MAC per 
chunk but a signature per chunk.  If the encryptor only signs a single 
aggregate MAC at the end, then the decryptor needs to process its input all the 
way to that signature at the end before it can be certain that any (even the 
first) bytes of the decrypted data are valid.  If the encryptor produces a 
Merkle tree at the end and signs its root as in Tahoe-LAFS (e.g., in pursuit of 
goal #2), the decryptor still needs to read to the end of its input before 
being able to integrity-check anything, and hence still fails to achieve goal 
#1.

To achieve goal #2, the Merkle tree approach used in Tahoe-LAFS seems 
well-suited.  But it assumes/requires the decryptor has a complete, 
random-access seekable copy of the encrypted file sitting somewhere, which is 
certainly a reasonable assumption in the context of a file system like 
Tahoe-LAFS, but violates requirement (a) of goal #1 and in particular may not 
hold when OpenPGP is used to filter-decrypt from unseekable media or a network 
stream to an un-tar type process.

We could very well design an OpenPGP format that addresses both goals together, 
if we decide both goals are valuable.  The encryptor would need to build its 
Merkle tree incrementally - without initially knowing its total size - and 
after encrypting and outputting each chunk, generate and sign a new Merkle tree 
root after each chunk.  We wouldn’t necessarily have to store an entirely new 
Merkle tree after each chunk, only a log2(N)-size list of hashes comprising the 
“update path” from the new chunk back to the root, reusing the un-updated parts 
of the Merkle tree already stored in earlier parts of the encrypted file.  (But 
this will require that earlier Merkle trees be “incrementally updatable”, which 
might violate the properties of some of the Merkle tree proposals suggested 
elsewhere in other threads.)

There are some obvious tradeoffs here, both in storage and complexity costs.  
I’m not that worried about the storage efficiency costs, because even the costs 
of a log2(N)-size list of hashes per chunk can be amortized fairly effectively 
by making the chunks moderately large, e.g., 65KB or even 1MB.  There’s also 
the compute cost of performing a public-key signature rather than just a MAC 
per chunk, but that can be similarly amortized by using relatively large 
chunks.  On the other hand, large chunks might not be ideal for 
embedded/constrained devices.  And the implementation-complexity is certainly 
an issue regardless.

So some questions about this:

1. How important is the ability to achieve goal #1 above in the OpenPGP format 
(streaming-mode integrity-checking)?

2. How important is the ability to achieve goal #2 above in the OpenPGP format 
(random-access integrity-checking)? 

3. For whichever goal(s) we wish to be able to achieve, should those be 
*mandatory* or *optional* in the format?  That is, should *every* 
OpenPGPv5-encrypted file satisfy either or both of these goals, or should they 
be configurable or user-selectable (such that some encrypted files might 
contain per-chunk signatures and/or Merkle trees while others do not)?  Making 
either of these goals “supported but optional” might help mitigate any 
performance/storage cost concerns with either of them, but would only further 
increase the complexity of the overall OpenPGP spec and increase the “usability 
risk” of a user accidentally failing to enable a relevant option when he really 
should have (e.g., streaming-mode protection for backups).

4. What are reasonable upper- and lower-bounds for chunk sizes, and what are 
the considerations behind them?  

Thanks
Bryan

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp