[openpgp] Algorithm-specific data: problems with Simple Octet Strings, a

Hi OpenPGP folks--

I spent a good part of today trying to write a patch for including
Simple Octet Strings in the crypto-refresh draft, pursuant to gniibe's
presentation at IETF 110 and his public memo at:

   https://www.gniibe.org/memo/standard/openpgp/ecc-in-openpgp-sos.html

This message doesn't include the patch i wanted to make because I think
i ran into a problem with it.  I'm trying to explain the problem here,
and how my thinking has shifted in the course of trying to write the
thing down.  Please correct me where I've erred.

IIUC, SOS is basically intended as a genericization of MPI; it
represents its length in a two-vector scalar as (8*octet-count) instead
of as (bit-count).  It "fits" where MPI fits, but doesn't require
leading-zero removal, or require interpretation as an integer at all.

The problem comes with the different length representations, in
particular for existing ECC mechanisms.

Because existing ECC points are currently represented with a leading
prefix byte (0x04 for NIST curves, 0x40 for 25519), and that byte has
the leading bit cleared, the MPI version of the length scalar is
*different* on the wire from the SOS version of the length scalar.

As a concrete example, a NIST p256 pubkey looks today like this (using
visualization from "sq packet dump --hex"):

```
Public-Key Packet, old CTB, 2 header bytes + 82 bytes
    Version: 4
    Creation time: 2021-03-24 22:06:14 UTC
    Pk algo: ECDSA public key algorithm
    Pk size: 256 bits
    Fingerprint: D4732B93D4E82D068EE197A63A73ED20925926A6
    KeyID: 3A73ED20925926A6
  
   00000000  98                                                 CTB
   00000001     52                                              length
   00000002        04                                           version
   00000003           60 5b b7 d6                               creation_time
   00000007                       13                            pk_algo
   00000008                           08                        curve_len
   00000009                              2a 86 48 ce 3d 03 01   curve
   00000010  07
   00000011     02 03                                           ecdsa_public_len
   00000013           04 29 75 bb 53  55 49 31 80 ab 03 54 15   ecdsa_public
   00000020  a0 6c 29 10 a8 85 b7 2f  e5 93 81 90 48 08 b8 0e
   00000030  c7 02 55 aa 51 b2 b7 6a  00 7a a1 2b cf 25 a7 e7
   00000040  9c 95 b1 fa 79 87 c2 cb  94 53 a8 da c2 00 58 45
   00000050  79 a0 96 8e
```

but if it was represented in SOS form, we'd see this change:

```text/x-diff
    00000010  07
-   00000011     02 03                                           
ecdsa_public_len
+   00000011     02 08                                           
ecdsa_public_len
    00000013           04 29 75 bb 53  55 49 31 80 ab 03 54 15   ecdsa_public
```

The result is that the fingerprint would also change (to
C9EAB3BE6B20CAEA568BE93F9BA3D29CBB873842), since the bytestream that
feeds into the fingerprint calculation changes.  This means that we can
represent a single EC public key (with the exact same creation date) in
at least two different ways just by varying whether the public point is
seen as an SOS or an MPI.

(note: this concern is not necessarily just for ECC; it's also relevant
for traditional RSA, DSA, and ElGamal key MPI components that might have
the highest set bit somewhere other than just below an octet boundary.
For example, if we were to "retcon" RSA's e component as an SOS instead
of an MPI, the most common value for e (65537) could change on the wire
from [00 11 01 00 01] to [00 18 11 00 01], resulting in a similar change
to the fingerprint)

This leaves me wondering about the utility of the SOS abstraction.  I
can imagine a few distinct options:

 a) Introduce SOS and apply it to existing ECC pubkeys.  This is what I
    understand to be gniibe's proposal.  It looks like this means an
    implementation will have to cope with multiple distinct wire
    representations and fingerprints of the exact same public key.

 b) Introduce the SOS abstraction, but only use it for *new* pubkey
    algorithms.  All existing ECC (including NIST and 25519) would still
    be this weird pseudo-MPI, representing the count based on the
    highest set bit, not 8*octet-count, and requiring clearing of
    leading zeros.

 c) Introduce a variant of SOS that indicates count based on highest set
    bit (not 8*octet-count) and use that for everything, but without
    requiring removal of leading all-zero octets.

 d) Not introduce the SOS abstraction at all, but instead explicitly say
    that new algorithms can define their own fixed algorithm-specific
    bytestream for public key material, secret key material, and
    signatures -- it does not need to be MPIs (gniibe's memo calls this
    approach "Each data format definition by each curve").

 e) No SOS abstraction, and just keep on doing the kind of pseudo-MPI
    we've been doing for any new curves (gniibe's memo calls this
    "Practically easiest").

 f) Introduce a new distinct data type that uses an actual octet count
    for each bytestring (gniibe's memo calls this "Non-strange but Just
    an Octet String")

Of these choices, I'm leaning toward (d) but i'd love to hear what other
people think.

(a) Seems problematic because of the multiple acceptable representations
    outlined above.  It's already pretty tricky that the fingerprint of
    a piece of given public key material can vary depending on the
    creation timestamp.  Some implementations might try to normalize an
    SOS into a "compliant" MPI, thereby affecting their ability to
    calculate the correct fingerprint.  I don't think we currently have
    any tests that demonstrate interoperability in the face of this kind
    of confusion.

(b) Seems sort of pointless because it doesn't let us clean up any of
    the existing mess.  If we're going to keep all the existing mess, we
    might as well go with (d).

(c) Introduces a new problem: When encoding an SOS string, how many
    leading zero-octets should be used?  If the answer is
    "implementation gets to decide" then we share the
    multiple-representations problem with (a); And if the answer is
    "each algorithm has a mandated size" then we might as well go with
    (d).  Also, when consuming such a string, an implementation would
    need to first scan through all the leading zero-octets before it
    could use the length count to jump to the next field.  That's a new
    piece of ugliness for OpenPGP parsing.

(e) Bothers me because it means that each new algorithm introduced has
    to declare not only its data formats, but how to transform those
    data formats into the semblance of an OpenPGP "MPI".  This just
    seems like extra work and new ways for both implementers and
    specifiers to get things wrong.

(f) Seems excessive: why declare a standard for encoding lengths of
    fields in those cases where the lengths are pre-determined by the
    algorithm in question (i expect that to be most cases for new
    algorithms).  For specific algorithms which have variable length
    parameters, *those algorithms* can declare how they want to delimit
    the fields.  If the goal is to be able to skip over
    algorithm-specific details of an algorithm that we don't know, what
    are we skipping *to*?  For public key packets and signature packets,
    the only thing that comes next is the end of the packet, whose size
    we already know.  For secret key packets, we *might* try to skip
    from the public key material to the secret key material, but again,
    if we don't know how many MPIs (or SOSes or JOSes) to skip for
    public keys we might not even be able to tell when the s2k usage
    octet starts, so we can't do that today anyway.

I will probably try to draw up a concrete patch for (d) unless other
folks in the WG think it would be useful for consideration.  The patch
will likely ask IANA to add new columns to the "public key algorithm"
registry (e.g. "public key representation", "secret key representation"
and "signature representation") and to formally include an Elliptic
Curves registry with similar columns, so that we have a compact way to
view these distinctions, and a place for new proposals to include
updates.

If any of my analysis above is wrong, please help me understand it
better!

      --dkg

signature.asc
Description: PGP signature

_______________________________________________
openpgp mailing list
openpgp(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/openpgp

[openpgp] Algorithm-specific data: problems with Simple Octet Strings, and possible alternatives