ietf
[Top] [All Lists]

Re: Last Call: <draft-ietf-sidr-rpki-rtr-19.txt> (The RPKI/Router Protocol) to Proposed Standard

2012-01-28 16:12:04
At Wed, 21 Dec 2011 17:43:23 -0800, Terry Manderson wrote:

Apologies for my lack of attention to date on this topic, so speaking only
for myself here.

Similar apologies for not having answered this more promptly.  Somehow
we missed seeing this until our AD asked us about it.

Please see draft-ietf-sidr-rpki-rtr-25, just posted, which we hope
addresses most of your concerns (there are a few points on which I
think we're just going to have to agree to disagree).

Starting with the document structure, I see no reference to a set of
requirements. The introduction is rather vague, and if anywhere that is
where I would expect to see such a requirements description. This means for
the rest of document I found myself asking "why" on many levels.

Motivation was discussed at some length in the SIDR WG.  We hadn't
thought it necessary to discuss this in the draft, but -25 adds a bit
of text on this.

In brief, though, for ietf(_at_)ietf(_dot_)org readers who weren't tracking the
WG: the primary goal of this protocol was to make it possible to run
BGP origin authentication based on RPKI data on currently shipping
router hardware, rather than having to wait for bigger processors,
crypto accelerators, etcetera.  So we wanted a simple protocol that
would let us do all the heavy lifting (RPKI data collection,
certificate checking, computation of deltas from previous data, etc)
somewhere off the router, and feed the router only what it needs.

When I got to the end of the document I felt that the protocol borders on a
wheel re-invention exercise. When you think about a router simply being a
client to a cache that is providing RIB access tokens for a route using a
mechanism that is a secure, stable, scalable, known (by both vendors and
operators), and is extensible, I'm more likely to swing to RADIUS in doing
such a service with nicely structured AV-Pairs and sane timers for
reauth/retry etc. Even the SME's know radius for their WPA enterprise kit.

RADIUS doesn't have a bulk transfer operation, and bulk transfer of
data is the main task of this protocol, particularly at start-up.

You are certainly entitled to your opinion, but it comes a bit late.
This work was done in the public view, with regular progress reports
to the SIDR WG, and we have multiple interoperable implementations
including several of the major router vendors.  So, with all due
respect, I don't think the folks who have put work into this will be
all that interested in abandoning running code at this point.

Glossary:

Global RPKI: 
I disagree with this definition for two reasons. 1) I'm not aware of a
unified definition for 'distributed system' so this is all rather vague.

The term has been used to describe DNS for decades.  Also see:

  http://en.wikipedia.org/wiki/Distributed_computing

Perhaps you could say 'published at a disparate set of systems'.

I don't find that any clearer.  Readers who can't understand the words
"distributed set" aren't likely to understand "disparate set" either.

2) Limiting
the servers to be "at" the "IANA, RIRs, NIRs, and ISPs" is also premature.
It's not clear to me that these entities will run their own repositories,
nor are they going to be the only repository operators in the lifecycle of
the RPKI.

This is essentially the same list as appears in section 1.1 of
draft-ietf-sidr-arch, with the term "LIR" replaced by "ISP".

I suppose we could add "or other service providers".

Cache:
The words surrounding the fetch/refresh mechanisms of the RPKI is limiting.
Both draft-ietf-sidr-repos-struct and draft-ietf-sidr-res-certs allow for
other (future) retrieval mechanisms as defined by the repository operator
beyond RSYNC (loosely documented in RFC5781).

Terry, you've made it quite clear that you disagree with the SIDR WG's
decision to make rsync the mandatory-to-implement RPKI retrieval
protocol, but you lost that argument a long time ago, and I fail to
see the point of bringing it up here yet again.

Last sentence. "Trusting this cache further is a matter between the provider
of the cache and a relying party". In my mind the Relying Party was the one
that did the RPKI validation - would this not be better stated as "Trusting
this cache further is a matter between the provider of the cache and the
router operator".

If a router is making decisions based on data given to it by a server,
the router is the relying party in that relationship.  That the server
in question was itself the relying party in another relationship does
not change this.

The picture here is not all that different from the way that some
vendors have chosen to implement DNSSEC.  It's a two-tier security
relationship: an end-to-end relationship between the publisher of
signed objects and the validator of those signed objects, then a
separate security relationship between the entity that validated the
signed objects and the end entity that actually uses the data.

Deployment Structure:

Why repeat the definition of "Global RPKI"? It's superfluous.

Because it's not a definition?

I agree that the text here is similar to the definition, but this
section is trying to describe the roles in the system.

Local Cache: Again. 'Relying party' seems to be borrowed from the
CA/identity world. Unless you redefine that term here it seems as if the
"router" is making RPKI validation decisions. Which it is not. The router is
acting more like a NAS (See Radius, 2865) when talking to a local cache.

The definition of "routers" seems to get this right - eg "a client of the
cache".

See above.  "Relying party" is a security relationship term, not just
a PKI term.

Operational Overview

when you first use "ROA", please expand the TLA, and provide a reference.

Done.

Serial Query

I don't remember seeing a recommendation for how often a client (router)
sends a serial query. Is there a Min/Max? Surely doing it every second would
be excessive..

Maximum is covered in section 6.2: the router must send a Serial or
Reset Query no less frequently than once per hour.

Minimum is a good question.  We had been assuming that, as this is an
in-POP relationship with cache and router operated by the same party,
there would likely be a knob in the router (router guys live for
knobs) and setting it would be a matter of local policy.  If you want
your router to beat up your cache server every minute, who am I to
stop you?

We needed to set a maximum because that affects the architecture of
the cache (how long does it need to hold onto old data -- given the
potential size of the data sets involved, one might implement the
cache very differently if one needed to hold old data for a week
rather than an hour).

IPv4 Prefix:

"and nothing prohibits the existence of two identical
   route: or route6: objects in the IRR."

Why even mention the IRR here? It just doesn't seem at all relevant. (and
isn't defined)

Good catch.  Done.

" IPvX PDUs" expand to IPv4 or IPv6. Globing into one is a misdirection
under a heading of 'IPv4 Prefix'

IPv6 Prefix

Some text here to say that the IPv6 data structure follows the same
semantics as the IPv4 data structure would be good.. or alternatively
restructure the document to Semantics, then describe the IPv4 and IPv6 data
structures as subheadings to Prefix PDUs.

Done.

Error Report

What is "excessive length" of a PDU? at what point do you say "o.k, now I
can truncate".

Too long to be any valid PDU other than an Error Report.  Done.

Fields of a PDU

For all types, instead of using "ordinal" can you use the exact description
of the number? eg unsigned integer? For me I always relate ordinals to set
theory.

Done.

PDU type, the e,g is incomplete shouldn't it be "IPv4 Prefix = 4" with a
forward reference to the IANA Considerations section?

I think this is a matter of stylistic preference.

Serial Number. "for example via rcynic", Is not defined and implementation
specific!

Please read the words "for example".

I suppose we could add a reference, but the last time we did that
somebody objected to having a reference pointing to the source code
for a particular implementation.

and there is a typo "completing an rigorously validated"..while
there, consider why you use the term 'rigorously'..

Sigh.  Next time, please be explicit about the typo you're seeing, our
eyes repeatedly bounced off the "an" here until after we'd posting
version -25.  It's not worth yet another rev just to fix that.

are there situations when a validation is less rigorous? If so
explain.

I suspect that my co-author was trying to say that one can't just
retrieve the data, pull the ASNs and prefixes out of the ROAs, and
feed them into the router, one has to do the RPKI validation first.

I guess we can remove the word if it offends you, but it seems
harmless.

Session ID

What is the risk of a cache server starting/restarting with the same session
ID and serial number as before, but with different cache contents? Is this
an entropy concern? Just thinking of a potential scenario where a router is
cache-wedged. Is this at all probable? and why not - some words here to
cover this would be good.

We added several paragraphs on exactly this topic sometime around IETF
Last Call, I suspect the version you reviewed did not have that text.
I think we've addressed this point, please check the current text and
let us know if there's a further issue here.

Flags

Can you reword the binary choice here? Do you actually need to delve into
'right to announce'? This is really about RIB entry behaviors yeah?

The semantics here are closely related to ROAs, which, as you no doubt
recall, are Route Origin Authorizations, so the text here follows that
model.

With all due respect, I do not think that a discussion of RIB entry
behavior here would be simpler.

Expand "IPvX".

Done.

Start or Restart:

I think the terms in when a router needs to send a serial query or a reset
query need to be tighter. Saying MAY here is too loose. I would much prefer
to see a structure where if the router does not have a recorded serial for a
cache from a previous session, the router MUST send a reset query. Logically
you assume that to be the case, so be specific.

I think this is a stylistic matter again.  The router MAY do two
things here, one of which is only applicable if it has data from a
previous broken session.

The only real difference I see here between the current formulation
and the MUST formulation you prefer is that, as currently written, the
router could chose not to send anything at all initially; this option
doesn't seem particularly useful, so I don't mind removing it, but
neither do I see the difference between the current text and your
suggested change as a big deal.

Thereafter the router MAY send a reset query, and SHOULD send a serial
query. I suspect this is what the vendors (who have chimed in on the list)
have coded.

This then corroborates section 4 where you suggest the router only send
serial queries for efficiency.

Section 6.2 already says that the typical exchange is for the cache to
send a Serial Notify, in the expectation that the router will schedule
an immediate Serial Query.  We didn't make it any stronger than that
because the folks implementing the router side of this expressed
concern at the notion that the cache could tell them to do something
(read: they understand that the notification mechanism will help speed
convergence, but they're worried that the dinky CPUs they're stuck
with in some of the relevant hardware will be swamped if they try too
hard, which is why routers are allowed to ignore notifications and
caches are rate-limited in sending them).

Transport:

MiTM is Man in the middle as I and many others know it. 'Monkey/piggy/pickle
in the middle' is a child's ball game.

Monkey-in-the-middle is a common non-sexist variant of this term.
Welcome to the 21st century.

" Therefore, as of this document, there is no mandatory to
   implement transport which provides authentication and integrity
   protection."

if this is the case.. then why? what is the gain?

OK, this is the elephant in the living room.

The basic problem is that the implementers and the IETF live on
different planets.  As discussed in section 7, it is pretty much
impossible to find any channel security technology which is
implemented on conventional servers (Linux, BSD, ...), is implemented
on routers, and is acceptable to the IETF security folks.

As further discussed in Section 7, the long term plan is TCP-AO, and
there are people out there implementing that now, but it's going to be
a few years before that's usable.  In the meantime, we're stuck with
ad-hoc pairings of what particular platforms support.  Some routers
support SSH clients, some don't.  Some server platforms support
TCP-MD5, some don't, the IETF doesn't like TCP-MD5, and at least one
that sort-of-supports TCP-MD5 only supports it for incoming
connections.  Some routers support IPsec transit but can't terminate
it.  And so forth.  It's a horrible mess.

So, as discussed at some length in the SIDR WG, after talking both to
people who knew the current router and server platforms and to the
SIDR WG's security advisor, we came up with the compromise you see in
the draft: the path forward is TCP-AO, but since we don't have that
yet, there's this raft of other channel security mechanisms one is
allowed to use for now.  We expect to deprecate everything but TCP-AO
once TCP-AO is readily available.

Nobody is happy with this, but it's the least bad compromise we could
find between what the IETF would prefer and reality in the field.

why not then make the router fetch the signed objects and do the
validation internal - this again seems to be the 'missing
requirements' problem.

See "currently shipping routing hardware", above.

SSH Transport

State up front that you MUST use SSHv2. (instead hinting in the third
paragraph)

Done.

TLS Transport
"Man in The Middle (MiTM)" please.

Above.

Router Cache setup

"When a more preferred cache becomes available, if resources allow, it
   would be prudent for the client to start fetching from that cache."

How does the client (I assume router) know when to do this as cache's are
not synchronized?? How does a router tell if any particular cache has more
current data over another cache? what if two caches contradict each other?

The document repeatedly states that the router has an ordered
preference list of the caches it uses.  The text you quote here
doesn't say "has more current data", it says "becomes available", ie,
it stops rejecting connection attempts, signalling errors, or
otherwise failing to be useful.

Error codes

6: Withdrawal of Unknown Record (fatal), why drop the session? (which
presumably causes a restart) to a cache, assuming the cache is corrupt,
which will then send another Unknown Record, which is fatal... (repeat)??

Why not mark the cache as corrupt at the client?

This is one of several loss-of-synchronization problems.  The
assumption is that the router may have (somehow) lost synchronization
with the cache.  We don't really know which party is confused at this
point, all we know is that the session itself is no longer useful
because the router and cache are not communicating clearly.  So the
router's data isn't necessarily corrupt.

The router won't necessarily restart with this cache right away
either, it has several options: it might try another cache, it switch
to another set of data it has already loaded, or might try a reset
query to this cache.

Security Considerations:

Transport Security. There are multiple valid options for a root trust anchor
including the structure from the IAB aligning it to the IANA. Perhaps
instead of saying " the IANA root trust anchor" say "Global RPKI root trust
anchor". Otherwise you might accidently find your validated cache only
covers unallocated and reserved blocks.

I think you're saying that using the term IANA here is politically
incorrect.


Thanks for the review!
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www.ietf.org/mailman/listinfo/ietf