RE: [GROW] [Idr] draft-ietf-grow-ops-reqs-for-bgp-error-handling-05

Rob Shakir wrote (on Fri 31-Aug-2012 at 19:00 +0100):
...

Thanks for this detailed analysis. It is akin to something that
Alton Lo and I worked through whilst defining the critical and
semantic error types (and suggested inclusions).

If you'll forgive me for responding to some particular points, I
feel this might aid the discussion and positioning. I have added
comments in-line marked [rjs].

On 31 Aug 2012, at 09:02, Chris Hall wrote:

This is all pretty low level stuff.  I can hear an argument that
the requirements document is not the place for this level of
detail.  However, without a more precise understanding of how
broken attributes may be parsed, requirements for how to deal
with them are hard to specify and to interpret.

[rjs]: What this draft intends to do is provide expectations,
requirements and context for error handling in BGP-4, based on
current deployments (and operator's experience). It also puts
forwards requirements for how each type of error is reacted to in a
broad sense. Essentially, where it came from is defining why
amending the error handling behaviour is required, and providing a
framework against which we can hang the different developments that
are being discussed in IDR, such that they meet the operational
challenges that come from amending this behaviour and form a
complete set of solutions to meet the problem space.

[rjs]: I think the error handling solutions draft (draft-ietf-idr-
error-handling) should take the work that we have done within IDR
and GROW in this draft and build the next level of detail, which I
think that you've made a great start to. I would like to try and
keep the requirements draft such that it can be referred to by both
existing attributes, and future ones.


I agree that the Requirements want to be at as high a level as
possible.  Requirements based on experience are good, too.

The issue I have is that the draft seems to go into too much detail in
some areas, and not enough in others.

Looking at Sections 2.1.1: at a high level, a Critical error is one
for which tearing down the session is unavoidable.  That is deemed to
be when the receiver cannot be sure that they have extracted from a
broken UPDATE all the NLRI to which it refers (yes ?).  If you don't
have the NLRI, you are stuck.  If you have the NLRI, it may be
possible to contain the error so that it affects only those NLRI (for
example, by "treat-as-withdraw").  If you have the NLRI and enough
(however defined) valid (however defined) attributes, it may be
possible to update the NLRI, perhaps partially or temporarily.

[Actually, I'm not sure that is complete.  As observed elsewhere in
the draft, a given BGP session may be carrying a number of AFI/SAFI
and possibly a number of separate VPN.  So for some purposes perhaps
it's not necessary to be able to identify all the NLRI to which the
UPDATE applies, only the AFI/SAFI or VPN to which the UPDATE applies.
If so, then perhaps there is a Requirement for Critical errors to be
handled on a per AFI/SAFI and per VPN basis -- that is, a
"semi-Critical" error which tears down a self-contained part of the
session.  I'm not sure whether the definition of a Critical BGP Error
allows for this or not, depending on how one interprets "Errors
Parsing the NLRI".] 

Staying with the high level requirements, if an error is to not be
Critical, it appears one needs:

  (a) to be able to extract the NLRI (or AFI/SAFI etc ?) with 
      some degree of certainty.

  (b) to have ways of dealing with that NLRI in ways that do not
      affect other NLRI learned in the session unnecessarily, and
      which do not cause unacceptable side effects.

  (c) to be able to extract some attributes with some degree of
      certainty, and be able to judge when proceeding to process
      an UPDATE with an incomplete or damaged set of attributes
      will yield sufficiently valid routes.

  (d) to have mechanisms to signal the problem so that the root
      cause(s) can be addressed and possibly to trigger other
      (e.g. operational) responses.

for which there could usefully be some discussion of "degree of
certainty", "affect...unnecessarily", "unacceptable side effects",
"sufficiently valid" and so on -- from a routing information and an
operational perspective.  For example, if "treat as withdraw" is
performed, but the receiver has not (for whatever reason) been able to
extract all the NLRI sent, the receiver is left with some stale
(possibly invalid) routes; that may be acceptable because the
alternative (tearing down the session) is worse, or because other
(operational/protocol) mechanisms will kick in to clean up... and so
on.

The draft appears to say that anything which is not a Critical Error
is a Semantic one -- or vice versa.  This appears to assume that in
the parsing of attributes, an error in one attribute does not affect
any other attribute -- in particular, that an error in a not-NLRI
attribute does not affect the ability to reliably (enough ?) extract
the NLRI.  To support that, I think the document would need to go down
into the nuts and bolts of the parsing mechanics.  (Section 2.1.2
starts with "Where a BGP message is correctly formed"... I assume that
means that the Message, Withdrawn Routes and Total Path Attribute
Lengths are consistent, and the Marker is 'all-ones' ?) 

This is what I mean by both too much and too little detail.  The high
level requirement is to be able to extract NLRI (etc); whatever the
issues in doing so are, they are perhaps at too great a level of
detail for this document.  On the other hand, the discussion of
Semantic Errors does not go into sufficient detail to support the
requirements which flow from (the apparently assumed) ability to parse
attributes separately.

Where an UPDATE does not contain a Critical error, the receiver has
(by definition) the NLRI (which it believes it has received correctly)
and perhaps some Attributes.  What the receiver then does may depend
on its confidence in what it has managed to extract from the broken
message.  All of that can be left as an exercise for implementers.
The requirements should focus on the implications from a
routing/operational perspective and offer some criteria for acceptable
behaviour.  The current standard requires (for safety) that any error
invalidates everything learned in the Session.  One step from there is
that some errors only invalidate all the NLRI referred to in the
erroneous UPDATE message -- which (for safety) discards all attributes
in the message.  A further step is that some errors do not invalidate
the NLRI in the erroneous UPDATE message, but processing proceeds with
some subset of the attributes.  For my money, that further step is a
giant step, and deserves to be covered at the Requirements level.
[Another step is that some errors invalidate everything learned in the
Session about a given AFI/SAFI or (possibly) VPN.]

...

[rjs]: Please note that the requirements draft does not present
distinctions such as recoverable and ignorable. We went around this
loop previously. I think that in some cases, some specific errors
may be handled by 'patching' or 'ignoring' specific errors. But
generically, these are exceptions - the requirements try and define
broader categories, if a particular attribute needs something else
(e.g., AS4_PATH may have information it can recover from other
attributes) then this can be handled in error handling solution
considerations of these attributes or as it is defined going
forward.


I'm sorry to have missed the previous discussion.  Suffice it to say
that, as above, I think that Ignoring or Recovering (patching up) some
errors is materially different to examining each attribute carefully
and dumping the whole lot on the floor if any one is invalid.

But this touches on the incompleteness (IMHO) of the classification.
For me, one can consider a semantic error in an attribute only after
establishing that it is (in my terms) correctly "framed".  Once one
has parsed a set of attributes, and concluded there is no reason to
believe that some invalid attribute length has thrown the parser off
track, then one can get into what to do with the contents of each
attribute.

I think the difficulty is (repeating myself, sorry) exactly
illustrated by the question of what to do with an ATOMIC_AGGREGATE
attribute which is apparently 421 octets long.

At a requirements level, you may not wish to get into the detail of
this.  But as it stands, the draft classifies pretty much every way in
which an attribute can be broken as a Semantic error.  This seems to
me to miss the important fact that an attribute is only an attribute
once the Path Attributes part of the message has been parsed
satisfactorily -- up to that point, the entire Path Attributes are a
pretty random looking collection of octets.

...

[rjs]: The requirement the document makes is explicitly that not all
errors are defined as critical (if they were, the requirement
specified by section 3 would not be met, and we would stick with the
behaviour we have right now). The reason for a distinction between
critical and semantic is that there are certain errors that mean
that cannot be localised to certain NLRI.


OK.  Sure.  We have Critical and Not-Critical errors.  Not-Critical
errors are those for which we can extract the NLRI.

A key reason for not being able to extract the NLRI is encountering an
error when parsing the Attributes.  Some errors may suggest that the
sender has gone barking mad, and it is not possible to say whether
there are NLRI there to extract or not.  Other errors may be less
alarming.  The given definition of Semantic errors does not
distinguish.

As above, the requirements could step back from the parsing issue, and
specify only the need to reliably extract NLRI.  And, if it is a
requirement to proceed to process (as opposed to just invalidate) NLRI
from a broken message, the requirements should specify the need to
reliably extract a good enough subset of attributes to proceed with.

At the very least, I suggest that Critical (severity of error) and
Semantic (form of error) are orthogonal notions.

[rjs]: I hope you do not see these comments as dismissive of what
you have put together - I think that this is where operational and
implementation views diverge. My view is that I need to understand
what the impact to a service, the device and the network is during
these error conditions (and balance the risk of incorrectness
against the correctness of the protocol). From an implementation
perspective, clearly, one needs to understand exactly which
circumstances one can extract the NLRI, and the particulars of how
this is achieved. I would encourage discussion that falls into the
latter category such that we define the solutions draft to have the
relevant guidance where required. Comments on the former should
absolutely live in the requirements draft


Sure.  I am trying to make the case that where the Requirements touch
on the implementation issues, it is going too deep.  And, in lumping
all kinds of errors together and deeming them to be Semantic errors,
the issues there are obscured rather than brought into focus.

If the Requirements were written without reference to the internal
organisation of the BGP UPDATE message, that would be fine.

As you say, what really matters is the operational impact of changes
to the protocol which may include, inter alia:

  1. some routes will be treated as "good" while others from the
     same source have been deemed invalid.

     This is the effect of, for example, "treat-as-withdraw".

     How much confidence can one have in the "good" routes, if the
     peer is sending a mixture of apparently valid and invalid
     stuff ?

     If a peer who sent a bunch of valid routes last week now sends a
     number of invalid ones, what do we think about the ones which
     remain "good" ?

     Should there be a mechanism to de-preference remaining "good"
     routes ?

     Is a response that has this effect required to be just the first
     step in a longer process, in which the cause of the error is
     dealt with ?  In which case, can more risks be taken when
     selecting such a response ?

  2. some routes will be treated as "good" which should be treated as
     invalid.

     This is the effect of not treating an error as Critical, but not
     identifying all affected NLRI.

     If this is not acceptable, then an implementation must take some
     care when deciding whether it has extracted all NLRI from a
     broken message.

     If there are degrees of acceptability, then an implementation
     would need to take a view... presumably based on some
     understanding of the likely operational impact ?

     Or there should be configuration knobs to twiddle ?

  3. some routes will be treated as valid which would previously have
     been treated as invalid.

     If the rules for validating attributes are changed, then some
     routes might be accepted with a variety of issues with their
     attributes.

     If processing proceeds with a partial set of attributes, routing
     may be affected.  If this is not acceptable, then an
     implementation must take great care here.

     While the question of which attributes and under what conditions
     this might happen is clearly an implementation issue, the
     acceptability of the result must be judged by its operational
     impact.

Also important from an operational perspective may be how any new
features to support better error handling are deployed.  Clearly new
code is involved.  But if new capabilities and new code at both ends
of an eBGP conversation are required, is that an issue ?  Should new
behaviour in BGP be required to be enabled by configuration, or
enabled by default but with suitable override configuration options ?

I am in violent agreement with you.  The high level requirements are
essentially operational ones.

My suggestion is that the requirements would be improved by backing
out of the discussion of the internal structure of BGP UPDATE
messages.  On the other hand, classifying errors in terms of what
information is lost/preserved, would improve things.  Such analysis
might lead to requirements which can only be met by changes at the
protocol/implementation level, and if so, more power to it, say I.

[Those changes might be: (a) a greater separation of NLRI from
Attributes, (b) more redundancy in the framing of attributes so that
the parser can have greater confidence that it has identified all
attributes sent, (c) a means to identify AFI/SAFI or VPN independent
of the NLRI, even ?, (d) etc... in addition to the various other
features mentioned in the draft for tearing down parts of a session,
recovering parts of the RIB, signalling errors, monitoring etc. etc.]

Thanks,

Chris

PS:  section 2 refers to "analysis of incidents".  Is there a
collection somewhere one could take into consideration when making
implementation decisions ?  Is there evidence, for example, that well
known attributes with invalid flags and/or lengths have been a problem
in practice ?