FW: [Idr] draft-ietf-grow-ops-reqs-for-bgp-error-handling-05

John G. Scudder wrote (on Fri 31-Aug-2012 at 17:24 +0100):

By the way, since this is now in *IETF* last call, if you want your
comments to be considered you should send them to ietf(_at_)ietf(_dot_)org.
Feel free to cc IDR and GROW if you like.


OK.  Message previously posted to idr(_at_)ietf(_dot_)org follows:
----------------------------------------------------------------------
--------

In trying to classify Errors with BGP-4 UPDATE Messages I think it
would be useful to distinguish between the form of an error and the
severity of that error and how BGP should respond.

It seems to me that there are four severities/responses:

  1) "Critical Error" -> drop/restart session or AFI/SAFI

     So the overall response then depends on how gracefully the
     session drop and/or restart can be handled.

  2) "Serious Error" -> do something with NLRI, short of
     dropping the session.

     The "treat-as-withdraw" mechanism is mentioned.

     The requirements obviously do not wish to specify mechanisms.

     But I think that the requirements should address what outcome
     is expected if errors in an individual UPDATE message are to
     be limited to that message.  I think what that means is:

       * it must be possible to identify all NLRI that the message
         could be carrying.

       * whatever is done with those NLRI must reflect the fact
         that the recipient has an incomplete, possibly empty,
         set of attributes for those NLRI.

  3) "Ignorable Error" -> process the UPDATE message as if the
     ignored attributes had never existed.

     Some errors in some trivial attributes may be ignorable.
     The requirements could cover the criteria for being deemed
     trivial.

     Some errors in Optional Transitive may be dealt with by
     ignoring the attribute altogether.  The requirements
     mention this, but do not specify criteria for being
     ignorable.

  4) "Recoverable Error" -> process the UPDATE message which has
     had errors "patched up".

     The draft-ieft-idr-error-handling, for example, suggests
     that invalid Attribute Flags may simply be overwritten
     by the expected value.

I would then divide the forms of error into (1) "framing" and (2)
"content" (or "semantic").

A BGP UPDATE message has three levels of framing:

  * Level 1 -- the 16 octet "Marker" + Message Length
                                     + Withdrawn Routes Length
                                     + Total Path Attributes Length

    If the Message Length is broken, it is extremely likely that the
    "Marker" on the next message will be invalid.

  * Level 2(a) -- the Withdrawn Routes

    Each prefix must have a valid prefix length, and the last
    must run exactly to the end of this part of the message.

  * Level 2(b) -- the Attributes

    Each attribute must be correctly framed, and at the end of the
    attributes the last one must run to exactly the end of the
    attribute part of the message.

  * Level 2(c) -- the Network Layer Reachability Information.

    Same as 2(a).

  * Level 3 -- various Attributes

    Some attributes have internal framing.

So far, so obvious.  To judge if an individual attribute is properly
framed, we need to consider the red-tape:

  * the Flags octet has a limited set of valid values, depending
    on the Type.

  * the Type may be more or less anything, but repeats are not
    valid.
 
  * the Length is constrained for some Types

There is some redundancy here, more for known types than unknown ones,
which helps.  The Total Path Attributes Length is, effectively, a
checksum for all the Lengths of all the Attributes.  It would be
possible to specify that a set of attributes should be deemed
correctly framed solely on the basis of passing that test.  However,
my feeling is that all the available redundancy (such as it is) should
be used to minimise the possibility of accepting a broken attributes
-- *particularly* where an error is going to be treated as Ignorable.

Once attributes are correctly framed, then one can consider their
content.  Wherever the line between framing and content is drawn, I
think it helps to be clear about the distinction between them --
"framing" errors affect the attribute and the attributes around it,
"content" errors affect only the attribute.

The framing of an Optional Transitive is a special case.  If the
parser recognises an Optional Transitive, but its Length is not valid,
what should the receiver do ?  If the sender did not understand the
Attribute, then the broken Length is a "content" issue.  If the sender
did understand it, then the broken Length is a "framing" issue.  (It
is a serious disappointment to me that the Partial bit does not help
here.  But even if it did, what if the sender made a mess of
setting/clearing it !?)

In section 2.1.2 the draft specifies a number of "Semantic BGP
Errors", which includes many things which I would class as "framing"
errors.

This is all pretty low level stuff.  I can hear an argument that the
requirements document is not the place for this level of detail.
However, without a more precise understanding of how broken attributes
may be parsed, requirements for how to deal with them are hard to
specify and to interpret.

If NLRI were explicitly separate from the attributes, then if a set of
attributes fails a strict "framing" check, then "treat-as-withdraw"
(or equivalent) could be applied, reliably.  This seems to me to be as
safe as possible, short of dropping the session (which has its own
safety issues).

With NLRI mixed up in the attributes, either one plays safe and treats
all attribute errors as Critical, or a much more detailed analysis of
attribute parsing is required.  What is the cost of missing some NLRI
which were sent, but were obscured by some other broken attribute ?
What is the risk ?  What degree of broken-ness of an attribute can be
deemed not to invalidate the parsing of the attributes before and/or
after it ?  Is that different for different attributes ?

In order to contemplate classifying some attribute errors as
"Ignorable" or "Recoverable", a more detailed analysis of attribute
parsing is also required.  An ATOMIC_AGGREGATE attribute is arguably
trivial and Ignorable.  But is an ATOMIC_AGGREGATE attribute with a
length of 421 (say) likely to be a momentary lapse of concentration at
the sender end, or more likely to be a symptom of a badly broken set
of attributes ? 

Chris