RE: Tsvart telechat review of draft-ietf-trill-mtu-negotiation-06

Hi Magnus,

Thanks for your careful review. Please see the reponses inline below.

________________________________________
From: Magnus Westerlund [magnus(_dot_)westerlund(_at_)ericsson(_dot_)com]
Sent: Wednesday, July 05, 2017 21:02
To: tsv-art(_at_)ietf(_dot_)org
Cc: draft-ietf-trill-mtu-negotiation(_dot_)all(_at_)ietf(_dot_)org; 
ietf(_at_)ietf(_dot_)org; trill(_at_)ietf(_dot_)org
Subject: Tsvart telechat review of draft-ietf-trill-mtu-negotiation-06

Reviewer: Magnus Westerlund
Review result: Not Ready

This TSV-ART review is influenced by that I did the review of
draft-ietf-trill-over-ip.

1. So draft-ietf-trill-over-ip-10 has MTU discovery needs for
determining if the UDP encapsulation will work or not. It references  in
Section 8.4 the old RFC, i.e. RFC 6325, which is updated by
draft-ietf-trill-mtu-negotiation.

    TRILL IS-IS MTU PDUs, as specified in Section 5 of [RFC6325] and in
    [RFC7177], can be used to obtain added assurance of the MTU of a
    link.

However, this is not quite true, as if the IP path MTU is below 1470
bytes, which is not unheard of, the algorithm in the MTU negotiation
draft can't determine it. It will only report the IP path as having an
MTU to small when the 1470 bytes probe fail.

[Mingui]   I copied the relevant text from RFC 6325. 
    "The desired minimum acceptable inter-RBridge link MTU for the
      campus, that is, originatingLSPBufferSize.  This is a 16-bit
      unsigned integer number of octets that defaults to 1470 bytes,
      which is the minimum valid value.  Any lower value being
      advertised by an RBridge is ignored."
So the minimum value of Sz would be 1470. IOW, IP path with MTU below 1470 will 
not be qualified as an adjaceny of the TRILL network topology.


So, if the trill-over-ip authors want to use this as a mechanism, then the MTU
negotiation draft needs to be expanded to have more flexible lower
boundaries. However, that appear to affect MTU negotiation quite
significant as it needs to separate algorithm for finding MTU, from the
different usage of the algorithm with different starting points. Where
the normal will have a lower bound of 1470, and be more tightly coupled
to Sz when finding Lz. While the Trill-over-IP has a different usage.

I think the trill WG needs to decide on how to slice this. If the
MTU-negotiation only targets the explicit targets in the current draft and goes
forward now. Or if they want to meet trill-over-ip's goals which will require
restructuring.

2. Another issue, is that I think the algorithm is a bit short on
transmission scheduling recommendations:

    1) If RB1 successfully receives the MTU-ack from RB2 to the probe of
       the value of link-wide Lz within k tries (where k is a
       configurable parameter whose default is 3), link MTU size is set
       to the size of link-wide Lz and stop.

If I do this test with all three packets back to back at line rate, I could
potentially get all probes lost in the same burst loss in router queue or
switch fabric. What I think is needed here is a specification on how these
probes are transmitted. Spaced in a particular way, or at least minimal
distance, and are the additional probes only sent after the previous has been
judged to have been lost, which makes it interact with the next issue.

[Mingui] This seems an implementation space. However, the document may offer 
recommendations. The being recommended minimum interval between two successive 
probes would affect the boot up speed of a TRILL campus. One RTT is a 
reasonable value.

3. This is also unclear on what the criteria is for determining that something
is lost:

      a) If RB1 fails to receive an MTU-ack from RB2 after k tries, RB1
          sets the "failed minimum MTU test" flag for RB2 in RB1's Hello
          and stop.

I fail to see any specification for the criteria when an MTU-ack should be
considered to have failed to reach the probing entity. So this appear to
require a timeout, and thus a timeout interval. Is the RTT known so that one
can define something as lost after N*RTT? Are there possible delays in sending
the MTU-ack that are considered okay that can affect this?

[Mingui] Yes, this makes sense. An MTU-ack should be considered to have failed 
two RTT after the probe is sent out.

4. Section 3, the algorithm in Step 1 is unable to reach the first termination
condition (3) "If lowerBound >= upperBound" in some cases.

[Mingui] This algorithm has been updated through a few rounds of revisions. Let 
me insert a few minor updates to the cited text as below.

  Step 1: RB1 tries to send an MTU-probe padded to the size x.

   1) If RB1 fails to receive an MTU-ack from RB2 after k tries:

         upperBound is set to x and x is set to [(lowerBound +
         upperBound)/2], rounded up to the nearest integer.

[Mingui] s/uppperBound is set to x/uppperBound is set to x-1/
[Mingui] s/rounded up to the nearest integer./rounded down to the nearest 
integer./


   2) If RB1 receives an MTU-ack to a probe of size x from RB2:

         link MTU size is set to x, lowerBound is set to x and x is set
         to [(lowerBound + upperBound)/2], rounded up to the nearest
         integer.

[Mingui] s/rounded up to the nearest integer./rounded down to the nearest 
integer./
[Mingui] Append one condistion to this step 2): If lowerBound equals 
upperBound-1 then x is set to upperBound.

   3) If lowerBound >= upperBound or Step 1 has been repeated n times
      (where n is a configurable parameter whose default value is 5),
      stop.

   4) Repeat Step 1.

I run this on the input data: Lower bound = 1470, Upper bound = 9216 and with
an MTU of 7935 and gets the following sequence:

Lower   Upper   X
1470    9216    5343
5343    9216    7280
7280    9216    8248
7280    8248    7764
7764    8248    8006
7764    8006    7885
7885    8006    7946
7885    7946    7916
7916    7946    7931
7931    7946    7939
7931    7939    7935
7935    7939    7937
7935    7937    7936
7935    7936    7936
7935    7936    7936

Thus, the termination condition needs to change. 

[Mingui] After the update of the text, the sequence would become:
Lower   Upper   X
1470    9216    5343
5343    9216    7279
7279    9216    8247
7279    8246    7762
7762    8246    8004
7762    8003    7882
7882    8003    7942
7882    7941    7911
7911    7941    7926
7926    7941    7933
7933    7941    7937
7933    7936    7934
7934    7936    7935
7935    7936    7935
7935    7936    7936
7935    7935    7935

The second I notice is that having a limitation on number of steps as 5, 

[Mingui] Since the testing might be too resource consuming, implementors 
suggested this limitation. Afterall, the purpose of testing a Lz value is to 
improve the efficiency (if Lz > Sz) rather than reach the optimal efficiency. 

results in quite a large gap
between upper and lower bound in which the MTU exists in.

5. I frankly gets confused by the application of the binary search. First it
will in many case not be run to termination where the actual MTU is determined.
Then the result of the upper and lower bound are just used to confirm the Sz
value. There are no discussion about using the MTU search to determine a new
possible value for Sz.

[Mingui] Because the MTU search will NOT be used to determine a new possible 
value for Sz. It is only applicable to Lz. 

 The text is not even explicit that lower bound is the
highest known to work Transmission unit size at the time of testing. I think
section 3, should conclude in determine some TU value, and if that is Sz or
something other appears quite relevant for what to do in the later sections.

[Mingui] As specified in Section 3, “link MTU size” is already set to the lower 
bound. This tested “link MTU size” is the “TU” value. This value is potentially 
larger than Sz as explained in the introduction and Section 2.

Thanks,
Mingui