Re: Tsvart early review of draft-ietf-trill-over-ip-10

HI Magnus,

On Tue, Jun 27, 2017 at 1:13 PM, Magnus Westerlund <
magnus(_dot_)westerlund(_at_)ericsson(_dot_)com> wrote:

Hi Donald,

After having read your response I think there is an important question
about the applicability of this document that affects several of the issues
below and what solution you need. That is the question of what type of
paths one expect to get Trill over IP working over. Because if the target
is general Internet and also through middleboxes such as NATs and Firewall
(Not intending to block) then there are a lot more work to ensure this. If
you for example changes the applicability to require any on path
middleboxes to fulfill certain requirements things can be more easily
addressed.


The use cases in the document support communication over the general
Internet in the brach office case but that does not necessarily imply
NATs/Firewalls.

Den 2017-06-26 kl. 02:07, skrev Donald Eastlake:


Hi Magnus,

Thanks for the extensive review. See my responses below.

On Thu, Jun 15, 2017 at 1:32 PM, Magnus 
Westerlund<magnus(_dot_)westerlund(_at_)ericsson(_dot_)com> 
<magnus(_dot_)westerlund(_at_)ericsson(_dot_)com> wrote:


Diffserv usage
--------------

Section 4.3:

   TRILL over IP implementations MUST support setting the DSCP value in
   the outer IP Header of TRILL packets they send by mapping the TRILL
   priority and DEI to the DSCP. They MAY support, for a TRILL Data
   packet where the native frame payload is an IP packet, mapping the
   DSCP in this inner IP packet to the outer IP Header with the default
   for that mapping being to copy the DSCP without change.

I think it is fine to require that implementations are capable  of setting
DSCP values on the outer IP header. However, I fail to see any discussion of
the potential issues with actually setting the DSCP values. It is one thing to
do this in an IP back bone use case where one can know and have control over
the PHB that the DSCP values maps to. But otherwise, over general internet the
behavior is not that predictable. One can easily be subject to policers or
remapping. Also as the actual DSCP code point usage is domain specific this is
difficult. Priority reversal is likely the least of the problems that this can
run into over general Internet.

It sounds like appropriate discussion and warnings about these issues
would resolve the above comment.

I would note that the choice of encapsulation here do becomes important.
Your's and Joe Touch's observation that for TCP, you can only have a single
DSCP marking per TCP connection for example. For others, see the discussion
in Section 5.1 of https://datatracker.ietf.org/doc/rfc7657/ on this
issue.


Well, if a TRILL over IP implementation using TCP transport wants to have
more than one priority category for traffic where there might be one or
more intervening IP routers, which would be the normal case, it would just
need a TCP connection per priority category. Mapping the 8 priority levels
into a smaller number of categories is a routine thing to do. Note that the
base TRILL protocol specification (RFC 6325) says:

   RBridges are not required to implement any
   particular number of distinct priority levels but may treat one or

   more adjacent priority levels in the same fashion.

David Black also raised an important question if one should treat this as
a tunnel with a single predictable behavior or let the inner networks
marking show through. Establishing a tunnel with a single PHB has less risk
of running into issues than multiple different markings.


It is an implementation choice whether to have a single PHB or eight, one
per priority level, or something in between.

Section 4.3:

   The default TRILL priority and DEI to DSCP mapping, which may be
   configured per TRILL over IP port, is an follows. Note that the DEI
   value does not affect the default mapping and, to provide a
   potentially lower priority service than the default priority 0,
   priority 1 is considered lower priority than 0. So the priority
   sequence from lower to higher priority is 1, 0, 2, 3, 4, 5, 6, 7.

      TRILL Priority  DEI  DSCP Field (Binary/decimal)
      --------------  ---  -----------------------------
                  0   0/1  001000 / 8
                  1   0/1  000000 / 0
                  2   0/1  010000 / 16
                  3   0/1  011000 / 24
                  4   0/1  100000 / 32
                  5   0/1  101000 / 40
                  6   0/1  110000 / 48
                  7   0/1  111000 / 56

This appear to be an problematic mapping. At least for prio 0 and 1. As
priority 1 appears to be intended to be higher than priority 0, it is
interesting that it is mapped to CS1, which to 
quotehttps://datatracker.ietf.org/doc/rfc7657/:

CS1 ('001000') was subsequently designated as the recommended
      codepoint for the Lower Effort (LE) PHB [RFC3662].

So what is proposed can in a network using default mapping, result in that you
get priority 0 to be lower priority than 1. Plus that in some networks this 
can
also results in strange remapping that results in a different PHB for CS1 
than.

The intent in the draft is to reflect the default relative priority of
the different priority code points in IEEE Std 802.1Q where priority 1
is lower than priority 0. At a quick look, it appears to me that RFC
2474 requires that 0x001000 be handled as being of a priority not
lower than the priority with which 0x000000 is handled. Yet RFC 3662,
which you point to, seems to suggest using 0x001000 as a lower
priority code point than 0x000000. Given that 3662 not only does not
update 2474 but is only Informational while 2474 is Standards Track, I
would say that 2474 dominates and that this draft makes the best
assumptions it can about default behavior...


David Black provide a good answer on this.


I'll reply to him.

MTU and Fragmentation
---------------------

I think there are two main issue here. The first one is MTUD discovery
of the actual IP path MTU between the ports. That will be needed to prevent
a lot of traffic going into MTU black holes. Especially as TRILL requries
1470 byte support which is likey above a lot of paths.

Seems like it would depend on the environments where TRILL was used.
For example, I do not think 1470 would be a problem in most Data
Center or Internet Exchange point uses, for example. Data Centers
sometimes support 9K jumbo frames and the like.

In fact, it is probably bad to focus too much on 1470 -- that is a
required minimum to be sure that reasonable size link state PDUs can
be successfully flooded through the TRILL campus so that routing will
work. However, it would commonly be the case that, for the TRILL
campus to be useful in a particular case, links need to be able to
carry the expected size TRILL Data packets. For example, if there were
two parts of a TRILL campus connected by one or a few TRILL over IP
links and the end stations in each part were assuming they could use
1500 byte Ethernet packets, then the TRILL over IP links would need to
support an MTU based on 1500 + TRILL Header + IP and TRILL over IP
encapsulation. And more if security was being used or there were any
other reasons for additional headers/encapsulation...


Yes, and over general Internet you should be happy if you get 1500 bytes
of IP MTU, it may easily be lower with a couple of additional tunnel
headers. Thus, what you say is the goal is not feasible without a solution
that supports fragmentation and reassembly, enabling one TRILL packet to be
sent in multiple IP packets. The re-assembly do requires buffering and not
something to easily perform on a router fast path. And attempting to use IP
fragmentation is likely doomed if you have any type of NAT or Firewall in
the way.

This points to a dedicated solution or using a transport protocol that
supports carrying arbitrary data sizes, like TCP or SCTP. And you need to
use the byte-stream API of TCP to achieve this.

OK.

Section 8.4:

   Path MTU discovery [RFC4821] should be useful
   in determining the IP MTU between a pair of RBridge ports with IP
   connectivity.

The issue with RFC4821 is that it has requirements on the packetization layer.
Trill appears to have several components that are useful. However, it will
require a specification of the procedure to result in a useful tool.

See below.


Section 8.4:

   TRILL IS-IS MTU PDUs, as specified in Section 5 of [RFC6325] and in
   [RFC7177], can be used to obtain added assurance of the MTU of a
   link.

Yes, that can confirm working MTUs that are at 1470 or above, but appears
prevented from working below 1470?

While there is a minimum size for TRILL IS-IS MTU PDUs, determined by
header size, it is well below 1470, probably (depending on whether
secuirty is in use, etc.) below 150 bytes.


Okay, if you say so, it was not obvious from the spec that is was allowed
to probe for paths with lesser MTUs than 1470.

Thus, it appears that there is a lack of mechanism here to actually get a 
valid
and functional MTU from TRILL in the cases where the Path MTU is below 1470. 
If
I am wrong good, but I think this is an important piece for how to handle the
next main issue.

How about referencing Section 3 
ofhttps://tools.ietf.org/html/draft-ietf-trill-mtu-negotiation-05
which is currently in IETF Last Call? (The wording of that section is
probably going to be improved based on an OPS review by Brian
Carpenter.)

I looked at this, and it appears to have the same issue, that it can't
probe for MTU values below 1470.


I think the thing was that, before TRILL over IP, it would not have been
useful to determine an MTU below 1470. But there is no particular problem
in constructing a smaller MTU-probe PDUs and an RBridge receiving such a
PDU is generally required to respond with an equal length MTU-ack.

   2) RB1 tries to send an MTU-probe padded to the size 1470.

      a) If RB1 fails to receive an MTU-ack from RB2 after k tries, RB1
         sets the "failed minimum MTU test" flag for RB2 in RB1's Hello
         and stop.


But, the algorithm clearly performs a binary search for the MTU. If
one look at RFC 4821 one will notice that there are some additional 
considerations
there how to make the probing better and robuster. But, cleary Trill has some 
other
criterias for what is a success. Verification that Sz works appears 
sufficient,
and there are no need to probe further upwards.

UDP encapsulation and IP fragments.

  ----------------------------------

I see it as a big issue that UDP encapsulation is the native one, and that
relies on IP fragmentation despite the need for reliable fragmentation. With
the setup of having to support 1470 MTU on TRILL level some packets will be
fragmented in many environments. That will lead to a lot of losses, and as
discussed below a very big problem with middleboxes. The main problem here is
that if one tries to rely on IP fragments one will have issues with packets
ending up in black holes. And different problems depending on IPv4 or IPv6.
IPv6 is lilkely the lesser problem assuming that one have working PMTUD.

There are several ways out of this.

1. Detect issues and use TCP encapsulation with correctly set MSS to not get 
IP
fragements 2. Determine MTU and implement an fragmentation mechanism on top of
UDP.

So, I don't see that much problem with UDP being the general default
consistent with the TRILL philosophy of defaulting to need zero or
minimal configuration. The default should be to use multicast Hellos
for discovery of neighbors which sure points at UDP to me. Having to
traverse a NAT should be a rare case. Since, in the NAT case, you have
to configure things related to the static binding and the IP
address(es) of peer(s) anyway you can also configure to use a
different encapsulation than UDP, such as TCP, at the same time. I
don't see it as much of a problem if, by default, TRILL won't operate
through a NAT. If you are using UDP and it fragments and fragments are
dropped at a NAT, probably you can't exchange Hellos so you will not
form an adjacency and anything on the other side of the NAT will not
be visible.


Yes, but this is the issue of applicability and documenting that
applicability. I don't know what goals and requirements that exist for
Trill. If the WG are fine with some restrictions, then document them and
focus on solving the issues that must be solved.

You can clearly choose to require TCP for cases where the IP MTU is
insufficient for carrying the Sz sized trill packets between the RBs using
UDP.

OK.

Zero Checksum:
--------------

Section 5.4:

UDP Checksum - as specified in [RFC0768]

Considering the fast path encapsulation desire, I am surprised to not see any
mentioning of use of zero checksum here. Raising the zero checksum and forward
reference would be good I think.

And then Section 8.5:

   The requirements for the usage of the zero UDP Checksum in a UDP
   tunnel protocol are detailed in [RFC6936]. These requirements apply
   to the UDP based TRILL over IP encapsulations specified herein
   (native and VXLAN), which are applications of UDP tunnel.

If you actually intended to allow zero checksum, then you actually should
document that Trill fulfills the requirements that the applicability statement
raises. I have not analyzed how well it meets these requirements.

Please review Section 6.2 of RFC 8086 for example how that can be done.

OK. We'll look into it.


TCP Encapsulation issue
-----------------------

Section 5.6:

The TCP encapsulation appear to be missing an delimiter format allowing each
individual TRILL packet/payload to be read out of the TCP's byte stream. In
other words, a normal implementation has no way of ensuring that the TCP
payload starts with the start of a new TRILL payload. Multiple small TRILL
payloads may be included in the same TCP payload, and also only parts as TCP 
is
one way of dealing with TRILL packets that are larger than the 
IP+Encapsulation
MTU that actually will work.

This comment is based on that there appear to be no length fields included in
the TRILL header. The most straight forward delimiter is a 2-byte length field
for the TRILL payload to be encapsulated.

Right. It might also be useful to include some sort of check field, as
is done in BGP, to detect if you are out of sync in parsing the TCP
stream.

As you need to actually perform re-assembly, the solution is to use the
byte stream semantics the TCP API provides and have a framing for each
packet.


Of course.

My point was that the framing might usefully have some sort of flag field,
like the BGP framing has, so that there was a good chance of detecting if
the parsing of the byte stream into frames has gotten out of synch.

Another point is that, while with UDP it seems fine to send packets
with assorted QoS, you don't want to encourage re-ordering of TCP
packets in a stream. So if TCP encapsulation is being used, you want
to use the same DSCP value for the packets in a particular TCP stream.
So, generally, you need to have a TCP connection per priority handling
category. Mapping the 8 priority levels into a smaller number of
handling categories is a normal thing to do so you certainly don't
necessarily need 8 TCP connections. Adding material on this should not
be too hard.


Yes, agreed it is a possibility and points into possible considerations
that David raised.

Section 5.6:

TCP endpoint requirements. I do wonder if an application like TRILL actual
would need to discuss performance impacting implementation choices or
limitations. For example use of NAGLE, the requirements on buffer sizes in
relation to Bandwidth delay products, as buffer memory in a RBridge will 
impact
performance.

Well, I'm not sure how deeply this document should get into such
performance issues. What about just saying something about
consideration being given to tuning TCP for performance and pointing
to one or a few other RFCs that talk about this?


As Joe said, these are important considerations. If your intention is to
enable this to run at substantial fractions of line rates of the
interfaces. Then this do require considerations.


I see.

Congestion Control
------------------
First thanks for the effort here.

You're welcome.


8.1.2 In Other Environments

   Where UDP based encapsulation headers are used in TRILL over IP in
   environments other than those discussed in Section 8.1.1, specific
   congestion control mechanisms are commonly needed.  However, if the
   traffic being carried by the TRILL over IP link is already congestion
   controlled and the size and volatility of the TRILL IS-IS link state
   database is limited, then specific congestion control may not be
   needed. See [RFC8085] Section 3.1.11 for further guidance.

This is correct, however my question is if the RBridges have any way of 
knowing
which traffic is actually congestion controlled, considering that TRILL 
provides
an layer 2 abstraction. I wonder if there should be any type of white list of
the types of layer 2 payloads that can be assumed to be congestion controlled,
and thus okay to forward over IP paths? I am worried that without any
recommendation to prevent traffic that is not controlled to be forwarded, can
lead to congestion issues.

The other issue I think may exist is the issue serial unicast emulation of
broadcast/multicast creates. As this amplifies the outgoing packet rate with
a factor of how many addresses are configured for serial unicast this can
be significant traffic expansion. Thus, I think additional considerations are
needed here, and maybe rate limiting of the amount of traffic to be 
multicasted.

OK. We can think about those issues.


Flow and ECMP
-------------

Section 8.3:

For example, for TRILL
   Data, this entropy field could be based on some hash of the
   Inner.MacDA, Inner.MacSA, and Inner.VLAN or Inner.FGL.

I would appreciate clearer references to what these fields are.

In a TRILL Data packet, the payload after the TRILL Header looks like
an Ethernet frame except that there is always either a VLAN tag or,
alternatively, where the VLAN tag would be, a Fine Grained Label
[RFC7172]. (The preceding is the view in the TRILL RFCs, but there is
an equivalent and equally valid view in which all the fields through
and including the VLAN or FGL tag are part of the TRILL Header.) The
TRILL base protocol specification focuses on Ethernet as a link
technology between TRILL switches, in which case there will be a link
header including an Outer.MacDA and Outer.MacSA fields and possibly an
Outer.VLAN, all before the TRILL Header. See Figure 1 and Figure 2 in
RFC 7172.

Some of the above could be added to the draft for clarity.


If I understand this correctly, the idea here is to look into the inner
layer 2 frames, and use the flow equivalents that exists on that level and
hash that into value that maps the flows onto the source port range.

Yes.


I think this text should include a summary of the principle and ensure to
note the important requirement that what is considered flows in the inner
must not result in being striped over multiple source ports as this may lead 
to
reordering issues due to packets taking different paths.

Well, we can add some text. But when would the relative ordering
matter for two TRILL Data packets where the two inner native payloads
have different values for any one or more of these three fields
(Inner.MacDA, Inner.MacSA, and inner VLAN/FGL tag) ? If any of those
fields are different, you are talking about different streams.


Okay, then this is very straightforward.

NAT and TRILL over IP:
Section 8.5:

If one like to use TRILL over IP through a NAT, then there are some very
important considerations that are missing. First the need for static binding
configurations or the need for determining ones external address(es) and be
able to communicate that to the peer RBridges, and in addition ensure that one
has keep-alives to that the NAT binding never times out.

I think those are good points. There is an additional problem that
TRILL Hellos detect neighbors with which they have 2-way connectivity
by indicating, inside the Hellos that are sent, from what neighbors
Hellos have been received on that port. If a NAT is involved, these
neighbor addresses inside Hellos need to be mapped.

Yes, and the question is how that can be handled, by the receiver of the
packet, or if the sender needs to determine what address it uses and
provide that in the HELLOs. If the first is possible that can simplify a
lot.


I'm not sure. this would require a little detailed design work.

Next is the issue that there is almost zero chance of getting a IP/UDP
encapsulation TRILL payload through the NAT if it results in IP fragmentation,
as NATs don't do defragment and refragmented on the internal side, and an IP
fragment lacks UDP port and thus can't be matched to binding.

So perhaps the recommendation should be to configure the port to use
TCP if there will be fragmentation.

Yes, I think that are likely the simplest solution for you.


OK

Thanks,
Donald
=============================
 Donald E. Eastlake 3rd   +1-508-333-2270 (cell)
 155 Beaver Street, Milford, MA 01757 USA
 d3e3e3(_at_)gmail(_dot_)com

Also if you like to run IP/ESP through a NAT, then you most likely need the

IP/UDP/ESP encapsulation (https://tools.ietf.org/html/rfc3948). Note that this
will restrict the MTU even further and thus ensure that the 1470 requirement
cannot be fulfilled even without additional tunnels over an 1500 bytes MTU
Ethernet infrastructure.

I would note that also firewalls likely have issues with IP fragments for the
same reason, they require significant amount of state to be verified if they
should be let through.

In general I think you should create a configuration that has chance to work
through most middleboxes, but I think you should require static bindings. I
think that configuration is, and don't laugh now, but IP/UDP/ESP/TCP/TRILL,
otherwise you will not be able to have both security and reliable 
fragmentation
of TRILL packets.

OK. Thanks again for this review. It has pointed out a number of
problems and in thinking about those, I believe a couple of further
problems have come to mind that I mentioned above. We'll work on a
revised draft.



Cheers

Magnus Westerlund

----------------------------------------------------------------------
Media Technologies, Ericsson Research
----------------------------------------------------------------------
Ericsson AB                 | Phone  +46 10 7148287 <+46%2010%20714%2082%2087>
Torshamnsgatan 23           | Mobile +46 73 0949079 <+46%2073%20094%2090%2079>
SE-164 80 Stockholm, Sweden | mailto: 
magnus(_dot_)westerlund(_at_)ericsson(_dot_)com
----------------------------------------------------------------------