Thursday, 24 September 2020

Detecting and Mitigating Loops in VXLAN Networks

The Problem with Looping

First-generation Layer-2 Ethernet networks could not natively detect or mitigate looped topologies, while modern Layer-2 Overlays implicitly build loop-free topologies. Overlays do not have any need for loop detection and mitigation as long as no first-gen Layer-2 network is attached, which is common in complex data center networks. When loops occur, data frames can exist indefinitely, disrupting network stability and degrading performance. Loops introduce broadcast radiation, increasing utilization of CPU and network bandwidth, which results in a degradation of user application access experience. In multi-site networks a loop can span multiple data centers, causing disruptions that are difficult to pinpoint. In other words, loops are bad news. Before we look at how a modern network fabric minimizes looping, let’s examine previous attempts at preventing loops in topologies.

Spanning Tree Protocols (STP) counteract the loop problem in first-gen Layer-2 Ethernet network. Over time, other approaches evolved by moving networks from “looped topologies” to “loop-free topologies”. This evolution reduced the dependence on Loop Prevention protocols, so they are now employed mostly as a failsafe mechanism. Today with Network Virtualization Overlays, the dependency on Loop Prevention protocols is almost entirely eliminated. However, even though virtualized overlay networks such as VXLAN EVPN are loop free, having a failsafe loop detection and mitigation method is still desirable because loops can be introduced by topologies connected to the overlay network.

Cisco Prep, Cisco Tutorial and Material, Cisco Learning, Cisco Networks, Cisco Guides

Loop-free VXLAN overlays may be connected to an Ethernet segment that can result in network loops, requiring detection and mitigation in conjunction with the overlay.

Many Solutions to Loop Prevention, But Which is the Best?


The Spanning Tree Protocol enables network designs that include redundant links to provide fault tolerance but avoid the presence of bridging loops. STP builds a single tree that calculates the relationship of network nodes and bridges within a layer 2 network to avoid creating loops.

An alternate approach to prevent loops in layer 2 networks uses link bundles between two neighboring bridges. This technique improves performance (Link Aggregation – LAG) and provides link redundancy (member link failure in a LAG). When multiple bridges exist, link bundles are extended to provide peering between multiple bridges (Multi-Chassis Link Aggregation – MLAG), increasing bridge node resiliency along with link redundancy and performance. In both of these cases, the link bundles are treated by STP as a single logical link and the creation of a loop is prevented (loop free). In each of these cases, STP acts as a failsafe.

While LAG and MLAG were in use for many years, other approaches for building loop free topologies arose by using ECMP (Equal Cost Multi-Path), either at the MAC layer or IP layer. FabricPath or TRILL (Transparent Interconnect of Lots of Links) are MAC layer ECMP approaches that emerged in the last decade. More recently, Network Virtualization Overlays that build loop free topologies on top of IP layer ECMP became the state-of-the-art. VXLAN is the most prevalent network virtualization protocol in use today that builds loop free topologies.

Cisco Prep, Cisco Tutorial and Material, Cisco Learning, Cisco Networks, Cisco Guides

A loop-free VXLAN overlay network.

While a VXLAN Overlay provides a loop free layer 2 service over IP ECMP, a layer 2 loop may still be introduced by connecting an L2 Ethernet network. VXLAN Edge-Devices act as bridges between VXLAN and Ethernet, known as Layer 2 Gateways (L2GW). A loop on the Ethernet network side can still introduce harmful broadcast radiation to the loop-free overlay network. If a loop is accidentally configured, physically or logically, the absence of a Loop Prevention protocol in VXLAN could allow the existence of a loop. While the layer 2 service in the VXLAN overlay network does not participate in the Spanning Tree Protocol, even if it could, blocking of a link in a loop-free overlay network would not prevent a loop but might cause additional harm, such as loss of service.

While proposals exist to integrate the overlay network with STP, these proposals are considering all Edge-Devices representing a single STP root bridge – Layer 2 Gateway STP (L2G-STP). While this approach is valid, it introduces rigidity into the deployment of modern overlay networks, reducing flexibility. With L2G-STP or similar approaches, the location of the STP root is predefined and hence can’t adjust to network designs that require a different location for this function. While L2G-STP can be used as a separate feature, the same functionality can be configured with a common STP root priority on the Edge-Device and the use of STP Root Guard.

In order to maintain the flexibility of overlay network deployments with VXLAN but have the ability to detect and protect against potential loops, Cisco provides an innovation: VXLAN EVPN Southbound Loop Detection and Mitigation.

Southbound Loop Detection and Mitigation


Let’s look at a VXLAN network in a spine/leaf topology to define “southbound looping”. The leaf is acting as Network Virtualization Edge-Device that is hosting the VXLAN Tunnel Endpoint (VTEP) function. In this topology, the VXLAN network represents the “northbound” portion of the network. The network from the leaf or Edge-Device to the “south” is most commonly the Ethernet network. As loops are potentially formed in this “southbound” network, the goal is to detect and mitigate loops that are introduced by the “southbound” network.

Cisco Prep, Cisco Tutorial and Material, Cisco Learning, Cisco Networks, Cisco Guides

North and south network topology.

Operations, Administration, and Maintenance (OAM) provides a framework for Connectivity Fault Management (CFM) defined in IEEE 802.1ag. Within this protocol framework and specifications, a continuous check message traverses intermediate bridges. This is a key criteria for enabling uninterrupted transfer of signaling across north-south borders. Based on well-defined triggers that span from initial port up to duplicate MAC detection (RFC7432 Section 15.1), check message probes are sent in a focused manner to detect if and where loops exist.

Loop detection is provided exclusively by the Edge-Devices that form the “northbound” VXLAN and bridge to the “southbound” Ethernet network. If the probe is not returned to the sending Edge-Devices, then no southbound Loop exists. If a southbound probe is returned, the existence of a loop is validated. As Edge-Devices become aware of a detected loop, notifications are shared with network operators and mitigation actions initiated.  

Cisco Prep, Cisco Tutorial and Material, Cisco Learning, Cisco Networks, Cisco Guides

A probe uncovers a loop in a southbound Ethernet network.

Loop Mitigation and Recovery


As part of the mitigation, the “southbound” Ethernet interfaces that participate in a loop are identified. As loops can exist in some VLANs but not in others, the granularity of control on a Port, VLAN basis is significant. In the action of mitigation, only the specific offending combination of VLAN and port is suppressed to break the detected loop and stop traffic radiation without disrupting other traffic on the port. Breaking the loop updates the topology which can affect the accuracy of the MAC address table. Therefore, a MAC-flush is initiated in the VLAN with the detected loop to enable proper re-learning and forwarding subsequent to the loop mitigation.

Once a loop has been mitigated, it can be difficult to know if the recovery—the unsuspending of a Port,VLAN combination—will reintroduce the loop. In order to prevent a false-recovery and loop reintroduction, a probe is sent prior to initiating the recovery while the Port,VLAN combination stays suspended (doesn’t forward traffic). If the probe still reports an indication of an existing southbound loop, the recovery process is stopped and the Port,VLAN stays suspended. After a given interval, loop detection is reinitiated. The recovery process continues until no loop is detected. Appropriate configuration, notification, and override commands are available to the Network Operator.

VXLAN EVPN with Built-In Southbound Loop Detection and Mitigation


Cisco NX-OS 9.3(5) provides native southbound loop detection and mitigation for VXLAN EVPN fabrics. The functionality extends the loop-free behavior of VXLAN EVPN’s Network Virtualization Overlay with existing Ethernet networks. While there are many use-cases that require loop detection and mitigation in a single fabric, the same functionality is available for VXLAN EVPN Multi-Site deployments. For these Multi-Site deployments, loop detection and mitigation supports the detection of backdoor links, the most prevalent cause of multi-site outages during extension or migrations.  

While many loop protection solutions support detecting the existence of loops in the overall topology and shutting down the offending ports, VXLAN EVPN Loop Detection and Mitigation defines the topology at the “VLAN-level”. Similar to Per-VLAN Spanning Tree variations (PVST+ and PVRST/802.1w) the functionality of VXLAN EVPN Loop Detection and Mitigation acts with comparable granularity. Differing from Spanning Tree, no pro-active calculation of a forwarding tree is built, but precautions are made to avoid the existence of loops and introducing them into the Overlay. VXLAN EVPN southbound loop detection and mitigation aims to ensure network uptime and avoid unnecessary risks due to loop creation, whether it is within a single fabric or across multiple fabrics with VXLAN EVPN Multi-Site.

Cisco Prep, Cisco Tutorial and Material, Cisco Learning, Cisco Networks, Cisco Guides

Looping can be accidentally introduced into multi-site fabrics through backdoor links.

Innovative Solutions for Increasing Data Center Resiliency


Increasing the stability of data center fabrics is key to supporting business resiliency — whether for a single on-premise brownfield fabric or when adding new multi-site greenfield fabrics. In order to optimize application performance and network stability, modern networks need to build upon a consistent, up-to-date platform instead of relying on a patchwork of technologies that can cause more conflicts than resolutions.

Even though modern VXLAN EVPN overlays prevent most looping scenarios natively, combining them with older network topologies can still introduce the risk of corrosive loops. Even carefully designed multi-site VXLAN EVPN data center fabrics can still accidentally create backdoor links, leading to looping-related performance issues. Cisco Nexus 9000 Series based NX-OS VXLAN implementation addresses the most prevalent loop scenarios within and among multi-site data centers to build and maintain a stable and resilient network architecture for your organization.

Related Posts

0 comments:

Post a Comment