4.5.1. Fabric Unicast Routing
Unicast routing in CN5000 Omni-Path represents a critical aspect of fabric configuration. The SM must carefully balance several competing requirements when establishing routing paths throughout the network. This balancing act involves optimizing for:
Performance across diverse workloads and applications.
Resilience against network disruptions and component failures.
Prevention of credit loops.
Efficient use of limited network resources, including LIDs and routing table entries.
These requirements often present trade-offs, as optimizing for one factor may negatively impact another. For example, maximizing performance might require more routing table entries, while enhancing resilience could introduce additional hops that affect latency. To address these challenges, the SM provides administrators with configurable routing options. The routing algorithm selection and parameters can be tailored to specific network topologies and workload characteristics, allowing administrators to prioritize the factors most important for their environment. Understanding these routing considerations is essential for administrators seeking to optimize their CN5000 Omni-Path fabric deployment for specific application requirements and operational constraints.
4.5.1.1. Credit Loops
Because the Omni-Path Architecture has a credit-based link layer flow control, credit loops are possible. Under high stress, a credit loop can become a fabric deadlock, which will force switch timers to discard packets. These deadlocks and discards can cause significant performance impacts.
Credit loop avoidance is a focus of many of the algorithms. Credit loops are avoidable for all the popular fabric topologies, and the SM uses algorithms that are designed to avoid credit loops.
Enforcing up/down routing is one method used to avoid credit loops in tree topologies. With up/down routing, when equal cost paths exist that go up or down the tree, preference is given to the up links. This method of routing is used with the shortest path algorithm where the backbone of the fabric is made up of Director Class Switches, by enabling SpineFirstRouting in the SM configuration. When SpineFirstRouting is enabled for equal length paths, the SM gives preference to routing traffic from a DCS leaf, through the DCS spine, as opposed to using an edge external to the DCS. This treats the spine as "up" relative to the leaf and results in clean "up/down" routing for the fabric. By ensuring that all traffic uses consistent routes, credit loops are prevented.
The fat tree routing algorithm also avoids credit loops for fabrics where the backbone of the fabric is not comprised of DCSes. It uses spine-first routing for any type of switch at the core of the fabric.
If a credit loop is suspected, use the CLI command opareport -o validatecreditloop to check the fabric.
4.5.1.2. Routing Algorithms
For every pair of nodes in the fabric, the routing algorithm selects the path to be used for their communications. The actual paths chosen may be based on the path hop count and the link width/speed for each hop. In the context of all the various node-to-node communication paths possible, the SM statically load balances the number of communication paths being routed over each link in the fabric. A variety of strategies for routing are available in the form of different, selectable routing algorithms.
The SM supports the following routing algorithms:
Shortest Path: The default option and works very well for most fabrics.
Fat Tree: An optimized, balanced routing specifically for fat tree topologies with credit loop avoidance.
Device Group Shortest Path: A variation of shortest path that can result in better balanced fabrics in some conditions.
The routing algorithm is selectable using the RoutingAlgorithm parameter in the opafm.xml file.
4.5.1.2.1. Shortest Path
The Shortest Path algorithm is the default and works very well for most fabrics. It routes traffic using a least-cost path. In most fabrics, there are many equal-cost paths, in which case, the SM statically balances the number of paths using each ISL.
SpineFirstRouting is an optional feature of Shortest Path routing. When SpineFirstRouting is enabled for equal-length paths, the SM always gives preference to routing traffic from a DCS leaf, through the DCS spine, as opposed to using a CN5000 Switch external to the DCS. This treats the spine as "up" relative to the leaf and results in clean up/down routing for the fabric. Credit loops are prevented when all traffic uses consistent routes.
Note
If you want to use spine-first routing in fabrics where DCSs are not at the core of the fabric, you can use the fat tree routing algorithm.
SpineFirstRouting is enabled by default and has no ill side effects. Unlike simpler algorithms in other SMs, the SM's Shortest Path algorithm has sophisticated traffic balancing and routing algorithms that allow it to provide high performance for a wide variety of topologies.
4.5.1.2.2. Fat Tree
The Fat Tree algorithm generally provides better balancing of ISL traffic with fat tree topologies than the Shortest Path algorithm and it provides up/down routing for deadlock avoidance. The Fabric Manager accomplishes the balancing of ISL traffic through the identification of the fat tree topology layout and determines up/down routing by calculating at which tier in the fabric each switch resides.
To determine the switch tier, the Fabric Manager needs to understand how many tiers of switch chips there are in the fabric. If all SuperNICs and Target Fabric Interfaces (TFIs) are at the same tier, the Fabric Manager automatically determines the fat tree topology layout. If they are not on the same tier, you can specify which switches are at the core/root of the tree by configuring a CoreSwitches device group as an alternate method of topology identification. Only devices that do not communicate with each other should be connected to the core switches. If devices connected to the core switches communicate with each other they could potentially form a credit loop.
The Fat Tree algorithm also provides the capability of balancing traffic across device groups. You can configure a RouteLast device group to ensure that devices in this group have balanced routing. For instance, if all compute nodes are in this group, better traffic dispersion can be obtained for these devices. This mechanism can be used to load balance over compute and I/O nodes. For instance, if all compute nodes are in the device group, the routing would be calculated for compute nodes first followed by I/O nodes, with load balancing across each set of nodes.
4.5.1.2.3. Device Group Shortest Path
Overall, Device Group Shortest Path (dgshortestpath) routing is a form of Min-Hop or Shortest Path (shortestpath) routing, except you can control the order in which routes to end nodes are assigned. This can be used to ensure diversity of routing within groups of devices as well as the entire fabric overall. End nodes that are not members of any listed groups will be routed last.
When the routing algorithm is set to dgshortestpath, the following section in the opafm.xml file is used to configure the algorithm.
<DGShortestPathTopology> <!-- RoutingOrder lists the device groups in the order they should --> <!-- be handled. Each device group must have been declared in the --> <!-- DeviceGroups section. --> <!-- <RoutingOrder> --> <!-- <DeviceGroup>Compute</DeviceGroup> --> <!-- <DeviceGroup>All</DeviceGroup> --> <!-- <DeviceGroup>Storage</DeviceGroup> --> <!-- </RoutingOrder> --></DGShortestPathTopology>
4.5.1.3. Adaptive Routing
A limitation of static routing is that it must be done before traffic begins to flow. Static routes are balanced using best guesses by the Fabric Manager and sysadmin of potential application traffic patterns. However, once applications start to run, those routes may not be ideal. Adaptive routing allows the switches to adjust their routes while the applications are running to balance the routes based on actual traffic patterns.
The Cornelis adaptive routing solution is highly scalable because it allows the Fabric Manager to provide the topology awareness and program the switches with the rules for adaptive routing. Thereafter, switches can dynamically and rapidly adjust the routes based on actual traffic patterns.
This approach ensures a scalable solution because as switches are added, each new switch works in parallel with others to dynamically adapt to traffic. Additionally, this approach removes the Fabric Manager as a bottleneck.
Adaptive routing provides a few important capabilities and options:
Adaptive routing can rapidly route around fabric disruptions and lost ISLs.
When adaptive routing is enabled, this capability automatically occurs and limits the amount of lag time between an ISL going down and the traffic being redirected to alternate routes.
Adaptive routing can automatically balance and re-balance the fabric routes based on traffic patterns.
Adaptive routing has the ability to handle changing traffic patterns that may occur due to different computational phases or the impacts of starting or completing multiple applications that are running on the same fabric.
When Fine Grained Adaptive Routing (FGAR) is enabled, the SM programs each switch with a list of alternate, equal-cost routes for each destination in the fabric. When a switch detects that a route is congested (due to congestion or complete failure of a link or switch), it selects from the list of alternates and updates its linear forwarding table (LFT) with the alternate route.
The following is the AdaptiveRouting section of the opafm.xml file:
<!-- Configures support for AdaptiveRouting in Cornelis Switches -->
<!-- Adaptive Routing monitors the performance of the possible paths between -->
<!-- fabric endpoints and periodically rebalances the routes to reduce -->
<!-- congestion and achieve a more balanced packet load -->
<AdaptiveRouting>
<!-- 1 = Enable, 0 = Disable -->
<Enable>0</Enable>
<!-- When set, Fine Grained routing algorithm will be used to route packets -->
<!-- If set using fat tree, LMC will be replaced by SDR ignoring LMC tags -->
<FineGrained>0</FineGrained>
<!-- When set, only adjust routes when they are lost. -->
<!-- If not set, adjust routes when they are lost and -->
<!-- when congestion is indicated. -->
<LostRouteOnly>0</LostRouteOnly>
<!-- Algorithm the switch should use when selecting an egress port -->
<!-- Algorithms are currently 0 = Random, 1 = Greedy and -->
<!-- 2 = GreedyRandom. Default is 2. -->
<Algorithm>2</Algorithm>
<!-- Update Frequency: Specifies the minimum time between -->
<!-- AR adjustments. Values range from 0 to 7 and are read as 2^n -->
<!-- times 64 ms. Default is 0. -->
<ARFrequency>0</ARFrequency>
<!-- Congestion threshold above which switch uses adaptive routing. -->
<!-- Congestion threshold is per-VL and measured by tag consumption percentage. -->
<!-- Values range from 0 to 7. -->
<!-- 7, 6, 5, 4 correspond to 55%, 60%, 65%, and 70%, respectively. -->
<!-- 3, 2, 1 correspond to 80%, 90%, and 100%, respectively. -->
<!-- 0 means "Use firmware default". Default is 3. -->
<!-- Higher Percentage means higher congestion is required before Adaptive -->
<!-- routing takes control. Higher Percentage is less sensitive, less adaptive. -->
<Threshold>3</Threshold>
</AdaptiveRouting>
The following describes the settings for AdaptiveRouting:
The
Algorithmsetting the switch uses for choosing the alternate route can be Random (choose the alternate randomly), Greedy (choose the alternate that is least busy), or GreedyRandom (if there are multiple alternates that aren't busy, randomly choose from them).The
Thresholdsetting determines how busy a route must be before it will be rerouted by Adaptive Routing.The
ARFrequencysetting determines how frequently the switch checks for congestion.If the
LostRouteOnlysetting is enabled, traffic is only rerouted if a route completely fails. Such rerouting is only done on the switch with the port that has gone down.
Note
To enable FGAR, both the AdaptiveRouting and FineGrained settings must be enabled.
4.5.1.4. LMC, Dispersive Routing, and Fabric Resiliency
The SM supports LID Mask Control (LMC), which allows for more than one LID to be assigned to each end node port in the fabric (specifically 2LMC LIDs will be assigned). With this, the SM can configure the fabric with multiple routes between each end node port, allowing applications to load balance traffic across multiple routes or provide rapid failover using techniques like Alternate Path Migration (APM).
When LMC is configured with a non-zero value, the SM assigns routes with the following goals in priority order:
Each LID for a given destination port is assigned a unique route through the fabric, using the following (in order of preference):
A different switch where possible.
Different ASICs when possible and when a different switch is not possible.
Different ports in the same switch when different ASICs or switch are not possible.
This approach provides optimal resiliency so that fabric disruptions can be recovered by using APM and other rapid failover techniques.
The overall assignment of Base LIDs to ISLs is statically balanced, such that applications that only use the Base LID will see balanced use of the fabric.
The assignment of alternate LIDs to ISLs is statically balanced, such that applications that use multiple LIDs for load balancing may see additional available bandwidth through the fabric core.
4.5.1.4.1. PathRecord Path Selection
When a non-zero LMC value is used, the SM will have multiple paths available between pairs of nodes. The Fabric Manager permits the configuration of the SM/SA to specify which combinations of paths should be returned and in what order. Most multi-path applications will use the paths in the order given, so the first few returned are typically used for various failover and dispersive routing techniques.
Most applications use the first path or only the first few paths. When LMC!=0, there can be N=(2LMC) addresses per port. This means there are N2 possible combinations of SLID and DLID that the SA could return in the Path Records. However, there are really only N combinations that represent distinct outbound and return paths. All other combinations are different mixtures of those N outbound and N return paths.
Also important to note is that LMC for all SuperNICs is typically the same, while LMC for switches will usually be less. Generally, redundant paths or having a variety of paths is not critical for paths to switches (which are mainly used for management traffic), but can be important for applications communicating SuperNIC to SuperNIC.
The FM Path Selection parameter controls what combinations are returned and in what order. For the examples below, assume SGID LMC=1 (2 LIDs) and DGID LMC=2 (4 LIDs)
Minimal– Return no more than one path per lid: SLID1/DLID1, SLID2/DLID2 (since SGID has 2 lids stop)Pairwise– Cover every lid on both sides at least once: SLID1/DLID1, SLID2/DLID2, SLID1/DLID3, SLID2/DLID4OrderAll– Cover every combination, but start with pairwise set: SLID1/DLID1, SLID2/DLID2, SLID1/DLID3, SLID2/DLID4 SLID1/DLID2, SLID1/DLID4, SLID2/DLID1, SLID2/DLID3SrcDstAll– Cover every combination with simple all src, all dst: SLID1/DLID1, SLID1/DLID2, SLID1/DLID3, SLID1/DLID4 SLID2/DLID1, SLID2/DLID2, SLID2/DLID3, SLID2/DLID4
4.5.1.4.2. Handling Fabric Changes When Using Dispersive Routing
Programming using Directed Route (DR) Subnet Management Packets (SMPs) is less performant than LID Routing (LR) due to a significant amount of latency introduced at each intermediate hop on the path to the destination. LR SMPs are used whenever possible to improve performance at scale. When programming linear forwarding tables (LFTs) on switches, a mixed LR-DR approach is used. This method uses LR, routing up to the last hop in the path, then uses DR for the last hop. This approach is used to program the minimal amount of LFT route data to enable the LR path from the SM to the destination switch, a maximum of two LFT blocks. The SM then switches to pure LR SMPs to program the remaining LFT data.
A wave model is used to program the routing tables. The first wave of switches includes the set of switches connected to the SM. The next wave starts with the SM doing LID-based routing to these connected sets of switches and initializing the next set of switches that are connected to these switches with one hop DR from those switches and so on. This wave order also ensures that each new wave is fully LID routable from the SM. That is, the routes to the new wave are such that they will always pass through switches that already have been fully initialized.
To minimize disruption during the programming of the LFT data, the programming of full LFT blocks is done in parallel for the set of switches in each wave by striping LFT blocks across the switches. For example, block N for all switches are written, and then block N+1, and so on.
When handling fabric changes, the fabric is analyzed to reduce the number of changes to the routing tables in an effort to minimize fabric disruptions. If SuperNICs are added or deleted, the LFT changes will be limited to the affected routes. Instead of recalculating routing tables, the loss of a SuperNIC results in the removal of the route for the specific SuperNIC. The addition of a SuperNIC results in the insertion of the route into the existing LFT blocks. Only those LFT blocks with added or deleted SuperNIC routes will be programmed on the switches. This reduces fabric programming time and minimizes fabric disruption due to unnecessary routing recalculations.
4.5.1.4.3. LMC Best Practices
Cornelis recommends that dispersive routing be enabled to take advantage of the potentially significant performance improvements, especially in highly oversubscribed fabrics. The recommended LMC value is either 2 or 3, depending on the size of the fabric.
When LMC=3, eight LIDs will be assigned to each end node port that could, for larger fabrics (over 4k nodes), result in the Fabric Manager using an excessive number of LIDs. Cornelis recommends using an LMC value of 3 for fabrics less than 4k nodes, and 2 for fabrics larger than 4k nodes.
The performance benefits realized when using dispersive routing will depend on many factors, including message sizes of the application and contention with other applications running on the fabric.