Skip to main content

Cornelis Technical Documentation

4.5.2. Fabric Multicast Routing

In addition to unicast routing, Omni-Path Architecture also supports multicast routing. Multicast routing allows a single packet sent by an application to be delivered to many recipients. Omni-Path multicast routing is used for IPoIB broadcast for TCP/IP address resolution protocol (ARP), and IPoIB multicast for UDP/IP multicast applications. It can also be used directly by other applications.

CN5000 supports separate routes per multicast group. Each multicast group is identified by a unique 128-bit Multicast GID. Within the fabric, the SM assigns each active multicast group a 24-bit Multicast LID.

To implement multicast routing, the SM must construct spanning tree routes throughout the fabric that will deliver exactly one copy of each sent packet to every interested node. The SM must configure such routes within the limitations of the hardware. Specifically:

  • There may be varied MTU and speed capabilities for different switches and ISLs.

  • The fabric topology and hardware may change after applications have joined a multicast group.

To support efficient yet dependable multicast routing, the SM allows you to configure and control multicast routing. The root selection algorithm gives you the ability to influence the choice of the root of the spanning tree. The default is to choose a switch that is at the core of the fabric, that is, one with the least total cost to all other switches.

4.5.2.1. Handling Fabric Changes When Using Multicast Groups

At the time an application creates or joins a multicast group, the SM determines if there is a path with the appropriate speed and MTU to meet the requested capabilities of the multicast group.

However, later fabric changes (such as removal or downgrade of high-speed links, loss of switches, changes to switch MTU, or speed configuration) could make the multicast group unusable. Unfortunately, for these changes, there are no OpenFabrics Alliance APIs to notify end nodes that the multicast group is no longer viable.

To address this situation, the SM performs stricter multicast checking at the Join/Create time. This means a multicast Join/Create is rejected if there are any switch-to-switch links that do not have at least the MTU or rate requested for the multicast group, reducing the chance that a simple fabric failure or change (for example, loss of one link) could make the group unusable.

The DisableStrictCheck parameter controls this capability. When set to 1, this parameter disables the strict checking and accepts Join/Create requests for which at least one viable fabric path exists. By default, the parameter is set to 0, which allows for more strict checking.

In addition, the MLIDTableCap parameter is used to configure the maximum number of Multicast LIDs available in the fabric. This must be set to a value less than or equal to the smallest Multicast Forwarding Table (MFT) size of all the switches that may be in the fabric. It defaults to 1024, which is below the capability of all current Omni-Path switches. Using a value below the capability of the switches can prevent errant applications from creating an excessive number of multicast groups.

4.5.2.2. Conserving Multicast LIDs

IPv6 (and possibly other applications) can create numerous multicast groups. In the case of IPv6, there is one Solicited-Node multicast group per SuperNIC/TFI port. This can result in an excessively large number of multicast groups. Also, in large fabrics, this quickly exceeds MLIDTableCap. For example, a 10,000-node fabric with IPv6 would need over 10,000 multicast groups.

To address this situation, the SM can share a single MLID among multiple Multicast groups. Such sharing means that both the routes and destinations are shared. This may deliver some unrequested multicast packets to end nodes. However, unneeded packets are silently discarded by the transport layer in the SuperNIC/TFI and have no impact on applications.

The SM allows you to configure sets of multicast groups that will share a given pool of Multicast LIDs. This is accomplished using the MLIDShare sections in the configuration file.

MLID sharing can conserve the hardware MLID tables so other uses of multicast can be optimized/efficient.

By default, the SM shares a pool of 500 LIDs among all IPv6 solicited-node multicast groups. Thus in fabrics of 500 nodes or less, a unique LID is used for every multicast group. In larger fabrics, LIDs are shared so that there are still over 500 unique LIDs available for other multicast groups, such as the IPoIB broadcast group and other multicast groups that may be used by applications.

It is also possible to specify the maximum number of MLIDs within an MLIDShare that a single PKey can consume. In configurations with many vFabrics, this ensures no vFabric exhausts the entire MLID pool.

4.5.2.3. Pre-Created Multicast Groups

The first end node that joins a multicast group also creates the multicast group. When a multicast group is created, critical parameters such as the MTU and speed of the multicast group are also established. The selection of these values must carefully balance the performance of the multicast group against the capabilities of the hardware, which may need to participate in the group in the future. For example, if an application on a SuperNIC with a 4K MTU creates a 4K multicast group, it prevents subsequent joins of the group by 2K MTU SuperNICs.

Some ULPs and applications, such as IPoIB, require key multicast groups, such as the IPv4 broadcast group, to be pre-created by the SM.

Pre-created multicast group configurations are specified in the MulticastGroup sections of the SM configuration files. When the multicast groups are pre-created, their MTU and speed are defined by the SM configuration file, allowing you to be able to account for anticipated hardware capabilities and required performance.

To simplify typical configurations, a MulticastGroup section, which does not explicitly specify any MGIDs, will implicitly include all the multicast groups specified by the IPv4 and IPv6 standards for IPoIB. As such, typical clusters using IPoIB do not need to explicitly list the MGIDs for IPoIB.

4.5.2.4. Multicast Spanning Tree Root

Multicast routing is performed by computing a spanning tree for the fabric. When constructing a spanning tree for a multicast group, MTU and rate must be considered. If the SM first constructs a spanning tree for the largest MTU and rate found in the fabric and if that spanning tree is complete (that is, includes all switches) then that spanning tree is sufficient for all groups as it will support all smaller MTUs and rates. If it is not complete, the SM will use MTUs and rates smaller than the maximum until a complete spanning tree is computed. Using a common spanning tree reduces the computational time required in fabric programming. The spanning tree has a root switch and spans throughout the fabric to reach all of the switches and SuperNICs that are members of the multicast group. Since this tree is common across multicast groups, it is optimized to compute the MFTs data associated with each switch only once. This common data is augmented with ports participating in a given multicast group at the switch when programming the MFT.

The FM allows the root of the spanning tree to be configured using the Sm.Multicast.RootSelectionAlgorithm parameter. A goal of the spanning tree calculation is to select a switch at the center of the fabric for the root of the spanning tree. A good option would be a switch that has the least total cost of all other switches. Another option is a switch that has the least worst-case cost to other switches.

Concerning fabric change, the root switch using one of these cost options can change, resulting in a reconstruction of the spanning tree and potential disruption due to reprogramming of the MFTs. The SM’s MinCostImprovement parameter can determine how much improvement is needed before a new spanning tree root is selected. Disruption to in-flight multicast traffic can be avoided or limited to cases where the fabric has changed significantly enough to provide sufficient benefit to justify a change by using these parameters.

The SM’s DB Sync capability synchronizes the multicast root between the primary and standby SMs. During SM failover the multicast root can be retained and limit the disruption to multicast traffic in flight.

4.5.2.5. Multicast Spanning Tree Pruning

A complete tree unconditionally includes all switches. When SuperNICs request to join or leave the multicast group, the SM only needs to program the switch immediately next to the SuperNIC. This is optimized to program only those MFT blocks that have changed since the last sweep.

A pruned tree omits switches that do not have SuperNICs as members of the group, as well as intermediate switches that do not need to be in the group. A pruned tree reduces multicast traffic internal to the fabric when only a small subset of nodes are part of a given multicast group. However, the time to add or remove SuperNICs from the group can be significantly higher as many intermediate switches may need to also be programmed for the group.

The default is a complete tree. This has been found to work very well in HPC environments. Such environments typically have very little multicast traffic with the vast majority of traffic being IPoIB ARP packets that need to be broadcast to all nodes running IPoIB. The default allows IPoIB hosts to come up and down quicker as may be common in environments that restart compute nodes between jobs.