Skip to main content

Cornelis Technical Documentation

4.3.4. Oversubscription

The preceding examples of 2-tier fat trees (Figure 64 and Figure 65) assume a non-blocking configuration and have a 1:1 subscription. Subscription is the ratio between the total bandwidth above and below a given tier. In the examples, every switch within the leaf tier supports 9.6 Tbps worth of SuperNIC links and 9.6 Tbps worth of ISLs to the spine tier. This results in a ratio of 1:1. If the SuperNIC link bandwidth were to exceed the ISL bandwidth the ratio would increase from unity, resulting in an oversubscribed fabric.

While a wide range of subscription ratios could be considered valid from a topological standpoint, the desire to fully utilize equipment and the fact that switches generally offer even port counts can limit oversubscribed ratios to integer values.

Consider a 2:1 oversubscribed 2-tier fat tree using only native links as shown in the following figure. The switches in the leaf tier provide 32 SuperNIC links and only 16 ISLs to the spine tier. Since all links are of equivalent bandwidth, this results in a 2:1 ratio of bandwidth across the leaf tier. Though we have introduced a blocking configuration, and therefore a potential performance penalty, the number of endpoints connected to each leaf switch has increased by a third, allowing for a maximum of 1536 total endpoints. If the SuperNIC links were also subdivided, the maximum allowable endpoints would once again double, allowing up to 3072 endpoints.

Figure 66. 2:1 Oversubscribed 2-Tier Fat Tree
2:1 Oversubscribed 2-Tier Fat Tree


In practice, the cost-benefit analysis of oversubscription is not as simple as more endpoints, more bottlenecks/congestion. The traffic patterns of many applications exhibit large streams between adjacent or nearby endpoints, resulting in a higher concentration of inner-switch messages than inter-switch messages. Depending on the traffic pattern of the applications in question, a 2:1 oversubscription value may not meaningfully impact performance through congestion at the leaf switches. Such an oversubscribed fabric also uses less equipment per endpoint than a perfectly subscribed configuration, thus affecting the cost-per-endpoint metric.

On the other hand, if the traffic pattern of the applications in question results in frequent messages between all endpoints, for example when using MPI collective operations, oversubscription can significantly impact performance. In such cases, the bottlenecks created at the leaf tier can lead to increased latency and reduced effective bandwidth as messages queue for transmission across the limited ISLs to the spine tier. This congestion can be particularly problematic for latency-sensitive applications where even small delays can cascade into substantial performance degradation.

The decision to implement an oversubscribed fabric requires careful consideration of the expected workloads, the performance requirements, and the cost constraints of the deployment. In many high-performance computing environments, a compromise is often reached where a moderate level of oversubscription is accepted to balance performance needs with practical resource limitations.