Skip to main content

Cornelis Technical Documentation

4.2.3. Switching, Hops, and Latency

In a fabric, one-way latency is defined as the time between a message being sent by the CN5000 SuperNIC of one host and it being received by the SuperNIC of the target host (generally measured in nanoseconds). Though there are several sources of latency, the following are most prominent.

  • Switching latency is the most severe and most discussed form of latency. This is the time between a switch receiving a message and that switch transmitting it from the required port. The causes of switching latency are varied, but in general, the operations required to read a message header, determine the required outlet port, and forward the message to that port, must take some amount of time. CN5000 switches offer extremely low switching latency, < 100 ns, thus minimizing this form of latency. As switching latency is not negligible and cannot be eliminated, it can affect the performance of topologies.

    Though the exact cause of switching latency may be difficult to determine precisely, latency can be discussed in terms of hops. A hop is defined as the message encountering a switch. If a SuperNIC is connected directly to another SuperNIC, a so-called back-to-back configuration, this would be considered a 0-hop path. If a message is sent from a SuperNIC to a CN5000 Switch, and then sent to another SuperNIC connected to that same switch, this would be considered a 1-hop path. If that same CN5000 Switch were instead to send the message to a second CN5000 Switch, which then sends it to a third CN5000 Switch to which the target SuperNIC is connected, this would be considered a 3-hop path. In this way, switching latency can be discussed as a simple integer rather than an exact value, thus allowing for easy comparisons of paths and topologies.

  • Propagation delay is another major source of latency. This is the time it takes for a message to travel the length of a given cable between devices and is dependent on the cable material.

    • For Omni-Path copper cables, the delay is 4.33 ns per meter.

    • For Omni-Path optical cables, the delay is 5.48 ns per meter.

    As copper cables are generally only a few meters long and used for intra-rack connectivity, it is generally acceptable to neglect this source of latency. Long optical cables, however, may have non-negligible latencies. For example, a 50 meter optical cable would have a propagation delay of 274 ns. This is comparable to the switching delay caused by a 2-hop path.

    The minimization of latency is vital to the performance of many applications. Many engineering applications, for example, split a given problem into time-steps or sweeps of a physical domain. These applications are unable to progress to the next time-step or the next sweep until all processes in the fabric have confirmed to the controller process that they have completed their work. Assuming ideal processes and a constant latency throughout the fabric, this means that the processes are idle during the latency period, thus wasting computational time.

    As in most applications, the processes must also share data during the computation itself, and in many cases must wait for that data sharing to be completed. The amount of idle time can be dramatic in a high latency fabric. In some cases, the latency of a single node can also affect overall performance. Assuming no processes need to share any data, but do need to wait for each other to finish before moving on in the computation, it is possible for one process that is sending data over a path with an exceptionally high latency to cause high idle times in the other processes.

  • Tail latency refers to the longest latency experienced by messages in a fabric. The longer the tail latency, both as an absolute value and with respect to the average fabric latency, the more likely and more severe this type of performance bottleneck becomes.