7.4.1.3. Diagnosing Bad Cables
Once a suspect cable is identified, you can determine if it is bad and needs to be replaced by using several automated and manual processes.
7.4.1.3.1. Intermittent Link Quality Issues and Port Bounces
The Link Quality Indicator (LQI) provides a simple way to identify links with poor signal integrity. Links with poor LQI will be visible in PM logs, opatop, opainfo, opalinkanalysis, opareport, and other tools. Be aware that low LQI can occur when links are intentionally going down, such as during a device reboot or cable maintenance (replacement, reseating, and more). For more information on LQI, refer to Link Quality Indicator (LQI).
Intermittent Link Quality issues often lead to random port bounces when the link exceeds a Bit Error Threshold and attempts to retrain the link. This may indicate that a port or cable is in the process of failing. You can perform the Cable Swap Test to determine quickly if the issue is the port or the cable.
Additionally, sometimes if the failure occurs coinciding with any physical maintenance in the area (or at the port), you can reseat the cable, card, or blade to fix the issue.
7.4.1.3.2. High Error Counts
Once a link is identified as a possible bad link, you can observe the performance counters while the ports are in various states of load to determine the issue. For instance, a port with a significantly high rate of errors may point to a link that needs to be replaced. For a description of various port counters, refer to Port Counters.
7.4.1.3.3. Slow Links (LinkWidthDownGrade)
Slow Links are links that are not operating at the greatest supported bandwidth. For Omni-Path 400G, normal link operation would be 4x lane width with a lane speed of 100G.
Slow links usually indicate that a LinkWidthDowngrade event has occurred. One or more of the lanes experienced enough link issues that they could not continue to run. In order to avoid bouncing the whole link, the affected lanes were dropped.
7.4.1.3.4. Cable Swap Test
The Cable Swap Test is a simple way to determine if a link problem occurs with the cable.
Change out the suspect cable with a known, working cable and observe the results:
If the problem goes away, then the old cable was bad.
If the problem does not go away, then the problem resides elsewhere, such as in the port hardware on either end of the link.
7.4.1.3.5. Port Issues
When a link problem occurs with the ports, you need to determine which side is causing the issue. Switching one port to a new node can help to identify between two ports.
If the issue is on the switch side, you can move to an unused port or swap out the switch (or Director Class leaf or spine). Reseating the leaf or spine may also help.
If the issue is on the SuperNIC port, you can reseat the SuperNIC card to fix the issue. However, if the issue persists, then the SuperNIC card may need to be investigated. It is also possible that the issue may be in the server itself. If so, you can move the SuperNIC card to another slot or server to see if the issue is resolved.
If the issue did not move to the new ports or stay with the old ports, then it is possible the issue was fixed by reseating the cable(s). Several repetitions might be needed to verify.