7.2.3. Common Fabric Monitoring Tools
This section describes the common fabric monitoring tools and commands to check cluster and software status.
7.2.3.1. opafabricinfo
Use opafabricinfo to monitor fabric components, using the first active port on the given local host to perform its analysis.
From this command, you can determine information such as which host is running the primary FM and how many inter-switch links (ISLs) there are.
# opafabricinfo Fabric 0:0 Information: SM: opahsx151 hfi1_0 Guid: 0x0011750101671d2c State: Master Number of HFIs: 30 Number of Switches: 2 Number of Links: 33 Number of HFI Links: 30 (Internal: 0 External: 30) Number of ISLs: 3 (Internal: 0 External: 3) Number of Degraded Links: 0 (HFI Links: 0 ISLs: 0) Number of Omitted Links: 0 (HFI Links: 0 ISLs: 0) -------------------------------------------------------------------------------
For additional details, refer to CN5000 Commands Guide, opafabricinfo.
7.2.3.2. opareport
Use opareport to generate reports about the current state of the fabric and output snapshot files for later use in debugging.
Common issues to analyze are:
Cable health, including link quality
Nodes that are Inactive, Priority, or Elevated Priority when controlling failover for SM, PM
The following provides examples for looking into cable health, through errors and slow links.
Note
Before running opareport to look for errors (-o), you should clear all the counters using opareport -o none --clearall or opareport -o none -Ca.
opahsx41:~ # opareport -o none --clearall Getting All Node Records... Done Getting All Node Records Done Getting All Link Records Done Getting All Cable Info Records Done Getting All SM Info Records Done Getting vFabric Records Clearing Port Counters Configured Counters to Clear: XmitData RcvData XmitPkts RcvPkts MulticastXmitPkts MulticastRcvPkts UncorrectableErrors LinkDowned RcvErrors ExcessiveBufferOverruns FMConfigErrors LinkErrorRecovery LocalLinkIntegrityErrors RcvRemotePhysicalErrors XmitConstraintErrors RcvConstraintErrors RcvSwitchRelayErrors XmitDiscards CongDiscards RcvFECN RcvBECN MarkFECN XmitTimeCong XmitWait XmitWastedBW XmitWaitData RcvBubble Clearing Port Counters... Done Clearing Port Counters Cleared 68 Ports on 32 Nodes
To check for cable errors and slow links, use
opareport -o errors -o slowlinks.Cables with LinkQualityIndicator (LQI) "3" or less should be troubleshot or replaced. For more information on LQI, refer to Link Quality Indicator (LQI).
opareport -o errors Getting All Node Records... Done Getting All Node Records Done Getting All Link Records Done Getting All Cable Info Records Done Getting All SM Info Records Done Getting vFabric Records Getting All Port Counters... Done Getting All Port Counters Links with errors > threshold Summary Configured Thresholds: LinkQualityIndicator 3 UncorrectableErrors 1 LinkDowned 3 RcvErrors 1 ExcessiveBufferOverruns 1 FMConfigErrors 1 XmitConstraintErrors 10 RcvConstraintErrors 10 CongDiscards 100 Rate NodeGUID Port Type Name 100g 0x001175010170c572 1 FI ime06 hfi1_0 LinkQualityIndicator: 3 Below Threshold: 4 <-> 0x00117501020c4bdb 10 SW sw_ddn_r25u41 17688 of 17688 Links Checked, 1 Errors foundTo check for slow links (links on which lanes, either RX or TX, are running at speed less than "4"), use
opareport -o slowlinks.The following example shows two ends of a cable with “3” specifying one degraded lane TX/RX.
opareport -o slowlinks Getting All Node Records... Done Getting All Node Records Done Getting All Link Records Done Getting All Cable Info Records Done Getting All SM Info Records Done Getting vFabric Records Links running slower than expected Summary Links running slower than expected: Rate NodeGUID Port Type Name Active Enabled Lanes, Used(Tx), Used(Rx), Rate, Lanes, DownTo, Rates ------------------------------------------------------------------------------- 100g 0x001175010170b5a9 1 FI ddn-mon02 hfi1_0 4 3 4 25Gb 4 3,4 25Gb <-> 0x00117501020c4b3e 15 SW sw_ddn_r25u42 4 4 3 25Gb 1,2,3,4 3,4 25Gb
For additional details and the list of available report types, refer to CN5000 Commands Guide, opareport.
7.2.3.3. opainfo
Use opainfo to report on the status of the local SuperNICs.
The following provides an example output for a single port SuperNIC.
$ opainfo hfi1_0:1 PortGID:0xfe80000000000000:0011750101575fec PortState: Active LinkSpeed Act: 25Gb En: 25Gb LinkWidth Act: 4 En: 4 LinkWidthDnGrd ActTx: 4 Rx: 4 En: 3,4 LCRC Act: 14-bit En: 14-bit,16-bit,48-bit Mgmt: True LID: 0x00000001-0x00000001 SM LID: 0x00000001 SL: 0 QSFP: PassiveCu, 2m TE Connectivity P/N 2821076-2 Rev B Xmit Data: 5 MB Pkts: 28742 Recv Data: 17 MB Pkts: 28969 Link Quality: 5 (Excellent)
For additional details, refer to CN5000 Commands Guide, opainfo.
7.2.3.4. opatop
Use the Performance Monitor Tool opatop to drill down from a high-level, fabric-wide view to an individual port view.
From this tool, you can determine when an issue occurred at the high-level and drill down to find the offending port.
The following provides an example Summary screen.
opatop: Img: 10s @ Wed Mar 12 16:38:13 2025, Live
Summary: SW: 1 Ports: SW: 5 HFI: 4 Link: 4
SM: 1 Node NRsp: 0 Skip: 0 Port NRsp: 0 Skip: 0
AvgMBps MinMBps MaxMBps AvgKPps MinKPps MaxKPps
0 All Int 0 0 0 0 0 0
Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min
1 HFIs Snd 0 0 0 0 0 0
Rcv 0 0 0 0 0 0
Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min
2 SWs Int 0 0 0 0 0 0
Snd 0 0 0 0 0 0
Rcv 0 0 0 0 0 0
Integ:min Congst:min SmaCong:min Bubble:min Secure:min Routing:min
Master-SM: LID: 0x0001 Port: 1 Priority: 0 State: Master
Name: hds1fnb1051 hfi1_0
PortGUID: 0x00117501010AA4D8
Secondary-SM: none
Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
sS Pmcfg Imginfo View 0-n: