Skip to main content

Cornelis Technical Documentation

7.2.3. Common Fabric Monitoring Tools

This section describes the common fabric monitoring tools and commands to check cluster and software status.

7.2.3.1. opafabricinfo

Use opafabricinfo to monitor fabric components, using the first active port on the given local host to perform its analysis.

From this command, you can determine information such as which host is running the primary FM and how many inter-switch links (ISLs) there are.

# opafabricinfo
Fabric 0:0 Information:
SM: opahsx151 hfi1_0 Guid: 0x0011750101671d2c
State: Master
Number of HFIs: 30
Number of Switches: 2
Number of Links: 33
Number of HFI Links: 30             (Internal: 0   External: 30)
Number of ISLs: 3                   (Internal: 0   External: 3)
Number of Degraded Links: 0         (HFI Links: 0   ISLs: 0)
Number of Omitted Links: 0          (HFI Links: 0   ISLs: 0)
-------------------------------------------------------------------------------

For additional details, refer to CN5000 Commands Guide, opafabricinfo.

7.2.3.2. opareport

Use opareport to generate reports about the current state of the fabric and output snapshot files for later use in debugging.

Common issues to analyze are:

  • Cable health, including link quality

  • Nodes that are Inactive, Priority, or Elevated Priority when controlling failover for SM, PM

The following provides examples for looking into cable health, through errors and slow links.

Note

Before running opareport to look for errors (-o), you should clear all the counters using opareport -o none --clearall or opareport -o none -Ca.

opahsx41:~ # opareport -o none --clearall
Getting All Node Records...
Done Getting All Node Records
Done Getting All Link Records
Done Getting All Cable Info Records
Done Getting All SM Info Records
Done Getting vFabric Records
Clearing Port Counters

Configured Counters to Clear:
XmitData
RcvData
XmitPkts
RcvPkts
MulticastXmitPkts
MulticastRcvPkts
UncorrectableErrors
LinkDowned
RcvErrors
ExcessiveBufferOverruns
FMConfigErrors
LinkErrorRecovery
LocalLinkIntegrityErrors
RcvRemotePhysicalErrors
XmitConstraintErrors
RcvConstraintErrors
RcvSwitchRelayErrors
XmitDiscards
CongDiscards
RcvFECN
RcvBECN
MarkFECN
XmitTimeCong
XmitWait
XmitWastedBW
XmitWaitData
RcvBubble
Clearing Port Counters...
Done Clearing Port Counters
Cleared 68 Ports on 32 Nodes
  • To check for cable errors and slow links, use opareport -o errors -o slowlinks.

    Cables with LinkQualityIndicator (LQI) "3" or less should be troubleshot or replaced. For more information on LQI, refer to Link Quality Indicator (LQI).

    opareport -o errors
    Getting All Node Records...
    Done Getting All Node Records
    Done Getting All Link Records
    Done Getting All Cable Info Records
    Done Getting All SM Info Records
    Done Getting vFabric Records
    Getting All Port Counters...
    Done Getting All Port Counters
    Links with errors > threshold Summary
    
    Configured Thresholds:
      LinkQualityIndicator           3
       UncorrectableErrors            1
       LinkDowned                     3
       RcvErrors                      1
       ExcessiveBufferOverruns        1
       FMConfigErrors                 1
       XmitConstraintErrors           10
       RcvConstraintErrors            10
       CongDiscards                   100
    Rate NodeGUID          Port Type Name
    100g 0x001175010170c572   1 FI   ime06 hfi1_0
       LinkQualityIndicator: 3 Below Threshold: 4
    <->  0x00117501020c4bdb  10 SW  sw_ddn_r25u41
    17688 of 17688 Links Checked, 1 Errors found
  • To check for slow links (links on which lanes, either RX or TX, are running at speed less than "4"), use opareport -o slowlinks.

    The following example shows two ends of a cable with “3” specifying one degraded lane TX/RX.

    opareport -o slowlinks
    Getting All Node Records...
    Done Getting All Node Records
    Done Getting All Link Records
    Done Getting All Cable Info Records
    Done Getting All SM Info Records
    Done Getting vFabric Records
    Links running slower than expected Summary
    
    Links running slower than expected:
    Rate NodeGUID          Port Type Name
        Active                              Enabled
        Lanes, Used(Tx), Used(Rx), Rate,    Lanes,   DownTo,  Rates
    -------------------------------------------------------------------------------
    100g 0x001175010170b5a9   1 FI   ddn-mon02 hfi1_0
        4     3         4         25Gb     4        3,4      25Gb
    <->  0x00117501020c4b3e  15 SW  sw_ddn_r25u42
        4     4         3         25Gb     1,2,3,4  3,4      25Gb

For additional details and the list of available report types, refer to CN5000 Commands Guide, opareport.

7.2.3.3. opainfo

Use opainfo to report on the status of the local SuperNICs.

The following provides an example output for a single port SuperNIC.

$ opainfo
hfi1_0:1                           PortGID:0xfe80000000000000:0011750101575fec
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb
   LinkWidth      Act: 4            En: 4
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True
   LID: 0x00000001-0x00000001       SM LID: 0x00000001 SL: 0
   QSFP: PassiveCu,   2m TE Connectivity   P/N 2821076-2         Rev B
   Xmit Data:                  5 MB Pkts:                28742
   Recv Data:                 17 MB Pkts:                28969
   Link Quality: 5 (Excellent)

For additional details, refer to CN5000 Commands Guide, opainfo.

7.2.3.4. opatop

Use the Performance Monitor Tool opatop to drill down from a high-level, fabric-wide view to an individual port view.

From this tool, you can determine when an issue occurred at the high-level and drill down to find the offending port.

The following provides an example Summary screen.

opatop: Img: 10s @ Wed Mar 12 16:38:13 2025, Live
Summary:  SW:     1 Ports: SW:     5  HFI:     4       Link:     4
          SM:     1 Node NRsp:     0 Skip:     0 Port NRsp:     0 Skip:     0
                    AvgMBps   MinMBps   MaxMBps   AvgKPps   MinKPps   MaxKPps
0 All         Int         0         0         0         0         0         0
      Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
1 HFIs        Snd         0         0         0         0         0         0
              Rcv         0         0         0         0         0         0
      Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min
2 SWs         Int         0         0         0         0         0         0
              Snd         0         0         0         0         0         0
              Rcv         0         0         0         0         0         0
      Integ:min  Congst:min  SmaCong:min  Bubble:min  Secure:min  Routing:min

    Master-SM: LID: 0x0001 Port: 1   Priority: 0  State: Master
               Name: hds1fnb1051 hfi1_0
               PortGUID: 0x00117501010AA4D8
 Secondary-SM: none



Quit up Live/rRev/fFwd/time/bookmrked Bookmrk Unbookmrk ?help |
sS Pmcfg Imginfo View 0-n: