Skip to main content

Cornelis Technical Documentation

7.2.2. Monitoring Logs and Events

Logs and events provide valuable insights into system behavior, security events, and performance. They can enable faster troubleshooting and problem resolution.

7.2.2.1. Fabric Manager Monitoring

The Fabric Manager logs events to assist with fabric debugging.

The following lists some of the common Fabric Manager message log information that you should look for:

  • Normal Sweeps

  • Components that are constant

  • No Link Quality Indicator (LQI) errors

The following steps provide high-level instructions for monitoring the fabric/system using the Fabric Manager.

  1. Ensure /var/log/message is clean. 

    Note

    If opafm logs have been redirected elsewhere, check that directory.

    A healthy fabric should have sweeps every five minutes unless there is an event in the fabric, such as node reboot or adding or removing connections.

    You may see some retries in the “retries” counter, but the number should be in single or low-double digits. The sweep time should be 0.2 – 0.5 seconds (depending on the size of the fabric—in large setups, it may be greater than 1 second).

    Example output:

    Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval
    Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table already set
    Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 2 SWs, 30 HFIs, 30 end ports, 68 total ports, 1 SM(s), 148 packets, 0 retries, 0.238 sec sweep
    
    <after 5 minutes>
    
    Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Multicast group Membership change.
    Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table 
    Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 2 SWs, 30 HFIs, 30 end ports, 68 total ports, 1 SM(s), 149 packets, 0 retries, 0.217 sec sweep

    Any messages with WARN or ERROR should be investigated.

  2. Show only the most recent Fabric Manager log entries, and continuously print new entries so they are appended to the journal.

    journalctl -f -u opafm

7.2.2.2. Monitoring SuperNIC and AOC Temperatures

The SuperNIC allows for reporting of its current temperature as well as the temperature of any active optical cables present. This can be done using the tempsense file:

[root@cn5kGenoa165 ~]# cat /sys/class/infiniband/hfi1_0/tempsense 
ASIC 55.250
QSFP1 none
QSFP2 52.625

The temperatures listed here are in degrees Celsius.