7.2.2. Monitoring Logs and Events
Logs and events provide valuable insights into system behavior, security events, and performance. They can enable faster troubleshooting and problem resolution.
7.2.2.1. Fabric Manager Monitoring
The Fabric Manager logs events to assist with fabric debugging.
The following lists some of the common Fabric Manager message log information that you should look for:
Normal Sweeps
Components that are constant
No Link Quality Indicator (LQI) errors
The following steps provide high-level instructions for monitoring the fabric/system using the Fabric Manager.
Ensure
/var/log/messageis clean.Note
If opafm logs have been redirected elsewhere, check that directory.
A healthy fabric should have sweeps every five minutes unless there is an event in the fabric, such as node reboot or adding or removing connections.
You may see some retries in the “retries” counter, but the number should be in single or low-double digits. The sweep time should be 0.2 – 0.5 seconds (depending on the size of the fabric—in large setups, it may be greater than 1 second).
Example output:
Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Scheduled sweep interval Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table already set Feb 11 08:45:57 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 2 SWs, 30 HFIs, 30 end ports, 68 total ports, 1 SM(s), 148 packets, 0 retries, 0.238 sec sweep <after 5 minutes> Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: TT: DISCOVERY CYCLE START - REASON: Multicast group Membership change. Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: sm_set_local_port_pkey: sm pkey table Feb 11 08:50:17 opahsx151 fm0_sm[46537]: PROGR[topology]: SM: topology_main: DISCOVERY CYCLE END. 2 SWs, 30 HFIs, 30 end ports, 68 total ports, 1 SM(s), 149 packets, 0 retries, 0.217 sec sweep
Any messages with WARN or ERROR should be investigated.
Show only the most recent Fabric Manager log entries, and continuously print new entries so they are appended to the journal.
journalctl -f -u opafm
7.2.2.2. Monitoring SuperNIC and AOC Temperatures
The SuperNIC allows for reporting of its current temperature as well as the temperature of any active optical cables present. This can be done using the tempsense file:
[root@cn5kGenoa165 ~]# cat /sys/class/infiniband/hfi1_0/tempsense ASIC 55.250 QSFP1 none QSFP2 52.625
The temperatures listed here are in degrees Celsius.