Skip to main content

Cornelis Technical Documentation

7.4.3. Software Troubleshooting

7.4.3.1. Kernel and Initialization Issues

This section describes Issues that may prevent the system from coming up properly.

7.4.3.1.1. Driver Load Fails Due to Unsupported Kernel

If you try to load the Omni-Path driver on a kernel that the OPX Software does not support, the load fails with error messages that point to hfi1.ko.

To correct this problem, install one of the appropriate supported Linux kernel versions, then reload the driver.

7.4.3.1.2. Rebuild or Reinstall Drivers if Different Kernel Installed

If you upgrade the kernel, you must reboot and then rebuild or reinstall the Omni-Path kernel modules (drivers). Refer to the CN5000 Fabric Installation Guide for more information.

7.4.3.1.3. Omni-Path Interrupts Not Working

The driver cannot configure the Omni-Path link to a usable state unless interrupts are working. Check for this problem with the command:

$ grep hfi1 /proc/interrupts

Note

The output you see may vary depending on board type, distribution, or update level, and the number of CPUs in the system.

If there is no output at all, the driver initialization failed. For more information on driver problems, see Driver Load Fails Due to Unsupported Kernel or CN5000 Omni-Path SuperNIC Initialization Failure.

If the output is similar to one of these lines, then interrupts are not being delivered to the driver.

-MSI-edge    hfi1_0 sdma6
177:    0    0   0    PCI-MSI-edge    hfi1_0 sdma7
178:    0    0   0    PCI-MSI-edge    hfi1_0 sdma8
179:    0    0   0    PCI-MSI-edge    hfi1_0 sdma9
180:    0    0   0    PCI-MSI-edge    hfi1_0 sdma10
181:    0    0   0    PCI-MSI-edge    hfi1_0 sdma11
182:    0    0   0    PCI-MSI-edge    hfi1_0 sdma12
183:    0    0   0    PCI-MSI-edge    hfi1_0 sdma13
184:    0    0   0    PCI-MSI-edge    hfi1_0 sdma14
185:    0    0   0    PCI-MSI-edge    hfi1_0 sdma15
186:   39    0   0    PCI-MSI-edge    hfi1_0 kctxt0
187:    1   77   0    PCI-MSI-edge    hfi1_0 kctxt1
188:    0    0   0    PCI-MSI-edge    hfi1_0 kctxt2

A zero count in all CPU columns means that no Omni-Path interrupts have been delivered to the processor.

The possible causes of this problem are:

  • Booting the Linux kernel with ACPI disabled on either the boot command line or in the BIOS configuration.

  • Other Omni-Path initialization failures.

To check if the kernel was booted with the noacpi or pci=noacpi option, use this command:

$ grep -i acpi /proc/cmdline

If output is displayed, fix the kernel boot command line so that ACPI is enabled. This command line can be set in various ways, depending on your OS distribution. If no output is displayed, check that ACPI is enabled in your BIOS settings.

To track down other initialization failures, see CN5000 Omni-Path SuperNIC Initialization Failure.

7.4.3.1.4. OpenFabrics Load Errors if SuperNIC Driver Load Fails

When the SuperNIC driver fails to load, the other OpenFabrics drivers/modules are loaded and shown by lsmod. However, commands such as ibv_devinfo fail if the SuperNIC driver fails to load, as shown in the following example:

ibv_devinfo
libibverbs: Fatal: couldn’t read uverbs ABI version.
No Omni-Path devices found
7.4.3.1.5. CN5000 Omni-Path SuperNIC Initialization Failure

There may be cases where the SuperNIC driver was not properly initialized. Symptoms of this may show up in error messages from an MPI job or another program.

Here is a sample command and error message:

$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency
<nodename>:hfi_userinit: assign_port command failed: Network is down
<nodename>:can’t open /dev/hfi1, network down

This is followed by messages of this type after 60 seconds:

MPIRUN<node_where_started>: 1 rank has not yet exited 60 seconds after rank 0 (node 
<nodename>) exited without reaching MPI_Finalize().
MPIRUN<node_where_started>:Waiting at most another 60 seconds for the remaining 
ranks to do a clean shutdown before terminating 1 node processes.

If this error appears, check to see if the CN5000 Omni-Path SuperNIC driver is loaded with the command:

$ lsmod | grep hfi

If no output is displayed, the driver did not load for some reason. In this case, try the following commands (as root):

modprobe -v hfi1
lsmod | grep hfi1
dmesg | grep -i hfi1 | tail -25

The output indicates whether the driver has loaded or not. Printing out messages using dmesg may help to locate any problems with the SuperNIC driver.

If the driver loaded, but MPI or other programs are not working, check to see if problems were detected during the driver and hardware initialization with the command:

$ dmesg | grep -i hfi1

This command may generate more than one screen of output.

Also, check the link status with the command:

$ hfi1_control -iv