7.4.3. Software Troubleshooting
7.4.3.1. Kernel and Initialization Issues
This section describes Issues that may prevent the system from coming up properly.
7.4.3.1.1. Driver Load Fails Due to Unsupported Kernel
If you try to load the Omni-Path driver on a kernel that the OPX Software does not support, the load fails with error messages that point to hfi1.ko.
To correct this problem, install one of the appropriate supported Linux kernel versions, then reload the driver.
7.4.3.1.2. Rebuild or Reinstall Drivers if Different Kernel Installed
If you upgrade the kernel, you must reboot and then rebuild or reinstall the Omni-Path kernel modules (drivers). Refer to the CN5000 Fabric Installation Guide for more information.
7.4.3.1.3. Omni-Path Interrupts Not Working
The driver cannot configure the Omni-Path link to a usable state unless interrupts are working. Check for this problem with the command:
$ grep hfi1 /proc/interrupts
Note
The output you see may vary depending on board type, distribution, or update level, and the number of CPUs in the system.
If there is no output at all, the driver initialization failed. For more information on driver problems, see Driver Load Fails Due to Unsupported Kernel or CN5000 Omni-Path SuperNIC Initialization Failure.
If the output is similar to one of these lines, then interrupts are not being delivered to the driver.
-MSI-edge hfi1_0 sdma6 177: 0 0 0 PCI-MSI-edge hfi1_0 sdma7 178: 0 0 0 PCI-MSI-edge hfi1_0 sdma8 179: 0 0 0 PCI-MSI-edge hfi1_0 sdma9 180: 0 0 0 PCI-MSI-edge hfi1_0 sdma10 181: 0 0 0 PCI-MSI-edge hfi1_0 sdma11 182: 0 0 0 PCI-MSI-edge hfi1_0 sdma12 183: 0 0 0 PCI-MSI-edge hfi1_0 sdma13 184: 0 0 0 PCI-MSI-edge hfi1_0 sdma14 185: 0 0 0 PCI-MSI-edge hfi1_0 sdma15 186: 39 0 0 PCI-MSI-edge hfi1_0 kctxt0 187: 1 77 0 PCI-MSI-edge hfi1_0 kctxt1 188: 0 0 0 PCI-MSI-edge hfi1_0 kctxt2
A zero count in all CPU columns means that no Omni-Path interrupts have been delivered to the processor.
The possible causes of this problem are:
Booting the Linux kernel with ACPI disabled on either the boot command line or in the BIOS configuration.
Other Omni-Path initialization failures.
To check if the kernel was booted with the noacpi or pci=noacpi option, use this command:
$ grep -i acpi /proc/cmdline
If output is displayed, fix the kernel boot command line so that ACPI is enabled. This command line can be set in various ways, depending on your OS distribution. If no output is displayed, check that ACPI is enabled in your BIOS settings.
To track down other initialization failures, see CN5000 Omni-Path SuperNIC Initialization Failure.
7.4.3.1.4. OpenFabrics Load Errors if SuperNIC Driver Load Fails
When the SuperNIC driver fails to load, the other OpenFabrics drivers/modules are loaded and shown by lsmod. However, commands such as ibv_devinfo fail if the SuperNIC driver fails to load, as shown in the following example:
ibv_devinfo libibverbs: Fatal: couldn’t read uverbs ABI version. No Omni-Path devices found
7.4.3.1.5. CN5000 Omni-Path SuperNIC Initialization Failure
There may be cases where the SuperNIC driver was not properly initialized. Symptoms of this may show up in error messages from an MPI job or another program.
Here is a sample command and error message:
$ mpirun -np 2 -m ~/tmp/mbu13 osu_latency <nodename>:hfi_userinit: assign_port command failed: Network is down <nodename>:can’t open /dev/hfi1, network down
This is followed by messages of this type after 60 seconds:
MPIRUN<node_where_started>: 1 rank has not yet exited 60 seconds after rank 0 (node <nodename>) exited without reaching MPI_Finalize(). MPIRUN<node_where_started>:Waiting at most another 60 seconds for the remaining ranks to do a clean shutdown before terminating 1 node processes.
If this error appears, check to see if the CN5000 Omni-Path SuperNIC driver is loaded with the command:
$ lsmod | grep hfi
If no output is displayed, the driver did not load for some reason. In this case, try the following commands (as root):
modprobe -v hfi1 lsmod | grep hfi1 dmesg | grep -i hfi1 | tail -25
The output indicates whether the driver has loaded or not. Printing out messages using dmesg may help to locate any problems with the SuperNIC driver.
If the driver loaded, but MPI or other programs are not working, check to see if problems were detected during the driver and hardware initialization with the command:
$ dmesg | grep -i hfi1
This command may generate more than one screen of output.
Also, check the link status with the command:
$ hfi1_control -iv