3.6.1. Verify Servers
Perform the following steps to verify servers:
Use
lspcito verify the SuperNIC PCIe cards' operating speed and bus width. Possible sources for narrow PCIe width are:Be aware that Omni-Path does support different width PCIe cards, including dual SuperNIC cards using two x8 slices of a x16 physical connector.
SuperNIC Card partial insertion into x16 slots. Initially this looks to be a narrow width issue but re-inserting the card often resolves the issue. This may occur after a server is shipped. This step has resolved most width issues.
Server physical configuration. Many servers support different PCIe logical widths based on riser card configuration. The slot may be physically x16 but internally limited to x8. Check other servers of the same configuration in the fabric. Check the server configuration. This is also a common issue.
Swap the SuperNIC to another server to determine if the problem follows the card or the server.
Use the Linux
topcommand to identify the key CPU load processes.Note
opatopmay be useful for checking for loads that vary over time. Use ther(rev),f(forward), andL(live) options to look through PM snapshots of system activity. This is also helpful for monitoring application startup versus run time loads. The PM captures high resolution statistics, with very low system overhead, over periods up to two days.Check for high CPU percent processes. For example:
Screen savers - when a Linux GUI is enabled on hosts, the screen that runs when the interface is idle may have a high CPU load.
Test applications - look for MPI jobs or similar applications running in the background. This is a common issue particularly in a shared fabric bring-up environment. Use
kill -p processto stop orphan applications or reboot the server to debug the issue.
Review BIOS settings to isolate nodes with different or incorrect settings.
3.6.1.1. Verifying SuperNIC Speed and Bus Width Using lspci
After the OPX Software installation, verify the CN5000 Omni-Path SuperNIC is configured and visible to the host OS as x16 slot speed (output should be similar to the following with values in bold text):
lspci -vv -d 434e: |grep LnkSta
LnkSta: Speed 32GT/s (ok), Width x16 (ok)