6.9.3. AMD GPUs with ROCm
6.9.3.1. Prerequisites
To achieve optimal bandwidth and the lowest latency for ROCm-enabled GPU devices, ensure the following:
Hardware configuration
Connect the AMD GPU and CN5000 SuperNIC to the same CPU socket or behind a PCIe switch.
MPI configuration
Use an MPI implementation compatible with ROCm, such as Open MPI.
Compile Open MPI with ROCm support
Include the ROCm path during Open MPI compilation. Example
configurecommand:./configure --with-rocm=/opt/rocm \ --with-rccl=/opt/rocm \ --with-cuda=no \ --enable-orterun-prefix-by-default \ LDFLAGS=-Wl,--enable-new-dtags \ --with-ofi=/usrAfter installation, ensure the following paths are updated
Add
openmpi/binto your PATH environment variable.Include
openmpi/libandrocm/libin yourLD_LIBRARY_PATH.
6.9.3.2. Application and Benchmark Setup
Use a ROCm-enabled benchmark or application, such as OMB (OSU Micro-Benchmarks) version 7.5.
Ensure the application is built with
--enable-rocmand any additional settings specified in the application’s README documentation.
Example run command:
mpirun -x FI_OPX_HFI_SELECT=0 -x HIP_VISIBLE_DEVICES=0 \
-mca mtl ofi -x FI_PROVIDER=opx -mca btl self,vader \
-np 2 -host node1,node2 ./osu_bw D DAs of OPX Software 12.1.1 release, higher bandwidth can be achieved by enabling BTS. If the hfi1 driver is loaded with use_bulksvc=Y, you can enable BTS by setting FI_OPX_HFISVC=1 and FI_HMEM_ROCR_USE_DMABUF=1.