Skip to main content

Cornelis Technical Documentation

6.9.4. NVIDIA GPUs with CUDA

6.9.4.1. Prerequisites

To achieve optimal bandwidth and the lowest latency for CUDA-enabled GPU devices, ensure the following:

  • Hardware configuration

    Connect the NVIDIA GPU and CN5000 SuperNIC to the same CPU socket or behind a PCIe switch.

    In typical systems where both the GPU and SuperNIC are connected to the same NUMA socket (e.g., socket 0), ensure the MPI processes are pinned to the same socket as the GPU. Use options such as I_MPI_PIN_PROCESSOR_LIST, and select the appropriate GPU with CUDA_VISIBLE_DEVICES.

  • MPI configuration

    Use an MPI implementation compatible with CUDA, such as Open MPI.

6.9.4.2. Open MPI with CUDA

Compile Open MPI with CUDA support

Include the CUDA path during Open MPI compilation. Example configure command:

./configure --with-cuda=/usr/local/cuda-<cudaVers> \
            --enable-orterun-prefix-by-default \
            LDFLAGS=-Wl,--enable-new-dtags \
            --with-ofi=/usr
make -j && make install

After installation, ensure the following paths are updated

  • Add openmpi/bin to your PATH environment variable.

  • Include openmpi/lib in your LD_LIBRARY_PATH.

Application and benchmark setup

  • Use a CUDA-enabled benchmark or application, such as OMB (OSU Micro-Benchmarks) version 7.5.

  • Ensure the application is built with --enable-cuda and any additional settings specified in the application’s README documentation.

Note

OFI HMEM/CUDA support (FI_HMEM_CUDA_ENABLE=1) requires Open MPI 5.0+. Applications using OFI with earlier OpenMPI versions that require GPU support will need to use a host staged data path.

The most common setup is where the SuperNIC and GPU are connected to the same socket. For example, if the compute nodes have the SuperNIC on socket 0, the following command can be used:

mpirun -mca btl self,vader -mca mtl ofi \
       -x FI_PROVIDER=opx \
       -np 2 -host node1,node2 ./osu_bw D D

As of OPX Software 12.1.1 release, higher bandwidth can be achieved by enabling BTS. If the hfi1 driver is loaded with use_bulksvc=Y, you can enable BTS by setting FI_OPX_HFISVC=1 and FI_HMEM_ROCR_USE_DMABUF=1.

6.9.4.3. Intel MPI with CUDA

This section describes how to enable device-to-device MPI communication using Intel MPI with GPUDirect on NVIDIA GPUs, using OPX libfabric. Intel MPI 2021.16 and later has the best support for CUDA-enabled GPUs. For additional documentation, refer to Intel MPI Library GPU Support.

Follow the environment setup instructions provided in Intel MPI Library Settings to properly source Intel MPI and required libraries.

Building GPU-Aware IMB (IMB-MPI1-GPU)

You can either use the prebuilt OneAPI version of IMB-MPI1-GPU, or manually build it as follows:

Load Intel MPI environment:

source /path/to/intel/oneapi/setvars.sh

Build GPU-aware version:

make  CC=icx  IMB-MPI1-GPU  CUDA_INCLUDE_DIR=/usr/local/cuda/include/

Once built, IMB-MPI1-GPU can be used to benchmark MPI performance over GPU buffers using OPX.

Running GPU-GPU Benchmarking with OPX and Intel MPI

mpirun -np 2 -ppn 1 -hosts hostA,hostB \ 
       -genv I_MPI_OFFLOAD=1 -genv I_MPI_OFFLOAD_RDMA=1 -genv I_MPI_OFI_MR_HMEM=1 \ 
       -genv FI_PROVIDER=opx -genv FI_OPX_HFISVC=1 -genv FI_HMEM_CUDA_USE_DMABUF=1 \
       ./IMB-MPI1-GPU -mem_alloc_type device  

This command configures Intel MPI to offload communication directly to GPU buffers using the OPX Provider with GPUDirect RDMA enabled.