6.9.5. NCCL Benchmarks – Configuration and Tuning

Cornelis Technical Documentation

6.9.5. NCCL Benchmarks – Configuration and Tuning

6.9.5.1. Running NCCL Benchmarks with Open MPI

NVIDIA NCCL provides highly optimized collective and point-to-point (P2P) communication primitives tailored for multi-GPU systems. It is designed to deliver low-latency and high-bandwidth performance over high-speed interconnects such as PCIe and NVLink.

For more information, visit the official NCCL page: NVIDIA NCCL.

6.9.5.1.1. Example: NCCL all_reduce_perf with Open MPI and OPX on a 2-Node System, with 4 GPUs per Node

mpirun -n 8 --map-by ppr:4:node \
       -mca btl self,vader -mca mtl ofi \
       -mca mtl ofi \
       -x FI_PROVIDER=opx \
       -x NCCL_IB_HCA=hfi1_0:2 \
       -x NCCL_SOCKET_IFNAME=eth0 \
       `which all_reduce_perf` -b 4 -e 1073741824 -f 2 -g 1

Users can choose between the aws-ofi-nccl plugin (which requires separate installation) or NCCL’s built-in plugin ib. By default, NCCL looks for an available plugin in the LD_LIBRARY_PATH and uses it if found. If no external plugin is detected, NCCL automatically falls back to the built-in Verbs-based plugin. To explicitly specify which plugin to use, set the NCCL_NET environment variable — for example, NCCL_NET=ib to use the built-in Verbs plugin, or NCCL_NET="AWS Libfabric" to use the aws-ofi-nccl plugin that calls the Libfabric API.

6.9.5.1.2. Key Performance Variables

The following variables may have a performance impact and are platform dependent.

-x NCCL_NET_GDR_LEVEL=4: Ensures GDR (GPUDirect RDMA) level is set to the maximum supported value.
-x NCCL_NET_GDR_READ=0: Disables sending data to the NIC directly for optimal performance.
Exception: When using multiple GPUs per node, and the topology between them is classified as NODE (that is, Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node), it is recommended to set NCCL_NET_GDR_READ=1.
-x NCCL_PROTO=SIMPLE: Uses the SIMPLE protocol, which is recommended for benchmarking.
-x FI_HMEM_CUDA_USE_GDRCOPY=1: Enables GDRCopy support for low-latency CUDA memory registration.

6.9.5.1.3. GPUs and NICs Locality

For optimal NCCL performance, it is highly recommended that the GPUs and the NIC used for communication are located under the same NUMA node or CPU socket. Otherwise, RDMA communication (like GPUDirect RDMA) might not be fully effective. To fully benefit from this configuration when using OPX with aws-ofi-nccl plugin, it is also recommended to bind processes to the same NUMA node as the GPUs and NIC using CPU affinity settings using taskset or the Open MPI Pinning options. For example, --map-by pe-list=32,33:ordered.

Would you like to provide feedback? Just click here to suggest edits.

Cornelis Technical Documentation

6.9.5. NCCL Benchmarks – Configuration and Tuning

6.9.5.1. Running NCCL Benchmarks with Open MPI

6.9.5.1.1. Example: NCCL all_reduce_perf with Open MPI and OPX on a 2-Node System, with 4 GPUs per Node

6.9.5.1.2. Key Performance Variables

6.9.5.1.3. GPUs and NICs Locality

Search results