6.5.6. MPI Collectives

Cornelis Technical Documentation

6.5.6. MPI Collectives

MPI collective operations, such as Allreduce, are critical for many HPC workloads. Their performance can be evaluated using benchmarks run across large node counts.

For example, to measure Allreduce performance on 128 nodes with 32 ranks per node (a total of 4096 ranks), use the following command:

mpirun -np $((128*32)) -ppn 32 -hostfile 128hosts \
  -genv FI_PROVIDER=opx \
  IMB-MPI1 Allreduce -npmin $((128*32))

Sample output:

# Benchmarking Allreduce
# #processes = 4096
#----------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
...

t_avg provides a general sense of the collective operation’s average latency.
A large gap between t_min and t_max can indicate performance variability, often caused by system-level jitter or non-uniform resource utilization.

Monitoring collective performance at scale helps identify bottlenecks that may not be visible in point-to-point benchmarks.

Would you like to provide feedback? Just click here to suggest edits.

Cornelis Technical Documentation

6.5.6. MPI Collectives

Search results