6.5.6. MPI Collectives
MPI collective operations, such as Allreduce, are critical for many HPC workloads. Their performance can be evaluated using benchmarks run across large node counts.
For example, to measure Allreduce performance on 128 nodes with 32 ranks per node (a total of 4096 ranks), use the following command:
mpirun -np $((128*32)) -ppn 32 -hostfile 128hosts \ -genv FI_PROVIDER=opx \ IMB-MPI1 Allreduce -npmin $((128*32))
Sample output:
# Benchmarking Allreduce # #processes = 4096 #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] ...
t_avgprovides a general sense of the collective operation’s average latency.A large gap between
t_minandt_maxcan indicate performance variability, often caused by system-level jitter or non-uniform resource utilization.
Monitoring collective performance at scale helps identify bottlenecks that may not be visible in point-to-point benchmarks.