We have n ranks, each with a tensor. We want all ranks to have the sum (or average) of all tensors.
Notation: n = number of ranks/GPUs, S = size of the tensor (in bytes or elements).
Rank 0: [1, 2, 3]
Rank 1: [4, 5, 6]
Rank 2: [7, 8, 9]
After all-reduce SUM:
All ranks: [12, 15, 18]
Communication:
Communication:
Phase 1 – Scatter-Reduce
Phase 2 – All-Gather
Communication:
See src/allreduce.py
When using dist.isend, the data in send_buff isn’t sent instantly. If you overwrite send_buff before the send finishes, you risk corrupting the outgoing data. To avoid this, ring all-reduce alternates between two buffers (send_buff and recv_buff).
Each rank starts with its tensor: $T_0$, $T_1$, $T_2$.
Start:
R0: send_buff = T_0, accum = T_0
R1: send_buff = T_1, accum = T_1
R2: send_buff = T_2, accum = T_2
Step 0 (even):
Each rank sends its send_buff to the right, receives into recv_buff from the left, and adds recv_buff to accum.
(e.g., R0 now has $T_0 + T_2$ in accum)
Step 1 (odd):
Each rank sends its recv_buff to the right, receives into send_buff from the left, and adds send_buff to accum.
(e.g., R0’s accum is now $T_0 + T_2 + T_1$ — the total sum.)