microddp

All-Reduce Algorithms Communication Overhead

Overview

We have n ranks, each with a tensor. We want all ranks to have the sum (or average) of all tensors.

Notation: n = number of ranks/GPUs, S = size of the tensor (in bytes or elements).

Rank 0: [1, 2, 3]
Rank 1: [4, 5, 6]
Rank 2: [7, 8, 9]

After all-reduce SUM:
All ranks: [12, 15, 18]

Naive All-Reduce

  1. All ranks send their tensors to rank 0.
  2. Rank 0 sums the tensors.
  3. Rank 0 broadcasts the result back to all ranks.

Communication:

Tree All-Reduce

Communication:

Ring All-Reduce

Phase 1 – Scatter-Reduce

Phase 2 – All-Gather

Communication:

Implementation

See src/allreduce.py

Double Buffering

When using dist.isend, the data in send_buff isn’t sent instantly. If you overwrite send_buff before the send finishes, you risk corrupting the outgoing data. To avoid this, ring all-reduce alternates between two buffers (send_buff and recv_buff).

3-Rank Example (R0, R1, R2)

Each rank starts with its tensor: $T_0$, $T_1$, $T_2$.

Further Reading