microddp

Performance Analysis of DDP

DDP adds communication overhead. Understanding when it’s worth it is crucial.

Overhead = Communication Time / Total Time

DDP becomes inefficient when:

Computation time » communication time:

Communication time » computation time:

DDP scaling plateaus
Solutions: Gradient compression, larger batch size per GPU, faster interconnect (InfiniBand/NVLink)

GPU memory limits batch size:

Solution: Model parallelism (FSDP, pipeline, or tensor parallelism) instead of pure DDP