Training large models on a single GPU faces three challenges:

| Aspect | DataParallel (DP) | DistributedDataParallel (DDP) |
|---|---|---|
| Process Model | Single-process, multi-threaded | Multi-process, typically one process per device (GPU) |
| Machine Support | Only works on a single machine | Supports both single-machine and multi-machine setups |
| Model Replication | Replicated to all devices on every forward pass (high overhead) | Model is replicated once at startup; each process has its own replica |
| Communication | Via threads; master process gathers grads (GIL bottleneck) | Collectives (e.g. all-reduce) run asynchronously outside the GIL |
| Performance | Generally slower due to replication and GIL | Much faster; enables computation/communication overlap |
A process is an independent program with its own memory; a thread is a lightweight unit of work within a process that shares the same memory space with other threads of that process. Processes are isolated, while threads are not.

Broadcast: Initialize model weights on one node, send to all nodes.

Forward/Backward: Each node trains on different data chunk, computes local gradients.

All-Reduce: Sum gradients across all nodes, distribute result to all nodes.

Update: Each node updates its model using the averaged gradients.


Point-to-Point:

Smart Collective Communication:

