

By looking at the pebble graph, we can observe some inefficiencies of naive model parallelism:
requires_gradrequires_grad = TrueIn this case, we allow the “chain” of math to stay connected across the two devices.
C.requires_grad = True..backward() on Rank 1, PyTorch calculates the gradient for its weights (D) and for the input C.C.grad. It calls send_backward(C.grad) to Rank 0. Rank 0 receives this and can now calculate the gradients for its own weights (B).requires_grad = FalseIn this case, Rank 1 treats the incoming data as a “constant” rather than a variable.
requires_grad is False..backward() on Rank 1, PyTorch calculates the gradient for the weights (D). However, because C was marked as a constant (no grad required), the engine stops there.C.grad is None. Rank 1 has nothing to send back. Rank 0 never receives a gradient, so its weights (B) never move. Only half the model learns.Essentially, requires_grad = True creates a “hook” at the very edge of the device’s memory. Without that hook, the backward pass has nothing to grab onto to pull the information back across the network to the other device.
detach()If you do not call .detach(), the output tensor remains “hooked” to the computational graph of the current GPU. PyTorch will try to keep all the activations of all previous layers in memory because it thinks you might call .backward() on that specific output variable later. This leads to a massive memory leak.
By detaching, you are explicitly saying: “I am done with this forward pass locally. I am handing off a static copy of the data to the next device”.
requires_grad = True.output.backward(received_grad).You don’t want the next GPU to have “ghost” references to memory that it cannot access. You give it the activations and keep the graph locally for when the backward pass eventually returns.