



batch_size × N data per forward and backward, totaling approximately 2 × (num_GPUs – 1) × batch_size × N floats per batch for a model of hidden size N.
isend_forward) - prevents deadlockasync_requests.append(req)) - prevents buffer deallocationsend_forward (DEADLOCK)Rank 2:
send_forward() for microbatch 1, waiting for Rank 3 to call recv_forward()recv_backward() for microbatch 0 because it’s stuck in the blocking sendRank 3:
send_backward() for microbatch 0, waiting for Rank 2 to call recv_backward()recv_forward() for microbatch 1 because it already finished forward(0) and moved to backward(0)Result: Both are blocked sending, and neither can receive because:
This is why async isend_forward() is needed: Rank 2 can start the send and immediately proceed to run_backward(0) → recv_backward(), which unblocks Rank 3, allowing Rank 3 to eventually receive the forward(mb1) that Rank 2 started.
isend_forward WITHOUT saving request (CRASH)# Rank 0:
def run_forward(micro_batch_idx):
output = model(input_data)
req = comms.isend_forward(output.detach()) # Start async send
# req is NOT saved anywhere
# Function returns, req goes out of scope
# Python GC may collect req object
# What happens inside Gloo:
# 1. isend() creates internal buffer reference to output.detach()
# 2. Send happens in background thread
# 3. req object is GC'd → Gloo loses reference
# 4. Gloo tries to access buffer → "Cannot lock pointer to unbound buffer"
# 5. CRASH
Why saving the request fixes it:
async_requests.append(req) # Keep reference alive!
# Now req stays in scope until function returns
# Gloo can safely access buffer during async send
isend_forward WITH saving request (WORKS)# Rank 0:
def run_forward(micro_batch_idx):
output = model(input_data)
req = comms.isend_forward(output.detach())
async_requests.append(req) # ✅ Keep alive
# req stays in async_requests list
# Buffer reference stays valid
# Gloo can complete async send safely
# Function returns, but async_requests is still in scope
# (it's a closure variable, lives until onef_oneb_pipeline_step returns)
# All sends complete before function exits
The request object returned by isend() contains a reference to the tensor buffer. If the request is garbage collected, Gloo loses that reference and crashes. Saving it in async_requests keeps the reference chain alive:
output.detach() → req object → Gloo internal buffer reference
↑ ↑
| |
output_buffers async_requests (both keep things alive)
Both references are needed:
output_buffers keeps the original output tensor alive (needed for backward)async_requests keeps the req object alive (needed for Gloo’s buffer reference)