micropp

PipeDream: 1F1B Scheduling

Communication Volume

Hypothetical Scenario

Why 1F1B Needs Async Forward + Request Tracking

  1. Async forward sends (isend_forward) - prevents deadlock
  2. Save request handles (async_requests.append(req)) - prevents buffer deallocation

Case 1: Blocking send_forward (DEADLOCK)

Rank 2:

Rank 3:

Result: Both are blocked sending, and neither can receive because:

This is why async isend_forward() is needed: Rank 2 can start the send and immediately proceed to run_backward(0)recv_backward(), which unblocks Rank 3, allowing Rank 3 to eventually receive the forward(mb1) that Rank 2 started.


Case 2: Async isend_forward WITHOUT saving request (CRASH)

# Rank 0:
def run_forward(micro_batch_idx):
    output = model(input_data)
    req = comms.isend_forward(output.detach())  # Start async send
    # req is NOT saved anywhere
    # Function returns, req goes out of scope
    # Python GC may collect req object

# What happens inside Gloo:
# 1. isend() creates internal buffer reference to output.detach()
# 2. Send happens in background thread
# 3. req object is GC'd → Gloo loses reference
# 4. Gloo tries to access buffer → "Cannot lock pointer to unbound buffer"
# 5. CRASH

Why saving the request fixes it:

async_requests.append(req)  # Keep reference alive!
# Now req stays in scope until function returns
# Gloo can safely access buffer during async send

Case 3: Async isend_forward WITH saving request (WORKS)

# Rank 0:
def run_forward(micro_batch_idx):
    output = model(input_data)
    req = comms.isend_forward(output.detach())
    async_requests.append(req)  # ✅ Keep alive
    # req stays in async_requests list
    # Buffer reference stays valid
    # Gloo can complete async send safely

# Function returns, but async_requests is still in scope
# (it's a closure variable, lives until onef_oneb_pipeline_step returns)
# All sends complete before function exits

Key Insight

The request object returned by isend() contains a reference to the tensor buffer. If the request is garbage collected, Gloo loses that reference and crashes. Saving it in async_requests keeps the reference chain alive:

output.detach() → req object → Gloo internal buffer reference
     ↑                ↑
     |                |
  output_buffers  async_requests (both keep things alive)

Both references are needed: