Sandbox Stuck

Overview

This runbook provides procedures for detecting, diagnosing, and resolving stuck or leaked sandbox instances in MonoTask agent execution.

Alert: sandbox_timeout_high or sandbox_leak_detectedSeverity: Warning (> 5% timeout rate) or Critical (> 20 active sandboxes) SLO Impact: Affects agent execution latency and sandbox provisioning SLO

Symptoms and Detection

How to Detect

Alert: "High Sandbox Timeout Rate" or "Sandbox Leak Detected"
Dashboard: Active sandboxes count not decreasing
Logs: Sandbox timeout errors or orphaned sandbox warnings
User Impact: Agent execution failures, slow task processing

Observable Symptoms

Active sandbox count > 20 sustained
Sandboxes in "running" state for > 5 minutes
Sandbox timeout errors > 5% of executions
Memory/resource warnings in sandbox logs
Sandbox cleanup jobs failing

Investigation Steps

1. Check Active Sandbox Count (ETA: 2 minutes)

Determine current sandbox state:

bash

# Query sandbox stats via Durable Object
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats

# Or use wrangler to check DO state
# (Requires DO inspection endpoint)

# Expected response:
# {
#   "active": 3,
#   "provisioning": 0,
#   "terminating": 0,
#   "total_created": 150,
#   "total_destroyed": 147
# }

Questions to Answer:

How many sandboxes are currently active?
Are any sandboxes stuck in "provisioning" state?
How long have active sandboxes been running?
Is the count growing or stable?

2. Identify Stuck Sandboxes (ETA: 3 minutes)

List sandboxes and their states:

bash

# Get list of active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Expected response:
# [
#   {
#     "id": "sandbox_123",
#     "status": "running",
#     "created_at": "2025-10-26T10:00:00Z",
#     "age_minutes": 45,  ← Stuck if > 10 minutes
#     "agent_type": "implementation",
#     "task_id": "task_456"
#   }
# ]

Red Flags:

Sandboxes running > 10 minutes
Status stuck in "provisioning"
No associated task or agent
Repeated timeout patterns for specific sandbox

3. Examine Sandbox Logs (ETA: 5 minutes)

Review logs for stuck sandbox:

bash

# Get logs for specific sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/logs

# Or tail worker logs filtering for sandbox ID
wrangler tail monotask-agent-worker | grep "sandbox_123"

# Look for:
# - Last activity timestamp
# - Error messages
# - Infinite loops or hangs
# - Resource exhaustion warnings

Common Log Patterns:

"Operation timed out" - External service hang
"Maximum call stack exceeded" - Infinite recursion
"Out of memory" - Memory leak
No recent logs - Process hung

4. Check Sandbox Resource Usage (ETA: 3 minutes)

Monitor resource consumption:

bash

# Get sandbox resource stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/resources

# Expected response:
# {
#   "cpu_percent": 95,  ← High CPU indicates busy loop
#   "memory_mb": 120,
#   "duration_ms": 45000,
#   "operations": 150000  ← High count indicates activity
# }

Resource Patterns:

Pattern	Likely Cause
CPU 100%	Infinite loop or heavy computation
CPU 0%	Hung waiting for I/O or deadlock
Memory growing	Memory leak
No operations	Process stuck or waiting

5. Review Cleanup Job Status (ETA: 2 minutes)

Check if cleanup automation is working:

bash

# Check cleanup job logs
wrangler tail monotask-agent-worker | grep "sandbox.*cleanup"

# Should see periodic cleanup runs (every 5 minutes):
# "Sandbox cleanup started"
# "Found 2 sandboxes to cleanup"
# "Cleaned up sandbox_123"
# "Cleanup completed"

Cleanup Issues:

No cleanup logs - Job not running
Cleanup errors - Job failing
Sandboxes not being cleaned - Detection logic broken

Common Causes and Resolutions

Cause 1: Agent Code Infinite Loop

Symptoms:

CPU at 100%
Sandbox never completes
No error messages
High operation count

Resolution:

Immediate (5 minutes):

Force terminate stuck sandbox:

bash

# Terminate via API
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123

# Or via Durable Object
const id = env.SANDBOX_LIFECYCLE.idFromName(sandboxId);
const stub = env.SANDBOX_LIFECYCLE.get(id);
await stub.terminate();

Add iteration limits to agent code:

typescript

// In agent execution code
let iterations = 0;
const MAX_ITERATIONS = 1000;

while (condition) {
  if (iterations++ > MAX_ITERATIONS) {
    throw new Error('Maximum iterations exceeded - possible infinite loop');
  }

  // Agent logic
}

Add execution timeout:

typescript

// Wrap agent execution with timeout
const AGENT_TIMEOUT = 60000; // 60 seconds

const result = await Promise.race([
  executeAgent(agentCode),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Agent execution timeout')), AGENT_TIMEOUT)
  ),
]);

Long-term:

Add static analysis to detect infinite loops
Implement step counting in agent runtime
Add circuit breakers for agent operations
Improve agent code validation

Cause 2: External API Hang

Symptoms:

CPU near 0%
Sandbox waiting indefinitely
Logs show external API call
No response from API

Resolution:

Immediate (5 minutes):

Terminate hung sandbox:

bash

curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123

Add request timeouts to agent API calls:

typescript

// In agent runtime
async function fetchWithTimeout(url: string, timeout = 10000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeout);

  try {
    const response = await fetch(url, {
      signal: controller.signal,
    });
    return response;
  } catch (error) {
    if (error.name === 'AbortError') {
      throw new Error(`Request timeout after ${timeout}ms`);
    }
    throw error;
  } finally {
    clearTimeout(timeoutId);
  }
}

Implement retry with timeout:

typescript

async function callAPIWithRetry(url: string, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetchWithTimeout(url, 10000);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
    }
  }
}

Long-term:

Add timeout configuration to agent runtime
Implement circuit breakers for external APIs
Add fallback responses for API failures
Monitor external API health

Cause 3: Cleanup Job Not Running

Symptoms:

Old sandboxes accumulating
No cleanup logs
Active count growing over time

Resolution:

Immediate (10 minutes):

Manually trigger cleanup:

typescript

// Create cleanup script
// cleanup-sandboxes.ts
import { Env } from './types';

export async function cleanupSandboxes(env: Env) {
  const response = await fetch('https://monotask-agent-worker.workers.dev/api/sandbox/cleanup', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + env.ADMIN_TOKEN,
    },
  });

  const result = await response.json();
  console.log('Cleanup result:', result);
}

// Execute:
// bun run cleanup-sandboxes.ts

Verify cleanup job configuration:

typescript

// In agent-worker/src/index.ts
export default {
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
    // Cleanup job should run every 5 minutes
    ctx.waitUntil(runSandboxCleanup(env));
  }
};

// Check wrangler.toml has cron trigger:
// [triggers]
// crons = ["*/5 * * * *"]  # Every 5 minutes

Fix cleanup logic if broken:

typescript

async function runSandboxCleanup(env: Env) {
  try {
    console.log('Starting sandbox cleanup');

    // Get all sandboxes
    const sandboxes = await listAllSandboxes(env);

    const now = Date.now();
    const MAX_AGE_MS = 10 * 60 * 1000; // 10 minutes

    let cleaned = 0;
    for (const sandbox of sandboxes) {
      const age = now - new Date(sandbox.created_at).getTime();

      if (age > MAX_AGE_MS || sandbox.status === 'failed') {
        try {
          await terminateSandbox(env, sandbox.id);
          cleaned++;
          console.log(`Cleaned up sandbox ${sandbox.id}`);
        } catch (error) {
          console.error(`Failed to cleanup ${sandbox.id}:`, error);
        }
      }
    }

    console.log(`Cleanup completed: ${cleaned} sandboxes removed`);
  } catch (error) {
    console.error('Cleanup job failed:', error);
  }
}

Long-term:

Add monitoring for cleanup job execution
Alert if cleanup job fails
Implement defensive cleanup (multiple methods)
Add cleanup job health checks

Cause 4: Sandbox Lifecycle State Corruption

Symptoms:

Sandbox shows as "running" but doesn't exist
Inconsistent state between DO and actual process
Cleanup fails with "sandbox not found"

Resolution:

Immediate (15 minutes):

Audit sandbox states:

typescript

// Check consistency between DO state and actual sandboxes
async function auditSandboxStates(env: Env) {
  const doSandboxes = await listSandboxesFromDO(env);
  const actualSandboxes = await listActualSandboxes(env);

  const orphanedDOs = doSandboxes.filter(
    s => !actualSandboxes.find(a => a.id === s.id)
  );

  const orphanedActual = actualSandboxes.filter(
    s => !doSandboxes.find(d => d.id === s.id)
  );

  console.log('Orphaned DO records:', orphanedDOs.length);
  console.log('Orphaned actual sandboxes:', orphanedActual.length);

  return { orphanedDOs, orphanedActual };
}

Clean up orphaned records:

typescript

async function cleanupOrphanedRecords(env: Env) {
  const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);

  // Remove orphaned DO records
  for (const sandbox of orphanedDOs) {
    const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
    const stub = env.SANDBOX_LIFECYCLE.get(id);
    await stub.delete(); // Remove from DO
    console.log(`Removed orphaned DO record: ${sandbox.id}`);
  }

  // Terminate orphaned actual sandboxes
  for (const sandbox of orphanedActual) {
    await terminateSandbox(env, sandbox.id);
    console.log(`Terminated orphaned sandbox: ${sandbox.id}`);
  }
}

Add state reconciliation:

typescript

// Run reconciliation periodically
async function reconcileSandboxStates(env: Env) {
  const sandboxes = await listAllSandboxes(env);

  for (const sandbox of sandboxes) {
    // Verify sandbox actually exists
    const exists = await checkSandboxExists(env, sandbox.id);

    if (!exists && sandbox.status === 'running') {
      // Update state to reflect reality
      const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
      const stub = env.SANDBOX_LIFECYCLE.get(id);
      await stub.updateStatus('terminated');
      console.log(`Reconciled state for ${sandbox.id}`);
    }
  }
}

Long-term:

Implement state verification before operations
Add transaction logging for state changes
Periodic state reconciliation job
Improve error handling in state transitions

Cause 5: Resource Exhaustion

Symptoms:

Cannot provision new sandboxes
"Resource limit exceeded" errors
All sandboxes stuck in "provisioning"

Resolution:

Immediate (5 minutes):

Check resource quotas:

bash

# Check Durable Object limits
# Cloudflare Dashboard > Workers > Durable Objects > Usage

# Check for:
# - Active DO instances near limit
# - Storage near quota
# - CPU time consumption

Emergency cleanup:

bash

# Terminate all non-critical sandboxes
curl -X POST https://monotask-agent-worker.workers.dev/api/sandbox/cleanup/emergency

Implement sandbox pooling:

typescript

// Limit maximum concurrent sandboxes
const MAX_CONCURRENT_SANDBOXES = 10;

async function provisionSandbox(env: Env, config: SandboxConfig) {
  const active = await getActiveSandboxCount(env);

  if (active >= MAX_CONCURRENT_SANDBOXES) {
    throw new Error('Maximum sandbox limit reached');
  }

  return await createSandbox(env, config);
}

Long-term:

Implement sandbox pooling and reuse
Add resource monitoring and alerts
Request quota increases if needed
Optimize sandbox resource usage

Resolution Procedures

Manual Sandbox Termination (ETA: 5 minutes)

bash

# Step 1: Identify stuck sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Step 2: Get sandbox details
curl https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}

# Step 3: Terminate sandbox
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}

# Step 4: Verify termination
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats
# Active count should decrease

Bulk Cleanup (ETA: 10 minutes)

For multiple stuck sandboxes:

typescript

// cleanup-script.ts
async function bulkCleanup(env: Env) {
  const sandboxes = await listAllSandboxes(env);
  const now = Date.now();
  const STUCK_THRESHOLD = 10 * 60 * 1000; // 10 minutes

  const stuckSandboxes = sandboxes.filter(s => {
    const age = now - new Date(s.created_at).getTime();
    return age > STUCK_THRESHOLD;
  });

  console.log(`Found ${stuckSandboxes.length} stuck sandboxes`);

  for (const sandbox of stuckSandboxes) {
    try {
      await terminateSandbox(env, sandbox.id);
      console.log(`✓ Terminated ${sandbox.id}`);
    } catch (error) {
      console.error(`✗ Failed to terminate ${sandbox.id}:`, error);
    }
  }

  console.log('Bulk cleanup completed');
}

// Execute:
// bun run cleanup-script.ts

State Reconciliation (ETA: 15 minutes)

Fix state inconsistencies:

typescript

// reconcile-states.ts
async function reconcileAllStates(env: Env) {
  console.log('Starting state reconciliation...');

  // Step 1: Audit current state
  const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);

  // Step 2: Clean up orphaned records
  await cleanupOrphanedRecords(env);

  // Step 3: Verify all active sandboxes
  await reconcileSandboxStates(env);

  // Step 4: Generate report
  const finalStats = await getSandboxStats(env);
  console.log('Reconciliation complete:', finalStats);
}

Verification Steps

1. Active Sandbox Count Normal (ETA: 5 minutes)

bash

# Check sandbox stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats

# Target: < 5 active sandboxes under normal load

2. No Stuck Sandboxes (ETA: 5 minutes)

bash

# List active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Verify: All sandboxes < 10 minutes old

3. Cleanup Job Running (ETA: 5 minutes)

bash

# Check cleanup logs
wrangler tail monotask-agent-worker | grep cleanup

# Should see cleanup runs every 5 minutes

4. Sandbox Timeout Rate Normal (ETA: 5 minutes)

bash

# Check timeout metrics in dashboard
# Target: < 2% timeout rate

5. Resource Usage Healthy (ETA: 5 minutes)

bash

# Check Durable Object metrics
# Cloudflare Dashboard > Workers > Durable Objects

# Verify: Well below quotas

Prevention Measures

1. Improved Timeout Handling

typescript

// Wrap all sandbox operations with timeout
class SandboxManager {
  async execute(sandboxId: string, operation: () => Promise<any>) {
    const OPERATION_TIMEOUT = 5 * 60 * 1000; // 5 minutes

    const timeoutPromise = new Promise((_, reject) =>
      setTimeout(
        () => reject(new Error('Sandbox operation timeout')),
        OPERATION_TIMEOUT
      )
    );

    try {
      return await Promise.race([
        operation(),
        timeoutPromise,
      ]);
    } catch (error) {
      // Cleanup on timeout
      await this.terminateSandbox(sandboxId);
      throw error;
    }
  }
}

2. Enhanced Monitoring

typescript

// Add sandbox lifecycle metrics
async function trackSandboxLifecycle(
  env: Env,
  sandboxId: string,
  event: string
) {
  await env.ANALYTICS.writeDataPoint({
    blobs: [sandboxId, event],
    doubles: [Date.now()],
    indexes: [`sandbox:lifecycle`, `event:${event}`],
  });

  // Alert on anomalies
  if (event === 'timeout' || event === 'stuck') {
    await alerter.sendAlert({
      severity: 'warning',
      message: `Sandbox ${event}: ${sandboxId}`,
      context: { sandboxId, event },
    });
  }
}

3. Automatic Cleanup Improvements

typescript

// Multi-layer cleanup approach
async function enhancedCleanup(env: Env) {
  // Layer 1: Scheduled cleanup (every 5 min)
  await scheduledCleanup(env);

  // Layer 2: Age-based cleanup (every 15 min)
  await ageBasedCleanup(env, 10 * 60 * 1000);

  // Layer 3: Resource-based cleanup (when near limits)
  const usage = await getResourceUsage(env);
  if (usage > 0.8) {
    await emergencyCleanup(env);
  }

  // Layer 4: State reconciliation (hourly)
  if (shouldReconcile()) {
    await reconcileSandboxStates(env);
  }
}

4. Sandbox Health Checks

typescript

// Periodic health checks for active sandboxes
async function healthCheckSandboxes(env: Env) {
  const sandboxes = await listActiveSandboxes(env);

  for (const sandbox of sandboxes) {
    const health = await checkSandboxHealth(env, sandbox.id);

    if (!health.healthy) {
      console.warn(`Unhealthy sandbox detected: ${sandbox.id}`, health);

      if (health.age > 10 * 60 * 1000) {
        // Terminate if old and unhealthy
        await terminateSandbox(env, sandbox.id);
      }
    }
  }
}

Escalation Path

When to Escalate

Escalate if:

Active sandboxes > 50 for more than 30 minutes
Cleanup jobs consistently failing
Unable to provision new sandboxes
State corruption affecting multiple sandboxes
Resource quotas exceeded

Escalation Contacts

Level 1 - Agent Team

Slack: #agent-team
For agent code issues causing hangs

Level 2 - Infrastructure Team

For Durable Object issues
For resource quota concerns

Level 3 - Cloudflare Support

For platform-level sandbox issues
For quota increase requests

Post-Incident

Required Actions

Root Cause Analysis:
- What caused sandboxes to get stuck?
- Were there code changes involved?
- Was cleanup working properly?
- Are there resource constraints?
Improve Detection:
- Add alerts for stuck sandboxes
- Monitor sandbox age distribution
- Track cleanup job success rate
- Add resource usage alerts
Code Improvements:
- Fix agent code causing hangs
- Improve timeout handling
- Enhance cleanup robustness
- Add state verification
Documentation:
- Document sandbox limits and quotas
- Update agent development guidelines
- Share lessons learned
- Update this runbook

Worker Timeout - For sandbox timeout issues
High Error Rate - For sandbox-related errors
Queue Backup - If sandbox issues cause queue backup

Last Updated: 2025-10-26 Owner: Agent Team Reviewers: SRE Team, Backend Team

Sandbox Stuck ​

Overview ​

Symptoms and Detection ​

How to Detect ​

Observable Symptoms ​

Investigation Steps ​

1. Check Active Sandbox Count (ETA: 2 minutes) ​

2. Identify Stuck Sandboxes (ETA: 3 minutes) ​

3. Examine Sandbox Logs (ETA: 5 minutes) ​

4. Check Sandbox Resource Usage (ETA: 3 minutes) ​

5. Review Cleanup Job Status (ETA: 2 minutes) ​

Common Causes and Resolutions ​

Cause 1: Agent Code Infinite Loop ​

Cause 2: External API Hang ​

Cause 3: Cleanup Job Not Running ​

Cause 4: Sandbox Lifecycle State Corruption ​

Cause 5: Resource Exhaustion ​

Resolution Procedures ​

Manual Sandbox Termination (ETA: 5 minutes) ​

Bulk Cleanup (ETA: 10 minutes) ​

State Reconciliation (ETA: 15 minutes) ​

Verification Steps ​

1. Active Sandbox Count Normal (ETA: 5 minutes) ​

2. No Stuck Sandboxes (ETA: 5 minutes) ​

3. Cleanup Job Running (ETA: 5 minutes) ​

4. Sandbox Timeout Rate Normal (ETA: 5 minutes) ​

5. Resource Usage Healthy (ETA: 5 minutes) ​

Prevention Measures ​

1. Improved Timeout Handling ​

2. Enhanced Monitoring ​

3. Automatic Cleanup Improvements ​

4. Sandbox Health Checks ​

Escalation Path ​

When to Escalate ​

Escalation Contacts ​

Post-Incident ​

Required Actions ​

Related Runbooks ​

Sandbox Stuck

Overview

Symptoms and Detection

How to Detect

Observable Symptoms

Investigation Steps

1. Check Active Sandbox Count (ETA: 2 minutes)

2. Identify Stuck Sandboxes (ETA: 3 minutes)

3. Examine Sandbox Logs (ETA: 5 minutes)

4. Check Sandbox Resource Usage (ETA: 3 minutes)

5. Review Cleanup Job Status (ETA: 2 minutes)

Common Causes and Resolutions

Cause 1: Agent Code Infinite Loop

Cause 2: External API Hang

Cause 3: Cleanup Job Not Running

Cause 4: Sandbox Lifecycle State Corruption

Cause 5: Resource Exhaustion

Resolution Procedures

Manual Sandbox Termination (ETA: 5 minutes)

Bulk Cleanup (ETA: 10 minutes)

State Reconciliation (ETA: 15 minutes)

Verification Steps

1. Active Sandbox Count Normal (ETA: 5 minutes)

2. No Stuck Sandboxes (ETA: 5 minutes)

3. Cleanup Job Running (ETA: 5 minutes)

4. Sandbox Timeout Rate Normal (ETA: 5 minutes)

5. Resource Usage Healthy (ETA: 5 minutes)

Prevention Measures

1. Improved Timeout Handling

2. Enhanced Monitoring

3. Automatic Cleanup Improvements

4. Sandbox Health Checks

Escalation Path

When to Escalate

Escalation Contacts

Post-Incident

Required Actions

Related Runbooks