Skip to content

Sandbox Stuck

Overview

This runbook provides procedures for detecting, diagnosing, and resolving stuck or leaked sandbox instances in MonoTask agent execution.

Alert: sandbox_timeout_high or sandbox_leak_detectedSeverity: Warning (> 5% timeout rate) or Critical (> 20 active sandboxes) SLO Impact: Affects agent execution latency and sandbox provisioning SLO


Symptoms and Detection

How to Detect

  • Alert: "High Sandbox Timeout Rate" or "Sandbox Leak Detected"
  • Dashboard: Active sandboxes count not decreasing
  • Logs: Sandbox timeout errors or orphaned sandbox warnings
  • User Impact: Agent execution failures, slow task processing

Observable Symptoms

  • Active sandbox count > 20 sustained
  • Sandboxes in "running" state for > 5 minutes
  • Sandbox timeout errors > 5% of executions
  • Memory/resource warnings in sandbox logs
  • Sandbox cleanup jobs failing

Investigation Steps

1. Check Active Sandbox Count (ETA: 2 minutes)

Determine current sandbox state:

bash
# Query sandbox stats via Durable Object
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats

# Or use wrangler to check DO state
# (Requires DO inspection endpoint)

# Expected response:
# {
#   "active": 3,
#   "provisioning": 0,
#   "terminating": 0,
#   "total_created": 150,
#   "total_destroyed": 147
# }

Questions to Answer:

  • How many sandboxes are currently active?
  • Are any sandboxes stuck in "provisioning" state?
  • How long have active sandboxes been running?
  • Is the count growing or stable?

2. Identify Stuck Sandboxes (ETA: 3 minutes)

List sandboxes and their states:

bash
# Get list of active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Expected response:
# [
#   {
#     "id": "sandbox_123",
#     "status": "running",
#     "created_at": "2025-10-26T10:00:00Z",
#     "age_minutes": 45,  ← Stuck if > 10 minutes
#     "agent_type": "implementation",
#     "task_id": "task_456"
#   }
# ]

Red Flags:

  • Sandboxes running > 10 minutes
  • Status stuck in "provisioning"
  • No associated task or agent
  • Repeated timeout patterns for specific sandbox

3. Examine Sandbox Logs (ETA: 5 minutes)

Review logs for stuck sandbox:

bash
# Get logs for specific sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/logs

# Or tail worker logs filtering for sandbox ID
wrangler tail monotask-agent-worker | grep "sandbox_123"

# Look for:
# - Last activity timestamp
# - Error messages
# - Infinite loops or hangs
# - Resource exhaustion warnings

Common Log Patterns:

  • "Operation timed out" - External service hang
  • "Maximum call stack exceeded" - Infinite recursion
  • "Out of memory" - Memory leak
  • No recent logs - Process hung

4. Check Sandbox Resource Usage (ETA: 3 minutes)

Monitor resource consumption:

bash
# Get sandbox resource stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/resources

# Expected response:
# {
#   "cpu_percent": 95,  ← High CPU indicates busy loop
#   "memory_mb": 120,
#   "duration_ms": 45000,
#   "operations": 150000  ← High count indicates activity
# }

Resource Patterns:

PatternLikely Cause
CPU 100%Infinite loop or heavy computation
CPU 0%Hung waiting for I/O or deadlock
Memory growingMemory leak
No operationsProcess stuck or waiting

5. Review Cleanup Job Status (ETA: 2 minutes)

Check if cleanup automation is working:

bash
# Check cleanup job logs
wrangler tail monotask-agent-worker | grep "sandbox.*cleanup"

# Should see periodic cleanup runs (every 5 minutes):
# "Sandbox cleanup started"
# "Found 2 sandboxes to cleanup"
# "Cleaned up sandbox_123"
# "Cleanup completed"

Cleanup Issues:

  • No cleanup logs - Job not running
  • Cleanup errors - Job failing
  • Sandboxes not being cleaned - Detection logic broken

Common Causes and Resolutions

Cause 1: Agent Code Infinite Loop

Symptoms:

  • CPU at 100%
  • Sandbox never completes
  • No error messages
  • High operation count

Resolution:

Immediate (5 minutes):

  1. Force terminate stuck sandbox:
bash
# Terminate via API
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123

# Or via Durable Object
const id = env.SANDBOX_LIFECYCLE.idFromName(sandboxId);
const stub = env.SANDBOX_LIFECYCLE.get(id);
await stub.terminate();
  1. Add iteration limits to agent code:
typescript
// In agent execution code
let iterations = 0;
const MAX_ITERATIONS = 1000;

while (condition) {
  if (iterations++ > MAX_ITERATIONS) {
    throw new Error('Maximum iterations exceeded - possible infinite loop');
  }

  // Agent logic
}
  1. Add execution timeout:
typescript
// Wrap agent execution with timeout
const AGENT_TIMEOUT = 60000; // 60 seconds

const result = await Promise.race([
  executeAgent(agentCode),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Agent execution timeout')), AGENT_TIMEOUT)
  ),
]);

Long-term:

  • Add static analysis to detect infinite loops
  • Implement step counting in agent runtime
  • Add circuit breakers for agent operations
  • Improve agent code validation

Cause 2: External API Hang

Symptoms:

  • CPU near 0%
  • Sandbox waiting indefinitely
  • Logs show external API call
  • No response from API

Resolution:

Immediate (5 minutes):

  1. Terminate hung sandbox:
bash
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123
  1. Add request timeouts to agent API calls:
typescript
// In agent runtime
async function fetchWithTimeout(url: string, timeout = 10000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeout);

  try {
    const response = await fetch(url, {
      signal: controller.signal,
    });
    return response;
  } catch (error) {
    if (error.name === 'AbortError') {
      throw new Error(`Request timeout after ${timeout}ms`);
    }
    throw error;
  } finally {
    clearTimeout(timeoutId);
  }
}
  1. Implement retry with timeout:
typescript
async function callAPIWithRetry(url: string, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fetchWithTimeout(url, 10000);
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000); // Exponential backoff
    }
  }
}

Long-term:

  • Add timeout configuration to agent runtime
  • Implement circuit breakers for external APIs
  • Add fallback responses for API failures
  • Monitor external API health

Cause 3: Cleanup Job Not Running

Symptoms:

  • Old sandboxes accumulating
  • No cleanup logs
  • Active count growing over time

Resolution:

Immediate (10 minutes):

  1. Manually trigger cleanup:
typescript
// Create cleanup script
// cleanup-sandboxes.ts
import { Env } from './types';

export async function cleanupSandboxes(env: Env) {
  const response = await fetch('https://monotask-agent-worker.workers.dev/api/sandbox/cleanup', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer ' + env.ADMIN_TOKEN,
    },
  });

  const result = await response.json();
  console.log('Cleanup result:', result);
}

// Execute:
// bun run cleanup-sandboxes.ts
  1. Verify cleanup job configuration:
typescript
// In agent-worker/src/index.ts
export default {
  async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
    // Cleanup job should run every 5 minutes
    ctx.waitUntil(runSandboxCleanup(env));
  }
};

// Check wrangler.toml has cron trigger:
// [triggers]
// crons = ["*/5 * * * *"]  # Every 5 minutes
  1. Fix cleanup logic if broken:
typescript
async function runSandboxCleanup(env: Env) {
  try {
    console.log('Starting sandbox cleanup');

    // Get all sandboxes
    const sandboxes = await listAllSandboxes(env);

    const now = Date.now();
    const MAX_AGE_MS = 10 * 60 * 1000; // 10 minutes

    let cleaned = 0;
    for (const sandbox of sandboxes) {
      const age = now - new Date(sandbox.created_at).getTime();

      if (age > MAX_AGE_MS || sandbox.status === 'failed') {
        try {
          await terminateSandbox(env, sandbox.id);
          cleaned++;
          console.log(`Cleaned up sandbox ${sandbox.id}`);
        } catch (error) {
          console.error(`Failed to cleanup ${sandbox.id}:`, error);
        }
      }
    }

    console.log(`Cleanup completed: ${cleaned} sandboxes removed`);
  } catch (error) {
    console.error('Cleanup job failed:', error);
  }
}

Long-term:

  • Add monitoring for cleanup job execution
  • Alert if cleanup job fails
  • Implement defensive cleanup (multiple methods)
  • Add cleanup job health checks

Cause 4: Sandbox Lifecycle State Corruption

Symptoms:

  • Sandbox shows as "running" but doesn't exist
  • Inconsistent state between DO and actual process
  • Cleanup fails with "sandbox not found"

Resolution:

Immediate (15 minutes):

  1. Audit sandbox states:
typescript
// Check consistency between DO state and actual sandboxes
async function auditSandboxStates(env: Env) {
  const doSandboxes = await listSandboxesFromDO(env);
  const actualSandboxes = await listActualSandboxes(env);

  const orphanedDOs = doSandboxes.filter(
    s => !actualSandboxes.find(a => a.id === s.id)
  );

  const orphanedActual = actualSandboxes.filter(
    s => !doSandboxes.find(d => d.id === s.id)
  );

  console.log('Orphaned DO records:', orphanedDOs.length);
  console.log('Orphaned actual sandboxes:', orphanedActual.length);

  return { orphanedDOs, orphanedActual };
}
  1. Clean up orphaned records:
typescript
async function cleanupOrphanedRecords(env: Env) {
  const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);

  // Remove orphaned DO records
  for (const sandbox of orphanedDOs) {
    const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
    const stub = env.SANDBOX_LIFECYCLE.get(id);
    await stub.delete(); // Remove from DO
    console.log(`Removed orphaned DO record: ${sandbox.id}`);
  }

  // Terminate orphaned actual sandboxes
  for (const sandbox of orphanedActual) {
    await terminateSandbox(env, sandbox.id);
    console.log(`Terminated orphaned sandbox: ${sandbox.id}`);
  }
}
  1. Add state reconciliation:
typescript
// Run reconciliation periodically
async function reconcileSandboxStates(env: Env) {
  const sandboxes = await listAllSandboxes(env);

  for (const sandbox of sandboxes) {
    // Verify sandbox actually exists
    const exists = await checkSandboxExists(env, sandbox.id);

    if (!exists && sandbox.status === 'running') {
      // Update state to reflect reality
      const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
      const stub = env.SANDBOX_LIFECYCLE.get(id);
      await stub.updateStatus('terminated');
      console.log(`Reconciled state for ${sandbox.id}`);
    }
  }
}

Long-term:

  • Implement state verification before operations
  • Add transaction logging for state changes
  • Periodic state reconciliation job
  • Improve error handling in state transitions

Cause 5: Resource Exhaustion

Symptoms:

  • Cannot provision new sandboxes
  • "Resource limit exceeded" errors
  • All sandboxes stuck in "provisioning"

Resolution:

Immediate (5 minutes):

  1. Check resource quotas:
bash
# Check Durable Object limits
# Cloudflare Dashboard > Workers > Durable Objects > Usage

# Check for:
# - Active DO instances near limit
# - Storage near quota
# - CPU time consumption
  1. Emergency cleanup:
bash
# Terminate all non-critical sandboxes
curl -X POST https://monotask-agent-worker.workers.dev/api/sandbox/cleanup/emergency
  1. Implement sandbox pooling:
typescript
// Limit maximum concurrent sandboxes
const MAX_CONCURRENT_SANDBOXES = 10;

async function provisionSandbox(env: Env, config: SandboxConfig) {
  const active = await getActiveSandboxCount(env);

  if (active >= MAX_CONCURRENT_SANDBOXES) {
    throw new Error('Maximum sandbox limit reached');
  }

  return await createSandbox(env, config);
}

Long-term:

  • Implement sandbox pooling and reuse
  • Add resource monitoring and alerts
  • Request quota increases if needed
  • Optimize sandbox resource usage

Resolution Procedures

Manual Sandbox Termination (ETA: 5 minutes)

bash
# Step 1: Identify stuck sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Step 2: Get sandbox details
curl https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}

# Step 3: Terminate sandbox
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}

# Step 4: Verify termination
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats
# Active count should decrease

Bulk Cleanup (ETA: 10 minutes)

For multiple stuck sandboxes:

typescript
// cleanup-script.ts
async function bulkCleanup(env: Env) {
  const sandboxes = await listAllSandboxes(env);
  const now = Date.now();
  const STUCK_THRESHOLD = 10 * 60 * 1000; // 10 minutes

  const stuckSandboxes = sandboxes.filter(s => {
    const age = now - new Date(s.created_at).getTime();
    return age > STUCK_THRESHOLD;
  });

  console.log(`Found ${stuckSandboxes.length} stuck sandboxes`);

  for (const sandbox of stuckSandboxes) {
    try {
      await terminateSandbox(env, sandbox.id);
      console.log(`✓ Terminated ${sandbox.id}`);
    } catch (error) {
      console.error(`✗ Failed to terminate ${sandbox.id}:`, error);
    }
  }

  console.log('Bulk cleanup completed');
}

// Execute:
// bun run cleanup-script.ts

State Reconciliation (ETA: 15 minutes)

Fix state inconsistencies:

typescript
// reconcile-states.ts
async function reconcileAllStates(env: Env) {
  console.log('Starting state reconciliation...');

  // Step 1: Audit current state
  const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);

  // Step 2: Clean up orphaned records
  await cleanupOrphanedRecords(env);

  // Step 3: Verify all active sandboxes
  await reconcileSandboxStates(env);

  // Step 4: Generate report
  const finalStats = await getSandboxStats(env);
  console.log('Reconciliation complete:', finalStats);
}

Verification Steps

1. Active Sandbox Count Normal (ETA: 5 minutes)

bash
# Check sandbox stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats

# Target: < 5 active sandboxes under normal load

2. No Stuck Sandboxes (ETA: 5 minutes)

bash
# List active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list

# Verify: All sandboxes < 10 minutes old

3. Cleanup Job Running (ETA: 5 minutes)

bash
# Check cleanup logs
wrangler tail monotask-agent-worker | grep cleanup

# Should see cleanup runs every 5 minutes

4. Sandbox Timeout Rate Normal (ETA: 5 minutes)

bash
# Check timeout metrics in dashboard
# Target: < 2% timeout rate

5. Resource Usage Healthy (ETA: 5 minutes)

bash
# Check Durable Object metrics
# Cloudflare Dashboard > Workers > Durable Objects

# Verify: Well below quotas

Prevention Measures

1. Improved Timeout Handling

typescript
// Wrap all sandbox operations with timeout
class SandboxManager {
  async execute(sandboxId: string, operation: () => Promise<any>) {
    const OPERATION_TIMEOUT = 5 * 60 * 1000; // 5 minutes

    const timeoutPromise = new Promise((_, reject) =>
      setTimeout(
        () => reject(new Error('Sandbox operation timeout')),
        OPERATION_TIMEOUT
      )
    );

    try {
      return await Promise.race([
        operation(),
        timeoutPromise,
      ]);
    } catch (error) {
      // Cleanup on timeout
      await this.terminateSandbox(sandboxId);
      throw error;
    }
  }
}

2. Enhanced Monitoring

typescript
// Add sandbox lifecycle metrics
async function trackSandboxLifecycle(
  env: Env,
  sandboxId: string,
  event: string
) {
  await env.ANALYTICS.writeDataPoint({
    blobs: [sandboxId, event],
    doubles: [Date.now()],
    indexes: [`sandbox:lifecycle`, `event:${event}`],
  });

  // Alert on anomalies
  if (event === 'timeout' || event === 'stuck') {
    await alerter.sendAlert({
      severity: 'warning',
      message: `Sandbox ${event}: ${sandboxId}`,
      context: { sandboxId, event },
    });
  }
}

3. Automatic Cleanup Improvements

typescript
// Multi-layer cleanup approach
async function enhancedCleanup(env: Env) {
  // Layer 1: Scheduled cleanup (every 5 min)
  await scheduledCleanup(env);

  // Layer 2: Age-based cleanup (every 15 min)
  await ageBasedCleanup(env, 10 * 60 * 1000);

  // Layer 3: Resource-based cleanup (when near limits)
  const usage = await getResourceUsage(env);
  if (usage > 0.8) {
    await emergencyCleanup(env);
  }

  // Layer 4: State reconciliation (hourly)
  if (shouldReconcile()) {
    await reconcileSandboxStates(env);
  }
}

4. Sandbox Health Checks

typescript
// Periodic health checks for active sandboxes
async function healthCheckSandboxes(env: Env) {
  const sandboxes = await listActiveSandboxes(env);

  for (const sandbox of sandboxes) {
    const health = await checkSandboxHealth(env, sandbox.id);

    if (!health.healthy) {
      console.warn(`Unhealthy sandbox detected: ${sandbox.id}`, health);

      if (health.age > 10 * 60 * 1000) {
        // Terminate if old and unhealthy
        await terminateSandbox(env, sandbox.id);
      }
    }
  }
}

Escalation Path

When to Escalate

Escalate if:

  • Active sandboxes > 50 for more than 30 minutes
  • Cleanup jobs consistently failing
  • Unable to provision new sandboxes
  • State corruption affecting multiple sandboxes
  • Resource quotas exceeded

Escalation Contacts

Level 1 - Agent Team

  • Slack: #agent-team
  • For agent code issues causing hangs

Level 2 - Infrastructure Team

  • For Durable Object issues
  • For resource quota concerns

Level 3 - Cloudflare Support

  • For platform-level sandbox issues
  • For quota increase requests

Post-Incident

Required Actions

  1. Root Cause Analysis:

    • What caused sandboxes to get stuck?
    • Were there code changes involved?
    • Was cleanup working properly?
    • Are there resource constraints?
  2. Improve Detection:

    • Add alerts for stuck sandboxes
    • Monitor sandbox age distribution
    • Track cleanup job success rate
    • Add resource usage alerts
  3. Code Improvements:

    • Fix agent code causing hangs
    • Improve timeout handling
    • Enhance cleanup robustness
    • Add state verification
  4. Documentation:

    • Document sandbox limits and quotas
    • Update agent development guidelines
    • Share lessons learned
    • Update this runbook


Last Updated: 2025-10-26 Owner: Agent Team Reviewers: SRE Team, Backend Team

MonoKernel MonoTask Documentation