Appearance
Sandbox Stuck
Overview
This runbook provides procedures for detecting, diagnosing, and resolving stuck or leaked sandbox instances in MonoTask agent execution.
Alert: sandbox_timeout_high or sandbox_leak_detectedSeverity: Warning (> 5% timeout rate) or Critical (> 20 active sandboxes) SLO Impact: Affects agent execution latency and sandbox provisioning SLO
Symptoms and Detection
How to Detect
- Alert: "High Sandbox Timeout Rate" or "Sandbox Leak Detected"
- Dashboard: Active sandboxes count not decreasing
- Logs: Sandbox timeout errors or orphaned sandbox warnings
- User Impact: Agent execution failures, slow task processing
Observable Symptoms
- Active sandbox count > 20 sustained
- Sandboxes in "running" state for > 5 minutes
- Sandbox timeout errors > 5% of executions
- Memory/resource warnings in sandbox logs
- Sandbox cleanup jobs failing
Investigation Steps
1. Check Active Sandbox Count (ETA: 2 minutes)
Determine current sandbox state:
bash
# Query sandbox stats via Durable Object
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats
# Or use wrangler to check DO state
# (Requires DO inspection endpoint)
# Expected response:
# {
# "active": 3,
# "provisioning": 0,
# "terminating": 0,
# "total_created": 150,
# "total_destroyed": 147
# }Questions to Answer:
- How many sandboxes are currently active?
- Are any sandboxes stuck in "provisioning" state?
- How long have active sandboxes been running?
- Is the count growing or stable?
2. Identify Stuck Sandboxes (ETA: 3 minutes)
List sandboxes and their states:
bash
# Get list of active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list
# Expected response:
# [
# {
# "id": "sandbox_123",
# "status": "running",
# "created_at": "2025-10-26T10:00:00Z",
# "age_minutes": 45, ← Stuck if > 10 minutes
# "agent_type": "implementation",
# "task_id": "task_456"
# }
# ]Red Flags:
- Sandboxes running > 10 minutes
- Status stuck in "provisioning"
- No associated task or agent
- Repeated timeout patterns for specific sandbox
3. Examine Sandbox Logs (ETA: 5 minutes)
Review logs for stuck sandbox:
bash
# Get logs for specific sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/logs
# Or tail worker logs filtering for sandbox ID
wrangler tail monotask-agent-worker | grep "sandbox_123"
# Look for:
# - Last activity timestamp
# - Error messages
# - Infinite loops or hangs
# - Resource exhaustion warningsCommon Log Patterns:
- "Operation timed out" - External service hang
- "Maximum call stack exceeded" - Infinite recursion
- "Out of memory" - Memory leak
- No recent logs - Process hung
4. Check Sandbox Resource Usage (ETA: 3 minutes)
Monitor resource consumption:
bash
# Get sandbox resource stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123/resources
# Expected response:
# {
# "cpu_percent": 95, ← High CPU indicates busy loop
# "memory_mb": 120,
# "duration_ms": 45000,
# "operations": 150000 ← High count indicates activity
# }Resource Patterns:
| Pattern | Likely Cause |
|---|---|
| CPU 100% | Infinite loop or heavy computation |
| CPU 0% | Hung waiting for I/O or deadlock |
| Memory growing | Memory leak |
| No operations | Process stuck or waiting |
5. Review Cleanup Job Status (ETA: 2 minutes)
Check if cleanup automation is working:
bash
# Check cleanup job logs
wrangler tail monotask-agent-worker | grep "sandbox.*cleanup"
# Should see periodic cleanup runs (every 5 minutes):
# "Sandbox cleanup started"
# "Found 2 sandboxes to cleanup"
# "Cleaned up sandbox_123"
# "Cleanup completed"Cleanup Issues:
- No cleanup logs - Job not running
- Cleanup errors - Job failing
- Sandboxes not being cleaned - Detection logic broken
Common Causes and Resolutions
Cause 1: Agent Code Infinite Loop
Symptoms:
- CPU at 100%
- Sandbox never completes
- No error messages
- High operation count
Resolution:
Immediate (5 minutes):
- Force terminate stuck sandbox:
bash
# Terminate via API
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123
# Or via Durable Object
const id = env.SANDBOX_LIFECYCLE.idFromName(sandboxId);
const stub = env.SANDBOX_LIFECYCLE.get(id);
await stub.terminate();- Add iteration limits to agent code:
typescript
// In agent execution code
let iterations = 0;
const MAX_ITERATIONS = 1000;
while (condition) {
if (iterations++ > MAX_ITERATIONS) {
throw new Error('Maximum iterations exceeded - possible infinite loop');
}
// Agent logic
}- Add execution timeout:
typescript
// Wrap agent execution with timeout
const AGENT_TIMEOUT = 60000; // 60 seconds
const result = await Promise.race([
executeAgent(agentCode),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Agent execution timeout')), AGENT_TIMEOUT)
),
]);Long-term:
- Add static analysis to detect infinite loops
- Implement step counting in agent runtime
- Add circuit breakers for agent operations
- Improve agent code validation
Cause 2: External API Hang
Symptoms:
- CPU near 0%
- Sandbox waiting indefinitely
- Logs show external API call
- No response from API
Resolution:
Immediate (5 minutes):
- Terminate hung sandbox:
bash
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/sandbox_123- Add request timeouts to agent API calls:
typescript
// In agent runtime
async function fetchWithTimeout(url: string, timeout = 10000) {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
signal: controller.signal,
});
return response;
} catch (error) {
if (error.name === 'AbortError') {
throw new Error(`Request timeout after ${timeout}ms`);
}
throw error;
} finally {
clearTimeout(timeoutId);
}
}- Implement retry with timeout:
typescript
async function callAPIWithRetry(url: string, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fetchWithTimeout(url, 10000);
} catch (error) {
if (i === maxRetries - 1) throw error;
await sleep(Math.pow(2, i) * 1000); // Exponential backoff
}
}
}Long-term:
- Add timeout configuration to agent runtime
- Implement circuit breakers for external APIs
- Add fallback responses for API failures
- Monitor external API health
Cause 3: Cleanup Job Not Running
Symptoms:
- Old sandboxes accumulating
- No cleanup logs
- Active count growing over time
Resolution:
Immediate (10 minutes):
- Manually trigger cleanup:
typescript
// Create cleanup script
// cleanup-sandboxes.ts
import { Env } from './types';
export async function cleanupSandboxes(env: Env) {
const response = await fetch('https://monotask-agent-worker.workers.dev/api/sandbox/cleanup', {
method: 'POST',
headers: {
'Authorization': 'Bearer ' + env.ADMIN_TOKEN,
},
});
const result = await response.json();
console.log('Cleanup result:', result);
}
// Execute:
// bun run cleanup-sandboxes.ts- Verify cleanup job configuration:
typescript
// In agent-worker/src/index.ts
export default {
async scheduled(event: ScheduledEvent, env: Env, ctx: ExecutionContext) {
// Cleanup job should run every 5 minutes
ctx.waitUntil(runSandboxCleanup(env));
}
};
// Check wrangler.toml has cron trigger:
// [triggers]
// crons = ["*/5 * * * *"] # Every 5 minutes- Fix cleanup logic if broken:
typescript
async function runSandboxCleanup(env: Env) {
try {
console.log('Starting sandbox cleanup');
// Get all sandboxes
const sandboxes = await listAllSandboxes(env);
const now = Date.now();
const MAX_AGE_MS = 10 * 60 * 1000; // 10 minutes
let cleaned = 0;
for (const sandbox of sandboxes) {
const age = now - new Date(sandbox.created_at).getTime();
if (age > MAX_AGE_MS || sandbox.status === 'failed') {
try {
await terminateSandbox(env, sandbox.id);
cleaned++;
console.log(`Cleaned up sandbox ${sandbox.id}`);
} catch (error) {
console.error(`Failed to cleanup ${sandbox.id}:`, error);
}
}
}
console.log(`Cleanup completed: ${cleaned} sandboxes removed`);
} catch (error) {
console.error('Cleanup job failed:', error);
}
}Long-term:
- Add monitoring for cleanup job execution
- Alert if cleanup job fails
- Implement defensive cleanup (multiple methods)
- Add cleanup job health checks
Cause 4: Sandbox Lifecycle State Corruption
Symptoms:
- Sandbox shows as "running" but doesn't exist
- Inconsistent state between DO and actual process
- Cleanup fails with "sandbox not found"
Resolution:
Immediate (15 minutes):
- Audit sandbox states:
typescript
// Check consistency between DO state and actual sandboxes
async function auditSandboxStates(env: Env) {
const doSandboxes = await listSandboxesFromDO(env);
const actualSandboxes = await listActualSandboxes(env);
const orphanedDOs = doSandboxes.filter(
s => !actualSandboxes.find(a => a.id === s.id)
);
const orphanedActual = actualSandboxes.filter(
s => !doSandboxes.find(d => d.id === s.id)
);
console.log('Orphaned DO records:', orphanedDOs.length);
console.log('Orphaned actual sandboxes:', orphanedActual.length);
return { orphanedDOs, orphanedActual };
}- Clean up orphaned records:
typescript
async function cleanupOrphanedRecords(env: Env) {
const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);
// Remove orphaned DO records
for (const sandbox of orphanedDOs) {
const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
const stub = env.SANDBOX_LIFECYCLE.get(id);
await stub.delete(); // Remove from DO
console.log(`Removed orphaned DO record: ${sandbox.id}`);
}
// Terminate orphaned actual sandboxes
for (const sandbox of orphanedActual) {
await terminateSandbox(env, sandbox.id);
console.log(`Terminated orphaned sandbox: ${sandbox.id}`);
}
}- Add state reconciliation:
typescript
// Run reconciliation periodically
async function reconcileSandboxStates(env: Env) {
const sandboxes = await listAllSandboxes(env);
for (const sandbox of sandboxes) {
// Verify sandbox actually exists
const exists = await checkSandboxExists(env, sandbox.id);
if (!exists && sandbox.status === 'running') {
// Update state to reflect reality
const id = env.SANDBOX_LIFECYCLE.idFromName(sandbox.id);
const stub = env.SANDBOX_LIFECYCLE.get(id);
await stub.updateStatus('terminated');
console.log(`Reconciled state for ${sandbox.id}`);
}
}
}Long-term:
- Implement state verification before operations
- Add transaction logging for state changes
- Periodic state reconciliation job
- Improve error handling in state transitions
Cause 5: Resource Exhaustion
Symptoms:
- Cannot provision new sandboxes
- "Resource limit exceeded" errors
- All sandboxes stuck in "provisioning"
Resolution:
Immediate (5 minutes):
- Check resource quotas:
bash
# Check Durable Object limits
# Cloudflare Dashboard > Workers > Durable Objects > Usage
# Check for:
# - Active DO instances near limit
# - Storage near quota
# - CPU time consumption- Emergency cleanup:
bash
# Terminate all non-critical sandboxes
curl -X POST https://monotask-agent-worker.workers.dev/api/sandbox/cleanup/emergency- Implement sandbox pooling:
typescript
// Limit maximum concurrent sandboxes
const MAX_CONCURRENT_SANDBOXES = 10;
async function provisionSandbox(env: Env, config: SandboxConfig) {
const active = await getActiveSandboxCount(env);
if (active >= MAX_CONCURRENT_SANDBOXES) {
throw new Error('Maximum sandbox limit reached');
}
return await createSandbox(env, config);
}Long-term:
- Implement sandbox pooling and reuse
- Add resource monitoring and alerts
- Request quota increases if needed
- Optimize sandbox resource usage
Resolution Procedures
Manual Sandbox Termination (ETA: 5 minutes)
bash
# Step 1: Identify stuck sandbox
curl https://monotask-agent-worker.workers.dev/api/sandbox/list
# Step 2: Get sandbox details
curl https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}
# Step 3: Terminate sandbox
curl -X DELETE https://monotask-agent-worker.workers.dev/api/sandbox/{sandbox_id}
# Step 4: Verify termination
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats
# Active count should decreaseBulk Cleanup (ETA: 10 minutes)
For multiple stuck sandboxes:
typescript
// cleanup-script.ts
async function bulkCleanup(env: Env) {
const sandboxes = await listAllSandboxes(env);
const now = Date.now();
const STUCK_THRESHOLD = 10 * 60 * 1000; // 10 minutes
const stuckSandboxes = sandboxes.filter(s => {
const age = now - new Date(s.created_at).getTime();
return age > STUCK_THRESHOLD;
});
console.log(`Found ${stuckSandboxes.length} stuck sandboxes`);
for (const sandbox of stuckSandboxes) {
try {
await terminateSandbox(env, sandbox.id);
console.log(`✓ Terminated ${sandbox.id}`);
} catch (error) {
console.error(`✗ Failed to terminate ${sandbox.id}:`, error);
}
}
console.log('Bulk cleanup completed');
}
// Execute:
// bun run cleanup-script.tsState Reconciliation (ETA: 15 minutes)
Fix state inconsistencies:
typescript
// reconcile-states.ts
async function reconcileAllStates(env: Env) {
console.log('Starting state reconciliation...');
// Step 1: Audit current state
const { orphanedDOs, orphanedActual } = await auditSandboxStates(env);
// Step 2: Clean up orphaned records
await cleanupOrphanedRecords(env);
// Step 3: Verify all active sandboxes
await reconcileSandboxStates(env);
// Step 4: Generate report
const finalStats = await getSandboxStats(env);
console.log('Reconciliation complete:', finalStats);
}Verification Steps
1. Active Sandbox Count Normal (ETA: 5 minutes)
bash
# Check sandbox stats
curl https://monotask-agent-worker.workers.dev/api/sandbox/stats
# Target: < 5 active sandboxes under normal load2. No Stuck Sandboxes (ETA: 5 minutes)
bash
# List active sandboxes
curl https://monotask-agent-worker.workers.dev/api/sandbox/list
# Verify: All sandboxes < 10 minutes old3. Cleanup Job Running (ETA: 5 minutes)
bash
# Check cleanup logs
wrangler tail monotask-agent-worker | grep cleanup
# Should see cleanup runs every 5 minutes4. Sandbox Timeout Rate Normal (ETA: 5 minutes)
bash
# Check timeout metrics in dashboard
# Target: < 2% timeout rate5. Resource Usage Healthy (ETA: 5 minutes)
bash
# Check Durable Object metrics
# Cloudflare Dashboard > Workers > Durable Objects
# Verify: Well below quotasPrevention Measures
1. Improved Timeout Handling
typescript
// Wrap all sandbox operations with timeout
class SandboxManager {
async execute(sandboxId: string, operation: () => Promise<any>) {
const OPERATION_TIMEOUT = 5 * 60 * 1000; // 5 minutes
const timeoutPromise = new Promise((_, reject) =>
setTimeout(
() => reject(new Error('Sandbox operation timeout')),
OPERATION_TIMEOUT
)
);
try {
return await Promise.race([
operation(),
timeoutPromise,
]);
} catch (error) {
// Cleanup on timeout
await this.terminateSandbox(sandboxId);
throw error;
}
}
}2. Enhanced Monitoring
typescript
// Add sandbox lifecycle metrics
async function trackSandboxLifecycle(
env: Env,
sandboxId: string,
event: string
) {
await env.ANALYTICS.writeDataPoint({
blobs: [sandboxId, event],
doubles: [Date.now()],
indexes: [`sandbox:lifecycle`, `event:${event}`],
});
// Alert on anomalies
if (event === 'timeout' || event === 'stuck') {
await alerter.sendAlert({
severity: 'warning',
message: `Sandbox ${event}: ${sandboxId}`,
context: { sandboxId, event },
});
}
}3. Automatic Cleanup Improvements
typescript
// Multi-layer cleanup approach
async function enhancedCleanup(env: Env) {
// Layer 1: Scheduled cleanup (every 5 min)
await scheduledCleanup(env);
// Layer 2: Age-based cleanup (every 15 min)
await ageBasedCleanup(env, 10 * 60 * 1000);
// Layer 3: Resource-based cleanup (when near limits)
const usage = await getResourceUsage(env);
if (usage > 0.8) {
await emergencyCleanup(env);
}
// Layer 4: State reconciliation (hourly)
if (shouldReconcile()) {
await reconcileSandboxStates(env);
}
}4. Sandbox Health Checks
typescript
// Periodic health checks for active sandboxes
async function healthCheckSandboxes(env: Env) {
const sandboxes = await listActiveSandboxes(env);
for (const sandbox of sandboxes) {
const health = await checkSandboxHealth(env, sandbox.id);
if (!health.healthy) {
console.warn(`Unhealthy sandbox detected: ${sandbox.id}`, health);
if (health.age > 10 * 60 * 1000) {
// Terminate if old and unhealthy
await terminateSandbox(env, sandbox.id);
}
}
}
}Escalation Path
When to Escalate
Escalate if:
- Active sandboxes > 50 for more than 30 minutes
- Cleanup jobs consistently failing
- Unable to provision new sandboxes
- State corruption affecting multiple sandboxes
- Resource quotas exceeded
Escalation Contacts
Level 1 - Agent Team
- Slack: #agent-team
- For agent code issues causing hangs
Level 2 - Infrastructure Team
- For Durable Object issues
- For resource quota concerns
Level 3 - Cloudflare Support
- For platform-level sandbox issues
- For quota increase requests
Post-Incident
Required Actions
Root Cause Analysis:
- What caused sandboxes to get stuck?
- Were there code changes involved?
- Was cleanup working properly?
- Are there resource constraints?
Improve Detection:
- Add alerts for stuck sandboxes
- Monitor sandbox age distribution
- Track cleanup job success rate
- Add resource usage alerts
Code Improvements:
- Fix agent code causing hangs
- Improve timeout handling
- Enhance cleanup robustness
- Add state verification
Documentation:
- Document sandbox limits and quotas
- Update agent development guidelines
- Share lessons learned
- Update this runbook
Related Runbooks
- Worker Timeout - For sandbox timeout issues
- High Error Rate - For sandbox-related errors
- Queue Backup - If sandbox issues cause queue backup
Last Updated: 2025-10-26 Owner: Agent Team Reviewers: SRE Team, Backend Team