Skip to content

Partial Failure

Overview

This playbook addresses scenarios where one or more system components fail while others remain operational. It provides procedures for isolated component recovery while maintaining service continuity.

RTO Target: Varies by component (15m - 2h) Goal: Minimize total system downtime


When to Use This Playbook

Use this playbook when experiencing:

  • Single Worker Failure: One worker down, others operational
  • Database Read/Write Issues: D1 queries failing intermittently
  • KV Namespace Unavailable: Specific namespace inaccessible
  • R2 Bucket Issues: One bucket failing, others working
  • Durable Object Errors: Specific DO class experiencing issues
  • Queue Processing Failure: Messages not being processed in one queue
  • API Endpoint Degradation: Specific endpoints returning errors

Key Characteristic: System partially functional, not total outage


Component Failure Identification

Quick Diagnosis

  1. Check Component Health:

    bash
    # Test each worker
    curl https://monotask-api-gateway.workers.dev/health
    curl https://monotask-task-worker.workers.dev/health
    curl https://monotask-agent-worker.workers.dev/health
    curl https://monotask-github-worker.workers.dev/health
    curl https://monotask-auth-worker.workers.dev/health
    curl https://monotask-websocket-worker.workers.dev/health
  2. Check Cloudflare Status:

    • Cloudflare status page
    • Account-specific service alerts
    • Regional availability
  3. Check Resource Limits:

    • D1 query limits
    • Worker CPU time
    • KV operation limits
    • R2 bandwidth limits

Component-Specific Recovery

API Gateway Worker Failure

Symptoms:

  • All API requests failing
  • 502/503 errors at gateway
  • No traffic reaching service workers

Impact: Critical - entire API unavailable

Recovery Steps:

  1. Identify Issue:

    bash
    bunx wrangler tail monotask-api-gateway
    # Watch live logs for errors
  2. Quick Fix Options:

    bash
    # Option 1: Rollback
    bunx wrangler rollback --name monotask-api-gateway
    
    # Option 2: Redeploy current version
    cd packages/cloudflare-workers/api-gateway
    bunx wrangler deploy
  3. Verify Recovery:

    bash
    curl https://monotask-api-gateway.workers.dev/health

RTO: 5-10 minutes


Task Worker Failure

Symptoms:

  • Task operations failing
  • State transitions not processing
  • Task queries returning errors

Impact: High - core functionality impaired

Recovery Steps:

  1. Check Worker Status:

    bash
    bunx wrangler tail monotask-task-worker
  2. Check D1 Connectivity:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM tasks" \
      --json
  3. Recovery Options:

    bash
    # If worker issue
    bunx wrangler rollback --name monotask-task-worker
    
    # If D1 issue, check [D1 Recovery Playbook](./d1-recovery.md)

Workaround: Route traffic through backup worker instance (if configured)

RTO: 10-15 minutes


Agent Worker Failure

Symptoms:

  • AI agent executions failing
  • Queue messages backing up
  • Sandbox provisioning errors

Impact: Medium - automation delayed, manual operations still work

Recovery Steps:

  1. Check Queue Status:

    bash
    # Check queue depth
    # (Via Cloudflare dashboard or custom monitoring)
  2. Identify Failure Point:

    • Claude API connectivity?
    • Sandbox lifecycle issues?
    • Queue processing errors?
  3. Recovery Actions:

    bash
    # Rollback worker
    bunx wrangler rollback --name monotask-agent-worker
    
    # Or fix and redeploy
    cd packages/cloudflare-workers/agent-worker
    bunx wrangler deploy
  4. Process Backlog:

    • Queue will automatically process backlog
    • Monitor processing rate
    • Increase worker instances if needed (Cloudflare dashboard)

Workaround: Manual task execution while worker recovers

RTO: 20-30 minutes


GitHub Worker Failure

Symptoms:

  • Webhook events not processing
  • Repository syncs failing
  • GitHub API calls timing out

Impact: Medium - GitHub integration down, core features unaffected

Recovery Steps:

  1. Verify GitHub API Status:

    • Check GitHub status page
    • Test API connectivity
  2. Check Webhook Configuration:

    bash
    # Verify webhook endpoint accessible
    curl https://monotask-github-worker.workers.dev/webhook
  3. Recovery:

    bash
    bunx wrangler rollback --name monotask-github-worker
  4. Replay Missed Events:

    • GitHub webhooks have retry logic
    • Or manually trigger sync:
    bash
    # Via API call to sync endpoint
    curl -X POST https://monotask-api-gateway.workers.dev/api/github/sync

Workaround: Manual GitHub operations via UI

RTO: 15-20 minutes


D1 Database Partial Failure

Symptoms:

  • Intermittent query failures
  • Slow response times
  • Connection timeout errors

Impact: Critical - affects all data operations

Diagnosis:

  1. Check D1 Health:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA integrity_check" \
      --json
  2. Check Query Performance:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "EXPLAIN QUERY PLAN SELECT * FROM tasks LIMIT 10" \
      --json
  3. Monitor Rate Limits:

    • Check if hitting D1 query limits
    • Review request patterns

Recovery Options:

  1. If Rate Limited:

    • Reduce query frequency
    • Implement caching
    • Batch operations
  2. If Database Corruption:

  3. If Connectivity Issues:

    • Wait for Cloudflare infrastructure recovery
    • Monitor Cloudflare status page

RTO: 15-30 minutes (infrastructure dependent)


KV Namespace Failure

Symptoms:

  • Specific KV namespace returning errors
  • Keys not accessible
  • Write operations failing

Impact: Varies by namespace:

  • SESSIONS: High (authentication issues)
  • CACHE: Low (performance degradation)
  • RATE_LIMITS: Medium (rate limiting disabled)
  • FEATURE_FLAGS: Medium (flags unavailable)
  • API_KEYS: High (API auth fails)

Recovery Steps:

  1. Identify Affected Namespace:

    bash
    # Test each namespace
    bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302
    bunx wrangler kv:key list --namespace-id 32f1ed0a552e453da465bb36df6f1644
    # ... etc for each namespace
  2. Check Cloudflare KV Status:

    • KV service status
    • Regional availability
  3. Recovery Options:

    Option A: Wait for Service Recovery

    • If Cloudflare infrastructure issue
    • Monitor status page

    Option B: Restore from Backup

    bash
    # Restore specific namespace
    bun run scripts/recovery/kv-restore.ts <backup-id> \
      --namespace SESSIONS

    Option C: Failover Strategy

    • Use fallback mechanisms in code
    • Degrade gracefully without KV

Workarounds by Namespace:

  • SESSIONS: Force re-authentication
  • CACHE: Accept cache misses
  • RATE_LIMITS: Disable rate limiting temporarily
  • FEATURE_FLAGS: Use default values
  • API_KEYS: Manual verification

RTO: 30-45 minutes


R2 Bucket Failure

Symptoms:

  • Cannot upload/download objects
  • 404 errors for existing objects
  • Slow object access

Impact: Medium - file storage affected

Recovery Steps:

  1. Identify Affected Bucket:

    bash
    bunx wrangler r2 object list evidence-storage
    bunx wrangler r2 object list screenshots
    bunx wrangler r2 object list agent-artifacts
  2. Check R2 Status:

    • Cloudflare R2 service status
    • Regional availability
  3. Recovery Options:

    Option A: Use Secondary Bucket

    bash
    # Failover to backup bucket
    # (Requires code change to use backup bucket binding)

    Option B: Restore from Replication

    bash
    bun run scripts/backup/r2-backup.ts sync <bucket-name>

Workaround: Temporarily disable file uploads, queue for later

RTO: 1-2 hours


Durable Object Failure

Symptoms:

  • Specific DO operations failing
  • Connection errors to DO
  • State inconsistencies

Impact: Varies by DO type

Recovery by DO Type:

  1. QueueManager DO:

    • Queue processing stops
    • Messages accumulate
    • Recovery: Redeploy agent worker
  2. SandboxLifecycle DO:

    • Sandbox provisioning fails
    • Agent execution blocked
    • Recovery: Redeploy agent worker, cleanup stuck sandboxes
  3. TaskCoordinator DO:

    • Task coordination fails
    • State transitions delayed
    • Recovery: Redeploy task worker
  4. WebSocketRoom DO:

    • Real-time updates stop
    • Clients disconnected
    • Recovery: Redeploy websocket worker, clients auto-reconnect

General Recovery:

bash
# Redeploy worker containing DO
bunx wrangler deploy --name <worker-with-do>

RTO: 10-15 minutes


Gradual Restoration Strategy

When multiple components affected:

Phase 1: Core Services (0-15 minutes)

Priority: Get basic functionality working

  1. API Gateway (entry point)
  2. Auth Worker (authentication)
  3. D1 Database (data layer)

Phase 2: Primary Features (15-45 minutes)

Priority: Restore main user-facing features

  1. Task Worker (task operations)
  2. KV Namespaces (sessions, cache)
  3. WebSocket Worker (real-time updates)

Phase 3: Advanced Features (45-120 minutes)

Priority: Full functionality

  1. Agent Worker (AI automation)
  2. GitHub Worker (integrations)
  3. R2 Buckets (file storage)

Service Continuity During Recovery

Maintain Available Services

  1. Communicate Partial Availability:

    ⚠️ PARTIAL SERVICE DISRUPTION
    
    Available:
    ✅ API access
    ✅ Task viewing
    ✅ Manual operations
    
    Unavailable:
    ❌ AI agent execution
    ❌ GitHub sync
    ❌ File uploads
    
    ETA: [time]
  2. Redirect Traffic:

    • Route around failed components
    • Use fallback mechanisms
    • Graceful degradation
  3. User Guidance:

    • Status banner in UI
    • API error messages
    • Workaround instructions

Isolated Component Testing

Before declaring recovery complete:

  1. Test Failed Component:

    bash
    # Health check
    curl https://<worker>.workers.dev/health
    
    # Functional test
    # (Component-specific test calls)
  2. Test Integration:

    • Verify component communicates with dependencies
    • Check data flow end-to-end
    • Validate cross-component operations
  3. Monitor Metrics:

    • Error rates normal
    • Response times baseline
    • Resource usage expected

Common Partial Failure Patterns

Pattern 1: Cascading Failure

Symptoms:

  • Initial component failure
  • Dependent components start failing
  • Errors propagating through system

Response:

  1. Identify root cause component
  2. Recover root component first
  3. Then recover dependent components
  4. Implement circuit breakers to prevent future cascades

Pattern 2: Intermittent Failure

Symptoms:

  • Component works sometimes
  • Errors appear randomly
  • No clear pattern

Response:

  1. Check rate limits
  2. Review resource constraints
  3. Monitor for external dependencies
  4. Increase logging/monitoring
  5. Implement retry logic

Pattern 3: Silent Failure

Symptoms:

  • Component appears healthy
  • But operations not completing
  • No obvious errors

Response:

  1. Check data flow end-to-end
  2. Review queue depths
  3. Verify downstream processing
  4. Check for stuck operations
  5. Review timeout configurations

Recovery Verification Checklist

After component recovery:

  • [ ] Health checks passing
  • [ ] Functional tests passing
  • [ ] Error rates normal
  • [ ] Response times baseline
  • [ ] Dependent components working
  • [ ] End-to-end flows working
  • [ ] User-facing features operational
  • [ ] Monitoring showing green
  • [ ] Logs clean of errors
  • [ ] Backlog processing normally

Post-Recovery Monitoring

Monitor for 24-48 hours after recovery:

  1. Metrics to Watch:

    • Error rates
    • Response times
    • Request volumes
    • Resource usage
  2. Early Warning Signs:

    • Gradual performance degradation
    • Increasing error rates
    • Resource usage trending up
    • User complaints
  3. Alerting:

    • Ensure alerts configured
    • Test alert triggers
    • Verify notification delivery

Prevention Strategies

  1. Circuit Breakers:

    • Prevent cascade failures
    • Automatic degradation
  2. Health Checks:

    • Proactive monitoring
    • Early failure detection
  3. Graceful Degradation:

    • Fallback mechanisms
    • Reduced functionality vs. total failure
  4. Redundancy:

    • Multiple worker instances
    • Backup data stores
    • Alternative execution paths


Revision History

DateVersionChangesAuthor
2025-10-261.0Initial playbookSystem

MonoKernel MonoTask Documentation