Skip to content

Stage 3 Implementation Plan

Date: October 26, 2025 Status: Ready for Implementation Priority: High (P1/P2)

Executive Summary

This document outlines the implementation plan for completing Stage 3 of the Cloudflare migration, focusing on integration, monitoring, and operational readiness. Four GitHub issues need to be addressed to achieve production-ready status.

GitHub Issues Overview

IssueTitlePriorityStatus
#99Implement Queue handlersP1Open
#100Integrate sandbox provisioningP2Open
#101Set up monitoring and alertingP1Open
#102Implement backup and recoveryP1Open

Current State Analysis

✅ Infrastructure Ready

  • QueueManager Durable Object: Fully implemented with job tracking, retry logic, and stats
  • SandboxLifecycle Durable Object: Complete lifecycle management, timeout handling, cleanup
  • Worker Services: Agent, Task, and GitHub workers deployed with basic queue consumers
  • Database: D1 database configured and accessible from all workers
  • Storage: R2 and KV bindings configured

❌ Gaps Identified

  • Queue Handlers: Basic implementation lacks comprehensive error handling and monitoring
  • Sandbox Integration: SandboxLifecycle exists but not integrated with agent execution flow
  • Monitoring: No metrics collection, alerting, or dashboards configured
  • Backup/Recovery: No automated backup system or recovery procedures

Issue #99: Implement Queue Handlers (P1)

Acceptance Criteria

  • ✅ All handlers implemented
  • ✅ Retry logic configured
  • ✅ Dead letter queues set up
  • ✅ Error handling complete
  • ✅ Performance optimized

Current Implementation Status

Agent Worker Queue Handler (packages/cloudflare-workers/agent-worker/src/index.ts:179-207)

  • ✅ Basic message processing
  • ✅ Simple retry logic (max 3 attempts)
  • ❌ Limited error categorization
  • ❌ No metrics collection
  • ❌ Basic DLQ handling

Task Worker Queue Handler (packages/cloudflare-workers/task-worker/src/index.ts:222-255)

  • ✅ Multiple message types supported
  • ✅ Basic error handling
  • ❌ No batch optimization
  • ❌ Missing queue depth monitoring

GitHub Worker Queue Handler (packages/cloudflare-workers/github-worker/src/index.ts:88-135)

  • ✅ Webhook event routing
  • ✅ Message acknowledgment
  • ❌ No deduplication
  • ❌ No rate limit handling

Implementation Tasks

1. Create Shared Queue Utilities

File: packages/cloudflare-shared/src/queue-utils.ts

typescript
// Error classification
export enum ErrorSeverity {
  RETRYABLE = 'retryable',     // Transient errors, retry
  FATAL = 'fatal',              // Permanent errors, DLQ
  RATE_LIMITED = 'rate_limited' // Rate limit, backoff
}

export interface QueueError {
  severity: ErrorSeverity;
  message: string;
  retryAfter?: number;
  category: string;
}

// Retry strategies
export interface RetryStrategy {
  maxAttempts: number;
  backoffMs: number[];
  backoffMultiplier: number;
}

// Metrics collection
export interface QueueMetrics {
  messagesProcessed: number;
  messagesSucceeded: number;
  messagesFailed: number;
  avgProcessingTimeMs: number;
  dlqCount: number;
}

Tasks:

  • [ ] Implement error classification logic
  • [ ] Create exponential backoff calculator
  • [ ] Build retry strategy evaluator
  • [ ] Add metrics aggregation helpers
  • [ ] Create DLQ error context builder

2. Enhance Agent Queue Handler

File: packages/cloudflare-workers/agent-worker/src/index.ts

Tasks:

  • [ ] Add error categorization (classify Claude API errors, timeout errors, validation errors)
  • [ ] Implement exponential backoff (1s, 2s, 4s, 8s pattern)
  • [ ] Add processing time metrics tracking
  • [ ] Enhance DLQ handling with full error context
  • [ ] Add circuit breaker for repeated failures
  • [ ] Implement queue depth alerts
  • [ ] Add success/failure rate tracking

3. Enhance Task Queue Handler

File: packages/cloudflare-workers/task-worker/src/index.ts

Tasks:

  • [ ] Add batch processing optimization
  • [ ] Implement message deduplication
  • [ ] Add state transition validation
  • [ ] Track queue depth per message type
  • [ ] Add priority-based processing
  • [ ] Implement graceful degradation
  • [ ] Add validation result processing

4. Enhance GitHub Queue Handler

File: packages/cloudflare-workers/github-worker/src/index.ts

Tasks:

  • [ ] Implement webhook event deduplication (by delivery ID)
  • [ ] Add GitHub API rate limit handling
  • [ ] Implement event prioritization (critical events first)
  • [ ] Add sync operation progress tracking
  • [ ] Handle webhook replay scenarios
  • [ ] Add API error retry logic

Issue #100: Integrate Sandbox Provisioning (P2)

Acceptance Criteria

  • ✅ Provisioning automated
  • ✅ Resource limits enforced
  • ✅ Cleanup automated
  • ✅ Security hardened
  • ✅ Monitoring enabled

Current Implementation Status

SandboxLifecycle DO (packages/cloudflare-workers/agent-worker/src/durable-objects/SandboxLifecycle.ts)

  • ✅ Complete CRUD operations
  • ✅ Status tracking
  • ✅ Timeout management
  • ✅ Log collection
  • ❌ Not integrated with agent execution

Implementation Tasks

1. Create Execution Environment Wrapper

File: packages/cloudflare-workers/agent-worker/src/sandbox/execution-env.ts

Tasks:

  • [ ] Define resource limits configuration
  • [ ] Implement sandbox isolation wrapper
  • [ ] Add environment variable injection
  • [ ] Create security context setup
  • [ ] Add stdout/stderr capture
  • [ ] Implement timeout enforcement
  • [ ] Add resource usage tracking

2. Create Sandbox Integration Service

File: packages/cloudflare-workers/agent-worker/src/sandbox/integration.ts

Tasks:

  • [ ] Create sandbox provisioning helper
  • [ ] Implement agent-to-sandbox binding
  • [ ] Add log streaming integration
  • [ ] Create cleanup scheduler
  • [ ] Add failure recovery logic
  • [ ] Implement sandbox pooling (optional optimization)

3. Integrate with Agent Execution Flow

File: packages/cloudflare-workers/agent-worker/src/index.ts

Tasks:

  • [ ] Add sandbox provisioning before agent execution
  • [ ] Wrap agent execution in sandbox context
  • [ ] Capture and store sandbox logs
  • [ ] Handle sandbox timeout failures
  • [ ] Trigger cleanup on completion
  • [ ] Add sandbox metrics to agent execution response

4. Implement Resource Cleanup

File: packages/cloudflare-workers/agent-worker/src/sandbox/cleanup.ts

Tasks:

  • [ ] Create scheduled cleanup task (runs every 5 minutes)
  • [ ] Implement orphaned sandbox detection
  • [ ] Add forced termination for stuck sandboxes
  • [ ] Create cleanup metrics dashboard
  • [ ] Add cleanup failure alerting
  • [ ] Implement graceful shutdown handling

5. Add Sandbox Monitoring

File: packages/cloudflare-workers/agent-worker/src/sandbox/monitoring.ts

Tasks:

  • [ ] Track active sandbox count
  • [ ] Monitor resource utilization (CPU, memory)
  • [ ] Track failure rates by agent type
  • [ ] Monitor timeout frequency
  • [ ] Add sandbox lifecycle duration metrics
  • [ ] Create sandbox health endpoint

Issue #101: Set Up Monitoring and Alerting (P1)

Acceptance Criteria

  • ✅ Dashboards created
  • ✅ Alert rules configured
  • ✅ Logging pipeline set up
  • ✅ SLIs/SLOs defined
  • ✅ Runbooks created

Implementation Tasks

1. Configure Cloudflare Analytics

Files: All packages/cloudflare-workers/*/wrangler.toml

Tasks:

  • [ ] Enable Workers Analytics Engine in all wrangler.toml files
  • [ ] Add analytics_engine_datasets configuration
  • [ ] Configure custom metrics collection points
  • [ ] Set up logpush to R2 bucket for log retention
  • [ ] Configure log sampling rates
  • [ ] Add structured logging format

2. Implement Error Alerting

File: packages/cloudflare-workers/monitoring/error-alerter.ts

Tasks:

  • [ ] Create error categorization system (CRITICAL, WARNING, INFO)
  • [ ] Implement alert routing logic
  • [ ] Add email notification integration
  • [ ] Add Slack webhook integration
  • [ ] Add PagerDuty integration (optional)
  • [ ] Configure alert thresholds per worker
  • [ ] Implement alert deduplication
  • [ ] Add alert acknowledgment tracking

File: packages/cloudflare-workers/monitoring/middleware/error-tracker.ts

Tasks:

  • [ ] Create error tracking middleware
  • [ ] Add to all worker fetch handlers
  • [ ] Capture stack traces and context
  • [ ] Track error frequency by type
  • [ ] Implement error rate calculation

3. Add Performance Monitoring

File: packages/cloudflare-workers/monitoring/performance-tracker.ts

Tasks:

  • [ ] Implement request duration tracking
  • [ ] Add queue processing time metrics
  • [ ] Track D1 query performance
  • [ ] Monitor external API latency (Claude API, GitHub API)
  • [ ] Calculate P50, P95, P99 percentiles
  • [ ] Implement performance sampling (10% of requests)
  • [ ] Add performance budget alerts

File: packages/cloudflare-workers/monitoring/middleware/performance-middleware.ts

Tasks:

  • [ ] Create performance tracking middleware
  • [ ] Add to all worker fetch handlers
  • [ ] Capture request/response timing
  • [ ] Track resource usage
  • [ ] Log slow queries

4. Create Dashboards

File: monitoring/cloudflare-dashboard.json

Dashboard Sections:

  • [ ] Worker Health:

    • Uptime percentage
    • Error rates by worker
    • Active request count
    • Request rate (req/sec)
  • [ ] Queue Metrics:

    • Queue depth by queue
    • Messages processed/sec
    • Processing time distribution
    • DLQ message count
    • Retry rate
  • [ ] Performance Metrics:

    • P50/P95/P99 latency by endpoint
    • Database query time
    • External API latency
    • Cache hit rates
  • [ ] Resource Usage:

    • CPU time by worker
    • Memory usage
    • D1 query count
    • R2 operation count
    • KV operation count
  • [ ] Sandbox Metrics:

    • Active sandboxes
    • Sandbox provision time
    • Timeout frequency
    • Resource utilization

5. Define SLIs/SLOs

File: monitoring/slos.yaml

Tasks:

  • [ ] Define API endpoint availability SLO (99.9% uptime)
  • [ ] Set P95 response time targets:
    • API Gateway: < 200ms
    • Task operations: < 500ms
    • Agent execution: < 30s
  • [ ] Define queue processing latency SLO (P95 < 5s)
  • [ ] Set error rate threshold (< 1% of requests)
  • [ ] Define D1 query performance targets
  • [ ] Configure SLO violation alerts
  • [ ] Create SLO compliance reports

6. Create Runbooks

Directory: monitoring/runbooks/

Runbooks to Create:

  • [ ] high-error-rate.md:

    • Symptoms and detection
    • Investigation steps
    • Common causes
    • Resolution procedures
    • Escalation path
  • [ ] queue-backup.md:

    • Queue congestion detection
    • Impact assessment
    • Mitigation steps
    • Worker scaling
    • DLQ processing
  • [ ] database-slow.md:

    • Slow query identification
    • Index analysis
    • Query optimization
    • D1 capacity scaling
    • Temporary mitigations
  • [ ] worker-timeout.md:

    • Timeout cause analysis
    • Code profiling
    • External dependency checks
    • Timeout limit adjustment
    • Performance optimization
  • [ ] sandbox-stuck.md:

    • Stuck sandbox detection
    • Manual cleanup procedure
    • Root cause analysis
    • Prevention measures
    • Monitoring improvements

Issue #102: Implement Backup and Recovery (P1)

Acceptance Criteria

  • ✅ Automated backups scheduled
  • ✅ Recovery procedures tested
  • ✅ RTO/RPO targets met
  • ✅ Documentation complete
  • ✅ Disaster recovery plan approved

RTO/RPO Targets

  • D1 Database: RTO 1 hour, RPO 24 hours
  • KV Namespaces: RTO 30 minutes, RPO 1 hour
  • R2 Artifacts: RTO 2 hours, RPO 24 hours

Implementation Tasks

1. D1 Database Backup

File: scripts/backup/d1-backup.ts

Tasks:

  • [ ] Implement D1 export using wrangler CLI
  • [ ] Upload backup to R2 bucket with timestamp
  • [ ] Implement retention policy:
    • Daily backups: 30 days
    • Weekly backups: 90 days
    • Monthly backups: 1 year
  • [ ] Add backup integrity verification
  • [ ] Generate backup manifest file
  • [ ] Add backup size tracking
  • [ ] Implement backup encryption

File: .github/workflows/d1-backup.yml

Tasks:

  • [ ] Create GitHub Action workflow
  • [ ] Schedule daily backups (2 AM UTC)
  • [ ] Add backup success/failure notifications
  • [ ] Store backup metadata
  • [ ] Monitor backup duration

2. KV/R2 Data Backup

File: scripts/backup/kv-backup.ts

Tasks:

  • [ ] Export all KV namespaces to R2
  • [ ] Include key metadata (expiration, etc.)
  • [ ] Generate KV backup manifest
  • [ ] Implement incremental backups
  • [ ] Add backup verification

File: scripts/backup/r2-backup.ts

Tasks:

  • [ ] Configure cross-region replication for critical buckets
  • [ ] Set up secondary backup bucket
  • [ ] Implement object versioning
  • [ ] Add lifecycle policies
  • [ ] Monitor replication lag

3. Recovery Procedures

File: scripts/recovery/d1-restore.ts

Tasks:

  • [ ] Implement database drop and recreate
  • [ ] Import from backup file
  • [ ] Validate data integrity after restore
  • [ ] Generate restoration report
  • [ ] Add rollback capability
  • [ ] Test on staging environment

File: scripts/recovery/kv-restore.ts

Tasks:

  • [ ] Implement namespace clearing
  • [ ] Bulk key restoration
  • [ ] Individual key restoration option
  • [ ] Validate restored data
  • [ ] Handle key conflicts

File: scripts/recovery/disaster-recovery.ts

Tasks:

  • [ ] Orchestrate full system restoration
  • [ ] Restore D1, KV, and R2 in correct order
  • [ ] Verify service health after each step
  • [ ] Run smoke tests
  • [ ] Generate recovery timeline report
  • [ ] Document actual RTO achieved

4. Recovery Playbooks

Directory: docs/recovery-playbooks/

Playbooks to Create:

  • [ ] d1-recovery.md:

    • When to use
    • Prerequisites
    • Step-by-step restoration
    • Verification steps
    • Common issues and solutions
    • RTO/RPO expectations
  • [ ] worker-rollback.md:

    • Rollback triggers
    • Version identification
    • Rollback procedure
    • Traffic switching
    • Verification
    • Communication plan
  • [ ] data-corruption.md:

    • Corruption detection
    • Impact assessment
    • Point-in-time recovery
    • Data validation
    • Preventing recurrence
  • [ ] partial-failure.md:

    • Component failure identification
    • Isolated component recovery
    • Service continuity
    • Gradual restoration
    • Full system validation

5. Automated Recovery Testing

File: scripts/backup/test-recovery.ts

Tasks:

  • [ ] Create staging environment for tests
  • [ ] Automate backup creation
  • [ ] Perform automated restore
  • [ ] Run data validation tests
  • [ ] Measure actual RTO
  • [ ] Compare against target RTO/RPO
  • [ ] Generate test report
  • [ ] Schedule monthly recovery drills

Implementation Timeline

Week 1: Monitoring and Queue Enhancement

Days 1-2: Monitoring Foundation

  • [ ] Configure Cloudflare Analytics in all workers
  • [ ] Implement error tracking middleware
  • [ ] Set up basic alerting
  • [ ] Create initial dashboard

Days 3-4: Queue Enhancement

  • [ ] Create shared queue utilities
  • [ ] Enhance agent queue handler
  • [ ] Enhance task queue handler
  • [ ] Enhance GitHub queue handler
  • [ ] Add queue metrics collection

Week 2: Sandbox and Backup

Days 5-6: Sandbox Integration

  • [ ] Create execution environment wrapper
  • [ ] Build sandbox integration service
  • [ ] Integrate with agent execution flow
  • [ ] Implement resource cleanup
  • [ ] Add sandbox monitoring

Days 7-8: Backup & Recovery

  • [ ] Implement D1 backup script
  • [ ] Create KV/R2 backup scripts
  • [ ] Build recovery procedures
  • [ ] Write recovery playbooks
  • [ ] Create GitHub Actions workflows

Days 9-10: Testing and Documentation

  • [ ] Run full backup/restore tests
  • [ ] Measure actual RTO/RPO
  • [ ] Complete all runbooks
  • [ ] Finalize SLI/SLO definitions
  • [ ] Update system documentation
  • [ ] Close all GitHub issues

Testing Strategy

Queue Handler Testing

File: packages/cloudflare-shared/tests/queue-handlers.test.ts

  • [ ] Unit tests for error classification
  • [ ] Retry logic tests
  • [ ] DLQ processing tests
  • [ ] Load tests (1000+ messages/sec)
  • [ ] Failure scenario tests

Sandbox Integration Testing

File: packages/e2e/tests/sandbox-integration.spec.ts

  • [ ] E2E sandbox lifecycle test
  • [ ] Resource limit enforcement test
  • [ ] Timeout handling test
  • [ ] Cleanup automation test
  • [ ] Concurrent sandbox test

Monitoring Testing

  • [ ] Verify all metrics are collected
  • [ ] Trigger test alerts
  • [ ] Validate dashboard accuracy
  • [ ] Test alert notification delivery
  • [ ] Verify SLO calculations

Backup/Recovery Testing

File: scripts/backup/__tests__/recovery.test.ts

  • [ ] Automated backup validation
  • [ ] Full restore to staging
  • [ ] Partial restore test
  • [ ] Corruption recovery test
  • [ ] Measure actual RTO/RPO

Success Criteria Checklist

Issue #99: Queue Handlers ✓

  • [ ] All queue handlers have comprehensive error handling
  • [ ] Exponential backoff implemented
  • [ ] Dead letter queues configured with error context
  • [ ] Queue metrics collected and displayed
  • [ ] Performance optimizations applied
  • [ ] Load tested at 1000+ messages/sec

Issue #100: Sandbox Provisioning ✓

  • [ ] Sandbox provisioning integrated with agent execution
  • [ ] Resource limits enforced (CPU, memory, timeout)
  • [ ] Automated cleanup running every 5 minutes
  • [ ] Security hardened with isolation
  • [ ] Monitoring dashboard shows sandbox metrics
  • [ ] E2E tests passing

Issue #101: Monitoring and Alerting ✓

  • [ ] Cloudflare Analytics configured in all workers
  • [ ] Error alerting functional with notifications
  • [ ] Performance monitoring tracking P95/P99
  • [ ] Dashboard created with all key metrics
  • [ ] SLIs/SLOs defined and monitored
  • [ ] All 5 runbooks completed

Issue #102: Backup and Recovery ✓

  • [ ] D1 automated backups running daily
  • [ ] KV/R2 backups configured
  • [ ] All recovery scripts tested
  • [ ] Recovery playbooks written
  • [ ] RTO/RPO targets validated
  • [ ] Monthly recovery drills scheduled

Overall Completion ✓

  • [ ] All 4 Stage 3 issues closed
  • [ ] System ready for Stage 4 deployment
  • [ ] Documentation complete
  • [ ] Team trained on operations

Risk Mitigation

High Risk Areas

  1. Queue Processing Under Load

    • Mitigation: Load testing with 2x expected capacity
    • Fallback: Circuit breaker to prevent cascading failures
  2. Backup Window Exceeds RTO

    • Mitigation: Parallel backup processes
    • Fallback: Incremental backup strategy
  3. Alert Fatigue from False Positives

    • Mitigation: Careful threshold tuning
    • Fallback: Alert aggregation and smart routing
  4. Sandbox Resource Exhaustion

    • Mitigation: Strict resource limits and quotas
    • Fallback: Automatic sandbox pool scaling

Dependencies

External Services

  • Cloudflare Workers (runtime)
  • Cloudflare D1 (database)
  • Cloudflare R2 (object storage)
  • Cloudflare KV (key-value store)
  • Cloudflare Analytics Engine
  • GitHub Actions (automation)

Internal Packages

  • @monotask/cloudflare-shared (utilities)
  • @monotask/core (business logic)
  • @monotask/shared (types and constants)

Post-Implementation

Monitoring and Maintenance

  • Daily review of error logs
  • Weekly SLO compliance review
  • Monthly backup recovery drill
  • Quarterly runbook updates

Documentation Updates

  • Update CLAUDE.md with new operational procedures
  • Document monitoring dashboard usage
  • Create video walkthroughs for runbooks
  • Update team wiki with recovery procedures

Next Steps (Stage 4)

After completing Stage 3:

  1. Configure production DNS and routes (#105)
  2. Perform data migration (#104)
  3. Execute staged rollout (#103)

Document Version: 1.0 Last Updated: October 26, 2025 Owner: Development Team Reviewers: DevOps, SRE

MonoKernel MonoTask Documentation