Stage 3 Implementation Plan

Date: October 26, 2025 Status: Ready for Implementation Priority: High (P1/P2)

Executive Summary

This document outlines the implementation plan for completing Stage 3 of the Cloudflare migration, focusing on integration, monitoring, and operational readiness. Four GitHub issues need to be addressed to achieve production-ready status.

GitHub Issues Overview

Issue	Title	Priority	Status
#99	Implement Queue handlers	P1	Open
#100	Integrate sandbox provisioning	P2	Open
#101	Set up monitoring and alerting	P1	Open
#102	Implement backup and recovery	P1	Open

Current State Analysis

✅ Infrastructure Ready

QueueManager Durable Object: Fully implemented with job tracking, retry logic, and stats
SandboxLifecycle Durable Object: Complete lifecycle management, timeout handling, cleanup
Worker Services: Agent, Task, and GitHub workers deployed with basic queue consumers
Database: D1 database configured and accessible from all workers
Storage: R2 and KV bindings configured

❌ Gaps Identified

Queue Handlers: Basic implementation lacks comprehensive error handling and monitoring
Sandbox Integration: SandboxLifecycle exists but not integrated with agent execution flow
Monitoring: No metrics collection, alerting, or dashboards configured
Backup/Recovery: No automated backup system or recovery procedures

Issue #99: Implement Queue Handlers (P1)

Acceptance Criteria

✅ All handlers implemented
✅ Retry logic configured
✅ Dead letter queues set up
✅ Error handling complete
✅ Performance optimized

Current Implementation Status

Agent Worker Queue Handler (packages/cloudflare-workers/agent-worker/src/index.ts:179-207)

✅ Basic message processing
✅ Simple retry logic (max 3 attempts)
❌ Limited error categorization
❌ No metrics collection
❌ Basic DLQ handling

Task Worker Queue Handler (packages/cloudflare-workers/task-worker/src/index.ts:222-255)

✅ Multiple message types supported
✅ Basic error handling
❌ No batch optimization
❌ Missing queue depth monitoring

GitHub Worker Queue Handler (packages/cloudflare-workers/github-worker/src/index.ts:88-135)

✅ Webhook event routing
✅ Message acknowledgment
❌ No deduplication
❌ No rate limit handling

Implementation Tasks

1. Create Shared Queue Utilities

File: packages/cloudflare-shared/src/queue-utils.ts

typescript

// Error classification
export enum ErrorSeverity {
  RETRYABLE = 'retryable',     // Transient errors, retry
  FATAL = 'fatal',              // Permanent errors, DLQ
  RATE_LIMITED = 'rate_limited' // Rate limit, backoff
}

export interface QueueError {
  severity: ErrorSeverity;
  message: string;
  retryAfter?: number;
  category: string;
}

// Retry strategies
export interface RetryStrategy {
  maxAttempts: number;
  backoffMs: number[];
  backoffMultiplier: number;
}

// Metrics collection
export interface QueueMetrics {
  messagesProcessed: number;
  messagesSucceeded: number;
  messagesFailed: number;
  avgProcessingTimeMs: number;
  dlqCount: number;
}

Tasks:

[ ] Implement error classification logic
[ ] Create exponential backoff calculator
[ ] Build retry strategy evaluator
[ ] Add metrics aggregation helpers
[ ] Create DLQ error context builder

2. Enhance Agent Queue Handler

File: packages/cloudflare-workers/agent-worker/src/index.ts

Tasks:

[ ] Add error categorization (classify Claude API errors, timeout errors, validation errors)
[ ] Implement exponential backoff (1s, 2s, 4s, 8s pattern)
[ ] Add processing time metrics tracking
[ ] Enhance DLQ handling with full error context
[ ] Add circuit breaker for repeated failures
[ ] Implement queue depth alerts
[ ] Add success/failure rate tracking

3. Enhance Task Queue Handler

File: packages/cloudflare-workers/task-worker/src/index.ts

Tasks:

[ ] Add batch processing optimization
[ ] Implement message deduplication
[ ] Add state transition validation
[ ] Track queue depth per message type
[ ] Add priority-based processing
[ ] Implement graceful degradation
[ ] Add validation result processing

4. Enhance GitHub Queue Handler

File: packages/cloudflare-workers/github-worker/src/index.ts

Tasks:

[ ] Implement webhook event deduplication (by delivery ID)
[ ] Add GitHub API rate limit handling
[ ] Implement event prioritization (critical events first)
[ ] Add sync operation progress tracking
[ ] Handle webhook replay scenarios
[ ] Add API error retry logic

Issue #100: Integrate Sandbox Provisioning (P2)

Acceptance Criteria

✅ Provisioning automated
✅ Resource limits enforced
✅ Cleanup automated
✅ Security hardened
✅ Monitoring enabled

Current Implementation Status

SandboxLifecycle DO (packages/cloudflare-workers/agent-worker/src/durable-objects/SandboxLifecycle.ts)

✅ Complete CRUD operations
✅ Status tracking
✅ Timeout management
✅ Log collection
❌ Not integrated with agent execution

Implementation Tasks

1. Create Execution Environment Wrapper

File: packages/cloudflare-workers/agent-worker/src/sandbox/execution-env.ts

Tasks:

[ ] Define resource limits configuration
[ ] Implement sandbox isolation wrapper
[ ] Add environment variable injection
[ ] Create security context setup
[ ] Add stdout/stderr capture
[ ] Implement timeout enforcement
[ ] Add resource usage tracking

2. Create Sandbox Integration Service

File: packages/cloudflare-workers/agent-worker/src/sandbox/integration.ts

Tasks:

[ ] Create sandbox provisioning helper
[ ] Implement agent-to-sandbox binding
[ ] Add log streaming integration
[ ] Create cleanup scheduler
[ ] Add failure recovery logic
[ ] Implement sandbox pooling (optional optimization)

3. Integrate with Agent Execution Flow

File: packages/cloudflare-workers/agent-worker/src/index.ts

Tasks:

[ ] Add sandbox provisioning before agent execution
[ ] Wrap agent execution in sandbox context
[ ] Capture and store sandbox logs
[ ] Handle sandbox timeout failures
[ ] Trigger cleanup on completion
[ ] Add sandbox metrics to agent execution response

4. Implement Resource Cleanup

File: packages/cloudflare-workers/agent-worker/src/sandbox/cleanup.ts

Tasks:

[ ] Create scheduled cleanup task (runs every 5 minutes)
[ ] Implement orphaned sandbox detection
[ ] Add forced termination for stuck sandboxes
[ ] Create cleanup metrics dashboard
[ ] Add cleanup failure alerting
[ ] Implement graceful shutdown handling

5. Add Sandbox Monitoring

File: packages/cloudflare-workers/agent-worker/src/sandbox/monitoring.ts

Tasks:

[ ] Track active sandbox count
[ ] Monitor resource utilization (CPU, memory)
[ ] Track failure rates by agent type
[ ] Monitor timeout frequency
[ ] Add sandbox lifecycle duration metrics
[ ] Create sandbox health endpoint

Issue #101: Set Up Monitoring and Alerting (P1)

Acceptance Criteria

✅ Dashboards created
✅ Alert rules configured
✅ Logging pipeline set up
✅ SLIs/SLOs defined
✅ Runbooks created

Implementation Tasks

1. Configure Cloudflare Analytics

Files: All packages/cloudflare-workers/*/wrangler.toml

Tasks:

[ ] Enable Workers Analytics Engine in all wrangler.toml files
[ ] Add analytics_engine_datasets configuration
[ ] Configure custom metrics collection points
[ ] Set up logpush to R2 bucket for log retention
[ ] Configure log sampling rates
[ ] Add structured logging format

2. Implement Error Alerting

File: packages/cloudflare-workers/monitoring/error-alerter.ts

Tasks:

[ ] Create error categorization system (CRITICAL, WARNING, INFO)
[ ] Implement alert routing logic
[ ] Add email notification integration
[ ] Add Slack webhook integration
[ ] Add PagerDuty integration (optional)
[ ] Configure alert thresholds per worker
[ ] Implement alert deduplication
[ ] Add alert acknowledgment tracking

File: packages/cloudflare-workers/monitoring/middleware/error-tracker.ts

Tasks:

[ ] Create error tracking middleware
[ ] Add to all worker fetch handlers
[ ] Capture stack traces and context
[ ] Track error frequency by type
[ ] Implement error rate calculation

3. Add Performance Monitoring

File: packages/cloudflare-workers/monitoring/performance-tracker.ts

Tasks:

[ ] Implement request duration tracking
[ ] Add queue processing time metrics
[ ] Track D1 query performance
[ ] Monitor external API latency (Claude API, GitHub API)
[ ] Calculate P50, P95, P99 percentiles
[ ] Implement performance sampling (10% of requests)
[ ] Add performance budget alerts

File: packages/cloudflare-workers/monitoring/middleware/performance-middleware.ts

Tasks:

[ ] Create performance tracking middleware
[ ] Add to all worker fetch handlers
[ ] Capture request/response timing
[ ] Track resource usage
[ ] Log slow queries

4. Create Dashboards

File: monitoring/cloudflare-dashboard.json

Dashboard Sections:

[ ] Worker Health:
- Uptime percentage
- Error rates by worker
- Active request count
- Request rate (req/sec)
[ ] Queue Metrics:
- Queue depth by queue
- Messages processed/sec
- Processing time distribution
- DLQ message count
- Retry rate
[ ] Performance Metrics:
- P50/P95/P99 latency by endpoint
- Database query time
- External API latency
- Cache hit rates
[ ] Resource Usage:
- CPU time by worker
- Memory usage
- D1 query count
- R2 operation count
- KV operation count
[ ] Sandbox Metrics:
- Active sandboxes
- Sandbox provision time
- Timeout frequency
- Resource utilization

5. Define SLIs/SLOs

File: monitoring/slos.yaml

Tasks:

[ ] Define API endpoint availability SLO (99.9% uptime)
[ ] Set P95 response time targets:
- API Gateway: < 200ms
- Task operations: < 500ms
- Agent execution: < 30s
[ ] Define queue processing latency SLO (P95 < 5s)
[ ] Set error rate threshold (< 1% of requests)
[ ] Define D1 query performance targets
[ ] Configure SLO violation alerts
[ ] Create SLO compliance reports

6. Create Runbooks

Directory: monitoring/runbooks/

Runbooks to Create:

[ ] high-error-rate.md:
- Symptoms and detection
- Investigation steps
- Common causes
- Resolution procedures
- Escalation path
[ ] queue-backup.md:
- Queue congestion detection
- Impact assessment
- Mitigation steps
- Worker scaling
- DLQ processing
[ ] database-slow.md:
- Slow query identification
- Index analysis
- Query optimization
- D1 capacity scaling
- Temporary mitigations
[ ] worker-timeout.md:
- Timeout cause analysis
- Code profiling
- External dependency checks
- Timeout limit adjustment
- Performance optimization
[ ] sandbox-stuck.md:
- Stuck sandbox detection
- Manual cleanup procedure
- Root cause analysis
- Prevention measures
- Monitoring improvements

Issue #102: Implement Backup and Recovery (P1)

Acceptance Criteria

✅ Automated backups scheduled
✅ Recovery procedures tested
✅ RTO/RPO targets met
✅ Documentation complete
✅ Disaster recovery plan approved

RTO/RPO Targets

D1 Database: RTO 1 hour, RPO 24 hours
KV Namespaces: RTO 30 minutes, RPO 1 hour
R2 Artifacts: RTO 2 hours, RPO 24 hours

Implementation Tasks

1. D1 Database Backup

File: scripts/backup/d1-backup.ts

Tasks:

[ ] Implement D1 export using wrangler CLI
[ ] Upload backup to R2 bucket with timestamp
[ ] Implement retention policy:
- Daily backups: 30 days
- Weekly backups: 90 days
- Monthly backups: 1 year
[ ] Add backup integrity verification
[ ] Generate backup manifest file
[ ] Add backup size tracking
[ ] Implement backup encryption

File: .github/workflows/d1-backup.yml

Tasks:

[ ] Create GitHub Action workflow
[ ] Schedule daily backups (2 AM UTC)
[ ] Add backup success/failure notifications
[ ] Store backup metadata
[ ] Monitor backup duration

2. KV/R2 Data Backup

File: scripts/backup/kv-backup.ts

Tasks:

[ ] Export all KV namespaces to R2
[ ] Include key metadata (expiration, etc.)
[ ] Generate KV backup manifest
[ ] Implement incremental backups
[ ] Add backup verification

File: scripts/backup/r2-backup.ts

Tasks:

[ ] Configure cross-region replication for critical buckets
[ ] Set up secondary backup bucket
[ ] Implement object versioning
[ ] Add lifecycle policies
[ ] Monitor replication lag

3. Recovery Procedures

File: scripts/recovery/d1-restore.ts

Tasks:

[ ] Implement database drop and recreate
[ ] Import from backup file
[ ] Validate data integrity after restore
[ ] Generate restoration report
[ ] Add rollback capability
[ ] Test on staging environment

File: scripts/recovery/kv-restore.ts

Tasks:

[ ] Implement namespace clearing
[ ] Bulk key restoration
[ ] Individual key restoration option
[ ] Validate restored data
[ ] Handle key conflicts

File: scripts/recovery/disaster-recovery.ts

Tasks:

[ ] Orchestrate full system restoration
[ ] Restore D1, KV, and R2 in correct order
[ ] Verify service health after each step
[ ] Run smoke tests
[ ] Generate recovery timeline report
[ ] Document actual RTO achieved

4. Recovery Playbooks

Directory: docs/recovery-playbooks/

Playbooks to Create:

[ ] d1-recovery.md:
- When to use
- Prerequisites
- Step-by-step restoration
- Verification steps
- Common issues and solutions
- RTO/RPO expectations
[ ] worker-rollback.md:
- Rollback triggers
- Version identification
- Rollback procedure
- Traffic switching
- Verification
- Communication plan
[ ] data-corruption.md:
- Corruption detection
- Impact assessment
- Point-in-time recovery
- Data validation
- Preventing recurrence
[ ] partial-failure.md:
- Component failure identification
- Isolated component recovery
- Service continuity
- Gradual restoration
- Full system validation

5. Automated Recovery Testing

File: scripts/backup/test-recovery.ts

Tasks:

[ ] Create staging environment for tests
[ ] Automate backup creation
[ ] Perform automated restore
[ ] Run data validation tests
[ ] Measure actual RTO
[ ] Compare against target RTO/RPO
[ ] Generate test report
[ ] Schedule monthly recovery drills

Implementation Timeline

Week 1: Monitoring and Queue Enhancement

Days 1-2: Monitoring Foundation

[ ] Configure Cloudflare Analytics in all workers
[ ] Implement error tracking middleware
[ ] Set up basic alerting
[ ] Create initial dashboard

Days 3-4: Queue Enhancement

[ ] Create shared queue utilities
[ ] Enhance agent queue handler
[ ] Enhance task queue handler
[ ] Enhance GitHub queue handler
[ ] Add queue metrics collection

Week 2: Sandbox and Backup

Days 5-6: Sandbox Integration

[ ] Create execution environment wrapper
[ ] Build sandbox integration service
[ ] Integrate with agent execution flow
[ ] Implement resource cleanup
[ ] Add sandbox monitoring

Days 7-8: Backup & Recovery

[ ] Implement D1 backup script
[ ] Create KV/R2 backup scripts
[ ] Build recovery procedures
[ ] Write recovery playbooks
[ ] Create GitHub Actions workflows

Days 9-10: Testing and Documentation

[ ] Run full backup/restore tests
[ ] Measure actual RTO/RPO
[ ] Complete all runbooks
[ ] Finalize SLI/SLO definitions
[ ] Update system documentation
[ ] Close all GitHub issues

Testing Strategy

Queue Handler Testing

File: packages/cloudflare-shared/tests/queue-handlers.test.ts

[ ] Unit tests for error classification
[ ] Retry logic tests
[ ] DLQ processing tests
[ ] Load tests (1000+ messages/sec)
[ ] Failure scenario tests

Sandbox Integration Testing

File: packages/e2e/tests/sandbox-integration.spec.ts

[ ] E2E sandbox lifecycle test
[ ] Resource limit enforcement test
[ ] Timeout handling test
[ ] Cleanup automation test
[ ] Concurrent sandbox test

Monitoring Testing

[ ] Verify all metrics are collected
[ ] Trigger test alerts
[ ] Validate dashboard accuracy
[ ] Test alert notification delivery
[ ] Verify SLO calculations

Backup/Recovery Testing

File: scripts/backup/__tests__/recovery.test.ts

[ ] Automated backup validation
[ ] Full restore to staging
[ ] Partial restore test
[ ] Corruption recovery test
[ ] Measure actual RTO/RPO

Success Criteria Checklist

Issue #99: Queue Handlers ✓

[ ] All queue handlers have comprehensive error handling
[ ] Exponential backoff implemented
[ ] Dead letter queues configured with error context
[ ] Queue metrics collected and displayed
[ ] Performance optimizations applied
[ ] Load tested at 1000+ messages/sec

Issue #100: Sandbox Provisioning ✓

[ ] Sandbox provisioning integrated with agent execution
[ ] Resource limits enforced (CPU, memory, timeout)
[ ] Automated cleanup running every 5 minutes
[ ] Security hardened with isolation
[ ] Monitoring dashboard shows sandbox metrics
[ ] E2E tests passing

Issue #101: Monitoring and Alerting ✓

[ ] Cloudflare Analytics configured in all workers
[ ] Error alerting functional with notifications
[ ] Performance monitoring tracking P95/P99
[ ] Dashboard created with all key metrics
[ ] SLIs/SLOs defined and monitored
[ ] All 5 runbooks completed

Issue #102: Backup and Recovery ✓

[ ] D1 automated backups running daily
[ ] KV/R2 backups configured
[ ] All recovery scripts tested
[ ] Recovery playbooks written
[ ] RTO/RPO targets validated
[ ] Monthly recovery drills scheduled

Overall Completion ✓

[ ] All 4 Stage 3 issues closed
[ ] System ready for Stage 4 deployment
[ ] Documentation complete
[ ] Team trained on operations

Risk Mitigation

High Risk Areas

Queue Processing Under Load
- Mitigation: Load testing with 2x expected capacity
- Fallback: Circuit breaker to prevent cascading failures
Backup Window Exceeds RTO
- Mitigation: Parallel backup processes
- Fallback: Incremental backup strategy
Alert Fatigue from False Positives
- Mitigation: Careful threshold tuning
- Fallback: Alert aggregation and smart routing
Sandbox Resource Exhaustion
- Mitigation: Strict resource limits and quotas
- Fallback: Automatic sandbox pool scaling

Dependencies

External Services

Cloudflare Workers (runtime)
Cloudflare D1 (database)
Cloudflare R2 (object storage)
Cloudflare KV (key-value store)
Cloudflare Analytics Engine
GitHub Actions (automation)

Internal Packages

@monotask/cloudflare-shared (utilities)
@monotask/core (business logic)
@monotask/shared (types and constants)

Post-Implementation

Monitoring and Maintenance

Daily review of error logs
Weekly SLO compliance review
Monthly backup recovery drill
Quarterly runbook updates

Documentation Updates

Update CLAUDE.md with new operational procedures
Document monitoring dashboard usage
Create video walkthroughs for runbooks
Update team wiki with recovery procedures

Next Steps (Stage 4)

After completing Stage 3:

Configure production DNS and routes (#105)
Perform data migration (#104)
Execute staged rollout (#103)

Document Version: 1.0 Last Updated: October 26, 2025 Owner: Development Team Reviewers: DevOps, SRE

Stage 3 Implementation Plan ​

Executive Summary ​

GitHub Issues Overview ​

Current State Analysis ​

✅ Infrastructure Ready ​

❌ Gaps Identified ​

Issue #99: Implement Queue Handlers (P1) ​

Acceptance Criteria ​

Current Implementation Status ​

Implementation Tasks ​

1. Create Shared Queue Utilities ​

2. Enhance Agent Queue Handler ​

3. Enhance Task Queue Handler ​

4. Enhance GitHub Queue Handler ​

Issue #100: Integrate Sandbox Provisioning (P2) ​

Acceptance Criteria ​

Current Implementation Status ​

Implementation Tasks ​

1. Create Execution Environment Wrapper ​

2. Create Sandbox Integration Service ​

3. Integrate with Agent Execution Flow ​

4. Implement Resource Cleanup ​

5. Add Sandbox Monitoring ​

Issue #101: Set Up Monitoring and Alerting (P1) ​

Acceptance Criteria ​

Implementation Tasks ​

1. Configure Cloudflare Analytics ​

2. Implement Error Alerting ​

3. Add Performance Monitoring ​

4. Create Dashboards ​

5. Define SLIs/SLOs ​

6. Create Runbooks ​

Issue #102: Implement Backup and Recovery (P1) ​

Acceptance Criteria ​

RTO/RPO Targets ​

Implementation Tasks ​

1. D1 Database Backup ​

2. KV/R2 Data Backup ​

3. Recovery Procedures ​

4. Recovery Playbooks ​

5. Automated Recovery Testing ​

Implementation Timeline ​

Week 1: Monitoring and Queue Enhancement ​

Week 2: Sandbox and Backup ​

Testing Strategy ​

Queue Handler Testing ​

Sandbox Integration Testing ​

Monitoring Testing ​

Backup/Recovery Testing ​

Success Criteria Checklist ​

Issue #99: Queue Handlers ✓ ​

Issue #100: Sandbox Provisioning ✓ ​

Issue #101: Monitoring and Alerting ✓ ​

Issue #102: Backup and Recovery ✓ ​

Overall Completion ✓ ​

Risk Mitigation ​

High Risk Areas ​

Dependencies ​

External Services ​

Internal Packages ​

Post-Implementation ​

Monitoring and Maintenance ​

Documentation Updates ​

Next Steps (Stage 4) ​

Stage 3 Implementation Plan

Executive Summary

GitHub Issues Overview

Current State Analysis

✅ Infrastructure Ready

❌ Gaps Identified

Issue #99: Implement Queue Handlers (P1)

Acceptance Criteria

Current Implementation Status

Implementation Tasks

1. Create Shared Queue Utilities

2. Enhance Agent Queue Handler

3. Enhance Task Queue Handler

4. Enhance GitHub Queue Handler

Issue #100: Integrate Sandbox Provisioning (P2)

Acceptance Criteria

Current Implementation Status

Implementation Tasks

1. Create Execution Environment Wrapper

2. Create Sandbox Integration Service

3. Integrate with Agent Execution Flow

4. Implement Resource Cleanup

5. Add Sandbox Monitoring

Issue #101: Set Up Monitoring and Alerting (P1)

Acceptance Criteria

Implementation Tasks

1. Configure Cloudflare Analytics

2. Implement Error Alerting

3. Add Performance Monitoring

4. Create Dashboards

5. Define SLIs/SLOs

6. Create Runbooks

Issue #102: Implement Backup and Recovery (P1)

Acceptance Criteria

RTO/RPO Targets

Implementation Tasks

1. D1 Database Backup

2. KV/R2 Data Backup

3. Recovery Procedures

4. Recovery Playbooks

5. Automated Recovery Testing

Implementation Timeline

Week 1: Monitoring and Queue Enhancement

Week 2: Sandbox and Backup

Testing Strategy

Queue Handler Testing

Sandbox Integration Testing

Monitoring Testing

Backup/Recovery Testing

Success Criteria Checklist

Issue #99: Queue Handlers ✓

Issue #100: Sandbox Provisioning ✓

Issue #101: Monitoring and Alerting ✓

Issue #102: Backup and Recovery ✓

Overall Completion ✓

Risk Mitigation

High Risk Areas

Dependencies

External Services

Internal Packages

Post-Implementation

Monitoring and Maintenance

Documentation Updates

Next Steps (Stage 4)