Appearance
Stage 3 Implementation Plan
Date: October 26, 2025 Status: Ready for Implementation Priority: High (P1/P2)
Executive Summary
This document outlines the implementation plan for completing Stage 3 of the Cloudflare migration, focusing on integration, monitoring, and operational readiness. Four GitHub issues need to be addressed to achieve production-ready status.
GitHub Issues Overview
| Issue | Title | Priority | Status |
|---|---|---|---|
| #99 | Implement Queue handlers | P1 | Open |
| #100 | Integrate sandbox provisioning | P2 | Open |
| #101 | Set up monitoring and alerting | P1 | Open |
| #102 | Implement backup and recovery | P1 | Open |
Current State Analysis
✅ Infrastructure Ready
- QueueManager Durable Object: Fully implemented with job tracking, retry logic, and stats
- SandboxLifecycle Durable Object: Complete lifecycle management, timeout handling, cleanup
- Worker Services: Agent, Task, and GitHub workers deployed with basic queue consumers
- Database: D1 database configured and accessible from all workers
- Storage: R2 and KV bindings configured
❌ Gaps Identified
- Queue Handlers: Basic implementation lacks comprehensive error handling and monitoring
- Sandbox Integration: SandboxLifecycle exists but not integrated with agent execution flow
- Monitoring: No metrics collection, alerting, or dashboards configured
- Backup/Recovery: No automated backup system or recovery procedures
Issue #99: Implement Queue Handlers (P1)
Acceptance Criteria
- ✅ All handlers implemented
- ✅ Retry logic configured
- ✅ Dead letter queues set up
- ✅ Error handling complete
- ✅ Performance optimized
Current Implementation Status
Agent Worker Queue Handler (packages/cloudflare-workers/agent-worker/src/index.ts:179-207)
- ✅ Basic message processing
- ✅ Simple retry logic (max 3 attempts)
- ❌ Limited error categorization
- ❌ No metrics collection
- ❌ Basic DLQ handling
Task Worker Queue Handler (packages/cloudflare-workers/task-worker/src/index.ts:222-255)
- ✅ Multiple message types supported
- ✅ Basic error handling
- ❌ No batch optimization
- ❌ Missing queue depth monitoring
GitHub Worker Queue Handler (packages/cloudflare-workers/github-worker/src/index.ts:88-135)
- ✅ Webhook event routing
- ✅ Message acknowledgment
- ❌ No deduplication
- ❌ No rate limit handling
Implementation Tasks
1. Create Shared Queue Utilities
File: packages/cloudflare-shared/src/queue-utils.ts
typescript
// Error classification
export enum ErrorSeverity {
RETRYABLE = 'retryable', // Transient errors, retry
FATAL = 'fatal', // Permanent errors, DLQ
RATE_LIMITED = 'rate_limited' // Rate limit, backoff
}
export interface QueueError {
severity: ErrorSeverity;
message: string;
retryAfter?: number;
category: string;
}
// Retry strategies
export interface RetryStrategy {
maxAttempts: number;
backoffMs: number[];
backoffMultiplier: number;
}
// Metrics collection
export interface QueueMetrics {
messagesProcessed: number;
messagesSucceeded: number;
messagesFailed: number;
avgProcessingTimeMs: number;
dlqCount: number;
}Tasks:
- [ ] Implement error classification logic
- [ ] Create exponential backoff calculator
- [ ] Build retry strategy evaluator
- [ ] Add metrics aggregation helpers
- [ ] Create DLQ error context builder
2. Enhance Agent Queue Handler
File: packages/cloudflare-workers/agent-worker/src/index.ts
Tasks:
- [ ] Add error categorization (classify Claude API errors, timeout errors, validation errors)
- [ ] Implement exponential backoff (1s, 2s, 4s, 8s pattern)
- [ ] Add processing time metrics tracking
- [ ] Enhance DLQ handling with full error context
- [ ] Add circuit breaker for repeated failures
- [ ] Implement queue depth alerts
- [ ] Add success/failure rate tracking
3. Enhance Task Queue Handler
File: packages/cloudflare-workers/task-worker/src/index.ts
Tasks:
- [ ] Add batch processing optimization
- [ ] Implement message deduplication
- [ ] Add state transition validation
- [ ] Track queue depth per message type
- [ ] Add priority-based processing
- [ ] Implement graceful degradation
- [ ] Add validation result processing
4. Enhance GitHub Queue Handler
File: packages/cloudflare-workers/github-worker/src/index.ts
Tasks:
- [ ] Implement webhook event deduplication (by delivery ID)
- [ ] Add GitHub API rate limit handling
- [ ] Implement event prioritization (critical events first)
- [ ] Add sync operation progress tracking
- [ ] Handle webhook replay scenarios
- [ ] Add API error retry logic
Issue #100: Integrate Sandbox Provisioning (P2)
Acceptance Criteria
- ✅ Provisioning automated
- ✅ Resource limits enforced
- ✅ Cleanup automated
- ✅ Security hardened
- ✅ Monitoring enabled
Current Implementation Status
SandboxLifecycle DO (packages/cloudflare-workers/agent-worker/src/durable-objects/SandboxLifecycle.ts)
- ✅ Complete CRUD operations
- ✅ Status tracking
- ✅ Timeout management
- ✅ Log collection
- ❌ Not integrated with agent execution
Implementation Tasks
1. Create Execution Environment Wrapper
File: packages/cloudflare-workers/agent-worker/src/sandbox/execution-env.ts
Tasks:
- [ ] Define resource limits configuration
- [ ] Implement sandbox isolation wrapper
- [ ] Add environment variable injection
- [ ] Create security context setup
- [ ] Add stdout/stderr capture
- [ ] Implement timeout enforcement
- [ ] Add resource usage tracking
2. Create Sandbox Integration Service
File: packages/cloudflare-workers/agent-worker/src/sandbox/integration.ts
Tasks:
- [ ] Create sandbox provisioning helper
- [ ] Implement agent-to-sandbox binding
- [ ] Add log streaming integration
- [ ] Create cleanup scheduler
- [ ] Add failure recovery logic
- [ ] Implement sandbox pooling (optional optimization)
3. Integrate with Agent Execution Flow
File: packages/cloudflare-workers/agent-worker/src/index.ts
Tasks:
- [ ] Add sandbox provisioning before agent execution
- [ ] Wrap agent execution in sandbox context
- [ ] Capture and store sandbox logs
- [ ] Handle sandbox timeout failures
- [ ] Trigger cleanup on completion
- [ ] Add sandbox metrics to agent execution response
4. Implement Resource Cleanup
File: packages/cloudflare-workers/agent-worker/src/sandbox/cleanup.ts
Tasks:
- [ ] Create scheduled cleanup task (runs every 5 minutes)
- [ ] Implement orphaned sandbox detection
- [ ] Add forced termination for stuck sandboxes
- [ ] Create cleanup metrics dashboard
- [ ] Add cleanup failure alerting
- [ ] Implement graceful shutdown handling
5. Add Sandbox Monitoring
File: packages/cloudflare-workers/agent-worker/src/sandbox/monitoring.ts
Tasks:
- [ ] Track active sandbox count
- [ ] Monitor resource utilization (CPU, memory)
- [ ] Track failure rates by agent type
- [ ] Monitor timeout frequency
- [ ] Add sandbox lifecycle duration metrics
- [ ] Create sandbox health endpoint
Issue #101: Set Up Monitoring and Alerting (P1)
Acceptance Criteria
- ✅ Dashboards created
- ✅ Alert rules configured
- ✅ Logging pipeline set up
- ✅ SLIs/SLOs defined
- ✅ Runbooks created
Implementation Tasks
1. Configure Cloudflare Analytics
Files: All packages/cloudflare-workers/*/wrangler.toml
Tasks:
- [ ] Enable Workers Analytics Engine in all wrangler.toml files
- [ ] Add analytics_engine_datasets configuration
- [ ] Configure custom metrics collection points
- [ ] Set up logpush to R2 bucket for log retention
- [ ] Configure log sampling rates
- [ ] Add structured logging format
2. Implement Error Alerting
File: packages/cloudflare-workers/monitoring/error-alerter.ts
Tasks:
- [ ] Create error categorization system (CRITICAL, WARNING, INFO)
- [ ] Implement alert routing logic
- [ ] Add email notification integration
- [ ] Add Slack webhook integration
- [ ] Add PagerDuty integration (optional)
- [ ] Configure alert thresholds per worker
- [ ] Implement alert deduplication
- [ ] Add alert acknowledgment tracking
File: packages/cloudflare-workers/monitoring/middleware/error-tracker.ts
Tasks:
- [ ] Create error tracking middleware
- [ ] Add to all worker fetch handlers
- [ ] Capture stack traces and context
- [ ] Track error frequency by type
- [ ] Implement error rate calculation
3. Add Performance Monitoring
File: packages/cloudflare-workers/monitoring/performance-tracker.ts
Tasks:
- [ ] Implement request duration tracking
- [ ] Add queue processing time metrics
- [ ] Track D1 query performance
- [ ] Monitor external API latency (Claude API, GitHub API)
- [ ] Calculate P50, P95, P99 percentiles
- [ ] Implement performance sampling (10% of requests)
- [ ] Add performance budget alerts
File: packages/cloudflare-workers/monitoring/middleware/performance-middleware.ts
Tasks:
- [ ] Create performance tracking middleware
- [ ] Add to all worker fetch handlers
- [ ] Capture request/response timing
- [ ] Track resource usage
- [ ] Log slow queries
4. Create Dashboards
File: monitoring/cloudflare-dashboard.json
Dashboard Sections:
[ ] Worker Health:
- Uptime percentage
- Error rates by worker
- Active request count
- Request rate (req/sec)
[ ] Queue Metrics:
- Queue depth by queue
- Messages processed/sec
- Processing time distribution
- DLQ message count
- Retry rate
[ ] Performance Metrics:
- P50/P95/P99 latency by endpoint
- Database query time
- External API latency
- Cache hit rates
[ ] Resource Usage:
- CPU time by worker
- Memory usage
- D1 query count
- R2 operation count
- KV operation count
[ ] Sandbox Metrics:
- Active sandboxes
- Sandbox provision time
- Timeout frequency
- Resource utilization
5. Define SLIs/SLOs
File: monitoring/slos.yaml
Tasks:
- [ ] Define API endpoint availability SLO (99.9% uptime)
- [ ] Set P95 response time targets:
- API Gateway: < 200ms
- Task operations: < 500ms
- Agent execution: < 30s
- [ ] Define queue processing latency SLO (P95 < 5s)
- [ ] Set error rate threshold (< 1% of requests)
- [ ] Define D1 query performance targets
- [ ] Configure SLO violation alerts
- [ ] Create SLO compliance reports
6. Create Runbooks
Directory: monitoring/runbooks/
Runbooks to Create:
[ ]
high-error-rate.md:- Symptoms and detection
- Investigation steps
- Common causes
- Resolution procedures
- Escalation path
[ ]
queue-backup.md:- Queue congestion detection
- Impact assessment
- Mitigation steps
- Worker scaling
- DLQ processing
[ ]
database-slow.md:- Slow query identification
- Index analysis
- Query optimization
- D1 capacity scaling
- Temporary mitigations
[ ]
worker-timeout.md:- Timeout cause analysis
- Code profiling
- External dependency checks
- Timeout limit adjustment
- Performance optimization
[ ]
sandbox-stuck.md:- Stuck sandbox detection
- Manual cleanup procedure
- Root cause analysis
- Prevention measures
- Monitoring improvements
Issue #102: Implement Backup and Recovery (P1)
Acceptance Criteria
- ✅ Automated backups scheduled
- ✅ Recovery procedures tested
- ✅ RTO/RPO targets met
- ✅ Documentation complete
- ✅ Disaster recovery plan approved
RTO/RPO Targets
- D1 Database: RTO 1 hour, RPO 24 hours
- KV Namespaces: RTO 30 minutes, RPO 1 hour
- R2 Artifacts: RTO 2 hours, RPO 24 hours
Implementation Tasks
1. D1 Database Backup
File: scripts/backup/d1-backup.ts
Tasks:
- [ ] Implement D1 export using wrangler CLI
- [ ] Upload backup to R2 bucket with timestamp
- [ ] Implement retention policy:
- Daily backups: 30 days
- Weekly backups: 90 days
- Monthly backups: 1 year
- [ ] Add backup integrity verification
- [ ] Generate backup manifest file
- [ ] Add backup size tracking
- [ ] Implement backup encryption
File: .github/workflows/d1-backup.yml
Tasks:
- [ ] Create GitHub Action workflow
- [ ] Schedule daily backups (2 AM UTC)
- [ ] Add backup success/failure notifications
- [ ] Store backup metadata
- [ ] Monitor backup duration
2. KV/R2 Data Backup
File: scripts/backup/kv-backup.ts
Tasks:
- [ ] Export all KV namespaces to R2
- [ ] Include key metadata (expiration, etc.)
- [ ] Generate KV backup manifest
- [ ] Implement incremental backups
- [ ] Add backup verification
File: scripts/backup/r2-backup.ts
Tasks:
- [ ] Configure cross-region replication for critical buckets
- [ ] Set up secondary backup bucket
- [ ] Implement object versioning
- [ ] Add lifecycle policies
- [ ] Monitor replication lag
3. Recovery Procedures
File: scripts/recovery/d1-restore.ts
Tasks:
- [ ] Implement database drop and recreate
- [ ] Import from backup file
- [ ] Validate data integrity after restore
- [ ] Generate restoration report
- [ ] Add rollback capability
- [ ] Test on staging environment
File: scripts/recovery/kv-restore.ts
Tasks:
- [ ] Implement namespace clearing
- [ ] Bulk key restoration
- [ ] Individual key restoration option
- [ ] Validate restored data
- [ ] Handle key conflicts
File: scripts/recovery/disaster-recovery.ts
Tasks:
- [ ] Orchestrate full system restoration
- [ ] Restore D1, KV, and R2 in correct order
- [ ] Verify service health after each step
- [ ] Run smoke tests
- [ ] Generate recovery timeline report
- [ ] Document actual RTO achieved
4. Recovery Playbooks
Directory: docs/recovery-playbooks/
Playbooks to Create:
[ ]
d1-recovery.md:- When to use
- Prerequisites
- Step-by-step restoration
- Verification steps
- Common issues and solutions
- RTO/RPO expectations
[ ]
worker-rollback.md:- Rollback triggers
- Version identification
- Rollback procedure
- Traffic switching
- Verification
- Communication plan
[ ]
data-corruption.md:- Corruption detection
- Impact assessment
- Point-in-time recovery
- Data validation
- Preventing recurrence
[ ]
partial-failure.md:- Component failure identification
- Isolated component recovery
- Service continuity
- Gradual restoration
- Full system validation
5. Automated Recovery Testing
File: scripts/backup/test-recovery.ts
Tasks:
- [ ] Create staging environment for tests
- [ ] Automate backup creation
- [ ] Perform automated restore
- [ ] Run data validation tests
- [ ] Measure actual RTO
- [ ] Compare against target RTO/RPO
- [ ] Generate test report
- [ ] Schedule monthly recovery drills
Implementation Timeline
Week 1: Monitoring and Queue Enhancement
Days 1-2: Monitoring Foundation
- [ ] Configure Cloudflare Analytics in all workers
- [ ] Implement error tracking middleware
- [ ] Set up basic alerting
- [ ] Create initial dashboard
Days 3-4: Queue Enhancement
- [ ] Create shared queue utilities
- [ ] Enhance agent queue handler
- [ ] Enhance task queue handler
- [ ] Enhance GitHub queue handler
- [ ] Add queue metrics collection
Week 2: Sandbox and Backup
Days 5-6: Sandbox Integration
- [ ] Create execution environment wrapper
- [ ] Build sandbox integration service
- [ ] Integrate with agent execution flow
- [ ] Implement resource cleanup
- [ ] Add sandbox monitoring
Days 7-8: Backup & Recovery
- [ ] Implement D1 backup script
- [ ] Create KV/R2 backup scripts
- [ ] Build recovery procedures
- [ ] Write recovery playbooks
- [ ] Create GitHub Actions workflows
Days 9-10: Testing and Documentation
- [ ] Run full backup/restore tests
- [ ] Measure actual RTO/RPO
- [ ] Complete all runbooks
- [ ] Finalize SLI/SLO definitions
- [ ] Update system documentation
- [ ] Close all GitHub issues
Testing Strategy
Queue Handler Testing
File: packages/cloudflare-shared/tests/queue-handlers.test.ts
- [ ] Unit tests for error classification
- [ ] Retry logic tests
- [ ] DLQ processing tests
- [ ] Load tests (1000+ messages/sec)
- [ ] Failure scenario tests
Sandbox Integration Testing
File: packages/e2e/tests/sandbox-integration.spec.ts
- [ ] E2E sandbox lifecycle test
- [ ] Resource limit enforcement test
- [ ] Timeout handling test
- [ ] Cleanup automation test
- [ ] Concurrent sandbox test
Monitoring Testing
- [ ] Verify all metrics are collected
- [ ] Trigger test alerts
- [ ] Validate dashboard accuracy
- [ ] Test alert notification delivery
- [ ] Verify SLO calculations
Backup/Recovery Testing
File: scripts/backup/__tests__/recovery.test.ts
- [ ] Automated backup validation
- [ ] Full restore to staging
- [ ] Partial restore test
- [ ] Corruption recovery test
- [ ] Measure actual RTO/RPO
Success Criteria Checklist
Issue #99: Queue Handlers ✓
- [ ] All queue handlers have comprehensive error handling
- [ ] Exponential backoff implemented
- [ ] Dead letter queues configured with error context
- [ ] Queue metrics collected and displayed
- [ ] Performance optimizations applied
- [ ] Load tested at 1000+ messages/sec
Issue #100: Sandbox Provisioning ✓
- [ ] Sandbox provisioning integrated with agent execution
- [ ] Resource limits enforced (CPU, memory, timeout)
- [ ] Automated cleanup running every 5 minutes
- [ ] Security hardened with isolation
- [ ] Monitoring dashboard shows sandbox metrics
- [ ] E2E tests passing
Issue #101: Monitoring and Alerting ✓
- [ ] Cloudflare Analytics configured in all workers
- [ ] Error alerting functional with notifications
- [ ] Performance monitoring tracking P95/P99
- [ ] Dashboard created with all key metrics
- [ ] SLIs/SLOs defined and monitored
- [ ] All 5 runbooks completed
Issue #102: Backup and Recovery ✓
- [ ] D1 automated backups running daily
- [ ] KV/R2 backups configured
- [ ] All recovery scripts tested
- [ ] Recovery playbooks written
- [ ] RTO/RPO targets validated
- [ ] Monthly recovery drills scheduled
Overall Completion ✓
- [ ] All 4 Stage 3 issues closed
- [ ] System ready for Stage 4 deployment
- [ ] Documentation complete
- [ ] Team trained on operations
Risk Mitigation
High Risk Areas
Queue Processing Under Load
- Mitigation: Load testing with 2x expected capacity
- Fallback: Circuit breaker to prevent cascading failures
Backup Window Exceeds RTO
- Mitigation: Parallel backup processes
- Fallback: Incremental backup strategy
Alert Fatigue from False Positives
- Mitigation: Careful threshold tuning
- Fallback: Alert aggregation and smart routing
Sandbox Resource Exhaustion
- Mitigation: Strict resource limits and quotas
- Fallback: Automatic sandbox pool scaling
Dependencies
External Services
- Cloudflare Workers (runtime)
- Cloudflare D1 (database)
- Cloudflare R2 (object storage)
- Cloudflare KV (key-value store)
- Cloudflare Analytics Engine
- GitHub Actions (automation)
Internal Packages
@monotask/cloudflare-shared(utilities)@monotask/core(business logic)@monotask/shared(types and constants)
Post-Implementation
Monitoring and Maintenance
- Daily review of error logs
- Weekly SLO compliance review
- Monthly backup recovery drill
- Quarterly runbook updates
Documentation Updates
- Update CLAUDE.md with new operational procedures
- Document monitoring dashboard usage
- Create video walkthroughs for runbooks
- Update team wiki with recovery procedures
Next Steps (Stage 4)
After completing Stage 3:
- Configure production DNS and routes (#105)
- Perform data migration (#104)
- Execute staged rollout (#103)
Document Version: 1.0 Last Updated: October 26, 2025 Owner: Development Team Reviewers: DevOps, SRE