Appearance
Partial Failure
Overview
This playbook addresses scenarios where one or more system components fail while others remain operational. It provides procedures for isolated component recovery while maintaining service continuity.
RTO Target: Varies by component (15m - 2h) Goal: Minimize total system downtime
When to Use This Playbook
Use this playbook when experiencing:
- Single Worker Failure: One worker down, others operational
- Database Read/Write Issues: D1 queries failing intermittently
- KV Namespace Unavailable: Specific namespace inaccessible
- R2 Bucket Issues: One bucket failing, others working
- Durable Object Errors: Specific DO class experiencing issues
- Queue Processing Failure: Messages not being processed in one queue
- API Endpoint Degradation: Specific endpoints returning errors
Key Characteristic: System partially functional, not total outage
Component Failure Identification
Quick Diagnosis
Check Component Health:
bash# Test each worker curl https://monotask-api-gateway.workers.dev/health curl https://monotask-task-worker.workers.dev/health curl https://monotask-agent-worker.workers.dev/health curl https://monotask-github-worker.workers.dev/health curl https://monotask-auth-worker.workers.dev/health curl https://monotask-websocket-worker.workers.dev/healthCheck Cloudflare Status:
- Cloudflare status page
- Account-specific service alerts
- Regional availability
Check Resource Limits:
- D1 query limits
- Worker CPU time
- KV operation limits
- R2 bandwidth limits
Component-Specific Recovery
API Gateway Worker Failure
Symptoms:
- All API requests failing
- 502/503 errors at gateway
- No traffic reaching service workers
Impact: Critical - entire API unavailable
Recovery Steps:
Identify Issue:
bashbunx wrangler tail monotask-api-gateway # Watch live logs for errorsQuick Fix Options:
bash# Option 1: Rollback bunx wrangler rollback --name monotask-api-gateway # Option 2: Redeploy current version cd packages/cloudflare-workers/api-gateway bunx wrangler deployVerify Recovery:
bashcurl https://monotask-api-gateway.workers.dev/health
RTO: 5-10 minutes
Task Worker Failure
Symptoms:
- Task operations failing
- State transitions not processing
- Task queries returning errors
Impact: High - core functionality impaired
Recovery Steps:
Check Worker Status:
bashbunx wrangler tail monotask-task-workerCheck D1 Connectivity:
bashbunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM tasks" \ --jsonRecovery Options:
bash# If worker issue bunx wrangler rollback --name monotask-task-worker # If D1 issue, check [D1 Recovery Playbook](./d1-recovery.md)
Workaround: Route traffic through backup worker instance (if configured)
RTO: 10-15 minutes
Agent Worker Failure
Symptoms:
- AI agent executions failing
- Queue messages backing up
- Sandbox provisioning errors
Impact: Medium - automation delayed, manual operations still work
Recovery Steps:
Check Queue Status:
bash# Check queue depth # (Via Cloudflare dashboard or custom monitoring)Identify Failure Point:
- Claude API connectivity?
- Sandbox lifecycle issues?
- Queue processing errors?
Recovery Actions:
bash# Rollback worker bunx wrangler rollback --name monotask-agent-worker # Or fix and redeploy cd packages/cloudflare-workers/agent-worker bunx wrangler deployProcess Backlog:
- Queue will automatically process backlog
- Monitor processing rate
- Increase worker instances if needed (Cloudflare dashboard)
Workaround: Manual task execution while worker recovers
RTO: 20-30 minutes
GitHub Worker Failure
Symptoms:
- Webhook events not processing
- Repository syncs failing
- GitHub API calls timing out
Impact: Medium - GitHub integration down, core features unaffected
Recovery Steps:
Verify GitHub API Status:
- Check GitHub status page
- Test API connectivity
Check Webhook Configuration:
bash# Verify webhook endpoint accessible curl https://monotask-github-worker.workers.dev/webhookRecovery:
bashbunx wrangler rollback --name monotask-github-workerReplay Missed Events:
- GitHub webhooks have retry logic
- Or manually trigger sync:
bash# Via API call to sync endpoint curl -X POST https://monotask-api-gateway.workers.dev/api/github/sync
Workaround: Manual GitHub operations via UI
RTO: 15-20 minutes
D1 Database Partial Failure
Symptoms:
- Intermittent query failures
- Slow response times
- Connection timeout errors
Impact: Critical - affects all data operations
Diagnosis:
Check D1 Health:
bashbunx wrangler d1 execute monotask-production \ --command "PRAGMA integrity_check" \ --jsonCheck Query Performance:
bashbunx wrangler d1 execute monotask-production \ --command "EXPLAIN QUERY PLAN SELECT * FROM tasks LIMIT 10" \ --jsonMonitor Rate Limits:
- Check if hitting D1 query limits
- Review request patterns
Recovery Options:
If Rate Limited:
- Reduce query frequency
- Implement caching
- Batch operations
If Database Corruption:
- Follow Data Corruption Playbook
If Connectivity Issues:
- Wait for Cloudflare infrastructure recovery
- Monitor Cloudflare status page
RTO: 15-30 minutes (infrastructure dependent)
KV Namespace Failure
Symptoms:
- Specific KV namespace returning errors
- Keys not accessible
- Write operations failing
Impact: Varies by namespace:
- SESSIONS: High (authentication issues)
- CACHE: Low (performance degradation)
- RATE_LIMITS: Medium (rate limiting disabled)
- FEATURE_FLAGS: Medium (flags unavailable)
- API_KEYS: High (API auth fails)
Recovery Steps:
Identify Affected Namespace:
bash# Test each namespace bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302 bunx wrangler kv:key list --namespace-id 32f1ed0a552e453da465bb36df6f1644 # ... etc for each namespaceCheck Cloudflare KV Status:
- KV service status
- Regional availability
Recovery Options:
Option A: Wait for Service Recovery
- If Cloudflare infrastructure issue
- Monitor status page
Option B: Restore from Backup
bash# Restore specific namespace bun run scripts/recovery/kv-restore.ts <backup-id> \ --namespace SESSIONSOption C: Failover Strategy
- Use fallback mechanisms in code
- Degrade gracefully without KV
Workarounds by Namespace:
- SESSIONS: Force re-authentication
- CACHE: Accept cache misses
- RATE_LIMITS: Disable rate limiting temporarily
- FEATURE_FLAGS: Use default values
- API_KEYS: Manual verification
RTO: 30-45 minutes
R2 Bucket Failure
Symptoms:
- Cannot upload/download objects
- 404 errors for existing objects
- Slow object access
Impact: Medium - file storage affected
Recovery Steps:
Identify Affected Bucket:
bashbunx wrangler r2 object list evidence-storage bunx wrangler r2 object list screenshots bunx wrangler r2 object list agent-artifactsCheck R2 Status:
- Cloudflare R2 service status
- Regional availability
Recovery Options:
Option A: Use Secondary Bucket
bash# Failover to backup bucket # (Requires code change to use backup bucket binding)Option B: Restore from Replication
bashbun run scripts/backup/r2-backup.ts sync <bucket-name>
Workaround: Temporarily disable file uploads, queue for later
RTO: 1-2 hours
Durable Object Failure
Symptoms:
- Specific DO operations failing
- Connection errors to DO
- State inconsistencies
Impact: Varies by DO type
Recovery by DO Type:
QueueManager DO:
- Queue processing stops
- Messages accumulate
- Recovery: Redeploy agent worker
SandboxLifecycle DO:
- Sandbox provisioning fails
- Agent execution blocked
- Recovery: Redeploy agent worker, cleanup stuck sandboxes
TaskCoordinator DO:
- Task coordination fails
- State transitions delayed
- Recovery: Redeploy task worker
WebSocketRoom DO:
- Real-time updates stop
- Clients disconnected
- Recovery: Redeploy websocket worker, clients auto-reconnect
General Recovery:
bash
# Redeploy worker containing DO
bunx wrangler deploy --name <worker-with-do>RTO: 10-15 minutes
Gradual Restoration Strategy
When multiple components affected:
Phase 1: Core Services (0-15 minutes)
Priority: Get basic functionality working
- API Gateway (entry point)
- Auth Worker (authentication)
- D1 Database (data layer)
Phase 2: Primary Features (15-45 minutes)
Priority: Restore main user-facing features
- Task Worker (task operations)
- KV Namespaces (sessions, cache)
- WebSocket Worker (real-time updates)
Phase 3: Advanced Features (45-120 minutes)
Priority: Full functionality
- Agent Worker (AI automation)
- GitHub Worker (integrations)
- R2 Buckets (file storage)
Service Continuity During Recovery
Maintain Available Services
Communicate Partial Availability:
⚠️ PARTIAL SERVICE DISRUPTION Available: ✅ API access ✅ Task viewing ✅ Manual operations Unavailable: ❌ AI agent execution ❌ GitHub sync ❌ File uploads ETA: [time]Redirect Traffic:
- Route around failed components
- Use fallback mechanisms
- Graceful degradation
User Guidance:
- Status banner in UI
- API error messages
- Workaround instructions
Isolated Component Testing
Before declaring recovery complete:
Test Failed Component:
bash# Health check curl https://<worker>.workers.dev/health # Functional test # (Component-specific test calls)Test Integration:
- Verify component communicates with dependencies
- Check data flow end-to-end
- Validate cross-component operations
Monitor Metrics:
- Error rates normal
- Response times baseline
- Resource usage expected
Common Partial Failure Patterns
Pattern 1: Cascading Failure
Symptoms:
- Initial component failure
- Dependent components start failing
- Errors propagating through system
Response:
- Identify root cause component
- Recover root component first
- Then recover dependent components
- Implement circuit breakers to prevent future cascades
Pattern 2: Intermittent Failure
Symptoms:
- Component works sometimes
- Errors appear randomly
- No clear pattern
Response:
- Check rate limits
- Review resource constraints
- Monitor for external dependencies
- Increase logging/monitoring
- Implement retry logic
Pattern 3: Silent Failure
Symptoms:
- Component appears healthy
- But operations not completing
- No obvious errors
Response:
- Check data flow end-to-end
- Review queue depths
- Verify downstream processing
- Check for stuck operations
- Review timeout configurations
Recovery Verification Checklist
After component recovery:
- [ ] Health checks passing
- [ ] Functional tests passing
- [ ] Error rates normal
- [ ] Response times baseline
- [ ] Dependent components working
- [ ] End-to-end flows working
- [ ] User-facing features operational
- [ ] Monitoring showing green
- [ ] Logs clean of errors
- [ ] Backlog processing normally
Post-Recovery Monitoring
Monitor for 24-48 hours after recovery:
Metrics to Watch:
- Error rates
- Response times
- Request volumes
- Resource usage
Early Warning Signs:
- Gradual performance degradation
- Increasing error rates
- Resource usage trending up
- User complaints
Alerting:
- Ensure alerts configured
- Test alert triggers
- Verify notification delivery
Prevention Strategies
Circuit Breakers:
- Prevent cascade failures
- Automatic degradation
Health Checks:
- Proactive monitoring
- Early failure detection
Graceful Degradation:
- Fallback mechanisms
- Reduced functionality vs. total failure
Redundancy:
- Multiple worker instances
- Backup data stores
- Alternative execution paths
Related Playbooks
- Worker Rollback - Worker deployment rollback
- D1 Recovery - Database recovery
- Data Corruption - Data integrity issues
- Disaster Recovery - Full system recovery
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-26 | 1.0 | Initial playbook | System |