Partial Failure

Overview

This playbook addresses scenarios where one or more system components fail while others remain operational. It provides procedures for isolated component recovery while maintaining service continuity.

RTO Target: Varies by component (15m - 2h) Goal: Minimize total system downtime

When to Use This Playbook

Use this playbook when experiencing:

Single Worker Failure: One worker down, others operational
Database Read/Write Issues: D1 queries failing intermittently
KV Namespace Unavailable: Specific namespace inaccessible
R2 Bucket Issues: One bucket failing, others working
Durable Object Errors: Specific DO class experiencing issues
Queue Processing Failure: Messages not being processed in one queue
API Endpoint Degradation: Specific endpoints returning errors

Key Characteristic: System partially functional, not total outage

Component Failure Identification

Quick Diagnosis

Check Component Health:

bash

# Test each worker
curl https://monotask-api-gateway.workers.dev/health
curl https://monotask-task-worker.workers.dev/health
curl https://monotask-agent-worker.workers.dev/health
curl https://monotask-github-worker.workers.dev/health
curl https://monotask-auth-worker.workers.dev/health
curl https://monotask-websocket-worker.workers.dev/health

Check Cloudflare Status:
- Cloudflare status page
- Account-specific service alerts
- Regional availability
Check Resource Limits:
- D1 query limits
- Worker CPU time
- KV operation limits
- R2 bandwidth limits

Component-Specific Recovery

API Gateway Worker Failure

Symptoms:

All API requests failing
502/503 errors at gateway
No traffic reaching service workers

Impact: Critical - entire API unavailable

Recovery Steps:

Identify Issue:

bash

bunx wrangler tail monotask-api-gateway
# Watch live logs for errors

Quick Fix Options:

bash

# Option 1: Rollback
bunx wrangler rollback --name monotask-api-gateway

# Option 2: Redeploy current version
cd packages/cloudflare-workers/api-gateway
bunx wrangler deploy

Verify Recovery:

bash

curl https://monotask-api-gateway.workers.dev/health

RTO: 5-10 minutes

Task Worker Failure

Symptoms:

Task operations failing
State transitions not processing
Task queries returning errors

Impact: High - core functionality impaired

Recovery Steps:

Check Worker Status:
bash
```
bunx wrangler tail monotask-task-worker
```
1

Check D1 Connectivity:

bash

bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM tasks" \
  --json

Recovery Options:

bash

# If worker issue
bunx wrangler rollback --name monotask-task-worker

# If D1 issue, check [D1 Recovery Playbook](./d1-recovery.md)

Workaround: Route traffic through backup worker instance (if configured)

RTO: 10-15 minutes

Agent Worker Failure

Symptoms:

AI agent executions failing
Queue messages backing up
Sandbox provisioning errors

Impact: Medium - automation delayed, manual operations still work

Recovery Steps:

Check Queue Status:

bash

# Check queue depth
# (Via Cloudflare dashboard or custom monitoring)

Identify Failure Point:
- Claude API connectivity?
- Sandbox lifecycle issues?
- Queue processing errors?

Recovery Actions:

bash

# Rollback worker
bunx wrangler rollback --name monotask-agent-worker

# Or fix and redeploy
cd packages/cloudflare-workers/agent-worker
bunx wrangler deploy

Process Backlog:
- Queue will automatically process backlog
- Monitor processing rate
- Increase worker instances if needed (Cloudflare dashboard)

Workaround: Manual task execution while worker recovers

RTO: 20-30 minutes

GitHub Worker Failure

Symptoms:

Webhook events not processing
Repository syncs failing
GitHub API calls timing out

Impact: Medium - GitHub integration down, core features unaffected

Recovery Steps:

Verify GitHub API Status:
- Check GitHub status page
- Test API connectivity

Check Webhook Configuration:

bash

# Verify webhook endpoint accessible
curl https://monotask-github-worker.workers.dev/webhook

Recovery:

bash

bunx wrangler rollback --name monotask-github-worker

Replay Missed Events:

GitHub webhooks have retry logic
Or manually trigger sync:

bash

# Via API call to sync endpoint
curl -X POST https://monotask-api-gateway.workers.dev/api/github/sync

Workaround: Manual GitHub operations via UI

RTO: 15-20 minutes

D1 Database Partial Failure

Symptoms:

Intermittent query failures
Slow response times
Connection timeout errors

Impact: Critical - affects all data operations

Diagnosis:

Check D1 Health:

bash

bunx wrangler d1 execute monotask-production \
  --command "PRAGMA integrity_check" \
  --json

Check Query Performance:

bash

bunx wrangler d1 execute monotask-production \
  --command "EXPLAIN QUERY PLAN SELECT * FROM tasks LIMIT 10" \
  --json

Monitor Rate Limits:
- Check if hitting D1 query limits
- Review request patterns

Recovery Options:

If Rate Limited:
- Reduce query frequency
- Implement caching
- Batch operations
If Database Corruption:
- Follow Data Corruption Playbook
If Connectivity Issues:
- Wait for Cloudflare infrastructure recovery
- Monitor Cloudflare status page

RTO: 15-30 minutes (infrastructure dependent)

KV Namespace Failure

Symptoms:

Specific KV namespace returning errors
Keys not accessible
Write operations failing

Impact: Varies by namespace:

SESSIONS: High (authentication issues)
CACHE: Low (performance degradation)
RATE_LIMITS: Medium (rate limiting disabled)
FEATURE_FLAGS: Medium (flags unavailable)
API_KEYS: High (API auth fails)

Recovery Steps:

Identify Affected Namespace:

bash

# Test each namespace
bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302
bunx wrangler kv:key list --namespace-id 32f1ed0a552e453da465bb36df6f1644
# ... etc for each namespace

Check Cloudflare KV Status:
- KV service status
- Regional availability
Recovery Options:
Option A: Wait for Service Recovery
- If Cloudflare infrastructure issue
- Monitor status page
Option B: Restore from Backup
bash
```
# Restore specific namespace
bun run scripts/recovery/kv-restore.ts <backup-id> \
  --namespace SESSIONS
```
1
2
3
Option C: Failover Strategy
- Use fallback mechanisms in code
- Degrade gracefully without KV

Workarounds by Namespace:

SESSIONS: Force re-authentication
CACHE: Accept cache misses
RATE_LIMITS: Disable rate limiting temporarily
FEATURE_FLAGS: Use default values
API_KEYS: Manual verification

RTO: 30-45 minutes

R2 Bucket Failure

Symptoms:

Cannot upload/download objects
404 errors for existing objects
Slow object access

Impact: Medium - file storage affected

Recovery Steps:

Identify Affected Bucket:

bash

bunx wrangler r2 object list evidence-storage
bunx wrangler r2 object list screenshots
bunx wrangler r2 object list agent-artifacts

Check R2 Status:
- Cloudflare R2 service status
- Regional availability

Recovery Options:

Option A: Use Secondary Bucket

bash

# Failover to backup bucket
# (Requires code change to use backup bucket binding)

Option B: Restore from Replication

bash

bun run scripts/backup/r2-backup.ts sync <bucket-name>

Workaround: Temporarily disable file uploads, queue for later

RTO: 1-2 hours

Durable Object Failure

Symptoms:

Specific DO operations failing
Connection errors to DO
State inconsistencies

Impact: Varies by DO type

Recovery by DO Type:

QueueManager DO:
- Queue processing stops
- Messages accumulate
- Recovery: Redeploy agent worker
SandboxLifecycle DO:
- Sandbox provisioning fails
- Agent execution blocked
- Recovery: Redeploy agent worker, cleanup stuck sandboxes
TaskCoordinator DO:
- Task coordination fails
- State transitions delayed
- Recovery: Redeploy task worker
WebSocketRoom DO:
- Real-time updates stop
- Clients disconnected
- Recovery: Redeploy websocket worker, clients auto-reconnect

General Recovery:

bash

# Redeploy worker containing DO
bunx wrangler deploy --name <worker-with-do>

RTO: 10-15 minutes

Gradual Restoration Strategy

When multiple components affected:

Phase 1: Core Services (0-15 minutes)

Priority: Get basic functionality working

API Gateway (entry point)
Auth Worker (authentication)
D1 Database (data layer)

Phase 2: Primary Features (15-45 minutes)

Priority: Restore main user-facing features

Task Worker (task operations)
KV Namespaces (sessions, cache)
WebSocket Worker (real-time updates)

Phase 3: Advanced Features (45-120 minutes)

Priority: Full functionality

Agent Worker (AI automation)
GitHub Worker (integrations)
R2 Buckets (file storage)

Service Continuity During Recovery

Maintain Available Services

Communicate Partial Availability:

⚠️ PARTIAL SERVICE DISRUPTION

Available:
✅ API access
✅ Task viewing
✅ Manual operations

Unavailable:
❌ AI agent execution
❌ GitHub sync
❌ File uploads

ETA: [time]

Redirect Traffic:
- Route around failed components
- Use fallback mechanisms
- Graceful degradation
User Guidance:
- Status banner in UI
- API error messages
- Workaround instructions

Isolated Component Testing

Before declaring recovery complete:

Test Failed Component:

bash

# Health check
curl https://<worker>.workers.dev/health

# Functional test
# (Component-specific test calls)

Test Integration:
- Verify component communicates with dependencies
- Check data flow end-to-end
- Validate cross-component operations
Monitor Metrics:
- Error rates normal
- Response times baseline
- Resource usage expected

Common Partial Failure Patterns

Pattern 1: Cascading Failure

Symptoms:

Initial component failure
Dependent components start failing
Errors propagating through system

Response:

Identify root cause component
Recover root component first
Then recover dependent components
Implement circuit breakers to prevent future cascades

Pattern 2: Intermittent Failure

Symptoms:

Component works sometimes
Errors appear randomly
No clear pattern

Response:

Check rate limits
Review resource constraints
Monitor for external dependencies
Increase logging/monitoring
Implement retry logic

Pattern 3: Silent Failure

Symptoms:

Component appears healthy
But operations not completing
No obvious errors

Response:

Check data flow end-to-end
Review queue depths
Verify downstream processing
Check for stuck operations
Review timeout configurations

Recovery Verification Checklist

After component recovery:

[ ] Health checks passing
[ ] Functional tests passing
[ ] Error rates normal
[ ] Response times baseline
[ ] Dependent components working
[ ] End-to-end flows working
[ ] User-facing features operational
[ ] Monitoring showing green
[ ] Logs clean of errors
[ ] Backlog processing normally

Post-Recovery Monitoring

Monitor for 24-48 hours after recovery:

Metrics to Watch:
- Error rates
- Response times
- Request volumes
- Resource usage
Early Warning Signs:
- Gradual performance degradation
- Increasing error rates
- Resource usage trending up
- User complaints
Alerting:
- Ensure alerts configured
- Test alert triggers
- Verify notification delivery

Prevention Strategies

Circuit Breakers:
- Prevent cascade failures
- Automatic degradation
Health Checks:
- Proactive monitoring
- Early failure detection
Graceful Degradation:
- Fallback mechanisms
- Reduced functionality vs. total failure
Redundancy:
- Multiple worker instances
- Backup data stores
- Alternative execution paths

Worker Rollback - Worker deployment rollback
D1 Recovery - Database recovery
Data Corruption - Data integrity issues
Disaster Recovery - Full system recovery

Revision History

Date	Version	Changes	Author
2025-10-26	1.0	Initial playbook	System

Partial Failure ​

Overview ​

When to Use This Playbook ​

Component Failure Identification ​

Quick Diagnosis ​

Component-Specific Recovery ​

API Gateway Worker Failure ​

Task Worker Failure ​

Agent Worker Failure ​

GitHub Worker Failure ​

D1 Database Partial Failure ​

KV Namespace Failure ​

R2 Bucket Failure ​

Durable Object Failure ​

Gradual Restoration Strategy ​

Phase 1: Core Services (0-15 minutes) ​

Phase 2: Primary Features (15-45 minutes) ​

Phase 3: Advanced Features (45-120 minutes) ​

Service Continuity During Recovery ​

Maintain Available Services ​

Isolated Component Testing ​

Common Partial Failure Patterns ​

Pattern 1: Cascading Failure ​

Pattern 2: Intermittent Failure ​

Pattern 3: Silent Failure ​

Recovery Verification Checklist ​

Post-Recovery Monitoring ​

Prevention Strategies ​

Related Playbooks ​

Revision History ​

Partial Failure

Overview

When to Use This Playbook

Component Failure Identification

Quick Diagnosis

Component-Specific Recovery

API Gateway Worker Failure

Task Worker Failure

Agent Worker Failure

GitHub Worker Failure

D1 Database Partial Failure

KV Namespace Failure

R2 Bucket Failure

Durable Object Failure

Gradual Restoration Strategy

Phase 1: Core Services (0-15 minutes)

Phase 2: Primary Features (15-45 minutes)

Phase 3: Advanced Features (45-120 minutes)

Service Continuity During Recovery

Maintain Available Services

Isolated Component Testing

Common Partial Failure Patterns

Pattern 1: Cascading Failure

Pattern 2: Intermittent Failure

Pattern 3: Silent Failure

Recovery Verification Checklist

Post-Recovery Monitoring

Prevention Strategies

Related Playbooks

Revision History