Appearance
Backup and Recovery System
Overview
This directory contains comprehensive backup, recovery, and disaster recovery procedures for the MonoTask Cloudflare infrastructure.
System Components:
- D1 Database (Primary data store)
- KV Namespaces (Sessions, cache, feature flags, rate limits, API keys)
- R2 Buckets (Evidence storage, screenshots, agent artifacts)
Quick Reference
RTO/RPO Targets
| Component | RTO (Recovery Time) | RPO (Data Loss) | Backup Frequency |
|---|---|---|---|
| D1 Database | 1 hour | 24 hours | Daily at 2 AM UTC |
| KV Namespaces | 30 minutes | 1 hour | Hourly |
| R2 Buckets | 2 hours | 24 hours | Continuous replication |
Backup System
Automated Backups
D1 Database Backups:
- Script:
/scripts/backup/d1-backup.ts - Schedule: Daily at 2 AM UTC (GitHub Actions)
- Features:
- AES-256-GCM encryption
- Retention: Daily (30d), Weekly (90d), Monthly (1y)
- Integrity verification via SHA-256 checksums
- Automatic cleanup of expired backups
KV Namespace Backups:
- Script:
/scripts/backup/kv-backup.ts - Schedule: Hourly
- Features:
- Full and incremental backup modes
- Metadata preservation (expiration, custom metadata)
- Per-namespace backup manifests
R2 Bucket Backups:
- Script:
/scripts/backup/r2-backup.ts - Method: Cross-region replication to secondary bucket
- Features:
- Object versioning
- Lifecycle policies
- Replication lag monitoring
Manual Backup Commands
bash
# D1 Database
bun run backup:d1 # Create backup
bun run backup:d1:list # List backups
bun run backup:d1:cleanup # Remove expired backups
# KV Namespaces
bun run backup:kv # Full backup
bun run backup:kv:incremental <id> # Incremental from base
bun run backup:kv:list # List backups
# R2 Buckets
bun run backup:r2:configure # Setup replication
bun run backup:r2:sync # Manual sync to secondary
bun run backup:r2:status # Check replication statusRecovery System
Recovery Scripts
D1 Database Recovery:
- Script:
/scripts/recovery/d1-restore.ts - Usage:bash
bun run recovery:d1 <backup-id> bun run recovery:d1 <backup-id> --validate-only # Dry run
KV Namespace Recovery:
- Script:
/scripts/recovery/kv-restore.ts - Usage:bash
bun run recovery:kv <backup-id> # All namespaces bun run recovery:kv <backup-id> --namespace SESSIONS # Single namespace bun run recovery:kv <backup-id> --clear # Clear before restore bun run recovery:kv <backup-id> --skip # Skip existing keys
Disaster Recovery Orchestration:
- Script:
/scripts/recovery/disaster-recovery.ts - Usage:bash
bun run recovery:disaster <d1-id> <kv-id> # Full recovery bun run recovery:disaster <d1-id> <kv-id> --dry-run # Simulate bun run recovery:disaster <d1-id> <kv-id> --components d1,kv # Partial
Recovery Playbooks
Detailed step-by-step procedures for different failure scenarios:
D1 Recovery Playbook
When to use:
- Database corruption
- Accidental data deletion
- Failed migration
- Need to restore to previous state
Key sections:
- Backup identification
- Validation procedures
- Step-by-step restoration
- Data integrity verification
- Rollback procedures
Worker Rollback Playbook
When to use:
- High error rates after deployment
- Performance degradation
- Broken functionality
- Security issues
Key sections:
- Quick rollback (< 5 minutes)
- Multi-worker rollback strategy
- Traffic switching
- Verification procedures
Data Corruption Playbook
When to use:
- Integrity violations
- Checksum mismatches
- Schema corruption
- Inconsistent data state
Key sections:
- Corruption detection
- Impact assessment
- Point-in-time recovery
- Selective data repair
- Prevention measures
Partial Failure Playbook
When to use:
- Single component failure
- Specific worker down
- KV namespace unavailable
- R2 bucket issues
Key sections:
- Component identification
- Isolated recovery
- Service continuity
- Gradual restoration
Automated Testing
Recovery Drill Script
Purpose: Monthly validation of backup/recovery procedures
Script: /scripts/backup/test-recovery.ts
Usage:
bash
bun run recovery:test # Full test suite
bun run recovery:test --components d1 # Test D1 only
bun run recovery:test --skip-validation # Skip validation stepsWhat it tests:
- Backup creation and integrity
- Restoration procedures
- Data validation
- RTO measurements
- End-to-end integration
Schedule: First day of each month (automated via GitHub Actions)
Output:
- JSON test report with RTO/RPO measurements
- Pass/fail status for each component
- Recommendations for improvements
Environment Variables
Required for backup/recovery operations:
bash
# Cloudflare Authentication
CLOUDFLARE_API_TOKEN=<your-api-token>
CLOUDFLARE_ACCOUNT_ID=<your-account-id>
# Backup Configuration
BACKUP_R2_BUCKET=monotask-backups
SECONDARY_R2_BUCKET=monotask-backups-secondary
BACKUP_ENCRYPTION_KEY=<64-char-hex-string> # Generate: openssl rand -hex 32
# Database IDs (from wrangler.toml)
D1_DATABASE_NAME=monotask-production
D1_DATABASE_ID=<your-database-id>
# For testing
TEST_DATABASE_NAME=monotask-recovery-test
TEST_DATABASE_ID=<test-database-id>GitHub Actions Workflows
D1 Backup Workflow
File: .github/workflows/d1-backup.yml
Schedule: Daily at 2 AM UTC
Steps:
- Create D1 backup
- Upload to R2 with encryption
- Cleanup expired backups
- Verify backup integrity
- Send notifications (success/failure)
Monthly recovery test: Runs full recovery drill on first day of month
Best Practices
Before Recovery
Identify the correct backup:
- Check timestamp
- Verify backup size
- Review manifest file
Validate the backup:
- Use
--validate-onlyflag - Verify checksum
- Check decryption works
- Use
Create pre-recovery backup:
- Always backup current state
- Note rollback backup ID
- Document current state
Communicate:
- Notify team
- Update status page
- Set expectations
During Recovery
Monitor progress:
- Watch script output
- Check for errors
- Note warnings
Verify each step:
- Don't skip validation
- Check integrity
- Test functionality
Document:
- Record backup IDs
- Note any issues
- Track timing
After Recovery
Validate thoroughly:
- Run smoke tests
- Check data integrity
- Verify all features work
Monitor closely:
- Watch error rates
- Check performance
- Monitor for 24-48 hours
Document incident:
- Create incident report
- Update playbooks
- Share learnings
Improve:
- Address root cause
- Update procedures
- Enhance monitoring
Troubleshooting
Common Issues
Backup Encryption Key Missing:
bash
# Generate new key
openssl rand -hex 32
# Set in environment
export BACKUP_ENCRYPTION_KEY=<generated-key>Wrangler Authentication Fails:
bash
# Login to Cloudflare
bunx wrangler login
# Or use API token
export CLOUDFLARE_API_TOKEN=<your-token>Backup Download Fails:
- Check R2 bucket name
- Verify backup exists
- Check permissions
- Retry operation
Restoration Takes Too Long:
- Check Cloudflare service status
- Verify network connectivity
- Monitor rate limits
- Break into smaller batches if needed
Support and Escalation
Self-Service Resources
- Check playbooks in this directory
- Review script help output
- Search error messages in docs
- Check Cloudflare status page
Escalation Path
Level 1 (< 30 minutes):
- Follow relevant playbook
- Attempt standard recovery
Level 2 (30-60 minutes):
- Consult DevOps team
- Review with senior engineer
Level 3 (> 60 minutes):
- Escalate to engineering manager
- Consider disaster recovery
Level 4 (Critical):
- Executive notification
- Invoke full disaster recovery plan
Metrics and Monitoring
Backup Health Metrics
- Backup success rate (target: 100%)
- Backup duration (track trends)
- Backup size (monitor growth)
- Retention compliance
- Encryption validation rate
Recovery Metrics
- RTO actual vs. target
- RPO actual vs. target
- Recovery success rate
- Recovery drill pass rate
- Mean time to recovery
Alerts
- Backup failure
- Backup duration exceeding baseline
- Expired backup cleanup failure
- RTO target exceeded in tests
- Replication lag > 2 hours
Maintenance
Daily
- Verify automated backups completed
- Check backup success notifications
Weekly
- Review backup retention
- Monitor storage usage
- Check replication status
Monthly
- Run recovery drill
- Review and update playbooks
- Analyze RTO/RPO trends
- Test disaster recovery
Quarterly
- Full playbook review
- Update documentation
- Team training refresh
- Process improvement review
Additional Resources
- Cloudflare D1 Documentation
- Cloudflare KV Documentation
- Cloudflare R2 Documentation
- Wrangler CLI Reference
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-26 | 1.0 | Initial backup/recovery system | System |
Feedback and Improvements
This backup and recovery system is continuously improved based on:
- Recovery drill results
- Actual incident experiences
- Team feedback
- Technology updates
To suggest improvements:
- Document issues during recovery
- Update playbooks with learnings
- Share with team
- Schedule regular reviews