Skip to content

Backup and Recovery System

Overview

This directory contains comprehensive backup, recovery, and disaster recovery procedures for the MonoTask Cloudflare infrastructure.

System Components:

  • D1 Database (Primary data store)
  • KV Namespaces (Sessions, cache, feature flags, rate limits, API keys)
  • R2 Buckets (Evidence storage, screenshots, agent artifacts)

Quick Reference

RTO/RPO Targets

ComponentRTO (Recovery Time)RPO (Data Loss)Backup Frequency
D1 Database1 hour24 hoursDaily at 2 AM UTC
KV Namespaces30 minutes1 hourHourly
R2 Buckets2 hours24 hoursContinuous replication

Backup System

Automated Backups

D1 Database Backups:

  • Script: /scripts/backup/d1-backup.ts
  • Schedule: Daily at 2 AM UTC (GitHub Actions)
  • Features:
    • AES-256-GCM encryption
    • Retention: Daily (30d), Weekly (90d), Monthly (1y)
    • Integrity verification via SHA-256 checksums
    • Automatic cleanup of expired backups

KV Namespace Backups:

  • Script: /scripts/backup/kv-backup.ts
  • Schedule: Hourly
  • Features:
    • Full and incremental backup modes
    • Metadata preservation (expiration, custom metadata)
    • Per-namespace backup manifests

R2 Bucket Backups:

  • Script: /scripts/backup/r2-backup.ts
  • Method: Cross-region replication to secondary bucket
  • Features:
    • Object versioning
    • Lifecycle policies
    • Replication lag monitoring

Manual Backup Commands

bash
# D1 Database
bun run backup:d1                    # Create backup
bun run backup:d1:list               # List backups
bun run backup:d1:cleanup            # Remove expired backups

# KV Namespaces
bun run backup:kv                    # Full backup
bun run backup:kv:incremental <id>   # Incremental from base
bun run backup:kv:list               # List backups

# R2 Buckets
bun run backup:r2:configure          # Setup replication
bun run backup:r2:sync               # Manual sync to secondary
bun run backup:r2:status             # Check replication status

Recovery System

Recovery Scripts

D1 Database Recovery:

  • Script: /scripts/recovery/d1-restore.ts
  • Usage:
    bash
    bun run recovery:d1 <backup-id>
    bun run recovery:d1 <backup-id> --validate-only  # Dry run

KV Namespace Recovery:

  • Script: /scripts/recovery/kv-restore.ts
  • Usage:
    bash
    bun run recovery:kv <backup-id>                           # All namespaces
    bun run recovery:kv <backup-id> --namespace SESSIONS      # Single namespace
    bun run recovery:kv <backup-id> --clear                   # Clear before restore
    bun run recovery:kv <backup-id> --skip                    # Skip existing keys

Disaster Recovery Orchestration:

  • Script: /scripts/recovery/disaster-recovery.ts
  • Usage:
    bash
    bun run recovery:disaster <d1-id> <kv-id>                 # Full recovery
    bun run recovery:disaster <d1-id> <kv-id> --dry-run       # Simulate
    bun run recovery:disaster <d1-id> <kv-id> --components d1,kv  # Partial

Recovery Playbooks

Detailed step-by-step procedures for different failure scenarios:

D1 Recovery Playbook

When to use:

  • Database corruption
  • Accidental data deletion
  • Failed migration
  • Need to restore to previous state

Key sections:

  • Backup identification
  • Validation procedures
  • Step-by-step restoration
  • Data integrity verification
  • Rollback procedures

Worker Rollback Playbook

When to use:

  • High error rates after deployment
  • Performance degradation
  • Broken functionality
  • Security issues

Key sections:

  • Quick rollback (< 5 minutes)
  • Multi-worker rollback strategy
  • Traffic switching
  • Verification procedures

Data Corruption Playbook

When to use:

  • Integrity violations
  • Checksum mismatches
  • Schema corruption
  • Inconsistent data state

Key sections:

  • Corruption detection
  • Impact assessment
  • Point-in-time recovery
  • Selective data repair
  • Prevention measures

Partial Failure Playbook

When to use:

  • Single component failure
  • Specific worker down
  • KV namespace unavailable
  • R2 bucket issues

Key sections:

  • Component identification
  • Isolated recovery
  • Service continuity
  • Gradual restoration

Automated Testing

Recovery Drill Script

Purpose: Monthly validation of backup/recovery procedures

Script: /scripts/backup/test-recovery.ts

Usage:

bash
bun run recovery:test                        # Full test suite
bun run recovery:test --components d1        # Test D1 only
bun run recovery:test --skip-validation      # Skip validation steps

What it tests:

  • Backup creation and integrity
  • Restoration procedures
  • Data validation
  • RTO measurements
  • End-to-end integration

Schedule: First day of each month (automated via GitHub Actions)

Output:

  • JSON test report with RTO/RPO measurements
  • Pass/fail status for each component
  • Recommendations for improvements

Environment Variables

Required for backup/recovery operations:

bash
# Cloudflare Authentication
CLOUDFLARE_API_TOKEN=<your-api-token>
CLOUDFLARE_ACCOUNT_ID=<your-account-id>

# Backup Configuration
BACKUP_R2_BUCKET=monotask-backups
SECONDARY_R2_BUCKET=monotask-backups-secondary
BACKUP_ENCRYPTION_KEY=<64-char-hex-string>  # Generate: openssl rand -hex 32

# Database IDs (from wrangler.toml)
D1_DATABASE_NAME=monotask-production
D1_DATABASE_ID=<your-database-id>

# For testing
TEST_DATABASE_NAME=monotask-recovery-test
TEST_DATABASE_ID=<test-database-id>

GitHub Actions Workflows

D1 Backup Workflow

File: .github/workflows/d1-backup.yml

Schedule: Daily at 2 AM UTC

Steps:

  1. Create D1 backup
  2. Upload to R2 with encryption
  3. Cleanup expired backups
  4. Verify backup integrity
  5. Send notifications (success/failure)

Monthly recovery test: Runs full recovery drill on first day of month


Best Practices

Before Recovery

  1. Identify the correct backup:

    • Check timestamp
    • Verify backup size
    • Review manifest file
  2. Validate the backup:

    • Use --validate-only flag
    • Verify checksum
    • Check decryption works
  3. Create pre-recovery backup:

    • Always backup current state
    • Note rollback backup ID
    • Document current state
  4. Communicate:

    • Notify team
    • Update status page
    • Set expectations

During Recovery

  1. Monitor progress:

    • Watch script output
    • Check for errors
    • Note warnings
  2. Verify each step:

    • Don't skip validation
    • Check integrity
    • Test functionality
  3. Document:

    • Record backup IDs
    • Note any issues
    • Track timing

After Recovery

  1. Validate thoroughly:

    • Run smoke tests
    • Check data integrity
    • Verify all features work
  2. Monitor closely:

    • Watch error rates
    • Check performance
    • Monitor for 24-48 hours
  3. Document incident:

    • Create incident report
    • Update playbooks
    • Share learnings
  4. Improve:

    • Address root cause
    • Update procedures
    • Enhance monitoring

Troubleshooting

Common Issues

Backup Encryption Key Missing:

bash
# Generate new key
openssl rand -hex 32

# Set in environment
export BACKUP_ENCRYPTION_KEY=<generated-key>

Wrangler Authentication Fails:

bash
# Login to Cloudflare
bunx wrangler login

# Or use API token
export CLOUDFLARE_API_TOKEN=<your-token>

Backup Download Fails:

  • Check R2 bucket name
  • Verify backup exists
  • Check permissions
  • Retry operation

Restoration Takes Too Long:

  • Check Cloudflare service status
  • Verify network connectivity
  • Monitor rate limits
  • Break into smaller batches if needed

Support and Escalation

Self-Service Resources

  1. Check playbooks in this directory
  2. Review script help output
  3. Search error messages in docs
  4. Check Cloudflare status page

Escalation Path

Level 1 (< 30 minutes):

  • Follow relevant playbook
  • Attempt standard recovery

Level 2 (30-60 minutes):

  • Consult DevOps team
  • Review with senior engineer

Level 3 (> 60 minutes):

  • Escalate to engineering manager
  • Consider disaster recovery

Level 4 (Critical):

  • Executive notification
  • Invoke full disaster recovery plan

Metrics and Monitoring

Backup Health Metrics

  • Backup success rate (target: 100%)
  • Backup duration (track trends)
  • Backup size (monitor growth)
  • Retention compliance
  • Encryption validation rate

Recovery Metrics

  • RTO actual vs. target
  • RPO actual vs. target
  • Recovery success rate
  • Recovery drill pass rate
  • Mean time to recovery

Alerts

  • Backup failure
  • Backup duration exceeding baseline
  • Expired backup cleanup failure
  • RTO target exceeded in tests
  • Replication lag > 2 hours

Maintenance

Daily

  • Verify automated backups completed
  • Check backup success notifications

Weekly

  • Review backup retention
  • Monitor storage usage
  • Check replication status

Monthly

  • Run recovery drill
  • Review and update playbooks
  • Analyze RTO/RPO trends
  • Test disaster recovery

Quarterly

  • Full playbook review
  • Update documentation
  • Team training refresh
  • Process improvement review

Additional Resources


Revision History

DateVersionChangesAuthor
2025-10-261.0Initial backup/recovery systemSystem

Feedback and Improvements

This backup and recovery system is continuously improved based on:

  • Recovery drill results
  • Actual incident experiences
  • Team feedback
  • Technology updates

To suggest improvements:

  1. Document issues during recovery
  2. Update playbooks with learnings
  3. Share with team
  4. Schedule regular reviews

MonoKernel MonoTask Documentation