Skip to content

D1 Recovery

Overview

This playbook provides step-by-step procedures for recovering the MonoTask D1 database from encrypted backups.

RTO Target: 1 hour RPO Target: 24 hours (daily backups at 2 AM UTC)


When to Use This Playbook

Use this playbook in the following scenarios:

  • Data Corruption: Database corruption detected through integrity checks
  • Accidental Data Loss: Critical data accidentally deleted
  • Failed Migration: Database migration resulted in data loss or corruption
  • Disaster Recovery: Full system restoration required
  • Rollback Required: Need to restore to a previous state after failed deployment

Prerequisites

Before starting the recovery process, ensure:

  1. Access Requirements

    • Cloudflare account access with appropriate permissions
    • CLOUDFLARE_API_TOKEN environment variable set
    • BACKUP_ENCRYPTION_KEY environment variable set (64-character hex string)
  2. Tools Required

    • Bun runtime installed
    • Wrangler CLI (bunx wrangler)
    • Access to MonoTask repository
  3. Backup Information

    • Backup ID to restore from
    • Backup location in R2 (monotask-backups bucket)
    • Backup manifest file for verification
  4. Stakeholder Notification

    • Notify team that database recovery is in progress
    • Schedule maintenance window if production restoration
    • Prepare rollback plan

Recovery Procedure

Step 1: Identify the Backup

Estimated Time: 5 minutes

  1. Determine the backup to restore from:

    bash
    # List available backups
    bun run scripts/backup/d1-backup.ts list
  2. Review backup manifest:

    bash
    # Download manifest from R2
    bunx wrangler r2 object get monotask-backups/backups/d1/manifests/<backup-id>.manifest.json
  3. Verify backup details:

    • Timestamp: When was the backup created?
    • Size: Does the size look correct?
    • Checksum: Is integrity verification present?
    • Retention type: daily/weekly/monthly

Decision Point: Confirm this is the correct backup before proceeding.


Step 2: Validate the Backup

Estimated Time: 10 minutes

  1. Run validation-only mode:

    bash
    bun run scripts/recovery/d1-restore.ts <backup-id> --validate-only
  2. Check validation results:

    • ✓ Backup downloaded successfully
    • ✓ Decryption successful
    • ✓ Checksum verification passed
    • ✓ SQL syntax validated
  3. If validation fails:

    • Check encryption key is correct
    • Verify backup file integrity
    • Try an alternate backup if available

Decision Point: Only proceed if validation passes.


Step 3: Create Pre-Restore Backup

Estimated Time: 10 minutes

IMPORTANT: Always create a backup of the current state before restoration.

bash
# Create manual backup
bun run scripts/backup/d1-backup.ts backup

This creates a rollback point if the restoration needs to be undone.

Note the Backup ID from the output for potential rollback.


Step 4: Execute Restoration

Estimated Time: 20-30 minutes

  1. Run the restoration script:

    bash
    bun run scripts/recovery/d1-restore.ts <backup-id>
  2. Monitor the restoration process:

    [D1 Restore] Step 1: Downloading backup...
    [D1 Restore] Step 2: Verifying backup integrity...
    [D1 Restore] Step 3: Decrypting backup...
    [D1 Restore] Step 4: Verifying checksum...
    [D1 Restore] Step 5: Validating SQL...
    [D1 Restore] Step 6: Creating pre-restore backup...
    [D1 Restore] Step 7: Dropping existing tables...
    [D1 Restore] Step 8: Importing data...
    [D1 Restore] Step 9: Validating restored data...
    [D1 Restore] Step 10: Cleaning up...
  3. Watch for errors or warnings during import

  4. Note the restoration report location


Step 5: Verify Data Integrity

Estimated Time: 10 minutes

  1. Check table count:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM sqlite_master WHERE type='table'" \
      --json
  2. Verify critical tables have data:

    bash
    # Check tasks table
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM tasks" \
      --json
    
    # Check projects table
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM projects" \
      --json
  3. Run sample queries to verify data integrity:

    bash
    # Get recent tasks
    bunx wrangler d1 execute monotask-production \
      --command "SELECT id, title, state FROM tasks ORDER BY created_at DESC LIMIT 5" \
      --json
  4. Compare row counts with backup manifest

Decision Point: If data looks incorrect, proceed to rollback procedure.


Step 6: Run Smoke Tests

Estimated Time: 5 minutes

  1. Test basic database operations:

    bash
    # Test read
    bunx wrangler d1 execute monotask-production \
      --command "SELECT 1" \
      --json
    
    # Test write (rollback after)
    bunx wrangler d1 execute monotask-production \
      --command "BEGIN; INSERT INTO tasks (id, title, state) VALUES ('test', 'test', 'PENDING'); ROLLBACK;" \
      --json
  2. Verify indexes exist:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "SELECT name FROM sqlite_master WHERE type='index'" \
      --json
  3. Check foreign key constraints:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA foreign_key_check" \
      --json

Step 7: Restart Dependent Services

Estimated Time: 5 minutes

  1. Restart Cloudflare Workers to pick up restored data:

    bash
    # The workers will automatically use the restored database
    # No action needed unless using local development
  2. For local development:

    bash
    # Restart daemon
    bun run daemon:stop
    bun run daemon:start
    
    # Restart dashboard
    bun run dashboard:stop
    bun run dashboard:start

Step 8: Monitor Post-Recovery

Estimated Time: Ongoing (first hour critical)

  1. Monitor error rates in Cloudflare dashboard

  2. Check application logs for database-related errors

  3. Monitor response times for database queries

  4. Verify no data inconsistencies reported by users


Verification Checklist

After recovery, verify:

  • [ ] All expected tables present
  • [ ] Row counts match expected values
  • [ ] Sample queries return correct data
  • [ ] Indexes and constraints intact
  • [ ] No foreign key violations
  • [ ] Application functioning normally
  • [ ] No error spikes in logs
  • [ ] Recovery report generated and saved
  • [ ] Stakeholders notified of completion
  • [ ] Post-mortem scheduled if incident

Rollback Procedure

If the restored data is incorrect or incomplete:

  1. Identify the pre-restore backup ID (from Step 3)

  2. Run restoration with the rollback backup:

    bash
    bun run scripts/recovery/d1-restore.ts <rollback-backup-id>
  3. Verify data is back to pre-recovery state

  4. Investigate why the original restore failed


Common Issues and Solutions

Issue: Checksum Verification Failed

Symptoms:

  • Error message: "Backup integrity check failed"
  • Checksum mismatch during validation

Solution:

  1. Verify encryption key is correct
  2. Check backup file wasn't corrupted during download
  3. Try downloading the backup again
  4. Use an alternate backup from a different date

Issue: Import Fails with Constraint Violations

Symptoms:

  • Foreign key constraint errors during import
  • Unique constraint violations

Solution:

  1. Ensure --skip-validation is NOT used
  2. Check if database schema has changed since backup
  3. Review migration history
  4. May need to restore schema separately first

Issue: Import Takes Too Long

Symptoms:

  • Import exceeds expected duration
  • Script appears to hang

Solution:

  1. Check Cloudflare D1 service status
  2. Verify network connectivity
  3. Monitor Cloudflare dashboard for rate limits
  4. Consider breaking import into smaller batches

Issue: Restored Data is Stale

Symptoms:

  • Data is older than expected
  • Recent changes missing

Solution:

  1. Verify correct backup ID was used
  2. Check backup timestamp in manifest
  3. Remember RPO is 24 hours (daily backups)
  4. If more recent data needed, check for newer backups
  5. Consider KV restoration for session data

RTO/RPO Expectations

Recovery Time Objective (RTO)

Target: 1 hour from decision to restore

Breakdown:

  • Backup identification: 5 minutes
  • Validation: 10 minutes
  • Pre-restore backup: 10 minutes
  • Restoration: 20-30 minutes
  • Verification: 10 minutes
  • Service restart: 5 minutes

Total: ~60 minutes

Recovery Point Objective (RPO)

Target: 24 hours

Explanation:

  • Backups run daily at 2 AM UTC
  • Maximum data loss: 24 hours
  • For critical operations, use transaction logs if available
  • KV namespaces have 1-hour RPO for session data

Escalation Path

If recovery is taking longer than expected or issues arise:

  1. < 30 minutes: Continue with playbook, monitor closely
  2. 30-60 minutes: Notify DevOps lead, consider assistance
  3. > 60 minutes: Escalate to senior engineer, consider disaster recovery
  4. > 90 minutes: Executive notification, invoke disaster recovery plan

Emergency Contacts:

  • DevOps Lead: [Contact info]
  • Database Administrator: [Contact info]
  • On-call Engineer: [Contact info]

Post-Recovery Actions

After successful recovery:

  1. Document the Incident

    • What triggered the recovery?
    • Which backup was used?
    • Were there any issues during recovery?
    • Actual RTO achieved?
  2. Update Monitoring

    • Add alerts if similar issue could occur
    • Review backup schedules if needed
    • Check if more frequent backups required
  3. Schedule Post-Mortem

    • Review incident timeline
    • Identify root cause
    • Document lessons learned
    • Update playbook based on experience
  4. Update Stakeholders

    • Send recovery completion notification
    • Document any data loss (if applicable)
    • Provide incident report


Revision History

DateVersionChangesAuthor
2025-10-261.0Initial playbookSystem

Feedback

If you find issues with this playbook or have suggestions for improvement:

  1. Document the issue during recovery
  2. Update the playbook after successful recovery
  3. Share learnings with the team
  4. Schedule playbook review quarterly

MonoKernel MonoTask Documentation