D1 Recovery

Overview

This playbook provides step-by-step procedures for recovering the MonoTask D1 database from encrypted backups.

RTO Target: 1 hour RPO Target: 24 hours (daily backups at 2 AM UTC)

When to Use This Playbook

Use this playbook in the following scenarios:

Data Corruption: Database corruption detected through integrity checks
Accidental Data Loss: Critical data accidentally deleted
Failed Migration: Database migration resulted in data loss or corruption
Disaster Recovery: Full system restoration required
Rollback Required: Need to restore to a previous state after failed deployment

Prerequisites

Before starting the recovery process, ensure:

Access Requirements
- Cloudflare account access with appropriate permissions
- CLOUDFLARE_API_TOKEN environment variable set
- BACKUP_ENCRYPTION_KEY environment variable set (64-character hex string)
Tools Required
- Bun runtime installed
- Wrangler CLI (bunx wrangler)
- Access to MonoTask repository
Backup Information
- Backup ID to restore from
- Backup location in R2 (monotask-backups bucket)
- Backup manifest file for verification
Stakeholder Notification
- Notify team that database recovery is in progress
- Schedule maintenance window if production restoration
- Prepare rollback plan

Recovery Procedure

Step 1: Identify the Backup

Estimated Time: 5 minutes

Determine the backup to restore from:

bash

# List available backups
bun run scripts/backup/d1-backup.ts list

Review backup manifest:

bash

# Download manifest from R2
bunx wrangler r2 object get monotask-backups/backups/d1/manifests/<backup-id>.manifest.json

Verify backup details:
- Timestamp: When was the backup created?
- Size: Does the size look correct?
- Checksum: Is integrity verification present?
- Retention type: daily/weekly/monthly

Decision Point: Confirm this is the correct backup before proceeding.

Step 2: Validate the Backup

Estimated Time: 10 minutes

Run validation-only mode:

bash

bun run scripts/recovery/d1-restore.ts <backup-id> --validate-only

Check validation results:
- ✓ Backup downloaded successfully
- ✓ Decryption successful
- ✓ Checksum verification passed
- ✓ SQL syntax validated
If validation fails:
- Check encryption key is correct
- Verify backup file integrity
- Try an alternate backup if available

Decision Point: Only proceed if validation passes.

Step 3: Create Pre-Restore Backup

Estimated Time: 10 minutes

IMPORTANT: Always create a backup of the current state before restoration.

bash

# Create manual backup
bun run scripts/backup/d1-backup.ts backup

This creates a rollback point if the restoration needs to be undone.

Note the Backup ID from the output for potential rollback.

Step 4: Execute Restoration

Estimated Time: 20-30 minutes

Run the restoration script:

bash

bun run scripts/recovery/d1-restore.ts <backup-id>

Monitor the restoration process:

[D1 Restore] Step 1: Downloading backup...
[D1 Restore] Step 2: Verifying backup integrity...
[D1 Restore] Step 3: Decrypting backup...
[D1 Restore] Step 4: Verifying checksum...
[D1 Restore] Step 5: Validating SQL...
[D1 Restore] Step 6: Creating pre-restore backup...
[D1 Restore] Step 7: Dropping existing tables...
[D1 Restore] Step 8: Importing data...
[D1 Restore] Step 9: Validating restored data...
[D1 Restore] Step 10: Cleaning up...

Watch for errors or warnings during import
Note the restoration report location

Step 5: Verify Data Integrity

Estimated Time: 10 minutes

Check table count:

bash

bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM sqlite_master WHERE type='table'" \
  --json

Verify critical tables have data:

bash

# Check tasks table
bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM tasks" \
  --json

# Check projects table
bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM projects" \
  --json

Run sample queries to verify data integrity:

bash

# Get recent tasks
bunx wrangler d1 execute monotask-production \
  --command "SELECT id, title, state FROM tasks ORDER BY created_at DESC LIMIT 5" \
  --json

Compare row counts with backup manifest

Decision Point: If data looks incorrect, proceed to rollback procedure.

Step 6: Run Smoke Tests

Estimated Time: 5 minutes

Test basic database operations:

bash

# Test read
bunx wrangler d1 execute monotask-production \
  --command "SELECT 1" \
  --json

# Test write (rollback after)
bunx wrangler d1 execute monotask-production \
  --command "BEGIN; INSERT INTO tasks (id, title, state) VALUES ('test', 'test', 'PENDING'); ROLLBACK;" \
  --json

Verify indexes exist:

bash

bunx wrangler d1 execute monotask-production \
  --command "SELECT name FROM sqlite_master WHERE type='index'" \
  --json

Check foreign key constraints:

bash

bunx wrangler d1 execute monotask-production \
  --command "PRAGMA foreign_key_check" \
  --json

Step 7: Restart Dependent Services

Estimated Time: 5 minutes

Restart Cloudflare Workers to pick up restored data:

bash

# The workers will automatically use the restored database
# No action needed unless using local development

For local development:

bash

# Restart daemon
bun run daemon:stop
bun run daemon:start

# Restart dashboard
bun run dashboard:stop
bun run dashboard:start

Step 8: Monitor Post-Recovery

Estimated Time: Ongoing (first hour critical)

Monitor error rates in Cloudflare dashboard
Check application logs for database-related errors
Monitor response times for database queries
Verify no data inconsistencies reported by users

Verification Checklist

After recovery, verify:

[ ] All expected tables present
[ ] Row counts match expected values
[ ] Sample queries return correct data
[ ] Indexes and constraints intact
[ ] No foreign key violations
[ ] Application functioning normally
[ ] No error spikes in logs
[ ] Recovery report generated and saved
[ ] Stakeholders notified of completion
[ ] Post-mortem scheduled if incident

Rollback Procedure

If the restored data is incorrect or incomplete:

Identify the pre-restore backup ID (from Step 3)

Run restoration with the rollback backup:

bash

bun run scripts/recovery/d1-restore.ts <rollback-backup-id>

Verify data is back to pre-recovery state
Investigate why the original restore failed

Common Issues and Solutions

Issue: Checksum Verification Failed

Symptoms:

Error message: "Backup integrity check failed"
Checksum mismatch during validation

Solution:

Verify encryption key is correct
Check backup file wasn't corrupted during download
Try downloading the backup again
Use an alternate backup from a different date

Issue: Import Fails with Constraint Violations

Symptoms:

Foreign key constraint errors during import
Unique constraint violations

Solution:

Ensure --skip-validation is NOT used
Check if database schema has changed since backup
Review migration history
May need to restore schema separately first

Issue: Import Takes Too Long

Symptoms:

Import exceeds expected duration
Script appears to hang

Solution:

Check Cloudflare D1 service status
Verify network connectivity
Monitor Cloudflare dashboard for rate limits
Consider breaking import into smaller batches

Issue: Restored Data is Stale

Symptoms:

Data is older than expected
Recent changes missing

Solution:

Verify correct backup ID was used
Check backup timestamp in manifest
Remember RPO is 24 hours (daily backups)
If more recent data needed, check for newer backups
Consider KV restoration for session data

RTO/RPO Expectations

Recovery Time Objective (RTO)

Target: 1 hour from decision to restore

Breakdown:

Backup identification: 5 minutes
Validation: 10 minutes
Pre-restore backup: 10 minutes
Restoration: 20-30 minutes
Verification: 10 minutes
Service restart: 5 minutes

Total: ~60 minutes

Recovery Point Objective (RPO)

Target: 24 hours

Explanation:

Backups run daily at 2 AM UTC
Maximum data loss: 24 hours
For critical operations, use transaction logs if available
KV namespaces have 1-hour RPO for session data

Escalation Path

If recovery is taking longer than expected or issues arise:

< 30 minutes: Continue with playbook, monitor closely
30-60 minutes: Notify DevOps lead, consider assistance
> 60 minutes: Escalate to senior engineer, consider disaster recovery
> 90 minutes: Executive notification, invoke disaster recovery plan

Emergency Contacts:

DevOps Lead: [Contact info]
Database Administrator: [Contact info]
On-call Engineer: [Contact info]

Post-Recovery Actions

After successful recovery:

Document the Incident
- What triggered the recovery?
- Which backup was used?
- Were there any issues during recovery?
- Actual RTO achieved?
Update Monitoring
- Add alerts if similar issue could occur
- Review backup schedules if needed
- Check if more frequent backups required
Schedule Post-Mortem
- Review incident timeline
- Identify root cause
- Document lessons learned
- Update playbook based on experience
Update Stakeholders
- Send recovery completion notification
- Document any data loss (if applicable)
- Provide incident report

Worker Rollback - For rolling back Worker deployments
Data Corruption - For investigating data corruption
Partial Failure - For component-specific failures
Disaster Recovery - For full system restoration

Revision History

Date	Version	Changes	Author
2025-10-26	1.0	Initial playbook	System

Feedback

If you find issues with this playbook or have suggestions for improvement:

Document the issue during recovery
Update the playbook after successful recovery
Share learnings with the team
Schedule playbook review quarterly

D1 Recovery ​

Overview ​

When to Use This Playbook ​

Prerequisites ​

Recovery Procedure ​

Step 1: Identify the Backup ​

Step 2: Validate the Backup ​

Step 3: Create Pre-Restore Backup ​

Step 4: Execute Restoration ​

Step 5: Verify Data Integrity ​

Step 6: Run Smoke Tests ​

Step 7: Restart Dependent Services ​

Step 8: Monitor Post-Recovery ​

Verification Checklist ​

Rollback Procedure ​

Common Issues and Solutions ​

Issue: Checksum Verification Failed ​

Issue: Import Fails with Constraint Violations ​

Issue: Import Takes Too Long ​

Issue: Restored Data is Stale ​

RTO/RPO Expectations ​

Recovery Time Objective (RTO) ​

Recovery Point Objective (RPO) ​

Escalation Path ​

Post-Recovery Actions ​

Related Playbooks ​

Revision History ​

Feedback ​

D1 Recovery

Overview

When to Use This Playbook

Prerequisites

Recovery Procedure

Step 1: Identify the Backup

Step 2: Validate the Backup

Step 3: Create Pre-Restore Backup

Step 4: Execute Restoration

Step 5: Verify Data Integrity

Step 6: Run Smoke Tests

Step 7: Restart Dependent Services

Step 8: Monitor Post-Recovery

Verification Checklist

Rollback Procedure

Common Issues and Solutions

Issue: Checksum Verification Failed

Issue: Import Fails with Constraint Violations

Issue: Import Takes Too Long

Issue: Restored Data is Stale

RTO/RPO Expectations

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Escalation Path

Post-Recovery Actions

Related Playbooks

Revision History

Feedback