Appearance
D1 Recovery
Overview
This playbook provides step-by-step procedures for recovering the MonoTask D1 database from encrypted backups.
RTO Target: 1 hour RPO Target: 24 hours (daily backups at 2 AM UTC)
When to Use This Playbook
Use this playbook in the following scenarios:
- Data Corruption: Database corruption detected through integrity checks
- Accidental Data Loss: Critical data accidentally deleted
- Failed Migration: Database migration resulted in data loss or corruption
- Disaster Recovery: Full system restoration required
- Rollback Required: Need to restore to a previous state after failed deployment
Prerequisites
Before starting the recovery process, ensure:
Access Requirements
- Cloudflare account access with appropriate permissions
CLOUDFLARE_API_TOKENenvironment variable setBACKUP_ENCRYPTION_KEYenvironment variable set (64-character hex string)
Tools Required
- Bun runtime installed
- Wrangler CLI (
bunx wrangler) - Access to MonoTask repository
Backup Information
- Backup ID to restore from
- Backup location in R2 (
monotask-backupsbucket) - Backup manifest file for verification
Stakeholder Notification
- Notify team that database recovery is in progress
- Schedule maintenance window if production restoration
- Prepare rollback plan
Recovery Procedure
Step 1: Identify the Backup
Estimated Time: 5 minutes
Determine the backup to restore from:
bash# List available backups bun run scripts/backup/d1-backup.ts listReview backup manifest:
bash# Download manifest from R2 bunx wrangler r2 object get monotask-backups/backups/d1/manifests/<backup-id>.manifest.jsonVerify backup details:
- Timestamp: When was the backup created?
- Size: Does the size look correct?
- Checksum: Is integrity verification present?
- Retention type: daily/weekly/monthly
Decision Point: Confirm this is the correct backup before proceeding.
Step 2: Validate the Backup
Estimated Time: 10 minutes
Run validation-only mode:
bashbun run scripts/recovery/d1-restore.ts <backup-id> --validate-onlyCheck validation results:
- ✓ Backup downloaded successfully
- ✓ Decryption successful
- ✓ Checksum verification passed
- ✓ SQL syntax validated
If validation fails:
- Check encryption key is correct
- Verify backup file integrity
- Try an alternate backup if available
Decision Point: Only proceed if validation passes.
Step 3: Create Pre-Restore Backup
Estimated Time: 10 minutes
IMPORTANT: Always create a backup of the current state before restoration.
bash
# Create manual backup
bun run scripts/backup/d1-backup.ts backupThis creates a rollback point if the restoration needs to be undone.
Note the Backup ID from the output for potential rollback.
Step 4: Execute Restoration
Estimated Time: 20-30 minutes
Run the restoration script:
bashbun run scripts/recovery/d1-restore.ts <backup-id>Monitor the restoration process:
[D1 Restore] Step 1: Downloading backup... [D1 Restore] Step 2: Verifying backup integrity... [D1 Restore] Step 3: Decrypting backup... [D1 Restore] Step 4: Verifying checksum... [D1 Restore] Step 5: Validating SQL... [D1 Restore] Step 6: Creating pre-restore backup... [D1 Restore] Step 7: Dropping existing tables... [D1 Restore] Step 8: Importing data... [D1 Restore] Step 9: Validating restored data... [D1 Restore] Step 10: Cleaning up...Watch for errors or warnings during import
Note the restoration report location
Step 5: Verify Data Integrity
Estimated Time: 10 minutes
Check table count:
bashbunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM sqlite_master WHERE type='table'" \ --jsonVerify critical tables have data:
bash# Check tasks table bunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM tasks" \ --json # Check projects table bunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM projects" \ --jsonRun sample queries to verify data integrity:
bash# Get recent tasks bunx wrangler d1 execute monotask-production \ --command "SELECT id, title, state FROM tasks ORDER BY created_at DESC LIMIT 5" \ --jsonCompare row counts with backup manifest
Decision Point: If data looks incorrect, proceed to rollback procedure.
Step 6: Run Smoke Tests
Estimated Time: 5 minutes
Test basic database operations:
bash# Test read bunx wrangler d1 execute monotask-production \ --command "SELECT 1" \ --json # Test write (rollback after) bunx wrangler d1 execute monotask-production \ --command "BEGIN; INSERT INTO tasks (id, title, state) VALUES ('test', 'test', 'PENDING'); ROLLBACK;" \ --jsonVerify indexes exist:
bashbunx wrangler d1 execute monotask-production \ --command "SELECT name FROM sqlite_master WHERE type='index'" \ --jsonCheck foreign key constraints:
bashbunx wrangler d1 execute monotask-production \ --command "PRAGMA foreign_key_check" \ --json
Step 7: Restart Dependent Services
Estimated Time: 5 minutes
Restart Cloudflare Workers to pick up restored data:
bash# The workers will automatically use the restored database # No action needed unless using local developmentFor local development:
bash# Restart daemon bun run daemon:stop bun run daemon:start # Restart dashboard bun run dashboard:stop bun run dashboard:start
Step 8: Monitor Post-Recovery
Estimated Time: Ongoing (first hour critical)
Monitor error rates in Cloudflare dashboard
Check application logs for database-related errors
Monitor response times for database queries
Verify no data inconsistencies reported by users
Verification Checklist
After recovery, verify:
- [ ] All expected tables present
- [ ] Row counts match expected values
- [ ] Sample queries return correct data
- [ ] Indexes and constraints intact
- [ ] No foreign key violations
- [ ] Application functioning normally
- [ ] No error spikes in logs
- [ ] Recovery report generated and saved
- [ ] Stakeholders notified of completion
- [ ] Post-mortem scheduled if incident
Rollback Procedure
If the restored data is incorrect or incomplete:
Identify the pre-restore backup ID (from Step 3)
Run restoration with the rollback backup:
bashbun run scripts/recovery/d1-restore.ts <rollback-backup-id>Verify data is back to pre-recovery state
Investigate why the original restore failed
Common Issues and Solutions
Issue: Checksum Verification Failed
Symptoms:
- Error message: "Backup integrity check failed"
- Checksum mismatch during validation
Solution:
- Verify encryption key is correct
- Check backup file wasn't corrupted during download
- Try downloading the backup again
- Use an alternate backup from a different date
Issue: Import Fails with Constraint Violations
Symptoms:
- Foreign key constraint errors during import
- Unique constraint violations
Solution:
- Ensure
--skip-validationis NOT used - Check if database schema has changed since backup
- Review migration history
- May need to restore schema separately first
Issue: Import Takes Too Long
Symptoms:
- Import exceeds expected duration
- Script appears to hang
Solution:
- Check Cloudflare D1 service status
- Verify network connectivity
- Monitor Cloudflare dashboard for rate limits
- Consider breaking import into smaller batches
Issue: Restored Data is Stale
Symptoms:
- Data is older than expected
- Recent changes missing
Solution:
- Verify correct backup ID was used
- Check backup timestamp in manifest
- Remember RPO is 24 hours (daily backups)
- If more recent data needed, check for newer backups
- Consider KV restoration for session data
RTO/RPO Expectations
Recovery Time Objective (RTO)
Target: 1 hour from decision to restore
Breakdown:
- Backup identification: 5 minutes
- Validation: 10 minutes
- Pre-restore backup: 10 minutes
- Restoration: 20-30 minutes
- Verification: 10 minutes
- Service restart: 5 minutes
Total: ~60 minutes
Recovery Point Objective (RPO)
Target: 24 hours
Explanation:
- Backups run daily at 2 AM UTC
- Maximum data loss: 24 hours
- For critical operations, use transaction logs if available
- KV namespaces have 1-hour RPO for session data
Escalation Path
If recovery is taking longer than expected or issues arise:
- < 30 minutes: Continue with playbook, monitor closely
- 30-60 minutes: Notify DevOps lead, consider assistance
- > 60 minutes: Escalate to senior engineer, consider disaster recovery
- > 90 minutes: Executive notification, invoke disaster recovery plan
Emergency Contacts:
- DevOps Lead: [Contact info]
- Database Administrator: [Contact info]
- On-call Engineer: [Contact info]
Post-Recovery Actions
After successful recovery:
Document the Incident
- What triggered the recovery?
- Which backup was used?
- Were there any issues during recovery?
- Actual RTO achieved?
Update Monitoring
- Add alerts if similar issue could occur
- Review backup schedules if needed
- Check if more frequent backups required
Schedule Post-Mortem
- Review incident timeline
- Identify root cause
- Document lessons learned
- Update playbook based on experience
Update Stakeholders
- Send recovery completion notification
- Document any data loss (if applicable)
- Provide incident report
Related Playbooks
- Worker Rollback - For rolling back Worker deployments
- Data Corruption - For investigating data corruption
- Partial Failure - For component-specific failures
- Disaster Recovery - For full system restoration
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-26 | 1.0 | Initial playbook | System |
Feedback
If you find issues with this playbook or have suggestions for improvement:
- Document the issue during recovery
- Update the playbook after successful recovery
- Share learnings with the team
- Schedule playbook review quarterly