Appearance
Data Corruption
Overview
This playbook addresses detection, assessment, and recovery from data corruption in the MonoTask system, covering D1 database corruption, KV data inconsistencies, and R2 object corruption.
RTO Target: 2 hours RPO Target: Varies by data type (D1: 24h, KV: 1h, R2: 24h)
When to Use This Playbook
Use this playbook when encountering:
- Data Integrity Violations: Foreign key constraints failing, orphaned records
- Checksum Mismatches: File integrity verification failures
- Schema Corruption: Table structure inconsistencies
- Index Corruption: Index scans returning incorrect results
- Encoding Issues: Character encoding problems, garbled text
- Inconsistent State: Data in conflicting states across systems
- Truncated Data: Incomplete or cut-off records
- Duplicate Records: Unexpected duplicate data
Corruption Detection Methods
Automated Detection
Database Integrity Checks:
bash# Run SQLite integrity check bunx wrangler d1 execute monotask-production \ --command "PRAGMA integrity_check" \ --jsonForeign Key Validation:
bashbunx wrangler d1 execute monotask-production \ --command "PRAGMA foreign_key_check" \ --jsonMonitoring Alerts:
- Constraint violation rate spike
- Failed transaction rate increase
- Data validation errors in logs
Manual Detection
User Reports:
- Missing data complaints
- Incorrect data displayed
- Application behaving inconsistently
Data Audits:
- Regular data sampling
- Row count verification
- Critical record validation
Development Detection:
- Test failures
- QA reports
- Code review findings
Corruption Assessment Process
Step 1: Identify Scope
Estimated Time: 15 minutes
Determine Affected Components:
bash# Check D1 database bunx wrangler d1 execute monotask-production \ --command "SELECT name FROM sqlite_master WHERE type='table'" \ --json # For each table, check row counts bunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM tasks" \ --jsonIdentify Corruption Extent:
- Single record vs. multiple records
- Single table vs. multiple tables
- Isolated vs. systemic corruption
- Point-in-time vs. ongoing
Check Related Systems:
- Are KV namespaces affected?
- Are R2 objects corrupted?
- Is corruption limited to one component?
Document findings for impact assessment.
Step 2: Assess Impact
Estimated Time: 10 minutes
User Impact:
- How many users affected?
- Which features are broken?
- Is data accessible but incorrect?
- Is data completely missing?
Data Loss Assessment:
- Can corruption be isolated?
- Is partial recovery possible?
- What's the extent of data loss?
Business Impact:
- Critical data affected?
- Regulatory/compliance implications?
- Financial impact?
Severity Levels:
- Critical: Core functionality broken, data loss, many users affected
- High: Important features broken, some data loss, subset of users affected
- Medium: Non-critical features affected, minimal data loss
- Low: Minor inconsistencies, no functional impact
Step 3: Determine Root Cause
Estimated Time: 20 minutes
Common corruption causes:
Application Bugs:
- Race conditions
- Improper transaction handling
- Validation bypass
- Logic errors
Migration Issues:
- Failed schema changes
- Data transformation errors
- Incomplete migrations
Infrastructure Issues:
- Database crash during write
- Network interruption
- Disk I/O errors
- Resource exhaustion
Operational Errors:
- Manual data modifications
- Incorrect admin operations
- Deployment issues
Investigate:
bash
# Check recent deployments
bunx wrangler deployments list --name monotask-api-gateway
# Review error logs
# Check Cloudflare Analytics for anomalies
# Review recent database migrationsPoint-in-Time Recovery
Option 1: Full Database Restore
When to Use:
- Systemic corruption across database
- Cannot isolate corrupted data
- Rollback to before corruption event
Procedure:
Identify Recovery Point:
bash# List available backups bun run scripts/backup/d1-backup.ts listSelect Backup Before Corruption:
- Check backup timestamp
- Verify backup predates corruption
- Consider data loss from RPO
Execute Restoration:
bashbun run scripts/recovery/d1-restore.ts <backup-id>Follow D1 Recovery Playbook for complete procedure
Data Loss: Everything since backup timestamp (up to 24 hours)
Option 2: Selective Data Restore
When to Use:
- Corruption isolated to specific tables/records
- Want to minimize data loss
- Can identify uncorrupted baseline
Procedure:
Export Uncorrupted Data:
bash# Export current state of uncorrupted tables bunx wrangler d1 execute monotask-production \ --command "SELECT * FROM projects" \ --json > projects_current.jsonRestore from Backup to Staging:
bash# Use staging database D1_DATABASE_NAME=monotask-staging \ bun run scripts/recovery/d1-restore.ts <backup-id>Extract Needed Data:
bash# Export corrupted tables from backup bunx wrangler d1 execute monotask-staging \ --command "SELECT * FROM tasks WHERE id IN ('id1', 'id2')" \ --json > tasks_from_backup.jsonMerge Data:
- Manually merge uncorrupted current data with restored data
- Use custom script to reconcile differences
- Validate merged result
Import to Production:
bash# Import merged data bunx wrangler d1 execute monotask-production \ --file merged_data.sql
Data Loss: Minimal, only specific corrupted records
Option 3: Surgical Data Repair
When to Use:
- Corruption is well-understood
- Can identify and fix specific issues
- Pattern-based corruption (e.g., all tasks in INVALID state)
Procedure:
Identify Corruption Pattern:
bash# Find affected records bunx wrangler d1 execute monotask-production \ --command "SELECT * FROM tasks WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED')" \ --jsonCreate Repair Script:
sql-- Example: Fix invalid state values UPDATE tasks SET state = 'PENDING' WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED', 'FAILED');Test on Staging:
bash# Test repair on staging database first bunx wrangler d1 execute monotask-staging --file repair.sqlValidate Repair:
bash# Verify fix worked bunx wrangler d1 execute monotask-staging \ --command "PRAGMA integrity_check" \ --jsonApply to Production:
bash# Create backup first bun run scripts/backup/d1-backup.ts backup # Apply repair bunx wrangler d1 execute monotask-production --file repair.sql
Data Loss: None if repair is correct
KV Data Corruption Recovery
Identifying KV Corruption
bash
# List keys to check for anomalies
bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302
# Get specific key to check value
bunx wrangler kv:key get "suspicious-key" --namespace-id 9fb88e98a937493e93fa6930f4506302Recovery Options
Full Namespace Restore:
bashbun run scripts/recovery/kv-restore.ts <backup-id> \ --namespace SESSIONS \ --clearSelective Key Restore:
bashbun run scripts/recovery/kv-restore.ts <backup-id> \ --namespace SESSIONS \ --key "specific-key"Manual Key Deletion:
bash# Delete corrupted key bunx wrangler kv:key delete "corrupted-key" \ --namespace-id 9fb88e98a937493e93fa6930f4506302
R2 Object Corruption Recovery
Detecting R2 Corruption
Checksum Verification:
- Compare object checksums
- Verify file integrity on download
Access Errors:
- 404 errors for existing objects
- Corrupted file downloads
Recovery Options
Restore from Secondary Bucket:
bash# Copy from backup bucket bunx wrangler r2 object get monotask-backups-secondary/evidence-storage/object.txt \ --file temp.txt bunx wrangler r2 object put evidence-storage/object.txt \ --file temp.txtUse Object Versioning (if enabled):
- Retrieve previous version of object
- Promote to current version
Restore from Backup:
- Use R2 sync from secondary bucket
- Follow R2 backup procedures
Data Validation After Recovery
Validation Checklist
[ ] Integrity Checks Pass:
bashbunx wrangler d1 execute monotask-production \ --command "PRAGMA integrity_check" \ --json[ ] Foreign Keys Valid:
bashbunx wrangler d1 execute monotask-production \ --command "PRAGMA foreign_key_check" \ --json[ ] Row Counts Match Expected:
bash# Compare with historical data bunx wrangler d1 execute monotask-production \ --command "SELECT COUNT(*) FROM tasks" \ --json[ ] Sample Data Correct:
- Query known records
- Verify values are as expected
- Check relationships intact
[ ] Application Functions:
- Test critical user flows
- Verify no errors in logs
- Check UI displays correctly
Prevention Measures
Immediate Prevention
Identify and Fix Root Cause:
- Fix application bug
- Correct migration script
- Update validation logic
Add Safeguards:
- Add data validation checks
- Implement transaction isolation
- Add constraint checks
Improve Monitoring:
- Add integrity check automation
- Set up corruption detection alerts
- Monitor constraint violations
Long-Term Prevention
Code Quality:
- Add integration tests
- Improve validation coverage
- Code review for data operations
Operational Excellence:
- Automate migrations
- Test migrations on staging
- Require backup before migrations
Architecture:
- Consider event sourcing
- Add data versioning
- Implement audit trails
Communication During Recovery
Status Update Template
🔧 DATA CORRUPTION RECOVERY
Scope: [affected component]
Severity: [Critical/High/Medium/Low]
Impact: [user/feature impact]
Root Cause: [if identified]
Recovery Method: [point-in-time/selective/repair]
ETA: [completion time]
Data Loss: [expected data loss]
Updates: Every 15 minutesEscalation Criteria
Escalate if:
- [ ] Corruption cause unknown after 30 minutes
- [ ] Multiple recovery attempts failed
- [ ] Corruption is spreading
- [ ] Data loss exceeds acceptable threshold
- [ ] Critical business data affected
- [ ] Recovery taking longer than RTO
Escalation Path:
- Database Administrator
- Senior Engineer
- Engineering Manager
- CTO/VP Engineering
Post-Recovery Actions
Root Cause Analysis:
- Document corruption cause
- Create prevention plan
- Update code/processes
Incident Report:
- Timeline of events
- Recovery steps taken
- Data loss assessment
- Lessons learned
Process Improvements:
- Update validation logic
- Improve testing
- Enhance monitoring
- Update playbooks
Training:
- Share learnings with team
- Update documentation
- Practice recovery procedures
Related Playbooks
- D1 Recovery - Database restoration procedures
- Disaster Recovery - Full system recovery
- Worker Rollback - Application rollback
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-26 | 1.0 | Initial playbook | System |