Data Corruption

Overview

This playbook addresses detection, assessment, and recovery from data corruption in the MonoTask system, covering D1 database corruption, KV data inconsistencies, and R2 object corruption.

RTO Target: 2 hours RPO Target: Varies by data type (D1: 24h, KV: 1h, R2: 24h)

When to Use This Playbook

Use this playbook when encountering:

Data Integrity Violations: Foreign key constraints failing, orphaned records
Checksum Mismatches: File integrity verification failures
Schema Corruption: Table structure inconsistencies
Index Corruption: Index scans returning incorrect results
Encoding Issues: Character encoding problems, garbled text
Inconsistent State: Data in conflicting states across systems
Truncated Data: Incomplete or cut-off records
Duplicate Records: Unexpected duplicate data

Corruption Detection Methods

Automated Detection

Database Integrity Checks:

bash

# Run SQLite integrity check
bunx wrangler d1 execute monotask-production \
  --command "PRAGMA integrity_check" \
  --json

Foreign Key Validation:

bash

bunx wrangler d1 execute monotask-production \
  --command "PRAGMA foreign_key_check" \
  --json

Monitoring Alerts:
- Constraint violation rate spike
- Failed transaction rate increase
- Data validation errors in logs

Manual Detection

User Reports:
- Missing data complaints
- Incorrect data displayed
- Application behaving inconsistently
Data Audits:
- Regular data sampling
- Row count verification
- Critical record validation
Development Detection:
- Test failures
- QA reports
- Code review findings

Corruption Assessment Process

Step 1: Identify Scope

Estimated Time: 15 minutes

Determine Affected Components:

bash

# Check D1 database
bunx wrangler d1 execute monotask-production \
  --command "SELECT name FROM sqlite_master WHERE type='table'" \
  --json

# For each table, check row counts
bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM tasks" \
  --json

Identify Corruption Extent:
- Single record vs. multiple records
- Single table vs. multiple tables
- Isolated vs. systemic corruption
- Point-in-time vs. ongoing
Check Related Systems:
- Are KV namespaces affected?
- Are R2 objects corrupted?
- Is corruption limited to one component?

Document findings for impact assessment.

Step 2: Assess Impact

Estimated Time: 10 minutes

User Impact:
- How many users affected?
- Which features are broken?
- Is data accessible but incorrect?
- Is data completely missing?
Data Loss Assessment:
- Can corruption be isolated?
- Is partial recovery possible?
- What's the extent of data loss?
Business Impact:
- Critical data affected?
- Regulatory/compliance implications?
- Financial impact?

Severity Levels:

Critical: Core functionality broken, data loss, many users affected
High: Important features broken, some data loss, subset of users affected
Medium: Non-critical features affected, minimal data loss
Low: Minor inconsistencies, no functional impact

Step 3: Determine Root Cause

Estimated Time: 20 minutes

Common corruption causes:

Application Bugs:
- Race conditions
- Improper transaction handling
- Validation bypass
- Logic errors
Migration Issues:
- Failed schema changes
- Data transformation errors
- Incomplete migrations
Infrastructure Issues:
- Database crash during write
- Network interruption
- Disk I/O errors
- Resource exhaustion
Operational Errors:
- Manual data modifications
- Incorrect admin operations
- Deployment issues

Investigate:

bash

# Check recent deployments
bunx wrangler deployments list --name monotask-api-gateway

# Review error logs
# Check Cloudflare Analytics for anomalies
# Review recent database migrations

Point-in-Time Recovery

Option 1: Full Database Restore

When to Use:

Systemic corruption across database
Cannot isolate corrupted data
Rollback to before corruption event

Procedure:

Identify Recovery Point:

bash

# List available backups
bun run scripts/backup/d1-backup.ts list

Select Backup Before Corruption:
- Check backup timestamp
- Verify backup predates corruption
- Consider data loss from RPO

Execute Restoration:

bash

bun run scripts/recovery/d1-restore.ts <backup-id>

Follow D1 Recovery Playbook for complete procedure

Data Loss: Everything since backup timestamp (up to 24 hours)

Option 2: Selective Data Restore

When to Use:

Corruption isolated to specific tables/records
Want to minimize data loss
Can identify uncorrupted baseline

Procedure:

Export Uncorrupted Data:

bash

# Export current state of uncorrupted tables
bunx wrangler d1 execute monotask-production \
  --command "SELECT * FROM projects" \
  --json > projects_current.json

Restore from Backup to Staging:

bash

# Use staging database
D1_DATABASE_NAME=monotask-staging \
bun run scripts/recovery/d1-restore.ts <backup-id>

Extract Needed Data:

bash

# Export corrupted tables from backup
bunx wrangler d1 execute monotask-staging \
  --command "SELECT * FROM tasks WHERE id IN ('id1', 'id2')" \
  --json > tasks_from_backup.json

Merge Data:
- Manually merge uncorrupted current data with restored data
- Use custom script to reconcile differences
- Validate merged result

Import to Production:

bash

# Import merged data
bunx wrangler d1 execute monotask-production \
  --file merged_data.sql

Data Loss: Minimal, only specific corrupted records

Option 3: Surgical Data Repair

When to Use:

Corruption is well-understood
Can identify and fix specific issues
Pattern-based corruption (e.g., all tasks in INVALID state)

Procedure:

Identify Corruption Pattern:

bash

# Find affected records
bunx wrangler d1 execute monotask-production \
  --command "SELECT * FROM tasks WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED')" \
  --json

Create Repair Script:

sql

-- Example: Fix invalid state values
UPDATE tasks
SET state = 'PENDING'
WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED', 'FAILED');

Test on Staging:

bash

# Test repair on staging database first
bunx wrangler d1 execute monotask-staging --file repair.sql

Validate Repair:

bash

# Verify fix worked
bunx wrangler d1 execute monotask-staging \
  --command "PRAGMA integrity_check" \
  --json

Apply to Production:

bash

# Create backup first
bun run scripts/backup/d1-backup.ts backup

# Apply repair
bunx wrangler d1 execute monotask-production --file repair.sql

Data Loss: None if repair is correct

KV Data Corruption Recovery

Identifying KV Corruption

bash

# List keys to check for anomalies
bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302

# Get specific key to check value
bunx wrangler kv:key get "suspicious-key" --namespace-id 9fb88e98a937493e93fa6930f4506302

Recovery Options

Full Namespace Restore:

bash

bun run scripts/recovery/kv-restore.ts <backup-id> \
  --namespace SESSIONS \
  --clear

Selective Key Restore:

bash

bun run scripts/recovery/kv-restore.ts <backup-id> \
  --namespace SESSIONS \
  --key "specific-key"

Manual Key Deletion:

bash

# Delete corrupted key
bunx wrangler kv:key delete "corrupted-key" \
  --namespace-id 9fb88e98a937493e93fa6930f4506302

R2 Object Corruption Recovery

Detecting R2 Corruption

Checksum Verification:
- Compare object checksums
- Verify file integrity on download
Access Errors:
- 404 errors for existing objects
- Corrupted file downloads

Recovery Options

Restore from Secondary Bucket:

bash

# Copy from backup bucket
bunx wrangler r2 object get monotask-backups-secondary/evidence-storage/object.txt \
  --file temp.txt

bunx wrangler r2 object put evidence-storage/object.txt \
  --file temp.txt

Use Object Versioning (if enabled):
- Retrieve previous version of object
- Promote to current version
Restore from Backup:
- Use R2 sync from secondary bucket
- Follow R2 backup procedures

Data Validation After Recovery

Validation Checklist

[ ] Integrity Checks Pass:

bash

bunx wrangler d1 execute monotask-production \
  --command "PRAGMA integrity_check" \
  --json

[ ] Foreign Keys Valid:

bash

bunx wrangler d1 execute monotask-production \
  --command "PRAGMA foreign_key_check" \
  --json

[ ] Row Counts Match Expected:

bash

# Compare with historical data
bunx wrangler d1 execute monotask-production \
  --command "SELECT COUNT(*) FROM tasks" \
  --json

[ ] Sample Data Correct:
- Query known records
- Verify values are as expected
- Check relationships intact
[ ] Application Functions:
- Test critical user flows
- Verify no errors in logs
- Check UI displays correctly

Prevention Measures

Immediate Prevention

Identify and Fix Root Cause:
- Fix application bug
- Correct migration script
- Update validation logic
Add Safeguards:
- Add data validation checks
- Implement transaction isolation
- Add constraint checks
Improve Monitoring:
- Add integrity check automation
- Set up corruption detection alerts
- Monitor constraint violations

Long-Term Prevention

Code Quality:
- Add integration tests
- Improve validation coverage
- Code review for data operations
Operational Excellence:
- Automate migrations
- Test migrations on staging
- Require backup before migrations
Architecture:
- Consider event sourcing
- Add data versioning
- Implement audit trails

Communication During Recovery

Status Update Template

🔧 DATA CORRUPTION RECOVERY

Scope: [affected component]
Severity: [Critical/High/Medium/Low]
Impact: [user/feature impact]
Root Cause: [if identified]
Recovery Method: [point-in-time/selective/repair]
ETA: [completion time]
Data Loss: [expected data loss]
Updates: Every 15 minutes

Escalation Criteria

Escalate if:

[ ] Corruption cause unknown after 30 minutes
[ ] Multiple recovery attempts failed
[ ] Corruption is spreading
[ ] Data loss exceeds acceptable threshold
[ ] Critical business data affected
[ ] Recovery taking longer than RTO

Escalation Path:

Database Administrator
Senior Engineer
Engineering Manager
CTO/VP Engineering

Post-Recovery Actions

Root Cause Analysis:
- Document corruption cause
- Create prevention plan
- Update code/processes
Incident Report:
- Timeline of events
- Recovery steps taken
- Data loss assessment
- Lessons learned
Process Improvements:
- Update validation logic
- Improve testing
- Enhance monitoring
- Update playbooks
Training:
- Share learnings with team
- Update documentation
- Practice recovery procedures

D1 Recovery - Database restoration procedures
Disaster Recovery - Full system recovery
Worker Rollback - Application rollback

Revision History

Date	Version	Changes	Author
2025-10-26	1.0	Initial playbook	System

Data Corruption ​

Overview ​

When to Use This Playbook ​

Corruption Detection Methods ​

Automated Detection ​

Manual Detection ​

Corruption Assessment Process ​

Step 1: Identify Scope ​

Step 2: Assess Impact ​

Step 3: Determine Root Cause ​

Point-in-Time Recovery ​

Option 1: Full Database Restore ​

Option 2: Selective Data Restore ​

Option 3: Surgical Data Repair ​

KV Data Corruption Recovery ​

Identifying KV Corruption ​

Recovery Options ​

R2 Object Corruption Recovery ​

Detecting R2 Corruption ​

Recovery Options ​

Data Validation After Recovery ​

Validation Checklist ​

Prevention Measures ​

Immediate Prevention ​

Long-Term Prevention ​

Communication During Recovery ​

Status Update Template ​

Escalation Criteria ​

Post-Recovery Actions ​

Related Playbooks ​

Revision History ​

Data Corruption

Overview

When to Use This Playbook

Corruption Detection Methods

Automated Detection

Manual Detection

Corruption Assessment Process

Step 1: Identify Scope

Step 2: Assess Impact

Step 3: Determine Root Cause

Point-in-Time Recovery

Option 1: Full Database Restore

Option 2: Selective Data Restore

Option 3: Surgical Data Repair

KV Data Corruption Recovery

Identifying KV Corruption

Recovery Options

R2 Object Corruption Recovery

Detecting R2 Corruption

Recovery Options

Data Validation After Recovery

Validation Checklist

Prevention Measures

Immediate Prevention

Long-Term Prevention

Communication During Recovery

Status Update Template

Escalation Criteria

Post-Recovery Actions

Related Playbooks

Revision History