Skip to content

Data Corruption

Overview

This playbook addresses detection, assessment, and recovery from data corruption in the MonoTask system, covering D1 database corruption, KV data inconsistencies, and R2 object corruption.

RTO Target: 2 hours RPO Target: Varies by data type (D1: 24h, KV: 1h, R2: 24h)


When to Use This Playbook

Use this playbook when encountering:

  • Data Integrity Violations: Foreign key constraints failing, orphaned records
  • Checksum Mismatches: File integrity verification failures
  • Schema Corruption: Table structure inconsistencies
  • Index Corruption: Index scans returning incorrect results
  • Encoding Issues: Character encoding problems, garbled text
  • Inconsistent State: Data in conflicting states across systems
  • Truncated Data: Incomplete or cut-off records
  • Duplicate Records: Unexpected duplicate data

Corruption Detection Methods

Automated Detection

  1. Database Integrity Checks:

    bash
    # Run SQLite integrity check
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA integrity_check" \
      --json
  2. Foreign Key Validation:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA foreign_key_check" \
      --json
  3. Monitoring Alerts:

    • Constraint violation rate spike
    • Failed transaction rate increase
    • Data validation errors in logs

Manual Detection

  1. User Reports:

    • Missing data complaints
    • Incorrect data displayed
    • Application behaving inconsistently
  2. Data Audits:

    • Regular data sampling
    • Row count verification
    • Critical record validation
  3. Development Detection:

    • Test failures
    • QA reports
    • Code review findings

Corruption Assessment Process

Step 1: Identify Scope

Estimated Time: 15 minutes

  1. Determine Affected Components:

    bash
    # Check D1 database
    bunx wrangler d1 execute monotask-production \
      --command "SELECT name FROM sqlite_master WHERE type='table'" \
      --json
    
    # For each table, check row counts
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM tasks" \
      --json
  2. Identify Corruption Extent:

    • Single record vs. multiple records
    • Single table vs. multiple tables
    • Isolated vs. systemic corruption
    • Point-in-time vs. ongoing
  3. Check Related Systems:

    • Are KV namespaces affected?
    • Are R2 objects corrupted?
    • Is corruption limited to one component?

Document findings for impact assessment.


Step 2: Assess Impact

Estimated Time: 10 minutes

  1. User Impact:

    • How many users affected?
    • Which features are broken?
    • Is data accessible but incorrect?
    • Is data completely missing?
  2. Data Loss Assessment:

    • Can corruption be isolated?
    • Is partial recovery possible?
    • What's the extent of data loss?
  3. Business Impact:

    • Critical data affected?
    • Regulatory/compliance implications?
    • Financial impact?

Severity Levels:

  • Critical: Core functionality broken, data loss, many users affected
  • High: Important features broken, some data loss, subset of users affected
  • Medium: Non-critical features affected, minimal data loss
  • Low: Minor inconsistencies, no functional impact

Step 3: Determine Root Cause

Estimated Time: 20 minutes

Common corruption causes:

  1. Application Bugs:

    • Race conditions
    • Improper transaction handling
    • Validation bypass
    • Logic errors
  2. Migration Issues:

    • Failed schema changes
    • Data transformation errors
    • Incomplete migrations
  3. Infrastructure Issues:

    • Database crash during write
    • Network interruption
    • Disk I/O errors
    • Resource exhaustion
  4. Operational Errors:

    • Manual data modifications
    • Incorrect admin operations
    • Deployment issues

Investigate:

bash
# Check recent deployments
bunx wrangler deployments list --name monotask-api-gateway

# Review error logs
# Check Cloudflare Analytics for anomalies
# Review recent database migrations

Point-in-Time Recovery

Option 1: Full Database Restore

When to Use:

  • Systemic corruption across database
  • Cannot isolate corrupted data
  • Rollback to before corruption event

Procedure:

  1. Identify Recovery Point:

    bash
    # List available backups
    bun run scripts/backup/d1-backup.ts list
  2. Select Backup Before Corruption:

    • Check backup timestamp
    • Verify backup predates corruption
    • Consider data loss from RPO
  3. Execute Restoration:

    bash
    bun run scripts/recovery/d1-restore.ts <backup-id>
  4. Follow D1 Recovery Playbook for complete procedure

Data Loss: Everything since backup timestamp (up to 24 hours)


Option 2: Selective Data Restore

When to Use:

  • Corruption isolated to specific tables/records
  • Want to minimize data loss
  • Can identify uncorrupted baseline

Procedure:

  1. Export Uncorrupted Data:

    bash
    # Export current state of uncorrupted tables
    bunx wrangler d1 execute monotask-production \
      --command "SELECT * FROM projects" \
      --json > projects_current.json
  2. Restore from Backup to Staging:

    bash
    # Use staging database
    D1_DATABASE_NAME=monotask-staging \
    bun run scripts/recovery/d1-restore.ts <backup-id>
  3. Extract Needed Data:

    bash
    # Export corrupted tables from backup
    bunx wrangler d1 execute monotask-staging \
      --command "SELECT * FROM tasks WHERE id IN ('id1', 'id2')" \
      --json > tasks_from_backup.json
  4. Merge Data:

    • Manually merge uncorrupted current data with restored data
    • Use custom script to reconcile differences
    • Validate merged result
  5. Import to Production:

    bash
    # Import merged data
    bunx wrangler d1 execute monotask-production \
      --file merged_data.sql

Data Loss: Minimal, only specific corrupted records


Option 3: Surgical Data Repair

When to Use:

  • Corruption is well-understood
  • Can identify and fix specific issues
  • Pattern-based corruption (e.g., all tasks in INVALID state)

Procedure:

  1. Identify Corruption Pattern:

    bash
    # Find affected records
    bunx wrangler d1 execute monotask-production \
      --command "SELECT * FROM tasks WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED')" \
      --json
  2. Create Repair Script:

    sql
    -- Example: Fix invalid state values
    UPDATE tasks
    SET state = 'PENDING'
    WHERE state NOT IN ('PENDING', 'IN_PROGRESS', 'COMPLETED', 'FAILED');
  3. Test on Staging:

    bash
    # Test repair on staging database first
    bunx wrangler d1 execute monotask-staging --file repair.sql
  4. Validate Repair:

    bash
    # Verify fix worked
    bunx wrangler d1 execute monotask-staging \
      --command "PRAGMA integrity_check" \
      --json
  5. Apply to Production:

    bash
    # Create backup first
    bun run scripts/backup/d1-backup.ts backup
    
    # Apply repair
    bunx wrangler d1 execute monotask-production --file repair.sql

Data Loss: None if repair is correct


KV Data Corruption Recovery

Identifying KV Corruption

bash
# List keys to check for anomalies
bunx wrangler kv:key list --namespace-id 9fb88e98a937493e93fa6930f4506302

# Get specific key to check value
bunx wrangler kv:key get "suspicious-key" --namespace-id 9fb88e98a937493e93fa6930f4506302

Recovery Options

  1. Full Namespace Restore:

    bash
    bun run scripts/recovery/kv-restore.ts <backup-id> \
      --namespace SESSIONS \
      --clear
  2. Selective Key Restore:

    bash
    bun run scripts/recovery/kv-restore.ts <backup-id> \
      --namespace SESSIONS \
      --key "specific-key"
  3. Manual Key Deletion:

    bash
    # Delete corrupted key
    bunx wrangler kv:key delete "corrupted-key" \
      --namespace-id 9fb88e98a937493e93fa6930f4506302

R2 Object Corruption Recovery

Detecting R2 Corruption

  1. Checksum Verification:

    • Compare object checksums
    • Verify file integrity on download
  2. Access Errors:

    • 404 errors for existing objects
    • Corrupted file downloads

Recovery Options

  1. Restore from Secondary Bucket:

    bash
    # Copy from backup bucket
    bunx wrangler r2 object get monotask-backups-secondary/evidence-storage/object.txt \
      --file temp.txt
    
    bunx wrangler r2 object put evidence-storage/object.txt \
      --file temp.txt
  2. Use Object Versioning (if enabled):

    • Retrieve previous version of object
    • Promote to current version
  3. Restore from Backup:


Data Validation After Recovery

Validation Checklist

  • [ ] Integrity Checks Pass:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA integrity_check" \
      --json
  • [ ] Foreign Keys Valid:

    bash
    bunx wrangler d1 execute monotask-production \
      --command "PRAGMA foreign_key_check" \
      --json
  • [ ] Row Counts Match Expected:

    bash
    # Compare with historical data
    bunx wrangler d1 execute monotask-production \
      --command "SELECT COUNT(*) FROM tasks" \
      --json
  • [ ] Sample Data Correct:

    • Query known records
    • Verify values are as expected
    • Check relationships intact
  • [ ] Application Functions:

    • Test critical user flows
    • Verify no errors in logs
    • Check UI displays correctly

Prevention Measures

Immediate Prevention

  1. Identify and Fix Root Cause:

    • Fix application bug
    • Correct migration script
    • Update validation logic
  2. Add Safeguards:

    • Add data validation checks
    • Implement transaction isolation
    • Add constraint checks
  3. Improve Monitoring:

    • Add integrity check automation
    • Set up corruption detection alerts
    • Monitor constraint violations

Long-Term Prevention

  1. Code Quality:

    • Add integration tests
    • Improve validation coverage
    • Code review for data operations
  2. Operational Excellence:

    • Automate migrations
    • Test migrations on staging
    • Require backup before migrations
  3. Architecture:

    • Consider event sourcing
    • Add data versioning
    • Implement audit trails

Communication During Recovery

Status Update Template

🔧 DATA CORRUPTION RECOVERY

Scope: [affected component]
Severity: [Critical/High/Medium/Low]
Impact: [user/feature impact]
Root Cause: [if identified]
Recovery Method: [point-in-time/selective/repair]
ETA: [completion time]
Data Loss: [expected data loss]
Updates: Every 15 minutes

Escalation Criteria

Escalate if:

  • [ ] Corruption cause unknown after 30 minutes
  • [ ] Multiple recovery attempts failed
  • [ ] Corruption is spreading
  • [ ] Data loss exceeds acceptable threshold
  • [ ] Critical business data affected
  • [ ] Recovery taking longer than RTO

Escalation Path:

  1. Database Administrator
  2. Senior Engineer
  3. Engineering Manager
  4. CTO/VP Engineering

Post-Recovery Actions

  1. Root Cause Analysis:

    • Document corruption cause
    • Create prevention plan
    • Update code/processes
  2. Incident Report:

    • Timeline of events
    • Recovery steps taken
    • Data loss assessment
    • Lessons learned
  3. Process Improvements:

    • Update validation logic
    • Improve testing
    • Enhance monitoring
    • Update playbooks
  4. Training:

    • Share learnings with team
    • Update documentation
    • Practice recovery procedures


Revision History

DateVersionChangesAuthor
2025-10-261.0Initial playbookSystem

MonoKernel MonoTask Documentation