Skip to content

Worker Rollback

Overview

This playbook provides procedures for rolling back Cloudflare Worker deployments to a previous version when issues are detected in production.

RTO Target: 15 minutes Impact: Minimal downtime with instant rollback


When to Use This Playbook

Execute a worker rollback when:

  • High Error Rates: Sudden spike in 5xx errors after deployment
  • Performance Degradation: Response times significantly increased
  • Functionality Broken: Critical features not working
  • Data Corruption Risk: Deployment could cause data integrity issues
  • Failed Canary: Canary deployment showing issues
  • Failed Health Checks: Post-deployment health checks failing

Rollback Triggers

Automatic Triggers

  • Error rate > 5% for 5 consecutive minutes
  • P95 response time > 2x baseline
  • Health check failures
  • Critical alert threshold exceeded

Manual Triggers

  • User reports of broken functionality
  • QA team identifies critical bugs
  • Security vulnerability discovered
  • Database migration incompatibility

Prerequisites

Before starting rollback:

  1. Access Requirements

    • Cloudflare account access
    • CLOUDFLARE_API_TOKEN set
    • Wrangler CLI access
  2. Information Needed

    • Current deployment version
    • Last known good version
    • Deployment timestamp
    • Which worker(s) to rollback
  3. Communication

    • Notify team in Slack/Teams
    • Update status page if external-facing
    • Prepare incident report template

Quick Rollback Procedure

Fast Path (< 5 minutes)

For immediate rollback when urgent:

bash
# 1. Identify worker to rollback
WORKER_NAME="monotask-api-gateway"

# 2. List recent deployments
bunx wrangler deployments list --name $WORKER_NAME

# 3. Rollback to previous version
bunx wrangler rollback --message "Rollback due to [reason]" --name $WORKER_NAME

# 4. Verify rollback
bunx wrangler deployments list --name $WORKER_NAME

Continue to verification steps after rollback.


Detailed Rollback Procedure

Step 1: Identify the Problem

Estimated Time: 2 minutes

  1. Gather incident information:

    • What symptoms are observed?
    • When did the issue start?
    • Which deployment caused it?
    • What's the impact scope?
  2. Check recent deployments:

    bash
    # List all workers and their versions
    bunx wrangler deployments list --name monotask-api-gateway
    bunx wrangler deployments list --name monotask-task-worker
    bunx wrangler deployments list --name monotask-agent-worker
    bunx wrangler deployments list --name monotask-github-worker
    bunx wrangler deployments list --name monotask-auth-worker
    bunx wrangler deployments list --name monotask-websocket-worker
  3. Check Cloudflare Analytics:

    • Error rate graph
    • Response time trends
    • Request volume patterns

Decision Point: Confirm rollback is necessary vs. forward fix.


Step 2: Identify Target Version

Estimated Time: 3 minutes

  1. Determine last known good version:

    bash
    # View deployment history with timestamps
    bunx wrangler deployments list --name $WORKER_NAME --format json
  2. Look for the deployment before the problematic one

  3. Verify that version was stable:

    • Check historical metrics
    • Review deployment notes
    • Confirm no issues reported during that time
  4. Note the deployment ID or version tag

Decision Point: Confirm target version is correct.


Step 3: Execute Rollback

Estimated Time: 2 minutes

  1. For Single Worker Rollback:

    bash
    # Rollback to previous version
    bunx wrangler rollback \
      --name monotask-api-gateway \
      --message "Rollback: High error rate after v1.2.3 deployment"
  2. For Specific Version Rollback:

    bash
    # Rollback to specific deployment ID
    bunx wrangler deployments view <deployment-id> --name $WORKER_NAME
    # Then promote that deployment if needed
  3. For Multiple Workers (if coordinated deployment):

    bash
    # Rollback each worker in reverse order of deployment
    bunx wrangler rollback --name monotask-api-gateway
    bunx wrangler rollback --name monotask-task-worker
    bunx wrangler rollback --name monotask-agent-worker
  4. Monitor rollback execution:

    • Watch for confirmation messages
    • Note new deployment ID
    • Verify no errors during rollback

Step 4: Verify Rollback Success

Estimated Time: 5 minutes

  1. Confirm Version Change:

    bash
    bunx wrangler deployments list --name $WORKER_NAME
    • Verify current version matches target
    • Check deployment timestamp is recent
  2. Health Checks:

    bash
    # Test API Gateway
    curl https://monotask-api-gateway.workers.dev/health
    
    # Test specific endpoints
    curl https://monotask-api-gateway.workers.dev/api/tasks
  3. Monitor Metrics:

    • Check error rate returning to normal
    • Verify response times improved
    • Monitor request success rate
  4. Quick Smoke Tests:

    • Test critical user flows
    • Verify core functionality
    • Check data access working

Step 5: Traffic Validation

Estimated Time: 3 minutes

  1. Monitor Real Traffic:

    • Watch Cloudflare Analytics dashboard
    • Check error logs for new issues
    • Monitor user reports/support tickets
  2. Validate Data Integrity:

    • Ensure no data corruption from rolled-back version
    • Verify database state is consistent
    • Check KV/R2 operations working
  3. Performance Validation:

    • Response times back to baseline
    • CPU usage normal
    • No memory leaks or issues

Decision Point: If issues persist, consider:

  • Rolling back additional workers
  • Checking infrastructure issues
  • Escalating to disaster recovery

Rollback Verification Checklist

After rollback, verify:

  • [ ] Correct version deployed
  • [ ] Error rate returned to normal (< 1%)
  • [ ] Response times at baseline
  • [ ] Health checks passing
  • [ ] Critical functionality working
  • [ ] No new errors in logs
  • [ ] User reports stopped
  • [ ] Metrics trending positive
  • [ ] No data integrity issues
  • [ ] Rollback documented in incident log

Post-Rollback Actions

Immediate Actions (< 30 minutes)

  1. Update Status:

    • Update status page (if applicable)
    • Notify stakeholders of rollback
    • Post in team channels
  2. Document Incident:

    • Create incident ticket
    • Document timeline
    • Note symptoms and resolution
    • Record rollback details
  3. Monitor Closely:

    • Watch metrics for next hour
    • Check logs for anomalies
    • Stay available for escalations

Short-Term Actions (< 24 hours)

  1. Root Cause Analysis:

    • What caused the issue?
    • Why wasn't it caught in testing?
    • What can prevent recurrence?
  2. Fix Forward:

    • Create fix for the issue
    • Add tests to catch regression
    • Plan next deployment
  3. Update CI/CD:

    • Add automated checks if needed
    • Improve testing coverage
    • Update deployment procedures

Long-Term Actions (< 1 week)

  1. Post-Mortem:

    • Schedule blameless post-mortem
    • Document lessons learned
    • Share with team
    • Update playbooks
  2. Process Improvements:

    • Update deployment checklist
    • Improve monitoring/alerting
    • Add health checks if missing
    • Consider canary deployments

Communication Templates

Rollback Notification

🔴 ROLLBACK IN PROGRESS

Worker: [worker-name]
Reason: [brief description]
Started: [timestamp]
Expected completion: [timestamp + 5 minutes]
Impact: [description]
Updates: [channel/link]

Rollback Complete

✅ ROLLBACK COMPLETE

Worker: [worker-name]
Previous version: [version]
Current version: [version]
Duration: [X minutes]
Status: Services restored
Next steps: Root cause analysis in progress

Multi-Worker Rollback Strategy

When rolling back multiple interdependent workers:

  1. Determine Rollback Order:

    • Reverse of deployment order
    • Start with API gateway (outermost layer)
    • Then service workers
    • Finally backend workers
  2. Example Rollback Sequence:

    bash
    # 1. API Gateway (entry point)
    bunx wrangler rollback --name monotask-api-gateway
    
    # 2. Frontend-facing workers
    bunx wrangler rollback --name monotask-websocket-worker
    bunx wrangler rollback --name monotask-auth-worker
    
    # 3. Service workers
    bunx wrangler rollback --name monotask-task-worker
    bunx wrangler rollback --name monotask-github-worker
    
    # 4. Backend workers
    bunx wrangler rollback --name monotask-agent-worker
  3. Verify Each Step:

    • Check health after each rollback
    • Ensure dependencies working
    • Monitor error rates between steps

Traffic Switching (Alternative Approach)

If Cloudflare routes are configured:

  1. Update Route to Previous Version:

    bash
    # List routes
    bunx wrangler routes list
    
    # Update route to point to stable version
    # (Manual in Cloudflare dashboard or API)
  2. Gradual Traffic Shift:

    • Move 10% traffic to old version
    • Monitor metrics
    • Gradually increase if stable
    • Roll back fully if issues persist

Rollback Decision Matrix

ScenarioActionUrgency
Error rate 5-10%RollbackHigh - 5 minutes
Error rate 10-20%Immediate rollbackCritical - 2 minutes
Error rate > 20%Emergency rollback + escalateCritical - 1 minute
Slow performance (2x baseline)RollbackMedium - 10 minutes
Critical feature brokenRollbackHigh - 5 minutes
Minor feature issueConsider forward fixLow - evaluate
Security issueImmediate rollbackCritical - 2 minutes
Data corruption riskEmergency rollback + freezeCritical - immediate

Rollback Automation

For automated rollbacks based on metrics:

bash
# Example: Automated rollback script
# (Integrate with monitoring system)

#!/bin/bash
WORKER_NAME="monotask-api-gateway"
ERROR_THRESHOLD=5

# Get current error rate
ERROR_RATE=$(get_error_rate_from_cloudflare)

if [ $ERROR_RATE -gt $ERROR_THRESHOLD ]; then
  echo "Error rate $ERROR_RATE% exceeds threshold. Rolling back..."
  bunx wrangler rollback --name $WORKER_NAME
  send_alert "Auto-rollback executed for $WORKER_NAME"
fi

Common Issues During Rollback

Issue: Rollback Command Fails

Symptoms:

  • Wrangler returns error
  • Version not changing

Solution:

  1. Check API token permissions
  2. Verify worker name is correct
  3. Ensure network connectivity
  4. Try manual deployment via dashboard

Issue: Rollback Completes But Issues Persist

Symptoms:

  • Version shows as rolled back
  • But errors continue

Solution:

  1. Clear Cloudflare cache
  2. Check if issue is in database, not worker
  3. Verify correct version actually deployed
  4. Check if multiple workers need rollback
  5. Consider infrastructure issue

Issue: Can't Identify Last Good Version

Symptoms:

  • All recent versions have issues
  • No clear stable point

Solution:

  1. Review deployment history further back
  2. Check if infrastructure change caused issue
  3. Review database migrations
  4. Consider restoring from backup
  5. May need disaster recovery

Escalation Path

If rollback doesn't resolve the issue:

  1. < 5 minutes: Verify rollback completed
  2. 5-10 minutes: Check infrastructure (D1, KV, R2)
  3. 10-15 minutes: Escalate to senior engineer
  4. > 15 minutes: Invoke disaster recovery playbook


Revision History

DateVersionChangesAuthor
2025-10-261.0Initial playbookSystem

MonoKernel MonoTask Documentation