Worker Rollback

Overview

This playbook provides procedures for rolling back Cloudflare Worker deployments to a previous version when issues are detected in production.

RTO Target: 15 minutes Impact: Minimal downtime with instant rollback

When to Use This Playbook

Execute a worker rollback when:

High Error Rates: Sudden spike in 5xx errors after deployment
Performance Degradation: Response times significantly increased
Functionality Broken: Critical features not working
Data Corruption Risk: Deployment could cause data integrity issues
Failed Canary: Canary deployment showing issues
Failed Health Checks: Post-deployment health checks failing

Rollback Triggers

Automatic Triggers

Error rate > 5% for 5 consecutive minutes
P95 response time > 2x baseline
Health check failures
Critical alert threshold exceeded

Manual Triggers

User reports of broken functionality
QA team identifies critical bugs
Security vulnerability discovered
Database migration incompatibility

Prerequisites

Before starting rollback:

Access Requirements
- Cloudflare account access
- CLOUDFLARE_API_TOKEN set
- Wrangler CLI access
Information Needed
- Current deployment version
- Last known good version
- Deployment timestamp
- Which worker(s) to rollback
Communication
- Notify team in Slack/Teams
- Update status page if external-facing
- Prepare incident report template

Quick Rollback Procedure

Fast Path (< 5 minutes)

For immediate rollback when urgent:

bash

# 1. Identify worker to rollback
WORKER_NAME="monotask-api-gateway"

# 2. List recent deployments
bunx wrangler deployments list --name $WORKER_NAME

# 3. Rollback to previous version
bunx wrangler rollback --message "Rollback due to [reason]" --name $WORKER_NAME

# 4. Verify rollback
bunx wrangler deployments list --name $WORKER_NAME

Continue to verification steps after rollback.

Detailed Rollback Procedure

Step 1: Identify the Problem

Estimated Time: 2 minutes

Gather incident information:
- What symptoms are observed?
- When did the issue start?
- Which deployment caused it?
- What's the impact scope?

Check recent deployments:

bash

# List all workers and their versions
bunx wrangler deployments list --name monotask-api-gateway
bunx wrangler deployments list --name monotask-task-worker
bunx wrangler deployments list --name monotask-agent-worker
bunx wrangler deployments list --name monotask-github-worker
bunx wrangler deployments list --name monotask-auth-worker
bunx wrangler deployments list --name monotask-websocket-worker

Check Cloudflare Analytics:
- Error rate graph
- Response time trends
- Request volume patterns

Decision Point: Confirm rollback is necessary vs. forward fix.

Step 2: Identify Target Version

Estimated Time: 3 minutes

Determine last known good version:

bash

# View deployment history with timestamps
bunx wrangler deployments list --name $WORKER_NAME --format json

Look for the deployment before the problematic one
Verify that version was stable:
- Check historical metrics
- Review deployment notes
- Confirm no issues reported during that time
Note the deployment ID or version tag

Decision Point: Confirm target version is correct.

Step 3: Execute Rollback

Estimated Time: 2 minutes

For Single Worker Rollback:

bash

# Rollback to previous version
bunx wrangler rollback \
  --name monotask-api-gateway \
  --message "Rollback: High error rate after v1.2.3 deployment"

For Specific Version Rollback:

bash

# Rollback to specific deployment ID
bunx wrangler deployments view <deployment-id> --name $WORKER_NAME
# Then promote that deployment if needed

For Multiple Workers (if coordinated deployment):

bash

# Rollback each worker in reverse order of deployment
bunx wrangler rollback --name monotask-api-gateway
bunx wrangler rollback --name monotask-task-worker
bunx wrangler rollback --name monotask-agent-worker

Monitor rollback execution:
- Watch for confirmation messages
- Note new deployment ID
- Verify no errors during rollback

Step 4: Verify Rollback Success

Estimated Time: 5 minutes

Confirm Version Change:
bash
```
bunx wrangler deployments list --name $WORKER_NAME
```
1
- Verify current version matches target
- Check deployment timestamp is recent

Health Checks:

bash

# Test API Gateway
curl https://monotask-api-gateway.workers.dev/health

# Test specific endpoints
curl https://monotask-api-gateway.workers.dev/api/tasks

Monitor Metrics:
- Check error rate returning to normal
- Verify response times improved
- Monitor request success rate
Quick Smoke Tests:
- Test critical user flows
- Verify core functionality
- Check data access working

Step 5: Traffic Validation

Estimated Time: 3 minutes

Monitor Real Traffic:
- Watch Cloudflare Analytics dashboard
- Check error logs for new issues
- Monitor user reports/support tickets
Validate Data Integrity:
- Ensure no data corruption from rolled-back version
- Verify database state is consistent
- Check KV/R2 operations working
Performance Validation:
- Response times back to baseline
- CPU usage normal
- No memory leaks or issues

Decision Point: If issues persist, consider:

Rolling back additional workers
Checking infrastructure issues
Escalating to disaster recovery

Rollback Verification Checklist

After rollback, verify:

[ ] Correct version deployed
[ ] Error rate returned to normal (< 1%)
[ ] Response times at baseline
[ ] Health checks passing
[ ] Critical functionality working
[ ] No new errors in logs
[ ] User reports stopped
[ ] Metrics trending positive
[ ] No data integrity issues
[ ] Rollback documented in incident log

Post-Rollback Actions

Immediate Actions (< 30 minutes)

Update Status:
- Update status page (if applicable)
- Notify stakeholders of rollback
- Post in team channels
Document Incident:
- Create incident ticket
- Document timeline
- Note symptoms and resolution
- Record rollback details
Monitor Closely:
- Watch metrics for next hour
- Check logs for anomalies
- Stay available for escalations

Short-Term Actions (< 24 hours)

Root Cause Analysis:
- What caused the issue?
- Why wasn't it caught in testing?
- What can prevent recurrence?
Fix Forward:
- Create fix for the issue
- Add tests to catch regression
- Plan next deployment
Update CI/CD:
- Add automated checks if needed
- Improve testing coverage
- Update deployment procedures

Long-Term Actions (< 1 week)

Post-Mortem:
- Schedule blameless post-mortem
- Document lessons learned
- Share with team
- Update playbooks
Process Improvements:
- Update deployment checklist
- Improve monitoring/alerting
- Add health checks if missing
- Consider canary deployments

Communication Templates

Rollback Notification

🔴 ROLLBACK IN PROGRESS

Worker: [worker-name]
Reason: [brief description]
Started: [timestamp]
Expected completion: [timestamp + 5 minutes]
Impact: [description]
Updates: [channel/link]

Rollback Complete

✅ ROLLBACK COMPLETE

Worker: [worker-name]
Previous version: [version]
Current version: [version]
Duration: [X minutes]
Status: Services restored
Next steps: Root cause analysis in progress

Multi-Worker Rollback Strategy

When rolling back multiple interdependent workers:

Determine Rollback Order:
- Reverse of deployment order
- Start with API gateway (outermost layer)
- Then service workers
- Finally backend workers

Example Rollback Sequence:

bash

# 1. API Gateway (entry point)
bunx wrangler rollback --name monotask-api-gateway

# 2. Frontend-facing workers
bunx wrangler rollback --name monotask-websocket-worker
bunx wrangler rollback --name monotask-auth-worker

# 3. Service workers
bunx wrangler rollback --name monotask-task-worker
bunx wrangler rollback --name monotask-github-worker

# 4. Backend workers
bunx wrangler rollback --name monotask-agent-worker

Verify Each Step:
- Check health after each rollback
- Ensure dependencies working
- Monitor error rates between steps

Traffic Switching (Alternative Approach)

If Cloudflare routes are configured:

Update Route to Previous Version:

bash

# List routes
bunx wrangler routes list

# Update route to point to stable version
# (Manual in Cloudflare dashboard or API)

Gradual Traffic Shift:
- Move 10% traffic to old version
- Monitor metrics
- Gradually increase if stable
- Roll back fully if issues persist

Rollback Decision Matrix

Scenario	Action	Urgency
Error rate 5-10%	Rollback	High - 5 minutes
Error rate 10-20%	Immediate rollback	Critical - 2 minutes
Error rate > 20%	Emergency rollback + escalate	Critical - 1 minute
Slow performance (2x baseline)	Rollback	Medium - 10 minutes
Critical feature broken	Rollback	High - 5 minutes
Minor feature issue	Consider forward fix	Low - evaluate
Security issue	Immediate rollback	Critical - 2 minutes
Data corruption risk	Emergency rollback + freeze	Critical - immediate

Rollback Automation

For automated rollbacks based on metrics:

bash

# Example: Automated rollback script
# (Integrate with monitoring system)

#!/bin/bash
WORKER_NAME="monotask-api-gateway"
ERROR_THRESHOLD=5

# Get current error rate
ERROR_RATE=$(get_error_rate_from_cloudflare)

if [ $ERROR_RATE -gt $ERROR_THRESHOLD ]; then
  echo "Error rate $ERROR_RATE% exceeds threshold. Rolling back..."
  bunx wrangler rollback --name $WORKER_NAME
  send_alert "Auto-rollback executed for $WORKER_NAME"
fi

Common Issues During Rollback

Issue: Rollback Command Fails

Symptoms:

Wrangler returns error
Version not changing

Solution:

Check API token permissions
Verify worker name is correct
Ensure network connectivity
Try manual deployment via dashboard

Issue: Rollback Completes But Issues Persist

Symptoms:

Version shows as rolled back
But errors continue

Solution:

Clear Cloudflare cache
Check if issue is in database, not worker
Verify correct version actually deployed
Check if multiple workers need rollback
Consider infrastructure issue

Issue: Can't Identify Last Good Version

Symptoms:

All recent versions have issues
No clear stable point

Solution:

Review deployment history further back
Check if infrastructure change caused issue
Review database migrations
Consider restoring from backup
May need disaster recovery

Escalation Path

If rollback doesn't resolve the issue:

< 5 minutes: Verify rollback completed
5-10 minutes: Check infrastructure (D1, KV, R2)
10-15 minutes: Escalate to senior engineer
> 15 minutes: Invoke disaster recovery playbook

D1 Recovery - Database restoration
Disaster Recovery - Full system recovery
Partial Failure - Component failures

Revision History

Date	Version	Changes	Author
2025-10-26	1.0	Initial playbook	System

Worker Rollback ​

Overview ​

When to Use This Playbook ​

Rollback Triggers ​

Automatic Triggers ​

Manual Triggers ​

Prerequisites ​

Quick Rollback Procedure ​

Fast Path (< 5 minutes) ​

Detailed Rollback Procedure ​

Step 1: Identify the Problem ​

Step 2: Identify Target Version ​

Step 3: Execute Rollback ​

Step 4: Verify Rollback Success ​

Step 5: Traffic Validation ​

Rollback Verification Checklist ​

Post-Rollback Actions ​

Immediate Actions (< 30 minutes) ​

Short-Term Actions (< 24 hours) ​

Long-Term Actions (< 1 week) ​

Communication Templates ​

Rollback Notification ​

Rollback Complete ​

Multi-Worker Rollback Strategy ​

Traffic Switching (Alternative Approach) ​

Rollback Decision Matrix ​

Rollback Automation ​

Common Issues During Rollback ​

Issue: Rollback Command Fails ​

Issue: Rollback Completes But Issues Persist ​

Issue: Can't Identify Last Good Version ​

Escalation Path ​

Related Playbooks ​

Revision History ​

Worker Rollback

Overview

When to Use This Playbook

Rollback Triggers

Automatic Triggers

Manual Triggers

Prerequisites

Quick Rollback Procedure

Fast Path (< 5 minutes)

Detailed Rollback Procedure

Step 1: Identify the Problem

Step 2: Identify Target Version

Step 3: Execute Rollback

Step 4: Verify Rollback Success

Step 5: Traffic Validation

Rollback Verification Checklist

Post-Rollback Actions

Immediate Actions (< 30 minutes)

Short-Term Actions (< 24 hours)

Long-Term Actions (< 1 week)

Communication Templates

Rollback Notification

Rollback Complete

Multi-Worker Rollback Strategy

Traffic Switching (Alternative Approach)

Rollback Decision Matrix

Rollback Automation

Common Issues During Rollback

Issue: Rollback Command Fails

Issue: Rollback Completes But Issues Persist

Issue: Can't Identify Last Good Version

Escalation Path

Related Playbooks

Revision History