Appearance
Worker Rollback
Overview
This playbook provides procedures for rolling back Cloudflare Worker deployments to a previous version when issues are detected in production.
RTO Target: 15 minutes Impact: Minimal downtime with instant rollback
When to Use This Playbook
Execute a worker rollback when:
- High Error Rates: Sudden spike in 5xx errors after deployment
- Performance Degradation: Response times significantly increased
- Functionality Broken: Critical features not working
- Data Corruption Risk: Deployment could cause data integrity issues
- Failed Canary: Canary deployment showing issues
- Failed Health Checks: Post-deployment health checks failing
Rollback Triggers
Automatic Triggers
- Error rate > 5% for 5 consecutive minutes
- P95 response time > 2x baseline
- Health check failures
- Critical alert threshold exceeded
Manual Triggers
- User reports of broken functionality
- QA team identifies critical bugs
- Security vulnerability discovered
- Database migration incompatibility
Prerequisites
Before starting rollback:
Access Requirements
- Cloudflare account access
CLOUDFLARE_API_TOKENset- Wrangler CLI access
Information Needed
- Current deployment version
- Last known good version
- Deployment timestamp
- Which worker(s) to rollback
Communication
- Notify team in Slack/Teams
- Update status page if external-facing
- Prepare incident report template
Quick Rollback Procedure
Fast Path (< 5 minutes)
For immediate rollback when urgent:
bash
# 1. Identify worker to rollback
WORKER_NAME="monotask-api-gateway"
# 2. List recent deployments
bunx wrangler deployments list --name $WORKER_NAME
# 3. Rollback to previous version
bunx wrangler rollback --message "Rollback due to [reason]" --name $WORKER_NAME
# 4. Verify rollback
bunx wrangler deployments list --name $WORKER_NAMEContinue to verification steps after rollback.
Detailed Rollback Procedure
Step 1: Identify the Problem
Estimated Time: 2 minutes
Gather incident information:
- What symptoms are observed?
- When did the issue start?
- Which deployment caused it?
- What's the impact scope?
Check recent deployments:
bash# List all workers and their versions bunx wrangler deployments list --name monotask-api-gateway bunx wrangler deployments list --name monotask-task-worker bunx wrangler deployments list --name monotask-agent-worker bunx wrangler deployments list --name monotask-github-worker bunx wrangler deployments list --name monotask-auth-worker bunx wrangler deployments list --name monotask-websocket-workerCheck Cloudflare Analytics:
- Error rate graph
- Response time trends
- Request volume patterns
Decision Point: Confirm rollback is necessary vs. forward fix.
Step 2: Identify Target Version
Estimated Time: 3 minutes
Determine last known good version:
bash# View deployment history with timestamps bunx wrangler deployments list --name $WORKER_NAME --format jsonLook for the deployment before the problematic one
Verify that version was stable:
- Check historical metrics
- Review deployment notes
- Confirm no issues reported during that time
Note the deployment ID or version tag
Decision Point: Confirm target version is correct.
Step 3: Execute Rollback
Estimated Time: 2 minutes
For Single Worker Rollback:
bash# Rollback to previous version bunx wrangler rollback \ --name monotask-api-gateway \ --message "Rollback: High error rate after v1.2.3 deployment"For Specific Version Rollback:
bash# Rollback to specific deployment ID bunx wrangler deployments view <deployment-id> --name $WORKER_NAME # Then promote that deployment if neededFor Multiple Workers (if coordinated deployment):
bash# Rollback each worker in reverse order of deployment bunx wrangler rollback --name monotask-api-gateway bunx wrangler rollback --name monotask-task-worker bunx wrangler rollback --name monotask-agent-workerMonitor rollback execution:
- Watch for confirmation messages
- Note new deployment ID
- Verify no errors during rollback
Step 4: Verify Rollback Success
Estimated Time: 5 minutes
Confirm Version Change:
bashbunx wrangler deployments list --name $WORKER_NAME- Verify current version matches target
- Check deployment timestamp is recent
Health Checks:
bash# Test API Gateway curl https://monotask-api-gateway.workers.dev/health # Test specific endpoints curl https://monotask-api-gateway.workers.dev/api/tasksMonitor Metrics:
- Check error rate returning to normal
- Verify response times improved
- Monitor request success rate
Quick Smoke Tests:
- Test critical user flows
- Verify core functionality
- Check data access working
Step 5: Traffic Validation
Estimated Time: 3 minutes
Monitor Real Traffic:
- Watch Cloudflare Analytics dashboard
- Check error logs for new issues
- Monitor user reports/support tickets
Validate Data Integrity:
- Ensure no data corruption from rolled-back version
- Verify database state is consistent
- Check KV/R2 operations working
Performance Validation:
- Response times back to baseline
- CPU usage normal
- No memory leaks or issues
Decision Point: If issues persist, consider:
- Rolling back additional workers
- Checking infrastructure issues
- Escalating to disaster recovery
Rollback Verification Checklist
After rollback, verify:
- [ ] Correct version deployed
- [ ] Error rate returned to normal (< 1%)
- [ ] Response times at baseline
- [ ] Health checks passing
- [ ] Critical functionality working
- [ ] No new errors in logs
- [ ] User reports stopped
- [ ] Metrics trending positive
- [ ] No data integrity issues
- [ ] Rollback documented in incident log
Post-Rollback Actions
Immediate Actions (< 30 minutes)
Update Status:
- Update status page (if applicable)
- Notify stakeholders of rollback
- Post in team channels
Document Incident:
- Create incident ticket
- Document timeline
- Note symptoms and resolution
- Record rollback details
Monitor Closely:
- Watch metrics for next hour
- Check logs for anomalies
- Stay available for escalations
Short-Term Actions (< 24 hours)
Root Cause Analysis:
- What caused the issue?
- Why wasn't it caught in testing?
- What can prevent recurrence?
Fix Forward:
- Create fix for the issue
- Add tests to catch regression
- Plan next deployment
Update CI/CD:
- Add automated checks if needed
- Improve testing coverage
- Update deployment procedures
Long-Term Actions (< 1 week)
Post-Mortem:
- Schedule blameless post-mortem
- Document lessons learned
- Share with team
- Update playbooks
Process Improvements:
- Update deployment checklist
- Improve monitoring/alerting
- Add health checks if missing
- Consider canary deployments
Communication Templates
Rollback Notification
🔴 ROLLBACK IN PROGRESS
Worker: [worker-name]
Reason: [brief description]
Started: [timestamp]
Expected completion: [timestamp + 5 minutes]
Impact: [description]
Updates: [channel/link]Rollback Complete
✅ ROLLBACK COMPLETE
Worker: [worker-name]
Previous version: [version]
Current version: [version]
Duration: [X minutes]
Status: Services restored
Next steps: Root cause analysis in progressMulti-Worker Rollback Strategy
When rolling back multiple interdependent workers:
Determine Rollback Order:
- Reverse of deployment order
- Start with API gateway (outermost layer)
- Then service workers
- Finally backend workers
Example Rollback Sequence:
bash# 1. API Gateway (entry point) bunx wrangler rollback --name monotask-api-gateway # 2. Frontend-facing workers bunx wrangler rollback --name monotask-websocket-worker bunx wrangler rollback --name monotask-auth-worker # 3. Service workers bunx wrangler rollback --name monotask-task-worker bunx wrangler rollback --name monotask-github-worker # 4. Backend workers bunx wrangler rollback --name monotask-agent-workerVerify Each Step:
- Check health after each rollback
- Ensure dependencies working
- Monitor error rates between steps
Traffic Switching (Alternative Approach)
If Cloudflare routes are configured:
Update Route to Previous Version:
bash# List routes bunx wrangler routes list # Update route to point to stable version # (Manual in Cloudflare dashboard or API)Gradual Traffic Shift:
- Move 10% traffic to old version
- Monitor metrics
- Gradually increase if stable
- Roll back fully if issues persist
Rollback Decision Matrix
| Scenario | Action | Urgency |
|---|---|---|
| Error rate 5-10% | Rollback | High - 5 minutes |
| Error rate 10-20% | Immediate rollback | Critical - 2 minutes |
| Error rate > 20% | Emergency rollback + escalate | Critical - 1 minute |
| Slow performance (2x baseline) | Rollback | Medium - 10 minutes |
| Critical feature broken | Rollback | High - 5 minutes |
| Minor feature issue | Consider forward fix | Low - evaluate |
| Security issue | Immediate rollback | Critical - 2 minutes |
| Data corruption risk | Emergency rollback + freeze | Critical - immediate |
Rollback Automation
For automated rollbacks based on metrics:
bash
# Example: Automated rollback script
# (Integrate with monitoring system)
#!/bin/bash
WORKER_NAME="monotask-api-gateway"
ERROR_THRESHOLD=5
# Get current error rate
ERROR_RATE=$(get_error_rate_from_cloudflare)
if [ $ERROR_RATE -gt $ERROR_THRESHOLD ]; then
echo "Error rate $ERROR_RATE% exceeds threshold. Rolling back..."
bunx wrangler rollback --name $WORKER_NAME
send_alert "Auto-rollback executed for $WORKER_NAME"
fiCommon Issues During Rollback
Issue: Rollback Command Fails
Symptoms:
- Wrangler returns error
- Version not changing
Solution:
- Check API token permissions
- Verify worker name is correct
- Ensure network connectivity
- Try manual deployment via dashboard
Issue: Rollback Completes But Issues Persist
Symptoms:
- Version shows as rolled back
- But errors continue
Solution:
- Clear Cloudflare cache
- Check if issue is in database, not worker
- Verify correct version actually deployed
- Check if multiple workers need rollback
- Consider infrastructure issue
Issue: Can't Identify Last Good Version
Symptoms:
- All recent versions have issues
- No clear stable point
Solution:
- Review deployment history further back
- Check if infrastructure change caused issue
- Review database migrations
- Consider restoring from backup
- May need disaster recovery
Escalation Path
If rollback doesn't resolve the issue:
- < 5 minutes: Verify rollback completed
- 5-10 minutes: Check infrastructure (D1, KV, R2)
- 10-15 minutes: Escalate to senior engineer
- > 15 minutes: Invoke disaster recovery playbook
Related Playbooks
- D1 Recovery - Database restoration
- Disaster Recovery - Full system recovery
- Partial Failure - Component failures
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2025-10-26 | 1.0 | Initial playbook | System |