Appearance
High Error Rate
Overview
This runbook provides step-by-step procedures for investigating and resolving high error rates in MonoTask Cloudflare Workers.
Alert: elevated_error_rate or critical_error_rateSeverity: Warning (> 1% errors) or Critical (> 5% errors) SLO Impact: Affects overall availability SLO (99.9% target)
Symptoms and Detection
How to Detect
- Alert: Cloudflare alert fires with title "High Error Rate Detected"
- Dashboard: Error rate widget shows spike above threshold
- Logs: Increased error messages in worker logs
- User Reports: Increased support tickets about service issues
Observable Symptoms
- HTTP 5xx status codes increasing
- Error rate > 1% sustained for 5+ minutes
- Specific worker showing elevated errors
- Error patterns in logs (repeated error messages)
Investigation Steps
1. Identify Scope (ETA: 2 minutes)
Access the monitoring dashboard and determine:
bash
# Check current error rate
curl https://monotask-api-gateway.workers.dev/health
# View recent errors by worker
# Navigate to: Cloudflare Dashboard > Analytics > Logs
# Filter: status >= 500, last 15 minutesQuestions to Answer:
- Which worker(s) are affected?
- What percentage of requests are failing?
- When did the error rate start increasing?
- Is it affecting all endpoints or specific ones?
2. Examine Error Types (ETA: 3 minutes)
Classify errors by type:
bash
# SSH into logging system or use Cloudflare dashboard
# Group errors by:
# - Status code (500, 502, 503, 504)
# - Error category (database, timeout, validation, etc.)
# - Endpoint pathCommon Error Patterns:
| Error Type | Status Code | Likely Cause |
|---|---|---|
| Database Errors | 503 | D1 database issues |
| Timeout Errors | 504 | Long-running operations |
| Internal Errors | 500 | Code bugs or exceptions |
| Bad Gateway | 502 | Service binding issues |
3. Check Recent Deployments (ETA: 2 minutes)
Verify if error rate correlates with recent deployments:
bash
# Check GitHub Actions deployment history
gh run list --repo monotask/monotask --limit 10
# Check specific worker deployment time
wrangler deployments list --name monotask-api-gateway
# Check if error spike timing matches deploymentIf deployment is the cause:
- Proceed to Rollback section below
- Identify problematic code changes
- Create incident report
4. Examine Dependency Health (ETA: 3 minutes)
Check health of external dependencies:
D1 Database:
bash
# Check D1 query errors
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/d1/database/{database_id}/metrics \
-H "Authorization: Bearer {api_token}"External APIs (GitHub, Claude):
bash
# Check GitHub API status
curl https://www.githubstatus.com/api/v2/status.json
# Check Anthropic API status
curl https://status.anthropic.com/api/v2/status.jsonService Bindings:
- Verify all worker-to-worker bindings are healthy
- Check if dependent workers are experiencing issues
5. Review Error Logs (ETA: 5 minutes)
Deep dive into error logs:
bash
# Tail live logs for specific worker
wrangler tail monotask-api-gateway --format pretty
# Filter for errors only
wrangler tail monotask-api-gateway --status error
# Search for specific error pattern
wrangler tail monotask-api-gateway | grep "database"Look for:
- Stack traces indicating code bugs
- Repeated error messages (same error occurring frequently)
- Error context (request ID, user ID, endpoint)
- Timing patterns (errors during specific time periods)
Common Causes and Resolutions
Cause 1: Database Connectivity Issues
Symptoms:
- Error messages: "database connection failed", "D1 unavailable"
- Status codes: 503 Service Unavailable
Resolution:
- Check D1 database status in Cloudflare dashboard
- Verify database bindings are correct in wrangler.toml
- Test database connectivity:bash
# Execute test query wrangler d1 execute monotask-production --command "SELECT 1" - If database is down, check Cloudflare status page
- Consider implementing retry logic for transient failures
Mitigation:
- Enable database connection pooling
- Add circuit breaker for database calls
- Implement graceful degradation
Cause 2: Code Bug or Exception
Symptoms:
- Specific error message repeated
- Stack trace in logs
- Error rate started after deployment
Resolution:
- Identify problematic code from stack trace
- Rollback to previous version:bash
# Rollback to previous deployment wrangler rollback monotask-api-gateway - Create hotfix PR for bug
- Deploy fix with monitoring:bash
wrangler deploy --env staging # Test in staging first wrangler deploy --env production
Cause 3: External API Failures
Symptoms:
- Timeout errors (504)
- "Failed to fetch" error messages
- External API status page shows incidents
Resolution:
- Implement retry logic with exponential backoff
- Enable circuit breaker to fail fast
- Return cached data if available
- Provide degraded service:typescript
try { result = await callExternalAPI(); } catch (error) { // Return cached result or partial data result = await getCachedResult() || getDefaultResult(); }
Cause 4: Rate Limiting
Symptoms:
- 429 Too Many Requests errors
- External API rate limit errors
Resolution:
- Check current request rate:bash
# View request rate metrics # Cloudflare Dashboard > Analytics > Requests - Implement request throttling:typescript
// Add rate limiting middleware if (requestCount > threshold) { return new Response('Rate limit exceeded', { status: 429 }); } - Distribute load across multiple workers
- Implement request queuing for burst traffic
Cause 5: Resource Exhaustion
Symptoms:
- Worker CPU time exceeded
- Memory limit errors
- Timeout errors on previously fast endpoints
Resolution:
- Check resource usage metrics
- Optimize heavy operations:
- Move CPU-intensive tasks to queues
- Implement pagination for large datasets
- Add caching for expensive computations
- Consider scaling worker instances:bash
# Adjust worker settings in wrangler.toml [limits] cpu_ms = 50
Resolution Procedures
Immediate Mitigation (ETA: 5 minutes)
Option 1: Rollback Deployment
bash
# List recent deployments
wrangler deployments list --name monotask-api-gateway
# Rollback to specific deployment
wrangler rollback monotask-api-gateway --deployment-id {deployment_id}
# Verify rollback success
curl https://monotask-api-gateway.workers.dev/healthOption 2: Enable Circuit Breaker
typescript
// Temporarily disable problematic feature
if (featureFlag.isEnabled('problematic-feature')) {
// Skip problematic code path
return getCachedResult();
}Option 3: Route Traffic Away
bash
# Use Cloudflare Load Balancer to route traffic
# to backup worker or maintenance pageLong-term Fix (ETA: varies)
- Identify Root Cause: Complete investigation steps above
- Create Fix PR: Implement proper solution
- Test Thoroughly:
- Unit tests for bug fixes
- Load tests for performance issues
- Integration tests for dependency issues
- Deploy to Staging: Verify fix works
- Deploy to Production: Monitor during rollout
- Verify Resolution: Confirm error rate returns to normal
Verification Steps
After applying fix, verify resolution:
Check Error Rate (target: < 1%):
bash# Monitor dashboard for 15 minutes # Verify error rate drops below thresholdMonitor Logs:
bashwrangler tail monotask-api-gateway --status error # Should see minimal errorsTest Affected Endpoints:
bash# Execute smoke tests curl https://monotask-api-gateway.workers.dev/api/tasks curl https://monotask-api-gateway.workers.dev/api/agentsCheck SLO Compliance:
- Verify availability SLO recovering
- Check error budget consumption
- Confirm alerts have cleared
Escalation Path
When to Escalate
Escalate if:
- Error rate > 10% for more than 15 minutes
- Unable to identify root cause within 30 minutes
- Fix attempts unsuccessful after 2 iterations
- Multiple workers affected simultaneously
- Data integrity concerns identified
Escalation Contacts
Level 1 - On-Call Engineer
- Slack: #monotask-oncall
- PagerDuty: Trigger incident escalation
Level 2 - Engineering Lead
- Slack: @engineering-lead
- Phone: [REDACTED]
Level 3 - CTO / VP Engineering
- For critical, ongoing incidents
- Business impact > $10K/hour
Post-Incident
Required Actions
Incident Report:
- Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
Update Monitoring:
- Add alerts for identified gap
- Improve error categorization
- Enhance logging for similar issues
Code Improvements:
- Add error handling
- Implement circuit breakers
- Add retry logic
- Improve observability
Documentation:
- Update runbook with lessons learned
- Document new error patterns
- Share knowledge with team
Related Runbooks
- Queue Backup - For queue-related errors
- Database Slow - For database performance issues
- Worker Timeout - For timeout errors
Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team