High Error Rate

Overview

This runbook provides step-by-step procedures for investigating and resolving high error rates in MonoTask Cloudflare Workers.

Alert: elevated_error_rate or critical_error_rateSeverity: Warning (> 1% errors) or Critical (> 5% errors) SLO Impact: Affects overall availability SLO (99.9% target)

Symptoms and Detection

How to Detect

Alert: Cloudflare alert fires with title "High Error Rate Detected"
Dashboard: Error rate widget shows spike above threshold
Logs: Increased error messages in worker logs
User Reports: Increased support tickets about service issues

Observable Symptoms

HTTP 5xx status codes increasing
Error rate > 1% sustained for 5+ minutes
Specific worker showing elevated errors
Error patterns in logs (repeated error messages)

Investigation Steps

1. Identify Scope (ETA: 2 minutes)

Access the monitoring dashboard and determine:

bash

# Check current error rate
curl https://monotask-api-gateway.workers.dev/health

# View recent errors by worker
# Navigate to: Cloudflare Dashboard > Analytics > Logs
# Filter: status >= 500, last 15 minutes

Questions to Answer:

Which worker(s) are affected?
What percentage of requests are failing?
When did the error rate start increasing?
Is it affecting all endpoints or specific ones?

2. Examine Error Types (ETA: 3 minutes)

Classify errors by type:

bash

# SSH into logging system or use Cloudflare dashboard
# Group errors by:
# - Status code (500, 502, 503, 504)
# - Error category (database, timeout, validation, etc.)
# - Endpoint path

Common Error Patterns:

Error Type	Status Code	Likely Cause
Database Errors	503	D1 database issues
Timeout Errors	504	Long-running operations
Internal Errors	500	Code bugs or exceptions
Bad Gateway	502	Service binding issues

3. Check Recent Deployments (ETA: 2 minutes)

Verify if error rate correlates with recent deployments:

bash

# Check GitHub Actions deployment history
gh run list --repo monotask/monotask --limit 10

# Check specific worker deployment time
wrangler deployments list --name monotask-api-gateway

# Check if error spike timing matches deployment

If deployment is the cause:

Proceed to Rollback section below
Identify problematic code changes
Create incident report

4. Examine Dependency Health (ETA: 3 minutes)

Check health of external dependencies:

D1 Database:

bash

# Check D1 query errors
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/d1/database/{database_id}/metrics \
  -H "Authorization: Bearer {api_token}"

External APIs (GitHub, Claude):

bash

# Check GitHub API status
curl https://www.githubstatus.com/api/v2/status.json

# Check Anthropic API status
curl https://status.anthropic.com/api/v2/status.json

Service Bindings:

Verify all worker-to-worker bindings are healthy
Check if dependent workers are experiencing issues

5. Review Error Logs (ETA: 5 minutes)

Deep dive into error logs:

bash

# Tail live logs for specific worker
wrangler tail monotask-api-gateway --format pretty

# Filter for errors only
wrangler tail monotask-api-gateway --status error

# Search for specific error pattern
wrangler tail monotask-api-gateway | grep "database"

Look for:

Stack traces indicating code bugs
Repeated error messages (same error occurring frequently)
Error context (request ID, user ID, endpoint)
Timing patterns (errors during specific time periods)

Common Causes and Resolutions

Cause 1: Database Connectivity Issues

Symptoms:

Error messages: "database connection failed", "D1 unavailable"
Status codes: 503 Service Unavailable

Resolution:

Check D1 database status in Cloudflare dashboard
Verify database bindings are correct in wrangler.toml

Test database connectivity:

bash

# Execute test query
wrangler d1 execute monotask-production --command "SELECT 1"

If database is down, check Cloudflare status page
Consider implementing retry logic for transient failures

Mitigation:

Enable database connection pooling
Add circuit breaker for database calls
Implement graceful degradation

Cause 2: Code Bug or Exception

Symptoms:

Specific error message repeated
Stack trace in logs
Error rate started after deployment

Resolution:

Identify problematic code from stack trace

Rollback to previous version:

bash

# Rollback to previous deployment
wrangler rollback monotask-api-gateway

Create hotfix PR for bug

Deploy fix with monitoring:

bash

wrangler deploy --env staging  # Test in staging first
wrangler deploy --env production

Cause 3: External API Failures

Symptoms:

Timeout errors (504)
"Failed to fetch" error messages
External API status page shows incidents

Resolution:

Implement retry logic with exponential backoff
Enable circuit breaker to fail fast
Return cached data if available

Provide degraded service:

typescript

try {
  result = await callExternalAPI();
} catch (error) {
  // Return cached result or partial data
  result = await getCachedResult() || getDefaultResult();
}

Cause 4: Rate Limiting

Symptoms:

429 Too Many Requests errors
External API rate limit errors

Resolution:

Check current request rate:

bash

# View request rate metrics
# Cloudflare Dashboard > Analytics > Requests

Implement request throttling:

typescript

// Add rate limiting middleware
if (requestCount > threshold) {
  return new Response('Rate limit exceeded', { status: 429 });
}

Distribute load across multiple workers
Implement request queuing for burst traffic

Cause 5: Resource Exhaustion

Symptoms:

Worker CPU time exceeded
Memory limit errors
Timeout errors on previously fast endpoints

Resolution:

Check resource usage metrics
Optimize heavy operations:
- Move CPU-intensive tasks to queues
- Implement pagination for large datasets
- Add caching for expensive computations

Consider scaling worker instances:

bash

# Adjust worker settings in wrangler.toml
[limits]
cpu_ms = 50

Resolution Procedures

Immediate Mitigation (ETA: 5 minutes)

Option 1: Rollback Deployment

bash

# List recent deployments
wrangler deployments list --name monotask-api-gateway

# Rollback to specific deployment
wrangler rollback monotask-api-gateway --deployment-id {deployment_id}

# Verify rollback success
curl https://monotask-api-gateway.workers.dev/health

Option 2: Enable Circuit Breaker

typescript

// Temporarily disable problematic feature
if (featureFlag.isEnabled('problematic-feature')) {
  // Skip problematic code path
  return getCachedResult();
}

Option 3: Route Traffic Away

bash

# Use Cloudflare Load Balancer to route traffic
# to backup worker or maintenance page

Long-term Fix (ETA: varies)

Identify Root Cause: Complete investigation steps above
Create Fix PR: Implement proper solution
Test Thoroughly:
- Unit tests for bug fixes
- Load tests for performance issues
- Integration tests for dependency issues
Deploy to Staging: Verify fix works
Deploy to Production: Monitor during rollout
Verify Resolution: Confirm error rate returns to normal

Verification Steps

After applying fix, verify resolution:

Check Error Rate (target: < 1%):

bash

# Monitor dashboard for 15 minutes
# Verify error rate drops below threshold

Monitor Logs:

bash

wrangler tail monotask-api-gateway --status error
# Should see minimal errors

Test Affected Endpoints:

bash

# Execute smoke tests
curl https://monotask-api-gateway.workers.dev/api/tasks
curl https://monotask-api-gateway.workers.dev/api/agents

Check SLO Compliance:
- Verify availability SLO recovering
- Check error budget consumption
- Confirm alerts have cleared

Escalation Path

When to Escalate

Escalate if:

Error rate > 10% for more than 15 minutes
Unable to identify root cause within 30 minutes
Fix attempts unsuccessful after 2 iterations
Multiple workers affected simultaneously
Data integrity concerns identified

Escalation Contacts

Level 1 - On-Call Engineer

Slack: #monotask-oncall
PagerDuty: Trigger incident escalation

Level 2 - Engineering Lead

Slack: @engineering-lead
Phone: [REDACTED]

Level 3 - CTO / VP Engineering

For critical, ongoing incidents
Business impact > $10K/hour

Post-Incident

Required Actions

Incident Report:
- Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
Update Monitoring:
- Add alerts for identified gap
- Improve error categorization
- Enhance logging for similar issues
Code Improvements:
- Add error handling
- Implement circuit breakers
- Add retry logic
- Improve observability
Documentation:
- Update runbook with lessons learned
- Document new error patterns
- Share knowledge with team

Queue Backup - For queue-related errors
Database Slow - For database performance issues
Worker Timeout - For timeout errors

Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team

High Error Rate ​

Overview ​

Symptoms and Detection ​

How to Detect ​

Observable Symptoms ​

Investigation Steps ​

1. Identify Scope (ETA: 2 minutes) ​

2. Examine Error Types (ETA: 3 minutes) ​

3. Check Recent Deployments (ETA: 2 minutes) ​

4. Examine Dependency Health (ETA: 3 minutes) ​

5. Review Error Logs (ETA: 5 minutes) ​

Common Causes and Resolutions ​

Cause 1: Database Connectivity Issues ​

Cause 2: Code Bug or Exception ​

Cause 3: External API Failures ​

Cause 4: Rate Limiting ​

Cause 5: Resource Exhaustion ​

Resolution Procedures ​

Immediate Mitigation (ETA: 5 minutes) ​

Long-term Fix (ETA: varies) ​

Verification Steps ​

Escalation Path ​

When to Escalate ​

Escalation Contacts ​

Post-Incident ​

Required Actions ​

Related Runbooks ​

High Error Rate

Overview

Symptoms and Detection

How to Detect

Observable Symptoms

Investigation Steps

1. Identify Scope (ETA: 2 minutes)

2. Examine Error Types (ETA: 3 minutes)

3. Check Recent Deployments (ETA: 2 minutes)

4. Examine Dependency Health (ETA: 3 minutes)

5. Review Error Logs (ETA: 5 minutes)

Common Causes and Resolutions

Cause 1: Database Connectivity Issues

Cause 2: Code Bug or Exception

Cause 3: External API Failures

Cause 4: Rate Limiting

Cause 5: Resource Exhaustion

Resolution Procedures

Immediate Mitigation (ETA: 5 minutes)

Long-term Fix (ETA: varies)

Verification Steps

Escalation Path

When to Escalate

Escalation Contacts

Post-Incident

Required Actions

Related Runbooks