Skip to content

High Error Rate

Overview

This runbook provides step-by-step procedures for investigating and resolving high error rates in MonoTask Cloudflare Workers.

Alert: elevated_error_rate or critical_error_rateSeverity: Warning (> 1% errors) or Critical (> 5% errors) SLO Impact: Affects overall availability SLO (99.9% target)


Symptoms and Detection

How to Detect

  • Alert: Cloudflare alert fires with title "High Error Rate Detected"
  • Dashboard: Error rate widget shows spike above threshold
  • Logs: Increased error messages in worker logs
  • User Reports: Increased support tickets about service issues

Observable Symptoms

  • HTTP 5xx status codes increasing
  • Error rate > 1% sustained for 5+ minutes
  • Specific worker showing elevated errors
  • Error patterns in logs (repeated error messages)

Investigation Steps

1. Identify Scope (ETA: 2 minutes)

Access the monitoring dashboard and determine:

bash
# Check current error rate
curl https://monotask-api-gateway.workers.dev/health

# View recent errors by worker
# Navigate to: Cloudflare Dashboard > Analytics > Logs
# Filter: status >= 500, last 15 minutes

Questions to Answer:

  • Which worker(s) are affected?
  • What percentage of requests are failing?
  • When did the error rate start increasing?
  • Is it affecting all endpoints or specific ones?

2. Examine Error Types (ETA: 3 minutes)

Classify errors by type:

bash
# SSH into logging system or use Cloudflare dashboard
# Group errors by:
# - Status code (500, 502, 503, 504)
# - Error category (database, timeout, validation, etc.)
# - Endpoint path

Common Error Patterns:

Error TypeStatus CodeLikely Cause
Database Errors503D1 database issues
Timeout Errors504Long-running operations
Internal Errors500Code bugs or exceptions
Bad Gateway502Service binding issues

3. Check Recent Deployments (ETA: 2 minutes)

Verify if error rate correlates with recent deployments:

bash
# Check GitHub Actions deployment history
gh run list --repo monotask/monotask --limit 10

# Check specific worker deployment time
wrangler deployments list --name monotask-api-gateway

# Check if error spike timing matches deployment

If deployment is the cause:

  • Proceed to Rollback section below
  • Identify problematic code changes
  • Create incident report

4. Examine Dependency Health (ETA: 3 minutes)

Check health of external dependencies:

D1 Database:

bash
# Check D1 query errors
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/d1/database/{database_id}/metrics \
  -H "Authorization: Bearer {api_token}"

External APIs (GitHub, Claude):

bash
# Check GitHub API status
curl https://www.githubstatus.com/api/v2/status.json

# Check Anthropic API status
curl https://status.anthropic.com/api/v2/status.json

Service Bindings:

  • Verify all worker-to-worker bindings are healthy
  • Check if dependent workers are experiencing issues

5. Review Error Logs (ETA: 5 minutes)

Deep dive into error logs:

bash
# Tail live logs for specific worker
wrangler tail monotask-api-gateway --format pretty

# Filter for errors only
wrangler tail monotask-api-gateway --status error

# Search for specific error pattern
wrangler tail monotask-api-gateway | grep "database"

Look for:

  • Stack traces indicating code bugs
  • Repeated error messages (same error occurring frequently)
  • Error context (request ID, user ID, endpoint)
  • Timing patterns (errors during specific time periods)

Common Causes and Resolutions

Cause 1: Database Connectivity Issues

Symptoms:

  • Error messages: "database connection failed", "D1 unavailable"
  • Status codes: 503 Service Unavailable

Resolution:

  1. Check D1 database status in Cloudflare dashboard
  2. Verify database bindings are correct in wrangler.toml
  3. Test database connectivity:
    bash
    # Execute test query
    wrangler d1 execute monotask-production --command "SELECT 1"
  4. If database is down, check Cloudflare status page
  5. Consider implementing retry logic for transient failures

Mitigation:

  • Enable database connection pooling
  • Add circuit breaker for database calls
  • Implement graceful degradation

Cause 2: Code Bug or Exception

Symptoms:

  • Specific error message repeated
  • Stack trace in logs
  • Error rate started after deployment

Resolution:

  1. Identify problematic code from stack trace
  2. Rollback to previous version:
    bash
    # Rollback to previous deployment
    wrangler rollback monotask-api-gateway
  3. Create hotfix PR for bug
  4. Deploy fix with monitoring:
    bash
    wrangler deploy --env staging  # Test in staging first
    wrangler deploy --env production

Cause 3: External API Failures

Symptoms:

  • Timeout errors (504)
  • "Failed to fetch" error messages
  • External API status page shows incidents

Resolution:

  1. Implement retry logic with exponential backoff
  2. Enable circuit breaker to fail fast
  3. Return cached data if available
  4. Provide degraded service:
    typescript
    try {
      result = await callExternalAPI();
    } catch (error) {
      // Return cached result or partial data
      result = await getCachedResult() || getDefaultResult();
    }

Cause 4: Rate Limiting

Symptoms:

  • 429 Too Many Requests errors
  • External API rate limit errors

Resolution:

  1. Check current request rate:
    bash
    # View request rate metrics
    # Cloudflare Dashboard > Analytics > Requests
  2. Implement request throttling:
    typescript
    // Add rate limiting middleware
    if (requestCount > threshold) {
      return new Response('Rate limit exceeded', { status: 429 });
    }
  3. Distribute load across multiple workers
  4. Implement request queuing for burst traffic

Cause 5: Resource Exhaustion

Symptoms:

  • Worker CPU time exceeded
  • Memory limit errors
  • Timeout errors on previously fast endpoints

Resolution:

  1. Check resource usage metrics
  2. Optimize heavy operations:
    • Move CPU-intensive tasks to queues
    • Implement pagination for large datasets
    • Add caching for expensive computations
  3. Consider scaling worker instances:
    bash
    # Adjust worker settings in wrangler.toml
    [limits]
    cpu_ms = 50

Resolution Procedures

Immediate Mitigation (ETA: 5 minutes)

Option 1: Rollback Deployment

bash
# List recent deployments
wrangler deployments list --name monotask-api-gateway

# Rollback to specific deployment
wrangler rollback monotask-api-gateway --deployment-id {deployment_id}

# Verify rollback success
curl https://monotask-api-gateway.workers.dev/health

Option 2: Enable Circuit Breaker

typescript
// Temporarily disable problematic feature
if (featureFlag.isEnabled('problematic-feature')) {
  // Skip problematic code path
  return getCachedResult();
}

Option 3: Route Traffic Away

bash
# Use Cloudflare Load Balancer to route traffic
# to backup worker or maintenance page

Long-term Fix (ETA: varies)

  1. Identify Root Cause: Complete investigation steps above
  2. Create Fix PR: Implement proper solution
  3. Test Thoroughly:
    • Unit tests for bug fixes
    • Load tests for performance issues
    • Integration tests for dependency issues
  4. Deploy to Staging: Verify fix works
  5. Deploy to Production: Monitor during rollout
  6. Verify Resolution: Confirm error rate returns to normal

Verification Steps

After applying fix, verify resolution:

  1. Check Error Rate (target: < 1%):

    bash
    # Monitor dashboard for 15 minutes
    # Verify error rate drops below threshold
  2. Monitor Logs:

    bash
    wrangler tail monotask-api-gateway --status error
    # Should see minimal errors
  3. Test Affected Endpoints:

    bash
    # Execute smoke tests
    curl https://monotask-api-gateway.workers.dev/api/tasks
    curl https://monotask-api-gateway.workers.dev/api/agents
  4. Check SLO Compliance:

    • Verify availability SLO recovering
    • Check error budget consumption
    • Confirm alerts have cleared

Escalation Path

When to Escalate

Escalate if:

  • Error rate > 10% for more than 15 minutes
  • Unable to identify root cause within 30 minutes
  • Fix attempts unsuccessful after 2 iterations
  • Multiple workers affected simultaneously
  • Data integrity concerns identified

Escalation Contacts

Level 1 - On-Call Engineer

  • Slack: #monotask-oncall
  • PagerDuty: Trigger incident escalation

Level 2 - Engineering Lead

  • Slack: @engineering-lead
  • Phone: [REDACTED]

Level 3 - CTO / VP Engineering

  • For critical, ongoing incidents
  • Business impact > $10K/hour

Post-Incident

Required Actions

  1. Incident Report:

    • Create post-mortem document
    • Timeline of events
    • Root cause analysis
    • Action items to prevent recurrence
  2. Update Monitoring:

    • Add alerts for identified gap
    • Improve error categorization
    • Enhance logging for similar issues
  3. Code Improvements:

    • Add error handling
    • Implement circuit breakers
    • Add retry logic
    • Improve observability
  4. Documentation:

    • Update runbook with lessons learned
    • Document new error patterns
    • Share knowledge with team


Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team

MonoKernel MonoTask Documentation