Skip to content

Queue Backup

Overview

This runbook provides procedures for diagnosing and resolving queue congestion issues in MonoTask Cloudflare Workers.

Alert: queue_backup or queue_processing_slowSeverity: Critical (depth > 500) or Warning (depth > 100) SLO Impact: Affects queue processing latency SLO (P95 < 5s)


Symptoms and Detection

How to Detect

  • Alert: "Queue Backup Detected" or "Queue Processing Slow"
  • Dashboard: Queue depth widget shows sustained elevation
  • Logs: Increasing queue depth metrics
  • User Impact: Delayed task processing, slow async operations

Observable Symptoms

  • Queue depth > 100 messages sustained for 5+ minutes
  • Messages taking longer to process
  • Dead Letter Queue (DLQ) accumulating messages
  • High retry rates
  • Worker timeout errors

Investigation Steps

1. Identify Affected Queue (ETA: 2 minutes)

Determine which queue(s) are experiencing congestion:

bash
# Check queue depths across all queues
# Via Cloudflare Dashboard > Queues > Overview

# Or use wrangler CLI
wrangler queues list

# Check specific queue stats
wrangler queues get agent-queue
wrangler queues get task-queue
wrangler queues get github-queue

Questions to Answer:

  • Which queue has elevated depth?
  • What is the current depth vs. normal baseline?
  • How fast is the queue growing?
  • Are multiple queues affected?

2. Analyze Queue Consumer Performance (ETA: 3 minutes)

Check consumer worker performance:

bash
# Tail consumer worker logs
wrangler tail monotask-agent-worker --format pretty

# Monitor for:
# - Processing time per message
# - Error rates
# - Timeout errors
# - Retry attempts

Key Metrics to Check:

  • Average processing time per message
  • Consumer error rate
  • Consumer throughput (messages/sec)
  • CPU time usage

3. Check Message Characteristics (ETA: 3 minutes)

Examine messages in the queue:

bash
# View sample messages (if available via Cloudflare API)
# Look for:
# - Message size (large payloads slow processing)
# - Message types (certain types slower than others)
# - Message age (old messages may indicate stuck processing)

Patterns to Identify:

  • Are specific message types causing slowness?
  • Are messages abnormally large?
  • Are messages being retried repeatedly?

4. Monitor Consumer Resources (ETA: 2 minutes)

Check if consumer worker has sufficient resources:

bash
# Check worker metrics in dashboard
# - CPU time usage
# - Request duration
# - Active requests
# - Error rate

Resource Constraints:

  • Worker CPU time limits being hit
  • Database connection limits
  • External API rate limits
  • Memory constraints

5. Examine Dead Letter Queue (ETA: 3 minutes)

Check DLQ for failed messages:

bash
# View DLQ depth
wrangler queues get agent-dlq

# Sample DLQ messages to identify failure patterns
# Look for:
# - Common error types
# - Retry count exceeded
# - Validation failures
# - External service failures

Common Causes and Resolutions

Cause 1: Traffic Spike / Increased Load

Symptoms:

  • Queue depth increasing steadily
  • Normal processing time per message
  • No errors, just high volume

Resolution:

Immediate (5 minutes):

bash
# Increase consumer batch size to process more messages per invocation
# Edit wrangler.toml for consumer worker:
[[queues.consumers]]
queue = "agent-queue"
max_batch_size = 10  # Increase from current value
max_batch_timeout = 30  # Reduce timeout to process faster

Deploy updated configuration:

bash
wrangler deploy monotask-agent-worker

Long-term:

  • Implement auto-scaling based on queue depth
  • Add more consumer workers if needed
  • Optimize message processing code
  • Consider message batching

Cause 2: Slow Message Processing

Symptoms:

  • Normal queue depth but slow processing
  • High P95/P99 processing times
  • Worker timeout errors

Resolution:

Immediate (10 minutes):

  1. Identify slow operations in code:

    bash
    wrangler tail monotask-agent-worker --format pretty
    # Look for slow operations in logs
  2. Add performance tracking:

    typescript
    // Wrap slow operations with timing
    const start = Date.now();
    await slowOperation();
    console.log(`Operation took ${Date.now() - start}ms`);
  3. Optimize identified bottlenecks:

    • Cache database queries
    • Parallelize independent operations
    • Add pagination for large datasets
    • Reduce external API calls

Long-term:

  • Profile code with performance tools
  • Add circuit breakers for external services
  • Implement request timeouts
  • Optimize database queries

Cause 3: Database Contention

Symptoms:

  • Slow database queries
  • D1 query timeouts
  • Database connection errors

Resolution:

Immediate (5 minutes):

  1. Check D1 database performance:

    bash
    # Monitor D1 metrics in Cloudflare Dashboard
    # Look for slow queries
  2. Implement query optimization:

    typescript
    // Add indexes for frequently queried fields
    // Use prepared statements
    // Batch database operations
  3. Add connection pooling:

    typescript
    // Reuse database connections
    // Implement connection limits

Long-term:

  • Optimize database schema
  • Add database read replicas
  • Implement caching layer
  • Use batch writes

Cause 4: External API Rate Limiting

Symptoms:

  • 429 errors from external APIs
  • Timeouts calling external services
  • Specific message types failing

Resolution:

Immediate (5 minutes):

  1. Implement rate limiting:

    typescript
    // Add rate limiter before external API calls
    const rateLimiter = new RateLimiter({
      maxRequests: 100,
      windowMs: 60000, // 1 minute
    });
    
    await rateLimiter.acquire();
    await callExternalAPI();
  2. Add retry with backoff:

    typescript
    async function retryWithBackoff(fn, maxRetries = 3) {
      for (let i = 0; i < maxRetries; i++) {
        try {
          return await fn();
        } catch (error) {
          if (error.status === 429 && i < maxRetries - 1) {
            await sleep(Math.pow(2, i) * 1000); // Exponential backoff
            continue;
          }
          throw error;
        }
      }
    }

Long-term:

  • Request higher rate limits from API providers
  • Implement request queuing
  • Add caching for API responses
  • Use multiple API keys for load distribution

Cause 5: Poison Messages

Symptoms:

  • Specific messages causing worker crashes
  • High retry rate
  • DLQ accumulating messages
  • Same messages retrying repeatedly

Resolution:

Immediate (10 minutes):

  1. Identify poison messages:

    bash
    # Check DLQ for common patterns
    wrangler queues get agent-dlq
  2. Add message validation:

    typescript
    // Validate message before processing
    function isValidMessage(message: QueueMessage): boolean {
      if (!message.body) return false;
      if (message.body.length > MAX_SIZE) return false;
      // Add specific validation
      return true;
    }
    
    // In consumer:
    if (!isValidMessage(message)) {
      console.error('Invalid message, skipping:', message.id);
      await message.ack(); // Remove from queue
      return;
    }
  3. Drain DLQ manually:

    bash
    # Process DLQ messages with special handling
    # Or delete if they're truly invalid

Long-term:

  • Improve message validation at producer
  • Add message schema validation
  • Implement message sanitization
  • Add telemetry for message patterns

Resolution Procedures

Immediate Mitigation (ETA: 10 minutes)

Step 1: Increase Consumer Capacity

bash
# Option A: Increase batch size
# Edit wrangler.toml for affected worker
[[queues.consumers]]
max_batch_size = 20  # Increase capacity
max_batch_timeout = 60

# Deploy changes
wrangler deploy monotask-agent-worker

Step 2: Pause Message Production (if necessary)

bash
# Temporarily disable message producers to stop queue growth
# This gives consumers time to catch up

# Option: Use feature flag to pause non-critical operations
# Or: Add rate limiting to message producers

Step 3: Monitor Queue Drain

bash
# Watch queue depth decrease
while true; do
  wrangler queues get agent-queue | grep "Approximate message count"
  sleep 10
done

Worker Scaling (ETA: 15 minutes)

If consumer capacity is insufficient:

Option 1: Deploy Additional Consumer Workers

bash
# Create parallel consumer worker
# Copy wrangler.toml and create new worker
cp wrangler.toml wrangler-consumer-2.toml

# Modify name
name = "monotask-agent-worker-consumer-2"

# Deploy additional consumer
wrangler deploy -c wrangler-consumer-2.toml

Option 2: Optimize Processing Code

typescript
// Process messages in parallel
async function processMessages(batch: Message<QueueMessage>[]) {
  // Before: Sequential processing
  // for (const message of batch) {
  //   await processMessage(message);
  // }

  // After: Parallel processing
  await Promise.all(
    batch.map(message => processMessage(message))
  );
}

DLQ Processing (ETA: 20 minutes)

Handle messages in Dead Letter Queue:

Step 1: Analyze DLQ Messages

bash
# Sample DLQ to understand failure patterns
# Categorize by error type

Step 2: Fix Root Cause

typescript
// Add special handling for known failure cases
async function processDLQMessage(message: QueueMessage) {
  try {
    // Attempt reprocessing with fixes
    await processWithRetry(message);
  } catch (error) {
    // Log for manual investigation
    console.error('DLQ message failed again:', {
      messageId: message.id,
      error: error.message,
      metadata: message.metadata,
    });
    // Archive or discard
  }
}

Step 3: Replay DLQ

bash
# After fixing root cause, replay DLQ messages
# Manual script or scheduled job to reprocess

Verification Steps

After applying mitigations:

1. Queue Depth Decreasing (ETA: 5 minutes)

bash
# Monitor queue depth
watch -n 10 'wrangler queues get agent-queue'

# Target: Depth decreasing by at least 10 messages/minute

2. Processing Rate Improved (ETA: 5 minutes)

bash
# Check messages processed per minute
# Dashboard > Queues > agent-queue > Metrics

# Target: Processing rate > production rate

3. Error Rate Normal (ETA: 5 minutes)

bash
# Monitor consumer worker error rate
wrangler tail monotask-agent-worker --status error

# Target: < 1% error rate

4. DLQ Not Growing (ETA: 5 minutes)

bash
# Check DLQ depth stable or decreasing
wrangler queues get agent-dlq

# Target: DLQ depth not increasing

5. SLO Compliance (ETA: ongoing)

  • P95 processing latency < 5s
  • Queue depth returns to normal (< 50)
  • No timeout errors
  • Retry rate < 5%

Prevention Measures

Monitoring Improvements

  1. Add Queue Depth Alerts:

    yaml
    # In monitoring/slos.yaml
    - name: queue_depth_warning
      condition: queue_depth > 100
      window: 5m
      severity: warning
    
    - name: queue_depth_critical
      condition: queue_depth > 500
      window: 2m
      severity: critical
  2. Track Processing Rate:

    typescript
    // Add metric tracking in consumer
    await analytics.track('messages_processed', {
      queue: 'agent-queue',
      count: batch.length,
      avgProcessingTime: avgTime,
    });
  3. Monitor DLQ Growth:

    typescript
    // Alert if DLQ accumulating
    if (dlqDepth > 10) {
      await alerter.sendAlert({
        severity: 'warning',
        message: 'DLQ accumulating messages',
        context: { dlqDepth, queue: 'agent-queue' },
      });
    }

Code Improvements

  1. Implement Circuit Breakers:

    typescript
    // For external API calls
    const circuitBreaker = new CircuitBreaker({
      threshold: 5,
      timeout: 60000,
    });
  2. Add Message Prioritization:

    typescript
    // Process high-priority messages first
    const sorted = batch.sort((a, b) =>
      b.metadata.priority - a.metadata.priority
    );
  3. Optimize Batch Processing:

    typescript
    // Process messages in optimal batch sizes
    const optimalBatchSize = calculateOptimalBatch();

Escalation Path

When to Escalate

Escalate if:

  • Queue depth > 1000 for more than 30 minutes
  • Processing completely stalled (0 messages/min)
  • Multiple queues backed up simultaneously
  • DLQ growing rapidly (> 100 messages in 10 minutes)
  • Unable to identify root cause within 20 minutes

Escalation Contacts

Level 1 - On-Call Engineer

  • Slack: #monotask-oncall
  • Investigate queue consumer issues

Level 2 - Backend Lead

  • Slack: @backend-lead
  • For worker optimization and scaling decisions

Level 3 - Infrastructure Team

  • For Cloudflare service issues
  • For capacity planning and scaling

Post-Incident

Required Actions

  1. Analyze Queue Patterns:

    • What caused the backup?
    • Were there warning signs?
    • How effective was the response?
  2. Update Consumer Configuration:

    • Adjust batch sizes based on findings
    • Optimize retry strategies
    • Improve error handling
  3. Improve Monitoring:

    • Add alerts for identified gaps
    • Track new metrics discovered during incident
    • Update dashboard with queue health widgets
  4. Document Lessons Learned:

    • Update this runbook
    • Share findings with team
    • Create post-mortem document


Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Backend Team

MonoKernel MonoTask Documentation