Appearance
Queue Backup
Overview
This runbook provides procedures for diagnosing and resolving queue congestion issues in MonoTask Cloudflare Workers.
Alert: queue_backup or queue_processing_slowSeverity: Critical (depth > 500) or Warning (depth > 100) SLO Impact: Affects queue processing latency SLO (P95 < 5s)
Symptoms and Detection
How to Detect
- Alert: "Queue Backup Detected" or "Queue Processing Slow"
- Dashboard: Queue depth widget shows sustained elevation
- Logs: Increasing queue depth metrics
- User Impact: Delayed task processing, slow async operations
Observable Symptoms
- Queue depth > 100 messages sustained for 5+ minutes
- Messages taking longer to process
- Dead Letter Queue (DLQ) accumulating messages
- High retry rates
- Worker timeout errors
Investigation Steps
1. Identify Affected Queue (ETA: 2 minutes)
Determine which queue(s) are experiencing congestion:
bash
# Check queue depths across all queues
# Via Cloudflare Dashboard > Queues > Overview
# Or use wrangler CLI
wrangler queues list
# Check specific queue stats
wrangler queues get agent-queue
wrangler queues get task-queue
wrangler queues get github-queueQuestions to Answer:
- Which queue has elevated depth?
- What is the current depth vs. normal baseline?
- How fast is the queue growing?
- Are multiple queues affected?
2. Analyze Queue Consumer Performance (ETA: 3 minutes)
Check consumer worker performance:
bash
# Tail consumer worker logs
wrangler tail monotask-agent-worker --format pretty
# Monitor for:
# - Processing time per message
# - Error rates
# - Timeout errors
# - Retry attemptsKey Metrics to Check:
- Average processing time per message
- Consumer error rate
- Consumer throughput (messages/sec)
- CPU time usage
3. Check Message Characteristics (ETA: 3 minutes)
Examine messages in the queue:
bash
# View sample messages (if available via Cloudflare API)
# Look for:
# - Message size (large payloads slow processing)
# - Message types (certain types slower than others)
# - Message age (old messages may indicate stuck processing)Patterns to Identify:
- Are specific message types causing slowness?
- Are messages abnormally large?
- Are messages being retried repeatedly?
4. Monitor Consumer Resources (ETA: 2 minutes)
Check if consumer worker has sufficient resources:
bash
# Check worker metrics in dashboard
# - CPU time usage
# - Request duration
# - Active requests
# - Error rateResource Constraints:
- Worker CPU time limits being hit
- Database connection limits
- External API rate limits
- Memory constraints
5. Examine Dead Letter Queue (ETA: 3 minutes)
Check DLQ for failed messages:
bash
# View DLQ depth
wrangler queues get agent-dlq
# Sample DLQ messages to identify failure patterns
# Look for:
# - Common error types
# - Retry count exceeded
# - Validation failures
# - External service failuresCommon Causes and Resolutions
Cause 1: Traffic Spike / Increased Load
Symptoms:
- Queue depth increasing steadily
- Normal processing time per message
- No errors, just high volume
Resolution:
Immediate (5 minutes):
bash
# Increase consumer batch size to process more messages per invocation
# Edit wrangler.toml for consumer worker:
[[queues.consumers]]
queue = "agent-queue"
max_batch_size = 10 # Increase from current value
max_batch_timeout = 30 # Reduce timeout to process fasterDeploy updated configuration:
bash
wrangler deploy monotask-agent-workerLong-term:
- Implement auto-scaling based on queue depth
- Add more consumer workers if needed
- Optimize message processing code
- Consider message batching
Cause 2: Slow Message Processing
Symptoms:
- Normal queue depth but slow processing
- High P95/P99 processing times
- Worker timeout errors
Resolution:
Immediate (10 minutes):
Identify slow operations in code:
bashwrangler tail monotask-agent-worker --format pretty # Look for slow operations in logsAdd performance tracking:
typescript// Wrap slow operations with timing const start = Date.now(); await slowOperation(); console.log(`Operation took ${Date.now() - start}ms`);Optimize identified bottlenecks:
- Cache database queries
- Parallelize independent operations
- Add pagination for large datasets
- Reduce external API calls
Long-term:
- Profile code with performance tools
- Add circuit breakers for external services
- Implement request timeouts
- Optimize database queries
Cause 3: Database Contention
Symptoms:
- Slow database queries
- D1 query timeouts
- Database connection errors
Resolution:
Immediate (5 minutes):
Check D1 database performance:
bash# Monitor D1 metrics in Cloudflare Dashboard # Look for slow queriesImplement query optimization:
typescript// Add indexes for frequently queried fields // Use prepared statements // Batch database operationsAdd connection pooling:
typescript// Reuse database connections // Implement connection limits
Long-term:
- Optimize database schema
- Add database read replicas
- Implement caching layer
- Use batch writes
Cause 4: External API Rate Limiting
Symptoms:
- 429 errors from external APIs
- Timeouts calling external services
- Specific message types failing
Resolution:
Immediate (5 minutes):
Implement rate limiting:
typescript// Add rate limiter before external API calls const rateLimiter = new RateLimiter({ maxRequests: 100, windowMs: 60000, // 1 minute }); await rateLimiter.acquire(); await callExternalAPI();Add retry with backoff:
typescriptasync function retryWithBackoff(fn, maxRetries = 3) { for (let i = 0; i < maxRetries; i++) { try { return await fn(); } catch (error) { if (error.status === 429 && i < maxRetries - 1) { await sleep(Math.pow(2, i) * 1000); // Exponential backoff continue; } throw error; } } }
Long-term:
- Request higher rate limits from API providers
- Implement request queuing
- Add caching for API responses
- Use multiple API keys for load distribution
Cause 5: Poison Messages
Symptoms:
- Specific messages causing worker crashes
- High retry rate
- DLQ accumulating messages
- Same messages retrying repeatedly
Resolution:
Immediate (10 minutes):
Identify poison messages:
bash# Check DLQ for common patterns wrangler queues get agent-dlqAdd message validation:
typescript// Validate message before processing function isValidMessage(message: QueueMessage): boolean { if (!message.body) return false; if (message.body.length > MAX_SIZE) return false; // Add specific validation return true; } // In consumer: if (!isValidMessage(message)) { console.error('Invalid message, skipping:', message.id); await message.ack(); // Remove from queue return; }Drain DLQ manually:
bash# Process DLQ messages with special handling # Or delete if they're truly invalid
Long-term:
- Improve message validation at producer
- Add message schema validation
- Implement message sanitization
- Add telemetry for message patterns
Resolution Procedures
Immediate Mitigation (ETA: 10 minutes)
Step 1: Increase Consumer Capacity
bash
# Option A: Increase batch size
# Edit wrangler.toml for affected worker
[[queues.consumers]]
max_batch_size = 20 # Increase capacity
max_batch_timeout = 60
# Deploy changes
wrangler deploy monotask-agent-workerStep 2: Pause Message Production (if necessary)
bash
# Temporarily disable message producers to stop queue growth
# This gives consumers time to catch up
# Option: Use feature flag to pause non-critical operations
# Or: Add rate limiting to message producersStep 3: Monitor Queue Drain
bash
# Watch queue depth decrease
while true; do
wrangler queues get agent-queue | grep "Approximate message count"
sleep 10
doneWorker Scaling (ETA: 15 minutes)
If consumer capacity is insufficient:
Option 1: Deploy Additional Consumer Workers
bash
# Create parallel consumer worker
# Copy wrangler.toml and create new worker
cp wrangler.toml wrangler-consumer-2.toml
# Modify name
name = "monotask-agent-worker-consumer-2"
# Deploy additional consumer
wrangler deploy -c wrangler-consumer-2.tomlOption 2: Optimize Processing Code
typescript
// Process messages in parallel
async function processMessages(batch: Message<QueueMessage>[]) {
// Before: Sequential processing
// for (const message of batch) {
// await processMessage(message);
// }
// After: Parallel processing
await Promise.all(
batch.map(message => processMessage(message))
);
}DLQ Processing (ETA: 20 minutes)
Handle messages in Dead Letter Queue:
Step 1: Analyze DLQ Messages
bash
# Sample DLQ to understand failure patterns
# Categorize by error typeStep 2: Fix Root Cause
typescript
// Add special handling for known failure cases
async function processDLQMessage(message: QueueMessage) {
try {
// Attempt reprocessing with fixes
await processWithRetry(message);
} catch (error) {
// Log for manual investigation
console.error('DLQ message failed again:', {
messageId: message.id,
error: error.message,
metadata: message.metadata,
});
// Archive or discard
}
}Step 3: Replay DLQ
bash
# After fixing root cause, replay DLQ messages
# Manual script or scheduled job to reprocessVerification Steps
After applying mitigations:
1. Queue Depth Decreasing (ETA: 5 minutes)
bash
# Monitor queue depth
watch -n 10 'wrangler queues get agent-queue'
# Target: Depth decreasing by at least 10 messages/minute2. Processing Rate Improved (ETA: 5 minutes)
bash
# Check messages processed per minute
# Dashboard > Queues > agent-queue > Metrics
# Target: Processing rate > production rate3. Error Rate Normal (ETA: 5 minutes)
bash
# Monitor consumer worker error rate
wrangler tail monotask-agent-worker --status error
# Target: < 1% error rate4. DLQ Not Growing (ETA: 5 minutes)
bash
# Check DLQ depth stable or decreasing
wrangler queues get agent-dlq
# Target: DLQ depth not increasing5. SLO Compliance (ETA: ongoing)
- P95 processing latency < 5s
- Queue depth returns to normal (< 50)
- No timeout errors
- Retry rate < 5%
Prevention Measures
Monitoring Improvements
Add Queue Depth Alerts:
yaml# In monitoring/slos.yaml - name: queue_depth_warning condition: queue_depth > 100 window: 5m severity: warning - name: queue_depth_critical condition: queue_depth > 500 window: 2m severity: criticalTrack Processing Rate:
typescript// Add metric tracking in consumer await analytics.track('messages_processed', { queue: 'agent-queue', count: batch.length, avgProcessingTime: avgTime, });Monitor DLQ Growth:
typescript// Alert if DLQ accumulating if (dlqDepth > 10) { await alerter.sendAlert({ severity: 'warning', message: 'DLQ accumulating messages', context: { dlqDepth, queue: 'agent-queue' }, }); }
Code Improvements
Implement Circuit Breakers:
typescript// For external API calls const circuitBreaker = new CircuitBreaker({ threshold: 5, timeout: 60000, });Add Message Prioritization:
typescript// Process high-priority messages first const sorted = batch.sort((a, b) => b.metadata.priority - a.metadata.priority );Optimize Batch Processing:
typescript// Process messages in optimal batch sizes const optimalBatchSize = calculateOptimalBatch();
Escalation Path
When to Escalate
Escalate if:
- Queue depth > 1000 for more than 30 minutes
- Processing completely stalled (0 messages/min)
- Multiple queues backed up simultaneously
- DLQ growing rapidly (> 100 messages in 10 minutes)
- Unable to identify root cause within 20 minutes
Escalation Contacts
Level 1 - On-Call Engineer
- Slack: #monotask-oncall
- Investigate queue consumer issues
Level 2 - Backend Lead
- Slack: @backend-lead
- For worker optimization and scaling decisions
Level 3 - Infrastructure Team
- For Cloudflare service issues
- For capacity planning and scaling
Post-Incident
Required Actions
Analyze Queue Patterns:
- What caused the backup?
- Were there warning signs?
- How effective was the response?
Update Consumer Configuration:
- Adjust batch sizes based on findings
- Optimize retry strategies
- Improve error handling
Improve Monitoring:
- Add alerts for identified gaps
- Track new metrics discovered during incident
- Update dashboard with queue health widgets
Document Lessons Learned:
- Update this runbook
- Share findings with team
- Create post-mortem document
Related Runbooks
- High Error Rate - For consumer worker errors
- Worker Timeout - For processing timeouts
- Database Slow - For DB-related slowness
Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Backend Team