Appearance
MonoTask Monitoring and Alerting System
This directory contains comprehensive monitoring, alerting, dashboard configurations, SLO definitions, and operational runbooks for the MonoTask Cloudflare Workers infrastructure.
📁 Directory Structure
monitoring/
├── cloudflare-dashboard.json # Dashboard configuration for Cloudflare Analytics
├── slos.yaml # Service Level Indicators and Objectives
└── runbooks/ # Operational runbooks for incident response
├── high-error-rate.md
├── queue-backup.md
├── database-slow.md
├── worker-timeout.md
└── sandbox-stuck.md🎯 Overview
Monitoring Infrastructure
The monitoring system provides:
- Error Tracking & Alerting: Automatic error categorization, alerting, and notification routing
- Performance Monitoring: Request duration, database queries, external API latency tracking
- Resource Monitoring: CPU, memory, D1/R2/KV operation tracking
- Queue Monitoring: Queue depth, processing time, DLQ, retry rates
- Sandbox Monitoring: Active sandboxes, provision time, timeout tracking
Key Components
1. Analytics Configuration (wrangler.toml)
All workers are configured with:
- Analytics Engine: Custom metrics collection
- Logpush: Structured logs sent to R2 for retention
- Sampling Rate: 10% of requests sampled for detailed metrics
2. Error Alerting System (packages/cloudflare-workers/monitoring/error-alerter.ts)
Features:
- Automatic error severity classification (CRITICAL, WARNING, INFO)
- Multi-channel alert routing (Email, Slack, PagerDuty)
- Alert deduplication to prevent alert fatigue
- Contextual error tracking with stack traces
- Configurable alert thresholds per worker
Usage:
typescript
import { createErrorAlerter } from '@monotask/monitoring';
const alerter = createErrorAlerter(env, 'api-gateway');
const error = new Error('Database connection failed');
const monitoringError = alerter.createErrorContext(error, request, {
userId: 'user_123',
endpoint: '/api/tasks',
});
await alerter.sendAlert(monitoringError);3. Performance Tracking (packages/cloudflare-workers/monitoring/performance-tracker.ts)
Capabilities:
- Request duration tracking with P50/P95/P99 percentiles
- Database query performance monitoring
- External API latency tracking (GitHub, Claude APIs)
- Queue processing time measurement
- Automatic slow request detection
Usage:
typescript
import { createPerformanceTracker } from '@monotask/monitoring';
const tracker = createPerformanceTracker(env, 'task-worker');
tracker.startRequest();
// Track database query
await tracker.wrapDbQuery(() => db.query('SELECT * FROM tasks'));
// Track external API
await tracker.wrapExternalApi(() => fetch('https://api.github.com'));
await tracker.endRequest(request, response, requestId);4. Middleware
Error Tracking Middleware:
typescript
import { createErrorTracker } from '@monotask/monitoring/middleware/error-tracker';
const errorTracker = createErrorTracker(env, {
workerName: 'api-gateway',
captureStackTraces: true,
});
try {
// Your handler code
} catch (error) {
return await errorTracker.onError(error, request, { requestId });
}Performance Middleware:
typescript
import { createPerformanceMiddleware } from '@monotask/monitoring/middleware/performance-middleware';
const perfMiddleware = createPerformanceMiddleware(env, {
workerName: 'api-gateway',
});
return await perfMiddleware.trackPerformance(request, async (tracker) => {
// Handler receives tracker for custom metrics
const result = await handleRequest(request, tracker);
return new Response(JSON.stringify(result));
});📊 Dashboard Configuration
The cloudflare-dashboard.json defines a comprehensive monitoring dashboard with the following sections:
1. Worker Health
- Uptime Percentage: 24-hour rolling uptime with 99.9% SLO line
- Error Rates: Errors/min by worker and severity
- Active Requests: Current request load gauge
- Request Rate: Requests per second with sparkline
2. Queue Metrics
- Queue Depth: Messages in queue by queue name
- Processing Rate: Messages processed per second
- Processing Time Distribution: Histogram of processing durations
- DLQ Count: Dead letter queue message accumulation
- Retry Rate: Percentage of messages being retried
3. Performance Metrics
- P50/P95/P99 Latency: Response time percentiles by endpoint
- Database Query Time: Average D1 query duration
- External API Latency: P95 latency for GitHub and Claude APIs
- Cache Hit Rate: KV cache effectiveness
4. Resource Usage
- CPU Time: CPU milliseconds consumed by worker
- Memory Usage: Average memory consumption
- D1/R2/KV Operation Counts: Storage operation rates
5. Sandbox Metrics
- Active Sandboxes: Current count with capacity alerts
- Provision Time: Histogram of sandbox startup duration
- Timeout Frequency: Sandbox timeout rate over time
- Resource Utilization: Sandbox resource usage percentage
🎯 Service Level Objectives (SLOs)
Defined in slos.yaml, our SLOs establish performance and reliability targets:
Availability SLOs
- API Availability: 99.9% uptime (30-day window)
- Database Availability: 99.95% query success rate
Latency SLOs
- API Gateway: P95 < 200ms
- Task Operations: P95 < 500ms
- Agent Execution: P95 < 30s, P99 < 60s
- Queue Processing: P95 < 5s
- D1 Queries: P95 < 50ms, P99 < 100ms
Error Rate SLOs
- Overall Error Rate: < 1% (1-hour window)
- Queue Success Rate: > 99% (24-hour window)
Resource SLOs
- Sandbox Provision Time: P95 < 2s
- Sandbox Timeout Rate: < 2% (24-hour window)
Error Budget
- Monthly Budget: 43.2 minutes downtime (0.1%)
- Burn Rate Alerts: Alert if burning > 10x normal rate
- Budget Exhaustion: Alert when < 10% remaining
📖 Operational Runbooks
Comprehensive step-by-step guides for common incident scenarios:
1. High Error Rate (high-error-rate.md)
When to use: Error rate > 1% sustained for 5+ minutes
Covers:
- Error type classification and investigation
- Database, code, external API, and rate limiting issues
- Rollback procedures
- Circuit breaker implementation
- Escalation paths
2. Queue Backup (queue-backup.md)
When to use: Queue depth > 100 messages sustained
Covers:
- Queue congestion detection and analysis
- Consumer performance tuning
- Scaling worker capacity
- DLQ processing
- Poison message handling
3. Database Slow Queries (database-slow.md)
When to use: P95 query latency > 100ms
Covers:
- Query performance analysis
- Missing index detection and creation
- N+1 query problem resolution
- Query optimization techniques
- Write lock contention handling
4. Worker Timeout (worker-timeout.md)
When to use: Timeout rate > 2% or consistent 504 errors
Covers:
- Timeout pattern identification
- CPU time profiling
- External API timeout handling
- Queue offloading strategies
- Code optimization techniques
5. Sandbox Stuck (sandbox-stuck.md)
When to use: Active sandboxes > 20 or timeout rate > 5%
Covers:
- Sandbox lifecycle debugging
- Infinite loop detection
- Manual and bulk cleanup procedures
- State reconciliation
- Resource exhaustion handling
🚨 Alert Channels
Alerts are routed based on severity:
Critical Alerts
- Channels: Email + Slack + PagerDuty
- Examples:
- Error rate > 5%
- API availability < 99%
- Queue depth > 500
- Database errors > 5%
Warning Alerts
- Channels: Slack
- Examples:
- Error rate > 1%
- P95 latency exceeds SLO
- Queue depth > 100
- Slow database queries
Info Alerts
- Channels: Logged only
- Examples:
- Validation errors
- Rate limit responses
- Individual request failures
🔧 Configuration
Environment Variables
Required in all worker wrangler.toml files:
toml
[vars]
LOG_LEVEL = "info" # info, warn, error
METRICS_SAMPLING_RATE = "0.1" # 10% sampling
SLACK_WEBHOOK_URL = "https://..." # For Slack alerts
ALERT_EMAIL = "alerts@monotask.dev"Alert Deduplication
Alerts are deduplicated with a 1-hour cooldown period using KV storage:
typescript
// Deduplication key format
`alert:{ruleName}:{workerName}`
// Cooldown: 3600 seconds (1 hour)📈 Metrics Collection
Analytics Engine Datasets
All workers write to the ANALYTICS binding:
typescript
env.ANALYTICS.writeDataPoint({
blobs: [requestId, workerName, endpoint, method, statusCode],
doubles: [durationMs, dbQueryTimeMs, externalApiTimeMs, queueProcessingTimeMs],
indexes: [`worker:${workerName}`, `endpoint:${endpoint}`, `status:${statusCode}`],
});Sampling Strategy
- Default: 10% of requests sampled for detailed metrics
- Always Sampled:
- Errors (status >= 400)
- Slow requests (above SLO threshold)
- Critical operations (agent execution, task state transitions)
Retention
- Live Metrics: 30 days in Analytics Engine
- Logs: Retained in R2 via Logpush (90 days)
- Aggregated Metrics: Permanent retention in dashboard
🎓 Best Practices
1. Error Handling
typescript
// Always use try-catch with error tracking
try {
await operation();
} catch (error) {
await errorTracker.trackError(error, request, {
operation: 'operation_name',
userId,
context: additionalData,
});
throw error;
}2. Performance Tracking
typescript
// Track critical operations
const tracker = createPerformanceTracker(env, workerName);
tracker.startRequest();
await tracker.wrapDbQuery(() => dbOperation());
await tracker.wrapExternalApi(() => apiCall());
await tracker.endRequest(request, response, requestId);3. Custom Metrics
typescript
// Track business metrics
await tracker.trackCustomMetric(
'tasks_completed',
1,
'count',
{ project_id: projectId, state: 'completed' }
);4. Structured Logging
typescript
// Use structured logs for better querying
console.log(JSON.stringify({
level: 'info',
timestamp: Date.now(),
workerName: 'api-gateway',
requestId,
message: 'Task completed',
metadata: { taskId, duration, state },
}));📞 Support and Escalation
On-Call Rotation
- Slack: #monotask-oncall
- PagerDuty: Automatic escalation for critical alerts
Escalation Levels
Level 1 - On-Call Engineer (15 min response time)
- Initial investigation
- Standard runbook procedures
- Most incidents resolved at this level
Level 2 - Team Lead (30 min response time)
- Complex issues requiring architectural decisions
- Cross-team coordination
- Capacity planning
Level 3 - Engineering Management (1 hour response time)
- Major outages
- Customer-impacting issues
- Business-critical escalations
🔄 Maintenance
Weekly Tasks
- Review SLO compliance reports
- Check error budget consumption
- Update dashboard for new metrics
- Review and acknowledge alerts
Monthly Tasks
- Run recovery drill (test backup/restore)
- Review and update runbooks
- Audit slow queries and add indexes
- Optimize monitoring costs
Quarterly Tasks
- SLO review and adjustment
- Dashboard redesign based on usage
- Alert threshold tuning
- Runbook effectiveness review
📚 Additional Resources
Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team, Product Team