Appearance
Worker Timeout
Overview
This runbook provides procedures for diagnosing and resolving Cloudflare Worker timeout issues.
Alert: worker_timeout_high or request_timeout_elevatedSeverity: Warning (> 2% timeout rate) or Critical (> 10% timeout rate) SLO Impact: Affects API latency and availability SLOs
Symptoms and Detection
How to Detect
- Alert: "Worker Timeout Rate Elevated"
- Dashboard: Request duration showing spikes to 30s+ (Worker CPU limit)
- Logs: "TimeoutError" or "CPU time limit exceeded"
- User Impact: 504 Gateway Timeout errors, slow or failed requests
Observable Symptoms
- HTTP 504 status codes
- Requests hitting Worker CPU time limit (50ms default, 50s max)
- Timeout errors in worker logs
- P99 latency at or near Worker timeout limit
Investigation Steps
1. Identify Timeout Pattern (ETA: 3 minutes)
Determine which workers and endpoints are timing out:
bash
# Check timeout errors by worker
wrangler tail --status error | grep -i timeout
# Monitor specific worker for timeouts
wrangler tail monotask-agent-worker | grep -i timeout
# Check which endpoints are affected
# Review logs for request pathsQuestions to Answer:
- Which worker(s) are experiencing timeouts?
- Which endpoints or operations are timing out?
- What is the timeout frequency (% of requests)?
- When did timeouts start occurring?
2. Analyze Request Characteristics (ETA: 5 minutes)
Examine requests that are timing out:
bash
# Look for common patterns in timeout errors
wrangler tail monotask-agent-worker --format pretty
# Look for:
# - Request size (large payloads)
# - Operation type (complex computations)
# - External API calls
# - Database operationsCommon Timeout Triggers:
- Large file processing
- Complex AI agent operations
- Multiple external API calls in sequence
- Large database queries or joins
- Infinite loops or recursive calls
3. Check CPU Time Usage (ETA: 3 minutes)
Monitor Worker CPU time consumption:
bash
# View CPU time metrics in Cloudflare Dashboard
# Navigate to: Workers > [Worker Name] > Metrics
# Check:
# - Average CPU time per request
# - P95/P99 CPU time
# - CPU time distributionCPU Time Limits:
- Free tier: 10ms CPU time
- Paid tier: 50ms CPU time (default)
- Unbound workers: 30s wall time, unlimited CPU bursts
4. Profile Code Execution (ETA: 10 minutes)
Add performance profiling to identify slow operations:
typescript
// Add timing markers in code
console.time('operation_name');
await someOperation();
console.timeEnd('operation_name');
// Or use performance API
const start = performance.now();
await someOperation();
const duration = performance.now() - start;
console.log(`Operation took ${duration}ms`);Deploy with profiling and monitor logs:
bash
wrangler deploy monotask-agent-worker --env staging
wrangler tail monotask-agent-worker-staging5. Check External Dependencies (ETA: 5 minutes)
Verify external services aren't causing delays:
bash
# Test external API response times
curl -w "@curl-format.txt" https://api.github.com/status
curl -w "@curl-format.txt" https://api.anthropic.com/v1/messages
# Check for:
# - API response times > 5s
# - API timeouts or errors
# - Rate limiting responsesCommon Causes and Resolutions
Cause 1: Synchronous External API Calls
Symptoms:
- Timeout correlates with external API slowness
- Multiple sequential API calls
- No timeout set on API requests
Resolution:
Immediate (10 minutes):
- Add request timeouts:
typescript
// BEFORE (no timeout):
const response = await fetch(externalAPI);
// AFTER (with timeout):
const TIMEOUT_MS = 5000;
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);
try {
const response = await fetch(externalAPI, {
signal: controller.signal,
});
return response;
} catch (error) {
if (error.name === 'AbortError') {
throw new Error('External API timeout');
}
throw error;
} finally {
clearTimeout(timeoutId);
}- Make parallel instead of sequential:
typescript
// BEFORE (sequential):
const result1 = await fetchAPI1();
const result2 = await fetchAPI2();
const result3 = await fetchAPI3();
// AFTER (parallel):
const [result1, result2, result3] = await Promise.all([
fetchAPI1(),
fetchAPI2(),
fetchAPI3(),
]);- Implement circuit breaker:
typescript
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private readonly threshold = 5;
private readonly resetTimeout = 60000;
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.isOpen()) {
throw new Error('Circuit breaker is open');
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private isOpen(): boolean {
if (this.failures >= this.threshold) {
const timeSinceLastFailure = Date.now() - this.lastFailure;
if (timeSinceLastFailure < this.resetTimeout) {
return true;
}
this.failures = 0; // Reset after timeout
}
return false;
}
private onSuccess(): void {
this.failures = 0;
}
private onFailure(): void {
this.failures++;
this.lastFailure = Date.now();
}
}Long-term:
- Move long-running API calls to queues
- Implement result caching
- Use webhooks instead of polling
- Add retry with exponential backoff
Cause 2: CPU-Intensive Computations
Symptoms:
- CPU time approaching 50ms limit
- Complex calculations or transformations
- Large data processing
Resolution:
Immediate (15 minutes):
- Offload to Durable Object:
typescript
// BEFORE (in Worker):
export default {
async fetch(request: Request, env: Env) {
const result = await heavyComputation(); // Times out!
return new Response(JSON.stringify(result));
}
};
// AFTER (offload to DO):
export default {
async fetch(request: Request, env: Env) {
const id = env.PROCESSOR.idFromName('processor');
const stub = env.PROCESSOR.get(id);
return stub.fetch(request);
}
};
// In Durable Object (has longer timeout):
export class Processor {
async fetch(request: Request) {
const result = await this.heavyComputation();
return new Response(JSON.stringify(result));
}
async heavyComputation() {
// CPU-intensive work here
}
}- Break into chunks:
typescript
// Process large datasets in chunks
async function processLargeArray(items: any[]) {
const CHUNK_SIZE = 100;
const results = [];
for (let i = 0; i < items.length; i += CHUNK_SIZE) {
const chunk = items.slice(i, i + CHUNK_SIZE);
const chunkResults = await processChunk(chunk);
results.push(...chunkResults);
// Optional: yield to event loop
await new Promise(resolve => setTimeout(resolve, 0));
}
return results;
}- Use streaming for large responses:
typescript
// Stream large JSON responses
function streamJSON(data: any[]): Response {
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
controller.enqueue(encoder.encode('['));
for (let i = 0; i < data.length; i++) {
if (i > 0) controller.enqueue(encoder.encode(','));
controller.enqueue(encoder.encode(JSON.stringify(data[i])));
// Yield to avoid blocking
await new Promise(resolve => setTimeout(resolve, 0));
}
controller.enqueue(encoder.encode(']'));
controller.close();
},
});
return new Response(stream, {
headers: { 'Content-Type': 'application/json' },
});
}Long-term:
- Move CPU-intensive work to queues
- Implement background processing
- Use WebAssembly for performance-critical code
- Optimize algorithms
Cause 3: Large Database Queries
Symptoms:
- Timeout on endpoints with database access
- Large result sets being fetched
- Complex JOIN queries
Resolution:
Immediate (10 minutes):
- Add pagination:
typescript
// BEFORE (fetches all):
const tasks = await db.query('SELECT * FROM tasks');
// AFTER (paginated):
const limit = 100;
const offset = (page - 1) * limit;
const tasks = await db.query(
'SELECT * FROM tasks LIMIT ? OFFSET ?',
[limit, offset]
);- Add query timeout:
typescript
async function queryWithTimeout(query: string, params: any[], timeout = 5000) {
return Promise.race([
db.query(query, params),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Query timeout')), timeout)
),
]);
}- Use indexes:
sql
-- Add missing indexes
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);Long-term:
- See Database Slow Runbook
- Implement caching layer
- Optimize queries and indexes
- Use database views for complex queries
Cause 4: Memory-Intensive Operations
Symptoms:
- Processing large files
- Building large objects in memory
- String concatenation in loops
Resolution:
Immediate (15 minutes):
- Use streaming for large files:
typescript
// BEFORE (loads entire file):
const content = await r2Object.text();
const processed = processLargeContent(content);
// AFTER (streaming):
const stream = r2Object.body;
const reader = stream.getReader();
const chunks: string[] = [];
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = new TextDecoder().decode(value);
const processed = processChunk(chunk);
chunks.push(processed);
}- Avoid large object creation:
typescript
// BEFORE (creates large array):
const results = [];
for (let i = 0; i < 1000000; i++) {
results.push(process(i));
}
// AFTER (generator):
function* processItems(count: number) {
for (let i = 0; i < count; i++) {
yield process(i);
}
}
// Use as stream or in chunks- Optimize string operations:
typescript
// BEFORE (slow concatenation):
let result = '';
for (const item of items) {
result += item.toString();
}
// AFTER (array join):
const result = items.map(item => item.toString()).join('');Long-term:
- Store large files in R2 instead of processing inline
- Use background workers for large operations
- Implement progressive processing
- Add memory usage monitoring
Cause 5: Infinite Loops or Recursion
Symptoms:
- Timeout always at maximum duration
- No response from worker
- High CPU usage
Resolution:
Immediate (5 minutes):
- Add loop guards:
typescript
// BEFORE (potential infinite loop):
while (condition) {
// ...
}
// AFTER (with guard):
let iterations = 0;
const MAX_ITERATIONS = 1000;
while (condition && iterations < MAX_ITERATIONS) {
// ...
iterations++;
}
if (iterations >= MAX_ITERATIONS) {
throw new Error('Loop iteration limit exceeded');
}- Add recursion depth limits:
typescript
function recursiveFunction(data: any, depth = 0) {
const MAX_DEPTH = 100;
if (depth > MAX_DEPTH) {
throw new Error('Max recursion depth exceeded');
}
if (baseCase) {
return result;
}
return recursiveFunction(nextData, depth + 1);
}Long-term:
- Add code review checks for loops
- Implement circuit breakers
- Add monitoring for loop iterations
- Use iterative algorithms instead of recursive
Resolution Procedures
Immediate Mitigation (ETA: 10 minutes)
Step 1: Move to Queue
For operations that can be asynchronous:
typescript
// BEFORE (synchronous, times out):
export default {
async fetch(request: Request, env: Env) {
const result = await longRunningOperation();
return new Response(JSON.stringify(result));
}
};
// AFTER (queue for async processing):
export default {
async fetch(request: Request, env: Env) {
const jobId = crypto.randomUUID();
// Queue the operation
await env.AGENT_QUEUE.send({
jobId,
operation: 'long-running',
data: await request.json(),
});
// Return job ID immediately
return new Response(JSON.stringify({
jobId,
status: 'queued',
}), { status: 202 });
}
};Step 2: Add Operation Timeout
typescript
// Wrap operations with timeout
async function withTimeout<T>(
promise: Promise<T>,
timeoutMs: number
): Promise<T> {
return Promise.race([
promise,
new Promise<T>((_, reject) =>
setTimeout(
() => reject(new Error(`Operation timeout after ${timeoutMs}ms`)),
timeoutMs
)
),
]);
}
// Usage:
const result = await withTimeout(
slowOperation(),
5000 // 5 second timeout
);Step 3: Enable Response Caching
typescript
// Cache expensive operations
async function getCachedResult(key: string, fn: () => Promise<any>) {
const cached = await env.CACHE.get(key);
if (cached) {
return JSON.parse(cached);
}
const result = await fn();
await env.CACHE.put(key, JSON.stringify(result), {
expirationTtl: 300, // 5 minutes
});
return result;
}Code Optimization (ETA: 30 minutes)
Step 1: Profile and Identify Bottlenecks
typescript
// Add detailed profiling
const timings: Record<string, number> = {};
function time(label: string) {
return {
end: () => {
const duration = Date.now() - start;
timings[label] = duration;
console.log(`${label}: ${duration}ms`);
},
};
const start = Date.now();
}
// Usage:
const t1 = time('database_query');
await db.query(...);
t1.end();
const t2 = time('api_call');
await fetch(...);
t2.end();
console.log('Total timings:', timings);Step 2: Optimize Critical Path
Focus on the slowest operations:
typescript
// Before optimization
async function processRequest(request: Request) {
const user = await getUser(); // 100ms
const data = await fetchData(); // 200ms
const processed = await process(data); // 500ms ← Bottleneck
const saved = await save(processed); // 50ms
return new Response(JSON.stringify(saved));
}
// After optimization
async function processRequest(request: Request) {
// Parallelize independent operations
const [user, data] = await Promise.all([
getUser(),
fetchData(),
]);
// Optimize bottleneck
const processed = await processOptimized(data); // 50ms
const saved = await save(processed);
return new Response(JSON.stringify(saved));
}Step 3: Deploy and Verify
bash
# Deploy optimized version
wrangler deploy monotask-agent-worker
# Monitor for timeout improvement
wrangler tail monotask-agent-worker | grep -i timeoutVerification Steps
1. Timeout Rate Decreased (ETA: 10 minutes)
bash
# Monitor timeout errors
wrangler tail --status error | grep -i timeout
# Check metrics dashboard
# Target: < 1% timeout rate2. Request Duration Improved (ETA: 5 minutes)
bash
# Check P95/P99 latency
# Dashboard > Workers > [Worker Name] > Metrics
# Target: P95 < SLO threshold for endpoint type3. CPU Time Reduced (ETA: 5 minutes)
bash
# Check CPU time metrics
# Dashboard > Workers > [Worker Name] > Metrics
# Target: Average CPU time < 30ms4. Error Rate Normal (ETA: 5 minutes)
bash
# Check overall error rate
# Should not have increased due to optimization
# Target: < 1% error ratePrevention Measures
1. Add Timeout Monitoring
typescript
// Track operations approaching timeout
const TIMEOUT_WARNING_THRESHOLD = 25000; // 25s (before 30s limit)
function trackOperationTime(operation: string, duration: number) {
if (duration > TIMEOUT_WARNING_THRESHOLD) {
console.warn('Operation approaching timeout:', {
operation,
duration,
threshold: TIMEOUT_WARNING_THRESHOLD,
});
// Send alert
await alerter.sendAlert({
severity: 'warning',
message: `Operation ${operation} took ${duration}ms`,
context: { operation, duration },
});
}
}2. Implement Operation Budgets
typescript
// Set time budget for operations
class OperationBudget {
constructor(private budgetMs: number) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
const start = Date.now();
const result = await fn();
const duration = Date.now() - start;
if (duration > this.budgetMs) {
console.warn('Budget exceeded:', {
budget: this.budgetMs,
actual: duration,
});
}
return result;
}
}
// Usage:
const budget = new OperationBudget(1000); // 1 second budget
const result = await budget.execute(() => slowOperation());3. Add Performance Tests
typescript
// test/performance.test.ts
import { describe, it, expect } from 'vitest';
describe('Performance Tests', () => {
it('should complete within timeout', async () => {
const start = Date.now();
await processRequest(mockRequest);
const duration = Date.now() - start;
expect(duration).toBeLessThan(5000); // 5 second limit
});
it('should handle concurrent requests', async () => {
const requests = Array(10).fill(mockRequest);
const start = Date.now();
await Promise.all(requests.map(r => processRequest(r)));
const duration = Date.now() - start;
expect(duration).toBeLessThan(10000); // 10 seconds for 10 requests
});
});4. Code Review Guidelines
- Require performance benchmarks for new endpoints
- Review all loops for potential infinite conditions
- Check external API calls have timeouts
- Verify database queries are optimized
- Test with production-like data volumes
Escalation Path
When to Escalate
Escalate if:
- Timeout rate > 10% for more than 30 minutes
- Unable to identify cause within 30 minutes
- Optimization attempts unsuccessful
- Timeouts affecting critical user workflows
- Worker consistently hitting CPU limits
Escalation Contacts
Level 1 - Performance Team
- Slack: #performance
- For code optimization assistance
Level 2 - Architecture Team
- For design changes (moving to queues, etc.)
- For Worker configuration changes
Level 3 - Cloudflare Support
- For Worker limit increases
- For platform-level issues
Post-Incident
Required Actions
Performance Audit:
- Profile all timeout-prone endpoints
- Create performance budget
- Implement continuous performance testing
Architecture Review:
- Identify operations better suited for queues
- Plan migration to async processing
- Consider Durable Objects for stateful operations
Monitoring Enhancements:
- Add timeout warnings before hard limit
- Track operation durations
- Alert on performance degradation
Documentation:
- Update performance guidelines
- Document optimization techniques
- Share lessons learned with team
Related Runbooks
- High Error Rate - Timeouts contribute to error rate
- Database Slow - DB timeouts
- Queue Backup - Queue processing timeouts
Last Updated: 2025-10-26 Owner: Performance Team Reviewers: Backend Team, SRE Team