Skip to content

Worker Timeout

Overview

This runbook provides procedures for diagnosing and resolving Cloudflare Worker timeout issues.

Alert: worker_timeout_high or request_timeout_elevatedSeverity: Warning (> 2% timeout rate) or Critical (> 10% timeout rate) SLO Impact: Affects API latency and availability SLOs


Symptoms and Detection

How to Detect

  • Alert: "Worker Timeout Rate Elevated"
  • Dashboard: Request duration showing spikes to 30s+ (Worker CPU limit)
  • Logs: "TimeoutError" or "CPU time limit exceeded"
  • User Impact: 504 Gateway Timeout errors, slow or failed requests

Observable Symptoms

  • HTTP 504 status codes
  • Requests hitting Worker CPU time limit (50ms default, 50s max)
  • Timeout errors in worker logs
  • P99 latency at or near Worker timeout limit

Investigation Steps

1. Identify Timeout Pattern (ETA: 3 minutes)

Determine which workers and endpoints are timing out:

bash
# Check timeout errors by worker
wrangler tail --status error | grep -i timeout

# Monitor specific worker for timeouts
wrangler tail monotask-agent-worker | grep -i timeout

# Check which endpoints are affected
# Review logs for request paths

Questions to Answer:

  • Which worker(s) are experiencing timeouts?
  • Which endpoints or operations are timing out?
  • What is the timeout frequency (% of requests)?
  • When did timeouts start occurring?

2. Analyze Request Characteristics (ETA: 5 minutes)

Examine requests that are timing out:

bash
# Look for common patterns in timeout errors
wrangler tail monotask-agent-worker --format pretty

# Look for:
# - Request size (large payloads)
# - Operation type (complex computations)
# - External API calls
# - Database operations

Common Timeout Triggers:

  • Large file processing
  • Complex AI agent operations
  • Multiple external API calls in sequence
  • Large database queries or joins
  • Infinite loops or recursive calls

3. Check CPU Time Usage (ETA: 3 minutes)

Monitor Worker CPU time consumption:

bash
# View CPU time metrics in Cloudflare Dashboard
# Navigate to: Workers > [Worker Name] > Metrics

# Check:
# - Average CPU time per request
# - P95/P99 CPU time
# - CPU time distribution

CPU Time Limits:

  • Free tier: 10ms CPU time
  • Paid tier: 50ms CPU time (default)
  • Unbound workers: 30s wall time, unlimited CPU bursts

4. Profile Code Execution (ETA: 10 minutes)

Add performance profiling to identify slow operations:

typescript
// Add timing markers in code
console.time('operation_name');
await someOperation();
console.timeEnd('operation_name');

// Or use performance API
const start = performance.now();
await someOperation();
const duration = performance.now() - start;
console.log(`Operation took ${duration}ms`);

Deploy with profiling and monitor logs:

bash
wrangler deploy monotask-agent-worker --env staging
wrangler tail monotask-agent-worker-staging

5. Check External Dependencies (ETA: 5 minutes)

Verify external services aren't causing delays:

bash
# Test external API response times
curl -w "@curl-format.txt" https://api.github.com/status
curl -w "@curl-format.txt" https://api.anthropic.com/v1/messages

# Check for:
# - API response times > 5s
# - API timeouts or errors
# - Rate limiting responses

Common Causes and Resolutions

Cause 1: Synchronous External API Calls

Symptoms:

  • Timeout correlates with external API slowness
  • Multiple sequential API calls
  • No timeout set on API requests

Resolution:

Immediate (10 minutes):

  1. Add request timeouts:
typescript
// BEFORE (no timeout):
const response = await fetch(externalAPI);

// AFTER (with timeout):
const TIMEOUT_MS = 5000;

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

try {
  const response = await fetch(externalAPI, {
    signal: controller.signal,
  });
  return response;
} catch (error) {
  if (error.name === 'AbortError') {
    throw new Error('External API timeout');
  }
  throw error;
} finally {
  clearTimeout(timeoutId);
}
  1. Make parallel instead of sequential:
typescript
// BEFORE (sequential):
const result1 = await fetchAPI1();
const result2 = await fetchAPI2();
const result3 = await fetchAPI3();

// AFTER (parallel):
const [result1, result2, result3] = await Promise.all([
  fetchAPI1(),
  fetchAPI2(),
  fetchAPI3(),
]);
  1. Implement circuit breaker:
typescript
class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private readonly threshold = 5;
  private readonly resetTimeout = 60000;

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.isOpen()) {
      throw new Error('Circuit breaker is open');
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private isOpen(): boolean {
    if (this.failures >= this.threshold) {
      const timeSinceLastFailure = Date.now() - this.lastFailure;
      if (timeSinceLastFailure < this.resetTimeout) {
        return true;
      }
      this.failures = 0; // Reset after timeout
    }
    return false;
  }

  private onSuccess(): void {
    this.failures = 0;
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailure = Date.now();
  }
}

Long-term:

  • Move long-running API calls to queues
  • Implement result caching
  • Use webhooks instead of polling
  • Add retry with exponential backoff

Cause 2: CPU-Intensive Computations

Symptoms:

  • CPU time approaching 50ms limit
  • Complex calculations or transformations
  • Large data processing

Resolution:

Immediate (15 minutes):

  1. Offload to Durable Object:
typescript
// BEFORE (in Worker):
export default {
  async fetch(request: Request, env: Env) {
    const result = await heavyComputation(); // Times out!
    return new Response(JSON.stringify(result));
  }
};

// AFTER (offload to DO):
export default {
  async fetch(request: Request, env: Env) {
    const id = env.PROCESSOR.idFromName('processor');
    const stub = env.PROCESSOR.get(id);
    return stub.fetch(request);
  }
};

// In Durable Object (has longer timeout):
export class Processor {
  async fetch(request: Request) {
    const result = await this.heavyComputation();
    return new Response(JSON.stringify(result));
  }

  async heavyComputation() {
    // CPU-intensive work here
  }
}
  1. Break into chunks:
typescript
// Process large datasets in chunks
async function processLargeArray(items: any[]) {
  const CHUNK_SIZE = 100;
  const results = [];

  for (let i = 0; i < items.length; i += CHUNK_SIZE) {
    const chunk = items.slice(i, i + CHUNK_SIZE);
    const chunkResults = await processChunk(chunk);
    results.push(...chunkResults);

    // Optional: yield to event loop
    await new Promise(resolve => setTimeout(resolve, 0));
  }

  return results;
}
  1. Use streaming for large responses:
typescript
// Stream large JSON responses
function streamJSON(data: any[]): Response {
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      controller.enqueue(encoder.encode('['));

      for (let i = 0; i < data.length; i++) {
        if (i > 0) controller.enqueue(encoder.encode(','));
        controller.enqueue(encoder.encode(JSON.stringify(data[i])));

        // Yield to avoid blocking
        await new Promise(resolve => setTimeout(resolve, 0));
      }

      controller.enqueue(encoder.encode(']'));
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'application/json' },
  });
}

Long-term:

  • Move CPU-intensive work to queues
  • Implement background processing
  • Use WebAssembly for performance-critical code
  • Optimize algorithms

Cause 3: Large Database Queries

Symptoms:

  • Timeout on endpoints with database access
  • Large result sets being fetched
  • Complex JOIN queries

Resolution:

Immediate (10 minutes):

  1. Add pagination:
typescript
// BEFORE (fetches all):
const tasks = await db.query('SELECT * FROM tasks');

// AFTER (paginated):
const limit = 100;
const offset = (page - 1) * limit;
const tasks = await db.query(
  'SELECT * FROM tasks LIMIT ? OFFSET ?',
  [limit, offset]
);
  1. Add query timeout:
typescript
async function queryWithTimeout(query: string, params: any[], timeout = 5000) {
  return Promise.race([
    db.query(query, params),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Query timeout')), timeout)
    ),
  ]);
}
  1. Use indexes:
sql
-- Add missing indexes
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);

Long-term:

  • See Database Slow Runbook
  • Implement caching layer
  • Optimize queries and indexes
  • Use database views for complex queries

Cause 4: Memory-Intensive Operations

Symptoms:

  • Processing large files
  • Building large objects in memory
  • String concatenation in loops

Resolution:

Immediate (15 minutes):

  1. Use streaming for large files:
typescript
// BEFORE (loads entire file):
const content = await r2Object.text();
const processed = processLargeContent(content);

// AFTER (streaming):
const stream = r2Object.body;
const reader = stream.getReader();
const chunks: string[] = [];

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = new TextDecoder().decode(value);
  const processed = processChunk(chunk);
  chunks.push(processed);
}
  1. Avoid large object creation:
typescript
// BEFORE (creates large array):
const results = [];
for (let i = 0; i < 1000000; i++) {
  results.push(process(i));
}

// AFTER (generator):
function* processItems(count: number) {
  for (let i = 0; i < count; i++) {
    yield process(i);
  }
}

// Use as stream or in chunks
  1. Optimize string operations:
typescript
// BEFORE (slow concatenation):
let result = '';
for (const item of items) {
  result += item.toString();
}

// AFTER (array join):
const result = items.map(item => item.toString()).join('');

Long-term:

  • Store large files in R2 instead of processing inline
  • Use background workers for large operations
  • Implement progressive processing
  • Add memory usage monitoring

Cause 5: Infinite Loops or Recursion

Symptoms:

  • Timeout always at maximum duration
  • No response from worker
  • High CPU usage

Resolution:

Immediate (5 minutes):

  1. Add loop guards:
typescript
// BEFORE (potential infinite loop):
while (condition) {
  // ...
}

// AFTER (with guard):
let iterations = 0;
const MAX_ITERATIONS = 1000;

while (condition && iterations < MAX_ITERATIONS) {
  // ...
  iterations++;
}

if (iterations >= MAX_ITERATIONS) {
  throw new Error('Loop iteration limit exceeded');
}
  1. Add recursion depth limits:
typescript
function recursiveFunction(data: any, depth = 0) {
  const MAX_DEPTH = 100;

  if (depth > MAX_DEPTH) {
    throw new Error('Max recursion depth exceeded');
  }

  if (baseCase) {
    return result;
  }

  return recursiveFunction(nextData, depth + 1);
}

Long-term:

  • Add code review checks for loops
  • Implement circuit breakers
  • Add monitoring for loop iterations
  • Use iterative algorithms instead of recursive

Resolution Procedures

Immediate Mitigation (ETA: 10 minutes)

Step 1: Move to Queue

For operations that can be asynchronous:

typescript
// BEFORE (synchronous, times out):
export default {
  async fetch(request: Request, env: Env) {
    const result = await longRunningOperation();
    return new Response(JSON.stringify(result));
  }
};

// AFTER (queue for async processing):
export default {
  async fetch(request: Request, env: Env) {
    const jobId = crypto.randomUUID();

    // Queue the operation
    await env.AGENT_QUEUE.send({
      jobId,
      operation: 'long-running',
      data: await request.json(),
    });

    // Return job ID immediately
    return new Response(JSON.stringify({
      jobId,
      status: 'queued',
    }), { status: 202 });
  }
};

Step 2: Add Operation Timeout

typescript
// Wrap operations with timeout
async function withTimeout<T>(
  promise: Promise<T>,
  timeoutMs: number
): Promise<T> {
  return Promise.race([
    promise,
    new Promise<T>((_, reject) =>
      setTimeout(
        () => reject(new Error(`Operation timeout after ${timeoutMs}ms`)),
        timeoutMs
      )
    ),
  ]);
}

// Usage:
const result = await withTimeout(
  slowOperation(),
  5000  // 5 second timeout
);

Step 3: Enable Response Caching

typescript
// Cache expensive operations
async function getCachedResult(key: string, fn: () => Promise<any>) {
  const cached = await env.CACHE.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  const result = await fn();
  await env.CACHE.put(key, JSON.stringify(result), {
    expirationTtl: 300, // 5 minutes
  });

  return result;
}

Code Optimization (ETA: 30 minutes)

Step 1: Profile and Identify Bottlenecks

typescript
// Add detailed profiling
const timings: Record<string, number> = {};

function time(label: string) {
  return {
    end: () => {
      const duration = Date.now() - start;
      timings[label] = duration;
      console.log(`${label}: ${duration}ms`);
    },
  };
  const start = Date.now();
}

// Usage:
const t1 = time('database_query');
await db.query(...);
t1.end();

const t2 = time('api_call');
await fetch(...);
t2.end();

console.log('Total timings:', timings);

Step 2: Optimize Critical Path

Focus on the slowest operations:

typescript
// Before optimization
async function processRequest(request: Request) {
  const user = await getUser();           // 100ms
  const data = await fetchData();          // 200ms
  const processed = await process(data);   // 500ms  ← Bottleneck
  const saved = await save(processed);     // 50ms

  return new Response(JSON.stringify(saved));
}

// After optimization
async function processRequest(request: Request) {
  // Parallelize independent operations
  const [user, data] = await Promise.all([
    getUser(),
    fetchData(),
  ]);

  // Optimize bottleneck
  const processed = await processOptimized(data);  // 50ms

  const saved = await save(processed);

  return new Response(JSON.stringify(saved));
}

Step 3: Deploy and Verify

bash
# Deploy optimized version
wrangler deploy monotask-agent-worker

# Monitor for timeout improvement
wrangler tail monotask-agent-worker | grep -i timeout

Verification Steps

1. Timeout Rate Decreased (ETA: 10 minutes)

bash
# Monitor timeout errors
wrangler tail --status error | grep -i timeout

# Check metrics dashboard
# Target: < 1% timeout rate

2. Request Duration Improved (ETA: 5 minutes)

bash
# Check P95/P99 latency
# Dashboard > Workers > [Worker Name] > Metrics

# Target: P95 < SLO threshold for endpoint type

3. CPU Time Reduced (ETA: 5 minutes)

bash
# Check CPU time metrics
# Dashboard > Workers > [Worker Name] > Metrics

# Target: Average CPU time < 30ms

4. Error Rate Normal (ETA: 5 minutes)

bash
# Check overall error rate
# Should not have increased due to optimization

# Target: < 1% error rate

Prevention Measures

1. Add Timeout Monitoring

typescript
// Track operations approaching timeout
const TIMEOUT_WARNING_THRESHOLD = 25000; // 25s (before 30s limit)

function trackOperationTime(operation: string, duration: number) {
  if (duration > TIMEOUT_WARNING_THRESHOLD) {
    console.warn('Operation approaching timeout:', {
      operation,
      duration,
      threshold: TIMEOUT_WARNING_THRESHOLD,
    });

    // Send alert
    await alerter.sendAlert({
      severity: 'warning',
      message: `Operation ${operation} took ${duration}ms`,
      context: { operation, duration },
    });
  }
}

2. Implement Operation Budgets

typescript
// Set time budget for operations
class OperationBudget {
  constructor(private budgetMs: number) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    const start = Date.now();
    const result = await fn();
    const duration = Date.now() - start;

    if (duration > this.budgetMs) {
      console.warn('Budget exceeded:', {
        budget: this.budgetMs,
        actual: duration,
      });
    }

    return result;
  }
}

// Usage:
const budget = new OperationBudget(1000); // 1 second budget
const result = await budget.execute(() => slowOperation());

3. Add Performance Tests

typescript
// test/performance.test.ts
import { describe, it, expect } from 'vitest';

describe('Performance Tests', () => {
  it('should complete within timeout', async () => {
    const start = Date.now();
    await processRequest(mockRequest);
    const duration = Date.now() - start;

    expect(duration).toBeLessThan(5000); // 5 second limit
  });

  it('should handle concurrent requests', async () => {
    const requests = Array(10).fill(mockRequest);
    const start = Date.now();

    await Promise.all(requests.map(r => processRequest(r)));

    const duration = Date.now() - start;
    expect(duration).toBeLessThan(10000); // 10 seconds for 10 requests
  });
});

4. Code Review Guidelines

  • Require performance benchmarks for new endpoints
  • Review all loops for potential infinite conditions
  • Check external API calls have timeouts
  • Verify database queries are optimized
  • Test with production-like data volumes

Escalation Path

When to Escalate

Escalate if:

  • Timeout rate > 10% for more than 30 minutes
  • Unable to identify cause within 30 minutes
  • Optimization attempts unsuccessful
  • Timeouts affecting critical user workflows
  • Worker consistently hitting CPU limits

Escalation Contacts

Level 1 - Performance Team

  • Slack: #performance
  • For code optimization assistance

Level 2 - Architecture Team

  • For design changes (moving to queues, etc.)
  • For Worker configuration changes

Level 3 - Cloudflare Support

  • For Worker limit increases
  • For platform-level issues

Post-Incident

Required Actions

  1. Performance Audit:

    • Profile all timeout-prone endpoints
    • Create performance budget
    • Implement continuous performance testing
  2. Architecture Review:

    • Identify operations better suited for queues
    • Plan migration to async processing
    • Consider Durable Objects for stateful operations
  3. Monitoring Enhancements:

    • Add timeout warnings before hard limit
    • Track operation durations
    • Alert on performance degradation
  4. Documentation:

    • Update performance guidelines
    • Document optimization techniques
    • Share lessons learned with team


Last Updated: 2025-10-26 Owner: Performance Team Reviewers: Backend Team, SRE Team

MonoKernel MonoTask Documentation