Worker Timeout

Overview

This runbook provides procedures for diagnosing and resolving Cloudflare Worker timeout issues.

Alert: worker_timeout_high or request_timeout_elevatedSeverity: Warning (> 2% timeout rate) or Critical (> 10% timeout rate) SLO Impact: Affects API latency and availability SLOs

Symptoms and Detection

How to Detect

Alert: "Worker Timeout Rate Elevated"
Dashboard: Request duration showing spikes to 30s+ (Worker CPU limit)
Logs: "TimeoutError" or "CPU time limit exceeded"
User Impact: 504 Gateway Timeout errors, slow or failed requests

Observable Symptoms

HTTP 504 status codes
Requests hitting Worker CPU time limit (50ms default, 50s max)
Timeout errors in worker logs
P99 latency at or near Worker timeout limit

Investigation Steps

1. Identify Timeout Pattern (ETA: 3 minutes)

Determine which workers and endpoints are timing out:

bash

# Check timeout errors by worker
wrangler tail --status error | grep -i timeout

# Monitor specific worker for timeouts
wrangler tail monotask-agent-worker | grep -i timeout

# Check which endpoints are affected
# Review logs for request paths

Questions to Answer:

Which worker(s) are experiencing timeouts?
Which endpoints or operations are timing out?
What is the timeout frequency (% of requests)?
When did timeouts start occurring?

2. Analyze Request Characteristics (ETA: 5 minutes)

Examine requests that are timing out:

bash

# Look for common patterns in timeout errors
wrangler tail monotask-agent-worker --format pretty

# Look for:
# - Request size (large payloads)
# - Operation type (complex computations)
# - External API calls
# - Database operations

Common Timeout Triggers:

Large file processing
Complex AI agent operations
Multiple external API calls in sequence
Large database queries or joins
Infinite loops or recursive calls

3. Check CPU Time Usage (ETA: 3 minutes)

Monitor Worker CPU time consumption:

bash

# View CPU time metrics in Cloudflare Dashboard
# Navigate to: Workers > [Worker Name] > Metrics

# Check:
# - Average CPU time per request
# - P95/P99 CPU time
# - CPU time distribution

CPU Time Limits:

Free tier: 10ms CPU time
Paid tier: 50ms CPU time (default)
Unbound workers: 30s wall time, unlimited CPU bursts

4. Profile Code Execution (ETA: 10 minutes)

Add performance profiling to identify slow operations:

typescript

// Add timing markers in code
console.time('operation_name');
await someOperation();
console.timeEnd('operation_name');

// Or use performance API
const start = performance.now();
await someOperation();
const duration = performance.now() - start;
console.log(`Operation took ${duration}ms`);

Deploy with profiling and monitor logs:

bash

wrangler deploy monotask-agent-worker --env staging
wrangler tail monotask-agent-worker-staging

5. Check External Dependencies (ETA: 5 minutes)

Verify external services aren't causing delays:

bash

# Test external API response times
curl -w "@curl-format.txt" https://api.github.com/status
curl -w "@curl-format.txt" https://api.anthropic.com/v1/messages

# Check for:
# - API response times > 5s
# - API timeouts or errors
# - Rate limiting responses

Common Causes and Resolutions

Cause 1: Synchronous External API Calls

Symptoms:

Timeout correlates with external API slowness
Multiple sequential API calls
No timeout set on API requests

Resolution:

Immediate (10 minutes):

Add request timeouts:

typescript

// BEFORE (no timeout):
const response = await fetch(externalAPI);

// AFTER (with timeout):
const TIMEOUT_MS = 5000;

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

try {
  const response = await fetch(externalAPI, {
    signal: controller.signal,
  });
  return response;
} catch (error) {
  if (error.name === 'AbortError') {
    throw new Error('External API timeout');
  }
  throw error;
} finally {
  clearTimeout(timeoutId);
}

Make parallel instead of sequential:

typescript

// BEFORE (sequential):
const result1 = await fetchAPI1();
const result2 = await fetchAPI2();
const result3 = await fetchAPI3();

// AFTER (parallel):
const [result1, result2, result3] = await Promise.all([
  fetchAPI1(),
  fetchAPI2(),
  fetchAPI3(),
]);

Implement circuit breaker:

typescript

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private readonly threshold = 5;
  private readonly resetTimeout = 60000;

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.isOpen()) {
      throw new Error('Circuit breaker is open');
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private isOpen(): boolean {
    if (this.failures >= this.threshold) {
      const timeSinceLastFailure = Date.now() - this.lastFailure;
      if (timeSinceLastFailure < this.resetTimeout) {
        return true;
      }
      this.failures = 0; // Reset after timeout
    }
    return false;
  }

  private onSuccess(): void {
    this.failures = 0;
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailure = Date.now();
  }
}

Long-term:

Move long-running API calls to queues
Implement result caching
Use webhooks instead of polling
Add retry with exponential backoff

Cause 2: CPU-Intensive Computations

Symptoms:

CPU time approaching 50ms limit
Complex calculations or transformations
Large data processing

Resolution:

Immediate (15 minutes):

Offload to Durable Object:

typescript

// BEFORE (in Worker):
export default {
  async fetch(request: Request, env: Env) {
    const result = await heavyComputation(); // Times out!
    return new Response(JSON.stringify(result));
  }
};

// AFTER (offload to DO):
export default {
  async fetch(request: Request, env: Env) {
    const id = env.PROCESSOR.idFromName('processor');
    const stub = env.PROCESSOR.get(id);
    return stub.fetch(request);
  }
};

// In Durable Object (has longer timeout):
export class Processor {
  async fetch(request: Request) {
    const result = await this.heavyComputation();
    return new Response(JSON.stringify(result));
  }

  async heavyComputation() {
    // CPU-intensive work here
  }
}

Break into chunks:

typescript

// Process large datasets in chunks
async function processLargeArray(items: any[]) {
  const CHUNK_SIZE = 100;
  const results = [];

  for (let i = 0; i < items.length; i += CHUNK_SIZE) {
    const chunk = items.slice(i, i + CHUNK_SIZE);
    const chunkResults = await processChunk(chunk);
    results.push(...chunkResults);

    // Optional: yield to event loop
    await new Promise(resolve => setTimeout(resolve, 0));
  }

  return results;
}

Use streaming for large responses:

typescript

// Stream large JSON responses
function streamJSON(data: any[]): Response {
  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      controller.enqueue(encoder.encode('['));

      for (let i = 0; i < data.length; i++) {
        if (i > 0) controller.enqueue(encoder.encode(','));
        controller.enqueue(encoder.encode(JSON.stringify(data[i])));

        // Yield to avoid blocking
        await new Promise(resolve => setTimeout(resolve, 0));
      }

      controller.enqueue(encoder.encode(']'));
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'application/json' },
  });
}

Long-term:

Move CPU-intensive work to queues
Implement background processing
Use WebAssembly for performance-critical code
Optimize algorithms

Cause 3: Large Database Queries

Symptoms:

Timeout on endpoints with database access
Large result sets being fetched
Complex JOIN queries

Resolution:

Immediate (10 minutes):

Add pagination:

typescript

// BEFORE (fetches all):
const tasks = await db.query('SELECT * FROM tasks');

// AFTER (paginated):
const limit = 100;
const offset = (page - 1) * limit;
const tasks = await db.query(
  'SELECT * FROM tasks LIMIT ? OFFSET ?',
  [limit, offset]
);

Add query timeout:

typescript

async function queryWithTimeout(query: string, params: any[], timeout = 5000) {
  return Promise.race([
    db.query(query, params),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Query timeout')), timeout)
    ),
  ]);
}

Use indexes:

sql

-- Add missing indexes
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);

Long-term:

See Database Slow Runbook
Implement caching layer
Optimize queries and indexes
Use database views for complex queries

Cause 4: Memory-Intensive Operations

Symptoms:

Processing large files
Building large objects in memory
String concatenation in loops

Resolution:

Immediate (15 minutes):

Use streaming for large files:

typescript

// BEFORE (loads entire file):
const content = await r2Object.text();
const processed = processLargeContent(content);

// AFTER (streaming):
const stream = r2Object.body;
const reader = stream.getReader();
const chunks: string[] = [];

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = new TextDecoder().decode(value);
  const processed = processChunk(chunk);
  chunks.push(processed);
}

Avoid large object creation:

typescript

// BEFORE (creates large array):
const results = [];
for (let i = 0; i < 1000000; i++) {
  results.push(process(i));
}

// AFTER (generator):
function* processItems(count: number) {
  for (let i = 0; i < count; i++) {
    yield process(i);
  }
}

// Use as stream or in chunks

Optimize string operations:

typescript

// BEFORE (slow concatenation):
let result = '';
for (const item of items) {
  result += item.toString();
}

// AFTER (array join):
const result = items.map(item => item.toString()).join('');

Long-term:

Store large files in R2 instead of processing inline
Use background workers for large operations
Implement progressive processing
Add memory usage monitoring

Cause 5: Infinite Loops or Recursion

Symptoms:

Timeout always at maximum duration
No response from worker
High CPU usage

Resolution:

Immediate (5 minutes):

Add loop guards:

typescript

// BEFORE (potential infinite loop):
while (condition) {
  // ...
}

// AFTER (with guard):
let iterations = 0;
const MAX_ITERATIONS = 1000;

while (condition && iterations < MAX_ITERATIONS) {
  // ...
  iterations++;
}

if (iterations >= MAX_ITERATIONS) {
  throw new Error('Loop iteration limit exceeded');
}

Add recursion depth limits:

typescript

function recursiveFunction(data: any, depth = 0) {
  const MAX_DEPTH = 100;

  if (depth > MAX_DEPTH) {
    throw new Error('Max recursion depth exceeded');
  }

  if (baseCase) {
    return result;
  }

  return recursiveFunction(nextData, depth + 1);
}

Long-term:

Add code review checks for loops
Implement circuit breakers
Add monitoring for loop iterations
Use iterative algorithms instead of recursive

Resolution Procedures

Immediate Mitigation (ETA: 10 minutes)

Step 1: Move to Queue

For operations that can be asynchronous:

typescript

// BEFORE (synchronous, times out):
export default {
  async fetch(request: Request, env: Env) {
    const result = await longRunningOperation();
    return new Response(JSON.stringify(result));
  }
};

// AFTER (queue for async processing):
export default {
  async fetch(request: Request, env: Env) {
    const jobId = crypto.randomUUID();

    // Queue the operation
    await env.AGENT_QUEUE.send({
      jobId,
      operation: 'long-running',
      data: await request.json(),
    });

    // Return job ID immediately
    return new Response(JSON.stringify({
      jobId,
      status: 'queued',
    }), { status: 202 });
  }
};

Step 2: Add Operation Timeout

typescript

// Wrap operations with timeout
async function withTimeout<T>(
  promise: Promise<T>,
  timeoutMs: number
): Promise<T> {
  return Promise.race([
    promise,
    new Promise<T>((_, reject) =>
      setTimeout(
        () => reject(new Error(`Operation timeout after ${timeoutMs}ms`)),
        timeoutMs
      )
    ),
  ]);
}

// Usage:
const result = await withTimeout(
  slowOperation(),
  5000  // 5 second timeout
);

Step 3: Enable Response Caching

typescript

// Cache expensive operations
async function getCachedResult(key: string, fn: () => Promise<any>) {
  const cached = await env.CACHE.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  const result = await fn();
  await env.CACHE.put(key, JSON.stringify(result), {
    expirationTtl: 300, // 5 minutes
  });

  return result;
}

Code Optimization (ETA: 30 minutes)

Step 1: Profile and Identify Bottlenecks

typescript

// Add detailed profiling
const timings: Record<string, number> = {};

function time(label: string) {
  return {
    end: () => {
      const duration = Date.now() - start;
      timings[label] = duration;
      console.log(`${label}: ${duration}ms`);
    },
  };
  const start = Date.now();
}

// Usage:
const t1 = time('database_query');
await db.query(...);
t1.end();

const t2 = time('api_call');
await fetch(...);
t2.end();

console.log('Total timings:', timings);

Step 2: Optimize Critical Path

Focus on the slowest operations:

typescript

// Before optimization
async function processRequest(request: Request) {
  const user = await getUser();           // 100ms
  const data = await fetchData();          // 200ms
  const processed = await process(data);   // 500ms  ← Bottleneck
  const saved = await save(processed);     // 50ms

  return new Response(JSON.stringify(saved));
}

// After optimization
async function processRequest(request: Request) {
  // Parallelize independent operations
  const [user, data] = await Promise.all([
    getUser(),
    fetchData(),
  ]);

  // Optimize bottleneck
  const processed = await processOptimized(data);  // 50ms

  const saved = await save(processed);

  return new Response(JSON.stringify(saved));
}

Step 3: Deploy and Verify

bash

# Deploy optimized version
wrangler deploy monotask-agent-worker

# Monitor for timeout improvement
wrangler tail monotask-agent-worker | grep -i timeout

Verification Steps

1. Timeout Rate Decreased (ETA: 10 minutes)

bash

# Monitor timeout errors
wrangler tail --status error | grep -i timeout

# Check metrics dashboard
# Target: < 1% timeout rate

2. Request Duration Improved (ETA: 5 minutes)

bash

# Check P95/P99 latency
# Dashboard > Workers > [Worker Name] > Metrics

# Target: P95 < SLO threshold for endpoint type

3. CPU Time Reduced (ETA: 5 minutes)

bash

# Check CPU time metrics
# Dashboard > Workers > [Worker Name] > Metrics

# Target: Average CPU time < 30ms

4. Error Rate Normal (ETA: 5 minutes)

bash

# Check overall error rate
# Should not have increased due to optimization

# Target: < 1% error rate

Prevention Measures

1. Add Timeout Monitoring

typescript

// Track operations approaching timeout
const TIMEOUT_WARNING_THRESHOLD = 25000; // 25s (before 30s limit)

function trackOperationTime(operation: string, duration: number) {
  if (duration > TIMEOUT_WARNING_THRESHOLD) {
    console.warn('Operation approaching timeout:', {
      operation,
      duration,
      threshold: TIMEOUT_WARNING_THRESHOLD,
    });

    // Send alert
    await alerter.sendAlert({
      severity: 'warning',
      message: `Operation ${operation} took ${duration}ms`,
      context: { operation, duration },
    });
  }
}

2. Implement Operation Budgets

typescript

// Set time budget for operations
class OperationBudget {
  constructor(private budgetMs: number) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    const start = Date.now();
    const result = await fn();
    const duration = Date.now() - start;

    if (duration > this.budgetMs) {
      console.warn('Budget exceeded:', {
        budget: this.budgetMs,
        actual: duration,
      });
    }

    return result;
  }
}

// Usage:
const budget = new OperationBudget(1000); // 1 second budget
const result = await budget.execute(() => slowOperation());

3. Add Performance Tests

typescript

// test/performance.test.ts
import { describe, it, expect } from 'vitest';

describe('Performance Tests', () => {
  it('should complete within timeout', async () => {
    const start = Date.now();
    await processRequest(mockRequest);
    const duration = Date.now() - start;

    expect(duration).toBeLessThan(5000); // 5 second limit
  });

  it('should handle concurrent requests', async () => {
    const requests = Array(10).fill(mockRequest);
    const start = Date.now();

    await Promise.all(requests.map(r => processRequest(r)));

    const duration = Date.now() - start;
    expect(duration).toBeLessThan(10000); // 10 seconds for 10 requests
  });
});

4. Code Review Guidelines

Require performance benchmarks for new endpoints
Review all loops for potential infinite conditions
Check external API calls have timeouts
Verify database queries are optimized
Test with production-like data volumes

Escalation Path

When to Escalate

Escalate if:

Timeout rate > 10% for more than 30 minutes
Unable to identify cause within 30 minutes
Optimization attempts unsuccessful
Timeouts affecting critical user workflows
Worker consistently hitting CPU limits

Escalation Contacts

Level 1 - Performance Team

Slack: #performance
For code optimization assistance

Level 2 - Architecture Team

For design changes (moving to queues, etc.)
For Worker configuration changes

Level 3 - Cloudflare Support

For Worker limit increases
For platform-level issues

Post-Incident

Required Actions

Performance Audit:
- Profile all timeout-prone endpoints
- Create performance budget
- Implement continuous performance testing
Architecture Review:
- Identify operations better suited for queues
- Plan migration to async processing
- Consider Durable Objects for stateful operations
Monitoring Enhancements:
- Add timeout warnings before hard limit
- Track operation durations
- Alert on performance degradation
Documentation:
- Update performance guidelines
- Document optimization techniques
- Share lessons learned with team

High Error Rate - Timeouts contribute to error rate
Database Slow - DB timeouts
Queue Backup - Queue processing timeouts

Last Updated: 2025-10-26 Owner: Performance Team Reviewers: Backend Team, SRE Team

Worker Timeout ​

Overview ​

Symptoms and Detection ​

How to Detect ​

Observable Symptoms ​

Investigation Steps ​

1. Identify Timeout Pattern (ETA: 3 minutes) ​

2. Analyze Request Characteristics (ETA: 5 minutes) ​

3. Check CPU Time Usage (ETA: 3 minutes) ​

4. Profile Code Execution (ETA: 10 minutes) ​

5. Check External Dependencies (ETA: 5 minutes) ​

Common Causes and Resolutions ​

Cause 1: Synchronous External API Calls ​

Cause 2: CPU-Intensive Computations ​

Cause 3: Large Database Queries ​

Cause 4: Memory-Intensive Operations ​

Cause 5: Infinite Loops or Recursion ​

Resolution Procedures ​

Immediate Mitigation (ETA: 10 minutes) ​

Code Optimization (ETA: 30 minutes) ​

Verification Steps ​

1. Timeout Rate Decreased (ETA: 10 minutes) ​

2. Request Duration Improved (ETA: 5 minutes) ​

3. CPU Time Reduced (ETA: 5 minutes) ​

4. Error Rate Normal (ETA: 5 minutes) ​

Prevention Measures ​

1. Add Timeout Monitoring ​

2. Implement Operation Budgets ​

3. Add Performance Tests ​

4. Code Review Guidelines ​

Escalation Path ​

When to Escalate ​

Escalation Contacts ​

Post-Incident ​

Required Actions ​

Related Runbooks ​

Worker Timeout

Overview

Symptoms and Detection

How to Detect

Observable Symptoms

Investigation Steps

1. Identify Timeout Pattern (ETA: 3 minutes)

2. Analyze Request Characteristics (ETA: 5 minutes)

3. Check CPU Time Usage (ETA: 3 minutes)

4. Profile Code Execution (ETA: 10 minutes)

5. Check External Dependencies (ETA: 5 minutes)

Common Causes and Resolutions

Cause 1: Synchronous External API Calls

Cause 2: CPU-Intensive Computations

Cause 3: Large Database Queries

Cause 4: Memory-Intensive Operations

Cause 5: Infinite Loops or Recursion

Resolution Procedures

Immediate Mitigation (ETA: 10 minutes)

Code Optimization (ETA: 30 minutes)

Verification Steps

1. Timeout Rate Decreased (ETA: 10 minutes)

2. Request Duration Improved (ETA: 5 minutes)

3. CPU Time Reduced (ETA: 5 minutes)

4. Error Rate Normal (ETA: 5 minutes)

Prevention Measures

1. Add Timeout Monitoring

2. Implement Operation Budgets

3. Add Performance Tests

4. Code Review Guidelines

Escalation Path

When to Escalate

Escalation Contacts

Post-Incident

Required Actions

Related Runbooks