Skip to content

Database Slow

Overview

This runbook provides procedures for diagnosing and resolving slow D1 database query performance issues.

Alert: slow_database_queries or d1_query_latency_highSeverity: Warning (P95 > 100ms) or Critical (P95 > 500ms) SLO Impact: Affects database query latency SLO (P95 < 50ms)


Symptoms and Detection

How to Detect

  • Alert: "Slow Database Queries Detected"
  • Dashboard: D1 query latency widget shows elevation
  • Logs: Database query warnings in worker logs
  • User Impact: Slow API responses, timeouts

Observable Symptoms

  • P95 database query time > 100ms
  • Database timeout errors
  • Increased API response times
  • Worker timeout errors during database operations

Investigation Steps

1. Identify Slow Queries (ETA: 3 minutes)

Determine which queries are slow:

bash
# Check D1 database metrics
# Cloudflare Dashboard > D1 > monotask-production > Metrics

# Look for:
# - Query execution time trends
# - Query error rates
# - Connection pool usage

Check Worker Logs for Query Timing:

bash
# Tail worker logs and filter for database operations
wrangler tail monotask-api-gateway --format pretty | grep "db_query"

# Look for log entries showing:
# - Query duration > 100ms
# - Specific SQL statements
# - Table names being queried

2. Analyze Query Patterns (ETA: 5 minutes)

Identify problematic query patterns:

sql
-- Common slow query patterns:

-- 1. Missing indexes (full table scans)
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;

-- 2. N+1 queries (multiple queries in loop)
-- Look for repeated similar queries in logs

-- 3. Large result sets without pagination
SELECT * FROM tasks;  -- No LIMIT

-- 4. Complex joins on large tables
SELECT * FROM tasks t
JOIN validations v ON t.id = v.task_id
JOIN evidence e ON v.id = e.validation_id;

3. Check Database Statistics (ETA: 3 minutes)

bash
# Check database size and table statistics
wrangler d1 execute monotask-production --command "
SELECT
  name,
  (SELECT COUNT(*) FROM pragma_table_info(name)) as column_count
FROM sqlite_master
WHERE type='table';
"

# Check index usage
wrangler d1 execute monotask-production --command "
SELECT name, tbl_name FROM sqlite_master WHERE type='index';
"

4. Monitor Database Load (ETA: 2 minutes)

bash
# Check concurrent query count
# Cloudflare Dashboard > D1 > Metrics > Active Connections

# Check query rate
# Dashboard > D1 > Metrics > Queries per Second

# Identify if database is under heavy load

5. Check for Lock Contention (ETA: 3 minutes)

D1 uses SQLite, which has database-level locking:

bash
# Look for write lock contention in logs
wrangler tail --format pretty | grep "database.*locked"

# Check for long-running write transactions
# These block all readers

Common Causes and Resolutions

Cause 1: Missing Database Indexes

Symptoms:

  • Full table scans on large tables
  • Query time increases with table size
  • EXPLAIN shows no index usage

Resolution:

Immediate (10 minutes):

  1. Identify missing indexes:
sql
-- Check query plan
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ? AND state = ?;

-- If output shows "SCAN TABLE tasks", index is missing
  1. Create necessary indexes:
sql
-- Add index for frequently queried columns
CREATE INDEX IF NOT EXISTS idx_tasks_project_state
ON tasks(project_id, state);

-- Add index for foreign keys
CREATE INDEX IF NOT EXISTS idx_validations_task_id
ON validations(task_id);

-- Add index for timestamp queries
CREATE INDEX IF NOT EXISTS idx_tasks_created_at
ON tasks(created_at DESC);
  1. Deploy index creation:
bash
# Create migration file
cat > migrations/add_performance_indexes.sql << EOF
CREATE INDEX IF NOT EXISTS idx_tasks_project_state ON tasks(project_id, state);
CREATE INDEX IF NOT EXISTS idx_validations_task_id ON validations(task_id);
CREATE INDEX IF NOT EXISTS idx_tasks_created_at ON tasks(created_at DESC);
EOF

# Apply migration
wrangler d1 execute monotask-production --file migrations/add_performance_indexes.sql

Long-term:

  • Analyze query patterns regularly
  • Add indexes during schema design
  • Monitor index usage statistics
  • Remove unused indexes

Cause 2: N+1 Query Problem

Symptoms:

  • Many small queries in sequence
  • Total query time high but individual queries fast
  • Logs show repeated similar queries

Resolution:

Immediate (15 minutes):

  1. Identify N+1 pattern in code:
typescript
// BEFORE (N+1 queries):
const tasks = await db.query('SELECT * FROM tasks');
for (const task of tasks) {
  // Each iteration makes a separate query
  const validations = await db.query(
    'SELECT * FROM validations WHERE task_id = ?',
    [task.id]
  );
}
  1. Refactor to use JOIN or batch query:
typescript
// AFTER (Single query with JOIN):
const tasksWithValidations = await db.query(`
  SELECT
    t.*,
    v.id as validation_id,
    v.status as validation_status
  FROM tasks t
  LEFT JOIN validations v ON t.id = v.task_id
`);

// Or use IN clause for batch query:
const taskIds = tasks.map(t => t.id);
const validations = await db.query(
  `SELECT * FROM validations WHERE task_id IN (${taskIds.map(() => '?').join(',')})`,
  taskIds
);

Long-term:

  • Use ORM with eager loading
  • Implement DataLoader pattern
  • Add query batching middleware
  • Monitor for N+1 patterns in code reviews

Cause 3: Large Result Sets

Symptoms:

  • Query returns thousands of rows
  • Memory usage high
  • Timeout errors on queries

Resolution:

Immediate (10 minutes):

  1. Add pagination to queries:
typescript
// BEFORE:
const tasks = await db.query('SELECT * FROM tasks');

// AFTER:
const limit = 100;
const offset = (page - 1) * limit;
const tasks = await db.query(
  'SELECT * FROM tasks LIMIT ? OFFSET ?',
  [limit, offset]
);

// Return pagination metadata
return {
  data: tasks,
  pagination: {
    page,
    limit,
    total: await db.query('SELECT COUNT(*) as count FROM tasks'),
  },
};
  1. Add query result limits:
typescript
// Enforce maximum result size
const MAX_RESULTS = 1000;
const query = `SELECT * FROM tasks LIMIT ${MAX_RESULTS}`;

Long-term:

  • Implement cursor-based pagination
  • Add result size limits at API level
  • Use streaming for large datasets
  • Add caching for large result sets

Cause 4: Inefficient Query Structure

Symptoms:

  • Complex queries with multiple JOINs
  • Subqueries instead of JOINs
  • SELECT * instead of specific columns

Resolution:

Immediate (15 minutes):

  1. Optimize query structure:
sql
-- BEFORE:
SELECT * FROM tasks WHERE id IN (
  SELECT task_id FROM validations WHERE status = 'passed'
);

-- AFTER (use JOIN instead):
SELECT t.* FROM tasks t
INNER JOIN validations v ON t.id = v.task_id
WHERE v.status = 'passed';

-- Or use EXISTS (often faster):
SELECT t.* FROM tasks t
WHERE EXISTS (
  SELECT 1 FROM validations v
  WHERE v.task_id = t.id AND v.status = 'passed'
);
  1. Select only needed columns:
sql
-- BEFORE:
SELECT * FROM tasks;  -- Returns all columns

-- AFTER:
SELECT id, title, state, created_at FROM tasks;
  1. Avoid functions in WHERE clause:
sql
-- BEFORE (can't use index):
SELECT * FROM tasks WHERE LOWER(title) = 'test';

-- AFTER (can use index):
-- Store lowercase in separate column or use collation
SELECT * FROM tasks WHERE title = 'test' COLLATE NOCASE;

Long-term:

  • Create database views for complex queries
  • Use query builder with optimization
  • Regular query performance audits
  • Implement query caching

Cause 5: Write Lock Contention

Symptoms:

  • "Database is locked" errors
  • Read queries blocked by writes
  • High concurrency failures

Resolution:

Immediate (10 minutes):

  1. Reduce transaction duration:
typescript
// BEFORE (long transaction):
await db.transaction(async (tx) => {
  await tx.query('UPDATE tasks SET state = ?', ['completed']);
  await expensiveOperation();  // This holds the lock!
  await tx.query('INSERT INTO logs ...');
});

// AFTER (short transaction):
await expensiveOperation();  // Do this outside transaction
await db.transaction(async (tx) => {
  await tx.query('UPDATE tasks SET state = ?', ['completed']);
  await tx.query('INSERT INTO logs ...');
});
  1. Use WAL mode (if not enabled):
sql
-- Enable WAL mode for better concurrency
PRAGMA journal_mode=WAL;
  1. Batch write operations:
typescript
// Batch multiple writes into single transaction
await db.transaction(async (tx) => {
  for (const update of updates) {
    await tx.query('UPDATE ...', update);
  }
});

Long-term:

  • Minimize write transaction scope
  • Use read replicas for read-heavy loads
  • Queue write operations
  • Implement optimistic locking

Resolution Procedures

Immediate Mitigation (ETA: 10 minutes)

Step 1: Add Critical Indexes

bash
# Identify most impactful indexes from query analysis
# Create and apply immediately

wrangler d1 execute monotask-production --command "
CREATE INDEX IF NOT EXISTS idx_tasks_project_state ON tasks(project_id, state);
"

Step 2: Enable Query Caching

typescript
// Add caching layer for frequently accessed data
import { CacheService } from './cache';

async function getTasks(projectId: string) {
  const cacheKey = `tasks:${projectId}`;

  // Check cache first
  const cached = await cache.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Query database
  const tasks = await db.query(
    'SELECT * FROM tasks WHERE project_id = ?',
    [projectId]
  );

  // Cache result
  await cache.set(cacheKey, JSON.stringify(tasks), { ttl: 300 });

  return tasks;
}

Step 3: Add Query Timeout

typescript
// Prevent indefinite query hangs
const QUERY_TIMEOUT = 5000; // 5 seconds

async function queryWithTimeout(query: string, params: any[]) {
  return Promise.race([
    db.query(query, params),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Query timeout')), QUERY_TIMEOUT)
    ),
  ]);
}

Database Optimization (ETA: 30 minutes)

Step 1: Analyze and Optimize Schema

sql
-- Check table statistics
ANALYZE;

-- Rebuild indexes to optimize
REINDEX;

-- Vacuum database to reclaim space (if needed)
VACUUM;

Step 2: Create Optimized Views

sql
-- Create materialized view for complex queries
CREATE VIEW IF NOT EXISTS task_summary AS
SELECT
  t.id,
  t.title,
  t.state,
  COUNT(v.id) as validation_count,
  MAX(v.created_at) as last_validation
FROM tasks t
LEFT JOIN validations v ON t.id = v.task_id
GROUP BY t.id;

Step 3: Update Application Code

typescript
// Replace slow queries with optimized versions
// Deploy changes incrementally
wrangler deploy monotask-api-gateway --env staging

// Verify performance improvement
// Then deploy to production
wrangler deploy monotask-api-gateway

Capacity Scaling (if needed)

If database is at capacity:

  1. Request D1 Capacity Increase:

    • Contact Cloudflare support
    • Provide usage metrics
    • Request higher limits
  2. Implement Read Replicas (when available):

    typescript
    // Route read queries to replicas
    const result = await readReplica.query('SELECT ...');
  3. Database Sharding (last resort):

    typescript
    // Partition data across multiple databases
    const shard = getShardForProject(projectId);
    const result = await shard.query('SELECT ...');

Verification Steps

1. Query Performance Improved (ETA: 5 minutes)

bash
# Run EXPLAIN on previously slow queries
wrangler d1 execute monotask-production --command "
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;
"

# Should show index usage instead of table scan

2. Latency Metrics Improved (ETA: 10 minutes)

bash
# Monitor P95 query latency in dashboard
# Target: < 50ms for most queries

# Check worker logs for query timing
wrangler tail monotask-api-gateway | grep "db_query"

3. No Timeout Errors (ETA: 5 minutes)

bash
# Check for database timeout errors
wrangler tail --status error | grep "timeout"

# Should see no database timeouts

4. API Response Times Improved (ETA: 5 minutes)

bash
# Test API endpoints
curl -w "@curl-format.txt" https://monotask-api-gateway.workers.dev/api/tasks

# Verify response times within SLO

Prevention Measures

1. Query Performance Monitoring

typescript
// Add query performance tracking
async function trackQuery(query: string, duration: number) {
  if (duration > 100) {
    console.warn('Slow query detected:', {
      query,
      duration,
      threshold: 100,
    });

    await metrics.track('slow_query', {
      query,
      duration,
    });
  }
}

2. Automated Index Analysis

bash
# Script to analyze missing indexes
# Run weekly via GitHub Actions
#!/bin/bash

# Export query logs
# Analyze for full table scans
# Suggest index creations

3. Query Review Process

  • Review all new database queries in PRs
  • Require EXPLAIN output for complex queries
  • Set query performance benchmarks
  • Add query tests to CI/CD

4. Regular Maintenance

bash
# Schedule regular database maintenance
# Via GitHub Actions cron job

# Analyze database
wrangler d1 execute monotask-production --command "ANALYZE;"

# Check for bloat
# Optimize indexes
# Update statistics

Escalation Path

When to Escalate

Escalate if:

  • P95 query latency > 500ms for more than 30 minutes
  • Database errors > 5% of queries
  • Unable to identify slow queries within 30 minutes
  • Index optimization doesn't improve performance
  • Database capacity limits reached

Escalation Contacts

Level 1 - Database Administrator

  • Slack: #database-team
  • For query optimization and indexing

Level 2 - Infrastructure Team

  • For D1 capacity issues
  • For Cloudflare support escalation

Level 3 - Cloudflare Support

  • For D1 service issues
  • For capacity increase requests

Post-Incident

Required Actions

  1. Query Performance Audit:

    • Identify all slow queries
    • Create optimization plan
    • Implement monitoring for slow queries
  2. Index Strategy Review:

    • Document current indexes
    • Identify missing indexes
    • Remove unused indexes
    • Create index maintenance plan
  3. Code Review:

    • Update query patterns
    • Implement query builders
    • Add query performance tests
    • Document best practices
  4. Documentation Updates:

    • Update schema documentation
    • Document query patterns
    • Share optimization techniques
    • Update this runbook


Last Updated: 2025-10-26 Owner: Database Team Reviewers: Backend Team, SRE Team

MonoKernel MonoTask Documentation