Appearance
Database Slow
Overview
This runbook provides procedures for diagnosing and resolving slow D1 database query performance issues.
Alert: slow_database_queries or d1_query_latency_highSeverity: Warning (P95 > 100ms) or Critical (P95 > 500ms) SLO Impact: Affects database query latency SLO (P95 < 50ms)
Symptoms and Detection
How to Detect
- Alert: "Slow Database Queries Detected"
- Dashboard: D1 query latency widget shows elevation
- Logs: Database query warnings in worker logs
- User Impact: Slow API responses, timeouts
Observable Symptoms
- P95 database query time > 100ms
- Database timeout errors
- Increased API response times
- Worker timeout errors during database operations
Investigation Steps
1. Identify Slow Queries (ETA: 3 minutes)
Determine which queries are slow:
bash
# Check D1 database metrics
# Cloudflare Dashboard > D1 > monotask-production > Metrics
# Look for:
# - Query execution time trends
# - Query error rates
# - Connection pool usageCheck Worker Logs for Query Timing:
bash
# Tail worker logs and filter for database operations
wrangler tail monotask-api-gateway --format pretty | grep "db_query"
# Look for log entries showing:
# - Query duration > 100ms
# - Specific SQL statements
# - Table names being queried2. Analyze Query Patterns (ETA: 5 minutes)
Identify problematic query patterns:
sql
-- Common slow query patterns:
-- 1. Missing indexes (full table scans)
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;
-- 2. N+1 queries (multiple queries in loop)
-- Look for repeated similar queries in logs
-- 3. Large result sets without pagination
SELECT * FROM tasks; -- No LIMIT
-- 4. Complex joins on large tables
SELECT * FROM tasks t
JOIN validations v ON t.id = v.task_id
JOIN evidence e ON v.id = e.validation_id;3. Check Database Statistics (ETA: 3 minutes)
bash
# Check database size and table statistics
wrangler d1 execute monotask-production --command "
SELECT
name,
(SELECT COUNT(*) FROM pragma_table_info(name)) as column_count
FROM sqlite_master
WHERE type='table';
"
# Check index usage
wrangler d1 execute monotask-production --command "
SELECT name, tbl_name FROM sqlite_master WHERE type='index';
"4. Monitor Database Load (ETA: 2 minutes)
bash
# Check concurrent query count
# Cloudflare Dashboard > D1 > Metrics > Active Connections
# Check query rate
# Dashboard > D1 > Metrics > Queries per Second
# Identify if database is under heavy load5. Check for Lock Contention (ETA: 3 minutes)
D1 uses SQLite, which has database-level locking:
bash
# Look for write lock contention in logs
wrangler tail --format pretty | grep "database.*locked"
# Check for long-running write transactions
# These block all readersCommon Causes and Resolutions
Cause 1: Missing Database Indexes
Symptoms:
- Full table scans on large tables
- Query time increases with table size
- EXPLAIN shows no index usage
Resolution:
Immediate (10 minutes):
- Identify missing indexes:
sql
-- Check query plan
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ? AND state = ?;
-- If output shows "SCAN TABLE tasks", index is missing- Create necessary indexes:
sql
-- Add index for frequently queried columns
CREATE INDEX IF NOT EXISTS idx_tasks_project_state
ON tasks(project_id, state);
-- Add index for foreign keys
CREATE INDEX IF NOT EXISTS idx_validations_task_id
ON validations(task_id);
-- Add index for timestamp queries
CREATE INDEX IF NOT EXISTS idx_tasks_created_at
ON tasks(created_at DESC);- Deploy index creation:
bash
# Create migration file
cat > migrations/add_performance_indexes.sql << EOF
CREATE INDEX IF NOT EXISTS idx_tasks_project_state ON tasks(project_id, state);
CREATE INDEX IF NOT EXISTS idx_validations_task_id ON validations(task_id);
CREATE INDEX IF NOT EXISTS idx_tasks_created_at ON tasks(created_at DESC);
EOF
# Apply migration
wrangler d1 execute monotask-production --file migrations/add_performance_indexes.sqlLong-term:
- Analyze query patterns regularly
- Add indexes during schema design
- Monitor index usage statistics
- Remove unused indexes
Cause 2: N+1 Query Problem
Symptoms:
- Many small queries in sequence
- Total query time high but individual queries fast
- Logs show repeated similar queries
Resolution:
Immediate (15 minutes):
- Identify N+1 pattern in code:
typescript
// BEFORE (N+1 queries):
const tasks = await db.query('SELECT * FROM tasks');
for (const task of tasks) {
// Each iteration makes a separate query
const validations = await db.query(
'SELECT * FROM validations WHERE task_id = ?',
[task.id]
);
}- Refactor to use JOIN or batch query:
typescript
// AFTER (Single query with JOIN):
const tasksWithValidations = await db.query(`
SELECT
t.*,
v.id as validation_id,
v.status as validation_status
FROM tasks t
LEFT JOIN validations v ON t.id = v.task_id
`);
// Or use IN clause for batch query:
const taskIds = tasks.map(t => t.id);
const validations = await db.query(
`SELECT * FROM validations WHERE task_id IN (${taskIds.map(() => '?').join(',')})`,
taskIds
);Long-term:
- Use ORM with eager loading
- Implement DataLoader pattern
- Add query batching middleware
- Monitor for N+1 patterns in code reviews
Cause 3: Large Result Sets
Symptoms:
- Query returns thousands of rows
- Memory usage high
- Timeout errors on queries
Resolution:
Immediate (10 minutes):
- Add pagination to queries:
typescript
// BEFORE:
const tasks = await db.query('SELECT * FROM tasks');
// AFTER:
const limit = 100;
const offset = (page - 1) * limit;
const tasks = await db.query(
'SELECT * FROM tasks LIMIT ? OFFSET ?',
[limit, offset]
);
// Return pagination metadata
return {
data: tasks,
pagination: {
page,
limit,
total: await db.query('SELECT COUNT(*) as count FROM tasks'),
},
};- Add query result limits:
typescript
// Enforce maximum result size
const MAX_RESULTS = 1000;
const query = `SELECT * FROM tasks LIMIT ${MAX_RESULTS}`;Long-term:
- Implement cursor-based pagination
- Add result size limits at API level
- Use streaming for large datasets
- Add caching for large result sets
Cause 4: Inefficient Query Structure
Symptoms:
- Complex queries with multiple JOINs
- Subqueries instead of JOINs
- SELECT * instead of specific columns
Resolution:
Immediate (15 minutes):
- Optimize query structure:
sql
-- BEFORE:
SELECT * FROM tasks WHERE id IN (
SELECT task_id FROM validations WHERE status = 'passed'
);
-- AFTER (use JOIN instead):
SELECT t.* FROM tasks t
INNER JOIN validations v ON t.id = v.task_id
WHERE v.status = 'passed';
-- Or use EXISTS (often faster):
SELECT t.* FROM tasks t
WHERE EXISTS (
SELECT 1 FROM validations v
WHERE v.task_id = t.id AND v.status = 'passed'
);- Select only needed columns:
sql
-- BEFORE:
SELECT * FROM tasks; -- Returns all columns
-- AFTER:
SELECT id, title, state, created_at FROM tasks;- Avoid functions in WHERE clause:
sql
-- BEFORE (can't use index):
SELECT * FROM tasks WHERE LOWER(title) = 'test';
-- AFTER (can use index):
-- Store lowercase in separate column or use collation
SELECT * FROM tasks WHERE title = 'test' COLLATE NOCASE;Long-term:
- Create database views for complex queries
- Use query builder with optimization
- Regular query performance audits
- Implement query caching
Cause 5: Write Lock Contention
Symptoms:
- "Database is locked" errors
- Read queries blocked by writes
- High concurrency failures
Resolution:
Immediate (10 minutes):
- Reduce transaction duration:
typescript
// BEFORE (long transaction):
await db.transaction(async (tx) => {
await tx.query('UPDATE tasks SET state = ?', ['completed']);
await expensiveOperation(); // This holds the lock!
await tx.query('INSERT INTO logs ...');
});
// AFTER (short transaction):
await expensiveOperation(); // Do this outside transaction
await db.transaction(async (tx) => {
await tx.query('UPDATE tasks SET state = ?', ['completed']);
await tx.query('INSERT INTO logs ...');
});- Use WAL mode (if not enabled):
sql
-- Enable WAL mode for better concurrency
PRAGMA journal_mode=WAL;- Batch write operations:
typescript
// Batch multiple writes into single transaction
await db.transaction(async (tx) => {
for (const update of updates) {
await tx.query('UPDATE ...', update);
}
});Long-term:
- Minimize write transaction scope
- Use read replicas for read-heavy loads
- Queue write operations
- Implement optimistic locking
Resolution Procedures
Immediate Mitigation (ETA: 10 minutes)
Step 1: Add Critical Indexes
bash
# Identify most impactful indexes from query analysis
# Create and apply immediately
wrangler d1 execute monotask-production --command "
CREATE INDEX IF NOT EXISTS idx_tasks_project_state ON tasks(project_id, state);
"Step 2: Enable Query Caching
typescript
// Add caching layer for frequently accessed data
import { CacheService } from './cache';
async function getTasks(projectId: string) {
const cacheKey = `tasks:${projectId}`;
// Check cache first
const cached = await cache.get(cacheKey);
if (cached) return JSON.parse(cached);
// Query database
const tasks = await db.query(
'SELECT * FROM tasks WHERE project_id = ?',
[projectId]
);
// Cache result
await cache.set(cacheKey, JSON.stringify(tasks), { ttl: 300 });
return tasks;
}Step 3: Add Query Timeout
typescript
// Prevent indefinite query hangs
const QUERY_TIMEOUT = 5000; // 5 seconds
async function queryWithTimeout(query: string, params: any[]) {
return Promise.race([
db.query(query, params),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Query timeout')), QUERY_TIMEOUT)
),
]);
}Database Optimization (ETA: 30 minutes)
Step 1: Analyze and Optimize Schema
sql
-- Check table statistics
ANALYZE;
-- Rebuild indexes to optimize
REINDEX;
-- Vacuum database to reclaim space (if needed)
VACUUM;Step 2: Create Optimized Views
sql
-- Create materialized view for complex queries
CREATE VIEW IF NOT EXISTS task_summary AS
SELECT
t.id,
t.title,
t.state,
COUNT(v.id) as validation_count,
MAX(v.created_at) as last_validation
FROM tasks t
LEFT JOIN validations v ON t.id = v.task_id
GROUP BY t.id;Step 3: Update Application Code
typescript
// Replace slow queries with optimized versions
// Deploy changes incrementally
wrangler deploy monotask-api-gateway --env staging
// Verify performance improvement
// Then deploy to production
wrangler deploy monotask-api-gatewayCapacity Scaling (if needed)
If database is at capacity:
Request D1 Capacity Increase:
- Contact Cloudflare support
- Provide usage metrics
- Request higher limits
Implement Read Replicas (when available):
typescript// Route read queries to replicas const result = await readReplica.query('SELECT ...');Database Sharding (last resort):
typescript// Partition data across multiple databases const shard = getShardForProject(projectId); const result = await shard.query('SELECT ...');
Verification Steps
1. Query Performance Improved (ETA: 5 minutes)
bash
# Run EXPLAIN on previously slow queries
wrangler d1 execute monotask-production --command "
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;
"
# Should show index usage instead of table scan2. Latency Metrics Improved (ETA: 10 minutes)
bash
# Monitor P95 query latency in dashboard
# Target: < 50ms for most queries
# Check worker logs for query timing
wrangler tail monotask-api-gateway | grep "db_query"3. No Timeout Errors (ETA: 5 minutes)
bash
# Check for database timeout errors
wrangler tail --status error | grep "timeout"
# Should see no database timeouts4. API Response Times Improved (ETA: 5 minutes)
bash
# Test API endpoints
curl -w "@curl-format.txt" https://monotask-api-gateway.workers.dev/api/tasks
# Verify response times within SLOPrevention Measures
1. Query Performance Monitoring
typescript
// Add query performance tracking
async function trackQuery(query: string, duration: number) {
if (duration > 100) {
console.warn('Slow query detected:', {
query,
duration,
threshold: 100,
});
await metrics.track('slow_query', {
query,
duration,
});
}
}2. Automated Index Analysis
bash
# Script to analyze missing indexes
# Run weekly via GitHub Actions
#!/bin/bash
# Export query logs
# Analyze for full table scans
# Suggest index creations3. Query Review Process
- Review all new database queries in PRs
- Require EXPLAIN output for complex queries
- Set query performance benchmarks
- Add query tests to CI/CD
4. Regular Maintenance
bash
# Schedule regular database maintenance
# Via GitHub Actions cron job
# Analyze database
wrangler d1 execute monotask-production --command "ANALYZE;"
# Check for bloat
# Optimize indexes
# Update statisticsEscalation Path
When to Escalate
Escalate if:
- P95 query latency > 500ms for more than 30 minutes
- Database errors > 5% of queries
- Unable to identify slow queries within 30 minutes
- Index optimization doesn't improve performance
- Database capacity limits reached
Escalation Contacts
Level 1 - Database Administrator
- Slack: #database-team
- For query optimization and indexing
Level 2 - Infrastructure Team
- For D1 capacity issues
- For Cloudflare support escalation
Level 3 - Cloudflare Support
- For D1 service issues
- For capacity increase requests
Post-Incident
Required Actions
Query Performance Audit:
- Identify all slow queries
- Create optimization plan
- Implement monitoring for slow queries
Index Strategy Review:
- Document current indexes
- Identify missing indexes
- Remove unused indexes
- Create index maintenance plan
Code Review:
- Update query patterns
- Implement query builders
- Add query performance tests
- Document best practices
Documentation Updates:
- Update schema documentation
- Document query patterns
- Share optimization techniques
- Update this runbook
Related Runbooks
- High Error Rate - For database error handling
- Worker Timeout - For query timeouts
- Queue Backup - For database-induced queue slowness
Last Updated: 2025-10-26 Owner: Database Team Reviewers: Backend Team, SRE Team