Runbook: Incident Response

Overview

This runbook covers incident detection, triage, mitigation, and resolution for CloudForge production issues.

Process Overview

Incident Response Flow

Prerequisites

PagerDuty access (on-call schedules)
Grafana dashboard access
kubectl access to production namespace (aegis)
Fly.io CLI (flyctl) authenticated against personal org
Access to status page for customer-facing updates

Incident Classification

Severity	Description	Response Time	Examples
SEV1	Service down, data loss risk	15 min	API unreachable, DB corruption
SEV2	Major degradation	30 min	50% error rate, major feature broken
SEV3	Minor degradation	2 hours	Single endpoint slow, non-critical bug
SEV4	Low impact	1 business day	UI cosmetic issue, minor inconvenience

Detection

Alert Sources

PagerDuty - Critical alerts
Grafana - Metric-based alerts
Datadog/CloudWatch - Log-based alerts
Customer Reports - Support tickets

Key Metrics to Monitor

# Error rate (should be <0.1%)
sum(rate(aegis_http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(aegis_http_requests_total[5m]))

# Latency P99 (should be <500ms)
histogram_quantile(0.99, rate(aegis_http_request_duration_seconds_bucket[5m]))

# Active findings (trend)
aegis_findings_active

# AI provider availability
aegis_health_status{component="ai_provider"}

Triage Procedure

Step 1: Initial Assessment (5 min)

# Check overall status
kubectl get pods -n aegis
kubectl top pods -n aegis

# Check recent deployments
kubectl rollout history deployment/aegis-api -n aegis

# Check logs for errors
kubectl logs -n aegis -l app=aegis-api --tail=100 | grep -i error

Step 2: Impact Assessment

How many users affected?
Which features impacted?
Is data at risk?
When did it start?

Step 3: Classification

Based on impact, classify severity and engage appropriate responders.

Common Issues and Remediation

Issue: High API Error Rate

Symptoms: 5xx errors, timeouts

Diagnosis:

# Check API logs
kubectl logs -n aegis -l app=aegis-api --tail=500 | grep "ERROR\|FATAL"

# Check resource usage
kubectl top pods -n aegis

# Check database connectivity
kubectl exec -n aegis deployment/aegis-api -- ./aegis health

Remediation:

If OOM: Increase memory limits, then investigate memory leak
If CPU: Scale horizontally, then optimize hot paths
If DB connection: Check connection pool, DB health
If external dependency: Check provider status, enable fallback

Issue: Database Connection Failures

Symptoms: "connection refused", "too many connections"

Diagnosis:

# Check connection count
kubectl exec -n aegis deployment/aegis-api -- \
  psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Check max connections
kubectl exec -n aegis deployment/aegis-api -- \
  psql $DATABASE_URL -c "SHOW max_connections;"

Remediation:

Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
Increase connection pool size in config
Add PgBouncer if not already present
Scale up DB instance if connection limit reached

Issue: AI Provider Timeouts

Symptoms: Slow analysis, AI-powered features fail

Diagnosis:

# Check AI provider status
curl -s https://status.anthropic.com/api/v2/status.json | jq .
curl -s https://status.openai.com/api/v2/status.json | jq .

# Check rate limit status
kubectl logs -n aegis -l app=aegis-api | grep "rate_limit"

Remediation:

Enable fallback provider in config
Increase timeout if provider slow but working
Enable cached responses for repeat queries
Gracefully degrade to static analysis

Issue: High Memory Usage

Symptoms: OOMKilled pods, increasing memory trend

Diagnosis:

# Check memory usage
kubectl top pods -n aegis

# Enable profiling
curl -s http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof

Remediation:

Restart affected pods (temporary)
Reduce batch sizes for processing
Add memory limits enforcement
Investigate and fix memory leak

Communication

Internal Communication

Create incident channel: #incident-YYYYMMDD-XX
Post initial update with:
- What's happening
- Who is investigating
- Current impact
Update every 15 minutes for SEV1-2

External Communication (if customer-facing)

Update status page
Prepare customer communication
Coordinate with support team

Post-Incident

Immediate (within 24h)

Document timeline
Confirm service restored
Remove any temporary mitigations
Update monitoring if gap identified

Post-Mortem (within 5 days)

Schedule blameless post-mortem
Document root cause
Create action items
Share learnings

Escalation Matrix

Severity	Primary	Escalation (30 min)	Escalation (1h)
SEV1	On-Call	Engineering Manager	VP Engineering
SEV2	On-Call	Tech Lead	Engineering Manager
SEV3	On-Call	Tech Lead	-
SEV4	Assigned Engineer	-	-

Contact Information

On-Call: PagerDuty
Engineering Manager: @eng-manager
Security: #security-ops
Customer Success: #customer-success

Overview​

Process Overview​

Prerequisites​

Incident Classification​

Detection​

Alert Sources​

Key Metrics to Monitor​

Triage Procedure​

Step 1: Initial Assessment (5 min)​

Step 2: Impact Assessment​

Step 3: Classification​

Common Issues and Remediation​

Issue: High API Error Rate​

Issue: Database Connection Failures​

Issue: AI Provider Timeouts​

Issue: High Memory Usage​

Communication​

Internal Communication​

External Communication (if customer-facing)​

Post-Incident​

Immediate (within 24h)​

Post-Mortem (within 5 days)​

Escalation Matrix​

Contact Information​

Overview

Process Overview

Prerequisites

Incident Classification

Detection

Alert Sources

Key Metrics to Monitor

Triage Procedure

Step 1: Initial Assessment (5 min)

Step 2: Impact Assessment

Step 3: Classification

Common Issues and Remediation

Issue: High API Error Rate

Issue: Database Connection Failures

Issue: AI Provider Timeouts

Issue: High Memory Usage

Communication

Internal Communication

External Communication (if customer-facing)

Post-Incident

Immediate (within 24h)

Post-Mortem (within 5 days)

Escalation Matrix

Contact Information