Skip to main content

Runbook: Incident Response

Overview

This runbook covers incident detection, triage, mitigation, and resolution for CloudForge production issues.

Process Overview

Incident Response Flow

Prerequisites

  • PagerDuty access (on-call schedules)
  • Grafana dashboard access
  • kubectl access to production namespace (aegis)
  • Fly.io CLI (flyctl) authenticated against personal org
  • Access to status page for customer-facing updates

Incident Classification

SeverityDescriptionResponse TimeExamples
SEV1Service down, data loss risk15 minAPI unreachable, DB corruption
SEV2Major degradation30 min50% error rate, major feature broken
SEV3Minor degradation2 hoursSingle endpoint slow, non-critical bug
SEV4Low impact1 business dayUI cosmetic issue, minor inconvenience

Detection

Alert Sources

  1. PagerDuty - Critical alerts
  2. Grafana - Metric-based alerts
  3. Datadog/CloudWatch - Log-based alerts
  4. Customer Reports - Support tickets

Key Metrics to Monitor

# Error rate (should be <0.1%)
sum(rate(aegis_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(aegis_http_requests_total[5m]))

# Latency P99 (should be <500ms)
histogram_quantile(0.99, rate(aegis_http_request_duration_seconds_bucket[5m]))

# Active findings (trend)
aegis_findings_active

# AI provider availability
aegis_health_status{component="ai_provider"}

Triage Procedure

Step 1: Initial Assessment (5 min)

# Check overall status
kubectl get pods -n aegis
kubectl top pods -n aegis

# Check recent deployments
kubectl rollout history deployment/aegis-api -n aegis

# Check logs for errors
kubectl logs -n aegis -l app=aegis-api --tail=100 | grep -i error

Step 2: Impact Assessment

  • How many users affected?
  • Which features impacted?
  • Is data at risk?
  • When did it start?

Step 3: Classification

Based on impact, classify severity and engage appropriate responders.

Common Issues and Remediation

Issue: High API Error Rate

Symptoms: 5xx errors, timeouts

Diagnosis:

# Check API logs
kubectl logs -n aegis -l app=aegis-api --tail=500 | grep "ERROR\|FATAL"

# Check resource usage
kubectl top pods -n aegis

# Check database connectivity
kubectl exec -n aegis deployment/aegis-api -- ./aegis health

Remediation:

  1. If OOM: Increase memory limits, then investigate memory leak
  2. If CPU: Scale horizontally, then optimize hot paths
  3. If DB connection: Check connection pool, DB health
  4. If external dependency: Check provider status, enable fallback

Issue: Database Connection Failures

Symptoms: "connection refused", "too many connections"

Diagnosis:

# Check connection count
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity;"

# Check max connections
kubectl exec -n aegis deployment/aegis-api -- \
psql $DATABASE_URL -c "SHOW max_connections;"

Remediation:

  1. Kill idle connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
  2. Increase connection pool size in config
  3. Add PgBouncer if not already present
  4. Scale up DB instance if connection limit reached

Issue: AI Provider Timeouts

Symptoms: Slow analysis, AI-powered features fail

Diagnosis:

# Check AI provider status
curl -s https://status.anthropic.com/api/v2/status.json | jq .
curl -s https://status.openai.com/api/v2/status.json | jq .

# Check rate limit status
kubectl logs -n aegis -l app=aegis-api | grep "rate_limit"

Remediation:

  1. Enable fallback provider in config
  2. Increase timeout if provider slow but working
  3. Enable cached responses for repeat queries
  4. Gracefully degrade to static analysis

Issue: High Memory Usage

Symptoms: OOMKilled pods, increasing memory trend

Diagnosis:

# Check memory usage
kubectl top pods -n aegis

# Enable profiling
curl -s http://localhost:6060/debug/pprof/heap > heap.prof
go tool pprof heap.prof

Remediation:

  1. Restart affected pods (temporary)
  2. Reduce batch sizes for processing
  3. Add memory limits enforcement
  4. Investigate and fix memory leak

Communication

Internal Communication

  1. Create incident channel: #incident-YYYYMMDD-XX
  2. Post initial update with:
    • What's happening
    • Who is investigating
    • Current impact
  3. Update every 15 minutes for SEV1-2

External Communication (if customer-facing)

  1. Update status page
  2. Prepare customer communication
  3. Coordinate with support team

Post-Incident

Immediate (within 24h)

  • Document timeline
  • Confirm service restored
  • Remove any temporary mitigations
  • Update monitoring if gap identified

Post-Mortem (within 5 days)

  1. Schedule blameless post-mortem
  2. Document root cause
  3. Create action items
  4. Share learnings

Escalation Matrix

SeverityPrimaryEscalation (30 min)Escalation (1h)
SEV1On-CallEngineering ManagerVP Engineering
SEV2On-CallTech LeadEngineering Manager
SEV3On-CallTech Lead-
SEV4Assigned Engineer--

Contact Information

  • On-Call: PagerDuty
  • Engineering Manager: @eng-manager
  • Security: #security-ops
  • Customer Success: #customer-success