Runbook: CloudForge Deployment
Overview
This runbook covers the current production-style demo deployment path:
- Backend API on Fly.io (
cloudforge-api) - Frontend on Cloudflare Pages (
cloudguard/cloudforge-demo) - PostgreSQL via
AEGIS_DATABASE_URLwhen Postgres-backed findings or GRC are enabled
The earlier ECS/RDS rollout path has been retired. Do not use Kubernetes, ECS, or ALB procedures from older notes for the active demo environment.
Process Flow
Prerequisites
- GitHub access to the repo and Actions history
- Fly.io CLI authenticated (
fly auth whoami) - Cloudflare Pages access for the frontend projects
-
psqlavailable locally if Postgres migrations are required - Change approval / stakeholder notice if this is a live demo environment
Pre-Deployment Checklist
# 1. Verify Fly.io app state
fly status -a cloudforge-api
fly releases list -a cloudforge-api | head
# 2. Verify current health
curl -sf https://api.cloudforge.lvonguyen.com/health | jq .
# 3. Review runtime secrets and config
fly secrets list -a cloudforge-api
# 4. If using postgres-backed findings or GRC, confirm DB connectivity
psql "$AEGIS_DATABASE_URL" -c 'select 1;'
# 5. Check the most recent Cloudflare Pages frontend deployment
wrangler pages deployment list --project-name cloudforge-demo | head -5
Deployment Procedure
Option A: Fly.io API Deployment (Primary)
# 1. Deploy the backend
fly deploy -a cloudforge-api
# 2. Monitor rollout
fly status -a cloudforge-api
fly logs -a cloudforge-api
# 3. Verify the machine is healthy
curl -sf https://api.cloudforge.lvonguyen.com/health | jq .
Option B: Frontend Deployment (Cloudflare Pages)
Cloudflare Pages deploys automatically from GitHub on pushes to main.
# Inspect recent frontend deployments
wrangler pages deployment list --project-name cloudforge-demo | head -10
# Validate required build-time env vars in the Pages dashboard:
# - VITE_API_URL=https://api.cloudforge.lvonguyen.com/api/v1
# - VITE_DEMO_MODE=true
# - JWT_SECRET=<secret used to generate the static demo token>
Cloudflare refs currently present in the Development vault:
- Scoped Pages deploy token:
op://Development/cf-pages-deploy/credential - Global API key fallback:
op://Development/cf-gbl-api-token/credential - Global API key email / username:
op://Development/cf-gbl-api-token/username - Verified 2026-03-31: the token can list deployments for
cloudguardandcloudforge-demo.cloudforge.lvonguyen.comis the custom domain, not the Pages project name.
Runtime Secrets Update
If backend configuration changed, update Fly.io secrets before or during deploy:
# Dry-run the 1Password-backed sync first.
./scripts/fly-sync-runtime-secrets.sh --include-integrations --include-threat-intel
# Apply the core + integration runtime values.
./scripts/fly-sync-runtime-secrets.sh --include-integrations --include-threat-intel --apply
Notes:
scripts/fly-sync-runtime-secrets.shcan bootstrapFLY_API_TOKENfromop://Development/flyio-org-deploy-token/credential, so--applydoes not depend on an existingfly authsession.scripts/fly-sync-runtime-secrets.shdefaults to the personal demo context and resolves the JWT, Asana, Jira, GreyNoise, HIBP, OTX, and ThreatFox keys from 1Password.- Personal-context threat-intel refs:
op://Development/glzdciaetfnrafvhntwe6enymu/credential(GreyNoise),op://Development/itrqxidqwvzwviz357fqtpcdi4/credential(HIBP),op://Development/dy5ds2uttd35prezcbyb4753ra/credential(OTX),op://Development/qxi4xw27nzkw6diikdug4arose/wvvuolayxv6m7b75ldy4c52aiu(abuse.ch ThreatFox Auth-Key). - For enterprise tenants or renamed 1Password items, override the
*_REFenv vars instead of editing the script.
Database Migration
Run migrations before shipping a backend version that depends on new schema:
# Option 1: direct psql from local workstation
for f in migrations/*.sql; do
echo "=== Running $f ==="
psql "$AEGIS_DATABASE_URL" -f "$f" || exit 1
done
# Option 2: use the migration container entrypoint
docker build -f deploy/docker/Dockerfile.migrate -t cloudforge-migrate .
docker run --rm -e AEGIS_DATABASE_URL="$AEGIS_DATABASE_URL" cloudforge-migrate
Option C: Full Findings Seed into Fly Postgres (D19)
Use this only during an explicit operator window. This is a destructive load sequence against the target database and should be done against a fresh database or a snapshot-backed restore point.
export DATABASE_URL='<postgres-dsn>'
export AEGIS_DATABASE_URL="$DATABASE_URL"
export FINDINGS_SOURCE=postgres
export SECGRAPH_AUTO_TICKETS=false
# Required because the schema uses gen_random_uuid().
psql "$DATABASE_URL" -c 'CREATE EXTENSION IF NOT EXISTS pgcrypto;'
# Apply the ordered schema set. Do not rely on older `make migrate` helpers for D19.
for f in \
migrations/001_exception_management.sql \
migrations/002_findings_and_compliance.sql \
migrations/003_operations_and_agents.sql \
migrations/005_tenant_isolation.sql \
migrations/006_graph_support.sql \
migrations/007_security_graph.sql \
migrations/008_findings_assignment_context.sql \
migrations/009_finding_tickets.sql
do
psql "$DATABASE_URL" -f "$f" || exit 1
done
# Full seed requires both --full and the explicit 300K count.
node --max-old-space-size=6144 scripts/aegis-seed.mjs \
--count 300000 \
--out testdata/seed \
--full \
--seed 42
# Generate SQL, then load it with psql.
node scripts/seed-postgres.mjs --in testdata/seed --out /tmp/seed-findings.sql
psql "$DATABASE_URL" -f /tmp/seed-findings.sql
node scripts/seed-resources.mjs --in testdata/seed --out /tmp/seed-resources.sql
psql "$DATABASE_URL" -f /tmp/seed-resources.sql
# Re-run graph-support/security-graph migrations after findings/resources are loaded.
psql "$DATABASE_URL" -f migrations/006_graph_support.sql
psql "$DATABASE_URL" -f migrations/007_security_graph.sql
Verify before cutover:
psql "$DATABASE_URL" -c 'select count(*) from findings;'
psql "$DATABASE_URL" -c 'select count(*) from resources;'
psql "$DATABASE_URL" -c 'select count(*) from accounts;'
psql "$DATABASE_URL" -c 'select count(*) from graph_edges;'
Stage the runtime secrets first, but keep findings on mock until the database is fully loaded:
AEGIS_DATABASE_URL_REF='op://Development/4uvialfye3icuwak32yblswaam/credential' \
./scripts/fly-sync-runtime-secrets.sh --include-integrations --include-threat-intel --include-postgres
AEGIS_DATABASE_URL_REF='op://Development/4uvialfye3icuwak32yblswaam/credential' \
./scripts/fly-sync-runtime-secrets.sh --include-integrations --include-threat-intel --include-postgres --apply
Cut over the app only after those counts look correct and startup headroom is acceptable:
FINDINGS_SOURCE=postgres \
AEGIS_DATABASE_URL_REF='op://Development/4uvialfye3icuwak32yblswaam/credential' \
./scripts/fly-sync-runtime-secrets.sh --include-integrations --include-threat-intel --include-postgres --apply
fly deploy -a cloudforge-api
Notes:
scripts/seed-postgres.mjsandscripts/seed-resources.mjsgenerate SQL; they do not load Postgres by themselves.- Personal-context default:
AEGIS_DATABASE_URL_REFresolves from the dedicated Development-vault item UUID refop://Development/4uvialfye3icuwak32yblswaam/credential(cloudforge neon-db). scripts/fly-sync-runtime-secrets.shnow defaultsFINDINGS_SOURCE=mock; the final Postgres cutover must be explicit withFINDINGS_SOURCE=postgres.- Verified March 31, 2026: Neon Launch now fits the full 300K corpus. The seeded
cloudforgedatabase is1,078,362,112bytes (1028 MB, about1.03 GB). - Verified March 31, 2026: the live Fly app is now Postgres-backed on the full 300K corpus.
- Leave
LARGE_CORPUS_WARMUP_ENABLEDunset on the current2 GBshared Fly VM. Enabling large-corpus search/attack-path warmup on startup causes health flaps and eventual OOM on the current machine size. - Leave
LARGE_CORPUS_SECGRAPH_SYNC_ENABLEDunset on the current2 GBshared Fly VM. That flag refers to the full large-corpus secgraph graph-artifact path, which still causes repeated OOM restarts after backfill on the current machine size. - Current stable behavior on Fly: findings load from Postgres, the full large-corpus startup warmup stays disabled, authenticated findings search degrades to in-memory keyword mode when the Bleve index is intentionally absent, attack paths lazily materialize a bounded sampled cache on first request, and the operator-facing secgraph issue surface incrementally syncs in bounded startup batches while heavier graph-edge artifacts remain deferred.
- Verified March 31, 2026: the first live issue-surface batch committed
77116evaluations/issues/issue_findings at2026-03-31T22:23:46Z, after which/api/v1/issues/stats,/api/v1/issues, and/api/v1/issues/{id}all responded successfully on the public demo API. - Verified March 31, 2026: the degraded keyword search path served the live-sized 300K corpus locally in about
748 msfor a hybrid-mode request while warmup stayed disabled. - Verified March 31, 2026: the deferred attack-path path served the live-sized 300K corpus locally in about
0.4son the first cold request after adjacency skipping, returning5476cached paths from5000candidate findings with stats modesampled. - Tune deferred attack-path sampling with
ATTACK_PATH_MAX_FINDINGSandATTACK_PATH_MAX_PER_ACCOUNTonly after verifying memory headroom on Fly. - If cutover fails, revert
FINDINGS_SOURCE=mockand restart before doing any deeper DB surgery. - Verified March 31, 2026:
/api/v1/providerson the live demo now reportsintegrations.default=asana,integrations.enabled=[asana,jira,mock],integrations.ticket_store=durable, andintegrations.asana_webhook=configured. - Verified March 31, 2026: live remediation mutation passed against both providers. Asana create/comment/resolve/sync succeeded through
cloudforge-api, and Jira create/comment/sync succeeded throughcloudforge-api.
Verification
API Health Check
curl -sf https://api.cloudforge.lvonguyen.com/health | jq .
Expected response shape:
{
"status": "healthy"
}
Functional Verification
# Authenticated findings request
curl -sf https://api.cloudforge.lvonguyen.com/api/v1/findings?limit=5 \
-H "Authorization: Bearer $API_TOKEN" | jq '.data | length'
# Provider readiness / durable ticket store
curl -sf https://api.cloudforge.lvonguyen.com/api/v1/providers | jq '.integrations'
# Frontend smoke
open https://cloudforge.lvonguyen.com
Check:
- frontend loads without auth redirect loops
- findings and issues views render
/api/v1/providersreports the expected integration provider set and durable ticket store- findings search returns results even when
LARGE_CORPUS_WARMUP_ENABLEDremains unset - attack path and graph pages do not 5xx
- first attack-path request may be materially slower on a cold process while the sampled cache is built
- integrations remain disabled unless the required secrets are set
Optional operator-window remediation smoke:
# Create an Asana-backed remediation ticket for a known finding.
curl -sf -X POST "https://api.cloudforge.lvonguyen.com/api/v1/findings/<finding-id>/remediate" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"provider":"asana","severity":"CRITICAL"}' | jq .
# Add a comment and then force-refresh status.
curl -sf -X POST "https://api.cloudforge.lvonguyen.com/api/v1/findings/<finding-id>/ticket/comments" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"body":"operator smoke"}' | jq .
curl -sf -X POST "https://api.cloudforge.lvonguyen.com/api/v1/findings/<finding-id>/ticket/sync" \
-H "Authorization: Bearer $API_TOKEN" | jq .
Notes:
- Use
provider:"jira"to force the Jira path instead of the default Asana path. integrations.asana_webhook=configuredonly proves the handshake token is present. It does not prove that an external Asana webhook registration is currently active.
Fly.io Release Verification
fly releases list -a cloudforge-api | head
fly status -a cloudforge-api
Rollback Procedure
Fly.io Release Rollback
# Identify the last known-good release
fly releases list -a cloudforge-api
# Roll back to a specific release version
fly releases rollback <version> -a cloudforge-api
# Verify health after rollback
curl -sf https://api.cloudforge.lvonguyen.com/health | jq .
Database Rollback
If a migration introduced an incompatible schema change, restore from backup or manually revert the relevant migration. There is no safe generic down path for every migration in this repo; treat DB rollback as an explicit operator action.
Post-Deployment
- Verify Fly.io health and recent logs
- Verify frontend loads from Cloudflare Pages
- Verify authenticated API calls against
/api/v1/findings - Verify
/api/v1/providersreports the expected integration readiness - Verify optional Postgres-backed features if
AEGIS_DATABASE_URLis enabled - During an operator window, run one remediation create/comment/sync smoke on the active ticket provider path
- Notify stakeholders / update the deployment record
Escalation
| Condition | Action |
|---|---|
| Fly.io deploy fails | Roll back to the last good release, then inspect fly logs |
| Health check fails after deploy | Check Fly secrets, DB reachability, and release diff before retrying |
| Frontend 404s or auth loops | Verify VITE_API_URL, JWT_SECRET, and Pages build env vars |
| Postgres-backed endpoints 5xx | Confirm AEGIS_DATABASE_URL, migration state, and DB connectivity |
Contact
- On-Call: PagerDuty
- Platform Team:
#platform-support - Security Team:
#security-ops