Rollback Procedures

When a deployment introduces issues, you need to roll back quickly. This guide covers rollback procedures for every layer of the stack.

Decision Matrix

Scenario	Action	Recovery Time
Bad code deploy (no DB change)	`kubectl rollout undo` or ArgoCD rollback	< 2 minutes
Bad code + reversible DB migration	Roll back migration → roll back code	< 10 minutes
Bad code + irreversible DB migration	Fix forward with new code	Varies
Infrastructure misconfiguration	`terraform apply` with previous state	< 15 minutes
Corrupted FalkorDB graph	Restore from backup or re-ingest	30–60 minutes
Total environment failure	Terraform destroy + recreate	1–2 hours

Service Rollback (Application Code)

Roll back a service to its previous image:

Option A: ArgoCD Rollback (Preferred)

# View deployment history
argocd app history gospelib-production

# Roll back to a previous revision
argocd app rollback gospelib-production <REVISION_NUMBER>

Option B: kubectl Rollback

# Roll back the most recent deployment
kubectl rollout undo deployment/gospelib-gateway -n gospelib-production

# Verify the rollback
kubectl rollout status deployment/gospelib-gateway -n gospelib-production

Option C: Pin to a Known-Good Image

cd infra/k8s/overlays/production
kustomize edit set image gospelib-gateway=$ECR_URL/gospelib-gateway:<KNOWN_GOOD_SHA>
kustomize build . | kubectl apply -f -

Full Release Rollback

Roll back ALL services to the previous release:

# Find the previous release tag
git tag -l 'release/v*' --sort=-v:refname | head -5

Option A: Revert the Kustomize Commit

git log --oneline infra/k8s/overlays/production/kustomization.yaml | head -5
git revert <COMMIT_SHA>
git push origin main
# ArgoCD syncs automatically

Option B: ArgoCD Full History Rollback

argocd app rollback gospelib-production <PREVIOUS_REVISION>

Database Migration Rollback

PostgreSQL

# Check current migration version
migrate -source "file://services/auth/migrations" \
  -database "$PG_URL" version

# Roll back one migration
migrate -source "file://services/auth/migrations" \
  -database "$PG_URL" down 1

# Roll back to a specific version
migrate -source "file://services/auth/migrations" \
  -database "$PG_URL" goto 5

warning

Data-destructive down migrations (dropping columns or tables) cannot be undone. If you've dropped data, you must restore from a backup or fix forward.

FalkorDB

FalkorDB does not have traditional migrations. Rollback options:

Additive changes (new indices, new node types): Leave them in place — they are harmless
Destructive changes (removed nodes, changed schema): Restore from backup
Nuclear option: Re-run full ingest from corpus data with --reset

# Full re-ingest (staging only — destroys and rebuilds)
kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging

Infrastructure Rollback (Terraform)

cd infra/terraform/environments/production

# Find the previous state version in S3
aws s3api list-object-versions \
  --bucket gospelib-terraform-state \
  --prefix infrastructure/terraform.tfstate

# Download the previous version
aws s3api get-object \
  --bucket gospelib-terraform-state \
  --key infrastructure/terraform.tfstate \
  --version-id <PREVIOUS_VERSION_ID> \
  /tmp/terraform.tfstate.backup

# Push the old state back
terraform state push /tmp/terraform.tfstate.backup

# Apply to revert infrastructure
terraform apply

Disaster Recovery

Backup Schedule

Data Store	Method	Frequency	Retention
PostgreSQL	RDS automated snapshots	Daily	14 days (prod)
PostgreSQL	Manual snapshot before releases	Per release	30 days
FalkorDB	Redis BGSAVE → S3	Every 6 hours	7 days
Typesense	Snapshot API → S3	Daily	7 days
Corpus data	Git	Every commit	Permanent
Terraform state	S3 versioning	Every apply	90 days

Restore PostgreSQL from Snapshot

# List available snapshots
aws rds describe-db-snapshots \
  --db-instance-identifier gospelib-production \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
  --output table

# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier gospelib-production-restored \
  --db-snapshot-identifier <SNAPSHOT_ID> \
  --db-instance-class db.t3.medium

# After verification, swap instance names
aws rds modify-db-instance \
  --db-instance-identifier gospelib-production \
  --new-db-instance-identifier gospelib-production-old

aws rds modify-db-instance \
  --db-instance-identifier gospelib-production-restored \
  --new-db-instance-identifier gospelib-production

Verify the Rollback

After any rollback:

# Check all pods are running
kubectl get pods -n gospelib-production

# Hit health endpoints
curl https://api.gospelib.com/health
curl https://api.gospelib.com/ready

# Monitor logs for errors
kubectl logs -f -l app=gospelib-gateway -n gospelib-production --since=5m

Monitor Grafana dashboards for 15 minutes to confirm stability.

Decision Matrix​

Service Rollback (Application Code)​

Option A: ArgoCD Rollback (Preferred)​

Option B: kubectl Rollback​

Option C: Pin to a Known-Good Image​

Full Release Rollback​

Option A: Revert the Kustomize Commit​

Option B: ArgoCD Full History Rollback​

Database Migration Rollback​

PostgreSQL​

FalkorDB​

Infrastructure Rollback (Terraform)​

Disaster Recovery​

Backup Schedule​

Restore PostgreSQL from Snapshot​

Verify the Rollback​