Rollback Procedures
When a deployment introduces issues, you need to roll back quickly. This guide covers rollback procedures for every layer of the stack.
Decision Matrix
| Scenario | Action | Recovery Time |
|---|---|---|
| Bad code deploy (no DB change) | kubectl rollout undo or ArgoCD rollback | < 2 minutes |
| Bad code + reversible DB migration | Roll back migration → roll back code | < 10 minutes |
| Bad code + irreversible DB migration | Fix forward with new code | Varies |
| Infrastructure misconfiguration | terraform apply with previous state | < 15 minutes |
| Corrupted FalkorDB graph | Restore from backup or re-ingest | 30–60 minutes |
| Total environment failure | Terraform destroy + recreate | 1–2 hours |
Service Rollback (Application Code)
Roll back a service to its previous image:
Option A: ArgoCD Rollback (Preferred)
# View deployment history
argocd app history gospelib-production
# Roll back to a previous revision
argocd app rollback gospelib-production <REVISION_NUMBER>
Option B: kubectl Rollback
# Roll back the most recent deployment
kubectl rollout undo deployment/gospelib-gateway -n gospelib-production
# Verify the rollback
kubectl rollout status deployment/gospelib-gateway -n gospelib-production
Option C: Pin to a Known-Good Image
cd infra/k8s/overlays/production
kustomize edit set image gospelib-gateway=$ECR_URL/gospelib-gateway:<KNOWN_GOOD_SHA>
kustomize build . | kubectl apply -f -
Full Release Rollback
Roll back ALL services to the previous release:
# Find the previous release tag
git tag -l 'release/v*' --sort=-v:refname | head -5
Option A: Revert the Kustomize Commit
git log --oneline infra/k8s/overlays/production/kustomization.yaml | head -5
git revert <COMMIT_SHA>
git push origin main
# ArgoCD syncs automatically
Option B: ArgoCD Full History Rollback
argocd app rollback gospelib-production <PREVIOUS_REVISION>
Database Migration Rollback
PostgreSQL
# Check current migration version
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" version
# Roll back one migration
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" down 1
# Roll back to a specific version
migrate -source "file://services/auth/migrations" \
-database "$PG_URL" goto 5
warning
Data-destructive down migrations (dropping columns or tables) cannot be undone. If you've dropped data, you must restore from a backup or fix forward.
FalkorDB
FalkorDB does not have traditional migrations. Rollback options:
- Additive changes (new indices, new node types): Leave them in place — they are harmless
- Destructive changes (removed nodes, changed schema): Restore from backup
- Nuclear option: Re-run full ingest from corpus data with
--reset
# Full re-ingest (staging only — destroys and rebuilds)
kubectl apply -f infra/k8s/jobs/ingest-full.yaml -n gospelib-staging
Infrastructure Rollback (Terraform)
cd infra/terraform/environments/production
# Find the previous state version in S3
aws s3api list-object-versions \
--bucket gospelib-terraform-state \
--prefix infrastructure/terraform.tfstate
# Download the previous version
aws s3api get-object \
--bucket gospelib-terraform-state \
--key infrastructure/terraform.tfstate \
--version-id <PREVIOUS_VERSION_ID> \
/tmp/terraform.tfstate.backup
# Push the old state back
terraform state push /tmp/terraform.tfstate.backup
# Apply to revert infrastructure
terraform apply
Disaster Recovery
Backup Schedule
| Data Store | Method | Frequency | Retention |
|---|---|---|---|
| PostgreSQL | RDS automated snapshots | Daily | 14 days (prod) |
| PostgreSQL | Manual snapshot before releases | Per release | 30 days |
| FalkorDB | Redis BGSAVE → S3 | Every 6 hours | 7 days |
| Typesense | Snapshot API → S3 | Daily | 7 days |
| Corpus data | Git | Every commit | Permanent |
| Terraform state | S3 versioning | Every apply | 90 days |
Restore PostgreSQL from Snapshot
# List available snapshots
aws rds describe-db-snapshots \
--db-instance-identifier gospelib-production \
--query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' \
--output table
# Restore from snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier gospelib-production-restored \
--db-snapshot-identifier <SNAPSHOT_ID> \
--db-instance-class db.t3.medium
# After verification, swap instance names
aws rds modify-db-instance \
--db-instance-identifier gospelib-production \
--new-db-instance-identifier gospelib-production-old
aws rds modify-db-instance \
--db-instance-identifier gospelib-production-restored \
--new-db-instance-identifier gospelib-production
Verify the Rollback
After any rollback:
# Check all pods are running
kubectl get pods -n gospelib-production
# Hit health endpoints
curl https://api.gospelib.com/health
curl https://api.gospelib.com/ready
# Monitor logs for errors
kubectl logs -f -l app=gospelib-gateway -n gospelib-production --since=5m
Monitor Grafana dashboards for 15 minutes to confirm stability.