Skip to main content

Operations Overview

GospeLib runs as containerized microservices on Kubernetes, backed by managed AWS data stores. This section covers everything needed to build, ship, observe, and recover the platform across staging and production environments.

What Operations Covers

ConcernSummaryGuide
CI/CDGitHub Actions + Nx affected builds, Docker image pipeline, ArgoCD GitOpsCI/CD Pipeline
Deployment architectureService inventory, data stores, environment topology, prerequisitesDeployment Overview
Stagingk3s on EC2, Terraform provisioning, ArgoCD auto-sync from stage branchDeploy to Staging
ProductionEKS cluster, approval gates, release tags, cost breakdownDeploy to Production
ObservabilityGrafana + Loki + Prometheus + Tempo + Pyroscope, frontend RUM via FaroMonitoring & Observability
RollbackService rollback, database recovery, infrastructure state restoreRollback Procedures
InfrastructureTerraform modules, Docker Compose topology, AWS resource layoutInfrastructure

Environment Overview

GospeLib maintains two deployed environments. Staging mirrors production in every meaningful way -- same Docker images, same Kubernetes manifests (different resource limits), same database engines, same monitoring stack. The only differences are instance sizes and replica counts.

Staging (staging.gospelib.com) Production (gospelib.com)
EC2 t3.medium running k3s EKS managed cluster
1 replica per service 2 replicas per service
RDS db.t3.micro RDS db.r6g.large
ElastiCache t3.micro ElastiCache r6g.large
~$2.50/mo (free tier) ~$213/mo

Deployment Flow

A typical change moves through the pipeline in this order:

  1. Push or PR -- CI runs lint, tests (JS/Python/Go in parallel), and build
  2. Merge to stage -- CD builds affected Docker images, pushes to ECR, updates Kustomize image tags
  3. ArgoCD syncs staging -- manifests applied to the k3s cluster automatically
  4. Bake period -- candidate build runs on staging for at least 24 hours
  5. Release tag -- release/v*.*.* triggers production CD with an approval gate
  6. ArgoCD syncs production -- EKS cluster updated, health checks polled

For emergency fixes that cannot wait for the full cycle, see the hotfix procedure.

Observability Signals

Every service emits logs, metrics, traces, and profiles through a unified collection agent (Alloy) into the Grafana stack. The same tooling runs locally via pnpm infra:observability and in production at grafana.gospelib.com.

SignalBackendLocal access
LogsLokihttp://localhost:3100
MetricsPrometheushttp://localhost:9090
TracesTempovia Grafana Explore
ProfilesPyroscopehttp://localhost:4040
FrontendLoki + Tempo (Faro)http://localhost:12347/collect
ErrorsSentryper-plan dashboard

Recovery Playbook

When something goes wrong, the Rollback Procedures guide provides a decision matrix:

  • Bad code deploy (no DB change) -- kubectl rollout undo or ArgoCD rollback, recovery in under 2 minutes
  • Bad code + reversible migration -- roll back migration, then roll back code, under 10 minutes
  • Corrupted graph data -- restore FalkorDB from backup or re-ingest from corpus, 30-60 minutes
  • Infrastructure misconfiguration -- restore previous Terraform state and re-apply, under 15 minutes

Infrastructure as Code

All cloud resources are defined in infra/terraform/ with reusable modules for EKS, RDS, ElastiCache, S3, ECR, CloudFront, Route53, and Secrets Manager. Kubernetes manifests use Kustomize with base + overlay pattern, and the local development stack runs entirely through Docker Compose. See Infrastructure for the full layout.