Skip to main content

Observability

GospeLib uses a full observability stack built on Grafana's open-source ecosystem. The same stack runs locally and in production — local dev is intended to be as rich as grafana.gospelib.com.

Signals

SignalBackendCollectorRetention (local)Retention (prod)
LogsLokiAlloy7 days90 days (S3)
MetricsPrometheusAlloy7 days15 days
TracesTempoAlloy (OTLP)local disklocal disk
ProfilesPyroscopeAlloy (Go) / SDK (Python)local disklocal disk
FrontendLoki + TempoAlloy (Faro receiver)7 days90 days
ErrorsSentrySDK (per-service)per plan

Architecture

┌────────────┐
│ Browser │
│ (Faro SDK)│
└─────┬──────┘
│ POST /collect :12347
┌─────────────────────────────────┼───────────────────────────────┐
│ Docker Compose │ │
│ ▼ │
│ ┌──────────┐ scrape /metrics ┌────────────┐ │
│ │ Alloy │◄────────────────────────│ Services │ │
│ │ │ scrape /debug/pprof │ (Go) │ │
│ │ │◄────────────────────────│ │ │
│ │ │ docker log tailing │ │ │
│ │ │◄────────────────────────│ │ │
│ │ │ OTLP gRPC :4317 │ │ │
│ │ │◄────────────────────────│ │ │
│ └────┬─────┘ └────────────┘ │
│ │ │
│ │ push ┌────────────┐ │
│ ├──────────────────────────────►│ Loki │ :3100 │
│ ├──────────────────────────────►│ Prometheus│ :9090 │
│ ├──────────────────────────────►│ Tempo │ (internal) │
│ └──────────────────────────────►│ Pyroscope │ :4040 │
│ └─────┬──────┘ │
│ │ │
│ ┌──────────┐ query datasources ┌─────▼──────┐ │
│ │ Grafana │◄───────────────────────►│ Backends │ │
│ │ :3000 │ └────────────┘ │
│ └──────────┘ │
│ ┌────────────┐ │
│ │ Services │ │
│ │ (Python) │ │
│ │ push ────┼──► Pyroscope│
│ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Alloy is the single collection agent. It tails Docker logs, scrapes Prometheus metrics and Go pprof endpoints, receives OTLP traces, receives browser telemetry via the built-in Faro receiver, and pushes everything to the local backends. Python services push profiles directly to Pyroscope via the SDK (Python's GIL makes pull-based CPU profiling impractical).

Local Setup

Start the stack

pnpm infra:observability

This starts Loki, Prometheus, Tempo, Pyroscope, Alloy, Grafana, and the FalkorDB Browser — all behind the observability Docker Compose profile. No environment variables are required; all backends default to the local containers.

Access

ToolURLCredentials
Grafanahttp://localhost:3000admin / localdev
Prometheushttp://localhost:9090
Lokihttp://localhost:3100
Pyroscopehttp://localhost:4040
Alloy UIhttp://localhost:12345
Farohttp://localhost:12347/collect— (POST only)
FalkorDBhttp://localhost:3004

Stop the stack

pnpm dev:stack:stop

Reset (wipe data volumes)

docker compose -f infra/docker/compose.yml -f infra/docker/compose.dev.yml \
--profile observability down -v

Grafana Datasources

Four datasources are auto-provisioned via infra/grafana/provisioning/datasources/datasources.yaml:

DatasourceTypeDefault URL
Lokilokihttp://loki:3100
Prometheusprometheushttp://prometheus:9090
Tempotempohttp://tempo:3200
Pyroscopegrafana-pyroscope-datasourcehttp://pyroscope:4040

Loki → Tempo linking is configured: log lines containing a trace_id field render a clickable link to the corresponding trace in Tempo. Tempo → Loki linking is also configured for trace-to-log correlation.

Dashboards

Pre-built dashboards are stored in infra/grafana/dashboards/ and auto-loaded by Grafana:

DashboardKey panels
Service OverviewRequest rate, error rate, latency P50/P95/P99 per service
FalkorDBQuery latency, memory usage, key count
PostgreSQLConnection pool, query latency, replication lag
Redis / ElastiCacheHit rate, memory, evictions
TypesenseSearch latency, index size, request rate
KubernetesPod CPU/memory, restart count, node health
Ingest PipelineJob status, nodes created, processing time
AI ServiceToken usage, response latency, model breakdown

Querying

Logs (Loki)

Use the Explore tab with the Loki datasource:

# All errors from the gateway in the last hour
{service="gateway"} |= "error" | json | level="error"

# Slow requests (> 500ms)
{service="content"} | json | latency_ms > 500

# Requests for a specific passage
{service="content"} |= "gen.1.1"

# Logs from a specific environment
{env="development"} | json

Traces (Tempo)

Search by service name, trace ID, or duration. Trace spans show the full request lifecycle across services — gateway → content → FalkorDB.

Profiles (Pyroscope)

Use the Explore tab with the Pyroscope datasource. Select a service name and profile type:

  • process_cpu — where CPU time is spent
  • allocs — heap allocation hotspots
  • goroutine — goroutine counts (Go services)
  • mutex / block — lock contention (Go services)

Python services report CPU and wall-time profiles via the Pyroscope SDK.

Frontend Observability (Faro)

The web app (apps/web) is instrumented with Grafana Faro, providing:

  • Error capture — JavaScript exceptions with stack traces, console errors
  • Web Vitals — Core Web Vitals (LCP, FID, CLS) and resource timings
  • Session tracking — persistent session IDs correlated with backend traces
  • Browser tracing — frontend spans connected to backend traces via W3C traceparent
  • Session replay — DOM recording that lets you play back exactly what the user saw when an error occurred

How it works

The Faro Web SDK runs in the browser and POSTs telemetry to Alloy's built-in faro.receiver on port 12347. Alloy routes:

  • Frontend logs (errors, console output, web vitals) → Loki
  • Frontend traces (navigation, fetch, user interactions) → Tempo

Source maps are automatically downloaded by Alloy from the origin server, so stack traces in Grafana show original TypeScript source locations.

Configuration

Faro is enabled by setting NEXT_PUBLIC_FARO_COLLECTOR_URL. In local dev, the Docker Compose stack defaults this to http://localhost:12347/collect. In production, point it at the central Alloy/Faro endpoint.

To disable Faro (e.g., for performance testing), leave NEXT_PUBLIC_FARO_COLLECTOR_URL empty.

Viewing session replays

In Grafana, use Explore → Loki and filter for {service_name="gospelib-web"}. Frontend errors include session IDs and metadata that let you correlate with the user's session. The Grafana Frontend Observability plugin (when installed) provides a dedicated session browser with replay playback.

Alert Rules

Configured in infra/k8s/base/monitoring/prometheus-alerts.yaml:

AlertConditionSeverity
Service downHealth check fails for > 2 minutesCritical
Error rate spikeHTTP 5xx rate > 5% for 5 minutesCritical
High latencyP99 > 2s for 10 minutesWarning
Pod restart loop> 3 restarts in 10 minutesCritical
DB connection pool exhaustedActive connections > 80% maxWarning
Disk usage high> 80% on any PVWarning
FalkorDB memory pressureUsed memory > 80% limitWarning
Certificate expiryTLS cert expires within 14 daysWarning

Multi-Environment Architecture

The production deployment at grafana.gospelib.com uses a single-backend, multi-env model:

  • One Loki, one Prometheus, one Tempo, one Pyroscope — shared across staging and production
  • Each environment stamps an env label on all telemetry (staging, production)
  • Grafana dashboards use a $env template variable to switch between environments
  • Local dev stays self-contained by default — pnpm infra:observability runs everything locally

To opt in to sending local telemetry to the central stack (e.g., reproducing a bug that needs team visibility), override the backend URLs in .env.local:

GOSPELIB_LOKI_PUSH_URL=https://<central>/loki/api/v1/push
GOSPELIB_MIMIR_PUSH_URL=https://<central>/api/v1/write
GOSPELIB_TEMPO_OTLP_ENDPOINT=<central>:443
GOSPELIB_PYROSCOPE_URL=https://<central>

Central Stack Deployment

The central observability stack (grafana.gospelib.com) is deployed separately from application environments. It runs in the production Kubernetes cluster under the monitoring namespace:

ComponentHelm ChartNotes
Grafana + Prometheuskube-prometheus-stackIncludes node-exporter, kube-state-metrics
Lokigrafana/loki-stackS3-backed storage, Promtail DaemonSet
Tempografana/tempoOTLP gRPC receiver
Pyroscopegrafana/pyroscopePull (Go) + push (Python) ingestion

Ingress is configured at grafana.gospelib.com with cert-manager TLS via letsencrypt-prod. See infra/k8s/base/monitoring/ for the full Kubernetes manifests.

Configuration Files

FilePurpose
infra/docker/compose.dev.ymlLocal container definitions
infra/alloy/config.alloyAlloy collection pipeline
infra/grafana/provisioning/datasources/Grafana datasource auto-provisioning
infra/grafana/provisioning/dashboards/Grafana dashboard auto-loading config
infra/grafana/dashboards/Dashboard JSON files
infra/loki/loki.yamlLoki storage and schema config
infra/tempo/tempo.yamlTempo storage config
infra/prometheus/prometheus.ymlPrometheus config (scraping via Alloy)
infra/k8s/base/monitoring/Kubernetes monitoring manifests
apps/web/lib/faro.tsFaro SDK initialization
.env.exampleDefault observability env vars