Ingest Service
The Ingest service is a Python CLI tool that reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected knowledge graph to FalkorDB. It is the sole authoritative writer to the graph database.
Quick Reference
| Property | Value |
|---|---|
| Language | Python 3.12 |
| Framework | Click CLI |
| Package | gospelib_ingest |
| Entry point | gospelib-ingest CLI command |
| Target DB | FalkorDB (port 6379) |
| Type | CLI tool (no HTTP port) |
| Deployment | K8s Job (full rebuild) / CronJob (incremental) |
Responsibilities
- Validate — Every JSON source file is validated with Pydantic before any database write
- Transform — Convert corpus JSON into FalkorDB graph nodes and edges
- Write — Produce a query-ready FalkorDB graph (idempotent via
MERGE) - Connect — Generate all edges derivable from source data (cross-references, topics, lexicon links)
- Report — Emit a structured run report at completion
Running Locally
cd services/ingest
uv sync
# Full ingest
uv run gospelib-ingest run
# Dry run (validate without writing)
uv run gospelib-ingest run --dry-run
# Show help
uv run gospelib-ingest --help
The service expects FalkorDB to be running on port 6379:
pnpm infra:up
14-Stage Pipeline
The ingest pipeline executes stages sequentially in dependency order:
| Stage | Pipeline | Description |
|---|---|---|
| 0 | Schema | Create graph indices and constraints |
| 1 | Validation | Pass-through (reserved for future use) |
| 2 | Lexicon | Hebrew/Greek lexicon entries → Word nodes + DERIVES_FROM/RELATED_TO |
| 3 | Scripture Text | Canonical text + interlinear → Passage, Witness, WordAlignment nodes |
| 3.5 | Cross-References | Standalone cross-reference files → CROSS_REF edges |
| 4 | Reference Data | TG, BD, Index → TGEntry, BDEntry, IndexTopic nodes + CITES edges |
| 4.5 | Proper Names | ProperName nodes and MENTIONS edges |
| 4.6 | Versification | VersificationScheme nodes and MAPS_TO edges |
| 4.7 | Theographic | Event, PeopleGroup nodes and relationship edges |
| 5 | Commentary | Clarke, BYU commentaries → Commentary nodes + ANNOTATES edges |
| 6 | Pending Resolution | Promote PendingPassage nodes to resolved CROSS_REF edges |
| 7 | Density | Materialize xrefCount/commentaryCount/entityMentionCount on verses |
Additional self-registering stages for church content (talks, curriculum, books, periodicals, hymns, proclamations, publications, scholarly works) and typed entity projection (Person, Place, Event, PeopleGroup) are loaded via the stage registry at runtime.
Why Sequential?
FalkorDB is built on Redis, which is single-threaded. Concurrent writes don't improve throughput — they serialize at the server. The pipeline uses:
ThreadPoolExecutor(max_workers=4)for parallel file I/O and Pydantic validation- Sequential database writes within each stage for MERGE consistency
UNWINDbatch writes to minimize network round-trips
Architecture
# services/ingest/src/gospelib_ingest/pipelines/base.py
class BasePipeline(ABC):
@abstractmethod
def run(self, graph_client, data_dir: Path, dry_run: bool = False) -> StageReport:
...
Each stage is a concrete implementation of BasePipeline. The orchestrator runs all stages in order, collecting reports. Additional stages self-register via the stage registry decorator.
Key Design Decisions
- Idempotent via MERGE — All graph writes use Cypher
MERGE(keyed onidproperty), notCREATE. Re-running ingest never creates duplicates. - Pydantic validation first — Every source file is fully validated before any write occurs. If validation fails, no partial writes happen.
UNWINDbatch writes — Nodes and edges are batched intoUNWINDqueries for efficient bulk writing.- Pending reference resolution — Cross-reference targets that don't exist during early stages are stored as
PendingPassagenodes and resolved in Stage 7.
Environment Variables
| Variable | Default | Description |
|---|---|---|
GOSPELIB_INGEST_FALKORDB_URL | redis://localhost:6379 | FalkorDB connection URL |
GOSPELIB_INGEST_GRAPH_NAME | gospelib | FalkorDB graph name |
GOSPELIB_INGEST_DATA_DIR | ./data | Path to corpus data directory |
GOSPELIB_INGEST_BATCH_SIZE | 500 | UNWIND batch size |
GOSPELIB_INGEST_LOG_LEVEL | INFO | Logging level |
CLI Commands
# Full pipeline run
gospelib-ingest run
# Dry run — validate only, no writes
gospelib-ingest run --dry-run
# Run a specific stage
gospelib-ingest run --stage lexicon
# Custom data directory
gospelib-ingest run --data-dir /path/to/corpus
# Verbose output
gospelib-ingest run --verbose
Docker
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml ./
RUN uv sync --no-dev
COPY src/ ./src/
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app /app
ENTRYPOINT ["gospelib-ingest"]
Deep Dive
For detailed pipeline internals, Cypher patterns, and troubleshooting:
Related Pages
- Services Overview — all services at a glance
- Content Service — consumes the graph produced by ingest
- Architecture > Data > FalkorDB — graph model and node types
- Data Sources — corpus sources consumed by the ingest pipeline