Ingest Service

The Ingest service is a Python CLI tool that reads all GospeLib JSON corpus files, validates them with Pydantic models, and writes a fully connected knowledge graph to FalkorDB. It is the sole authoritative writer to the graph database.

Quick Reference

Property	Value
Language	Python 3.12
Framework	Click CLI
Package	`gospelib_ingest`
Entry point	`gospelib-ingest` CLI command
Target DB	FalkorDB (port 6379)
Type	CLI tool (no HTTP port)
Deployment	K8s Job (full rebuild) / CronJob (incremental)

Responsibilities

Validate — Every JSON source file is validated with Pydantic before any database write
Transform — Convert corpus JSON into FalkorDB graph nodes and edges
Write — Produce a query-ready FalkorDB graph (idempotent via MERGE)
Connect — Generate all edges derivable from source data (cross-references, topics, lexicon links)
Report — Emit a structured run report at completion

Running Locally

cd services/ingest
uv sync

# Full ingest
uv run gospelib-ingest run

# Dry run (validate without writing)
uv run gospelib-ingest run --dry-run

# Show help
uv run gospelib-ingest --help

The service expects FalkorDB to be running on port 6379:

pnpm infra:up

14-Stage Pipeline

The ingest pipeline executes stages sequentially in dependency order:

Stage	Pipeline	Description
0	Schema	Create graph indices and constraints
1	Validation	Pass-through (reserved for future use)
2	Lexicon	Hebrew/Greek lexicon entries → `Word` nodes + `DERIVES_FROM`/`RELATED_TO`
3	Scripture Text	Canonical text + interlinear → `Passage`, `Witness`, `WordAlignment` nodes
3.5	Cross-References	Standalone cross-reference files → `CROSS_REF` edges
4	Reference Data	TG, BD, Index → `TGEntry`, `BDEntry`, `IndexTopic` nodes + `CITES` edges
4.5	Proper Names	`ProperName` nodes and `MENTIONS` edges
4.6	Versification	`VersificationScheme` nodes and `MAPS_TO` edges
4.7	Theographic	`Event`, `PeopleGroup` nodes and relationship edges
5	Commentary	Clarke, BYU commentaries → `Commentary` nodes + `ANNOTATES` edges
6	Pending Resolution	Promote `PendingPassage` nodes to resolved `CROSS_REF` edges
7	Density	Materialize `xrefCount`/`commentaryCount`/`entityMentionCount` on verses

Additional self-registering stages for church content (talks, curriculum, books, periodicals, hymns, proclamations, publications, scholarly works) and typed entity projection (Person, Place, Event, PeopleGroup) are loaded via the stage registry at runtime.

Why Sequential?

FalkorDB is built on Redis, which is single-threaded. Concurrent writes don't improve throughput — they serialize at the server. The pipeline uses:

ThreadPoolExecutor(max_workers=4) for parallel file I/O and Pydantic validation
Sequential database writes within each stage for MERGE consistency
UNWIND batch writes to minimize network round-trips

Architecture

# services/ingest/src/gospelib_ingest/pipelines/base.py
class BasePipeline(ABC):
    @abstractmethod
    def run(self, graph_client, data_dir: Path, dry_run: bool = False) -> StageReport:
        ...

Each stage is a concrete implementation of BasePipeline. The orchestrator runs all stages in order, collecting reports. Additional stages self-register via the stage registry decorator.

Key Design Decisions

Idempotent via MERGE — All graph writes use Cypher MERGE (keyed on id property), not CREATE. Re-running ingest never creates duplicates.
Pydantic validation first — Every source file is fully validated before any write occurs. If validation fails, no partial writes happen.
UNWIND batch writes — Nodes and edges are batched into UNWIND queries for efficient bulk writing.
Pending reference resolution — Cross-reference targets that don't exist during early stages are stored as PendingPassage nodes and resolved in Stage 7.

Environment Variables

Variable	Default	Description
`GOSPELIB_INGEST_FALKORDB_URL`	`redis://localhost:6379`	FalkorDB connection URL
`GOSPELIB_INGEST_GRAPH_NAME`	`gospelib`	FalkorDB graph name
`GOSPELIB_INGEST_DATA_DIR`	`./data`	Path to corpus data directory
`GOSPELIB_INGEST_BATCH_SIZE`	`500`	UNWIND batch size
`GOSPELIB_INGEST_LOG_LEVEL`	`INFO`	Logging level

CLI Commands

# Full pipeline run
gospelib-ingest run

# Dry run — validate only, no writes
gospelib-ingest run --dry-run

# Run a specific stage
gospelib-ingest run --stage lexicon

# Custom data directory
gospelib-ingest run --data-dir /path/to/corpus

# Verbose output
gospelib-ingest run --verbose

Docker

FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install uv
COPY pyproject.toml ./
RUN uv sync --no-dev
COPY src/ ./src/

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /app /app
ENTRYPOINT ["gospelib-ingest"]

Deep Dive

For detailed pipeline internals, Cypher patterns, and troubleshooting:

Services Overview — all services at a glance
Content Service — consumes the graph produced by ingest
Architecture > Data > FalkorDB — graph model and node types
Data Sources — corpus sources consumed by the ingest pipeline

Quick Reference​

Responsibilities​

Running Locally​

14-Stage Pipeline​

Why Sequential?​

Architecture​

Key Design Decisions​

Environment Variables​

CLI Commands​

Docker​

Deep Dive​

Related Pages​